2016-10-14 18:10:29

by Boris Ostrovsky

[permalink] [raw]
Subject: [PATCH 0/8] PVH v2 support

(Resending with proper mailing lists included)

PVH v2 support for unprivileged guests.

Previous version was posted long time ago. Major changes:
1. Drop PVH v1 support
2. Enable ACPI. This allows us to use much more of native code and
results in making this series much simpler (for example, PV-style
VCPU bringup is no longer necessary).
3. Refactor 32-bit pagetable setup from native code (by Matt Fleming)

This has been tested on Intel/AMD, both 32- and 64-bit, including CPU
hotplug and save/restore. Compile-tested for ARM.


Boris Ostrovsky (8):
xen/x86: Remove PVH support
x86/head: Refactor 32-bit pgtable setup
xen/pvh: Import PVH-related Xen public interfaces
xen/pvh: Bootstrap PVH guest
xen/pvh: Prevent PVH guests from using PIC, RTC and IOAPIC
xen/pvh: Initialize grant table for PVH guests
xen/pvh: PVH guests always have PV devices
xen/pvh: Enable CPU hotplug

arch/x86/Makefile | 2 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/head_32.S | 168 +---------------------
arch/x86/kernel/pgtable_32.S | 196 +++++++++++++++++++++++++
arch/x86/xen/Kconfig | 2 +-
arch/x86/xen/Makefile | 1 +
arch/x86/xen/enlighten.c | 251 ++++++++++++++++-----------------
arch/x86/xen/mmu.c | 21 +--
arch/x86/xen/platform-pci-unplug.c | 4 +-
arch/x86/xen/setup.c | 37 +----
arch/x86/xen/smp.c | 78 ++++------
arch/x86/xen/smp.h | 8 --
arch/x86/xen/xen-head.S | 62 +-------
arch/x86/xen/xen-ops.h | 1 -
arch/x86/xen/xen-pvh.S | 143 +++++++++++++++++++
drivers/xen/cpu_hotplug.c | 2 +-
drivers/xen/events/events_base.c | 1 -
drivers/xen/grant-table.c | 8 +-
include/xen/interface/elfnote.h | 12 +-
include/xen/interface/hvm/hvm_vcpu.h | 143 +++++++++++++++++++
include/xen/interface/hvm/start_info.h | 98 +++++++++++++
include/xen/xen.h | 12 +-
22 files changed, 764 insertions(+), 488 deletions(-)
create mode 100644 arch/x86/kernel/pgtable_32.S
create mode 100644 arch/x86/xen/xen-pvh.S
create mode 100644 include/xen/interface/hvm/hvm_vcpu.h
create mode 100644 include/xen/interface/hvm/start_info.h

--
1.8.3.1


2016-10-14 18:10:41

by Boris Ostrovsky

[permalink] [raw]
Subject: [PATCH 1/8] xen/x86: Remove PVH support

We are replacing existing PVH guests with new implementation.

Signed-off-by: Boris Ostrovsky <[email protected]>
---
arch/x86/xen/enlighten.c | 140 ++++++---------------------------------
arch/x86/xen/mmu.c | 21 +-----
arch/x86/xen/setup.c | 37 +----------
arch/x86/xen/smp.c | 78 ++++++++--------------
arch/x86/xen/smp.h | 8 ---
arch/x86/xen/xen-head.S | 62 ++---------------
arch/x86/xen/xen-ops.h | 1 -
drivers/xen/events/events_base.c | 1 -
include/xen/xen.h | 13 +---
9 files changed, 54 insertions(+), 307 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index c0fdd57..dc4ed0c 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1149,10 +1149,11 @@ void xen_setup_vcpu_info_placement(void)
xen_vcpu_setup(cpu);
}

- /* xen_vcpu_setup managed to place the vcpu_info within the
- * percpu area for all cpus, so make use of it. Note that for
- * PVH we want to use native IRQ mechanism. */
- if (have_vcpu_info_placement && !xen_pvh_domain()) {
+ /*
+ * xen_vcpu_setup managed to place the vcpu_info within the
+ * percpu area for all cpus, so make use of it.
+ */
+ if (have_vcpu_info_placement) {
pv_irq_ops.save_fl = __PV_IS_CALLEE_SAVE(xen_save_fl_direct);
pv_irq_ops.restore_fl = __PV_IS_CALLEE_SAVE(xen_restore_fl_direct);
pv_irq_ops.irq_disable = __PV_IS_CALLEE_SAVE(xen_irq_disable_direct);
@@ -1426,49 +1427,9 @@ static void __init xen_boot_params_init_edd(void)
* Set up the GDT and segment registers for -fstack-protector. Until
* we do this, we have to be careful not to call any stack-protected
* function, which is most of the kernel.
- *
- * Note, that it is __ref because the only caller of this after init
- * is PVH which is not going to use xen_load_gdt_boot or other
- * __init functions.
*/
-static void __ref xen_setup_gdt(int cpu)
+static void xen_setup_gdt(int cpu)
{
- if (xen_feature(XENFEAT_auto_translated_physmap)) {
-#ifdef CONFIG_X86_64
- unsigned long dummy;
-
- load_percpu_segment(cpu); /* We need to access per-cpu area */
- switch_to_new_gdt(cpu); /* GDT and GS set */
-
- /* We are switching of the Xen provided GDT to our HVM mode
- * GDT. The new GDT has __KERNEL_CS with CS.L = 1
- * and we are jumping to reload it.
- */
- asm volatile ("pushq %0\n"
- "leaq 1f(%%rip),%0\n"
- "pushq %0\n"
- "lretq\n"
- "1:\n"
- : "=&r" (dummy) : "0" (__KERNEL_CS));
-
- /*
- * While not needed, we also set the %es, %ds, and %fs
- * to zero. We don't care about %ss as it is NULL.
- * Strictly speaking this is not needed as Xen zeros those
- * out (and also MSR_FS_BASE, MSR_GS_BASE, MSR_KERNEL_GS_BASE)
- *
- * Linux zeros them in cpu_init() and in secondary_startup_64
- * (for BSP).
- */
- loadsegment(es, 0);
- loadsegment(ds, 0);
- loadsegment(fs, 0);
-#else
- /* PVH: TODO Implement. */
- BUG();
-#endif
- return; /* PVH does not need any PV GDT ops. */
- }
pv_cpu_ops.write_gdt_entry = xen_write_gdt_entry_boot;
pv_cpu_ops.load_gdt = xen_load_gdt_boot;

@@ -1479,59 +1440,6 @@ static void __ref xen_setup_gdt(int cpu)
pv_cpu_ops.load_gdt = xen_load_gdt;
}

-#ifdef CONFIG_XEN_PVH
-/*
- * A PV guest starts with default flags that are not set for PVH, set them
- * here asap.
- */
-static void xen_pvh_set_cr_flags(int cpu)
-{
-
- /* Some of these are setup in 'secondary_startup_64'. The others:
- * X86_CR0_TS, X86_CR0_PE, X86_CR0_ET are set by Xen for HVM guests
- * (which PVH shared codepaths), while X86_CR0_PG is for PVH. */
- write_cr0(read_cr0() | X86_CR0_MP | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM);
-
- if (!cpu)
- return;
- /*
- * For BSP, PSE PGE are set in probe_page_size_mask(), for APs
- * set them here. For all, OSFXSR OSXMMEXCPT are set in fpu__init_cpu().
- */
- if (boot_cpu_has(X86_FEATURE_PSE))
- cr4_set_bits_and_update_boot(X86_CR4_PSE);
-
- if (boot_cpu_has(X86_FEATURE_PGE))
- cr4_set_bits_and_update_boot(X86_CR4_PGE);
-}
-
-/*
- * Note, that it is ref - because the only caller of this after init
- * is PVH which is not going to use xen_load_gdt_boot or other
- * __init functions.
- */
-void __ref xen_pvh_secondary_vcpu_init(int cpu)
-{
- xen_setup_gdt(cpu);
- xen_pvh_set_cr_flags(cpu);
-}
-
-static void __init xen_pvh_early_guest_init(void)
-{
- if (!xen_feature(XENFEAT_auto_translated_physmap))
- return;
-
- BUG_ON(!xen_feature(XENFEAT_hvm_callback_vector));
-
- xen_pvh_early_cpu_init(0, false);
- xen_pvh_set_cr_flags(0);
-
-#ifdef CONFIG_X86_32
- BUG(); /* PVH: Implement proper support. */
-#endif
-}
-#endif /* CONFIG_XEN_PVH */
-
static void __init xen_dom0_set_legacy_features(void)
{
x86_platform.legacy.rtc = 1;
@@ -1568,24 +1476,17 @@ asmlinkage __visible void __init xen_start_kernel(void)
xen_domain_type = XEN_PV_DOMAIN;

xen_setup_features();
-#ifdef CONFIG_XEN_PVH
- xen_pvh_early_guest_init();
-#endif
+
xen_setup_machphys_mapping();

/* Install Xen paravirt ops */
pv_info = xen_info;
pv_init_ops = xen_init_ops;
- if (!xen_pvh_domain()) {
- pv_cpu_ops = xen_cpu_ops;
+ pv_cpu_ops = xen_cpu_ops;

- x86_platform.get_nmi_reason = xen_get_nmi_reason;
- }
+ x86_platform.get_nmi_reason = xen_get_nmi_reason;

- if (xen_feature(XENFEAT_auto_translated_physmap))
- x86_init.resources.memory_setup = xen_auto_xlated_memory_setup;
- else
- x86_init.resources.memory_setup = xen_memory_setup;
+ x86_init.resources.memory_setup = xen_memory_setup;
x86_init.oem.arch_setup = xen_arch_setup;
x86_init.oem.banner = xen_banner;

@@ -1678,18 +1579,15 @@ asmlinkage __visible void __init xen_start_kernel(void)
/* set the limit of our address space */
xen_reserve_top();

- /* PVH: runs at default kernel iopl of 0 */
- if (!xen_pvh_domain()) {
- /*
- * We used to do this in xen_arch_setup, but that is too late
- * on AMD were early_cpu_init (run before ->arch_setup()) calls
- * early_amd_init which pokes 0xcf8 port.
- */
- set_iopl.iopl = 1;
- rc = HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl);
- if (rc != 0)
- xen_raw_printk("physdev_op failed %d\n", rc);
- }
+ /*
+ * We used to do this in xen_arch_setup, but that is too late
+ * on AMD were early_cpu_init (run before ->arch_setup()) calls
+ * early_amd_init which pokes 0xcf8 port.
+ */
+ set_iopl.iopl = 1;
+ rc = HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl);
+ if (rc != 0)
+ xen_raw_printk("physdev_op failed %d\n", rc);

#ifdef CONFIG_X86_32
/* set up basic CPUID stuff */
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 7d5afdb..f6740b5 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1792,10 +1792,6 @@ static void __init set_page_prot_flags(void *addr, pgprot_t prot,
unsigned long pfn = __pa(addr) >> PAGE_SHIFT;
pte_t pte = pfn_pte(pfn, prot);

- /* For PVH no need to set R/O or R/W to pin them or unpin them. */
- if (xen_feature(XENFEAT_auto_translated_physmap))
- return;
-
if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, flags))
BUG();
}
@@ -1902,8 +1898,7 @@ static void __init check_pt_base(unsigned long *pt_base, unsigned long *pt_end,
* level2_ident_pgt, and level2_kernel_pgt. This means that only the
* kernel has a physical mapping to start with - but that's enough to
* get __va working. We need to fill in the rest of the physical
- * mapping once some sort of allocator has been set up. NOTE: for
- * PVH, the page tables are native.
+ * mapping once some sort of allocator has been set up.
*/
void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
{
@@ -2812,16 +2807,6 @@ static int do_remap_gfn(struct vm_area_struct *vma,

BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP | VM_IO)));

- if (xen_feature(XENFEAT_auto_translated_physmap)) {
-#ifdef CONFIG_XEN_PVH
- /* We need to update the local page tables and the xen HAP */
- return xen_xlate_remap_gfn_array(vma, addr, gfn, nr, err_ptr,
- prot, domid, pages);
-#else
- return -EINVAL;
-#endif
- }
-
rmd.mfn = gfn;
rmd.prot = prot;
/* We use the err_ptr to indicate if there we are doing a contiguous
@@ -2915,10 +2900,6 @@ int xen_unmap_domain_gfn_range(struct vm_area_struct *vma,
if (!pages || !xen_feature(XENFEAT_auto_translated_physmap))
return 0;

-#ifdef CONFIG_XEN_PVH
- return xen_xlate_unmap_gfn_range(vma, numpgs, pages);
-#else
return -EINVAL;
-#endif
}
EXPORT_SYMBOL_GPL(xen_unmap_domain_gfn_range);
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index f8960fc..6999016 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -915,39 +915,6 @@ char * __init xen_memory_setup(void)
}

/*
- * Machine specific memory setup for auto-translated guests.
- */
-char * __init xen_auto_xlated_memory_setup(void)
-{
- struct xen_memory_map memmap;
- int i;
- int rc;
-
- memmap.nr_entries = E820MAX;
- set_xen_guest_handle(memmap.buffer, xen_e820_map);
-
- rc = HYPERVISOR_memory_op(XENMEM_memory_map, &memmap);
- if (rc < 0)
- panic("No memory map (%d)\n", rc);
-
- xen_e820_map_entries = memmap.nr_entries;
-
- sanitize_e820_map(xen_e820_map, ARRAY_SIZE(xen_e820_map),
- &xen_e820_map_entries);
-
- for (i = 0; i < xen_e820_map_entries; i++)
- e820_add_region(xen_e820_map[i].addr, xen_e820_map[i].size,
- xen_e820_map[i].type);
-
- /* Remove p2m info, it is not needed. */
- xen_start_info->mfn_list = 0;
- xen_start_info->first_p2m_pfn = 0;
- xen_start_info->nr_p2m_frames = 0;
-
- return "Xen";
-}
-
-/*
* Set the bit indicating "nosegneg" library variants should be used.
* We only need to bother in pure 32-bit mode; compat 32-bit processes
* can have un-truncated segments, so wrapping around is allowed.
@@ -1032,8 +999,8 @@ void __init xen_pvmmu_arch_setup(void)
void __init xen_arch_setup(void)
{
xen_panic_handler_init();
- if (!xen_feature(XENFEAT_auto_translated_physmap))
- xen_pvmmu_arch_setup();
+
+ xen_pvmmu_arch_setup();

#ifdef CONFIG_ACPI
if (!(xen_start_info->flags & SIF_INITDOMAIN)) {
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 9fa27ce..914e320 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -105,18 +105,8 @@ static void cpu_bringup(void)
local_irq_enable();
}

-/*
- * Note: cpu parameter is only relevant for PVH. The reason for passing it
- * is we can't do smp_processor_id until the percpu segments are loaded, for
- * which we need the cpu number! So we pass it in rdi as first parameter.
- */
-asmlinkage __visible void cpu_bringup_and_idle(int cpu)
+asmlinkage __visible void cpu_bringup_and_idle(void)
{
-#ifdef CONFIG_XEN_PVH
- if (xen_feature(XENFEAT_auto_translated_physmap) &&
- xen_feature(XENFEAT_supervisor_mode_kernel))
- xen_pvh_secondary_vcpu_init(cpu);
-#endif
cpu_bringup();
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
}
@@ -410,61 +400,47 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
gdt = get_cpu_gdt_table(cpu);

#ifdef CONFIG_X86_32
- /* Note: PVH is not yet supported on x86_32. */
ctxt->user_regs.fs = __KERNEL_PERCPU;
ctxt->user_regs.gs = __KERNEL_STACK_CANARY;
#endif
memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));

- if (!xen_feature(XENFEAT_auto_translated_physmap)) {
- ctxt->user_regs.eip = (unsigned long)cpu_bringup_and_idle;
- ctxt->flags = VGCF_IN_KERNEL;
- ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */
- ctxt->user_regs.ds = __USER_DS;
- ctxt->user_regs.es = __USER_DS;
- ctxt->user_regs.ss = __KERNEL_DS;
+ ctxt->user_regs.eip = (unsigned long)cpu_bringup_and_idle;
+ ctxt->flags = VGCF_IN_KERNEL;
+ ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */
+ ctxt->user_regs.ds = __USER_DS;
+ ctxt->user_regs.es = __USER_DS;
+ ctxt->user_regs.ss = __KERNEL_DS;

- xen_copy_trap_info(ctxt->trap_ctxt);
+ xen_copy_trap_info(ctxt->trap_ctxt);

- ctxt->ldt_ents = 0;
+ ctxt->ldt_ents = 0;

- BUG_ON((unsigned long)gdt & ~PAGE_MASK);
+ BUG_ON((unsigned long)gdt & ~PAGE_MASK);

- gdt_mfn = arbitrary_virt_to_mfn(gdt);
- make_lowmem_page_readonly(gdt);
- make_lowmem_page_readonly(mfn_to_virt(gdt_mfn));
+ gdt_mfn = arbitrary_virt_to_mfn(gdt);
+ make_lowmem_page_readonly(gdt);
+ make_lowmem_page_readonly(mfn_to_virt(gdt_mfn));

- ctxt->gdt_frames[0] = gdt_mfn;
- ctxt->gdt_ents = GDT_ENTRIES;
+ ctxt->gdt_frames[0] = gdt_mfn;
+ ctxt->gdt_ents = GDT_ENTRIES;

- ctxt->kernel_ss = __KERNEL_DS;
- ctxt->kernel_sp = idle->thread.sp0;
+ ctxt->kernel_ss = __KERNEL_DS;
+ ctxt->kernel_sp = idle->thread.sp0;

#ifdef CONFIG_X86_32
- ctxt->event_callback_cs = __KERNEL_CS;
- ctxt->failsafe_callback_cs = __KERNEL_CS;
+ ctxt->event_callback_cs = __KERNEL_CS;
+ ctxt->failsafe_callback_cs = __KERNEL_CS;
#else
- ctxt->gs_base_kernel = per_cpu_offset(cpu);
-#endif
- ctxt->event_callback_eip =
- (unsigned long)xen_hypervisor_callback;
- ctxt->failsafe_callback_eip =
- (unsigned long)xen_failsafe_callback;
- ctxt->user_regs.cs = __KERNEL_CS;
- per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir);
- }
-#ifdef CONFIG_XEN_PVH
- else {
- /*
- * The vcpu comes on kernel page tables which have the NX pte
- * bit set. This means before DS/SS is touched, NX in
- * EFER must be set. Hence the following assembly glue code.
- */
- ctxt->user_regs.eip = (unsigned long)xen_pvh_early_cpu_init;
- ctxt->user_regs.rdi = cpu;
- ctxt->user_regs.rsi = true; /* entry == true */
- }
+ ctxt->gs_base_kernel = per_cpu_offset(cpu);
#endif
+ ctxt->event_callback_eip =
+ (unsigned long)xen_hypervisor_callback;
+ ctxt->failsafe_callback_eip =
+ (unsigned long)xen_failsafe_callback;
+ ctxt->user_regs.cs = __KERNEL_CS;
+ per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir);
+
ctxt->user_regs.esp = idle->thread.sp0 - sizeof(struct pt_regs);
ctxt->ctrlreg[3] = xen_pfn_to_cr3(virt_to_gfn(swapper_pg_dir));
if (HYPERVISOR_vcpu_op(VCPUOP_initialise, xen_vcpu_nr(cpu), ctxt))
diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
index c5c16dc..9beef33 100644
--- a/arch/x86/xen/smp.h
+++ b/arch/x86/xen/smp.h
@@ -21,12 +21,4 @@ static inline int xen_smp_intr_init(unsigned int cpu)
static inline void xen_smp_intr_free(unsigned int cpu) {}
#endif /* CONFIG_SMP */

-#ifdef CONFIG_XEN_PVH
-extern void xen_pvh_early_cpu_init(int cpu, bool entry);
-#else
-static inline void xen_pvh_early_cpu_init(int cpu, bool entry)
-{
-}
-#endif
-
#endif
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 7f8d8ab..37794e4 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -16,25 +16,6 @@
#include <xen/interface/xen-mca.h>
#include <asm/xen/interface.h>

-#ifdef CONFIG_XEN_PVH
-#define PVH_FEATURES_STR "|writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel"
-/* Note the lack of 'hvm_callback_vector'. Older hypervisor will
- * balk at this being part of XEN_ELFNOTE_FEATURES, so we put it in
- * XEN_ELFNOTE_SUPPORTED_FEATURES which older hypervisors will ignore.
- */
-#define PVH_FEATURES ((1 << XENFEAT_writable_page_tables) | \
- (1 << XENFEAT_auto_translated_physmap) | \
- (1 << XENFEAT_supervisor_mode_kernel) | \
- (1 << XENFEAT_hvm_callback_vector))
-/* The XENFEAT_writable_page_tables is not stricly necessary as we set that
- * up regardless whether this CONFIG option is enabled or not, but it
- * clarifies what the right flags need to be.
- */
-#else
-#define PVH_FEATURES_STR ""
-#define PVH_FEATURES (0)
-#endif
-
__INIT
ENTRY(startup_xen)
cld
@@ -54,41 +35,6 @@ ENTRY(startup_xen)

__FINIT

-#ifdef CONFIG_XEN_PVH
-/*
- * xen_pvh_early_cpu_init() - early PVH VCPU initialization
- * @cpu: this cpu number (%rdi)
- * @entry: true if this is a secondary vcpu coming up on this entry
- * point, false if this is the boot CPU being initialized for
- * the first time (%rsi)
- *
- * Note: This is called as a function on the boot CPU, and is the entry point
- * on the secondary CPU.
- */
-ENTRY(xen_pvh_early_cpu_init)
- mov %rsi, %r11
-
- /* Gather features to see if NX implemented. */
- mov $0x80000001, %eax
- cpuid
- mov %edx, %esi
-
- mov $MSR_EFER, %ecx
- rdmsr
- bts $_EFER_SCE, %eax
-
- bt $20, %esi
- jnc 1f /* No NX, skip setting it */
- bts $_EFER_NX, %eax
-1: wrmsr
-#ifdef CONFIG_SMP
- cmp $0, %r11b
- jne cpu_bringup_and_idle
-#endif
- ret
-
-#endif /* CONFIG_XEN_PVH */
-
.pushsection .text
.balign PAGE_SIZE
ENTRY(hypercall_page)
@@ -114,10 +60,10 @@ ENTRY(hypercall_page)
#endif
ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, _ASM_PTR startup_xen)
ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR hypercall_page)
- ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .ascii "!writable_page_tables|pae_pgdir_above_4gb"; .asciz PVH_FEATURES_STR)
- ELFNOTE(Xen, XEN_ELFNOTE_SUPPORTED_FEATURES, .long (PVH_FEATURES) |
- (1 << XENFEAT_writable_page_tables) |
- (1 << XENFEAT_dom0))
+ ELFNOTE(Xen, XEN_ELFNOTE_FEATURES,
+ .ascii "!writable_page_tables|pae_pgdir_above_4gb")
+ ELFNOTE(Xen, XEN_ELFNOTE_SUPPORTED_FEATURES,
+ .long (1 << XENFEAT_writable_page_tables) | (1 << XENFEAT_dom0))
ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz "yes")
ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz "generic")
ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 3cbce3b..3688ecc 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -146,5 +146,4 @@ static inline void __init xen_efi_init(void)

extern int xen_panic_handler_init(void);

-void xen_pvh_secondary_vcpu_init(int cpu);
#endif /* XEN_OPS_H */
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 9ecfcdc..6a0aa5c 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1706,7 +1706,6 @@ void __init xen_init_IRQ(void)
pirq_eoi_map = (void *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
eoi_gmfn.gmfn = virt_to_gfn(pirq_eoi_map);
rc = HYPERVISOR_physdev_op(PHYSDEVOP_pirq_eoi_gmfn_v2, &eoi_gmfn);
- /* TODO: No PVH support for PIRQ EOI */
if (rc != 0) {
free_page((unsigned long) pirq_eoi_map);
pirq_eoi_map = NULL;
diff --git a/include/xen/xen.h b/include/xen/xen.h
index f0f0252..d0f9684 100644
--- a/include/xen/xen.h
+++ b/include/xen/xen.h
@@ -29,17 +29,6 @@ enum xen_domain_type {
#define xen_initial_domain() (0)
#endif /* CONFIG_XEN_DOM0 */

-#ifdef CONFIG_XEN_PVH
-/* This functionality exists only for x86. The XEN_PVHVM support exists
- * only in x86 world - hence on ARM it will be always disabled.
- * N.B. ARM guests are neither PV nor HVM nor PVHVM.
- * It's a bit like PVH but is different also (it's further towards the H
- * end of the spectrum than even PVH).
- */
-#include <xen/features.h>
-#define xen_pvh_domain() (xen_pv_domain() && \
- xen_feature(XENFEAT_auto_translated_physmap))
-#else
#define xen_pvh_domain() (0)
-#endif
+
#endif /* _XEN_XEN_H */
--
1.8.3.1

2016-10-14 18:10:54

by Boris Ostrovsky

[permalink] [raw]
Subject: [PATCH 4/8] xen/pvh: Bootstrap PVH guest

Start PVH guest at XEN_ELFNOTE_PHYS32_ENTRY address. Setup hypercall
page, initialize boot_params, enable early page tables.

Since this stub is executed before kernel entry point we cannot use
variables in .bss which is cleared by kernel. We explicitly place
variables that are initialized here into .data.

Signed-off-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
---
arch/x86/xen/Kconfig | 2 +-
arch/x86/xen/Makefile | 1 +
arch/x86/xen/enlighten.c | 87 +++++++++++++++++++++++++++-
arch/x86/xen/xen-pvh.S | 143 +++++++++++++++++++++++++++++++++++++++++++++++
include/xen/xen.h | 5 ++
5 files changed, 236 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/xen/xen-pvh.S

diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index c7b15f3..76b6dbd 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -53,5 +53,5 @@ config XEN_DEBUG_FS

config XEN_PVH
bool "Support for running as a PVH guest"
- depends on X86_64 && XEN && XEN_PVHVM
+ depends on XEN && XEN_PVHVM && ACPI
def_bool n
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index e47e527..cb0164a 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
obj-$(CONFIG_XEN_DOM0) += vga.o
obj-$(CONFIG_SWIOTLB_XEN) += pci-swiotlb-xen.o
obj-$(CONFIG_XEN_EFI) += efi.o
+obj-$(CONFIG_XEN_PVH) += xen-pvh.o
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index dc4ed0c..d38d568 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -45,6 +45,7 @@
#include <xen/interface/memory.h>
#include <xen/interface/nmi.h>
#include <xen/interface/xen-mca.h>
+#include <xen/interface/hvm/start_info.h>
#include <xen/features.h>
#include <xen/page.h>
#include <xen/hvm.h>
@@ -121,7 +122,8 @@
DEFINE_PER_CPU(uint32_t, xen_vcpu_id);
EXPORT_PER_CPU_SYMBOL(xen_vcpu_id);

-enum xen_domain_type xen_domain_type = XEN_NATIVE;
+enum xen_domain_type xen_domain_type
+ __attribute__((section(".data"))) = XEN_NATIVE;
EXPORT_SYMBOL_GPL(xen_domain_type);

unsigned long *machine_to_phys_mapping = (void *)MACH2PHYS_VIRT_START;
@@ -176,6 +178,17 @@ struct tls_descs {
*/
static DEFINE_PER_CPU(struct tls_descs, shadow_tls_desc);

+#ifdef CONFIG_XEN_PVH
+/*
+ * PVH variables. These need to live in data segment since they are
+ * initialized before startup_{32|64}, which clear .bss, are invoked.
+ */
+int xen_pvh __attribute__((section(".data"))) = 0;
+struct hvm_start_info pvh_start_info __attribute__((section(".data")));
+uint pvh_start_info_sz = sizeof(pvh_start_info);
+struct boot_params pvh_bootparams __attribute__((section(".data")));
+#endif
+
static void clamp_max_cpus(void)
{
#ifdef CONFIG_SMP
@@ -1669,6 +1682,78 @@ asmlinkage __visible void __init xen_start_kernel(void)
#endif
}

+#ifdef CONFIG_XEN_PVH
+static void __init init_pvh_bootparams(void)
+{
+ struct xen_memory_map memmap;
+ int i;
+
+ memset(&pvh_bootparams, 0, sizeof(pvh_bootparams));
+
+ memmap.nr_entries = ARRAY_SIZE(pvh_bootparams.e820_map);
+ set_xen_guest_handle(memmap.buffer, pvh_bootparams.e820_map);
+ if (HYPERVISOR_memory_op(XENMEM_memory_map, &memmap)) {
+ xen_raw_console_write("XENMEM_memory_map failed\n");
+ BUG();
+ }
+
+ pvh_bootparams.e820_map[memmap.nr_entries].addr =
+ ISA_START_ADDRESS;
+ pvh_bootparams.e820_map[memmap.nr_entries].size =
+ ISA_END_ADDRESS - ISA_START_ADDRESS;
+ pvh_bootparams.e820_map[memmap.nr_entries++].type =
+ E820_RESERVED;
+
+ sanitize_e820_map(pvh_bootparams.e820_map,
+ ARRAY_SIZE(pvh_bootparams.e820_map),
+ &memmap.nr_entries);
+
+ pvh_bootparams.e820_entries = memmap.nr_entries;
+ for (i = 0; i < pvh_bootparams.e820_entries; i++)
+ e820_add_region(pvh_bootparams.e820_map[i].addr,
+ pvh_bootparams.e820_map[i].size,
+ pvh_bootparams.e820_map[i].type);
+
+ pvh_bootparams.hdr.cmd_line_ptr =
+ pvh_start_info.cmdline_paddr;
+
+ /* The first module is always ramdisk */
+ if (pvh_start_info.nr_modules) {
+ struct hvm_modlist_entry *modaddr =
+ __va(pvh_start_info.modlist_paddr);
+ pvh_bootparams.hdr.ramdisk_image = modaddr->paddr;
+ pvh_bootparams.hdr.ramdisk_size = modaddr->size;
+ }
+
+ /*
+ * See Documentation/x86/boot.txt.
+ *
+ * Version 2.12 supports Xen entry point but we will use default x86/PC
+ * environment (i.e. hardware_subarch 0).
+ */
+ pvh_bootparams.hdr.version = 0x212;
+ pvh_bootparams.hdr.type_of_loader = (9 << 4) | 0; /* Xen loader */
+}
+
+/*
+ * This routine (and those that it might call) should not use
+ * anything that lives in .bss since that segment will be cleared later
+ */
+void __init xen_prepare_pvh(void)
+{
+ u32 eax, ecx, edx, msr;
+ u64 pfn;
+
+ xen_pvh = 1;
+
+ cpuid(xen_cpuid_base() + 2, &eax, &msr, &ecx, &edx);
+ pfn = __pa(hypercall_page);
+ wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+
+ init_pvh_bootparams();
+}
+#endif
+
void __ref xen_hvm_init_shared_info(void)
{
int cpu;
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
new file mode 100644
index 0000000..58c477b
--- /dev/null
+++ b/arch/x86/xen/xen-pvh.S
@@ -0,0 +1,143 @@
+/*
+ * Copyright C 2016, Oracle and/or its affiliates. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+ .code32
+ .text
+#define _pa(x) ((x) - __START_KERNEL_map)
+
+#include <linux/elfnote.h>
+#include <linux/init.h>
+#include <linux/linkage.h>
+#include <asm/segment.h>
+#include <asm/asm.h>
+#include <asm/boot.h>
+#include <asm/processor-flags.h>
+#include <asm/msr.h>
+#include <xen/interface/elfnote.h>
+
+ __HEAD
+ .code32
+
+/* Entry point for PVH guests */
+ENTRY(pvh_start_xen)
+ cli
+ cld
+
+ mov $_pa(gdt), %eax
+ lgdt (%eax)
+
+ movl $(__BOOT_DS),%eax
+ movl %eax,%ds
+ movl %eax,%es
+ movl %eax,%ss
+
+ /* Stash hvm_start_info */
+ mov $_pa(pvh_start_info), %edi
+ mov %ebx, %esi
+ mov $_pa(pvh_start_info_sz), %ecx
+ mov (%ecx), %ecx
+ rep
+ movsb
+
+ movl $_pa(early_stack_end), %eax
+ movl %eax, %esp
+
+ /* Enable PAE mode */
+ movl %cr4, %eax
+ orl $X86_CR4_PAE, %eax
+ movl %eax, %cr4
+
+#ifdef CONFIG_X86_64
+ /* Enable Long mode */
+ movl $MSR_EFER, %ecx
+ rdmsr
+ btsl $_EFER_LME, %eax
+ wrmsr
+
+ /* Enable pre-constructed page tables */
+ mov $_pa(init_level4_pgt), %eax
+ movl %eax, %cr3
+ movl $(X86_CR0_PG | X86_CR0_PE), %eax
+ movl %eax, %cr0
+
+ /* Jump to 64-bit mode. */
+ pushl $__KERNEL_CS
+ leal _pa(1f), %eax
+ pushl %eax
+ lret
+
+ /* 64-bit entry point */
+ .code64
+1:
+ call xen_prepare_pvh
+
+ /* startup_64 expects boot_params in %rsi */
+ mov $_pa(pvh_bootparams), %rsi
+ movq $_pa(startup_64), %rax
+ jmp *%rax
+
+#else /* CONFIG_X86_64 */
+
+ call setup_pgtable_32
+
+ mov $_pa(initial_page_table), %eax
+ movl %eax, %cr3
+
+ movl %cr0, %eax
+ orl $(X86_CR0_PG | X86_CR0_PE), %eax
+ movl %eax, %cr0
+
+ ljmp $__BOOT_CS,$1f
+1:
+ call xen_prepare_pvh
+ mov $_pa(pvh_bootparams), %esi
+
+ /* startup_32 doesn't expect paging and PAE to be on */
+ ljmp $__BOOT_CS,$_pa(2f)
+2:
+ movl %cr0, %eax
+ andl $~X86_CR0_PG, %eax
+ movl %eax, %cr0
+ movl %cr4, %eax
+ andl $~X86_CR4_PAE, %eax
+ movl %eax, %cr4
+
+ ljmp $0x10, $_pa(startup_32)
+#endif
+
+ .data
+gdt:
+ .word gdt_end - gdt
+ .long _pa(gdt)
+ .word 0
+ .quad 0x0000000000000000 /* NULL descriptor */
+#ifdef CONFIG_X86_64
+ .quad 0x00af9a000000ffff /* __KERNEL_CS */
+#else
+ .quad 0x00cf9a000000ffff /* __KERNEL_CS */
+#endif
+ .quad 0x00cf92000000ffff /* __KERNEL_DS */
+gdt_end:
+
+ .bss
+ .balign 4
+early_stack:
+ .fill 16, 1, 0
+early_stack_end:
+
+ ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,
+ _ASM_PTR (pvh_start_xen - __START_KERNEL_map))
diff --git a/include/xen/xen.h b/include/xen/xen.h
index d0f9684..ed3f841 100644
--- a/include/xen/xen.h
+++ b/include/xen/xen.h
@@ -29,6 +29,11 @@ enum xen_domain_type {
#define xen_initial_domain() (0)
#endif /* CONFIG_XEN_DOM0 */

+#ifdef CONFIG_XEN_PVH
+extern int xen_pvh;
+#define xen_pvh_domain() (xen_hvm_domain() && xen_pvh)
+#else
#define xen_pvh_domain() (0)
+#endif

#endif /* _XEN_XEN_H */
--
1.8.3.1

2016-10-14 18:11:04

by Boris Ostrovsky

[permalink] [raw]
Subject: [PATCH 8/8] xen/pvh: Enable CPU hotplug

PVH guests don't receive ACPI hotplug interrupts and therefore
need to monitor xenstore for CPU hotplug event.

Signed-off-by: Boris Ostrovsky <[email protected]>
---
drivers/xen/cpu_hotplug.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/cpu_hotplug.c b/drivers/xen/cpu_hotplug.c
index 5676aef..0bab60a3 100644
--- a/drivers/xen/cpu_hotplug.c
+++ b/drivers/xen/cpu_hotplug.c
@@ -107,7 +107,7 @@ static int __init setup_vcpu_hotplug_event(void)
.notifier_call = setup_cpu_watcher };

#ifdef CONFIG_X86
- if (!xen_pv_domain())
+ if (!xen_pv_domain() && !xen_pvh_domain())
#else
if (!xen_domain())
#endif
--
1.8.3.1

2016-10-14 18:11:20

by Boris Ostrovsky

[permalink] [raw]
Subject: [PATCH 5/8] xen/pvh: Prevent PVH guests from using PIC, RTC and IOAPIC

Make sure they don't use these devices since they are not emulated
for unprivileged PVH guest.

Also don't initialize hypercall page for them in init_hvm_pv_info()
since this has already been done.

Signed-off-by: Boris Ostrovsky <[email protected]>
---
arch/x86/xen/enlighten.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index d38d568..6c1a330 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1803,10 +1803,21 @@ static void __init init_hvm_pv_info(void)
minor = eax & 0xffff;
printk(KERN_INFO "Xen version %d.%d.\n", major, minor);

- cpuid(base + 2, &pages, &msr, &ecx, &edx);
+ xen_domain_type = XEN_HVM_DOMAIN;

- pfn = __pa(hypercall_page);
- wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+ /* PVH set up hypercall page earlier in xen_prepare_pvh() */
+ if (xen_pvh_domain()) {
+ pv_info.name = "Xen PVH";
+#ifdef CONFIG_ACPI
+ /* No PIC or IOAPIC */
+ acpi_irq_model = ACPI_IRQ_MODEL_PLATFORM;
+#endif
+ } else {
+ pv_info.name = "Xen HVM";
+ cpuid(base + 2, &pages, &msr, &ecx, &edx);
+ pfn = __pa(hypercall_page);
+ wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
+ }

xen_setup_features();

@@ -1815,10 +1826,6 @@ static void __init init_hvm_pv_info(void)
this_cpu_write(xen_vcpu_id, ebx);
else
this_cpu_write(xen_vcpu_id, smp_processor_id());
-
- pv_info.name = "Xen HVM";
-
- xen_domain_type = XEN_HVM_DOMAIN;
}

static int xen_cpu_up_prepare(unsigned int cpu)
@@ -1892,6 +1899,9 @@ static void __init xen_hvm_guest_init(void)

init_hvm_pv_info();

+ if (xen_pvh_domain())
+ x86_platform.legacy.rtc = 0;
+
xen_hvm_init_shared_info();

xen_panic_handler_init();
--
1.8.3.1

2016-10-14 18:10:51

by Boris Ostrovsky

[permalink] [raw]
Subject: [PATCH 3/8] xen/pvh: Import PVH-related Xen public interfaces

Signed-off-by: Boris Ostrovsky <[email protected]>
---
include/xen/interface/elfnote.h | 12 ++-
include/xen/interface/hvm/hvm_vcpu.h | 143 +++++++++++++++++++++++++++++++++
include/xen/interface/hvm/start_info.h | 98 ++++++++++++++++++++++
3 files changed, 252 insertions(+), 1 deletion(-)
create mode 100644 include/xen/interface/hvm/hvm_vcpu.h
create mode 100644 include/xen/interface/hvm/start_info.h

diff --git a/include/xen/interface/elfnote.h b/include/xen/interface/elfnote.h
index f90b034..9e9f9bf 100644
--- a/include/xen/interface/elfnote.h
+++ b/include/xen/interface/elfnote.h
@@ -193,9 +193,19 @@
#define XEN_ELFNOTE_SUPPORTED_FEATURES 17

/*
+ * Physical entry point into the kernel.
+ *
+ * 32bit entry point into the kernel. When requested to launch the
+ * guest kernel in a HVM container, Xen will use this entry point to
+ * launch the guest in 32bit protected mode with paging disabled.
+ * Ignored otherwise.
+ */
+#define XEN_ELFNOTE_PHYS32_ENTRY 18
+
+/*
* The number of the highest elfnote defined.
*/
-#define XEN_ELFNOTE_MAX XEN_ELFNOTE_SUPPORTED_FEATURES
+#define XEN_ELFNOTE_MAX XEN_ELFNOTE_PHYS32_ENTRY

#endif /* __XEN_PUBLIC_ELFNOTE_H__ */

diff --git a/include/xen/interface/hvm/hvm_vcpu.h b/include/xen/interface/hvm/hvm_vcpu.h
new file mode 100644
index 0000000..32ca83e
--- /dev/null
+++ b/include/xen/interface/hvm/hvm_vcpu.h
@@ -0,0 +1,143 @@
+/*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2015, Roger Pau Monne <[email protected]>
+ */
+
+#ifndef __XEN_PUBLIC_HVM_HVM_VCPU_H__
+#define __XEN_PUBLIC_HVM_HVM_VCPU_H__
+
+#include "../xen.h"
+
+struct vcpu_hvm_x86_32 {
+ uint32_t eax;
+ uint32_t ecx;
+ uint32_t edx;
+ uint32_t ebx;
+ uint32_t esp;
+ uint32_t ebp;
+ uint32_t esi;
+ uint32_t edi;
+ uint32_t eip;
+ uint32_t eflags;
+
+ uint32_t cr0;
+ uint32_t cr3;
+ uint32_t cr4;
+
+ uint32_t pad1;
+
+ /*
+ * EFER should only be used to set the NXE bit (if required)
+ * when starting a vCPU in 32bit mode with paging enabled or
+ * to set the LME/LMA bits in order to start the vCPU in
+ * compatibility mode.
+ */
+ uint64_t efer;
+
+ uint32_t cs_base;
+ uint32_t ds_base;
+ uint32_t ss_base;
+ uint32_t es_base;
+ uint32_t tr_base;
+ uint32_t cs_limit;
+ uint32_t ds_limit;
+ uint32_t ss_limit;
+ uint32_t es_limit;
+ uint32_t tr_limit;
+ uint16_t cs_ar;
+ uint16_t ds_ar;
+ uint16_t ss_ar;
+ uint16_t es_ar;
+ uint16_t tr_ar;
+
+ uint16_t pad2[3];
+};
+
+/*
+ * The layout of the _ar fields of the segment registers is the
+ * following:
+ *
+ * Bits [0,3]: type (bits 40-43).
+ * Bit 4: s (descriptor type, bit 44).
+ * Bit [5,6]: dpl (descriptor privilege level, bits 45-46).
+ * Bit 7: p (segment-present, bit 47).
+ * Bit 8: avl (available for system software, bit 52).
+ * Bit 9: l (64-bit code segment, bit 53).
+ * Bit 10: db (meaning depends on the segment, bit 54).
+ * Bit 11: g (granularity, bit 55)
+ * Bits [12,15]: unused, must be blank.
+ *
+ * A more complete description of the meaning of this fields can be
+ * obtained from the Intel SDM, Volume 3, section 3.4.5.
+ */
+
+struct vcpu_hvm_x86_64 {
+ uint64_t rax;
+ uint64_t rcx;
+ uint64_t rdx;
+ uint64_t rbx;
+ uint64_t rsp;
+ uint64_t rbp;
+ uint64_t rsi;
+ uint64_t rdi;
+ uint64_t rip;
+ uint64_t rflags;
+
+ uint64_t cr0;
+ uint64_t cr3;
+ uint64_t cr4;
+ uint64_t efer;
+
+ /*
+ * Using VCPU_HVM_MODE_64B implies that the vCPU is launched
+ * directly in long mode, so the cached parts of the segment
+ * registers get set to match that environment.
+ *
+ * If the user wants to launch the vCPU in compatibility mode
+ * the 32-bit structure should be used instead.
+ */
+};
+
+struct vcpu_hvm_context {
+#define VCPU_HVM_MODE_32B 0 /* 32bit fields of the structure will be used. */
+#define VCPU_HVM_MODE_64B 1 /* 64bit fields of the structure will be used. */
+ uint32_t mode;
+
+ uint32_t pad;
+
+ /* CPU registers. */
+ union {
+ struct vcpu_hvm_x86_32 x86_32;
+ struct vcpu_hvm_x86_64 x86_64;
+ } cpu_regs;
+};
+typedef struct vcpu_hvm_context vcpu_hvm_context_t;
+
+#endif /* __XEN_PUBLIC_HVM_HVM_VCPU_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/xen/interface/hvm/start_info.h b/include/xen/interface/hvm/start_info.h
new file mode 100644
index 0000000..6484159
--- /dev/null
+++ b/include/xen/interface/hvm/start_info.h
@@ -0,0 +1,98 @@
+/*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2016, Citrix Systems, Inc.
+ */
+
+#ifndef __XEN_PUBLIC_ARCH_X86_HVM_START_INFO_H__
+#define __XEN_PUBLIC_ARCH_X86_HVM_START_INFO_H__
+
+/*
+ * Start of day structure passed to PVH guests and to HVM guests in %ebx.
+ *
+ * NOTE: nothing will be loaded at physical address 0, so a 0 value in any
+ * of the address fields should be treated as not present.
+ *
+ * 0 +----------------+
+ * | magic | Contains the magic value XEN_HVM_START_MAGIC_VALUE
+ * | | ("xEn3" with the 0x80 bit of the "E" set).
+ * 4 +----------------+
+ * | version | Version of this structure. Current version is 0. New
+ * | | versions are guaranteed to be backwards-compatible.
+ * 8 +----------------+
+ * | flags | SIF_xxx flags.
+ * 12 +----------------+
+ * | nr_modules | Number of modules passed to the kernel.
+ * 16 +----------------+
+ * | modlist_paddr | Physical address of an array of modules
+ * | | (layout of the structure below).
+ * 24 +----------------+
+ * | cmdline_paddr | Physical address of the command line,
+ * | | a zero-terminated ASCII string.
+ * 32 +----------------+
+ * | rsdp_paddr | Physical address of the RSDP ACPI data structure.
+ * 40 +----------------+
+ *
+ * The layout of each entry in the module structure is the following:
+ *
+ * 0 +----------------+
+ * | paddr | Physical address of the module.
+ * 8 +----------------+
+ * | size | Size of the module in bytes.
+ * 16 +----------------+
+ * | cmdline_paddr | Physical address of the command line,
+ * | | a zero-terminated ASCII string.
+ * 24 +----------------+
+ * | reserved |
+ * 32 +----------------+
+ *
+ * The address and sizes are always a 64bit little endian unsigned integer.
+ *
+ * NB: Xen on x86 will always try to place all the data below the 4GiB
+ * boundary.
+ */
+#define XEN_HVM_START_MAGIC_VALUE 0x336ec578
+
+/*
+ * C representation of the x86/HVM start info layout.
+ *
+ * The canonical definition of this layout is above, this is just a way to
+ * represent the layout described there using C types.
+ */
+struct hvm_start_info {
+ uint32_t magic; /* Contains the magic value 0x336ec578 */
+ /* ("xEn3" with the 0x80 bit of the "E" set).*/
+ uint32_t version; /* Version of this structure. */
+ uint32_t flags; /* SIF_xxx flags. */
+ uint32_t nr_modules; /* Number of modules passed to the kernel. */
+ uint64_t modlist_paddr; /* Physical address of an array of */
+ /* hvm_modlist_entry. */
+ uint64_t cmdline_paddr; /* Physical address of the command line. */
+ uint64_t rsdp_paddr; /* Physical address of the RSDP ACPI data */
+ /* structure. */
+};
+
+struct hvm_modlist_entry {
+ uint64_t paddr; /* Physical address of the module. */
+ uint64_t size; /* Size of the module in bytes. */
+ uint64_t cmdline_paddr; /* Physical address of the command line. */
+ uint64_t reserved;
+};
+
+#endif /* __XEN_PUBLIC_ARCH_X86_HVM_START_INFO_H__ */
--
1.8.3.1

2016-10-14 18:11:43

by Boris Ostrovsky

[permalink] [raw]
Subject: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

From: Matt Fleming <[email protected]>

The new Xen PVH entry point requires page tables to be setup by the
kernel since it is entered with paging disabled.

Pull the common code out of head_32.S and into pgtable_32.S so that
setup_pgtable_32 can be invoked from both the new Xen entry point and
the existing startup_32 code.

Cc: Boris Ostrovsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Signed-off-by: Matt Fleming <[email protected]>
---
arch/x86/Makefile | 2 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/head_32.S | 168 +------------------------------------
arch/x86/kernel/pgtable_32.S | 196 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 201 insertions(+), 167 deletions(-)
create mode 100644 arch/x86/kernel/pgtable_32.S

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 2d44933..67cc771 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -204,6 +204,8 @@ head-y += arch/x86/kernel/head$(BITS).o
head-y += arch/x86/kernel/ebda.o
head-y += arch/x86/kernel/platform-quirks.o

+head-$(CONFIG_X86_32) += arch/x86/kernel/pgtable_32.o
+
libs-y += arch/x86/lib/

# See arch/x86/Kbuild for content of core part of the kernel
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 4dd5d50..eae85a5 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -8,6 +8,8 @@ extra-y += ebda.o
extra-y += platform-quirks.o
extra-y += vmlinux.lds

+extra-$(CONFIG_X86_32) += pgtable_32.o
+
CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE)

ifdef CONFIG_FUNCTION_TRACER
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 5f40126..0db066e 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -41,51 +41,6 @@
#define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id

/*
- * This is how much memory in addition to the memory covered up to
- * and including _end we need mapped initially.
- * We need:
- * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
- * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
- *
- * Modulo rounding, each megabyte assigned here requires a kilobyte of
- * memory, which is currently unreclaimed.
- *
- * This should be a multiple of a page.
- *
- * KERNEL_IMAGE_SIZE should be greater than pa(_end)
- * and small than max_low_pfn, otherwise will waste some page table entries
- */
-
-#if PTRS_PER_PMD > 1
-#define PAGE_TABLE_SIZE(pages) (((pages) / PTRS_PER_PMD) + PTRS_PER_PGD)
-#else
-#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
-#endif
-
-/*
- * Number of possible pages in the lowmem region.
- *
- * We shift 2 by 31 instead of 1 by 32 to the left in order to avoid a
- * gas warning about overflowing shift count when gas has been compiled
- * with only a host target support using a 32-bit type for internal
- * representation.
- */
-LOWMEM_PAGES = (((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT)
-
-/* Enough space to fit pagetables for the low memory linear map */
-MAPPING_BEYOND_END = PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT
-
-/*
- * Worst-case size of the kernel mapping we need to make:
- * a relocatable kernel can live anywhere in lowmem, so we need to be able
- * to map all of lowmem.
- */
-KERNEL_PAGES = LOWMEM_PAGES
-
-INIT_MAP_SIZE = PAGE_TABLE_SIZE(KERNEL_PAGES) * PAGE_SIZE
-RESERVE_BRK(pagetables, INIT_MAP_SIZE)
-
-/*
* 32-bit kernel entrypoint; only used by the boot CPU. On entry,
* %esi points to the real-mode code as a 32-bit pointer.
* CS and DS must be 4 GB flat segments, but we don't depend on
@@ -157,92 +112,7 @@ ENTRY(startup_32)
call load_ucode_bsp
#endif

-/*
- * Initialize page tables. This creates a PDE and a set of page
- * tables, which are located immediately beyond __brk_base. The variable
- * _brk_end is set up to point to the first "safe" location.
- * Mappings are created both at virtual address 0 (identity mapping)
- * and PAGE_OFFSET for up to _end.
- */
-#ifdef CONFIG_X86_PAE
-
- /*
- * In PAE mode initial_page_table is statically defined to contain
- * enough entries to cover the VMSPLIT option (that is the top 1, 2 or 3
- * entries). The identity mapping is handled by pointing two PGD entries
- * to the first kernel PMD.
- *
- * Note the upper half of each PMD or PTE are always zero at this stage.
- */
-
-#define KPMDS (((-__PAGE_OFFSET) >> 30) & 3) /* Number of kernel PMDs */
-
- xorl %ebx,%ebx /* %ebx is kept at zero */
-
- movl $pa(__brk_base), %edi
- movl $pa(initial_pg_pmd), %edx
- movl $PTE_IDENT_ATTR, %eax
-10:
- leal PDE_IDENT_ATTR(%edi),%ecx /* Create PMD entry */
- movl %ecx,(%edx) /* Store PMD entry */
- /* Upper half already zero */
- addl $8,%edx
- movl $512,%ecx
-11:
- stosl
- xchgl %eax,%ebx
- stosl
- xchgl %eax,%ebx
- addl $0x1000,%eax
- loop 11b
-
- /*
- * End condition: we must map up to the end + MAPPING_BEYOND_END.
- */
- movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
- cmpl %ebp,%eax
- jb 10b
-1:
- addl $__PAGE_OFFSET, %edi
- movl %edi, pa(_brk_end)
- shrl $12, %eax
- movl %eax, pa(max_pfn_mapped)
-
- /* Do early initialization of the fixmap area */
- movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
- movl %eax,pa(initial_pg_pmd+0x1000*KPMDS-8)
-#else /* Not PAE */
-
-page_pde_offset = (__PAGE_OFFSET >> 20);
-
- movl $pa(__brk_base), %edi
- movl $pa(initial_page_table), %edx
- movl $PTE_IDENT_ATTR, %eax
-10:
- leal PDE_IDENT_ATTR(%edi),%ecx /* Create PDE entry */
- movl %ecx,(%edx) /* Store identity PDE entry */
- movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
- addl $4,%edx
- movl $1024, %ecx
-11:
- stosl
- addl $0x1000,%eax
- loop 11b
- /*
- * End condition: we must map up to the end + MAPPING_BEYOND_END.
- */
- movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
- cmpl %ebp,%eax
- jb 10b
- addl $__PAGE_OFFSET, %edi
- movl %edi, pa(_brk_end)
- shrl $12, %eax
- movl %eax, pa(max_pfn_mapped)
-
- /* Do early initialization of the fixmap area */
- movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
- movl %eax,pa(initial_page_table+0xffc)
-#endif
+ call setup_pgtable_32

#ifdef CONFIG_PARAVIRT
/* This is can only trip for a broken bootloader... */
@@ -660,47 +530,11 @@ ENTRY(setup_once_ref)
*/
__PAGE_ALIGNED_BSS
.align PAGE_SIZE
-#ifdef CONFIG_X86_PAE
-initial_pg_pmd:
- .fill 1024*KPMDS,4,0
-#else
-ENTRY(initial_page_table)
- .fill 1024,4,0
-#endif
-initial_pg_fixmap:
- .fill 1024,4,0
ENTRY(empty_zero_page)
.fill 4096,1,0
ENTRY(swapper_pg_dir)
.fill 1024,4,0

-/*
- * This starts the data section.
- */
-#ifdef CONFIG_X86_PAE
-__PAGE_ALIGNED_DATA
- /* Page-aligned for the benefit of paravirt? */
- .align PAGE_SIZE
-ENTRY(initial_page_table)
- .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
-# if KPMDS == 3
- .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
- .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
- .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x2000),0
-# elif KPMDS == 2
- .long 0,0
- .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
- .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
-# elif KPMDS == 1
- .long 0,0
- .long 0,0
- .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
-# else
-# error "Kernel PMDs should be 1, 2 or 3"
-# endif
- .align PAGE_SIZE /* needs to be page-sized too */
-#endif
-
.data
.balign 4
ENTRY(initial_stack)
diff --git a/arch/x86/kernel/pgtable_32.S b/arch/x86/kernel/pgtable_32.S
new file mode 100644
index 0000000..aded718
--- /dev/null
+++ b/arch/x86/kernel/pgtable_32.S
@@ -0,0 +1,196 @@
+#include <linux/threads.h>
+#include <linux/init.h>
+#include <linux/linkage.h>
+#include <asm/segment.h>
+#include <asm/page_types.h>
+#include <asm/pgtable_types.h>
+#include <asm/cache.h>
+#include <asm/thread_info.h>
+#include <asm/asm-offsets.h>
+#include <asm/setup.h>
+#include <asm/processor-flags.h>
+#include <asm/msr-index.h>
+#include <asm/cpufeatures.h>
+#include <asm/percpu.h>
+#include <asm/nops.h>
+#include <asm/bootparam.h>
+
+/* Physical address */
+#define pa(X) ((X) - __PAGE_OFFSET)
+
+/*
+ * This is how much memory in addition to the memory covered up to
+ * and including _end we need mapped initially.
+ * We need:
+ * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
+ * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
+ *
+ * Modulo rounding, each megabyte assigned here requires a kilobyte of
+ * memory, which is currently unreclaimed.
+ *
+ * This should be a multiple of a page.
+ *
+ * KERNEL_IMAGE_SIZE should be greater than pa(_end)
+ * and small than max_low_pfn, otherwise will waste some page table entries
+ */
+
+#if PTRS_PER_PMD > 1
+#define PAGE_TABLE_SIZE(pages) (((pages) / PTRS_PER_PMD) + PTRS_PER_PGD)
+#else
+#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
+#endif
+
+/*
+ * Number of possible pages in the lowmem region.
+ *
+ * We shift 2 by 31 instead of 1 by 32 to the left in order to avoid a
+ * gas warning about overflowing shift count when gas has been compiled
+ * with only a host target support using a 32-bit type for internal
+ * representation.
+ */
+LOWMEM_PAGES = (((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT)
+
+/* Enough space to fit pagetables for the low memory linear map */
+MAPPING_BEYOND_END = PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT
+
+/*
+ * Worst-case size of the kernel mapping we need to make:
+ * a relocatable kernel can live anywhere in lowmem, so we need to be able
+ * to map all of lowmem.
+ */
+KERNEL_PAGES = LOWMEM_PAGES
+
+INIT_MAP_SIZE = PAGE_TABLE_SIZE(KERNEL_PAGES) * PAGE_SIZE
+RESERVE_BRK(pagetables, INIT_MAP_SIZE)
+
+/*
+ * Initialize page tables. This creates a PDE and a set of page
+ * tables, which are located immediately beyond __brk_base. The variable
+ * _brk_end is set up to point to the first "safe" location.
+ * Mappings are created both at virtual address 0 (identity mapping)
+ * and PAGE_OFFSET for up to _end.
+ */
+ .text
+ENTRY(setup_pgtable_32)
+#ifdef CONFIG_X86_PAE
+ /*
+ * In PAE mode initial_page_table is statically defined to contain
+ * enough entries to cover the VMSPLIT option (that is the top 1, 2 or 3
+ * entries). The identity mapping is handled by pointing two PGD entries
+ * to the first kernel PMD.
+ *
+ * Note the upper half of each PMD or PTE are always zero at this stage.
+ */
+
+#define KPMDS (((-__PAGE_OFFSET) >> 30) & 3) /* Number of kernel PMDs */
+
+ xorl %ebx,%ebx /* %ebx is kept at zero */
+
+ movl $pa(__brk_base), %edi
+ movl $pa(initial_pg_pmd), %edx
+ movl $PTE_IDENT_ATTR, %eax
+10:
+ leal PDE_IDENT_ATTR(%edi),%ecx /* Create PMD entry */
+ movl %ecx,(%edx) /* Store PMD entry */
+ /* Upper half already zero */
+ addl $8,%edx
+ movl $512,%ecx
+11:
+ stosl
+ xchgl %eax,%ebx
+ stosl
+ xchgl %eax,%ebx
+ addl $0x1000,%eax
+ loop 11b
+
+ /*
+ * End condition: we must map up to the end + MAPPING_BEYOND_END.
+ */
+ movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
+ cmpl %ebp,%eax
+ jb 10b
+1:
+ addl $__PAGE_OFFSET, %edi
+ movl %edi, pa(_brk_end)
+ shrl $12, %eax
+ movl %eax, pa(max_pfn_mapped)
+
+ /* Do early initialization of the fixmap area */
+ movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
+ movl %eax,pa(initial_pg_pmd+0x1000*KPMDS-8)
+#else /* Not PAE */
+
+page_pde_offset = (__PAGE_OFFSET >> 20);
+
+ movl $pa(__brk_base), %edi
+ movl $pa(initial_page_table), %edx
+ movl $PTE_IDENT_ATTR, %eax
+10:
+ leal PDE_IDENT_ATTR(%edi),%ecx /* Create PDE entry */
+ movl %ecx,(%edx) /* Store identity PDE entry */
+ movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
+ addl $4,%edx
+ movl $1024, %ecx
+11:
+ stosl
+ addl $0x1000,%eax
+ loop 11b
+ /*
+ * End condition: we must map up to the end + MAPPING_BEYOND_END.
+ */
+ movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
+ cmpl %ebp,%eax
+ jb 10b
+ addl $__PAGE_OFFSET, %edi
+ movl %edi, pa(_brk_end)
+ shrl $12, %eax
+ movl %eax, pa(max_pfn_mapped)
+
+ /* Do early initialization of the fixmap area */
+ movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
+ movl %eax,pa(initial_page_table+0xffc)
+#endif
+ ret
+ENDPROC(setup_pgtable_32)
+
+/*
+ * BSS section
+ */
+__PAGE_ALIGNED_BSS
+ .align PAGE_SIZE
+#ifdef CONFIG_X86_PAE
+initial_pg_pmd:
+ .fill 1024*KPMDS,4,0
+#else
+ENTRY(initial_page_table)
+ .fill 1024,4,0
+#endif
+initial_pg_fixmap:
+ .fill 1024,4,0
+
+/*
+ * This starts the data section.
+ */
+#ifdef CONFIG_X86_PAE
+__PAGE_ALIGNED_DATA
+ /* Page-aligned for the benefit of paravirt? */
+ .align PAGE_SIZE
+ENTRY(initial_page_table)
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
+# if KPMDS == 3
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x2000),0
+# elif KPMDS == 2
+ .long 0,0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
+# elif KPMDS == 1
+ .long 0,0
+ .long 0,0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
+# else
+# error "Kernel PMDs should be 1, 2 or 3"
+# endif
+ .align PAGE_SIZE /* needs to be page-sized too */
+#endif
--
1.8.3.1

2016-10-14 18:11:50

by Boris Ostrovsky

[permalink] [raw]
Subject: [PATCH 6/8] xen/pvh: Initialize grant table for PVH guests

Signed-off-by: Boris Ostrovsky <[email protected]>
---
drivers/xen/grant-table.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index bb36b1e..d6786b8 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -1146,13 +1146,13 @@ int gnttab_init(void)

static int __gnttab_init(void)
{
+ if (!xen_domain())
+ return -ENODEV;
+
/* Delay grant-table initialization in the PV on HVM case */
- if (xen_hvm_domain())
+ if (xen_hvm_domain() && !xen_pvh_domain())
return 0;

- if (!xen_pv_domain())
- return -ENODEV;
-
return gnttab_init();
}
/* Starts after core_initcall so that xen_pvh_gnttab_setup can be called
--
1.8.3.1

2016-10-14 18:12:01

by Boris Ostrovsky

[permalink] [raw]
Subject: [PATCH 7/8] xen/pvh: PVH guests always have PV devices

Signed-off-by: Boris Ostrovsky <[email protected]>
---
arch/x86/xen/platform-pci-unplug.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/platform-pci-unplug.c b/arch/x86/xen/platform-pci-unplug.c
index 90d1b83..33a783c 100644
--- a/arch/x86/xen/platform-pci-unplug.c
+++ b/arch/x86/xen/platform-pci-unplug.c
@@ -73,8 +73,8 @@ bool xen_has_pv_devices(void)
if (!xen_domain())
return false;

- /* PV domains always have them. */
- if (xen_pv_domain())
+ /* PV and PVH domains always have them. */
+ if (xen_pv_domain() || xen_pvh_domain())
return true;

/* And user has xen_platform_pci=0 set in guest config as
--
1.8.3.1

2016-10-14 18:31:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

On October 14, 2016 11:05:12 AM PDT, Boris Ostrovsky <[email protected]> wrote:
>From: Matt Fleming <[email protected]>
>
>The new Xen PVH entry point requires page tables to be setup by the
>kernel since it is entered with paging disabled.
>
>Pull the common code out of head_32.S and into pgtable_32.S so that
>setup_pgtable_32 can be invoked from both the new Xen entry point and
>the existing startup_32 code.
>
>Cc: Boris Ostrovsky <[email protected]>
>Cc: Thomas Gleixner <[email protected]>
>Cc: Ingo Molnar <[email protected]>
>Cc: "H. Peter Anvin" <[email protected]>
>Cc: [email protected]
>Signed-off-by: Matt Fleming <[email protected]>
>---
> arch/x86/Makefile | 2 +
> arch/x86/kernel/Makefile | 2 +
>arch/x86/kernel/head_32.S | 168
>+------------------------------------
>arch/x86/kernel/pgtable_32.S | 196
>+++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 201 insertions(+), 167 deletions(-)
> create mode 100644 arch/x86/kernel/pgtable_32.S
>
>diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>index 2d44933..67cc771 100644
>--- a/arch/x86/Makefile
>+++ b/arch/x86/Makefile
>@@ -204,6 +204,8 @@ head-y += arch/x86/kernel/head$(BITS).o
> head-y += arch/x86/kernel/ebda.o
> head-y += arch/x86/kernel/platform-quirks.o
>
>+head-$(CONFIG_X86_32) += arch/x86/kernel/pgtable_32.o
>+
> libs-y += arch/x86/lib/
>
> # See arch/x86/Kbuild for content of core part of the kernel
>diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>index 4dd5d50..eae85a5 100644
>--- a/arch/x86/kernel/Makefile
>+++ b/arch/x86/kernel/Makefile
>@@ -8,6 +8,8 @@ extra-y += ebda.o
> extra-y += platform-quirks.o
> extra-y += vmlinux.lds
>
>+extra-$(CONFIG_X86_32) += pgtable_32.o
>+
> CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE)
>
> ifdef CONFIG_FUNCTION_TRACER
>diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
>index 5f40126..0db066e 100644
>--- a/arch/x86/kernel/head_32.S
>+++ b/arch/x86/kernel/head_32.S
>@@ -41,51 +41,6 @@
> #define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id
>
> /*
>- * This is how much memory in addition to the memory covered up to
>- * and including _end we need mapped initially.
>- * We need:
>- * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
>- * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
>- *
>- * Modulo rounding, each megabyte assigned here requires a kilobyte of
>- * memory, which is currently unreclaimed.
>- *
>- * This should be a multiple of a page.
>- *
>- * KERNEL_IMAGE_SIZE should be greater than pa(_end)
>- * and small than max_low_pfn, otherwise will waste some page table
>entries
>- */
>-
>-#if PTRS_PER_PMD > 1
>-#define PAGE_TABLE_SIZE(pages) (((pages) / PTRS_PER_PMD) +
>PTRS_PER_PGD)
>-#else
>-#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
>-#endif
>-
>-/*
>- * Number of possible pages in the lowmem region.
>- *
>- * We shift 2 by 31 instead of 1 by 32 to the left in order to avoid a
>- * gas warning about overflowing shift count when gas has been
>compiled
>- * with only a host target support using a 32-bit type for internal
>- * representation.
>- */
>-LOWMEM_PAGES = (((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT)
>-
>-/* Enough space to fit pagetables for the low memory linear map */
>-MAPPING_BEYOND_END = PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT
>-
>-/*
>- * Worst-case size of the kernel mapping we need to make:
>- * a relocatable kernel can live anywhere in lowmem, so we need to be
>able
>- * to map all of lowmem.
>- */
>-KERNEL_PAGES = LOWMEM_PAGES
>-
>-INIT_MAP_SIZE = PAGE_TABLE_SIZE(KERNEL_PAGES) * PAGE_SIZE
>-RESERVE_BRK(pagetables, INIT_MAP_SIZE)
>-
>-/*
> * 32-bit kernel entrypoint; only used by the boot CPU. On entry,
> * %esi points to the real-mode code as a 32-bit pointer.
> * CS and DS must be 4 GB flat segments, but we don't depend on
>@@ -157,92 +112,7 @@ ENTRY(startup_32)
> call load_ucode_bsp
> #endif
>
>-/*
>- * Initialize page tables. This creates a PDE and a set of page
>- * tables, which are located immediately beyond __brk_base. The
>variable
>- * _brk_end is set up to point to the first "safe" location.
>- * Mappings are created both at virtual address 0 (identity mapping)
>- * and PAGE_OFFSET for up to _end.
>- */
>-#ifdef CONFIG_X86_PAE
>-
>- /*
>- * In PAE mode initial_page_table is statically defined to contain
>- * enough entries to cover the VMSPLIT option (that is the top 1, 2
>or 3
>- * entries). The identity mapping is handled by pointing two PGD
>entries
>- * to the first kernel PMD.
>- *
>- * Note the upper half of each PMD or PTE are always zero at this
>stage.
>- */
>-
>-#define KPMDS (((-__PAGE_OFFSET) >> 30) & 3) /* Number of kernel PMDs
>*/
>-
>- xorl %ebx,%ebx /* %ebx is kept at zero */
>-
>- movl $pa(__brk_base), %edi
>- movl $pa(initial_pg_pmd), %edx
>- movl $PTE_IDENT_ATTR, %eax
>-10:
>- leal PDE_IDENT_ATTR(%edi),%ecx /* Create PMD entry */
>- movl %ecx,(%edx) /* Store PMD entry */
>- /* Upper half already zero */
>- addl $8,%edx
>- movl $512,%ecx
>-11:
>- stosl
>- xchgl %eax,%ebx
>- stosl
>- xchgl %eax,%ebx
>- addl $0x1000,%eax
>- loop 11b
>-
>- /*
>- * End condition: we must map up to the end + MAPPING_BEYOND_END.
>- */
>- movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
>- cmpl %ebp,%eax
>- jb 10b
>-1:
>- addl $__PAGE_OFFSET, %edi
>- movl %edi, pa(_brk_end)
>- shrl $12, %eax
>- movl %eax, pa(max_pfn_mapped)
>-
>- /* Do early initialization of the fixmap area */
>- movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
>- movl %eax,pa(initial_pg_pmd+0x1000*KPMDS-8)
>-#else /* Not PAE */
>-
>-page_pde_offset = (__PAGE_OFFSET >> 20);
>-
>- movl $pa(__brk_base), %edi
>- movl $pa(initial_page_table), %edx
>- movl $PTE_IDENT_ATTR, %eax
>-10:
>- leal PDE_IDENT_ATTR(%edi),%ecx /* Create PDE entry */
>- movl %ecx,(%edx) /* Store identity PDE entry */
>- movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
>- addl $4,%edx
>- movl $1024, %ecx
>-11:
>- stosl
>- addl $0x1000,%eax
>- loop 11b
>- /*
>- * End condition: we must map up to the end + MAPPING_BEYOND_END.
>- */
>- movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
>- cmpl %ebp,%eax
>- jb 10b
>- addl $__PAGE_OFFSET, %edi
>- movl %edi, pa(_brk_end)
>- shrl $12, %eax
>- movl %eax, pa(max_pfn_mapped)
>-
>- /* Do early initialization of the fixmap area */
>- movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
>- movl %eax,pa(initial_page_table+0xffc)
>-#endif
>+ call setup_pgtable_32
>
> #ifdef CONFIG_PARAVIRT
> /* This is can only trip for a broken bootloader... */
>@@ -660,47 +530,11 @@ ENTRY(setup_once_ref)
> */
> __PAGE_ALIGNED_BSS
> .align PAGE_SIZE
>-#ifdef CONFIG_X86_PAE
>-initial_pg_pmd:
>- .fill 1024*KPMDS,4,0
>-#else
>-ENTRY(initial_page_table)
>- .fill 1024,4,0
>-#endif
>-initial_pg_fixmap:
>- .fill 1024,4,0
> ENTRY(empty_zero_page)
> .fill 4096,1,0
> ENTRY(swapper_pg_dir)
> .fill 1024,4,0
>
>-/*
>- * This starts the data section.
>- */
>-#ifdef CONFIG_X86_PAE
>-__PAGE_ALIGNED_DATA
>- /* Page-aligned for the benefit of paravirt? */
>- .align PAGE_SIZE
>-ENTRY(initial_page_table)
>- .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
>-# if KPMDS == 3
>- .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>- .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
>- .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x2000),0
>-# elif KPMDS == 2
>- .long 0,0
>- .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>- .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
>-# elif KPMDS == 1
>- .long 0,0
>- .long 0,0
>- .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>-# else
>-# error "Kernel PMDs should be 1, 2 or 3"
>-# endif
>- .align PAGE_SIZE /* needs to be page-sized too */
>-#endif
>-
> .data
> .balign 4
> ENTRY(initial_stack)
>diff --git a/arch/x86/kernel/pgtable_32.S
>b/arch/x86/kernel/pgtable_32.S
>new file mode 100644
>index 0000000..aded718
>--- /dev/null
>+++ b/arch/x86/kernel/pgtable_32.S
>@@ -0,0 +1,196 @@
>+#include <linux/threads.h>
>+#include <linux/init.h>
>+#include <linux/linkage.h>
>+#include <asm/segment.h>
>+#include <asm/page_types.h>
>+#include <asm/pgtable_types.h>
>+#include <asm/cache.h>
>+#include <asm/thread_info.h>
>+#include <asm/asm-offsets.h>
>+#include <asm/setup.h>
>+#include <asm/processor-flags.h>
>+#include <asm/msr-index.h>
>+#include <asm/cpufeatures.h>
>+#include <asm/percpu.h>
>+#include <asm/nops.h>
>+#include <asm/bootparam.h>
>+
>+/* Physical address */
>+#define pa(X) ((X) - __PAGE_OFFSET)
>+
>+/*
>+ * This is how much memory in addition to the memory covered up to
>+ * and including _end we need mapped initially.
>+ * We need:
>+ * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
>+ * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
>+ *
>+ * Modulo rounding, each megabyte assigned here requires a kilobyte of
>+ * memory, which is currently unreclaimed.
>+ *
>+ * This should be a multiple of a page.
>+ *
>+ * KERNEL_IMAGE_SIZE should be greater than pa(_end)
>+ * and small than max_low_pfn, otherwise will waste some page table
>entries
>+ */
>+
>+#if PTRS_PER_PMD > 1
>+#define PAGE_TABLE_SIZE(pages) (((pages) / PTRS_PER_PMD) +
>PTRS_PER_PGD)
>+#else
>+#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
>+#endif
>+
>+/*
>+ * Number of possible pages in the lowmem region.
>+ *
>+ * We shift 2 by 31 instead of 1 by 32 to the left in order to avoid a
>+ * gas warning about overflowing shift count when gas has been
>compiled
>+ * with only a host target support using a 32-bit type for internal
>+ * representation.
>+ */
>+LOWMEM_PAGES = (((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT)
>+
>+/* Enough space to fit pagetables for the low memory linear map */
>+MAPPING_BEYOND_END = PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT
>+
>+/*
>+ * Worst-case size of the kernel mapping we need to make:
>+ * a relocatable kernel can live anywhere in lowmem, so we need to be
>able
>+ * to map all of lowmem.
>+ */
>+KERNEL_PAGES = LOWMEM_PAGES
>+
>+INIT_MAP_SIZE = PAGE_TABLE_SIZE(KERNEL_PAGES) * PAGE_SIZE
>+RESERVE_BRK(pagetables, INIT_MAP_SIZE)
>+
>+/*
>+ * Initialize page tables. This creates a PDE and a set of page
>+ * tables, which are located immediately beyond __brk_base. The
>variable
>+ * _brk_end is set up to point to the first "safe" location.
>+ * Mappings are created both at virtual address 0 (identity mapping)
>+ * and PAGE_OFFSET for up to _end.
>+ */
>+ .text
>+ENTRY(setup_pgtable_32)
>+#ifdef CONFIG_X86_PAE
>+ /*
>+ * In PAE mode initial_page_table is statically defined to contain
>+ * enough entries to cover the VMSPLIT option (that is the top 1, 2
>or 3
>+ * entries). The identity mapping is handled by pointing two PGD
>entries
>+ * to the first kernel PMD.
>+ *
>+ * Note the upper half of each PMD or PTE are always zero at this
>stage.
>+ */
>+
>+#define KPMDS (((-__PAGE_OFFSET) >> 30) & 3) /* Number of kernel PMDs
>*/
>+
>+ xorl %ebx,%ebx /* %ebx is kept at zero */
>+
>+ movl $pa(__brk_base), %edi
>+ movl $pa(initial_pg_pmd), %edx
>+ movl $PTE_IDENT_ATTR, %eax
>+10:
>+ leal PDE_IDENT_ATTR(%edi),%ecx /* Create PMD entry */
>+ movl %ecx,(%edx) /* Store PMD entry */
>+ /* Upper half already zero */
>+ addl $8,%edx
>+ movl $512,%ecx
>+11:
>+ stosl
>+ xchgl %eax,%ebx
>+ stosl
>+ xchgl %eax,%ebx
>+ addl $0x1000,%eax
>+ loop 11b
>+
>+ /*
>+ * End condition: we must map up to the end + MAPPING_BEYOND_END.
>+ */
>+ movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
>+ cmpl %ebp,%eax
>+ jb 10b
>+1:
>+ addl $__PAGE_OFFSET, %edi
>+ movl %edi, pa(_brk_end)
>+ shrl $12, %eax
>+ movl %eax, pa(max_pfn_mapped)
>+
>+ /* Do early initialization of the fixmap area */
>+ movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
>+ movl %eax,pa(initial_pg_pmd+0x1000*KPMDS-8)
>+#else /* Not PAE */
>+
>+page_pde_offset = (__PAGE_OFFSET >> 20);
>+
>+ movl $pa(__brk_base), %edi
>+ movl $pa(initial_page_table), %edx
>+ movl $PTE_IDENT_ATTR, %eax
>+10:
>+ leal PDE_IDENT_ATTR(%edi),%ecx /* Create PDE entry */
>+ movl %ecx,(%edx) /* Store identity PDE entry */
>+ movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
>+ addl $4,%edx
>+ movl $1024, %ecx
>+11:
>+ stosl
>+ addl $0x1000,%eax
>+ loop 11b
>+ /*
>+ * End condition: we must map up to the end + MAPPING_BEYOND_END.
>+ */
>+ movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
>+ cmpl %ebp,%eax
>+ jb 10b
>+ addl $__PAGE_OFFSET, %edi
>+ movl %edi, pa(_brk_end)
>+ shrl $12, %eax
>+ movl %eax, pa(max_pfn_mapped)
>+
>+ /* Do early initialization of the fixmap area */
>+ movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
>+ movl %eax,pa(initial_page_table+0xffc)
>+#endif
>+ ret
>+ENDPROC(setup_pgtable_32)
>+
>+/*
>+ * BSS section
>+ */
>+__PAGE_ALIGNED_BSS
>+ .align PAGE_SIZE
>+#ifdef CONFIG_X86_PAE
>+initial_pg_pmd:
>+ .fill 1024*KPMDS,4,0
>+#else
>+ENTRY(initial_page_table)
>+ .fill 1024,4,0
>+#endif
>+initial_pg_fixmap:
>+ .fill 1024,4,0
>+
>+/*
>+ * This starts the data section.
>+ */
>+#ifdef CONFIG_X86_PAE
>+__PAGE_ALIGNED_DATA
>+ /* Page-aligned for the benefit of paravirt? */
>+ .align PAGE_SIZE
>+ENTRY(initial_page_table)
>+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
>+# if KPMDS == 3
>+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
>+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x2000),0
>+# elif KPMDS == 2
>+ .long 0,0
>+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
>+# elif KPMDS == 1
>+ .long 0,0
>+ .long 0,0
>+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>+# else
>+# error "Kernel PMDs should be 1, 2 or 3"
>+# endif
>+ .align PAGE_SIZE /* needs to be page-sized too */
>+#endif

And why does it need a separate entry point as opposed to the plain one?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2016-10-14 18:34:54

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 3/8] xen/pvh: Import PVH-related Xen public interfaces

On Fri, Oct 14, 2016 at 02:05:13PM -0400, Boris Ostrovsky wrote:
> Signed-off-by: Boris Ostrovsky <[email protected]>

Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>

2016-10-14 18:38:28

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 1/8] xen/x86: Remove PVH support

On Fri, Oct 14, 2016 at 02:05:11PM -0400, Boris Ostrovsky wrote:
> We are replacing existing PVH guests with new implementation.
>
> Signed-off-by: Boris Ostrovsky <[email protected]>

Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>

2016-10-14 18:40:20

by Andrew Cooper

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 4/8] xen/pvh: Bootstrap PVH guest

On 14/10/16 19:05, Boris Ostrovsky wrote:
> diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
> new file mode 100644
> index 0000000..58c477b
> --- /dev/null
> +++ b/arch/x86/xen/xen-pvh.S
> @@ -0,0 +1,143 @@
> +/*
> + * Copyright C 2016, Oracle and/or its affiliates. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> + .code32
> + .text
> +#define _pa(x) ((x) - __START_KERNEL_map)
> +
> +#include <linux/elfnote.h>
> +#include <linux/init.h>
> +#include <linux/linkage.h>
> +#include <asm/segment.h>
> +#include <asm/asm.h>
> +#include <asm/boot.h>
> +#include <asm/processor-flags.h>
> +#include <asm/msr.h>
> +#include <xen/interface/elfnote.h>
> +
> + __HEAD
> + .code32

Duplicated .code32

> +
> +/* Entry point for PVH guests */
> +ENTRY(pvh_start_xen)
> + cli
> + cld

The ABI states that these will be clear.

> +
> + mov $_pa(gdt), %eax
> + lgdt (%eax)

I am fairly sure you can express this without an intermediate in %eax.

> +
> + movl $(__BOOT_DS),%eax
> + movl %eax,%ds
> + movl %eax,%es
> + movl %eax,%ss
> +
> + /* Stash hvm_start_info */
> + mov $_pa(pvh_start_info), %edi
> + mov %ebx, %esi
> + mov $_pa(pvh_start_info_sz), %ecx
> + mov (%ecx), %ecx

No need for an intermediate.

> + rep
> + movsb

Surely we can guarentee the size is a multiple of 4? movsl would be better.

> +
> + movl $_pa(early_stack_end), %eax
> + movl %eax, %esp

You can mov straight into %esp.

> +
> + /* Enable PAE mode */
> + movl %cr4, %eax
> + orl $X86_CR4_PAE, %eax
> + movl %eax, %cr4
> +
> +#ifdef CONFIG_X86_64
> + /* Enable Long mode */
> + movl $MSR_EFER, %ecx
> + rdmsr
> + btsl $_EFER_LME, %eax
> + wrmsr
> +
> + /* Enable pre-constructed page tables */
> + mov $_pa(init_level4_pgt), %eax
> + movl %eax, %cr3
> + movl $(X86_CR0_PG | X86_CR0_PE), %eax
> + movl %eax, %cr0
> +
> + /* Jump to 64-bit mode. */
> + pushl $__KERNEL_CS
> + leal _pa(1f), %eax
> + pushl %eax
> + lret

You are still in compat mode, so can ljmp $__KERNEL_CS, $_pa(1f)

> +
> + /* 64-bit entry point */
> + .code64
> +1:
> + call xen_prepare_pvh
> +
> + /* startup_64 expects boot_params in %rsi */
> + mov $_pa(pvh_bootparams), %rsi
> + movq $_pa(startup_64), %rax

You seem to have an inconsistent mix of writing the explicit suffixes
when they aren't required.

> + jmp *%rax
> +
> +#else /* CONFIG_X86_64 */
> +
> + call setup_pgtable_32
> +
> + mov $_pa(initial_page_table), %eax
> + movl %eax, %cr3
> +
> + movl %cr0, %eax
> + orl $(X86_CR0_PG | X86_CR0_PE), %eax
> + movl %eax, %cr0
> +
> + ljmp $__BOOT_CS,$1f
> +1:
> + call xen_prepare_pvh

Why does xen_prepare_pvh need paging? I can't spot anything which
should need it, and it feels conceptually wrong.

~Andrew

> + mov $_pa(pvh_bootparams), %esi
> +
> + /* startup_32 doesn't expect paging and PAE to be on */
> + ljmp $__BOOT_CS,$_pa(2f)
> +2:
> + movl %cr0, %eax
> + andl $~X86_CR0_PG, %eax
> + movl %eax, %cr0
> + movl %cr4, %eax
> + andl $~X86_CR4_PAE, %eax
> + movl %eax, %cr4
> +
> + ljmp $0x10, $_pa(startup_32)
> +#endif
> +
> + .data
> +gdt:
> + .word gdt_end - gdt
> + .long _pa(gdt)
> + .word 0
> + .quad 0x0000000000000000 /* NULL descriptor */
> +#ifdef CONFIG_X86_64
> + .quad 0x00af9a000000ffff /* __KERNEL_CS */
> +#else
> + .quad 0x00cf9a000000ffff /* __KERNEL_CS */
> +#endif
> + .quad 0x00cf92000000ffff /* __KERNEL_DS */
> +gdt_end:
> +
> + .bss
> + .balign 4
> +early_stack:
> + .fill 16, 1, 0
> +early_stack_end:
> +
> + ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,
> + _ASM_PTR (pvh_start_xen - __START_KERNEL_map))
>

2016-10-14 18:41:36

by Andrew Cooper

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 8/8] xen/pvh: Enable CPU hotplug

On 14/10/16 19:05, Boris Ostrovsky wrote:
> PVH guests don't receive ACPI hotplug interrupts and therefore
> need to monitor xenstore for CPU hotplug event.

Why not? If they don't, they should. As we are providing ACPI anyway,
we should provide all bits of it.

~Andrew

2016-10-14 18:43:58

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

On 10/14/2016 02:31 PM, [email protected] wrote:
> On October 14, 2016 11:05:12 AM PDT, Boris Ostrovsky <[email protected]> wrote:
>> From: Matt Fleming <[email protected]>
>>
>> The new Xen PVH entry point requires page tables to be setup by the
>> kernel since it is entered with paging disabled.
>>
>> Pull the common code out of head_32.S and into pgtable_32.S so that
>> setup_pgtable_32 can be invoked from both the new Xen entry point and
>> the existing startup_32 code.
>>
> And why does it need a separate entry point as opposed to the plain one?

One reason is that we need to prepare boot_params before jumping to
startup_{32|64}.

When the guest is loaded (always in 32-bit mode) the only thing we have
is a pointer to Xen-specific datastructure. The early PVH code will
prepare zeropage based on that structure and then jump to regular
startup_*() code.

-boris

2016-10-14 18:53:58

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 4/8] xen/pvh: Bootstrap PVH guest

On 10/14/2016 02:38 PM, Andrew Cooper wrote:
>> + jmp *%rax
>> +
>> +#else /* CONFIG_X86_64 */
>> +
>> + call setup_pgtable_32
>> +
>> + mov $_pa(initial_page_table), %eax
>> + movl %eax, %cr3
>> +
>> + movl %cr0, %eax
>> + orl $(X86_CR0_PG | X86_CR0_PE), %eax
>> + movl %eax, %cr0
>> +
>> + ljmp $__BOOT_CS,$1f
>> +1:
>> + call xen_prepare_pvh
> Why does xen_prepare_pvh need paging? I can't spot anything which
> should need it, and it feels conceptually wrong.

xen_prepare_pvh() deals with virtual addresses. How can we run without paging?

(Also, startup_64, which is where we jump from here in 64-bit mode expects paging to be on).

-boris


2016-10-14 19:00:50

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 8/8] xen/pvh: Enable CPU hotplug

On 10/14/2016 02:41 PM, Andrew Cooper wrote:
> On 14/10/16 19:05, Boris Ostrovsky wrote:
>> PVH guests don't receive ACPI hotplug interrupts and therefore
>> need to monitor xenstore for CPU hotplug event.
> Why not? If they don't, they should. As we are providing ACPI anyway,
> we should provide all bits of it.

We don't have IOAPIC, which is how these interrupts are typically
delivered. I suppose we might be able to specify it as something else.

I'll look into this.

-boris

2016-10-14 19:05:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

On October 14, 2016 11:44:18 AM PDT, Boris Ostrovsky <[email protected]> wrote:
>On 10/14/2016 02:31 PM, [email protected] wrote:
>> On October 14, 2016 11:05:12 AM PDT, Boris Ostrovsky
><[email protected]> wrote:
>>> From: Matt Fleming <[email protected]>
>>>
>>> The new Xen PVH entry point requires page tables to be setup by the
>>> kernel since it is entered with paging disabled.
>>>
>>> Pull the common code out of head_32.S and into pgtable_32.S so that
>>> setup_pgtable_32 can be invoked from both the new Xen entry point
>and
>>> the existing startup_32 code.
>>>
>> And why does it need a separate entry point as opposed to the plain
>one?
>
>One reason is that we need to prepare boot_params before jumping to
>startup_{32|64}.
>
>When the guest is loaded (always in 32-bit mode) the only thing we have
>is a pointer to Xen-specific datastructure. The early PVH code will
>prepare zeropage based on that structure and then jump to regular
>startup_*() code.
>
>-boris

And why not just resume execution at start_32 then?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2016-10-14 19:13:29

by Andrew Cooper

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 4/8] xen/pvh: Bootstrap PVH guest

On 14/10/16 19:55, Boris Ostrovsky wrote:
> On 10/14/2016 02:38 PM, Andrew Cooper wrote:
>>> + jmp *%rax
>>> +
>>> +#else /* CONFIG_X86_64 */
>>> +
>>> + call setup_pgtable_32
>>> +
>>> + mov $_pa(initial_page_table), %eax
>>> + movl %eax, %cr3
>>> +
>>> + movl %cr0, %eax
>>> + orl $(X86_CR0_PG | X86_CR0_PE), %eax
>>> + movl %eax, %cr0
>>> +
>>> + ljmp $__BOOT_CS,$1f
>>> +1:
>>> + call xen_prepare_pvh
>> Why does xen_prepare_pvh need paging? I can't spot anything which
>> should need it, and it feels conceptually wrong.
> xen_prepare_pvh() deals with virtual addresses. How can we run without paging?

Ah yes - with a high-half kernel, that way around doesn't work. Sorry
for the noise - I have been spending too long working with virtual
addresses down round 0, where that specifically can be solved by setting
%ds with a suitable non-zero base.

~Andrew

2016-10-14 19:14:35

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 4/8] xen/pvh: Bootstrap PVH guest

On Fri, Oct 14, 2016 at 02:05:14PM -0400, Boris Ostrovsky wrote:
> Start PVH guest at XEN_ELFNOTE_PHYS32_ENTRY address. Setup hypercall
> page, initialize boot_params, enable early page tables.
>
> Since this stub is executed before kernel entry point we cannot use
> variables in .bss which is cleared by kernel. We explicitly place
> variables that are initialized here into .data.
>
> Signed-off-by: Boris Ostrovsky <[email protected]>
> Signed-off-by: Matt Fleming <[email protected]>
> ---
> arch/x86/xen/Kconfig | 2 +-
> arch/x86/xen/Makefile | 1 +
> arch/x86/xen/enlighten.c | 87 +++++++++++++++++++++++++++-
> arch/x86/xen/xen-pvh.S | 143 +++++++++++++++++++++++++++++++++++++++++++++++
> include/xen/xen.h | 5 ++
> 5 files changed, 236 insertions(+), 2 deletions(-)
> create mode 100644 arch/x86/xen/xen-pvh.S
>
> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
> index c7b15f3..76b6dbd 100644
> --- a/arch/x86/xen/Kconfig
> +++ b/arch/x86/xen/Kconfig
> @@ -53,5 +53,5 @@ config XEN_DEBUG_FS
>
> config XEN_PVH
> bool "Support for running as a PVH guest"
> - depends on X86_64 && XEN && XEN_PVHVM
> + depends on XEN && XEN_PVHVM && ACPI
> def_bool n
> diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
> index e47e527..cb0164a 100644
> --- a/arch/x86/xen/Makefile
> +++ b/arch/x86/xen/Makefile
> @@ -23,3 +23,4 @@ obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
> obj-$(CONFIG_XEN_DOM0) += vga.o
> obj-$(CONFIG_SWIOTLB_XEN) += pci-swiotlb-xen.o
> obj-$(CONFIG_XEN_EFI) += efi.o
> +obj-$(CONFIG_XEN_PVH) += xen-pvh.o
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index dc4ed0c..d38d568 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -45,6 +45,7 @@
> #include <xen/interface/memory.h>
> #include <xen/interface/nmi.h>
> #include <xen/interface/xen-mca.h>
> +#include <xen/interface/hvm/start_info.h>
> #include <xen/features.h>
> #include <xen/page.h>
> #include <xen/hvm.h>
> @@ -121,7 +122,8 @@
> DEFINE_PER_CPU(uint32_t, xen_vcpu_id);
> EXPORT_PER_CPU_SYMBOL(xen_vcpu_id);
>
> -enum xen_domain_type xen_domain_type = XEN_NATIVE;
> +enum xen_domain_type xen_domain_type
> + __attribute__((section(".data"))) = XEN_NATIVE;
> EXPORT_SYMBOL_GPL(xen_domain_type);
>
> unsigned long *machine_to_phys_mapping = (void *)MACH2PHYS_VIRT_START;
> @@ -176,6 +178,17 @@ struct tls_descs {
> */
> static DEFINE_PER_CPU(struct tls_descs, shadow_tls_desc);
>
> +#ifdef CONFIG_XEN_PVH
> +/*
> + * PVH variables. These need to live in data segment since they are
> + * initialized before startup_{32|64}, which clear .bss, are invoked.
> + */
> +int xen_pvh __attribute__((section(".data"))) = 0;

unsigned int?
> +struct hvm_start_info pvh_start_info __attribute__((section(".data")));
> +uint pvh_start_info_sz = sizeof(pvh_start_info);

unsigned int please. Typedefs in Linux are frowned upon.

> +struct boot_params pvh_bootparams __attribute__((section(".data")));
> +#endif
> +
> static void clamp_max_cpus(void)
> {
> #ifdef CONFIG_SMP
> @@ -1669,6 +1682,78 @@ asmlinkage __visible void __init xen_start_kernel(void)
> #endif
> }
>
> +#ifdef CONFIG_XEN_PVH
> +static void __init init_pvh_bootparams(void)
> +{
> + struct xen_memory_map memmap;
> + int i;

unsigned int?
> +
> + memset(&pvh_bootparams, 0, sizeof(pvh_bootparams));
> +
> + memmap.nr_entries = ARRAY_SIZE(pvh_bootparams.e820_map);
> + set_xen_guest_handle(memmap.buffer, pvh_bootparams.e820_map);
> + if (HYPERVISOR_memory_op(XENMEM_memory_map, &memmap)) {
> + xen_raw_console_write("XENMEM_memory_map failed\n");

Should we print the error value at least?
> + BUG();
> + }
> +
> + pvh_bootparams.e820_map[memmap.nr_entries].addr =
> + ISA_START_ADDRESS;

What if nr_entries is 128? Should we double-check for that?

> + pvh_bootparams.e820_map[memmap.nr_entries].size =
> + ISA_END_ADDRESS - ISA_START_ADDRESS;
> + pvh_bootparams.e820_map[memmap.nr_entries++].type =
> + E820_RESERVED;
> +
> + sanitize_e820_map(pvh_bootparams.e820_map,
> + ARRAY_SIZE(pvh_bootparams.e820_map),
> + &memmap.nr_entries);
> +
> + pvh_bootparams.e820_entries = memmap.nr_entries;
> + for (i = 0; i < pvh_bootparams.e820_entries; i++)
> + e820_add_region(pvh_bootparams.e820_map[i].addr,
> + pvh_bootparams.e820_map[i].size,
> + pvh_bootparams.e820_map[i].type);
> +
> + pvh_bootparams.hdr.cmd_line_ptr =
> + pvh_start_info.cmdline_paddr;
> +
> + /* The first module is always ramdisk */

Could you add an period at end please?
> + if (pvh_start_info.nr_modules) {
> + struct hvm_modlist_entry *modaddr =
> + __va(pvh_start_info.modlist_paddr);
> + pvh_bootparams.hdr.ramdisk_image = modaddr->paddr;
> + pvh_bootparams.hdr.ramdisk_size = modaddr->size;
> + }
> +
> + /*
> + * See Documentation/x86/boot.txt.
> + *
> + * Version 2.12 supports Xen entry point but we will use default x86/PC
> + * environment (i.e. hardware_subarch 0).
> + */
> + pvh_bootparams.hdr.version = 0x212;
> + pvh_bootparams.hdr.type_of_loader = (9 << 4) | 0; /* Xen loader */
> +}
> +
> +/*
> + * This routine (and those that it might call) should not use
> + * anything that lives in .bss since that segment will be cleared later

And maybe one here too?
> + */
> +void __init xen_prepare_pvh(void)
> +{
> + u32 eax, ecx, edx, msr;

msr = 0 ?
> + u64 pfn;
> +
> + xen_pvh = 1;
> +
> + cpuid(xen_cpuid_base() + 2, &eax, &msr, &ecx, &edx);

cpuid_ebx ? And that way you don't have have ecx and edx?
> + pfn = __pa(hypercall_page);
> + wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
> +
> + init_pvh_bootparams();
> +}
> +#endif
> +
> void __ref xen_hvm_init_shared_info(void)
> {
> int cpu;
> diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
> new file mode 100644
> index 0000000..58c477b
> --- /dev/null
> +++ b/arch/x86/xen/xen-pvh.S
> @@ -0,0 +1,143 @@
> +/*
> + * Copyright C 2016, Oracle and/or its affiliates. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> + .code32
> + .text
> +#define _pa(x) ((x) - __START_KERNEL_map)
> +
> +#include <linux/elfnote.h>
> +#include <linux/init.h>
> +#include <linux/linkage.h>
> +#include <asm/segment.h>
> +#include <asm/asm.h>
> +#include <asm/boot.h>
> +#include <asm/processor-flags.h>
> +#include <asm/msr.h>
> +#include <xen/interface/elfnote.h>
> +
> + __HEAD
> + .code32
> +
> +/* Entry point for PVH guests */
> +ENTRY(pvh_start_xen)

You are missing the ENDPROC macro at the end.

> + cli
> + cld
> +
> + mov $_pa(gdt), %eax
> + lgdt (%eax)
> +
> + movl $(__BOOT_DS),%eax
> + movl %eax,%ds
> + movl %eax,%es
> + movl %eax,%ss
> +
> + /* Stash hvm_start_info */
> + mov $_pa(pvh_start_info), %edi
> + mov %ebx, %esi

Should we derference the first byte or such to check for the magic
string? Actually I am not even seeing the check in the C code?

> + mov $_pa(pvh_start_info_sz), %ecx
> + mov (%ecx), %ecx
> + rep
> + movsb
> +
> + movl $_pa(early_stack_end), %eax
> + movl %eax, %esp
> +
> + /* Enable PAE mode */

Periods are nice! Truly!

> + movl %cr4, %eax
> + orl $X86_CR4_PAE, %eax
> + movl %eax, %cr4
> +
> +#ifdef CONFIG_X86_64
> + /* Enable Long mode */

:-) I think you know what I am going to say here.

> + movl $MSR_EFER, %ecx
> + rdmsr
> + btsl $_EFER_LME, %eax
> + wrmsr
> +
> + /* Enable pre-constructed page tables */

And here.

> + mov $_pa(init_level4_pgt), %eax
> + movl %eax, %cr3
> + movl $(X86_CR0_PG | X86_CR0_PE), %eax
> + movl %eax, %cr0
> +
> + /* Jump to 64-bit mode. */
> + pushl $__KERNEL_CS
> + leal _pa(1f), %eax
> + pushl %eax
> + lret
> +
> + /* 64-bit entry point */

And right here.
> + .code64
> +1:
> + call xen_prepare_pvh
> +
> + /* startup_64 expects boot_params in %rsi */

..
> + mov $_pa(pvh_bootparams), %rsi
> + movq $_pa(startup_64), %rax
> + jmp *%rax
> +
> +#else /* CONFIG_X86_64 */
> +
> + call setup_pgtable_32
> +
> + mov $_pa(initial_page_table), %eax
> + movl %eax, %cr3
> +
> + movl %cr0, %eax
> + orl $(X86_CR0_PG | X86_CR0_PE), %eax
> + movl %eax, %cr0
> +
> + ljmp $__BOOT_CS,$1f
> +1:
> + call xen_prepare_pvh
> + mov $_pa(pvh_bootparams), %esi
> +
> + /* startup_32 doesn't expect paging and PAE to be on */

Should 'startup_32' be documented with this?


> + ljmp $__BOOT_CS,$_pa(2f)
> +2:
> + movl %cr0, %eax
> + andl $~X86_CR0_PG, %eax
> + movl %eax, %cr0
> + movl %cr4, %eax
> + andl $~X86_CR4_PAE, %eax
> + movl %eax, %cr4
> +
> + ljmp $0x10, $_pa(startup_32)
> +#endif
> +
> + .data
> +gdt:
> + .word gdt_end - gdt
> + .long _pa(gdt)
> + .word 0
> + .quad 0x0000000000000000 /* NULL descriptor */
> +#ifdef CONFIG_X86_64
> + .quad 0x00af9a000000ffff /* __KERNEL_CS */
> +#else
> + .quad 0x00cf9a000000ffff /* __KERNEL_CS */
> +#endif
> + .quad 0x00cf92000000ffff /* __KERNEL_DS */
> +gdt_end:
> +
> + .bss
> + .balign 4
> +early_stack:
> + .fill 16, 1, 0
> +early_stack_end:
> +
> + ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,
> + _ASM_PTR (pvh_start_xen - __START_KERNEL_map))
> diff --git a/include/xen/xen.h b/include/xen/xen.h
> index d0f9684..ed3f841 100644
> --- a/include/xen/xen.h
> +++ b/include/xen/xen.h
> @@ -29,6 +29,11 @@ enum xen_domain_type {
> #define xen_initial_domain() (0)
> #endif /* CONFIG_XEN_DOM0 */
>
> +#ifdef CONFIG_XEN_PVH
> +extern int xen_pvh;
> +#define xen_pvh_domain() (xen_hvm_domain() && xen_pvh)
> +#else
> #define xen_pvh_domain() (0)
> +#endif
>
> #endif /* _XEN_XEN_H */
> --
> 1.8.3.1
>
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> https://lists.xen.org/xen-devel

2016-10-14 19:18:01

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 5/8] xen/pvh: Prevent PVH guests from using PIC, RTC and IOAPIC

On Fri, Oct 14, 2016 at 02:05:15PM -0400, Boris Ostrovsky wrote:
> Make sure they don't use these devices since they are not emulated
> for unprivileged PVH guest.

Which means they would just return 0 ? Or would it get worst since
the in/out would go to the hypervisor which would kill the guest?
>
> Also don't initialize hypercall page for them in init_hvm_pv_info()
> since this has already been done.
>
> Signed-off-by: Boris Ostrovsky <[email protected]>
> ---
> arch/x86/xen/enlighten.c | 24 +++++++++++++++++-------
> 1 file changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index d38d568..6c1a330 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -1803,10 +1803,21 @@ static void __init init_hvm_pv_info(void)
> minor = eax & 0xffff;
> printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
>
> - cpuid(base + 2, &pages, &msr, &ecx, &edx);
> + xen_domain_type = XEN_HVM_DOMAIN;
>
> - pfn = __pa(hypercall_page);
> - wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
> + /* PVH set up hypercall page earlier in xen_prepare_pvh() */

A period at the end?
> + if (xen_pvh_domain()) {
> + pv_info.name = "Xen PVH";
> +#ifdef CONFIG_ACPI
> + /* No PIC or IOAPIC */

Here?
> + acpi_irq_model = ACPI_IRQ_MODEL_PLATFORM;
> +#endif
> + } else {
> + pv_info.name = "Xen HVM";
> + cpuid(base + 2, &pages, &msr, &ecx, &edx);

Could you use cpuid_ebx ?
> + pfn = __pa(hypercall_page);
> + wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
> + }
>
> xen_setup_features();
>
> @@ -1815,10 +1826,6 @@ static void __init init_hvm_pv_info(void)
> this_cpu_write(xen_vcpu_id, ebx);
> else
> this_cpu_write(xen_vcpu_id, smp_processor_id());
> -
> - pv_info.name = "Xen HVM";
> -
> - xen_domain_type = XEN_HVM_DOMAIN;
> }
>
> static int xen_cpu_up_prepare(unsigned int cpu)
> @@ -1892,6 +1899,9 @@ static void __init xen_hvm_guest_init(void)
>
> init_hvm_pv_info();
>
> + if (xen_pvh_domain())
> + x86_platform.legacy.rtc = 0;
> +
> xen_hvm_init_shared_info();
>
> xen_panic_handler_init();
> --
> 1.8.3.1
>
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> https://lists.xen.org/xen-devel

2016-10-14 19:19:42

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 6/8] xen/pvh: Initialize grant table for PVH guests

On Fri, Oct 14, 2016 at 02:05:16PM -0400, Boris Ostrovsky wrote:

Perhaps add in here:

PVH is like PV in that there are no PCI devices - which HVM
code would piggyback on to find the Xen PCI platform device and
use its MMIO space to stash the grants in.

For PVH we balloon out memory and stash the grants in there.

(Which begs the next question - where and when do we balloon out the
normal memory back in?)

> Signed-off-by: Boris Ostrovsky <[email protected]>
> ---
> drivers/xen/grant-table.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> index bb36b1e..d6786b8 100644
> --- a/drivers/xen/grant-table.c
> +++ b/drivers/xen/grant-table.c
> @@ -1146,13 +1146,13 @@ int gnttab_init(void)
>
> static int __gnttab_init(void)
> {
> + if (!xen_domain())
> + return -ENODEV;
> +
> /* Delay grant-table initialization in the PV on HVM case */
> - if (xen_hvm_domain())
> + if (xen_hvm_domain() && !xen_pvh_domain())
> return 0;
>
> - if (!xen_pv_domain())
> - return -ENODEV;
> -
> return gnttab_init();
> }
> /* Starts after core_initcall so that xen_pvh_gnttab_setup can be called
> --
> 1.8.3.1
>
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> https://lists.xen.org/xen-devel

2016-10-14 19:18:02

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

On 10/14/2016 03:04 PM, [email protected] wrote:
> On October 14, 2016 11:44:18 AM PDT, Boris Ostrovsky <[email protected]> wrote:
>> On 10/14/2016 02:31 PM, [email protected] wrote:
>>> On October 14, 2016 11:05:12 AM PDT, Boris Ostrovsky
>> <[email protected]> wrote:
>>>> From: Matt Fleming <[email protected]>
>>>>
>>>> The new Xen PVH entry point requires page tables to be setup by the
>>>> kernel since it is entered with paging disabled.
>>>>
>>>> Pull the common code out of head_32.S and into pgtable_32.S so that
>>>> setup_pgtable_32 can be invoked from both the new Xen entry point
>> and
>>>> the existing startup_32 code.
>>>>
>>> And why does it need a separate entry point as opposed to the plain
>> one?
>>
>> One reason is that we need to prepare boot_params before jumping to
>> startup_{32|64}.
>>
>> When the guest is loaded (always in 32-bit mode) the only thing we have
>> is a pointer to Xen-specific datastructure. The early PVH code will
>> prepare zeropage based on that structure and then jump to regular
>> startup_*() code.
>>
>> -boris
> And why not just resume execution at start_32 then?

I am not sure what start_32 is.

If you meant startup_32 then that's exactly what we do (for 32-bit
guests) once zeropage is set up.

-boris


-boris

2016-10-14 19:17:59

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 7/8] xen/pvh: PVH guests always have PV devices

On Fri, Oct 14, 2016 at 02:05:17PM -0400, Boris Ostrovsky wrote:
> Signed-off-by: Boris Ostrovsky <[email protected]>

Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
> ---
> arch/x86/xen/platform-pci-unplug.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/xen/platform-pci-unplug.c b/arch/x86/xen/platform-pci-unplug.c
> index 90d1b83..33a783c 100644
> --- a/arch/x86/xen/platform-pci-unplug.c
> +++ b/arch/x86/xen/platform-pci-unplug.c
> @@ -73,8 +73,8 @@ bool xen_has_pv_devices(void)
> if (!xen_domain())
> return false;
>
> - /* PV domains always have them. */
> - if (xen_pv_domain())
> + /* PV and PVH domains always have them. */
> + if (xen_pv_domain() || xen_pvh_domain())
> return true;
>
> /* And user has xen_platform_pci=0 set in guest config as
> --
> 1.8.3.1
>
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> https://lists.xen.org/xen-devel

2016-10-14 19:34:09

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 4/8] xen/pvh: Bootstrap PVH guest

On 10/14/2016 03:14 PM, Konrad Rzeszutek Wilk wrote:
>
>> +
>> + memset(&pvh_bootparams, 0, sizeof(pvh_bootparams));
>> +
>> + memmap.nr_entries = ARRAY_SIZE(pvh_bootparams.e820_map);
>> + set_xen_guest_handle(memmap.buffer, pvh_bootparams.e820_map);
>> + if (HYPERVISOR_memory_op(XENMEM_memory_map, &memmap)) {
>> + xen_raw_console_write("XENMEM_memory_map failed\n");
> Should we print the error value at least?

I will have to check again but IIRC there was something about not being
able to format strings properly this early. But if we can --- sure.

>> + BUG();
>> + }
>> +
>> + pvh_bootparams.e820_map[memmap.nr_entries].addr =
>> + ISA_START_ADDRESS;
> What if nr_entries is 128? Should we double-check for that?
>

OK.



>> + */
>> +void __init xen_prepare_pvh(void)
>> +{
>> + u32 eax, ecx, edx, msr;
> msr = 0 ?

Won't cpuid() (or cpuid_ebx()) overwrite it anyway?

>> + u64 pfn;
>> +
>> + xen_pvh = 1;
>> +
>> + cpuid(xen_cpuid_base() + 2, &eax, &msr, &ecx, &edx);
> cpuid_ebx ? And that way you don't have have ecx and edx?



>> + cli
>> + cld
>> +
>> + mov $_pa(gdt), %eax
>> + lgdt (%eax)
>> +
>> + movl $(__BOOT_DS),%eax
>> + movl %eax,%ds
>> + movl %eax,%es
>> + movl %eax,%ss
>> +
>> + /* Stash hvm_start_info */
>> + mov $_pa(pvh_start_info), %edi
>> + mov %ebx, %esi
> Should we derference the first byte or such to check for the magic
> string? Actually I am not even seeing the check in the C code?


Yes, good idea.


>> + .code64
>> +1:
>> + call xen_prepare_pvh
>> +
>> + /* startup_64 expects boot_params in %rsi */
> ..
>> + mov $_pa(pvh_bootparams), %rsi
>> + movq $_pa(startup_64), %rax
>> + jmp *%rax
>> +
>> +#else /* CONFIG_X86_64 */
>> +
>> + call setup_pgtable_32
>> +
>> + mov $_pa(initial_page_table), %eax
>> + movl %eax, %cr3
>> +
>> + movl %cr0, %eax
>> + orl $(X86_CR0_PG | X86_CR0_PE), %eax
>> + movl %eax, %cr0
>> +
>> + ljmp $__BOOT_CS,$1f
>> +1:
>> + call xen_prepare_pvh
>> + mov $_pa(pvh_bootparams), %esi
>> +
>> + /* startup_32 doesn't expect paging and PAE to be on */
> Should 'startup_32' be documented with this?

It is documented in Documentation/x86/boot.txt and in the startup_64 code.


-boris

2016-10-14 19:37:01

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 5/8] xen/pvh: Prevent PVH guests from using PIC, RTC and IOAPIC

On 10/14/2016 03:16 PM, Konrad Rzeszutek Wilk wrote:
> On Fri, Oct 14, 2016 at 02:05:15PM -0400, Boris Ostrovsky wrote:
>> Make sure they don't use these devices since they are not emulated
>> for unprivileged PVH guest.
> Which means they would just return 0 ? Or would it get worst since
> the in/out would go to the hypervisor which would kill the guest?

For PIC and IOAPIC (and I think RTC too) we will get a warning (with a
splat) that SCI cannot be initialized. But the guest will continue
running without problems.

-boris

>> Also don't initialize hypercall page for them in init_hvm_pv_info()
>> since this has already been done.
>>
>> Signed-off-by: Boris Ostrovsky <[email protected]>
>> ---
>> arch/x86/xen/enlighten.c | 24 +++++++++++++++++-------
>> 1 file changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index d38d568..6c1a330 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -1803,10 +1803,21 @@ static void __init init_hvm_pv_info(void)
>> minor = eax & 0xffff;
>> printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
>>
>> - cpuid(base + 2, &pages, &msr, &ecx, &edx);
>> + xen_domain_type = XEN_HVM_DOMAIN;
>>
>> - pfn = __pa(hypercall_page);
>> - wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
>> + /* PVH set up hypercall page earlier in xen_prepare_pvh() */
> A period at the end?
>> + if (xen_pvh_domain()) {
>> + pv_info.name = "Xen PVH";
>> +#ifdef CONFIG_ACPI
>> + /* No PIC or IOAPIC */
> Here?
>> + acpi_irq_model = ACPI_IRQ_MODEL_PLATFORM;
>> +#endif
>> + } else {
>> + pv_info.name = "Xen HVM";
>> + cpuid(base + 2, &pages, &msr, &ecx, &edx);
> Could you use cpuid_ebx ?
>> + pfn = __pa(hypercall_page);
>> + wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
>> + }
>>
>> xen_setup_features();
>>
>> @@ -1815,10 +1826,6 @@ static void __init init_hvm_pv_info(void)
>> this_cpu_write(xen_vcpu_id, ebx);
>> else
>> this_cpu_write(xen_vcpu_id, smp_processor_id());
>> -
>> - pv_info.name = "Xen HVM";
>> -
>> - xen_domain_type = XEN_HVM_DOMAIN;
>> }
>>
>> static int xen_cpu_up_prepare(unsigned int cpu)
>> @@ -1892,6 +1899,9 @@ static void __init xen_hvm_guest_init(void)
>>
>> init_hvm_pv_info();
>>
>> + if (xen_pvh_domain())
>> + x86_platform.legacy.rtc = 0;
>> +
>> xen_hvm_init_shared_info();
>>
>> xen_panic_handler_init();
>> --
>> 1.8.3.1
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> [email protected]
>> https://lists.xen.org/xen-devel

2016-10-14 19:41:45

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 6/8] xen/pvh: Initialize grant table for PVH guests

On 10/14/2016 03:19 PM, Konrad Rzeszutek Wilk wrote:
> On Fri, Oct 14, 2016 at 02:05:16PM -0400, Boris Ostrovsky wrote:
>
> Perhaps add in here:
>
> PVH is like PV in that there are no PCI devices - which HVM
> code would piggyback on to find the Xen PCI platform device and
> use its MMIO space to stash the grants in.
>
> For PVH we balloon out memory and stash the grants in there.
>
> (Which begs the next question - where and when do we balloon out the
> normal memory back in?)

Are you saying that we should get back memory that we gave to grant tables?

-boris


>
>> Signed-off-by: Boris Ostrovsky <[email protected]>
>> ---
>> drivers/xen/grant-table.c | 8 ++++----
>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
>> index bb36b1e..d6786b8 100644
>> --- a/drivers/xen/grant-table.c
>> +++ b/drivers/xen/grant-table.c
>> @@ -1146,13 +1146,13 @@ int gnttab_init(void)
>>
>> static int __gnttab_init(void)
>> {
>> + if (!xen_domain())
>> + return -ENODEV;
>> +
>> /* Delay grant-table initialization in the PV on HVM case */
>> - if (xen_hvm_domain())
>> + if (xen_hvm_domain() && !xen_pvh_domain())
>> return 0;
>>
>> - if (!xen_pv_domain())
>> - return -ENODEV;
>> -
>> return gnttab_init();
>> }
>> /* Starts after core_initcall so that xen_pvh_gnttab_setup can be called
>> --
>> 1.8.3.1
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> [email protected]
>> https://lists.xen.org/xen-devel

2016-10-14 19:51:52

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 6/8] xen/pvh: Initialize grant table for PVH guests

On Fri, Oct 14, 2016 at 03:43:19PM -0400, Boris Ostrovsky wrote:
> On 10/14/2016 03:19 PM, Konrad Rzeszutek Wilk wrote:
> > On Fri, Oct 14, 2016 at 02:05:16PM -0400, Boris Ostrovsky wrote:
> >
> > Perhaps add in here:
> >
> > PVH is like PV in that there are no PCI devices - which HVM
> > code would piggyback on to find the Xen PCI platform device and
> > use its MMIO space to stash the grants in.
> >
> > For PVH we balloon out memory and stash the grants in there.
> >
> > (Which begs the next question - where and when do we balloon out the
> > normal memory back in?)
>
> Are you saying that we should get back memory that we gave to grant tables?

Yes.

In pure HVM that area is MMIO - which hvmloader has balloonned out.

The hvmloader then balloons that number of pages back at the end of
guest memory (after 4GB).

2016-10-14 20:01:16

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 6/8] xen/pvh: Initialize grant table for PVH guests

On 10/14/2016 03:51 PM, Konrad Rzeszutek Wilk wrote:
> On Fri, Oct 14, 2016 at 03:43:19PM -0400, Boris Ostrovsky wrote:
>> On 10/14/2016 03:19 PM, Konrad Rzeszutek Wilk wrote:
>>> On Fri, Oct 14, 2016 at 02:05:16PM -0400, Boris Ostrovsky wrote:
>>>
>>> Perhaps add in here:
>>>
>>> PVH is like PV in that there are no PCI devices - which HVM
>>> code would piggyback on to find the Xen PCI platform device and
>>> use its MMIO space to stash the grants in.
>>>
>>> For PVH we balloon out memory and stash the grants in there.
>>>
>>> (Which begs the next question - where and when do we balloon out the
>>> normal memory back in?)
>> Are you saying that we should get back memory that we gave to grant tables?
> Yes.
>
> In pure HVM that area is MMIO - which hvmloader has balloonned out.
>
> The hvmloader then balloons that number of pages back at the end of
> guest memory (after 4GB).

We don't do this for PV though, do we?

-boris

2016-10-18 13:47:10

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH 1/8] xen/x86: Remove PVH support

On 14/10/16 20:05, Boris Ostrovsky wrote:
> We are replacing existing PVH guests with new implementation.
>
> Signed-off-by: Boris Ostrovsky <[email protected]>

Reviewed-by: Juergen Gross <[email protected]>

with the following addressed:

> diff --git a/include/xen/xen.h b/include/xen/xen.h
> index f0f0252..d0f9684 100644
> --- a/include/xen/xen.h
> +++ b/include/xen/xen.h
> @@ -29,17 +29,6 @@ enum xen_domain_type {
> #define xen_initial_domain() (0)
> #endif /* CONFIG_XEN_DOM0 */
>
> -#ifdef CONFIG_XEN_PVH
> -/* This functionality exists only for x86. The XEN_PVHVM support exists
> - * only in x86 world - hence on ARM it will be always disabled.
> - * N.B. ARM guests are neither PV nor HVM nor PVHVM.
> - * It's a bit like PVH but is different also (it's further towards the H
> - * end of the spectrum than even PVH).
> - */
> -#include <xen/features.h>
> -#define xen_pvh_domain() (xen_pv_domain() && \
> - xen_feature(XENFEAT_auto_translated_physmap))
> -#else
> #define xen_pvh_domain() (0)

Any reason you don't remove this, too (together with its last user in
arch/x86/xen/grant-table.c) ?


Juergen

2016-10-18 14:45:27

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 1/8] xen/x86: Remove PVH support

On 10/18/2016 09:46 AM, Juergen Gross wrote:
> On 14/10/16 20:05, Boris Ostrovsky wrote:
>> We are replacing existing PVH guests with new implementation.
>>
>> Signed-off-by: Boris Ostrovsky <[email protected]>
> Reviewed-by: Juergen Gross <[email protected]>
>
> with the following addressed:
>
>> diff --git a/include/xen/xen.h b/include/xen/xen.h
>> index f0f0252..d0f9684 100644
>> --- a/include/xen/xen.h
>> +++ b/include/xen/xen.h
>> @@ -29,17 +29,6 @@ enum xen_domain_type {
>> #define xen_initial_domain() (0)
>> #endif /* CONFIG_XEN_DOM0 */
>>
>> -#ifdef CONFIG_XEN_PVH
>> -/* This functionality exists only for x86. The XEN_PVHVM support exists
>> - * only in x86 world - hence on ARM it will be always disabled.
>> - * N.B. ARM guests are neither PV nor HVM nor PVHVM.
>> - * It's a bit like PVH but is different also (it's further towards the H
>> - * end of the spectrum than even PVH).
>> - */
>> -#include <xen/features.h>
>> -#define xen_pvh_domain() (xen_pv_domain() && \
>> - xen_feature(XENFEAT_auto_translated_physmap))
>> -#else
>> #define xen_pvh_domain() (0)
> Any reason you don't remove this, too (together with its last user in
> arch/x86/xen/grant-table.c) ?

grant-table.c is in fact one of the reasons: we will be using that code
for PVHv2 again so I kept it to avoid unnecessary code churn.

Also, we want to have a nop definition of xen_pvh_domain() for
!CONFIG_XEN_PVH.

-boris


2016-10-18 15:33:53

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH 1/8] xen/x86: Remove PVH support

On 18/10/16 16:45, Boris Ostrovsky wrote:
> On 10/18/2016 09:46 AM, Juergen Gross wrote:
>> On 14/10/16 20:05, Boris Ostrovsky wrote:
>>> We are replacing existing PVH guests with new implementation.
>>>
>>> Signed-off-by: Boris Ostrovsky <[email protected]>
>> Reviewed-by: Juergen Gross <[email protected]>
>>
>> with the following addressed:
>>
>>> diff --git a/include/xen/xen.h b/include/xen/xen.h
>>> index f0f0252..d0f9684 100644
>>> --- a/include/xen/xen.h
>>> +++ b/include/xen/xen.h
>>> @@ -29,17 +29,6 @@ enum xen_domain_type {
>>> #define xen_initial_domain() (0)
>>> #endif /* CONFIG_XEN_DOM0 */
>>>
>>> -#ifdef CONFIG_XEN_PVH
>>> -/* This functionality exists only for x86. The XEN_PVHVM support exists
>>> - * only in x86 world - hence on ARM it will be always disabled.
>>> - * N.B. ARM guests are neither PV nor HVM nor PVHVM.
>>> - * It's a bit like PVH but is different also (it's further towards the H
>>> - * end of the spectrum than even PVH).
>>> - */
>>> -#include <xen/features.h>
>>> -#define xen_pvh_domain() (xen_pv_domain() && \
>>> - xen_feature(XENFEAT_auto_translated_physmap))
>>> -#else
>>> #define xen_pvh_domain() (0)
>> Any reason you don't remove this, too (together with its last user in
>> arch/x86/xen/grant-table.c) ?
>
> grant-table.c is in fact one of the reasons: we will be using that code
> for PVHv2 again so I kept it to avoid unnecessary code churn.
>
> Also, we want to have a nop definition of xen_pvh_domain() for
> !CONFIG_XEN_PVH.

Okay, could you mention this in the commit message, please?


Juergen

2016-10-18 15:41:55

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 1/8] xen/x86: Remove PVH support

On 10/18/2016 11:33 AM, Juergen Gross wrote:
> On 18/10/16 16:45, Boris Ostrovsky wrote:
>> On 10/18/2016 09:46 AM, Juergen Gross wrote:
>>> On 14/10/16 20:05, Boris Ostrovsky wrote:
>>>> We are replacing existing PVH guests with new implementation.
>>>>
>>>> Signed-off-by: Boris Ostrovsky <[email protected]>
>>> Reviewed-by: Juergen Gross <[email protected]>
>>>
>>> with the following addressed:
>>>
>>>> diff --git a/include/xen/xen.h b/include/xen/xen.h
>>>> index f0f0252..d0f9684 100644
>>>> --- a/include/xen/xen.h
>>>> +++ b/include/xen/xen.h
>>>> @@ -29,17 +29,6 @@ enum xen_domain_type {
>>>> #define xen_initial_domain() (0)
>>>> #endif /* CONFIG_XEN_DOM0 */
>>>>
>>>> -#ifdef CONFIG_XEN_PVH
>>>> -/* This functionality exists only for x86. The XEN_PVHVM support exists
>>>> - * only in x86 world - hence on ARM it will be always disabled.
>>>> - * N.B. ARM guests are neither PV nor HVM nor PVHVM.
>>>> - * It's a bit like PVH but is different also (it's further towards the H
>>>> - * end of the spectrum than even PVH).
>>>> - */
>>>> -#include <xen/features.h>
>>>> -#define xen_pvh_domain() (xen_pv_domain() && \
>>>> - xen_feature(XENFEAT_auto_translated_physmap))
>>>> -#else
>>>> #define xen_pvh_domain() (0)
>>> Any reason you don't remove this, too (together with its last user in
>>> arch/x86/xen/grant-table.c) ?
>> grant-table.c is in fact one of the reasons: we will be using that code
>> for PVHv2 again so I kept it to avoid unnecessary code churn.
>>
>> Also, we want to have a nop definition of xen_pvh_domain() for
>> !CONFIG_XEN_PVH.
> Okay, could you mention this in the commit message, please?


Will do.

-boris

2016-10-18 15:54:59

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH 7/8] xen/pvh: PVH guests always have PV devices

On 14/10/16 20:05, Boris Ostrovsky wrote:
> Signed-off-by: Boris Ostrovsky <[email protected]>

Reviewed-by: Juergen Gross <[email protected]>

> ---
> arch/x86/xen/platform-pci-unplug.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/xen/platform-pci-unplug.c b/arch/x86/xen/platform-pci-unplug.c
> index 90d1b83..33a783c 100644
> --- a/arch/x86/xen/platform-pci-unplug.c
> +++ b/arch/x86/xen/platform-pci-unplug.c
> @@ -73,8 +73,8 @@ bool xen_has_pv_devices(void)
> if (!xen_domain())
> return false;
>
> - /* PV domains always have them. */
> - if (xen_pv_domain())
> + /* PV and PVH domains always have them. */
> + if (xen_pv_domain() || xen_pvh_domain())
> return true;
>
> /* And user has xen_platform_pci=0 set in guest config as
>

2016-10-18 16:08:44

by Jürgen Groß

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 6/8] xen/pvh: Initialize grant table for PVH guests

On 14/10/16 22:02, Boris Ostrovsky wrote:
> On 10/14/2016 03:51 PM, Konrad Rzeszutek Wilk wrote:
>> On Fri, Oct 14, 2016 at 03:43:19PM -0400, Boris Ostrovsky wrote:
>>> On 10/14/2016 03:19 PM, Konrad Rzeszutek Wilk wrote:
>>>> On Fri, Oct 14, 2016 at 02:05:16PM -0400, Boris Ostrovsky wrote:
>>>>
>>>> Perhaps add in here:
>>>>
>>>> PVH is like PV in that there are no PCI devices - which HVM
>>>> code would piggyback on to find the Xen PCI platform device and
>>>> use its MMIO space to stash the grants in.
>>>>
>>>> For PVH we balloon out memory and stash the grants in there.
>>>>
>>>> (Which begs the next question - where and when do we balloon out the
>>>> normal memory back in?)
>>> Are you saying that we should get back memory that we gave to grant tables?
>> Yes.
>>
>> In pure HVM that area is MMIO - which hvmloader has balloonned out.
>>
>> The hvmloader then balloons that number of pages back at the end of
>> guest memory (after 4GB).
>
> We don't do this for PV though, do we?

Uuh, kind of. We try to allocate granted pages from the ballooned area.
See gnttab_alloc_pages().

So for PV(H) we don't need to balloon this memory back in as it was
never shadowed by a grant.


Juergen

2016-10-18 16:39:25

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 6/8] xen/pvh: Initialize grant table for PVH guests

On 10/18/2016 12:08 PM, Juergen Gross wrote:
> On 14/10/16 22:02, Boris Ostrovsky wrote:
>> On 10/14/2016 03:51 PM, Konrad Rzeszutek Wilk wrote:
>>> On Fri, Oct 14, 2016 at 03:43:19PM -0400, Boris Ostrovsky wrote:
>>>> On 10/14/2016 03:19 PM, Konrad Rzeszutek Wilk wrote:
>>>>> On Fri, Oct 14, 2016 at 02:05:16PM -0400, Boris Ostrovsky wrote:
>>>>>
>>>>> Perhaps add in here:
>>>>>
>>>>> PVH is like PV in that there are no PCI devices - which HVM
>>>>> code would piggyback on to find the Xen PCI platform device and
>>>>> use its MMIO space to stash the grants in.
>>>>>
>>>>> For PVH we balloon out memory and stash the grants in there.
>>>>>
>>>>> (Which begs the next question - where and when do we balloon out the
>>>>> normal memory back in?)
>>>> Are you saying that we should get back memory that we gave to grant tables?
>>> Yes.
>>>
>>> In pure HVM that area is MMIO - which hvmloader has balloonned out.
>>>
>>> The hvmloader then balloons that number of pages back at the end of
>>> guest memory (after 4GB).
>> We don't do this for PV though, do we?
> Uuh, kind of. We try to allocate granted pages from the ballooned area.
> See gnttab_alloc_pages().


I meant that we don't give memory back for PV.


>
> So for PV(H) we don't need to balloon this memory back in as it was
> never shadowed by a grant.


Is it *never* or *may or may not be* shadowed? (I assume "shadowed"
means "used for" here.)

-boris

2016-10-19 05:47:22

by Jürgen Groß

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 6/8] xen/pvh: Initialize grant table for PVH guests

On 18/10/16 18:40, Boris Ostrovsky wrote:
> On 10/18/2016 12:08 PM, Juergen Gross wrote:
>> On 14/10/16 22:02, Boris Ostrovsky wrote:
>>> On 10/14/2016 03:51 PM, Konrad Rzeszutek Wilk wrote:
>>>> On Fri, Oct 14, 2016 at 03:43:19PM -0400, Boris Ostrovsky wrote:
>>>>> On 10/14/2016 03:19 PM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Fri, Oct 14, 2016 at 02:05:16PM -0400, Boris Ostrovsky wrote:
>>>>>>
>>>>>> Perhaps add in here:
>>>>>>
>>>>>> PVH is like PV in that there are no PCI devices - which HVM
>>>>>> code would piggyback on to find the Xen PCI platform device and
>>>>>> use its MMIO space to stash the grants in.
>>>>>>
>>>>>> For PVH we balloon out memory and stash the grants in there.
>>>>>>
>>>>>> (Which begs the next question - where and when do we balloon out the
>>>>>> normal memory back in?)
>>>>> Are you saying that we should get back memory that we gave to grant tables?
>>>> Yes.
>>>>
>>>> In pure HVM that area is MMIO - which hvmloader has balloonned out.
>>>>
>>>> The hvmloader then balloons that number of pages back at the end of
>>>> guest memory (after 4GB).
>>> We don't do this for PV though, do we?
>> Uuh, kind of. We try to allocate granted pages from the ballooned area.
>> See gnttab_alloc_pages().
>
>
> I meant that we don't give memory back for PV.

That's right AFAIK.

>> So for PV(H) we don't need to balloon this memory back in as it was
>> never shadowed by a grant.
>
>
> Is it *never* or *may or may not be* shadowed? (I assume "shadowed"
> means "used for" here.)

"shadowed" means a pte is being used for a granted page which was
referencing a RAM page before. So the RAM page is unusable as long as
the grant is active.

A page is shadowed by a grant only if there is no ballooning space
available, so ballooning that page out would serve no purpose as we
would have no way to balloon it in at another address.


Juergen

2016-10-21 10:58:20

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH 3/8] xen/pvh: Import PVH-related Xen public interfaces

On 14/10/16 20:05, Boris Ostrovsky wrote:
> Signed-off-by: Boris Ostrovsky <[email protected]>

Reviewed-by: Juergen Gross <[email protected]>


Juergen

2016-10-26 10:42:16

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH 5/8] xen/pvh: Prevent PVH guests from using PIC, RTC and IOAPIC

On Fri, Oct 14, 2016 at 02:05:15PM -0400, Boris Ostrovsky wrote:
> Make sure they don't use these devices since they are not emulated
> for unprivileged PVH guest.
>
> Also don't initialize hypercall page for them in init_hvm_pv_info()
> since this has already been done.
>
> Signed-off-by: Boris Ostrovsky <[email protected]>
> ---
> arch/x86/xen/enlighten.c | 24 +++++++++++++++++-------
> 1 file changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index d38d568..6c1a330 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -1803,10 +1803,21 @@ static void __init init_hvm_pv_info(void)
> minor = eax & 0xffff;
> printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
>
> - cpuid(base + 2, &pages, &msr, &ecx, &edx);
> + xen_domain_type = XEN_HVM_DOMAIN;
>
> - pfn = __pa(hypercall_page);
> - wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
> + /* PVH set up hypercall page earlier in xen_prepare_pvh() */
> + if (xen_pvh_domain()) {
> + pv_info.name = "Xen PVH";
> +#ifdef CONFIG_ACPI
> + /* No PIC or IOAPIC */

Shouldn't this be fetched from the MADT ACPI table if ACPI is available
(rsdp_paddr != 0 in start_info)?

Roger.

2016-10-26 14:48:55

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 5/8] xen/pvh: Prevent PVH guests from using PIC, RTC and IOAPIC

On 10/26/2016 06:42 AM, Roger Pau Monn? wrote:
> On Fri, Oct 14, 2016 at 02:05:15PM -0400, Boris Ostrovsky wrote:
>> Make sure they don't use these devices since they are not emulated
>> for unprivileged PVH guest.
>>
>> Also don't initialize hypercall page for them in init_hvm_pv_info()
>> since this has already been done.
>>
>> Signed-off-by: Boris Ostrovsky <[email protected]>
>> ---
>> arch/x86/xen/enlighten.c | 24 +++++++++++++++++-------
>> 1 file changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index d38d568..6c1a330 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -1803,10 +1803,21 @@ static void __init init_hvm_pv_info(void)
>> minor = eax & 0xffff;
>> printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
>>
>> - cpuid(base + 2, &pages, &msr, &ecx, &edx);
>> + xen_domain_type = XEN_HVM_DOMAIN;
>>
>> - pfn = __pa(hypercall_page);
>> - wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
>> + /* PVH set up hypercall page earlier in xen_prepare_pvh() */
>> + if (xen_pvh_domain()) {
>> + pv_info.name = "Xen PVH";
>> +#ifdef CONFIG_ACPI
>> + /* No PIC or IOAPIC */
> Shouldn't this be fetched from the MADT ACPI table if ACPI is available
> (rsdp_paddr != 0 in start_info)?


At this point we haven't parsed ACPI yet (with or without rsdp_paddr,
which we don't set anyway for domU) so we don't know whether we have PIC
or IOAPIC.

Having said that, I will probably remove this ("acpi_irq_model =
ACPI_IRQ_MODEL_PLATFORM", together with the comment) since I am working
on adding SCI support via an event channel and ACPI_IRQ_MODEL_PIC model,
which is default, seems to work OK.

Alternatively, I may keep ACPI_IRQ_MODEL_PLATFORM but then I'd need to
make changes to acpi_gsi_to_irq().

-boris



2016-10-26 15:18:34

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH 5/8] xen/pvh: Prevent PVH guests from using PIC, RTC and IOAPIC

On Wed, Oct 26, 2016 at 10:50:21AM -0400, Boris Ostrovsky wrote:
> On 10/26/2016 06:42 AM, Roger Pau Monn? wrote:
> > On Fri, Oct 14, 2016 at 02:05:15PM -0400, Boris Ostrovsky wrote:
> >> Make sure they don't use these devices since they are not emulated
> >> for unprivileged PVH guest.
> >>
> >> Also don't initialize hypercall page for them in init_hvm_pv_info()
> >> since this has already been done.
> >>
> >> Signed-off-by: Boris Ostrovsky <[email protected]>
> >> ---
> >> arch/x86/xen/enlighten.c | 24 +++++++++++++++++-------
> >> 1 file changed, 17 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> >> index d38d568..6c1a330 100644
> >> --- a/arch/x86/xen/enlighten.c
> >> +++ b/arch/x86/xen/enlighten.c
> >> @@ -1803,10 +1803,21 @@ static void __init init_hvm_pv_info(void)
> >> minor = eax & 0xffff;
> >> printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
> >>
> >> - cpuid(base + 2, &pages, &msr, &ecx, &edx);
> >> + xen_domain_type = XEN_HVM_DOMAIN;
> >>
> >> - pfn = __pa(hypercall_page);
> >> - wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
> >> + /* PVH set up hypercall page earlier in xen_prepare_pvh() */
> >> + if (xen_pvh_domain()) {
> >> + pv_info.name = "Xen PVH";
> >> +#ifdef CONFIG_ACPI
> >> + /* No PIC or IOAPIC */
> > Shouldn't this be fetched from the MADT ACPI table if ACPI is available
> > (rsdp_paddr != 0 in start_info)?
>
>
> At this point we haven't parsed ACPI yet (with or without rsdp_paddr,
> which we don't set anyway for domU) so we don't know whether we have PIC
> or IOAPIC.

I guess I'm missing something, but if we are providing ACPI tables to a PVH
DomU rsdp_paddr should be set, or else we are failing to comply with our own
start info specification.

> Having said that, I will probably remove this ("acpi_irq_model =
> ACPI_IRQ_MODEL_PLATFORM", together with the comment) since I am working
> on adding SCI support via an event channel and ACPI_IRQ_MODEL_PIC model,
> which is default, seems to work OK.

Hm, right, that might be an option. Do you know how hardware-reduced ACPI
implementations (which IIRC also don't have a SCI interrupt) deliver events
to the OS? Or it's simply not possible in that case?

Roger.

2016-10-26 16:05:09

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 5/8] xen/pvh: Prevent PVH guests from using PIC, RTC and IOAPIC

On 10/26/2016 11:18 AM, Roger Pau Monn? wrote:
> On Wed, Oct 26, 2016 at 10:50:21AM -0400, Boris Ostrovsky wrote:
>> On 10/26/2016 06:42 AM, Roger Pau Monn? wrote:
>>> On Fri, Oct 14, 2016 at 02:05:15PM -0400, Boris Ostrovsky wrote:
>>>> Make sure they don't use these devices since they are not emulated
>>>> for unprivileged PVH guest.
>>>>
>>>> Also don't initialize hypercall page for them in init_hvm_pv_info()
>>>> since this has already been done.
>>>>
>>>> Signed-off-by: Boris Ostrovsky <[email protected]>
>>>> ---
>>>> arch/x86/xen/enlighten.c | 24 +++++++++++++++++-------
>>>> 1 file changed, 17 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>> index d38d568..6c1a330 100644
>>>> --- a/arch/x86/xen/enlighten.c
>>>> +++ b/arch/x86/xen/enlighten.c
>>>> @@ -1803,10 +1803,21 @@ static void __init init_hvm_pv_info(void)
>>>> minor = eax & 0xffff;
>>>> printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
>>>>
>>>> - cpuid(base + 2, &pages, &msr, &ecx, &edx);
>>>> + xen_domain_type = XEN_HVM_DOMAIN;
>>>>
>>>> - pfn = __pa(hypercall_page);
>>>> - wrmsr_safe(msr, (u32)pfn, (u32)(pfn >> 32));
>>>> + /* PVH set up hypercall page earlier in xen_prepare_pvh() */
>>>> + if (xen_pvh_domain()) {
>>>> + pv_info.name = "Xen PVH";
>>>> +#ifdef CONFIG_ACPI
>>>> + /* No PIC or IOAPIC */
>>> Shouldn't this be fetched from the MADT ACPI table if ACPI is available
>>> (rsdp_paddr != 0 in start_info)?
>>
>> At this point we haven't parsed ACPI yet (with or without rsdp_paddr,
>> which we don't set anyway for domU) so we don't know whether we have PIC
>> or IOAPIC.
> I guess I'm missing something, but if we are providing ACPI tables to a PVH
> DomU rsdp_paddr should be set, or else we are failing to comply with our own
> start info specification.

acpi_find_root_pointer() searches low MB of memory for the RSDP
signature. This is standard ACPICA's method of finding it, I'd think
that FreeBSD does the same thing. And that's how a non-UEFI system is
expected to find it as required by the ACPI spec ("Root System
Description Pointer (RSDP)" section)

RSDP structure is placed at RSDP_ADDRESS (which is just under 1MB at
0xfffc0) by libxl__dom_load_acpi().


>
>> Having said that, I will probably remove this ("acpi_irq_model =
>> ACPI_IRQ_MODEL_PLATFORM", together with the comment) since I am working
>> on adding SCI support via an event channel and ACPI_IRQ_MODEL_PIC model,
>> which is default, seems to work OK.
> Hm, right, that might be an option.

I'd actually prefer not to use ACPI_IRQ_MODEL_PIC since it implies that
we have a PIC (which we don't). It just so happens that using it in
Linux works for PVH as well.

> Do you know how hardware-reduced ACPI
> implementations (which IIRC also don't have a SCI interrupt) deliver events
> to the OS? Or it's simply not possible in that case?

I believe you need _AEI object for that. At least that's what section
"4.1.1.1 GPIO-Signaled Events" suggests.


-boris


2016-10-27 14:25:32

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 8/8] xen/pvh: Enable CPU hotplug



On 10/14/2016 03:01 PM, Boris Ostrovsky wrote:
> On 10/14/2016 02:41 PM, Andrew Cooper wrote:
>> On 14/10/16 19:05, Boris Ostrovsky wrote:
>>> PVH guests don't receive ACPI hotplug interrupts and therefore
>>> need to monitor xenstore for CPU hotplug event.
>> Why not? If they don't, they should. As we are providing ACPI anyway,
>> we should provide all bits of it.
>
> We don't have IOAPIC, which is how these interrupts are typically
> delivered. I suppose we might be able to specify it as something else.
>
> I'll look into this.


(+Jan)

Yes, we can do this. The main issue is how to deal with event registers
(i.e FADT.x_pm1a_evt_blk) and AML's PRST region (which specifies online
CPU map).

Currently these are accessed via IO space and are handled by qemu.

There are a couple of ways to deal with this that I can see.

1. We can implement ioreq handling in the hypervisor, there are only a
few addresses that need handling.

2. We can implement those registers in memory space and have libxl
update them them on a hotplug command. This appears to be possible
because these registers mostly just consume writes without side effects
so they can be simple memory locations. The one exception is updating
status bits (they are cleared by writing 1s) but I think we can do this
from the AML.

Other than that the only other thing is setting up an event channel
between the toolstack and the guest (either via xenstore or perhaps by
having a reserved port for SCI).

I have a prototype with (2) (except for the bit clearing part) but I
want to hear comments on this approach before I write proper patches.

-boris

2016-10-27 15:00:32

by Andrew Cooper

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 8/8] xen/pvh: Enable CPU hotplug

On 27/10/16 15:25, Boris Ostrovsky wrote:
>
>
> On 10/14/2016 03:01 PM, Boris Ostrovsky wrote:
>> On 10/14/2016 02:41 PM, Andrew Cooper wrote:
>>> On 14/10/16 19:05, Boris Ostrovsky wrote:
>>>> PVH guests don't receive ACPI hotplug interrupts and therefore
>>>> need to monitor xenstore for CPU hotplug event.
>>> Why not? If they don't, they should. As we are providing ACPI anyway,
>>> we should provide all bits of it.
>>
>> We don't have IOAPIC, which is how these interrupts are typically
>> delivered. I suppose we might be able to specify it as something else.
>>
>> I'll look into this.
>
>
> (+Jan)
>
> Yes, we can do this. The main issue is how to deal with event
> registers (i.e FADT.x_pm1a_evt_blk) and AML's PRST region (which
> specifies online CPU map).
>
> Currently these are accessed via IO space and are handled by qemu.
>
> There are a couple of ways to deal with this that I can see.
>
> 1. We can implement ioreq handling in the hypervisor, there are only a
> few addresses that need handling.
>
> 2. We can implement those registers in memory space and have libxl
> update them them on a hotplug command. This appears to be possible
> because these registers mostly just consume writes without side
> effects so they can be simple memory locations. The one exception is
> updating status bits (they are cleared by writing 1s) but I think we
> can do this from the AML.
>
> Other than that the only other thing is setting up an event channel
> between the toolstack and the guest (either via xenstore or perhaps by
> having a reserved port for SCI).
>
> I have a prototype with (2) (except for the bit clearing part) but I
> want to hear comments on this approach before I write proper patches.

Xen already deals with 1 for HVM guests. We should do the same for PVH
guests as well.

-1 to anything involving looping a PVH dom0 back around to some entity
running inside dom0.

~Andrew

~Andrew

2016-10-27 16:24:18

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 8/8] xen/pvh: Enable CPU hotplug



On 10/27/2016 11:00 AM, Andrew Cooper wrote:
> On 27/10/16 15:25, Boris Ostrovsky wrote:
>>
>>
>> On 10/14/2016 03:01 PM, Boris Ostrovsky wrote:
>>> On 10/14/2016 02:41 PM, Andrew Cooper wrote:
>>>> On 14/10/16 19:05, Boris Ostrovsky wrote:
>>>>> PVH guests don't receive ACPI hotplug interrupts and therefore
>>>>> need to monitor xenstore for CPU hotplug event.
>>>> Why not? If they don't, they should. As we are providing ACPI anyway,
>>>> we should provide all bits of it.
>>>
>>> We don't have IOAPIC, which is how these interrupts are typically
>>> delivered. I suppose we might be able to specify it as something else.
>>>
>>> I'll look into this.
>>
>>
>> (+Jan)
>>
>> Yes, we can do this. The main issue is how to deal with event
>> registers (i.e FADT.x_pm1a_evt_blk) and AML's PRST region (which
>> specifies online CPU map).
>>
>> Currently these are accessed via IO space and are handled by qemu.
>>
>> There are a couple of ways to deal with this that I can see.
>>
>> 1. We can implement ioreq handling in the hypervisor, there are only a
>> few addresses that need handling.
>>
>> 2. We can implement those registers in memory space and have libxl
>> update them them on a hotplug command. This appears to be possible
>> because these registers mostly just consume writes without side
>> effects so they can be simple memory locations. The one exception is
>> updating status bits (they are cleared by writing 1s) but I think we
>> can do this from the AML.
>>
>> Other than that the only other thing is setting up an event channel
>> between the toolstack and the guest (either via xenstore or perhaps by
>> having a reserved port for SCI).
>>
>> I have a prototype with (2) (except for the bit clearing part) but I
>> want to hear comments on this approach before I write proper patches.
>
> Xen already deals with 1 for HVM guests. We should do the same for PVH
> guests as well.

OK. The reason I was a little hesitant to go that route was because I
didn't want to add more code to hypervisor.

But this is quite simpler than (2) --- we can keep all ACPI data common
with HVM.

>
> -1 to anything involving looping a PVH dom0 back around to some entity
> running inside dom0.


This hotplug path would only be used by an unprivileged domain (i.e. the
one whose ACPI data is created by the toolstack). But TBH I haven't
thought about dom0 at all. And for that (1) is probably the only option.

-boris


2016-10-31 12:35:07

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup



On 10/14/2016 02:05 PM, Boris Ostrovsky wrote:
> From: Matt Fleming <[email protected]>
>
> The new Xen PVH entry point requires page tables to be setup by the
> kernel since it is entered with paging disabled.
>
> Pull the common code out of head_32.S and into pgtable_32.S so that
> setup_pgtable_32 can be invoked from both the new Xen entry point and
> the existing startup_32 code.


Ping to x86 maintainers.

Peter, you had questions about this patch. Did I answer them?

-boris


>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Signed-off-by: Matt Fleming <[email protected]>
> ---
> arch/x86/Makefile | 2 +
> arch/x86/kernel/Makefile | 2 +
> arch/x86/kernel/head_32.S | 168 +------------------------------------
> arch/x86/kernel/pgtable_32.S | 196 +++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 201 insertions(+), 167 deletions(-)
> create mode 100644 arch/x86/kernel/pgtable_32.S
>
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 2d44933..67cc771 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -204,6 +204,8 @@ head-y += arch/x86/kernel/head$(BITS).o
> head-y += arch/x86/kernel/ebda.o
> head-y += arch/x86/kernel/platform-quirks.o
>
> +head-$(CONFIG_X86_32) += arch/x86/kernel/pgtable_32.o
> +
> libs-y += arch/x86/lib/
>
> # See arch/x86/Kbuild for content of core part of the kernel
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 4dd5d50..eae85a5 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -8,6 +8,8 @@ extra-y += ebda.o
> extra-y += platform-quirks.o
> extra-y += vmlinux.lds
>
> +extra-$(CONFIG_X86_32) += pgtable_32.o
> +
> CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE)
>
> ifdef CONFIG_FUNCTION_TRACER
> diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
> index 5f40126..0db066e 100644
> --- a/arch/x86/kernel/head_32.S
> +++ b/arch/x86/kernel/head_32.S
> @@ -41,51 +41,6 @@
> #define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id
>
> /*
> - * This is how much memory in addition to the memory covered up to
> - * and including _end we need mapped initially.
> - * We need:
> - * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
> - * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
> - *
> - * Modulo rounding, each megabyte assigned here requires a kilobyte of
> - * memory, which is currently unreclaimed.
> - *
> - * This should be a multiple of a page.
> - *
> - * KERNEL_IMAGE_SIZE should be greater than pa(_end)
> - * and small than max_low_pfn, otherwise will waste some page table entries
> - */
> -
> -#if PTRS_PER_PMD > 1
> -#define PAGE_TABLE_SIZE(pages) (((pages) / PTRS_PER_PMD) + PTRS_PER_PGD)
> -#else
> -#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
> -#endif
> -
> -/*
> - * Number of possible pages in the lowmem region.
> - *
> - * We shift 2 by 31 instead of 1 by 32 to the left in order to avoid a
> - * gas warning about overflowing shift count when gas has been compiled
> - * with only a host target support using a 32-bit type for internal
> - * representation.
> - */
> -LOWMEM_PAGES = (((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT)
> -
> -/* Enough space to fit pagetables for the low memory linear map */
> -MAPPING_BEYOND_END = PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT
> -
> -/*
> - * Worst-case size of the kernel mapping we need to make:
> - * a relocatable kernel can live anywhere in lowmem, so we need to be able
> - * to map all of lowmem.
> - */
> -KERNEL_PAGES = LOWMEM_PAGES
> -
> -INIT_MAP_SIZE = PAGE_TABLE_SIZE(KERNEL_PAGES) * PAGE_SIZE
> -RESERVE_BRK(pagetables, INIT_MAP_SIZE)
> -
> -/*
> * 32-bit kernel entrypoint; only used by the boot CPU. On entry,
> * %esi points to the real-mode code as a 32-bit pointer.
> * CS and DS must be 4 GB flat segments, but we don't depend on
> @@ -157,92 +112,7 @@ ENTRY(startup_32)
> call load_ucode_bsp
> #endif
>
> -/*
> - * Initialize page tables. This creates a PDE and a set of page
> - * tables, which are located immediately beyond __brk_base. The variable
> - * _brk_end is set up to point to the first "safe" location.
> - * Mappings are created both at virtual address 0 (identity mapping)
> - * and PAGE_OFFSET for up to _end.
> - */
> -#ifdef CONFIG_X86_PAE
> -
> - /*
> - * In PAE mode initial_page_table is statically defined to contain
> - * enough entries to cover the VMSPLIT option (that is the top 1, 2 or 3
> - * entries). The identity mapping is handled by pointing two PGD entries
> - * to the first kernel PMD.
> - *
> - * Note the upper half of each PMD or PTE are always zero at this stage.
> - */
> -
> -#define KPMDS (((-__PAGE_OFFSET) >> 30) & 3) /* Number of kernel PMDs */
> -
> - xorl %ebx,%ebx /* %ebx is kept at zero */
> -
> - movl $pa(__brk_base), %edi
> - movl $pa(initial_pg_pmd), %edx
> - movl $PTE_IDENT_ATTR, %eax
> -10:
> - leal PDE_IDENT_ATTR(%edi),%ecx /* Create PMD entry */
> - movl %ecx,(%edx) /* Store PMD entry */
> - /* Upper half already zero */
> - addl $8,%edx
> - movl $512,%ecx
> -11:
> - stosl
> - xchgl %eax,%ebx
> - stosl
> - xchgl %eax,%ebx
> - addl $0x1000,%eax
> - loop 11b
> -
> - /*
> - * End condition: we must map up to the end + MAPPING_BEYOND_END.
> - */
> - movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
> - cmpl %ebp,%eax
> - jb 10b
> -1:
> - addl $__PAGE_OFFSET, %edi
> - movl %edi, pa(_brk_end)
> - shrl $12, %eax
> - movl %eax, pa(max_pfn_mapped)
> -
> - /* Do early initialization of the fixmap area */
> - movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
> - movl %eax,pa(initial_pg_pmd+0x1000*KPMDS-8)
> -#else /* Not PAE */
> -
> -page_pde_offset = (__PAGE_OFFSET >> 20);
> -
> - movl $pa(__brk_base), %edi
> - movl $pa(initial_page_table), %edx
> - movl $PTE_IDENT_ATTR, %eax
> -10:
> - leal PDE_IDENT_ATTR(%edi),%ecx /* Create PDE entry */
> - movl %ecx,(%edx) /* Store identity PDE entry */
> - movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
> - addl $4,%edx
> - movl $1024, %ecx
> -11:
> - stosl
> - addl $0x1000,%eax
> - loop 11b
> - /*
> - * End condition: we must map up to the end + MAPPING_BEYOND_END.
> - */
> - movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
> - cmpl %ebp,%eax
> - jb 10b
> - addl $__PAGE_OFFSET, %edi
> - movl %edi, pa(_brk_end)
> - shrl $12, %eax
> - movl %eax, pa(max_pfn_mapped)
> -
> - /* Do early initialization of the fixmap area */
> - movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
> - movl %eax,pa(initial_page_table+0xffc)
> -#endif
> + call setup_pgtable_32
>
> #ifdef CONFIG_PARAVIRT
> /* This is can only trip for a broken bootloader... */
> @@ -660,47 +530,11 @@ ENTRY(setup_once_ref)
> */
> __PAGE_ALIGNED_BSS
> .align PAGE_SIZE
> -#ifdef CONFIG_X86_PAE
> -initial_pg_pmd:
> - .fill 1024*KPMDS,4,0
> -#else
> -ENTRY(initial_page_table)
> - .fill 1024,4,0
> -#endif
> -initial_pg_fixmap:
> - .fill 1024,4,0
> ENTRY(empty_zero_page)
> .fill 4096,1,0
> ENTRY(swapper_pg_dir)
> .fill 1024,4,0
>
> -/*
> - * This starts the data section.
> - */
> -#ifdef CONFIG_X86_PAE
> -__PAGE_ALIGNED_DATA
> - /* Page-aligned for the benefit of paravirt? */
> - .align PAGE_SIZE
> -ENTRY(initial_page_table)
> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
> -# if KPMDS == 3
> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x2000),0
> -# elif KPMDS == 2
> - .long 0,0
> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
> -# elif KPMDS == 1
> - .long 0,0
> - .long 0,0
> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
> -# else
> -# error "Kernel PMDs should be 1, 2 or 3"
> -# endif
> - .align PAGE_SIZE /* needs to be page-sized too */
> -#endif
> -
> .data
> .balign 4
> ENTRY(initial_stack)
> diff --git a/arch/x86/kernel/pgtable_32.S b/arch/x86/kernel/pgtable_32.S
> new file mode 100644
> index 0000000..aded718
> --- /dev/null
> +++ b/arch/x86/kernel/pgtable_32.S
> @@ -0,0 +1,196 @@
> +#include <linux/threads.h>
> +#include <linux/init.h>
> +#include <linux/linkage.h>
> +#include <asm/segment.h>
> +#include <asm/page_types.h>
> +#include <asm/pgtable_types.h>
> +#include <asm/cache.h>
> +#include <asm/thread_info.h>
> +#include <asm/asm-offsets.h>
> +#include <asm/setup.h>
> +#include <asm/processor-flags.h>
> +#include <asm/msr-index.h>
> +#include <asm/cpufeatures.h>
> +#include <asm/percpu.h>
> +#include <asm/nops.h>
> +#include <asm/bootparam.h>
> +
> +/* Physical address */
> +#define pa(X) ((X) - __PAGE_OFFSET)
> +
> +/*
> + * This is how much memory in addition to the memory covered up to
> + * and including _end we need mapped initially.
> + * We need:
> + * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
> + * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
> + *
> + * Modulo rounding, each megabyte assigned here requires a kilobyte of
> + * memory, which is currently unreclaimed.
> + *
> + * This should be a multiple of a page.
> + *
> + * KERNEL_IMAGE_SIZE should be greater than pa(_end)
> + * and small than max_low_pfn, otherwise will waste some page table entries
> + */
> +
> +#if PTRS_PER_PMD > 1
> +#define PAGE_TABLE_SIZE(pages) (((pages) / PTRS_PER_PMD) + PTRS_PER_PGD)
> +#else
> +#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
> +#endif
> +
> +/*
> + * Number of possible pages in the lowmem region.
> + *
> + * We shift 2 by 31 instead of 1 by 32 to the left in order to avoid a
> + * gas warning about overflowing shift count when gas has been compiled
> + * with only a host target support using a 32-bit type for internal
> + * representation.
> + */
> +LOWMEM_PAGES = (((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT)
> +
> +/* Enough space to fit pagetables for the low memory linear map */
> +MAPPING_BEYOND_END = PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT
> +
> +/*
> + * Worst-case size of the kernel mapping we need to make:
> + * a relocatable kernel can live anywhere in lowmem, so we need to be able
> + * to map all of lowmem.
> + */
> +KERNEL_PAGES = LOWMEM_PAGES
> +
> +INIT_MAP_SIZE = PAGE_TABLE_SIZE(KERNEL_PAGES) * PAGE_SIZE
> +RESERVE_BRK(pagetables, INIT_MAP_SIZE)
> +
> +/*
> + * Initialize page tables. This creates a PDE and a set of page
> + * tables, which are located immediately beyond __brk_base. The variable
> + * _brk_end is set up to point to the first "safe" location.
> + * Mappings are created both at virtual address 0 (identity mapping)
> + * and PAGE_OFFSET for up to _end.
> + */
> + .text
> +ENTRY(setup_pgtable_32)
> +#ifdef CONFIG_X86_PAE
> + /*
> + * In PAE mode initial_page_table is statically defined to contain
> + * enough entries to cover the VMSPLIT option (that is the top 1, 2 or 3
> + * entries). The identity mapping is handled by pointing two PGD entries
> + * to the first kernel PMD.
> + *
> + * Note the upper half of each PMD or PTE are always zero at this stage.
> + */
> +
> +#define KPMDS (((-__PAGE_OFFSET) >> 30) & 3) /* Number of kernel PMDs */
> +
> + xorl %ebx,%ebx /* %ebx is kept at zero */
> +
> + movl $pa(__brk_base), %edi
> + movl $pa(initial_pg_pmd), %edx
> + movl $PTE_IDENT_ATTR, %eax
> +10:
> + leal PDE_IDENT_ATTR(%edi),%ecx /* Create PMD entry */
> + movl %ecx,(%edx) /* Store PMD entry */
> + /* Upper half already zero */
> + addl $8,%edx
> + movl $512,%ecx
> +11:
> + stosl
> + xchgl %eax,%ebx
> + stosl
> + xchgl %eax,%ebx
> + addl $0x1000,%eax
> + loop 11b
> +
> + /*
> + * End condition: we must map up to the end + MAPPING_BEYOND_END.
> + */
> + movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
> + cmpl %ebp,%eax
> + jb 10b
> +1:
> + addl $__PAGE_OFFSET, %edi
> + movl %edi, pa(_brk_end)
> + shrl $12, %eax
> + movl %eax, pa(max_pfn_mapped)
> +
> + /* Do early initialization of the fixmap area */
> + movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
> + movl %eax,pa(initial_pg_pmd+0x1000*KPMDS-8)
> +#else /* Not PAE */
> +
> +page_pde_offset = (__PAGE_OFFSET >> 20);
> +
> + movl $pa(__brk_base), %edi
> + movl $pa(initial_page_table), %edx
> + movl $PTE_IDENT_ATTR, %eax
> +10:
> + leal PDE_IDENT_ATTR(%edi),%ecx /* Create PDE entry */
> + movl %ecx,(%edx) /* Store identity PDE entry */
> + movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
> + addl $4,%edx
> + movl $1024, %ecx
> +11:
> + stosl
> + addl $0x1000,%eax
> + loop 11b
> + /*
> + * End condition: we must map up to the end + MAPPING_BEYOND_END.
> + */
> + movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
> + cmpl %ebp,%eax
> + jb 10b
> + addl $__PAGE_OFFSET, %edi
> + movl %edi, pa(_brk_end)
> + shrl $12, %eax
> + movl %eax, pa(max_pfn_mapped)
> +
> + /* Do early initialization of the fixmap area */
> + movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
> + movl %eax,pa(initial_page_table+0xffc)
> +#endif
> + ret
> +ENDPROC(setup_pgtable_32)
> +
> +/*
> + * BSS section
> + */
> +__PAGE_ALIGNED_BSS
> + .align PAGE_SIZE
> +#ifdef CONFIG_X86_PAE
> +initial_pg_pmd:
> + .fill 1024*KPMDS,4,0
> +#else
> +ENTRY(initial_page_table)
> + .fill 1024,4,0
> +#endif
> +initial_pg_fixmap:
> + .fill 1024,4,0
> +
> +/*
> + * This starts the data section.
> + */
> +#ifdef CONFIG_X86_PAE
> +__PAGE_ALIGNED_DATA
> + /* Page-aligned for the benefit of paravirt? */
> + .align PAGE_SIZE
> +ENTRY(initial_page_table)
> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
> +# if KPMDS == 3
> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x2000),0
> +# elif KPMDS == 2
> + .long 0,0
> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
> +# elif KPMDS == 1
> + .long 0,0
> + .long 0,0
> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
> +# else
> +# error "Kernel PMDs should be 1, 2 or 3"
> +# endif
> + .align PAGE_SIZE /* needs to be page-sized too */
> +#endif
>

2016-12-01 15:32:42

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

On 10/31/2016 08:33 AM, Boris Ostrovsky wrote:
>
>
> On 10/14/2016 02:05 PM, Boris Ostrovsky wrote:
>> From: Matt Fleming <[email protected]>
>>
>> The new Xen PVH entry point requires page tables to be setup by the
>> kernel since it is entered with paging disabled.
>>
>> Pull the common code out of head_32.S and into pgtable_32.S so that
>> setup_pgtable_32 can be invoked from both the new Xen entry point and
>> the existing startup_32 code.
>
>
> Ping to x86 maintainers.

Pinging again.

I will be re-sending this series at some point (it has been delayed by
some hypervisor changes that will be needed) but I'd like to hear from
x86 maintainers whether this will be acceptable before I post this again.

Thanks.
-boris

>
> Peter, you had questions about this patch. Did I answer them?
>
> -boris
>
>
>>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Matt Fleming <[email protected]>
>> ---
>> arch/x86/Makefile | 2 +
>> arch/x86/kernel/Makefile | 2 +
>> arch/x86/kernel/head_32.S | 168
>> +------------------------------------
>> arch/x86/kernel/pgtable_32.S | 196
>> +++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 201 insertions(+), 167 deletions(-)
>> create mode 100644 arch/x86/kernel/pgtable_32.S
>>
>> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>> index 2d44933..67cc771 100644
>> --- a/arch/x86/Makefile
>> +++ b/arch/x86/Makefile
>> @@ -204,6 +204,8 @@ head-y += arch/x86/kernel/head$(BITS).o
>> head-y += arch/x86/kernel/ebda.o
>> head-y += arch/x86/kernel/platform-quirks.o
>>
>> +head-$(CONFIG_X86_32) += arch/x86/kernel/pgtable_32.o
>> +
>> libs-y += arch/x86/lib/
>>
>> # See arch/x86/Kbuild for content of core part of the kernel
>> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>> index 4dd5d50..eae85a5 100644
>> --- a/arch/x86/kernel/Makefile
>> +++ b/arch/x86/kernel/Makefile
>> @@ -8,6 +8,8 @@ extra-y += ebda.o
>> extra-y += platform-quirks.o
>> extra-y += vmlinux.lds
>>
>> +extra-$(CONFIG_X86_32) += pgtable_32.o
>> +
>> CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE)
>>
>> ifdef CONFIG_FUNCTION_TRACER
>> diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
>> index 5f40126..0db066e 100644
>> --- a/arch/x86/kernel/head_32.S
>> +++ b/arch/x86/kernel/head_32.S
>> @@ -41,51 +41,6 @@
>> #define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id
>>
>> /*
>> - * This is how much memory in addition to the memory covered up to
>> - * and including _end we need mapped initially.
>> - * We need:
>> - * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
>> - * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
>> - *
>> - * Modulo rounding, each megabyte assigned here requires a kilobyte of
>> - * memory, which is currently unreclaimed.
>> - *
>> - * This should be a multiple of a page.
>> - *
>> - * KERNEL_IMAGE_SIZE should be greater than pa(_end)
>> - * and small than max_low_pfn, otherwise will waste some page table
>> entries
>> - */
>> -
>> -#if PTRS_PER_PMD > 1
>> -#define PAGE_TABLE_SIZE(pages) (((pages) / PTRS_PER_PMD) +
>> PTRS_PER_PGD)
>> -#else
>> -#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
>> -#endif
>> -
>> -/*
>> - * Number of possible pages in the lowmem region.
>> - *
>> - * We shift 2 by 31 instead of 1 by 32 to the left in order to avoid a
>> - * gas warning about overflowing shift count when gas has been compiled
>> - * with only a host target support using a 32-bit type for internal
>> - * representation.
>> - */
>> -LOWMEM_PAGES = (((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT)
>> -
>> -/* Enough space to fit pagetables for the low memory linear map */
>> -MAPPING_BEYOND_END = PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT
>> -
>> -/*
>> - * Worst-case size of the kernel mapping we need to make:
>> - * a relocatable kernel can live anywhere in lowmem, so we need to
>> be able
>> - * to map all of lowmem.
>> - */
>> -KERNEL_PAGES = LOWMEM_PAGES
>> -
>> -INIT_MAP_SIZE = PAGE_TABLE_SIZE(KERNEL_PAGES) * PAGE_SIZE
>> -RESERVE_BRK(pagetables, INIT_MAP_SIZE)
>> -
>> -/*
>> * 32-bit kernel entrypoint; only used by the boot CPU. On entry,
>> * %esi points to the real-mode code as a 32-bit pointer.
>> * CS and DS must be 4 GB flat segments, but we don't depend on
>> @@ -157,92 +112,7 @@ ENTRY(startup_32)
>> call load_ucode_bsp
>> #endif
>>
>> -/*
>> - * Initialize page tables. This creates a PDE and a set of page
>> - * tables, which are located immediately beyond __brk_base. The
>> variable
>> - * _brk_end is set up to point to the first "safe" location.
>> - * Mappings are created both at virtual address 0 (identity mapping)
>> - * and PAGE_OFFSET for up to _end.
>> - */
>> -#ifdef CONFIG_X86_PAE
>> -
>> - /*
>> - * In PAE mode initial_page_table is statically defined to contain
>> - * enough entries to cover the VMSPLIT option (that is the top
>> 1, 2 or 3
>> - * entries). The identity mapping is handled by pointing two PGD
>> entries
>> - * to the first kernel PMD.
>> - *
>> - * Note the upper half of each PMD or PTE are always zero at
>> this stage.
>> - */
>> -
>> -#define KPMDS (((-__PAGE_OFFSET) >> 30) & 3) /* Number of kernel
>> PMDs */
>> -
>> - xorl %ebx,%ebx /* %ebx is kept at zero */
>> -
>> - movl $pa(__brk_base), %edi
>> - movl $pa(initial_pg_pmd), %edx
>> - movl $PTE_IDENT_ATTR, %eax
>> -10:
>> - leal PDE_IDENT_ATTR(%edi),%ecx /* Create PMD entry */
>> - movl %ecx,(%edx) /* Store PMD entry */
>> - /* Upper half already zero */
>> - addl $8,%edx
>> - movl $512,%ecx
>> -11:
>> - stosl
>> - xchgl %eax,%ebx
>> - stosl
>> - xchgl %eax,%ebx
>> - addl $0x1000,%eax
>> - loop 11b
>> -
>> - /*
>> - * End condition: we must map up to the end + MAPPING_BEYOND_END.
>> - */
>> - movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
>> - cmpl %ebp,%eax
>> - jb 10b
>> -1:
>> - addl $__PAGE_OFFSET, %edi
>> - movl %edi, pa(_brk_end)
>> - shrl $12, %eax
>> - movl %eax, pa(max_pfn_mapped)
>> -
>> - /* Do early initialization of the fixmap area */
>> - movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
>> - movl %eax,pa(initial_pg_pmd+0x1000*KPMDS-8)
>> -#else /* Not PAE */
>> -
>> -page_pde_offset = (__PAGE_OFFSET >> 20);
>> -
>> - movl $pa(__brk_base), %edi
>> - movl $pa(initial_page_table), %edx
>> - movl $PTE_IDENT_ATTR, %eax
>> -10:
>> - leal PDE_IDENT_ATTR(%edi),%ecx /* Create PDE entry */
>> - movl %ecx,(%edx) /* Store identity PDE entry */
>> - movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
>> - addl $4,%edx
>> - movl $1024, %ecx
>> -11:
>> - stosl
>> - addl $0x1000,%eax
>> - loop 11b
>> - /*
>> - * End condition: we must map up to the end + MAPPING_BEYOND_END.
>> - */
>> - movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
>> - cmpl %ebp,%eax
>> - jb 10b
>> - addl $__PAGE_OFFSET, %edi
>> - movl %edi, pa(_brk_end)
>> - shrl $12, %eax
>> - movl %eax, pa(max_pfn_mapped)
>> -
>> - /* Do early initialization of the fixmap area */
>> - movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
>> - movl %eax,pa(initial_page_table+0xffc)
>> -#endif
>> + call setup_pgtable_32
>>
>> #ifdef CONFIG_PARAVIRT
>> /* This is can only trip for a broken bootloader... */
>> @@ -660,47 +530,11 @@ ENTRY(setup_once_ref)
>> */
>> __PAGE_ALIGNED_BSS
>> .align PAGE_SIZE
>> -#ifdef CONFIG_X86_PAE
>> -initial_pg_pmd:
>> - .fill 1024*KPMDS,4,0
>> -#else
>> -ENTRY(initial_page_table)
>> - .fill 1024,4,0
>> -#endif
>> -initial_pg_fixmap:
>> - .fill 1024,4,0
>> ENTRY(empty_zero_page)
>> .fill 4096,1,0
>> ENTRY(swapper_pg_dir)
>> .fill 1024,4,0
>>
>> -/*
>> - * This starts the data section.
>> - */
>> -#ifdef CONFIG_X86_PAE
>> -__PAGE_ALIGNED_DATA
>> - /* Page-aligned for the benefit of paravirt? */
>> - .align PAGE_SIZE
>> -ENTRY(initial_page_table)
>> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity
>> map */
>> -# if KPMDS == 3
>> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
>> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x2000),0
>> -# elif KPMDS == 2
>> - .long 0,0
>> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
>> -# elif KPMDS == 1
>> - .long 0,0
>> - .long 0,0
>> - .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>> -# else
>> -# error "Kernel PMDs should be 1, 2 or 3"
>> -# endif
>> - .align PAGE_SIZE /* needs to be page-sized too */
>> -#endif
>> -
>> .data
>> .balign 4
>> ENTRY(initial_stack)
>> diff --git a/arch/x86/kernel/pgtable_32.S b/arch/x86/kernel/pgtable_32.S
>> new file mode 100644
>> index 0000000..aded718
>> --- /dev/null
>> +++ b/arch/x86/kernel/pgtable_32.S
>> @@ -0,0 +1,196 @@
>> +#include <linux/threads.h>
>> +#include <linux/init.h>
>> +#include <linux/linkage.h>
>> +#include <asm/segment.h>
>> +#include <asm/page_types.h>
>> +#include <asm/pgtable_types.h>
>> +#include <asm/cache.h>
>> +#include <asm/thread_info.h>
>> +#include <asm/asm-offsets.h>
>> +#include <asm/setup.h>
>> +#include <asm/processor-flags.h>
>> +#include <asm/msr-index.h>
>> +#include <asm/cpufeatures.h>
>> +#include <asm/percpu.h>
>> +#include <asm/nops.h>
>> +#include <asm/bootparam.h>
>> +
>> +/* Physical address */
>> +#define pa(X) ((X) - __PAGE_OFFSET)
>> +
>> +/*
>> + * This is how much memory in addition to the memory covered up to
>> + * and including _end we need mapped initially.
>> + * We need:
>> + * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
>> + * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
>> + *
>> + * Modulo rounding, each megabyte assigned here requires a kilobyte of
>> + * memory, which is currently unreclaimed.
>> + *
>> + * This should be a multiple of a page.
>> + *
>> + * KERNEL_IMAGE_SIZE should be greater than pa(_end)
>> + * and small than max_low_pfn, otherwise will waste some page table
>> entries
>> + */
>> +
>> +#if PTRS_PER_PMD > 1
>> +#define PAGE_TABLE_SIZE(pages) (((pages) / PTRS_PER_PMD) +
>> PTRS_PER_PGD)
>> +#else
>> +#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
>> +#endif
>> +
>> +/*
>> + * Number of possible pages in the lowmem region.
>> + *
>> + * We shift 2 by 31 instead of 1 by 32 to the left in order to avoid a
>> + * gas warning about overflowing shift count when gas has been compiled
>> + * with only a host target support using a 32-bit type for internal
>> + * representation.
>> + */
>> +LOWMEM_PAGES = (((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT)
>> +
>> +/* Enough space to fit pagetables for the low memory linear map */
>> +MAPPING_BEYOND_END = PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT
>> +
>> +/*
>> + * Worst-case size of the kernel mapping we need to make:
>> + * a relocatable kernel can live anywhere in lowmem, so we need to
>> be able
>> + * to map all of lowmem.
>> + */
>> +KERNEL_PAGES = LOWMEM_PAGES
>> +
>> +INIT_MAP_SIZE = PAGE_TABLE_SIZE(KERNEL_PAGES) * PAGE_SIZE
>> +RESERVE_BRK(pagetables, INIT_MAP_SIZE)
>> +
>> +/*
>> + * Initialize page tables. This creates a PDE and a set of page
>> + * tables, which are located immediately beyond __brk_base. The
>> variable
>> + * _brk_end is set up to point to the first "safe" location.
>> + * Mappings are created both at virtual address 0 (identity mapping)
>> + * and PAGE_OFFSET for up to _end.
>> + */
>> + .text
>> +ENTRY(setup_pgtable_32)
>> +#ifdef CONFIG_X86_PAE
>> + /*
>> + * In PAE mode initial_page_table is statically defined to contain
>> + * enough entries to cover the VMSPLIT option (that is the top
>> 1, 2 or 3
>> + * entries). The identity mapping is handled by pointing two PGD
>> entries
>> + * to the first kernel PMD.
>> + *
>> + * Note the upper half of each PMD or PTE are always zero at
>> this stage.
>> + */
>> +
>> +#define KPMDS (((-__PAGE_OFFSET) >> 30) & 3) /* Number of kernel
>> PMDs */
>> +
>> + xorl %ebx,%ebx /* %ebx is kept at zero */
>> +
>> + movl $pa(__brk_base), %edi
>> + movl $pa(initial_pg_pmd), %edx
>> + movl $PTE_IDENT_ATTR, %eax
>> +10:
>> + leal PDE_IDENT_ATTR(%edi),%ecx /* Create PMD entry */
>> + movl %ecx,(%edx) /* Store PMD entry */
>> + /* Upper half already zero */
>> + addl $8,%edx
>> + movl $512,%ecx
>> +11:
>> + stosl
>> + xchgl %eax,%ebx
>> + stosl
>> + xchgl %eax,%ebx
>> + addl $0x1000,%eax
>> + loop 11b
>> +
>> + /*
>> + * End condition: we must map up to the end + MAPPING_BEYOND_END.
>> + */
>> + movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
>> + cmpl %ebp,%eax
>> + jb 10b
>> +1:
>> + addl $__PAGE_OFFSET, %edi
>> + movl %edi, pa(_brk_end)
>> + shrl $12, %eax
>> + movl %eax, pa(max_pfn_mapped)
>> +
>> + /* Do early initialization of the fixmap area */
>> + movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
>> + movl %eax,pa(initial_pg_pmd+0x1000*KPMDS-8)
>> +#else /* Not PAE */
>> +
>> +page_pde_offset = (__PAGE_OFFSET >> 20);
>> +
>> + movl $pa(__brk_base), %edi
>> + movl $pa(initial_page_table), %edx
>> + movl $PTE_IDENT_ATTR, %eax
>> +10:
>> + leal PDE_IDENT_ATTR(%edi),%ecx /* Create PDE entry */
>> + movl %ecx,(%edx) /* Store identity PDE entry */
>> + movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
>> + addl $4,%edx
>> + movl $1024, %ecx
>> +11:
>> + stosl
>> + addl $0x1000,%eax
>> + loop 11b
>> + /*
>> + * End condition: we must map up to the end + MAPPING_BEYOND_END.
>> + */
>> + movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
>> + cmpl %ebp,%eax
>> + jb 10b
>> + addl $__PAGE_OFFSET, %edi
>> + movl %edi, pa(_brk_end)
>> + shrl $12, %eax
>> + movl %eax, pa(max_pfn_mapped)
>> +
>> + /* Do early initialization of the fixmap area */
>> + movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
>> + movl %eax,pa(initial_page_table+0xffc)
>> +#endif
>> + ret
>> +ENDPROC(setup_pgtable_32)
>> +
>> +/*
>> + * BSS section
>> + */
>> +__PAGE_ALIGNED_BSS
>> + .align PAGE_SIZE
>> +#ifdef CONFIG_X86_PAE
>> +initial_pg_pmd:
>> + .fill 1024*KPMDS,4,0
>> +#else
>> +ENTRY(initial_page_table)
>> + .fill 1024,4,0
>> +#endif
>> +initial_pg_fixmap:
>> + .fill 1024,4,0
>> +
>> +/*
>> + * This starts the data section.
>> + */
>> +#ifdef CONFIG_X86_PAE
>> +__PAGE_ALIGNED_DATA
>> + /* Page-aligned for the benefit of paravirt? */
>> + .align PAGE_SIZE
>> +ENTRY(initial_page_table)
>> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity
>> map */
>> +# if KPMDS == 3
>> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
>> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x2000),0
>> +# elif KPMDS == 2
>> + .long 0,0
>> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
>> +# elif KPMDS == 1
>> + .long 0,0
>> + .long 0,0
>> + .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
>> +# else
>> +# error "Kernel PMDs should be 1, 2 or 3"
>> +# endif
>> + .align PAGE_SIZE /* needs to be page-sized too */
>> +#endif
>>

2016-12-02 09:46:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup


* Boris Ostrovsky <[email protected]> wrote:

> On 10/31/2016 08:33 AM, Boris Ostrovsky wrote:
> >
> >
> > On 10/14/2016 02:05 PM, Boris Ostrovsky wrote:
> >> From: Matt Fleming <[email protected]>
> >>
> >> The new Xen PVH entry point requires page tables to be setup by the
> >> kernel since it is entered with paging disabled.
> >>
> >> Pull the common code out of head_32.S and into pgtable_32.S so that
> >> setup_pgtable_32 can be invoked from both the new Xen entry point and
> >> the existing startup_32 code.
> >
> >
> > Ping to x86 maintainers.
>
> Pinging again.
>
> I will be re-sending this series at some point (it has been delayed by
> some hypervisor changes that will be needed) but I'd like to hear from
> x86 maintainers whether this will be acceptable before I post this again.

Could this be done in C?

Thanks,

Ingo

2016-12-02 14:03:33

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

On 12/02/2016 04:45 AM, Ingo Molnar wrote:
> * Boris Ostrovsky <[email protected]> wrote:
>
>> On 10/31/2016 08:33 AM, Boris Ostrovsky wrote:
>>>
>>> On 10/14/2016 02:05 PM, Boris Ostrovsky wrote:
>>>> From: Matt Fleming <[email protected]>
>>>>
>>>> The new Xen PVH entry point requires page tables to be setup by the
>>>> kernel since it is entered with paging disabled.
>>>>
>>>> Pull the common code out of head_32.S and into pgtable_32.S so that
>>>> setup_pgtable_32 can be invoked from both the new Xen entry point and
>>>> the existing startup_32 code.
>>>
>>> Ping to x86 maintainers.
>> Pinging again.
>>
>> I will be re-sending this series at some point (it has been delayed by
>> some hypervisor changes that will be needed) but I'd like to hear from
>> x86 maintainers whether this will be acceptable before I post this again.
> Could this be done in C?

I suppose it could be, I haven't thought about it.

The goal here was to simply make existing startup code available to
others (Xen guest) without changes. Are you suggesting to build page
tables in C for the Xen guest only or to make startup_32 call new C code
as well?

-boris

2016-12-02 16:09:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup


* Boris Ostrovsky <[email protected]> wrote:

> On 12/02/2016 04:45 AM, Ingo Molnar wrote:
> > * Boris Ostrovsky <[email protected]> wrote:
> >
> >> On 10/31/2016 08:33 AM, Boris Ostrovsky wrote:
> >>>
> >>> On 10/14/2016 02:05 PM, Boris Ostrovsky wrote:
> >>>> From: Matt Fleming <[email protected]>
> >>>>
> >>>> The new Xen PVH entry point requires page tables to be setup by the
> >>>> kernel since it is entered with paging disabled.
> >>>>
> >>>> Pull the common code out of head_32.S and into pgtable_32.S so that
> >>>> setup_pgtable_32 can be invoked from both the new Xen entry point and
> >>>> the existing startup_32 code.
> >>>
> >>> Ping to x86 maintainers.
> >> Pinging again.
> >>
> >> I will be re-sending this series at some point (it has been delayed by
> >> some hypervisor changes that will be needed) but I'd like to hear from
> >> x86 maintainers whether this will be acceptable before I post this again.
> > Could this be done in C?
>
> I suppose it could be, I haven't thought about it.
>
> The goal here was to simply make existing startup code available to others (Xen
> guest) without changes. Are you suggesting to build page tables in C for the Xen
> guest only or to make startup_32 call new C code as well?

My suggestion would be to transform the factored out assembly code to C.

Thanks,

Ingo

2016-12-02 17:53:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

On December 2, 2016 8:08:55 AM PST, Ingo Molnar <[email protected]> wrote:
>
>* Boris Ostrovsky <[email protected]> wrote:
>
>> On 12/02/2016 04:45 AM, Ingo Molnar wrote:
>> > * Boris Ostrovsky <[email protected]> wrote:
>> >
>> >> On 10/31/2016 08:33 AM, Boris Ostrovsky wrote:
>> >>>
>> >>> On 10/14/2016 02:05 PM, Boris Ostrovsky wrote:
>> >>>> From: Matt Fleming <[email protected]>
>> >>>>
>> >>>> The new Xen PVH entry point requires page tables to be setup by
>the
>> >>>> kernel since it is entered with paging disabled.
>> >>>>
>> >>>> Pull the common code out of head_32.S and into pgtable_32.S so
>that
>> >>>> setup_pgtable_32 can be invoked from both the new Xen entry
>point and
>> >>>> the existing startup_32 code.
>> >>>
>> >>> Ping to x86 maintainers.
>> >> Pinging again.
>> >>
>> >> I will be re-sending this series at some point (it has been
>delayed by
>> >> some hypervisor changes that will be needed) but I'd like to hear
>from
>> >> x86 maintainers whether this will be acceptable before I post this
>again.
>> > Could this be done in C?
>>
>> I suppose it could be, I haven't thought about it.
>>
>> The goal here was to simply make existing startup code available to
>others (Xen
>> guest) without changes. Are you suggesting to build page tables in C
>for the Xen
>> guest only or to make startup_32 call new C code as well?
>
>My suggestion would be to transform the factored out assembly code to
>C.
>
>Thanks,
>
> Ingo

It is tricky to do so safely, because at this stage almost nothing of the C execution environment has been set up.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2016-12-02 19:48:32

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

On 12/02/2016 12:52 PM, [email protected] wrote:
> On December 2, 2016 8:08:55 AM PST, Ingo Molnar <[email protected]> wrote:
>> * Boris Ostrovsky <[email protected]> wrote:
>>
>>> On 12/02/2016 04:45 AM, Ingo Molnar wrote:
>>>> * Boris Ostrovsky <[email protected]> wrote:
>>>>
>>>>> On 10/31/2016 08:33 AM, Boris Ostrovsky wrote:
>>>>>> On 10/14/2016 02:05 PM, Boris Ostrovsky wrote:
>>>>>>> From: Matt Fleming <[email protected]>
>>>>>>>
>>>>>>> The new Xen PVH entry point requires page tables to be setup by
>> the
>>>>>>> kernel since it is entered with paging disabled.
>>>>>>>
>>>>>>> Pull the common code out of head_32.S and into pgtable_32.S so
>> that
>>>>>>> setup_pgtable_32 can be invoked from both the new Xen entry
>> point and
>>>>>>> the existing startup_32 code.
>>>>>> Ping to x86 maintainers.
>>>>> Pinging again.
>>>>>
>>>>> I will be re-sending this series at some point (it has been
>> delayed by
>>>>> some hypervisor changes that will be needed) but I'd like to hear
>> from
>>>>> x86 maintainers whether this will be acceptable before I post this
>> again.
>>>> Could this be done in C?
>>> I suppose it could be, I haven't thought about it.
>>>
>>> The goal here was to simply make existing startup code available to
>> others (Xen
>>> guest) without changes. Are you suggesting to build page tables in C
>> for the Xen
>>> guest only or to make startup_32 call new C code as well?
>> My suggestion would be to transform the factored out assembly code to
>> C.
>>
>> Thanks,
>>
>> Ingo
> It is tricky to do so safely, because at this stage almost nothing of the C execution environment has been set up.


I can still give it a try but I'd rather not tie it to this (Xen PVH)
patch series. Which would leave me with two options: either keep what
this patch does, leaving it as assembly (requires your ack), or have Xen
code build the pages on its own.

-boris

2016-12-03 05:49:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup


* Boris Ostrovsky <[email protected]> wrote:

> > It is tricky to do so safely, because at this stage almost nothing of the C
> > execution environment has been set up.

Yeah - but we do have a fair amount of early C code though.

> I can still give it a try but I'd rather not tie it to this (Xen PVH) patch
> series. Which would leave me with two options: either keep what this patch does,
> leaving it as assembly (requires your ack), or have Xen code build the pages on
> its own.

If you give it a try in a subsequent patch (please Cc: me) then it's OK to me:

Acked-by: Ingo Molnar <[email protected]>

Feel free to carry it in the Xen tree.

Thanks,

Ingo

2016-12-03 06:48:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/8] x86/head: Refactor 32-bit pgtable setup

On December 2, 2016 9:49:50 PM PST, Ingo Molnar <[email protected]> wrote:
>
>* Boris Ostrovsky <[email protected]> wrote:
>
>> > It is tricky to do so safely, because at this stage almost nothing
>of the C
>> > execution environment has been set up.
>
>Yeah - but we do have a fair amount of early C code though.
>
>> I can still give it a try but I'd rather not tie it to this (Xen PVH)
>patch
>> series. Which would leave me with two options: either keep what this
>patch does,
>> leaving it as assembly (requires your ack), or have Xen code build
>the pages on
>> its own.
>
>If you give it a try in a subsequent patch (please Cc: me) then it's OK
>to me:
>
> Acked-by: Ingo Molnar <[email protected]>
>
>Feel free to carry it in the Xen tree.
>
>Thanks,
>
> Ingo

It's true that it is now possible to run pre-paging C code. It would be so much better if Xen could simply go though the normal code path like any civilized machine.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.