2015-02-18 06:52:17

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 00/13] xen: support pv-domains larger than 512GB

Support 64 bit pv-domains with more than 512GB of memory.

Tested with 64 bit dom0 on machines with 8GB and 1TB and 32 bit dom0 on a
8GB machine. Conflicts between E820 map and different hypervisor populated
memory areas have been tested via a fake E820 map reserved area on the
8GB machine.

Juergen Gross (13):
xen: sync with xen header
xen: anchor linear p2m list in shared info structure
xen: eliminate scalability issues from initial mapping setup
xen: move static e820 map to global scope
xen: simplify xen_set_identity_and_remap() by using global variables
xen: detect pre-allocated memory interfering with e820 map
xen: find unused contiguous memory area
xen: add service function to copy physical memory areas
xen: check for kernel memory conflicting with memory layout
xen: check pre-allocated page tables for conflict with memory map
xen: move initrd away from e820 non-ram area
xen: if p2m list located in to be remapped region delay remapping
xen: allow more than 512 GB of RAM for 64 bit pv-domains

Documentation/kernel-parameters.txt | 7 +
arch/x86/include/asm/xen/interface.h | 96 ++++++-
arch/x86/include/asm/xen/page.h | 4 -
arch/x86/xen/Kconfig | 31 +-
arch/x86/xen/enlighten.c | 22 ++
arch/x86/xen/mmu.c | 141 ++++++++-
arch/x86/xen/p2m.c | 23 +-
arch/x86/xen/setup.c | 536 ++++++++++++++++++++++++++++-------
arch/x86/xen/xen-head.S | 2 +
arch/x86/xen/xen-ops.h | 4 +
10 files changed, 717 insertions(+), 149 deletions(-)

--
2.1.4


2015-02-18 06:54:52

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 01/13] xen: sync with xen header

Use the newest header from the xen tree to get some new structure
layouts.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/include/asm/xen/interface.h | 96 ++++++++++++++++++++++++++++++++----
1 file changed, 87 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/xen/interface.h b/arch/x86/include/asm/xen/interface.h
index 3400dba..3b88eea 100644
--- a/arch/x86/include/asm/xen/interface.h
+++ b/arch/x86/include/asm/xen/interface.h
@@ -3,12 +3,38 @@
*
* Guest OS interface to x86 Xen.
*
- * Copyright (c) 2004, K A Fraser
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2004-2006, K A Fraser
*/

#ifndef _ASM_X86_XEN_INTERFACE_H
#define _ASM_X86_XEN_INTERFACE_H

+/*
+ * XEN_GUEST_HANDLE represents a guest pointer, when passed as a field
+ * in a struct in memory.
+ * XEN_GUEST_HANDLE_PARAM represent a guest pointer, when passed as an
+ * hypercall argument.
+ * XEN_GUEST_HANDLE_PARAM and XEN_GUEST_HANDLE are the same on X86 but
+ * they might not be on other architectures.
+ */
#ifdef __XEN__
#define __DEFINE_GUEST_HANDLE(name, type) \
typedef struct { type *p; } __guest_handle_ ## name
@@ -88,13 +114,16 @@ DEFINE_GUEST_HANDLE(xen_ulong_t);
* start of the GDT because some stupid OSes export hard-coded selector values
* in their ABI. These hard-coded values are always near the start of the GDT,
* so Xen places itself out of the way, at the far end of the GDT.
+ *
+ * NB The LDT is set using the MMUEXT_SET_LDT op of HYPERVISOR_mmuext_op
*/
#define FIRST_RESERVED_GDT_PAGE 14
#define FIRST_RESERVED_GDT_BYTE (FIRST_RESERVED_GDT_PAGE * 4096)
#define FIRST_RESERVED_GDT_ENTRY (FIRST_RESERVED_GDT_BYTE / 8)

/*
- * Send an array of these to HYPERVISOR_set_trap_table()
+ * Send an array of these to HYPERVISOR_set_trap_table().
+ * Terminate the array with a sentinel entry, with traps[].address==0.
* The privilege level specifies which modes may enter a trap via a software
* interrupt. On x86/64, since rings 1 and 2 are unavailable, we allocate
* privilege levels as follows:
@@ -118,10 +147,41 @@ struct trap_info {
DEFINE_GUEST_HANDLE_STRUCT(trap_info);

struct arch_shared_info {
- unsigned long max_pfn; /* max pfn that appears in table */
- /* Frame containing list of mfns containing list of mfns containing p2m. */
- unsigned long pfn_to_mfn_frame_list_list;
- unsigned long nmi_reason;
+ /*
+ * Number of valid entries in the p2m table(s) anchored at
+ * pfn_to_mfn_frame_list_list and/or p2m_vaddr.
+ */
+ unsigned long max_pfn;
+ /*
+ * Frame containing list of mfns containing list of mfns containing p2m.
+ * A value of 0 indicates it has not yet been set up, ~0 indicates it
+ * has been set to invalid e.g. due to the p2m being too large for the
+ * 3-level p2m tree. In this case the linear mapper p2m list anchored
+ * at p2m_vaddr is to be used.
+ */
+ xen_pfn_t pfn_to_mfn_frame_list_list;
+ unsigned long nmi_reason;
+ /*
+ * Following three fields are valid if p2m_cr3 contains a value
+ * different from 0.
+ * p2m_cr3 is the root of the address space where p2m_vaddr is valid.
+ * p2m_cr3 is in the same format as a cr3 value in the vcpu register
+ * state and holds the folded machine frame number (via xen_pfn_to_cr3)
+ * of a L3 or L4 page table.
+ * p2m_vaddr holds the virtual address of the linear p2m list. All
+ * entries in the range [0...max_pfn[ are accessible via this pointer.
+ * p2m_generation will be incremented by the guest before and after each
+ * change of the mappings of the p2m list. p2m_generation starts at 0
+ * and a value with the least significant bit set indicates that a
+ * mapping update is in progress. This allows guest external software
+ * (e.g. in Dom0) to verify that read mappings are consistent and
+ * whether they have changed since the last check.
+ * Modifying a p2m element in the linear p2m list is allowed via an
+ * atomic write only.
+ */
+ unsigned long p2m_cr3; /* cr3 value of the p2m address space */
+ unsigned long p2m_vaddr; /* virtual address of the p2m list */
+ unsigned long p2m_generation; /* generation count of p2m mapping */
};
#endif /* !__ASSEMBLY__ */

@@ -137,13 +197,31 @@ struct arch_shared_info {
/*
* The following is all CPU context. Note that the fpu_ctxt block is filled
* in by FXSAVE if the CPU has feature FXSR; otherwise FSAVE is used.
+ *
+ * Also note that when calling DOMCTL_setvcpucontext and VCPU_initialise
+ * for HVM and PVH guests, not all information in this structure is updated:
+ *
+ * - For HVM guests, the structures read include: fpu_ctxt (if
+ * VGCT_I387_VALID is set), flags, user_regs, debugreg[*]
+ *
+ * - PVH guests are the same as HVM guests, but additionally use ctrlreg[3] to
+ * set cr3. All other fields not used should be set to 0.
*/
struct vcpu_guest_context {
/* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
struct { char x[512]; } fpu_ctxt; /* User-level FPU registers */
-#define VGCF_I387_VALID (1<<0)
-#define VGCF_HVM_GUEST (1<<1)
-#define VGCF_IN_KERNEL (1<<2)
+#define VGCF_I387_VALID (1<<0)
+#define VGCF_IN_KERNEL (1<<2)
+#define _VGCF_i387_valid 0
+#define VGCF_i387_valid (1<<_VGCF_i387_valid)
+#define _VGCF_in_kernel 2
+#define VGCF_in_kernel (1<<_VGCF_in_kernel)
+#define _VGCF_failsafe_disables_events 3
+#define VGCF_failsafe_disables_events (1<<_VGCF_failsafe_disables_events)
+#define _VGCF_syscall_disables_events 4
+#define VGCF_syscall_disables_events (1<<_VGCF_syscall_disables_events)
+#define _VGCF_online 5
+#define VGCF_online (1<<_VGCF_online)
unsigned long flags; /* VGCF_* flags */
struct cpu_user_regs user_regs; /* User-level CPU registers */
struct trap_info trap_ctxt[256]; /* Virtual IDT */
--
2.1.4

2015-02-18 06:52:16

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 02/13] xen: anchor linear p2m list in shared info structure

The linear p2m list should be anchored in the shared info structure
read by the Xen tools to be able to support 64 bit pv-domains larger
than 512 MB. Additionally the linear p2m list interface includes a
generation count which is changed prior to and after each mapping
change of the p2m list. Reading the generation count the Xen tools can
detect changes of the mappings and re-read the p2m list eventually.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/p2m.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index f18fd1d..df73cc5 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
virt_to_mfn(p2m_top_mfn);
HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
+ HYPERVISOR_shared_info->arch.p2m_generation = 0;
+ HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
+ HYPERVISOR_shared_info->arch.p2m_cr3 =
+ xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
}

/* Set up p2m_top to point to the domain-builder provided p2m pages */
@@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)

ptechk = lookup_address(vaddr, &level);
if (ptechk == pte_pg) {
+ HYPERVISOR_shared_info->arch.p2m_generation++;
set_pmd(pmdp,
__pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
+ HYPERVISOR_shared_info->arch.p2m_generation++;
pte_newpg[i] = NULL;
}

@@ -568,8 +574,10 @@ static bool alloc_p2m(unsigned long pfn)
spin_lock_irqsave(&p2m_update_lock, flags);

if (pte_pfn(*ptep) == p2m_pfn) {
+ HYPERVISOR_shared_info->arch.p2m_generation++;
set_pte(ptep,
pfn_pte(PFN_DOWN(__pa(p2m)), PAGE_KERNEL));
+ HYPERVISOR_shared_info->arch.p2m_generation++;
if (mid_mfn)
mid_mfn[mididx] = virt_to_mfn(p2m);
p2m = NULL;
@@ -621,6 +629,11 @@ bool __set_phys_to_machine(unsigned long pfn, unsigned long mfn)
return true;
}

+ /*
+ * The interface requires atomic updates on p2m elements.
+ * xen_safe_write_ulong() is using __put_user which does an atomic
+ * store via asm().
+ */
if (likely(!xen_safe_write_ulong(xen_p2m_addr + pfn, mfn)))
return true;

--
2.1.4

2015-02-18 06:54:53

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 03/13] xen: eliminate scalability issues from initial mapping setup

Direct Xen to place the initial P->M table outside of the initial
mapping, as otherwise the 1G (implementation) / 2G (theoretical)
restriction on the size of the initial mapping limits the amount
of memory a domain can be handed initially.

As the initial P->M table is copied rather early during boot to
domain private memory and it's initial virtual mapping is dropped,
the easiest way to avoid virtual address conflicts with other
addresses in the kernel is to use a user address area for the
virtual address of the initial P->M table. This allows us to just
throw away the page tables of the initial mapping after the copy
without having to care about address invalidation.

It should be noted that this patch won't enable a pv-domain to USE
more than 512 GB of RAM. It just enables it to be started with a
P->M table covering more memory. This is especially important for
being able to boot a Dom0 on a system with more than 512 GB memory.

Signed-off-by: Juergen Gross <[email protected]>
Based-on-patch-by: Jan Beulich <[email protected]>
---
arch/x86/xen/mmu.c | 126 ++++++++++++++++++++++++++++++++++++++++++++----
arch/x86/xen/setup.c | 67 ++++++++++++++-----------
arch/x86/xen/xen-head.S | 2 +
3 files changed, 156 insertions(+), 39 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index adca9e2..1ca5197 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1114,6 +1114,77 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
xen_mc_flush();
}

+/*
+ * Make a page range writeable and free it.
+ */
+static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
+{
+ void *vaddr = __va(paddr);
+ void *vaddr_end = vaddr + size;
+
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
+ make_lowmem_page_readwrite(vaddr);
+
+ memblock_free(paddr, size);
+}
+
+static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl)
+{
+ unsigned long pa = __pa(pgtbl) & PHYSICAL_PAGE_MASK;
+
+ ClearPagePinned(virt_to_page(__va(pa)));
+ xen_free_ro_pages(pa, PAGE_SIZE);
+}
+
+/*
+ * Since it is well isolated we can (and since it is perhaps large we should)
+ * also free the page tables mapping the initial P->M table.
+ */
+static void __init xen_cleanmfnmap(unsigned long vaddr)
+{
+ unsigned long va = vaddr & PMD_MASK;
+ unsigned long pa;
+ pgd_t *pgd = pgd_offset_k(va);
+ pud_t *pud_page = pud_offset(pgd, 0);
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ unsigned int i;
+
+ set_pgd(pgd, __pgd(0));
+ do {
+ pud = pud_page + pud_index(va);
+ if (pud_none(*pud)) {
+ va += PUD_SIZE;
+ } else if (pud_large(*pud)) {
+ pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
+ xen_free_ro_pages(pa, PUD_SIZE);
+ va += PUD_SIZE;
+ } else {
+ pmd = pmd_offset(pud, va);
+ if (pmd_large(*pmd)) {
+ pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
+ xen_free_ro_pages(pa, PMD_SIZE);
+ } else if (!pmd_none(*pmd)) {
+ pte = pte_offset_kernel(pmd, va);
+ for (i = 0; i < PTRS_PER_PTE; ++i) {
+ if (pte_none(pte[i]))
+ break;
+ pa = pte_pfn(pte[i]) << PAGE_SHIFT;
+ xen_free_ro_pages(pa, PAGE_SIZE);
+ }
+ xen_cleanmfnmap_free_pgtbl(pte);
+ }
+ va += PMD_SIZE;
+ if (pmd_index(va))
+ continue;
+ xen_cleanmfnmap_free_pgtbl(pmd);
+ }
+
+ } while (pud_index(va) || pmd_index(va));
+ xen_cleanmfnmap_free_pgtbl(pud_page);
+}
+
static void __init xen_pagetable_p2m_free(void)
{
unsigned long size;
@@ -1128,18 +1199,25 @@ static void __init xen_pagetable_p2m_free(void)
/* using __ka address and sticking INVALID_P2M_ENTRY! */
memset((void *)xen_start_info->mfn_list, 0xff, size);

- /* We should be in __ka space. */
- BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
addr = xen_start_info->mfn_list;
- /* We roundup to the PMD, which means that if anybody at this stage is
- * using the __ka address of xen_start_info or xen_start_info->shared_info
- * they are in going to crash. Fortunatly we have already revectored
- * in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
+ /*
+ * We could be in __ka space.
+ * We roundup to the PMD, which means that if anybody at this stage is
+ * using the __ka address of xen_start_info or
+ * xen_start_info->shared_info they are in going to crash. Fortunatly
+ * we have already revectored in xen_setup_kernel_pagetable and in
+ * xen_setup_shared_info.
+ */
size = roundup(size, PMD_SIZE);
- xen_cleanhighmap(addr, addr + size);

- size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
- memblock_free(__pa(xen_start_info->mfn_list), size);
+ if (addr >= __START_KERNEL_map) {
+ xen_cleanhighmap(addr, addr + size);
+ size = PAGE_ALIGN(xen_start_info->nr_pages *
+ sizeof(unsigned long));
+ memblock_free(__pa(addr), size);
+ } else {
+ xen_cleanmfnmap(addr);
+ }

/* At this stage, cleanup_highmap has already cleaned __ka space
* from _brk_limit way up to the max_pfn_mapped (which is the end of
@@ -1461,6 +1539,24 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
#else /* CONFIG_X86_64 */
static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
{
+ unsigned long pfn;
+
+ if (xen_feature(XENFEAT_writable_page_tables) ||
+ xen_feature(XENFEAT_auto_translated_physmap) ||
+ xen_start_info->mfn_list >= __START_KERNEL_map)
+ return pte;
+
+ /*
+ * Pages belonging to the initial p2m list mapped outside the default
+ * address range must be mapped read-only. This region contains the
+ * page tables for mapping the p2m list, too, and page tables MUST be
+ * mapped read-only.
+ */
+ pfn = pte_pfn(pte);
+ if (pfn >= xen_start_info->first_p2m_pfn &&
+ pfn < xen_start_info->first_p2m_pfn + xen_start_info->nr_p2m_frames)
+ pte = __pte_ma(pte_val_ma(pte) & ~_PAGE_RW);
+
return pte;
}
#endif /* CONFIG_X86_64 */
@@ -1815,7 +1911,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
* mappings. Considering that on Xen after the kernel mappings we
* have the mappings of some pages that don't exist in pfn space, we
* set max_pfn_mapped to the last real pfn mapped. */
- max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+ if (xen_start_info->mfn_list < __START_KERNEL_map)
+ max_pfn_mapped = xen_start_info->first_p2m_pfn;
+ else
+ max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));

pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
pt_end = pt_base + xen_start_info->nr_pt_frames;
@@ -1855,6 +1954,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
/* Graft it onto L4[511][510] */
copy_page(level2_kernel_pgt, l2);

+ /* Copy the initial P->M table mappings if necessary. */
+ i = pgd_index(xen_start_info->mfn_list);
+ if (i && i < pgd_index(__START_KERNEL_map))
+ init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+
if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Make pagetable pieces RO */
set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
@@ -1895,6 +1999,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)

/* Our (by three pages) smaller Xen pagetable that we are using */
memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+ /* protect xen_start_info */
+ memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
/* Revector the xen_start_info */
xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
}
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 55f388e..adad417 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -560,6 +560,41 @@ static void __init xen_ignore_unusable(struct e820entry *list, size_t map_size)
}
}

+/*
+ * Reserve Xen mfn_list.
+ * See comment above "struct start_info" in <xen/interface/xen.h>
+ * We tried to make the the memblock_reserve more selective so
+ * that it would be clear what region is reserved. Sadly we ran
+ * in the problem wherein on a 64-bit hypervisor with a 32-bit
+ * initial domain, the pt_base has the cr3 value which is not
+ * neccessarily where the pagetable starts! As Jan put it: "
+ * Actually, the adjustment turns out to be correct: The page
+ * tables for a 32-on-64 dom0 get allocated in the order "first L1",
+ * "first L2", "first L3", so the offset to the page table base is
+ * indeed 2. When reading xen/include/public/xen.h's comment
+ * very strictly, this is not a violation (since there nothing is said
+ * that the first thing in the page table space is pointed to by
+ * pt_base; I admit that this seems to be implied though, namely
+ * do I think that it is implied that the page table space is the
+ * range [pt_base, pt_base + nt_pt_frames), whereas that
+ * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
+ * which - without a priori knowledge - the kernel would have
+ * difficulty to figure out)." - so lets just fall back to the
+ * easy way and reserve the whole region.
+ */
+static void __init xen_reserve_xen_mfnlist(void)
+{
+ if (xen_start_info->mfn_list >= __START_KERNEL_map) {
+ memblock_reserve(__pa(xen_start_info->mfn_list),
+ xen_start_info->pt_base -
+ xen_start_info->mfn_list);
+ return;
+ }
+
+ memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
+ PFN_PHYS(xen_start_info->nr_p2m_frames));
+}
+
/**
* machine_specific_memory_setup - Hook for machine specific memory setup.
**/
@@ -684,35 +719,10 @@ char * __init xen_memory_setup(void)
e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
E820_RESERVED);

- /*
- * Reserve Xen bits:
- * - mfn_list
- * - xen_start_info
- * See comment above "struct start_info" in <xen/interface/xen.h>
- * We tried to make the the memblock_reserve more selective so
- * that it would be clear what region is reserved. Sadly we ran
- * in the problem wherein on a 64-bit hypervisor with a 32-bit
- * initial domain, the pt_base has the cr3 value which is not
- * neccessarily where the pagetable starts! As Jan put it: "
- * Actually, the adjustment turns out to be correct: The page
- * tables for a 32-on-64 dom0 get allocated in the order "first L1",
- * "first L2", "first L3", so the offset to the page table base is
- * indeed 2. When reading xen/include/public/xen.h's comment
- * very strictly, this is not a violation (since there nothing is said
- * that the first thing in the page table space is pointed to by
- * pt_base; I admit that this seems to be implied though, namely
- * do I think that it is implied that the page table space is the
- * range [pt_base, pt_base + nt_pt_frames), whereas that
- * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
- * which - without a priori knowledge - the kernel would have
- * difficulty to figure out)." - so lets just fall back to the
- * easy way and reserve the whole region.
- */
- memblock_reserve(__pa(xen_start_info->mfn_list),
- xen_start_info->pt_base - xen_start_info->mfn_list);
-
sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);

+ xen_reserve_xen_mfnlist();
+
return "Xen";
}

@@ -739,8 +749,7 @@ char * __init xen_auto_xlated_memory_setup(void)
for (i = 0; i < memmap.nr_entries; i++)
e820_add_region(map[i].addr, map[i].size, map[i].type);

- memblock_reserve(__pa(xen_start_info->mfn_list),
- xen_start_info->pt_base - xen_start_info->mfn_list);
+ xen_reserve_xen_mfnlist();

return "Xen";
}
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 674b2225..e89e816 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -147,6 +147,8 @@ NEXT_HYPERCALL(arch_6)
ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, _ASM_PTR __PAGE_OFFSET)
#else
ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, _ASM_PTR __START_KERNEL_map)
+ /* Map the p2m table to a 512GB-aligned user address. */
+ ELFNOTE(Xen, XEN_ELFNOTE_INIT_P2M, .quad PGDIR_SIZE)
#endif
ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, _ASM_PTR startup_xen)
ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR hypercall_page)
--
2.1.4

2015-02-18 06:54:51

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 04/13] xen: move static e820 map to global scope

Instead of using a function local static e820 map in xen_memory_setup()
and calling various functions in the same source with the map as a
parameter use a map directly accessible by all functions in the source.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/setup.c | 96 +++++++++++++++++++++++++++-------------------------
1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index adad417..ab6c36e 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -38,6 +38,10 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
/* Number of pages released from the initial allocation. */
unsigned long xen_released_pages;

+/* E820 map used during setting up memory. */
+static struct e820entry xen_e820_map[E820MAX] __initdata;
+static u32 xen_e820_map_entries __initdata;
+
/*
* Buffer used to remap identity mapped pages. We only need the virtual space.
* The physical page behind this address is remapped as needed to different
@@ -164,15 +168,13 @@ void __init xen_inv_extra_mem(void)
* This function updates min_pfn with the pfn found and returns
* the size of that range or zero if not found.
*/
-static unsigned long __init xen_find_pfn_range(
- const struct e820entry *list, size_t map_size,
- unsigned long *min_pfn)
+static unsigned long __init xen_find_pfn_range(unsigned long *min_pfn)
{
- const struct e820entry *entry;
+ const struct e820entry *entry = xen_e820_map;
unsigned int i;
unsigned long done = 0;

- for (i = 0, entry = list; i < map_size; i++, entry++) {
+ for (i = 0; i < xen_e820_map_entries; i++, entry++) {
unsigned long s_pfn;
unsigned long e_pfn;

@@ -356,9 +358,9 @@ static void __init xen_do_set_identity_and_remap_chunk(
* to Xen and not remapped.
*/
static unsigned long __init xen_set_identity_and_remap_chunk(
- const struct e820entry *list, size_t map_size, unsigned long start_pfn,
- unsigned long end_pfn, unsigned long nr_pages, unsigned long remap_pfn,
- unsigned long *released, unsigned long *remapped)
+ unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_pages,
+ unsigned long remap_pfn, unsigned long *released,
+ unsigned long *remapped)
{
unsigned long pfn;
unsigned long i = 0;
@@ -379,8 +381,7 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
if (cur_pfn + size > nr_pages)
size = nr_pages - cur_pfn;

- remap_range_size = xen_find_pfn_range(list, map_size,
- &remap_pfn);
+ remap_range_size = xen_find_pfn_range(&remap_pfn);
if (!remap_range_size) {
pr_warning("Unable to find available pfn range, not remapping identity pages\n");
xen_set_identity_and_release_chunk(cur_pfn,
@@ -411,13 +412,12 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
return remap_pfn;
}

-static void __init xen_set_identity_and_remap(
- const struct e820entry *list, size_t map_size, unsigned long nr_pages,
- unsigned long *released, unsigned long *remapped)
+static void __init xen_set_identity_and_remap(unsigned long nr_pages,
+ unsigned long *released, unsigned long *remapped)
{
phys_addr_t start = 0;
unsigned long last_pfn = nr_pages;
- const struct e820entry *entry;
+ const struct e820entry *entry = xen_e820_map;
unsigned long num_released = 0;
unsigned long num_remapped = 0;
int i;
@@ -433,9 +433,9 @@ static void __init xen_set_identity_and_remap(
* example) the DMI tables in a reserved region that begins on
* a non-page boundary.
*/
- for (i = 0, entry = list; i < map_size; i++, entry++) {
+ for (i = 0; i < xen_e820_map_entries; i++, entry++) {
phys_addr_t end = entry->addr + entry->size;
- if (entry->type == E820_RAM || i == map_size - 1) {
+ if (entry->type == E820_RAM || i == xen_e820_map_entries - 1) {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);

@@ -444,9 +444,9 @@ static void __init xen_set_identity_and_remap(

if (start_pfn < end_pfn)
last_pfn = xen_set_identity_and_remap_chunk(
- list, map_size, start_pfn,
- end_pfn, nr_pages, last_pfn,
- &num_released, &num_remapped);
+ start_pfn, end_pfn, nr_pages,
+ last_pfn, &num_released,
+ &num_remapped);
start = end;
}
}
@@ -549,12 +549,12 @@ static void __init xen_align_and_add_e820_region(phys_addr_t start,
e820_add_region(start, end - start, type);
}

-static void __init xen_ignore_unusable(struct e820entry *list, size_t map_size)
+static void __init xen_ignore_unusable(void)
{
- struct e820entry *entry;
+ struct e820entry *entry = xen_e820_map;
unsigned int i;

- for (i = 0, entry = list; i < map_size; i++, entry++) {
+ for (i = 0; i < xen_e820_map_entries; i++, entry++) {
if (entry->type == E820_UNUSABLE)
entry->type = E820_RAM;
}
@@ -600,8 +600,6 @@ static void __init xen_reserve_xen_mfnlist(void)
**/
char * __init xen_memory_setup(void)
{
- static struct e820entry map[E820MAX] __initdata;
-
unsigned long max_pfn = xen_start_info->nr_pages;
phys_addr_t mem_end;
int rc;
@@ -616,7 +614,7 @@ char * __init xen_memory_setup(void)
mem_end = PFN_PHYS(max_pfn);

memmap.nr_entries = E820MAX;
- set_xen_guest_handle(memmap.buffer, map);
+ set_xen_guest_handle(memmap.buffer, xen_e820_map);

op = xen_initial_domain() ?
XENMEM_machine_memory_map :
@@ -625,15 +623,16 @@ char * __init xen_memory_setup(void)
if (rc == -ENOSYS) {
BUG_ON(xen_initial_domain());
memmap.nr_entries = 1;
- map[0].addr = 0ULL;
- map[0].size = mem_end;
+ xen_e820_map[0].addr = 0ULL;
+ xen_e820_map[0].size = mem_end;
/* 8MB slack (to balance backend allocations). */
- map[0].size += 8ULL << 20;
- map[0].type = E820_RAM;
+ xen_e820_map[0].size += 8ULL << 20;
+ xen_e820_map[0].type = E820_RAM;
rc = 0;
}
BUG_ON(rc);
BUG_ON(memmap.nr_entries == 0);
+ xen_e820_map_entries = memmap.nr_entries;

/*
* Xen won't allow a 1:1 mapping to be created to UNUSABLE
@@ -644,10 +643,11 @@ char * __init xen_memory_setup(void)
* a patch in the future.
*/
if (xen_initial_domain())
- xen_ignore_unusable(map, memmap.nr_entries);
+ xen_ignore_unusable();

/* Make sure the Xen-supplied memory map is well-ordered. */
- sanitize_e820_map(map, memmap.nr_entries, &memmap.nr_entries);
+ sanitize_e820_map(xen_e820_map, xen_e820_map_entries,
+ &xen_e820_map_entries);

max_pages = xen_get_max_pages();
if (max_pages > max_pfn)
@@ -657,8 +657,8 @@ char * __init xen_memory_setup(void)
* Set identity map on non-RAM pages and prepare remapping the
* underlying RAM.
*/
- xen_set_identity_and_remap(map, memmap.nr_entries, max_pfn,
- &xen_released_pages, &remapped_pages);
+ xen_set_identity_and_remap(max_pfn, &xen_released_pages,
+ &remapped_pages);

extra_pages += xen_released_pages;
extra_pages += remapped_pages;
@@ -677,10 +677,10 @@ char * __init xen_memory_setup(void)
extra_pages = min(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
extra_pages);
i = 0;
- while (i < memmap.nr_entries) {
- phys_addr_t addr = map[i].addr;
- phys_addr_t size = map[i].size;
- u32 type = map[i].type;
+ while (i < xen_e820_map_entries) {
+ phys_addr_t addr = xen_e820_map[i].addr;
+ phys_addr_t size = xen_e820_map[i].size;
+ u32 type = xen_e820_map[i].type;

if (type == E820_RAM) {
if (addr < mem_end) {
@@ -696,9 +696,9 @@ char * __init xen_memory_setup(void)

xen_align_and_add_e820_region(addr, size, type);

- map[i].addr += size;
- map[i].size -= size;
- if (map[i].size == 0)
+ xen_e820_map[i].addr += size;
+ xen_e820_map[i].size -= size;
+ if (xen_e820_map[i].size == 0)
i++;
}

@@ -709,7 +709,7 @@ char * __init xen_memory_setup(void)
* PFNs above MAX_P2M_PFN are considered identity mapped as
* well.
*/
- set_phys_range_identity(map[i-1].addr / PAGE_SIZE, ~0ul);
+ set_phys_range_identity(xen_e820_map[i - 1].addr / PAGE_SIZE, ~0ul);

/*
* In domU, the ISA region is normal, usable memory, but we
@@ -731,23 +731,25 @@ char * __init xen_memory_setup(void)
*/
char * __init xen_auto_xlated_memory_setup(void)
{
- static struct e820entry map[E820MAX] __initdata;
-
struct xen_memory_map memmap;
int i;
int rc;

memmap.nr_entries = E820MAX;
- set_xen_guest_handle(memmap.buffer, map);
+ set_xen_guest_handle(memmap.buffer, xen_e820_map);

rc = HYPERVISOR_memory_op(XENMEM_memory_map, &memmap);
if (rc < 0)
panic("No memory map (%d)\n", rc);

- sanitize_e820_map(map, ARRAY_SIZE(map), &memmap.nr_entries);
+ xen_e820_map_entries = memmap.nr_entries;
+
+ sanitize_e820_map(xen_e820_map, ARRAY_SIZE(xen_e820_map),
+ &xen_e820_map_entries);

- for (i = 0; i < memmap.nr_entries; i++)
- e820_add_region(map[i].addr, map[i].size, map[i].type);
+ for (i = 0; i < xen_e820_map_entries; i++)
+ e820_add_region(xen_e820_map[i].addr, xen_e820_map[i].size,
+ xen_e820_map[i].type);

xen_reserve_xen_mfnlist();

--
2.1.4

2015-02-18 06:53:50

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables

xen_set_identity_and_remap() is used to prepare remapping of memory
conflicting with the E820 map. It is tracking the pfn where to remap
new memory via a local variable which is passed to a subfunction
which in turn returns the new value for that variable.

Additionally the targeted maximum pfn is passed as a parameter to
sub functions.

Simplify that construct by using just global variables in the
source for that purpose. This will make things simpler when we need
those values later, too.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/setup.c | 63 +++++++++++++++++++++++++---------------------------
1 file changed, 30 insertions(+), 33 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index ab6c36e..0dda131 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -56,6 +56,9 @@ static struct {
} xen_remap_buf __initdata __aligned(PAGE_SIZE);
static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;

+static unsigned long xen_remap_pfn;
+static unsigned long xen_max_pfn;
+
/*
* The maximum amount of extra memory compared to the base size. The
* main scaling factor is the size of struct page. At extreme ratios
@@ -223,7 +226,7 @@ static int __init xen_free_mfn(unsigned long mfn)
* as a fallback if the remapping fails.
*/
static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
- unsigned long end_pfn, unsigned long nr_pages, unsigned long *released)
+ unsigned long end_pfn, unsigned long *released)
{
unsigned long pfn, end;
int ret;
@@ -231,7 +234,7 @@ static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
WARN_ON(start_pfn > end_pfn);

/* Release pages first. */
- end = min(end_pfn, nr_pages);
+ end = min(end_pfn, xen_max_pfn);
for (pfn = start_pfn; pfn < end; pfn++) {
unsigned long mfn = pfn_to_mfn(pfn);

@@ -302,7 +305,7 @@ static void __init xen_update_mem_tables(unsigned long pfn, unsigned long mfn)
* its callers.
*/
static void __init xen_do_set_identity_and_remap_chunk(
- unsigned long start_pfn, unsigned long size, unsigned long remap_pfn)
+ unsigned long start_pfn, unsigned long size)
{
unsigned long buf = (unsigned long)&xen_remap_buf;
unsigned long mfn_save, mfn;
@@ -317,7 +320,7 @@ static void __init xen_do_set_identity_and_remap_chunk(

mfn_save = virt_to_mfn(buf);

- for (ident_pfn_iter = start_pfn, remap_pfn_iter = remap_pfn;
+ for (ident_pfn_iter = start_pfn, remap_pfn_iter = xen_remap_pfn;
ident_pfn_iter < ident_end_pfn;
ident_pfn_iter += REMAP_SIZE, remap_pfn_iter += REMAP_SIZE) {
chunk = (left < REMAP_SIZE) ? left : REMAP_SIZE;
@@ -350,17 +353,16 @@ static void __init xen_do_set_identity_and_remap_chunk(
* This function takes a contiguous pfn range that needs to be identity mapped
* and:
*
- * 1) Finds a new range of pfns to use to remap based on E820 and remap_pfn.
+ * 1) Finds a new range of pfns to use to remap based on E820 and
+ * xen_remap_pfn.
* 2) Calls the do_ function to actually do the mapping/remapping work.
*
* The goal is to not allocate additional memory but to remap the existing
* pages. In the case of an error the underlying memory is simply released back
* to Xen and not remapped.
*/
-static unsigned long __init xen_set_identity_and_remap_chunk(
- unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_pages,
- unsigned long remap_pfn, unsigned long *released,
- unsigned long *remapped)
+static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
+ unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
{
unsigned long pfn;
unsigned long i = 0;
@@ -373,30 +375,30 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
unsigned long remap_range_size;

/* Do not remap pages beyond the current allocation */
- if (cur_pfn >= nr_pages) {
+ if (cur_pfn >= xen_max_pfn) {
/* Identity map remaining pages */
set_phys_range_identity(cur_pfn, cur_pfn + size);
break;
}
- if (cur_pfn + size > nr_pages)
- size = nr_pages - cur_pfn;
+ if (cur_pfn + size > xen_max_pfn)
+ size = xen_max_pfn - cur_pfn;

- remap_range_size = xen_find_pfn_range(&remap_pfn);
+ remap_range_size = xen_find_pfn_range(&xen_remap_pfn);
if (!remap_range_size) {
pr_warning("Unable to find available pfn range, not remapping identity pages\n");
xen_set_identity_and_release_chunk(cur_pfn,
- cur_pfn + left, nr_pages, released);
+ cur_pfn + left, released);
break;
}
/* Adjust size to fit in current e820 RAM region */
if (size > remap_range_size)
size = remap_range_size;

- xen_do_set_identity_and_remap_chunk(cur_pfn, size, remap_pfn);
+ xen_do_set_identity_and_remap_chunk(cur_pfn, size);

/* Update variables to reflect new mappings. */
i += size;
- remap_pfn += size;
+ xen_remap_pfn += size;
*remapped += size;
}

@@ -408,20 +410,19 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
(void)HYPERVISOR_update_va_mapping(
(unsigned long)__va(pfn << PAGE_SHIFT),
mfn_pte(pfn, PAGE_KERNEL_IO), 0);
-
- return remap_pfn;
}

-static void __init xen_set_identity_and_remap(unsigned long nr_pages,
- unsigned long *released, unsigned long *remapped)
+static void __init xen_set_identity_and_remap(unsigned long *released,
+ unsigned long *remapped)
{
phys_addr_t start = 0;
- unsigned long last_pfn = nr_pages;
const struct e820entry *entry = xen_e820_map;
unsigned long num_released = 0;
unsigned long num_remapped = 0;
int i;

+ xen_remap_pfn = xen_max_pfn;
+
/*
* Combine non-RAM regions and gaps until a RAM region (or the
* end of the map) is reached, then set the 1:1 map and
@@ -443,10 +444,8 @@ static void __init xen_set_identity_and_remap(unsigned long nr_pages,
end_pfn = PFN_UP(entry->addr);

if (start_pfn < end_pfn)
- last_pfn = xen_set_identity_and_remap_chunk(
- start_pfn, end_pfn, nr_pages,
- last_pfn, &num_released,
- &num_remapped);
+ xen_set_identity_and_remap_chunk(start_pfn,
+ end_pfn, &num_released, &num_remapped);
start = end;
}
}
@@ -600,7 +599,6 @@ static void __init xen_reserve_xen_mfnlist(void)
**/
char * __init xen_memory_setup(void)
{
- unsigned long max_pfn = xen_start_info->nr_pages;
phys_addr_t mem_end;
int rc;
struct xen_memory_map memmap;
@@ -610,8 +608,8 @@ char * __init xen_memory_setup(void)
int i;
int op;

- max_pfn = min(MAX_DOMAIN_PAGES, max_pfn);
- mem_end = PFN_PHYS(max_pfn);
+ xen_max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
+ mem_end = PFN_PHYS(xen_max_pfn);

memmap.nr_entries = E820MAX;
set_xen_guest_handle(memmap.buffer, xen_e820_map);
@@ -650,15 +648,14 @@ char * __init xen_memory_setup(void)
&xen_e820_map_entries);

max_pages = xen_get_max_pages();
- if (max_pages > max_pfn)
- extra_pages += max_pages - max_pfn;
+ if (max_pages > xen_max_pfn)
+ extra_pages += max_pages - xen_max_pfn;

/*
* Set identity map on non-RAM pages and prepare remapping the
* underlying RAM.
*/
- xen_set_identity_and_remap(max_pfn, &xen_released_pages,
- &remapped_pages);
+ xen_set_identity_and_remap(&xen_released_pages, &remapped_pages);

extra_pages += xen_released_pages;
extra_pages += remapped_pages;
@@ -674,7 +671,7 @@ char * __init xen_memory_setup(void)
* the initial memory is also very large with respect to
* lowmem, but we won't try to deal with that here.
*/
- extra_pages = min(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
+ extra_pages = min(EXTRA_MEM_RATIO * min(xen_max_pfn, PFN_DOWN(MAXMEM)),
extra_pages);
i = 0;
while (i < xen_e820_map_entries) {
--
2.1.4

2015-02-18 06:53:51

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map

Currently especially for dom0 guest memory with guest pfns not
matching host areas populated with RAM are remapped to areas which
are RAM native as well. This is done to be able to use identity
mappings (pfn == mfn) for I/O areas.

Up to now it is not checked whether the remapped memory is already
in use. Remapping used memory will probably result in data corruption,
as the remapped memory will no longer be reserved. Any memory
allocation after the remap can claim that memory.

Add an infrastructure to check for conflicts of reserved memory areas
and in case of a conflict to react via an area specific function.

This function has 3 options:
- Panic
- Handle the conflict by moving the data to another memory area.
This is indicated by a return value other than 0.
- Just return 0. This will delay invalidating the conflicting memory
area to just before doing the remap. This option will be usable for
cases only where the memory will no longer be needed when the remap
operation will be started, e.g. for the p2m list, which is already
copied then.

When doing the remap, check for not remapping a reserved page.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/setup.c | 185 +++++++++++++++++++++++++++++++++++++++++++++++--
arch/x86/xen/xen-ops.h | 2 +
2 files changed, 182 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 0dda131..a0af554 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -59,6 +59,20 @@ static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
static unsigned long xen_remap_pfn;
static unsigned long xen_max_pfn;

+/*
+ * Areas with memblock_reserve()d memory to be checked against final E820 map.
+ * Each area has an associated function to handle conflicts (by either
+ * removing the conflict or by just crashing with an appropriate message).
+ * The array has a fixed size as there are only few areas of interest which are
+ * well known: kernel, page tables, p2m, initrd.
+ */
+#define XEN_N_RESERVED_AREAS 4
+static struct {
+ phys_addr_t start;
+ phys_addr_t size;
+ int (*func)(phys_addr_t start, phys_addr_t size);
+} xen_reserved_area[XEN_N_RESERVED_AREAS] __initdata;
+
/*
* The maximum amount of extra memory compared to the base size. The
* main scaling factor is the size of struct page. At extreme ratios
@@ -365,10 +379,10 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
{
unsigned long pfn;
- unsigned long i = 0;
+ unsigned long i;
unsigned long n = end_pfn - start_pfn;

- while (i < n) {
+ for (i = 0; i < n; ) {
unsigned long cur_pfn = start_pfn + i;
unsigned long left = n - i;
unsigned long size = left;
@@ -411,6 +425,53 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
(unsigned long)__va(pfn << PAGE_SHIFT),
mfn_pte(pfn, PAGE_KERNEL_IO), 0);
}
+/* Check to be remapped memory area for conflicts with reserved areas.
+ *
+ * Skip regions known to be reserved which are handled later. For these
+ * regions we have to increase the remapped counter in order to reserve
+ * extra memory space.
+ *
+ * In case a memory page already in use is to be remapped, just BUG().
+ */
+static void __init xen_set_identity_and_remap_chunk_chk(unsigned long start_pfn,
+ unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
+{
+ unsigned long pfn;
+ unsigned long area_start, area_end;
+ unsigned i;
+
+ for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
+
+ if (!xen_reserved_area[i].size)
+ break;
+
+ area_start = PFN_DOWN(xen_reserved_area[i].start);
+ area_end = PFN_UP(xen_reserved_area[i].start +
+ xen_reserved_area[i].size);
+ if (area_start >= end_pfn || area_end <= start_pfn)
+ continue;
+
+ if (area_start > start_pfn)
+ xen_set_identity_and_remap_chunk(start_pfn, area_start,
+ released, remapped);
+
+ if (area_end < end_pfn)
+ xen_set_identity_and_remap_chunk(area_end, end_pfn,
+ released, remapped);
+
+ *remapped += min(area_end, end_pfn) -
+ max(area_start, start_pfn);
+
+ return;
+ }
+
+ /* Test for memory already in use */
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ BUG_ON(memblock_is_reserved(PFN_PHYS(pfn)));
+
+ xen_set_identity_and_remap_chunk(start_pfn, end_pfn,
+ released, remapped);
+}

static void __init xen_set_identity_and_remap(unsigned long *released,
unsigned long *remapped)
@@ -444,7 +505,7 @@ static void __init xen_set_identity_and_remap(unsigned long *released,
end_pfn = PFN_UP(entry->addr);

if (start_pfn < end_pfn)
- xen_set_identity_and_remap_chunk(start_pfn,
+ xen_set_identity_and_remap_chunk_chk(start_pfn,
end_pfn, &num_released, &num_remapped);
start = end;
}
@@ -456,6 +517,45 @@ static void __init xen_set_identity_and_remap(unsigned long *released,
pr_info("Released %ld page(s)\n", num_released);
}

+static void __init xen_late_set_identity_and_remap(void)
+{
+ const struct e820entry *entry = xen_e820_map;
+ int i, e;
+ unsigned long num_released = 0;
+ unsigned long num_remapped = 0;
+
+ for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
+ unsigned long area_start, area_end;
+
+ if (!xen_reserved_area[i].size)
+ return;
+
+ area_start = PFN_DOWN(xen_reserved_area[i].start);
+ area_end = PFN_UP(xen_reserved_area[i].start +
+ xen_reserved_area[i].size);
+
+ for (e = 0; e < xen_e820_map_entries; e++, entry++) {
+ unsigned long start_pfn;
+ unsigned long end_pfn;
+
+ if (entry->type == E820_RAM)
+ continue;
+
+ start_pfn = PFN_DOWN(entry->addr);
+ end_pfn = PFN_UP(entry->addr + entry->size);
+
+ if (area_start >= end_pfn || area_end <= start_pfn)
+ continue;
+
+ start_pfn = max(area_start, start_pfn);
+ end_pfn = min(area_end, end_pfn);
+
+ xen_set_identity_and_remap_chunk(start_pfn, end_pfn,
+ &num_released, &num_remapped);
+ }
+ }
+}
+
/*
* Remap the memory prepared in xen_do_set_identity_and_remap_chunk().
* The remap information (which mfn remap to which pfn) is contained in the
@@ -472,6 +572,8 @@ void __init xen_remap_memory(void)
unsigned long pfn_s = ~0UL;
unsigned long len = 0;

+ xen_late_set_identity_and_remap();
+
mfn_save = virt_to_mfn(buf);

while (xen_remap_mfn != INVALID_P2M_ENTRY) {
@@ -560,6 +662,76 @@ static void __init xen_ignore_unusable(void)
}

/*
+ * Check reserved memory areas for conflicts with E820 map.
+ */
+static void __init xen_chk_e820_reserved(void)
+{
+ struct e820entry *entry;
+ unsigned areacnt, mapcnt;
+ phys_addr_t start, end;
+ int ok;
+
+ for (areacnt = 0; areacnt < XEN_N_RESERVED_AREAS; areacnt++) {
+ start = xen_reserved_area[areacnt].start;
+ end = start + xen_reserved_area[areacnt].size;
+ if (start == end)
+ return;
+
+ ok = 0;
+ entry = xen_e820_map;
+
+ for (mapcnt = 0; mapcnt < xen_e820_map_entries; mapcnt++) {
+ if (entry->type == E820_RAM && entry->addr <= start &&
+ (entry->addr + entry->size) >= end) {
+ ok = 1;
+ break;
+ }
+ entry++;
+ }
+
+ if (ok || !xen_reserved_area[areacnt].func(start, end - start))
+ continue;
+
+ for (mapcnt = areacnt; mapcnt < XEN_N_RESERVED_AREAS - 1;
+ mapcnt++)
+ xen_reserved_area[mapcnt] =
+ xen_reserved_area[mapcnt + 1];
+ xen_reserved_area[mapcnt].start = 0;
+ xen_reserved_area[mapcnt].size = 0;
+
+ areacnt--;
+ }
+}
+
+void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
+ int (*func)(phys_addr_t, phys_addr_t), int reserve)
+{
+ unsigned idx;
+
+ if (!size)
+ return;
+
+ BUG_ON(xen_reserved_area[XEN_N_RESERVED_AREAS - 1].size);
+
+ for (idx = XEN_N_RESERVED_AREAS - 1; idx > 0; idx--) {
+ if (!xen_reserved_area[idx - 1].size)
+ continue;
+
+ if (start > xen_reserved_area[idx - 1].start)
+ break;
+
+ xen_reserved_area[idx] = xen_reserved_area[idx - 1];
+ }
+
+ xen_reserved_area[idx].start = start;
+ xen_reserved_area[idx].size = size;
+ xen_reserved_area[idx].func = func;
+
+ if (reserve)
+ memblock_reserve(start, size);
+}
+
+/*
* Reserve Xen mfn_list.
* See comment above "struct start_info" in <xen/interface/xen.h>
* We tried to make the the memblock_reserve more selective so
@@ -608,6 +780,8 @@ char * __init xen_memory_setup(void)
int i;
int op;

+ xen_reserve_xen_mfnlist();
+
xen_max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
mem_end = PFN_PHYS(xen_max_pfn);

@@ -647,6 +821,9 @@ char * __init xen_memory_setup(void)
sanitize_e820_map(xen_e820_map, xen_e820_map_entries,
&xen_e820_map_entries);

+ /* Check for conflicts between used memory and memory map. */
+ xen_chk_e820_reserved();
+
max_pages = xen_get_max_pages();
if (max_pages > xen_max_pfn)
extra_pages += max_pages - xen_max_pfn;
@@ -718,8 +895,6 @@ char * __init xen_memory_setup(void)

sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);

- xen_reserve_xen_mfnlist();
-
return "Xen";
}

diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 9e195c6..fee4f70 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -42,6 +42,8 @@ void xen_mm_unpin_all(void);
unsigned long __ref xen_chk_extra_mem(unsigned long pfn);
void __init xen_inv_extra_mem(void);
void __init xen_remap_memory(void);
+void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
+ int (*func)(phys_addr_t, phys_addr_t), int reserve);
char * __init xen_memory_setup(void);
char * xen_auto_xlated_memory_setup(void);
void __init xen_arch_setup(void);
--
2.1.4

2015-02-18 06:53:47

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 07/13] xen: find unused contiguous memory area

For being able to relocate pre-allocated data areas like initrd or
p2m list it is mandatory to find a contiguous memory area which is
not yet in use and doesn't conflict with the memory map we want to
be in effect.

In case such an area is found reserve it at once as this will be
required to be done in any case.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/setup.c | 34 ++++++++++++++++++++++++++++++++++
arch/x86/xen/xen-ops.h | 1 +
2 files changed, 35 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index a0af554..9c49d71 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -732,6 +732,40 @@ void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
}

/*
+ * Find a free area in physical memory not yet reserved and compliant with
+ * E820 map.
+ * Used to relocate pre-allocated areas like initrd or p2m list which are in
+ * conflict with the to be used E820 map.
+ * In case no area is found, return 0. Otherwise return the physical address
+ * of the area which is already reserved for convenience.
+ */
+phys_addr_t __init xen_find_free_area(phys_addr_t size)
+{
+ unsigned mapcnt;
+ phys_addr_t addr, start;
+ struct e820entry *entry = xen_e820_map;
+
+ for (mapcnt = 0; mapcnt < xen_e820_map_entries; mapcnt++, entry++) {
+ if (entry->type != E820_RAM || entry->size < size)
+ continue;
+ start = entry->addr;
+ for (addr = start; addr < start + size; addr += PAGE_SIZE) {
+ if (!memblock_is_reserved(addr))
+ continue;
+ start = addr + PAGE_SIZE;
+ if (start + size > entry->addr + entry->size)
+ break;
+ }
+ if (addr >= start + size) {
+ memblock_reserve(start, size);
+ return start;
+ }
+ }
+
+ return 0;
+}
+
+/*
* Reserve Xen mfn_list.
* See comment above "struct start_info" in <xen/interface/xen.h>
* We tried to make the the memblock_reserve more selective so
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index fee4f70..8181e01 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -44,6 +44,7 @@ void __init xen_inv_extra_mem(void);
void __init xen_remap_memory(void);
void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
int (*func)(phys_addr_t, phys_addr_t), int reserve);
+phys_addr_t __init xen_find_free_area(phys_addr_t size);
char * __init xen_memory_setup(void);
char * xen_auto_xlated_memory_setup(void);
void __init xen_arch_setup(void);
--
2.1.4

2015-02-18 06:53:48

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 08/13] xen: add service function to copy physical memory areas

In case a pre-allocated memory area is to be moved in order to avoid
a conflict with the target E820 map we need a way to copy data between
physical addresses.

Add a function doing this via early_memremap().

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/setup.c | 29 +++++++++++++++++++++++++++++
arch/x86/xen/xen-ops.h | 1 +
2 files changed, 30 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 9c49d71..eb219c1 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -766,6 +766,35 @@ phys_addr_t __init xen_find_free_area(phys_addr_t size)
}

/*
+ * Like memcpy, but with physical addresses for dest and src.
+ */
+void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n)
+{
+ phys_addr_t dest_off, src_off, dest_len, src_len, len;
+ void *from, *to;
+
+ while (n) {
+ dest_off = dest & ~PAGE_MASK;
+ src_off = src & ~PAGE_MASK;
+ dest_len = n;
+ if (dest_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off)
+ dest_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off;
+ src_len = n;
+ if (src_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off)
+ src_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off;
+ len = min(dest_len, src_len);
+ to = early_memremap(dest - dest_off, dest_len + dest_off);
+ from = early_memremap(src - src_off, src_len + src_off);
+ memcpy(to, from, len);
+ early_iounmap(to, dest_len + dest_off);
+ early_iounmap(from, src_len + src_off);
+ n -= len;
+ dest += len;
+ src += len;
+ }
+}
+
+/*
* Reserve Xen mfn_list.
* See comment above "struct start_info" in <xen/interface/xen.h>
* We tried to make the the memblock_reserve more selective so
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 8181e01..9bf9df8 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -45,6 +45,7 @@ void __init xen_remap_memory(void);
void __init xen_add_reserved_area(phys_addr_t start, phys_addr_t size,
int (*func)(phys_addr_t, phys_addr_t), int reserve);
phys_addr_t __init xen_find_free_area(phys_addr_t size);
+void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n);
char * __init xen_memory_setup(void);
char * xen_auto_xlated_memory_setup(void);
void __init xen_arch_setup(void);
--
2.1.4

2015-02-18 06:52:45

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 09/13] xen: check for kernel memory conflicting with memory layout

Checks whether the pre-allocated memory of the loaded kernel is in
conflict with the target memory map. If this is the case, just panic
instead of run into problems later.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/setup.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index eb219c1..37a34f9 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -829,6 +829,12 @@ static void __init xen_reserve_xen_mfnlist(void)
PFN_PHYS(xen_start_info->nr_p2m_frames));
}

+static int __init xen_kernel_mem_conflict(phys_addr_t start, phys_addr_t size)
+{
+ panic("kernel is located at position conflicting with E820 map!\n");
+ return 0;
+}
+
/**
* machine_specific_memory_setup - Hook for machine specific memory setup.
**/
@@ -843,6 +849,10 @@ char * __init xen_memory_setup(void)
int i;
int op;

+ xen_add_reserved_area(__pa_symbol(_text),
+ __pa_symbol(__bss_stop) - __pa_symbol(_text),
+ xen_kernel_mem_conflict, 0);
+
xen_reserve_xen_mfnlist();

xen_max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
--
2.1.4

2015-02-18 06:52:46

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map

Check whether the page tables built by the domain builder are at
memory addresses which are in conflict with the target memory map.
If this is the case just panic instead of running into problems
later.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/mmu.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 1ca5197..6641459 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1863,6 +1863,12 @@ void __init xen_setup_machphys_mapping(void)
#endif
}

+static int __init xen_pt_memory_conflict(phys_addr_t start, phys_addr_t size)
+{
+ panic("page tables are located at position conflicting with E820 map!\n");
+ return 0;
+}
+
#ifdef CONFIG_X86_64
static void __init convert_pfn_mfn(void *v)
{
@@ -1998,7 +2004,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
check_pt_base(&pt_base, &pt_end, addr[i]);

/* Our (by three pages) smaller Xen pagetable that we are using */
- memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+ xen_add_reserved_area(PFN_PHYS(pt_base),
+ (pt_end - pt_base) * PAGE_SIZE,
+ xen_pt_memory_conflict, 1);
/* protect xen_start_info */
memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
/* Revector the xen_start_info */
@@ -2074,8 +2082,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
PFN_DOWN(__pa(initial_page_table)));
xen_write_cr3(__pa(initial_page_table));

- memblock_reserve(__pa(xen_start_info->pt_base),
- xen_start_info->nr_pt_frames * PAGE_SIZE);
+ xen_add_reserved_area(__pa(xen_start_info->pt_base),
+ xen_start_info->nr_pt_frames * PAGE_SIZE,
+ xen_pt_memory_conflict, 1);
}
#endif /* CONFIG_X86_64 */

--
2.1.4

2015-02-18 06:52:43

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 11/13] xen: move initrd away from e820 non-ram area

When adapting the dom0 memory layout to that of the host make sure
the initrd isn't moved to another pfn range, as it won't be found
there any more.

The easiest way to accomplish that is by copying the initrd to an
area which is RAM according to the E820 map.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/enlighten.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 78a881b..21c82dfd 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1530,6 +1530,25 @@ static void __init xen_pvh_early_guest_init(void)
}
#endif /* CONFIG_XEN_PVH */

+static int __init xen_initrd_mem_conflict(phys_addr_t start, phys_addr_t size)
+{
+ phys_addr_t new;
+
+ new = xen_find_free_area(size);
+ if (!new)
+ panic("initrd is located at position conflicting with E820 map!\n");
+
+ xen_phys_memcpy(new, start, size);
+ pr_info("initrd moved from [mem %#010llx-%#010llx] to [mem %#010llx-%#010llx]\n",
+ start, start + size, new, new + size);
+ memblock_free(start, size);
+
+ boot_params.hdr.ramdisk_image = new;
+ boot_params.ext_ramdisk_image = new >> 32;
+
+ return 1;
+}
+
/* First C function to be called on Xen boot */
asmlinkage __visible void __init xen_start_kernel(void)
{
@@ -1691,6 +1710,9 @@ asmlinkage __visible void __init xen_start_kernel(void)
boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);

+ xen_add_reserved_area(initrd_start, xen_start_info->mod_len,
+ xen_initrd_mem_conflict, 0);
+
if (!xen_initial_domain()) {
add_preferred_console("xenboot", 0, NULL);
add_preferred_console("tty", 0, NULL);
--
2.1.4

2015-02-18 06:52:42

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping

With adapting the memory layout of dom0 to that of the host care must
be taken not to remap the initial p2m list supported by the hypervisor.

If the p2m map is detected to be in a region which is going to be
remapped, delay the remapping of that area. Not doing so can either
crash the system very early, or lead to clobbered data as the target
memory area of the remap operation will no longer be reserved.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/setup.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 37a34f9..84a6473 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -794,6 +794,20 @@ void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n)
}
}

+#ifdef CONFIG_X86_64
+static int __init xen_p2m_conflict(phys_addr_t start, phys_addr_t size)
+{
+ /* Delay invalidating memory. */
+ return 0;
+}
+#else
+static int __init xen_p2m_conflict(phys_addr_t start, phys_addr_t size)
+{
+ panic("p2m list is located at position conflicting with E820 map!\n");
+ return 0;
+}
+#endif
+
/*
* Reserve Xen mfn_list.
* See comment above "struct start_info" in <xen/interface/xen.h>
@@ -819,14 +833,16 @@ void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n)
static void __init xen_reserve_xen_mfnlist(void)
{
if (xen_start_info->mfn_list >= __START_KERNEL_map) {
- memblock_reserve(__pa(xen_start_info->mfn_list),
- xen_start_info->pt_base -
- xen_start_info->mfn_list);
+ xen_add_reserved_area(__pa(xen_start_info->mfn_list),
+ xen_start_info->pt_base -
+ xen_start_info->mfn_list,
+ xen_p2m_conflict, 1);
return;
}

- memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
- PFN_PHYS(xen_start_info->nr_p2m_frames));
+ xen_add_reserved_area(PFN_PHYS(xen_start_info->first_p2m_pfn),
+ PFN_PHYS(xen_start_info->nr_p2m_frames),
+ xen_p2m_conflict, 1);
}

static int __init xen_kernel_mem_conflict(phys_addr_t start, phys_addr_t size)
--
2.1.4

2015-02-18 06:52:20

by Juergen Gross

[permalink] [raw]
Subject: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

64 bit pv-domains under Xen are limited to 512 GB of RAM today. The
main reason has been the 3 level p2m tree, which was replaced by the
virtual mapped linear p2m list. Parallel to the p2m list which is
being used by the kernel itself there is a 3 level mfn tree for usage
by the Xen tools and eventually for crash dump analysis. For this tree
the linear p2m list can serve as a replacement, too. As the kernel
can't know whether the tools are capable of dealing with the p2m list
instead of the mfn tree, the limit of 512 GB can't be dropped in all
cases.

This patch replaces the hard limit by a kernel parameter which tells
the kernel to obey the 512 GB limit or not. The default is selected by
a configuration parameter which specifies whether the 512 GB limit
should be active per default for dom0 (only crash dump analysis is
affected) and/or for domUs (additionally domain save/restore/migration
are affected).

Memory above the domain limit is returned to the hypervisor instead of
being identity mapped, which was wrong anyways.

The kernel configuration parameter to specify the maximum size of a
domain can be deleted, as it is not relevant any more.

Signed-off-by: Juergen Gross <[email protected]>
---
Documentation/kernel-parameters.txt | 7 ++++
arch/x86/include/asm/xen/page.h | 4 ---
arch/x86/xen/Kconfig | 31 +++++++++++-----
arch/x86/xen/p2m.c | 10 +++---
arch/x86/xen/setup.c | 72 ++++++++++++++++++++++++++++++-------
5 files changed, 93 insertions(+), 31 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index a89e326..7bf6342 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3959,6 +3959,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
plus one apbt timer for broadcast timer.
x86_intel_mid_timer=apbt_only | lapic_and_apbt

+ xen_512gb_limit [KNL,X86-64,XEN]
+ Restricts the kernel running paravirtualized under Xen
+ to use only up to 512 GB of RAM. The reason to do so is
+ crash analysis tools and Xen tools for doing domain
+ save/restore/migration must be enabled to handle larger
+ domains.
+
xen_emul_unplug= [HW,X86,XEN]
Unplug Xen emulated devices
Format: [unplug0,][unplug1]
diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 358dcd3..18a11f2 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -35,10 +35,6 @@ typedef struct xpaddr {
#define FOREIGN_FRAME(m) ((m) | FOREIGN_FRAME_BIT)
#define IDENTITY_FRAME(m) ((m) | IDENTITY_FRAME_BIT)

-/* Maximum amount of memory we can handle in a domain in pages */
-#define MAX_DOMAIN_PAGES \
- ((unsigned long)((u64)CONFIG_XEN_MAX_DOMAIN_MEMORY * 1024 * 1024 * 1024 / PAGE_SIZE))
-
extern unsigned long *machine_to_phys_mapping;
extern unsigned long machine_to_phys_nr;
extern unsigned long *xen_p2m_addr;
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index e88fda8..b61a15e 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -23,14 +23,29 @@ config XEN_PVHVM
def_bool y
depends on XEN && PCI && X86_LOCAL_APIC

-config XEN_MAX_DOMAIN_MEMORY
- int
- default 500 if X86_64
- default 64 if X86_32
- depends on XEN
- help
- This only affects the sizing of some bss arrays, the unused
- portions of which are freed.
+if X86_64
+choice
+ prompt "Support pv-domains larger than 512GB"
+ default XEN_512GB_NONE
+ help
+ Support paravirtualized domains with more than 512GB of RAM.
+
+ The Xen tools and crash dump analysis tools might not support
+ pv-domains with more than 512 GB of RAM. This option controls the
+ default setting of the kernel to use only up to 512 GB or more.
+ It is always possible to change the default via specifying the
+ boot parameter "xen_512gb_limit".
+
+ config XEN_512GB_NONE
+ bool "neither dom0 nor domUs can be larger than 512GB"
+ config XEN_512GB_DOM0
+ bool "dom0 can be larger than 512GB, domUs not"
+ config XEN_512GB_DOMU
+ bool "domUs can be larger than 512GB, dom0 not"
+ config XEN_512GB_ALL
+ bool "dom0 and domUs can be larger than 512GB"
+endchoice
+endif

config XEN_SAVE_RESTORE
bool
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index df73cc5..12a1e98 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -502,7 +502,7 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)
*/
static bool alloc_p2m(unsigned long pfn)
{
- unsigned topidx, mididx;
+ unsigned topidx;
unsigned long *top_mfn_p, *mid_mfn;
pte_t *ptep, *pte_pg;
unsigned int level;
@@ -510,9 +510,6 @@ static bool alloc_p2m(unsigned long pfn)
unsigned long addr = (unsigned long)(xen_p2m_addr + pfn);
unsigned long p2m_pfn;

- topidx = p2m_top_index(pfn);
- mididx = p2m_mid_index(pfn);
-
ptep = lookup_address(addr, &level);
BUG_ON(!ptep || level != PG_LEVEL_4K);
pte_pg = (pte_t *)((unsigned long)ptep & ~(PAGE_SIZE - 1));
@@ -524,7 +521,8 @@ static bool alloc_p2m(unsigned long pfn)
return false;
}

- if (p2m_top_mfn) {
+ if (p2m_top_mfn && pfn < MAX_P2M_PFN) {
+ topidx = p2m_top_index(pfn);
top_mfn_p = &p2m_top_mfn[topidx];
mid_mfn = ACCESS_ONCE(p2m_top_mfn_p[topidx]);

@@ -579,7 +577,7 @@ static bool alloc_p2m(unsigned long pfn)
pfn_pte(PFN_DOWN(__pa(p2m)), PAGE_KERNEL));
HYPERVISOR_shared_info->arch.p2m_generation++;
if (mid_mfn)
- mid_mfn[mididx] = virt_to_mfn(p2m);
+ mid_mfn[p2m_mid_index(pfn)] = virt_to_mfn(p2m);
p2m = NULL;
}

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 84a6473..16d94de 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -32,6 +32,8 @@
#include "p2m.h"
#include "mmu.h"

+#define GB(x) ((uint64_t)(x) * 1024 * 1024 * 1024)
+
/* Amount of extra memory space we add to the e820 ranges */
struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;

@@ -85,6 +87,27 @@ static struct {
*/
#define EXTRA_MEM_RATIO (10)

+static bool xen_dom0_512gb_limit __initdata =
+ IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOMU);
+static bool xen_domu_512gb_limit __initdata =
+ IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOM0);
+
+static int __init xen_parse_512gb(char *arg)
+{
+ bool val = false;
+
+ if (!arg)
+ val = true;
+ else if (strtobool(arg, &val))
+ return 1;
+
+ xen_dom0_512gb_limit = val;
+ xen_domu_512gb_limit = val;
+
+ return 0;
+}
+early_param("xen_512gb_limit", xen_parse_512gb);
+
static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
{
int i;
@@ -242,14 +265,13 @@ static int __init xen_free_mfn(unsigned long mfn)
static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
unsigned long end_pfn, unsigned long *released)
{
- unsigned long pfn, end;
+ unsigned long pfn;
int ret;

WARN_ON(start_pfn > end_pfn);

/* Release pages first. */
- end = min(end_pfn, xen_max_pfn);
- for (pfn = start_pfn; pfn < end; pfn++) {
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
unsigned long mfn = pfn_to_mfn(pfn);

/* Make sure pfn exists to start with */
@@ -390,8 +412,9 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,

/* Do not remap pages beyond the current allocation */
if (cur_pfn >= xen_max_pfn) {
- /* Identity map remaining pages */
- set_phys_range_identity(cur_pfn, cur_pfn + size);
+ /* Release remaining pages */
+ xen_set_identity_and_release_chunk(cur_pfn,
+ cur_pfn + size, released);
break;
}
if (cur_pfn + size > xen_max_pfn)
@@ -612,12 +635,34 @@ void __init xen_remap_memory(void)
pr_info("Remapped %ld page(s)\n", remapped);
}

+static unsigned long __init xen_get_pages_limit(void)
+{
+ unsigned long limit;
+
+#ifdef CONFIG_X86_32
+ limit = GB(64) / PAGE_SIZE;
+#else
+ limit = ~0ul;
+ if (xen_initial_domain()) {
+ if (xen_dom0_512gb_limit)
+ limit = GB(512) / PAGE_SIZE;
+ } else {
+ if (xen_domu_512gb_limit)
+ limit = GB(512) / PAGE_SIZE;
+ }
+#endif
+ return limit;
+}
+
static unsigned long __init xen_get_max_pages(void)
{
- unsigned long max_pages = MAX_DOMAIN_PAGES;
+ unsigned long max_pages, limit;
domid_t domid = DOMID_SELF;
int ret;

+ limit = xen_get_pages_limit();
+ max_pages = limit;
+
/*
* For the initial domain we use the maximum reservation as
* the maximum page.
@@ -633,7 +678,7 @@ static unsigned long __init xen_get_max_pages(void)
max_pages = ret;
}

- return min(max_pages, MAX_DOMAIN_PAGES);
+ return min(max_pages, limit);
}

static void __init xen_align_and_add_e820_region(phys_addr_t start,
@@ -871,7 +916,8 @@ char * __init xen_memory_setup(void)

xen_reserve_xen_mfnlist();

- xen_max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
+ xen_max_pfn = xen_get_pages_limit();
+ xen_max_pfn = min(xen_max_pfn, xen_start_info->nr_pages);
mem_end = PFN_PHYS(xen_max_pfn);

memmap.nr_entries = E820MAX;
@@ -933,12 +979,15 @@ char * __init xen_memory_setup(void)
* is limited to the max size of lowmem, so that it doesn't
* get completely filled.
*
+ * Make sure we have no memory above max_pages, as this area
+ * isn't handled by the p2m management.
+ *
* In principle there could be a problem in lowmem systems if
* the initial memory is also very large with respect to
* lowmem, but we won't try to deal with that here.
*/
- extra_pages = min(EXTRA_MEM_RATIO * min(xen_max_pfn, PFN_DOWN(MAXMEM)),
- extra_pages);
+ extra_pages = min3(EXTRA_MEM_RATIO * min(xen_max_pfn, PFN_DOWN(MAXMEM)),
+ extra_pages, max_pages - xen_max_pfn);
i = 0;
while (i < xen_e820_map_entries) {
phys_addr_t addr = xen_e820_map[i].addr;
@@ -968,9 +1017,6 @@ char * __init xen_memory_setup(void)
/*
* Set the rest as identity mapped, in case PCI BARs are
* located here.
- *
- * PFNs above MAX_P2M_PFN are considered identity mapped as
- * well.
*/
set_phys_range_identity(xen_e820_map[i - 1].addr / PAGE_SIZE, ~0ul);

--
2.1.4

2015-02-18 09:21:38

by Paul Bolle

[permalink] [raw]
Subject: Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
> 64 bit pv-domains under Xen are limited to 512 GB of RAM today. The
> main reason has been the 3 level p2m tree, which was replaced by the
> virtual mapped linear p2m list. Parallel to the p2m list which is
> being used by the kernel itself there is a 3 level mfn tree for usage
> by the Xen tools and eventually for crash dump analysis. For this tree
> the linear p2m list can serve as a replacement, too. As the kernel
> can't know whether the tools are capable of dealing with the p2m list
> instead of the mfn tree, the limit of 512 GB can't be dropped in all
> cases.
>
> This patch replaces the hard limit by a kernel parameter which tells
> the kernel to obey the 512 GB limit or not. The default is selected by
> a configuration parameter which specifies whether the 512 GB limit
> should be active per default for dom0 (only crash dump analysis is
> affected) and/or for domUs (additionally domain save/restore/migration
> are affected).
>
> Memory above the domain limit is returned to the hypervisor instead of
> being identity mapped, which was wrong anyways.
>
> The kernel configuration parameter to specify the maximum size of a
> domain can be deleted, as it is not relevant any more.
>
> Signed-off-by: Juergen Gross <[email protected]>
> ---
> Documentation/kernel-parameters.txt | 7 ++++
> arch/x86/include/asm/xen/page.h | 4 ---
> arch/x86/xen/Kconfig | 31 +++++++++++-----
> arch/x86/xen/p2m.c | 10 +++---
> arch/x86/xen/setup.c | 72 ++++++++++++++++++++++++++++++-------
> 5 files changed, 93 insertions(+), 31 deletions(-)

[...]

> --- a/arch/x86/xen/Kconfig
> +++ b/arch/x86/xen/Kconfig
> @@ -23,14 +23,29 @@ config XEN_PVHVM
> def_bool y
> depends on XEN && PCI && X86_LOCAL_APIC
>
> -config XEN_MAX_DOMAIN_MEMORY
> - int
> - default 500 if X86_64
> - default 64 if X86_32
> - depends on XEN
> - help
> - This only affects the sizing of some bss arrays, the unused
> - portions of which are freed.
> +if X86_64

Not
&& XEN
?

> +choice
> + prompt "Support pv-domains larger than 512GB"
> + default XEN_512GB_NONE
> + help
> + Support paravirtualized domains with more than 512GB of RAM.
> +
> + The Xen tools and crash dump analysis tools might not support
> + pv-domains with more than 512 GB of RAM. This option controls the
> + default setting of the kernel to use only up to 512 GB or more.
> + It is always possible to change the default via specifying the
> + boot parameter "xen_512gb_limit".
> +
> + config XEN_512GB_NONE
> + bool "neither dom0 nor domUs can be larger than 512GB"
> + config XEN_512GB_DOM0
> + bool "dom0 can be larger than 512GB, domUs not"
> + config XEN_512GB_DOMU
> + bool "domUs can be larger than 512GB, dom0 not"
> + config XEN_512GB_ALL
> + bool "dom0 and domUs can be larger than 512GB"
> +endchoice

So there are actually two independent limits, configured through a
choice with four entries. Would using just two separate Kconfig symbols
(XEN_512GB_DOM0 and XEN_512GB_DOMU) without a choice wrapper also work?
Because ...

> +endif
>
> config XEN_SAVE_RESTORE
> bool

[...]

> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 84a6473..16d94de 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -32,6 +32,8 @@
> #include "p2m.h"
> #include "mmu.h"
>
> +#define GB(x) ((uint64_t)(x) * 1024 * 1024 * 1024)
> +
> /* Amount of extra memory space we add to the e820 ranges */
> struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
>
> @@ -85,6 +87,27 @@ static struct {
> */
> #define EXTRA_MEM_RATIO (10)
>
> +static bool xen_dom0_512gb_limit __initdata =
> + IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOMU);

... then this could be something like:
static bool xen_dom0_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOM0);

> +static bool xen_domu_512gb_limit __initdata =
> + IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOM0);
> +

and this likewise:
static bool xen_domu_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOMU);

Correct?

> +static int __init xen_parse_512gb(char *arg)
> +{
> + bool val = false;
> +
> + if (!arg)
> + val = true;
> + else if (strtobool(arg, &val))
> + return 1;
> +
> + xen_dom0_512gb_limit = val;
> + xen_domu_512gb_limit = val;
> +
> + return 0;
> +}
> +early_param("xen_512gb_limit", xen_parse_512gb);
> +
> static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
> {
> int i;

So one can configure these two limits separately, but the kernel
parameter is used for both. Any particular reason?

Thanks,


Paul Bolle

2015-02-18 09:37:54

by Juergen Gross

[permalink] [raw]
Subject: Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

On 02/18/2015 10:21 AM, Paul Bolle wrote:
> On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
>> 64 bit pv-domains under Xen are limited to 512 GB of RAM today. The
>> main reason has been the 3 level p2m tree, which was replaced by the
>> virtual mapped linear p2m list. Parallel to the p2m list which is
>> being used by the kernel itself there is a 3 level mfn tree for usage
>> by the Xen tools and eventually for crash dump analysis. For this tree
>> the linear p2m list can serve as a replacement, too. As the kernel
>> can't know whether the tools are capable of dealing with the p2m list
>> instead of the mfn tree, the limit of 512 GB can't be dropped in all
>> cases.
>>
>> This patch replaces the hard limit by a kernel parameter which tells
>> the kernel to obey the 512 GB limit or not. The default is selected by
>> a configuration parameter which specifies whether the 512 GB limit
>> should be active per default for dom0 (only crash dump analysis is
>> affected) and/or for domUs (additionally domain save/restore/migration
>> are affected).
>>
>> Memory above the domain limit is returned to the hypervisor instead of
>> being identity mapped, which was wrong anyways.
>>
>> The kernel configuration parameter to specify the maximum size of a
>> domain can be deleted, as it is not relevant any more.
>>
>> Signed-off-by: Juergen Gross <[email protected]>
>> ---
>> Documentation/kernel-parameters.txt | 7 ++++
>> arch/x86/include/asm/xen/page.h | 4 ---
>> arch/x86/xen/Kconfig | 31 +++++++++++-----
>> arch/x86/xen/p2m.c | 10 +++---
>> arch/x86/xen/setup.c | 72 ++++++++++++++++++++++++++++++-------
>> 5 files changed, 93 insertions(+), 31 deletions(-)
>
> [...]
>
>> --- a/arch/x86/xen/Kconfig
>> +++ b/arch/x86/xen/Kconfig
>> @@ -23,14 +23,29 @@ config XEN_PVHVM
>> def_bool y
>> depends on XEN && PCI && X86_LOCAL_APIC
>>
>> -config XEN_MAX_DOMAIN_MEMORY
>> - int
>> - default 500 if X86_64
>> - default 64 if X86_32
>> - depends on XEN
>> - help
>> - This only affects the sizing of some bss arrays, the unused
>> - portions of which are freed.
>> +if X86_64
>
> Not
> && XEN
> ?

The complete directory is made only if CONFIG_XEN is set.

>
>> +choice
>> + prompt "Support pv-domains larger than 512GB"
>> + default XEN_512GB_NONE
>> + help
>> + Support paravirtualized domains with more than 512GB of RAM.
>> +
>> + The Xen tools and crash dump analysis tools might not support
>> + pv-domains with more than 512 GB of RAM. This option controls the
>> + default setting of the kernel to use only up to 512 GB or more.
>> + It is always possible to change the default via specifying the
>> + boot parameter "xen_512gb_limit".
>> +
>> + config XEN_512GB_NONE
>> + bool "neither dom0 nor domUs can be larger than 512GB"
>> + config XEN_512GB_DOM0
>> + bool "dom0 can be larger than 512GB, domUs not"
>> + config XEN_512GB_DOMU
>> + bool "domUs can be larger than 512GB, dom0 not"
>> + config XEN_512GB_ALL
>> + bool "dom0 and domUs can be larger than 512GB"
>> +endchoice
>
> So there are actually two independent limits, configured through a
> choice with four entries. Would using just two separate Kconfig symbols
> (XEN_512GB_DOM0 and XEN_512GB_DOMU) without a choice wrapper also work?

Yes.

> Because ...
>
>> +endif
>>
>> config XEN_SAVE_RESTORE
>> bool
>
> [...]
>
>> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
>> index 84a6473..16d94de 100644
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -32,6 +32,8 @@
>> #include "p2m.h"
>> #include "mmu.h"
>>
>> +#define GB(x) ((uint64_t)(x) * 1024 * 1024 * 1024)
>> +
>> /* Amount of extra memory space we add to the e820 ranges */
>> struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
>>
>> @@ -85,6 +87,27 @@ static struct {
>> */
>> #define EXTRA_MEM_RATIO (10)
>>
>> +static bool xen_dom0_512gb_limit __initdata =
>> + IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOMU);
>
> ... then this could be something like:
> static bool xen_dom0_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOM0);
>
>> +static bool xen_domu_512gb_limit __initdata =
>> + IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOM0);
>> +
>
> and this likewise:
> static bool xen_domu_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOMU);
>
> Correct?

Yes.

That's a matter of taste, I think.

>
>> +static int __init xen_parse_512gb(char *arg)
>> +{
>> + bool val = false;
>> +
>> + if (!arg)
>> + val = true;
>> + else if (strtobool(arg, &val))
>> + return 1;
>> +
>> + xen_dom0_512gb_limit = val;
>> + xen_domu_512gb_limit = val;
>> +
>> + return 0;
>> +}
>> +early_param("xen_512gb_limit", xen_parse_512gb);
>> +
>> static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
>> {
>> int i;
>
> So one can configure these two limits separately, but the kernel
> parameter is used for both. Any particular reason?

Yes. A kernel is running only either as Dom0 or as domU at a given time.
Having two parameters here would be nonsense, as only one could apply.

And being able to configure both limits separately does make sense,
of course.


Juergen

2015-02-18 09:49:38

by Jan Beulich

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

>>> On 18.02.15 at 10:37, <[email protected]> wrote:
> On 02/18/2015 10:21 AM, Paul Bolle wrote:
>> On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
>>> --- a/arch/x86/xen/Kconfig
>>> +++ b/arch/x86/xen/Kconfig
>>> @@ -23,14 +23,29 @@ config XEN_PVHVM
>>> def_bool y
>>> depends on XEN && PCI && X86_LOCAL_APIC
>>>
>>> -config XEN_MAX_DOMAIN_MEMORY
>>> - int
>>> - default 500 if X86_64
>>> - default 64 if X86_32
>>> - depends on XEN
>>> - help
>>> - This only affects the sizing of some bss arrays, the unused
>>> - portions of which are freed.
>>> +if X86_64
>>
>> Not
>> && XEN
>> ?
>
> The complete directory is made only if CONFIG_XEN is set.

But that doesn't mean this file gets used only when XEN is enabled.
I would think though that an eventual "if XEN" should have wider
scope than just this option (i.e. likely almost the entire file).

Jan

2015-02-18 10:32:23

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

On 18/02/15 06:51, Juergen Gross wrote:
> The linear p2m list should be anchored in the shared info structure

I'm not really sure what you mean by "anchored".

> read by the Xen tools to be able to support 64 bit pv-domains larger
> than 512 MB. Additionally the linear p2m list interface includes a
> generation count which is changed prior to and after each mapping
> change of the p2m list. Reading the generation count the Xen tools can
> detect changes of the mappings and re-read the p2m list eventually.
[...]
> --- a/arch/x86/xen/p2m.c
> +++ b/arch/x86/xen/p2m.c
> @@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
> HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
> virt_to_mfn(p2m_top_mfn);
> HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
> + HYPERVISOR_shared_info->arch.p2m_generation = 0;
> + HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
> + HYPERVISOR_shared_info->arch.p2m_cr3 =
> + xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
> }
>
> /* Set up p2m_top to point to the domain-builder provided p2m pages */
> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)
>
> ptechk = lookup_address(vaddr, &level);
> if (ptechk == pte_pg) {
> + HYPERVISOR_shared_info->arch.p2m_generation++;
> set_pmd(pmdp,
> __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
> + HYPERVISOR_shared_info->arch.p2m_generation++;

Do these increments of p2m_generation need to be atomic?

David

2015-02-18 10:35:15

by Paul Bolle

[permalink] [raw]
Subject: Re: [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

On Wed, 2015-02-18 at 10:37 +0100, Juergen Gross wrote:
> On 02/18/2015 10:21 AM, Paul Bolle wrote:
> > On Wed, 2015-02-18 at 07:52 +0100, Juergen Gross wrote:
> >> +choice
> >> + prompt "Support pv-domains larger than 512GB"
> >> + default XEN_512GB_NONE
> >> + help
> >> + Support paravirtualized domains with more than 512GB of RAM.
> >> +
> >> + The Xen tools and crash dump analysis tools might not support
> >> + pv-domains with more than 512 GB of RAM. This option controls the
> >> + default setting of the kernel to use only up to 512 GB or more.
> >> + It is always possible to change the default via specifying the
> >> + boot parameter "xen_512gb_limit".
> >> +
> >> + config XEN_512GB_NONE
> >> + bool "neither dom0 nor domUs can be larger than 512GB"
> >> + config XEN_512GB_DOM0
> >> + bool "dom0 can be larger than 512GB, domUs not"
> >> + config XEN_512GB_DOMU
> >> + bool "domUs can be larger than 512GB, dom0 not"
> >> + config XEN_512GB_ALL
> >> + bool "dom0 and domUs can be larger than 512GB"
> >> +endchoice
> >
> > So there are actually two independent limits, configured through a
> > choice with four entries. Would using just two separate Kconfig symbols
> > (XEN_512GB_DOM0 and XEN_512GB_DOMU) without a choice wrapper also work?
>
> Yes.
>
> > Because ...
> >
> >> +endif

[...]

> >> @@ -85,6 +87,27 @@ static struct {
> >> */
> >> #define EXTRA_MEM_RATIO (10)
> >>
> >> +static bool xen_dom0_512gb_limit __initdata =
> >> + IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOMU);
> >
> > ... then this could be something like:
> > static bool xen_dom0_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOM0);
> >
> >> +static bool xen_domu_512gb_limit __initdata =
> >> + IS_ENABLED(CONFIG_XEN_512GB_NONE) || IS_ENABLED(CONFIG_XEN_512GB_DOM0);
> >> +
> >
> > and this likewise:
> > static bool xen_domu_512gb_limit __initdata = !IS_ENABLED(CONFIG_XEN_512GB_DOMU);
> >
> > Correct?
>
> Yes.
>
> That's a matter of taste, I think.

Well, my suggestion does look simpler. Anyhow, I'll be glad to let the
maintainers decide.

> >
> >> +static int __init xen_parse_512gb(char *arg)
> >> +{
> >> + bool val = false;
> >> +
> >> + if (!arg)
> >> + val = true;
> >> + else if (strtobool(arg, &val))
> >> + return 1;
> >> +
> >> + xen_dom0_512gb_limit = val;
> >> + xen_domu_512gb_limit = val;
> >> +
> >> + return 0;
> >> +}
> >> +early_param("xen_512gb_limit", xen_parse_512gb);
> >> +
> >> static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
> >> {
> >> int i;
> >
> > So one can configure these two limits separately, but the kernel
> > parameter is used for both. Any particular reason?
>
> Yes. A kernel is running only either as Dom0 or as domU at a given time.
> Having two parameters here would be nonsense, as only one could apply.

I see.

> And being able to configure both limits separately does make sense,
> of course.

Thanks,


Paul Bolle

2015-02-18 10:43:00

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

On 02/18/2015 11:32 AM, David Vrabel wrote:
> On 18/02/15 06:51, Juergen Gross wrote:
>> The linear p2m list should be anchored in the shared info structure
>
> I'm not really sure what you mean by "anchored".

Bad wording? What about:

The virtual address of the linear p2m list should be stored in the
shared info structure.

>
>> read by the Xen tools to be able to support 64 bit pv-domains larger
>> than 512 MB. Additionally the linear p2m list interface includes a
>> generation count which is changed prior to and after each mapping
>> change of the p2m list. Reading the generation count the Xen tools can
>> detect changes of the mappings and re-read the p2m list eventually.
> [...]
>> --- a/arch/x86/xen/p2m.c
>> +++ b/arch/x86/xen/p2m.c
>> @@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
>> HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
>> virt_to_mfn(p2m_top_mfn);
>> HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
>> + HYPERVISOR_shared_info->arch.p2m_generation = 0;
>> + HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
>> + HYPERVISOR_shared_info->arch.p2m_cr3 =
>> + xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
>> }
>>
>> /* Set up p2m_top to point to the domain-builder provided p2m pages */
>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)
>>
>> ptechk = lookup_address(vaddr, &level);
>> if (ptechk == pte_pg) {
>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>> set_pmd(pmdp,
>> __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>
> Do these increments of p2m_generation need to be atomic?

Hmm, they are done under lock. I don't think the compiler is allowed to
reorder the writes to p2m_generation across set_pmd().


Juergen

2015-02-18 10:50:13

by Andrew Cooper

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

On 18/02/15 10:42, Juergen Gross wrote:
>
>>> /* Set up p2m_top to point to the domain-builder provided p2m
>>> pages */
>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>> pte_t *pte_pg)
>>>
>>> ptechk = lookup_address(vaddr, &level);
>>> if (ptechk == pte_pg) {
>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>> set_pmd(pmdp,
>>> __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>
>> Do these increments of p2m_generation need to be atomic?
>
> Hmm, they are done under lock. I don't think the compiler is allowed to
> reorder the writes to p2m_generation across set_pmd().

They do need smp_wmb() to guarantee that the increment is visible before
the update occurs, just as the toolstack will need smp_rmb() to read.

They also need to be protected from concurrent update inside the kernel,
for which a lock should appear to suffice.

~Andrew

2015-02-18 10:51:47

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

On 18/02/15 10:42, Juergen Gross wrote:
> On 02/18/2015 11:32 AM, David Vrabel wrote:
>> On 18/02/15 06:51, Juergen Gross wrote:
>>> The linear p2m list should be anchored in the shared info structure
>>
>> I'm not really sure what you mean by "anchored".
>
> Bad wording? What about:
>
> The virtual address of the linear p2m list should be stored in the
> shared info structure.

This is better.

>>> read by the Xen tools to be able to support 64 bit pv-domains larger
>>> than 512 MB. Additionally the linear p2m list interface includes a
>>> generation count which is changed prior to and after each mapping
>>> change of the p2m list. Reading the generation count the Xen tools can
>>> detect changes of the mappings and re-read the p2m list eventually.
>> [...]
>>> --- a/arch/x86/xen/p2m.c
>>> +++ b/arch/x86/xen/p2m.c
>>> @@ -256,6 +256,10 @@ void xen_setup_mfn_list_list(void)
>>> HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
>>> virt_to_mfn(p2m_top_mfn);
>>> HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
>>> + HYPERVISOR_shared_info->arch.p2m_generation = 0;
>>> + HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned
>>> long)xen_p2m_addr;
>>> + HYPERVISOR_shared_info->arch.p2m_cr3 =
>>> + xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
>>> }
>>>
>>> /* Set up p2m_top to point to the domain-builder provided p2m pages */
>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>> pte_t *pte_pg)
>>>
>>> ptechk = lookup_address(vaddr, &level);
>>> if (ptechk == pte_pg) {
>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>> set_pmd(pmdp,
>>> __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>
>> Do these increments of p2m_generation need to be atomic?
>
> Hmm, they are done under lock.

Ok, atomic isn't necessary.

> I don't think the compiler is allowed to
> reorder the writes to p2m_generation across set_pmd().

Ok. but I think you also need to prevent the processor reordering the
writes so I think some write barriers are needed here. (The toolstack
would then need the corresponding read barriers when checking the
p2m_generation.)

David

2015-02-18 10:54:11

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

On 18/02/15 10:50, Andrew Cooper wrote:
> On 18/02/15 10:42, Juergen Gross wrote:
>>
>>>> /* Set up p2m_top to point to the domain-builder provided p2m
>>>> pages */
>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>> pte_t *pte_pg)
>>>>
>>>> ptechk = lookup_address(vaddr, &level);
>>>> if (ptechk == pte_pg) {
>>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>>> set_pmd(pmdp,
>>>> __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>>
>>> Do these increments of p2m_generation need to be atomic?
>>
>> Hmm, they are done under lock. I don't think the compiler is allowed to
>> reorder the writes to p2m_generation across set_pmd().
>
> They do need smp_wmb() to guarantee that the increment is visible before
> the update occurs, just as the toolstack will need smp_rmb() to read.

smp_wmb() isn't good enough since you need the barrier even on non-smp
-- you need a wmb().

David

2015-02-18 10:56:18

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

On 02/18/2015 11:50 AM, Andrew Cooper wrote:
> On 18/02/15 10:42, Juergen Gross wrote:
>>
>>>> /* Set up p2m_top to point to the domain-builder provided p2m
>>>> pages */
>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>> pte_t *pte_pg)
>>>>
>>>> ptechk = lookup_address(vaddr, &level);
>>>> if (ptechk == pte_pg) {
>>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>>> set_pmd(pmdp,
>>>> __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>>
>>> Do these increments of p2m_generation need to be atomic?
>>
>> Hmm, they are done under lock. I don't think the compiler is allowed to
>> reorder the writes to p2m_generation across set_pmd().
>
> They do need smp_wmb() to guarantee that the increment is visible before
> the update occurs, just as the toolstack will need smp_rmb() to read.

Okay, I'll add smp_wmb() before and after calling set_pmd().


Juergen

2015-02-18 10:57:34

by Andrew Cooper

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

On 18/02/15 10:54, David Vrabel wrote:
> On 18/02/15 10:50, Andrew Cooper wrote:
>> On 18/02/15 10:42, Juergen Gross wrote:
>>>>> /* Set up p2m_top to point to the domain-builder provided p2m
>>>>> pages */
>>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>>> pte_t *pte_pg)
>>>>>
>>>>> ptechk = lookup_address(vaddr, &level);
>>>>> if (ptechk == pte_pg) {
>>>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>> set_pmd(pmdp,
>>>>> __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>>> Do these increments of p2m_generation need to be atomic?
>>> Hmm, they are done under lock. I don't think the compiler is allowed to
>>> reorder the writes to p2m_generation across set_pmd().
>> They do need smp_wmb() to guarantee that the increment is visible before
>> the update occurs, just as the toolstack will need smp_rmb() to read.
> smp_wmb() isn't good enough since you need the barrier even on non-smp
> -- you need a wmb().

Ah yes. I was thinking in the wrong context for smp. In this case we
need to guarantee interdomain consistency even with a UP guest kernel.

~Andrew

2015-02-18 10:57:47

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 02/13] xen: anchor linear p2m list in shared info structure

On 02/18/2015 11:54 AM, David Vrabel wrote:
> On 18/02/15 10:50, Andrew Cooper wrote:
>> On 18/02/15 10:42, Juergen Gross wrote:
>>>
>>>>> /* Set up p2m_top to point to the domain-builder provided p2m
>>>>> pages */
>>>>> @@ -469,8 +473,10 @@ static pte_t *alloc_p2m_pmd(unsigned long addr,
>>>>> pte_t *pte_pg)
>>>>>
>>>>> ptechk = lookup_address(vaddr, &level);
>>>>> if (ptechk == pte_pg) {
>>>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>> set_pmd(pmdp,
>>>>> __pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
>>>>> + HYPERVISOR_shared_info->arch.p2m_generation++;
>>>>
>>>> Do these increments of p2m_generation need to be atomic?
>>>
>>> Hmm, they are done under lock. I don't think the compiler is allowed to
>>> reorder the writes to p2m_generation across set_pmd().
>>
>> They do need smp_wmb() to guarantee that the increment is visible before
>> the update occurs, just as the toolstack will need smp_rmb() to read.
>
> smp_wmb() isn't good enough since you need the barrier even on non-smp
> -- you need a wmb().

Okay, will do.

Juergen

2015-02-18 11:19:00

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 13/13] xen: allow more than 512 GB of RAM for 64 bit pv-domains

On 18/02/15 06:52, Juergen Gross wrote:
>
> +if X86_64
> +choice
> + prompt "Support pv-domains larger than 512GB"
> + default XEN_512GB_NONE
> + help
> + Support paravirtualized domains with more than 512GB of RAM.
> +
> + The Xen tools and crash dump analysis tools might not support
> + pv-domains with more than 512 GB of RAM. This option controls the
> + default setting of the kernel to use only up to 512 GB or more.
> + It is always possible to change the default via specifying the
> + boot parameter "xen_512gb_limit".
> +
> + config XEN_512GB_NONE
> + bool "neither dom0 nor domUs can be larger than 512GB"
> + config XEN_512GB_DOM0
> + bool "dom0 can be larger than 512GB, domUs not"
> + config XEN_512GB_DOMU
> + bool "domUs can be larger than 512GB, dom0 not"
> + config XEN_512GB_ALL
> + bool "dom0 and domUs can be larger than 512GB"
> +endchoice
> +endif

This configuration option doesn't look useful to me. Can we get rid of
it with runtime checking. e.g.,

If dom0, enable >512G.
If domU, enable >512G if requested by command line option /or/ toolstack
indicates that it supports the linear p2m.

And

If max_pfn < 512G, populate 3-level p2m /unless/ toolstack indicates it
supports the linear p2m.

People using crash analysis tools that need the 3-level p2m can clamp
dom0 memory with the Xen command line option. FWIW, the tool we use
doesn't need this.

David

2015-02-19 17:27:40

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 04/13] xen: move static e820 map to global scope

On 18/02/2015 06:51, Juergen Gross wrote:
> Instead of using a function local static e820 map in xen_memory_setup()
> and calling various functions in the same source with the map as a
> parameter use a map directly accessible by all functions in the source.

Reviewed-by: David Vrabel <[email protected]>

David

2015-02-19 17:31:57

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 07/13] xen: find unused contiguous memory area

On 18/02/2015 06:52, Juergen Gross wrote:
> For being able to relocate pre-allocated data areas like initrd or
> p2m list it is mandatory to find a contiguous memory area which is
> not yet in use and doesn't conflict with the memory map we want to
> be in effect.
>
> In case such an area is found reserve it at once as this will be
> required to be done in any case.

Reviewed-by: David Vrabel <[email protected]>

David

2015-02-19 17:35:23

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 08/13] xen: add service function to copy physical memory areas

On 18/02/2015 06:52, Juergen Gross wrote:
> In case a pre-allocated memory area is to be moved in order to avoid
> a conflict with the target E820 map we need a way to copy data between
> physical addresses.
>
> Add a function doing this via early_memremap().
[...]
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -766,6 +766,35 @@ phys_addr_t __init xen_find_free_area(phys_addr_t size)
> }
>
> /*
> + * Like memcpy, but with physical addresses for dest and src.
> + */
> +void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src, phys_addr_t n)
> +{
> + phys_addr_t dest_off, src_off, dest_len, src_len, len;
> + void *from, *to;
> +
> + while (n) {
> + dest_off = dest & ~PAGE_MASK;
> + src_off = src & ~PAGE_MASK;
> + dest_len = n;
> + if (dest_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off)
> + dest_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off;
> + src_len = n;
> + if (src_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off)
> + src_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off;
> + len = min(dest_len, src_len);
> + to = early_memremap(dest - dest_off, dest_len + dest_off);
> + from = early_memremap(src - src_off, src_len + src_off);
> + memcpy(to, from, len);
> + early_iounmap(to, dest_len + dest_off);
> + early_iounmap(from, src_len + src_off);

early_memunmap surely?

Otherwise,

Reviewed-by: David Vrabel <[email protected]>

David

2015-02-19 17:36:23

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 09/13] xen: check for kernel memory conflicting with memory layout

On 18/02/2015 06:52, Juergen Gross wrote:
> Checks whether the pre-allocated memory of the loaded kernel is in
> conflict with the target memory map. If this is the case, just panic
> instead of run into problems later.

What ensures this doesn't actually happen?

David

2015-02-19 17:37:34

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map



On 18/02/2015 06:52, Juergen Gross wrote:
> Check whether the page tables built by the domain builder are at
> memory addresses which are in conflict with the target memory map.
> If this is the case just panic instead of running into problems
> later.

Again, what ensures this never actually happens?

David

2015-02-19 17:42:46

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 11/13] xen: move initrd away from e820 non-ram area



On 18/02/2015 06:52, Juergen Gross wrote:
> When adapting the dom0 memory layout to that of the host make sure
> the initrd isn't moved to another pfn range, as it won't be found
> there any more.
>
> The easiest way to accomplish that is by copying the initrd to an
> area which is RAM according to the E820 map.

Reviewed-by: David Vrabel <[email protected]>

David

2015-02-19 17:45:04

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping

On 18/02/2015 06:52, Juergen Gross wrote:
> With adapting the memory layout of dom0 to that of the host care must
> be taken not to remap the initial p2m list supported by the hypervisor.

"...supplied by the hypervisor" ?

> If the p2m map is detected to be in a region which is going to be
> remapped, delay the remapping of that area. Not doing so can either
> crash the system very early, or lead to clobbered data as the target
> memory area of the remap operation will no longer be reserved.

Would it be better to relocate the p2m before remapping memory? If not,
explain why in the commit message.

David

2015-02-19 18:07:38

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map

On 18/02/2015 06:51, Juergen Gross wrote:
> Currently especially for dom0 guest memory with guest pfns not
> matching host areas populated with RAM are remapped to areas which
> are RAM native as well. This is done to be able to use identity
> mappings (pfn == mfn) for I/O areas.
>
> Up to now it is not checked whether the remapped memory is already
> in use. Remapping used memory will probably result in data corruption,
> as the remapped memory will no longer be reserved. Any memory
> allocation after the remap can claim that memory.
>
> Add an infrastructure to check for conflicts of reserved memory areas
> and in case of a conflict to react via an area specific function.
>
> This function has 3 options:
> - Panic
> - Handle the conflict by moving the data to another memory area.
> This is indicated by a return value other than 0.
> - Just return 0. This will delay invalidating the conflicting memory
> area to just before doing the remap. This option will be usable for
> cases only where the memory will no longer be needed when the remap
> operation will be started, e.g. for the p2m list, which is already
> copied then.
>
> When doing the remap, check for not remapping a reserved page.
>
> Signed-off-by: Juergen Gross <[email protected]>
> ---
> arch/x86/xen/setup.c | 185 +++++++++++++++++++++++++++++++++++++++++++++++--
> arch/x86/xen/xen-ops.h | 2 +
> 2 files changed, 182 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 0dda131..a0af554 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -59,6 +59,20 @@ static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
> static unsigned long xen_remap_pfn;
> static unsigned long xen_max_pfn;
>
> +/*
> + * Areas with memblock_reserve()d memory to be checked against final E820 map.
> + * Each area has an associated function to handle conflicts (by either
> + * removing the conflict or by just crashing with an appropriate message).
> + * The array has a fixed size as there are only few areas of interest which are
> + * well known: kernel, page tables, p2m, initrd.
> + */
> +#define XEN_N_RESERVED_AREAS 4
> +static struct {
> + phys_addr_t start;
> + phys_addr_t size;
> + int (*func)(phys_addr_t start, phys_addr_t size);
> +} xen_reserved_area[XEN_N_RESERVED_AREAS] __initdata;
> +
> /*
> * The maximum amount of extra memory compared to the base size. The
> * main scaling factor is the size of struct page. At extreme ratios
> @@ -365,10 +379,10 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
> unsigned long end_pfn, unsigned long *released, unsigned long *remapped)
> {
> unsigned long pfn;
> - unsigned long i = 0;
> + unsigned long i;
> unsigned long n = end_pfn - start_pfn;
>
> - while (i < n) {
> + for (i = 0; i < n; ) {
> unsigned long cur_pfn = start_pfn + i;
> unsigned long left = n - i;
> unsigned long size = left;
> @@ -411,6 +425,53 @@ static void __init xen_set_identity_and_remap_chunk(unsigned long start_pfn,
> (unsigned long)__va(pfn << PAGE_SHIFT),
> mfn_pte(pfn, PAGE_KERNEL_IO), 0);
> }
> +/* Check to be remapped memory area for conflicts with reserved areas.
> + *
> + * Skip regions known to be reserved which are handled later. For these
> + * regions we have to increase the remapped counter in order to reserve
> + * extra memory space.
> + *
> + * In case a memory page already in use is to be remapped, just BUG().
> + */
> +static void __init xen_set_identity_and_remap_chunk_chk(unsigned long start_pfn,
> + unsigned long end_pfn, unsigned long *released, unsigned long *remapped)

...remap_chunk_checked() ?

> +{
> + unsigned long pfn;
> + unsigned long area_start, area_end;
> + unsigned i;
> +
> + for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
> +
> + if (!xen_reserved_area[i].size)
> + break;
> +
> + area_start = PFN_DOWN(xen_reserved_area[i].start);
> + area_end = PFN_UP(xen_reserved_area[i].start +
> + xen_reserved_area[i].size);
> + if (area_start >= end_pfn || area_end <= start_pfn)
> + continue;
> +
> + if (area_start > start_pfn)
> + xen_set_identity_and_remap_chunk(start_pfn, area_start,
> + released, remapped);
> +
> + if (area_end < end_pfn)
> + xen_set_identity_and_remap_chunk(area_end, end_pfn,
> + released, remapped);
> +
> + *remapped += min(area_end, end_pfn) -
> + max(area_start, start_pfn);
> +
> + return;

Why not defer the whole chunk that conflicts? Or for that matter defer
all this remapping to the last minute.

David

2015-02-19 18:10:29

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables



On 18/02/2015 06:51, Juergen Gross wrote:
> xen_set_identity_and_remap() is used to prepare remapping of memory
> conflicting with the E820 map. It is tracking the pfn where to remap
> new memory via a local variable which is passed to a subfunction
> which in turn returns the new value for that variable.
>
> Additionally the targeted maximum pfn is passed as a parameter to
> sub functions.
>
> Simplify that construct by using just global variables in the
> source for that purpose. This will make things simpler when we need
> those values later, too.

I'm not convinced this actually simplifies anything.

David

2015-02-24 06:15:04

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 05/13] xen: simplify xen_set_identity_and_remap() by using global variables

On 02/19/2015 07:10 PM, David Vrabel wrote:
>
>
> On 18/02/2015 06:51, Juergen Gross wrote:
>> xen_set_identity_and_remap() is used to prepare remapping of memory
>> conflicting with the E820 map. It is tracking the pfn where to remap
>> new memory via a local variable which is passed to a subfunction
>> which in turn returns the new value for that variable.
>>
>> Additionally the targeted maximum pfn is passed as a parameter to
>> sub functions.
>>
>> Simplify that construct by using just global variables in the
>> source for that purpose. This will make things simpler when we need
>> those values later, too.
>
> I'm not convinced this actually simplifies anything.

Perhaps I should have emphasised the last sentence a bit more. I really
need the global variables when deferring the remap operation.

Juergen

2015-02-24 06:27:54

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map

On 02/19/2015 07:07 PM, David Vrabel wrote:
> On 18/02/2015 06:51, Juergen Gross wrote:
>> Currently especially for dom0 guest memory with guest pfns not
>> matching host areas populated with RAM are remapped to areas which
>> are RAM native as well. This is done to be able to use identity
>> mappings (pfn == mfn) for I/O areas.
>>
>> Up to now it is not checked whether the remapped memory is already
>> in use. Remapping used memory will probably result in data corruption,
>> as the remapped memory will no longer be reserved. Any memory
>> allocation after the remap can claim that memory.
>>
>> Add an infrastructure to check for conflicts of reserved memory areas
>> and in case of a conflict to react via an area specific function.
>>
>> This function has 3 options:
>> - Panic
>> - Handle the conflict by moving the data to another memory area.
>> This is indicated by a return value other than 0.
>> - Just return 0. This will delay invalidating the conflicting memory
>> area to just before doing the remap. This option will be usable for
>> cases only where the memory will no longer be needed when the remap
>> operation will be started, e.g. for the p2m list, which is already
>> copied then.
>>
>> When doing the remap, check for not remapping a reserved page.
>>
>> Signed-off-by: Juergen Gross <[email protected]>
>> ---
>> arch/x86/xen/setup.c | 185
>> +++++++++++++++++++++++++++++++++++++++++++++++--
>> arch/x86/xen/xen-ops.h | 2 +
>> 2 files changed, 182 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
>> index 0dda131..a0af554 100644
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -59,6 +59,20 @@ static unsigned long xen_remap_mfn __initdata =
>> INVALID_P2M_ENTRY;
>> static unsigned long xen_remap_pfn;
>> static unsigned long xen_max_pfn;
>>
>> +/*
>> + * Areas with memblock_reserve()d memory to be checked against final
>> E820 map.
>> + * Each area has an associated function to handle conflicts (by either
>> + * removing the conflict or by just crashing with an appropriate
>> message).
>> + * The array has a fixed size as there are only few areas of interest
>> which are
>> + * well known: kernel, page tables, p2m, initrd.
>> + */
>> +#define XEN_N_RESERVED_AREAS 4
>> +static struct {
>> + phys_addr_t start;
>> + phys_addr_t size;
>> + int (*func)(phys_addr_t start, phys_addr_t size);
>> +} xen_reserved_area[XEN_N_RESERVED_AREAS] __initdata;
>> +
>> /*
>> * The maximum amount of extra memory compared to the base size. The
>> * main scaling factor is the size of struct page. At extreme ratios
>> @@ -365,10 +379,10 @@ static void __init
>> xen_set_identity_and_remap_chunk(unsigned long start_pfn,
>> unsigned long end_pfn, unsigned long *released, unsigned long
>> *remapped)
>> {
>> unsigned long pfn;
>> - unsigned long i = 0;
>> + unsigned long i;
>> unsigned long n = end_pfn - start_pfn;
>>
>> - while (i < n) {
>> + for (i = 0; i < n; ) {
>> unsigned long cur_pfn = start_pfn + i;
>> unsigned long left = n - i;
>> unsigned long size = left;
>> @@ -411,6 +425,53 @@ static void __init
>> xen_set_identity_and_remap_chunk(unsigned long start_pfn,
>> (unsigned long)__va(pfn << PAGE_SHIFT),
>> mfn_pte(pfn, PAGE_KERNEL_IO), 0);
>> }
>> +/* Check to be remapped memory area for conflicts with reserved areas.
>> + *
>> + * Skip regions known to be reserved which are handled later. For these
>> + * regions we have to increase the remapped counter in order to reserve
>> + * extra memory space.
>> + *
>> + * In case a memory page already in use is to be remapped, just BUG().
>> + */
>> +static void __init xen_set_identity_and_remap_chunk_chk(unsigned long
>> start_pfn,
>> + unsigned long end_pfn, unsigned long *released, unsigned long
>> *remapped)
>
> ...remap_chunk_checked() ?

I just wanted to avoid the function name to be even longer. OTOH I
really don't mind to use your suggestion. :-)

>
>> +{
>> + unsigned long pfn;
>> + unsigned long area_start, area_end;
>> + unsigned i;
>> +
>> + for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
>> +
>> + if (!xen_reserved_area[i].size)
>> + break;
>> +
>> + area_start = PFN_DOWN(xen_reserved_area[i].start);
>> + area_end = PFN_UP(xen_reserved_area[i].start +
>> + xen_reserved_area[i].size);
>> + if (area_start >= end_pfn || area_end <= start_pfn)
>> + continue;
>> +
>> + if (area_start > start_pfn)
>> + xen_set_identity_and_remap_chunk(start_pfn, area_start,
>> + released, remapped);
>> +
>> + if (area_end < end_pfn)
>> + xen_set_identity_and_remap_chunk(area_end, end_pfn,
>> + released, remapped);
>> +
>> + *remapped += min(area_end, end_pfn) -
>> + max(area_start, start_pfn);
>> +
>> + return;
>
> Why not defer the whole chunk that conflicts? Or for that matter defer
> all this remapping to the last minute.

There are two problems arising from this:

- In the initrd case remapping would be deferred too long: the initrd
data is still in use when device initialization is running. And we
really want the remap to have happened before PCI space is being used.

- Delaying all remapping to the point where the new p2m list is in place
would either result in a p2m list with all memory holes covered with
individual entries as the new list is built with those holes still
populated, or we would have to perform the check for being able to use
the pre-built "invalid" or "identity" p2m page in set_phys_to_machine.
The first option could easily waste significant amounts of memory (on
my test machine with 1TB RAM this would have been about 1GB), while
the second option would be performance critical.


Juergen

2015-02-24 06:34:25

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 08/13] xen: add service function to copy physical memory areas

On 02/19/2015 06:35 PM, David Vrabel wrote:
> On 18/02/2015 06:52, Juergen Gross wrote:
>> In case a pre-allocated memory area is to be moved in order to avoid
>> a conflict with the target E820 map we need a way to copy data between
>> physical addresses.
>>
>> Add a function doing this via early_memremap().
> [...]
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -766,6 +766,35 @@ phys_addr_t __init xen_find_free_area(phys_addr_t
>> size)
>> }
>>
>> /*
>> + * Like memcpy, but with physical addresses for dest and src.
>> + */
>> +void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src,
>> phys_addr_t n)
>> +{
>> + phys_addr_t dest_off, src_off, dest_len, src_len, len;
>> + void *from, *to;
>> +
>> + while (n) {
>> + dest_off = dest & ~PAGE_MASK;
>> + src_off = src & ~PAGE_MASK;
>> + dest_len = n;
>> + if (dest_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off)
>> + dest_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off;
>> + src_len = n;
>> + if (src_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off)
>> + src_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off;
>> + len = min(dest_len, src_len);
>> + to = early_memremap(dest - dest_off, dest_len + dest_off);
>> + from = early_memremap(src - src_off, src_len + src_off);
>> + memcpy(to, from, len);
>> + early_iounmap(to, dest_len + dest_off);
>> + early_iounmap(from, src_len + src_off);
>
> early_memunmap surely?

Hmm, yes, sure. I'll update the patch and send another one for
correcting the code where I took the usage from (relocate_initrd).

Juergen

>
> Otherwise,
>
> Reviewed-by: David Vrabel <[email protected]>
>
> David
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-02-24 06:45:14

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 09/13] xen: check for kernel memory conflicting with memory layout

On 02/19/2015 06:36 PM, David Vrabel wrote:
> On 18/02/2015 06:52, Juergen Gross wrote:
>> Checks whether the pre-allocated memory of the loaded kernel is in
>> conflict with the target memory map. If this is the case, just panic
>> instead of run into problems later.
>
> What ensures this doesn't actually happen?

Nothing.

We have basically three options here:

- Die early (my patch).
- Issue a warning that the critical situation has been detected and hope
for the best by not doing the remap, but probably run into strange
situations when the discrepancy between E820 map and memory usage is
becoming problematic.
- Do the remap like today and die eventually when the relocated page is
used for p2m/m2p translations (no real option).

Juergen

2015-02-24 06:45:42

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 10/13] xen: check pre-allocated page tables for conflict with memory map

On 02/19/2015 06:37 PM, David Vrabel wrote:
>
>
> On 18/02/2015 06:52, Juergen Gross wrote:
>> Check whether the page tables built by the domain builder are at
>> memory addresses which are in conflict with the target memory map.
>> If this is the case just panic instead of running into problems
>> later.
>
> Again, what ensures this never actually happens?

Same answer as before: nothing.

Juergen

2015-02-24 07:01:46

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 12/13] xen: if p2m list located in to be remapped region delay remapping

On 02/19/2015 06:44 PM, David Vrabel wrote:
> On 18/02/2015 06:52, Juergen Gross wrote:
>> With adapting the memory layout of dom0 to that of the host care must
>> be taken not to remap the initial p2m list supported by the hypervisor.
>
> "...supplied by the hypervisor" ?

Yes, of course.

>
>> If the p2m map is detected to be in a region which is going to be
>> remapped, delay the remapping of that area. Not doing so can either
>> crash the system very early, or lead to clobbered data as the target
>> memory area of the remap operation will no longer be reserved.
>
> Would it be better to relocate the p2m before remapping memory? If not,
> explain why in the commit message.

Okay, will do.


Juergen

2015-02-25 14:25:48

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map

On 24/02/15 06:27, Juergen Gross wrote:
> On 02/19/2015 07:07 PM, David Vrabel wrote:
>> On 18/02/2015 06:51, Juergen Gross wrote:
>>> +{
>>> + unsigned long pfn;
>>> + unsigned long area_start, area_end;
>>> + unsigned i;
>>> +
>>> + for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
>>> +
>>> + if (!xen_reserved_area[i].size)
>>> + break;
>>> +
>>> + area_start = PFN_DOWN(xen_reserved_area[i].start);
>>> + area_end = PFN_UP(xen_reserved_area[i].start +
>>> + xen_reserved_area[i].size);
>>> + if (area_start >= end_pfn || area_end <= start_pfn)
>>> + continue;
>>> +
>>> + if (area_start > start_pfn)
>>> + xen_set_identity_and_remap_chunk(start_pfn, area_start,
>>> + released, remapped);
>>> +
>>> + if (area_end < end_pfn)
>>> + xen_set_identity_and_remap_chunk(area_end, end_pfn,
>>> + released, remapped);
>>> +
>>> + *remapped += min(area_end, end_pfn) -
>>> + max(area_start, start_pfn);
>>> +
>>> + return;
>>
>> Why not defer the whole chunk that conflicts? Or for that matter defer
>> all this remapping to the last minute.
>
> There are two problems arising from this:
>
> - In the initrd case remapping would be deferred too long: the initrd
> data is still in use when device initialization is running. And we
> really want the remap to have happened before PCI space is being used.

I'm not sure I understand what you're saying here.

I'm suggesting:

1. Reserve all holes.

2. Relocate (if necessary) all modules (initrd, etc.) to regions that
are RAM in the e820.

3. Rebuild the p2m in RAM.

4. Relocate frames from E820 holes/reserved to the end, free p2m pages
from the holes and replacing them with the read-only 1:1 page (where
possible).

> - Delaying all remapping to the point where the new p2m list is in place
> would either result in a p2m list with all memory holes covered with
> individual entries as the new list is built with those holes still
> populated, ...
> The first option could easily waste significant amounts of memory (on
> my test machine with 1TB RAM this would have been about 1GB), while
> the second option would be performance critical.

I don't understand how this wastes memory. When you relocate the
frames from the holes you can reclaim the p2m pages for the holes (and
replace them with the r/o mapped identity p2m page).

David

2015-02-25 16:00:11

by Juergen Gross

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 06/13] xen: detect pre-allocated memory interfering with e820 map

On 02/25/2015 03:24 PM, David Vrabel wrote:
> On 24/02/15 06:27, Juergen Gross wrote:
>> On 02/19/2015 07:07 PM, David Vrabel wrote:
>>> On 18/02/2015 06:51, Juergen Gross wrote:
>>>> +{
>>>> + unsigned long pfn;
>>>> + unsigned long area_start, area_end;
>>>> + unsigned i;
>>>> +
>>>> + for (i = 0; i < XEN_N_RESERVED_AREAS; i++) {
>>>> +
>>>> + if (!xen_reserved_area[i].size)
>>>> + break;
>>>> +
>>>> + area_start = PFN_DOWN(xen_reserved_area[i].start);
>>>> + area_end = PFN_UP(xen_reserved_area[i].start +
>>>> + xen_reserved_area[i].size);
>>>> + if (area_start >= end_pfn || area_end <= start_pfn)
>>>> + continue;
>>>> +
>>>> + if (area_start > start_pfn)
>>>> + xen_set_identity_and_remap_chunk(start_pfn, area_start,
>>>> + released, remapped);
>>>> +
>>>> + if (area_end < end_pfn)
>>>> + xen_set_identity_and_remap_chunk(area_end, end_pfn,
>>>> + released, remapped);
>>>> +
>>>> + *remapped += min(area_end, end_pfn) -
>>>> + max(area_start, start_pfn);
>>>> +
>>>> + return;
>>>
>>> Why not defer the whole chunk that conflicts? Or for that matter defer
>>> all this remapping to the last minute.
>>
>> There are two problems arising from this:
>>
>> - In the initrd case remapping would be deferred too long: the initrd
>> data is still in use when device initialization is running. And we
>> really want the remap to have happened before PCI space is being used.
>
> I'm not sure I understand what you're saying here.

I thought you wanted to defer the remapping to the point where the
initrd memory is no longer being used. But the suggestion below is
more clear.

>
> I'm suggesting:
>
> 1. Reserve all holes.
>
> 2. Relocate (if necessary) all modules (initrd, etc.) to regions that
> are RAM in the e820.
>
> 3. Rebuild the p2m in RAM.
>
> 4. Relocate frames from E820 holes/reserved to the end, free p2m pages
> from the holes and replacing them with the read-only 1:1 page (where
> possible).
>
>> - Delaying all remapping to the point where the new p2m list is in place
>> would either result in a p2m list with all memory holes covered with
>> individual entries as the new list is built with those holes still
>> populated, ...
>> The first option could easily waste significant amounts of memory (on
>> my test machine with 1TB RAM this would have been about 1GB), while
>> the second option would be performance critical.
>
> I don't understand how this wastes memory. When you relocate the
> frames from the holes you can reclaim the p2m pages for the holes (and
> replace them with the r/o mapped identity p2m page).

Okay, this would work, I guess.

I'll have a try with some new patches...


Juergen