2015-06-08 12:07:21

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 00/16] xen: support pv-domains larger than 512GB

Support 64 bit pv-domains with more than 512GB of memory.

Tested with 64 bit dom0 on machines with 8GB and 1TB and 32 bit dom0 on a
8GB machine. Conflicts between E820 map and different hypervisor populated
memory areas have been tested via a fake E820 map reserved area on the
8GB machine.

Changes in V4:
- new patch 13 (add explicit memblock_reserve() calls for special pages)

Changes in V3:
- rename xen_chk_e820_reserved() to xen_is_e820_reserved() as requested by
David Vrabel
- add __initdata tag to global variables in patch 10
- move initrd conflict checking after reserving p2m memory (patch 11)

Changes in V2:
- some clarifications and better explanations in commit messages
- add header changes of include/xen/interface/xen.h (patch 01)
- add wmb() when incrementing p2m_generation (patch 02)
- add new patch 03 (don't build mfn tree if tools don't need it)
- add new patch 06 (split counting of extra memory pages from remapping)
- add new patch 07 (check memory area against e820 map)
- replace early_iounmap() with early_memunmap() (patch 07->patch 08)
- rework patch 09 (check for kernel memory conflicting with memory layout)
- rework patch 10 (check pre-allocated page tables for conflict with memory map)
- combine old patches 08 and 11 into patch 11
- add new patch 12 (provide early_memremap_ro to establish read-only mapping)
- rework old patch 12 (if p2m list located in to be remapped region delay
remapping) to copy p2m list in case of a conflict (now patch 13)
- correct Kconfig dependency (patch 13->14)
- don't limit dom0 to 512GB (patch 13->14)
- modify parameter parsing to work in very early boot (patch 13->14)
- add new patch 15 to do some cleanup
- remove old patch 05 (simplify xen_set_identity_and_remap() by using global
variables)
- remove old patch 08 (detect pre-allocated memory interfering with e820 map)


Juergen Gross (16):
xen: sync with xen headers
xen: save linear p2m list address in shared info structure
xen: don't build mfn tree if tools don't need it
xen: eliminate scalability issues from initial mapping setup
xen: move static e820 map to global scope
xen: split counting of extra memory pages from remapping
xen: check memory area against e820 map
xen: find unused contiguous memory area
xen: check for kernel memory conflicting with memory layout
xen: check pre-allocated page tables for conflict with memory map
xen: check for initrd conflicting with e820 map
mm: provide early_memremap_ro to establish read-only mapping
xen: add explicit memblock_reserve() calls for special pages
xen: move p2m list if conflicting with e820 map
xen: allow more than 512 GB of RAM for 64 bit pv-domains
xen: remove no longer needed p2m.h

Documentation/kernel-parameters.txt | 7 +
arch/x86/include/asm/xen/interface.h | 96 +++++++-
arch/x86/include/asm/xen/page.h | 8 +-
arch/x86/xen/Kconfig | 20 +-
arch/x86/xen/enlighten.c | 1 +
arch/x86/xen/mmu.c | 376 +++++++++++++++++++++++++++++--
arch/x86/xen/p2m.c | 43 +++-
arch/x86/xen/p2m.h | 15 --
arch/x86/xen/setup.c | 414 ++++++++++++++++++++++++++---------
arch/x86/xen/xen-head.S | 2 +
arch/x86/xen/xen-ops.h | 7 +
include/asm-generic/early_ioremap.h | 2 +
include/asm-generic/fixmap.h | 3 +
include/xen/interface/xen.h | 10 +-
mm/early_ioremap.c | 11 +
15 files changed, 833 insertions(+), 182 deletions(-)
delete mode 100644 arch/x86/xen/p2m.h

--
2.1.4


2015-06-08 12:07:16

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 01/16] xen: sync with xen headers

Use the newest headers from the xen tree to get some new structure
layouts.

Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
---
arch/x86/include/asm/xen/interface.h | 96 ++++++++++++++++++++++++++++++++----
include/xen/interface/xen.h | 10 ++--
2 files changed, 93 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/xen/interface.h b/arch/x86/include/asm/xen/interface.h
index 3400dba..3b88eea 100644
--- a/arch/x86/include/asm/xen/interface.h
+++ b/arch/x86/include/asm/xen/interface.h
@@ -3,12 +3,38 @@
*
* Guest OS interface to x86 Xen.
*
- * Copyright (c) 2004, K A Fraser
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2004-2006, K A Fraser
*/

#ifndef _ASM_X86_XEN_INTERFACE_H
#define _ASM_X86_XEN_INTERFACE_H

+/*
+ * XEN_GUEST_HANDLE represents a guest pointer, when passed as a field
+ * in a struct in memory.
+ * XEN_GUEST_HANDLE_PARAM represent a guest pointer, when passed as an
+ * hypercall argument.
+ * XEN_GUEST_HANDLE_PARAM and XEN_GUEST_HANDLE are the same on X86 but
+ * they might not be on other architectures.
+ */
#ifdef __XEN__
#define __DEFINE_GUEST_HANDLE(name, type) \
typedef struct { type *p; } __guest_handle_ ## name
@@ -88,13 +114,16 @@ DEFINE_GUEST_HANDLE(xen_ulong_t);
* start of the GDT because some stupid OSes export hard-coded selector values
* in their ABI. These hard-coded values are always near the start of the GDT,
* so Xen places itself out of the way, at the far end of the GDT.
+ *
+ * NB The LDT is set using the MMUEXT_SET_LDT op of HYPERVISOR_mmuext_op
*/
#define FIRST_RESERVED_GDT_PAGE 14
#define FIRST_RESERVED_GDT_BYTE (FIRST_RESERVED_GDT_PAGE * 4096)
#define FIRST_RESERVED_GDT_ENTRY (FIRST_RESERVED_GDT_BYTE / 8)

/*
- * Send an array of these to HYPERVISOR_set_trap_table()
+ * Send an array of these to HYPERVISOR_set_trap_table().
+ * Terminate the array with a sentinel entry, with traps[].address==0.
* The privilege level specifies which modes may enter a trap via a software
* interrupt. On x86/64, since rings 1 and 2 are unavailable, we allocate
* privilege levels as follows:
@@ -118,10 +147,41 @@ struct trap_info {
DEFINE_GUEST_HANDLE_STRUCT(trap_info);

struct arch_shared_info {
- unsigned long max_pfn; /* max pfn that appears in table */
- /* Frame containing list of mfns containing list of mfns containing p2m. */
- unsigned long pfn_to_mfn_frame_list_list;
- unsigned long nmi_reason;
+ /*
+ * Number of valid entries in the p2m table(s) anchored at
+ * pfn_to_mfn_frame_list_list and/or p2m_vaddr.
+ */
+ unsigned long max_pfn;
+ /*
+ * Frame containing list of mfns containing list of mfns containing p2m.
+ * A value of 0 indicates it has not yet been set up, ~0 indicates it
+ * has been set to invalid e.g. due to the p2m being too large for the
+ * 3-level p2m tree. In this case the linear mapper p2m list anchored
+ * at p2m_vaddr is to be used.
+ */
+ xen_pfn_t pfn_to_mfn_frame_list_list;
+ unsigned long nmi_reason;
+ /*
+ * Following three fields are valid if p2m_cr3 contains a value
+ * different from 0.
+ * p2m_cr3 is the root of the address space where p2m_vaddr is valid.
+ * p2m_cr3 is in the same format as a cr3 value in the vcpu register
+ * state and holds the folded machine frame number (via xen_pfn_to_cr3)
+ * of a L3 or L4 page table.
+ * p2m_vaddr holds the virtual address of the linear p2m list. All
+ * entries in the range [0...max_pfn[ are accessible via this pointer.
+ * p2m_generation will be incremented by the guest before and after each
+ * change of the mappings of the p2m list. p2m_generation starts at 0
+ * and a value with the least significant bit set indicates that a
+ * mapping update is in progress. This allows guest external software
+ * (e.g. in Dom0) to verify that read mappings are consistent and
+ * whether they have changed since the last check.
+ * Modifying a p2m element in the linear p2m list is allowed via an
+ * atomic write only.
+ */
+ unsigned long p2m_cr3; /* cr3 value of the p2m address space */
+ unsigned long p2m_vaddr; /* virtual address of the p2m list */
+ unsigned long p2m_generation; /* generation count of p2m mapping */
};
#endif /* !__ASSEMBLY__ */

@@ -137,13 +197,31 @@ struct arch_shared_info {
/*
* The following is all CPU context. Note that the fpu_ctxt block is filled
* in by FXSAVE if the CPU has feature FXSR; otherwise FSAVE is used.
+ *
+ * Also note that when calling DOMCTL_setvcpucontext and VCPU_initialise
+ * for HVM and PVH guests, not all information in this structure is updated:
+ *
+ * - For HVM guests, the structures read include: fpu_ctxt (if
+ * VGCT_I387_VALID is set), flags, user_regs, debugreg[*]
+ *
+ * - PVH guests are the same as HVM guests, but additionally use ctrlreg[3] to
+ * set cr3. All other fields not used should be set to 0.
*/
struct vcpu_guest_context {
/* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
struct { char x[512]; } fpu_ctxt; /* User-level FPU registers */
-#define VGCF_I387_VALID (1<<0)
-#define VGCF_HVM_GUEST (1<<1)
-#define VGCF_IN_KERNEL (1<<2)
+#define VGCF_I387_VALID (1<<0)
+#define VGCF_IN_KERNEL (1<<2)
+#define _VGCF_i387_valid 0
+#define VGCF_i387_valid (1<<_VGCF_i387_valid)
+#define _VGCF_in_kernel 2
+#define VGCF_in_kernel (1<<_VGCF_in_kernel)
+#define _VGCF_failsafe_disables_events 3
+#define VGCF_failsafe_disables_events (1<<_VGCF_failsafe_disables_events)
+#define _VGCF_syscall_disables_events 4
+#define VGCF_syscall_disables_events (1<<_VGCF_syscall_disables_events)
+#define _VGCF_online 5
+#define VGCF_online (1<<_VGCF_online)
unsigned long flags; /* VGCF_* flags */
struct cpu_user_regs user_regs; /* User-level CPU registers */
struct trap_info trap_ctxt[256]; /* Virtual IDT */
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index a483789..da16a73 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -641,10 +641,12 @@ struct start_info {
};

/* These flags are passed in the 'flags' field of start_info_t. */
-#define SIF_PRIVILEGED (1<<0) /* Is the domain privileged? */
-#define SIF_INITDOMAIN (1<<1) /* Is this the initial control domain? */
-#define SIF_MULTIBOOT_MOD (1<<2) /* Is mod_start a multiboot module? */
-#define SIF_MOD_START_PFN (1<<3) /* Is mod_start a PFN? */
+#define SIF_PRIVILEGED (1<<0) /* Is the domain privileged? */
+#define SIF_INITDOMAIN (1<<1) /* Is this the initial control domain? */
+#define SIF_MULTIBOOT_MOD (1<<2) /* Is mod_start a multiboot module? */
+#define SIF_MOD_START_PFN (1<<3) /* Is mod_start a PFN? */
+#define SIF_VIRT_P2M_4TOOLS (1<<4) /* Do Xen tools understand a virt. mapped */
+ /* P->M making the 3 level tree obsolete? */
#define SIF_PM_MASK (0xFF<<8) /* reserve 1 byte for xen-pm options */

/*
--
2.1.4

2015-06-08 12:07:10

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 02/16] xen: save linear p2m list address in shared info structure

The virtual address of the linear p2m list should be stored in the
shared info structure read by the Xen tools to be able to support
64 bit pv-domains larger than 512 GB. Additionally the linear p2m
list interface includes a generation count which is changed prior
to and after each mapping change of the p2m list. Reading the
generation count the Xen tools can detect changes of the mappings
and re-read the p2m list eventually.

Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
---
arch/x86/xen/p2m.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index b47124d..703f803 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -262,6 +262,10 @@ void xen_setup_mfn_list_list(void)
HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
virt_to_mfn(p2m_top_mfn);
HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
+ HYPERVISOR_shared_info->arch.p2m_generation = 0;
+ HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
+ HYPERVISOR_shared_info->arch.p2m_cr3 =
+ xen_pfn_to_cr3(virt_to_mfn(swapper_pg_dir));
}

/* Set up p2m_top to point to the domain-builder provided p2m pages */
@@ -477,8 +481,12 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)

ptechk = lookup_address(vaddr, &level);
if (ptechk == pte_pg) {
+ HYPERVISOR_shared_info->arch.p2m_generation++;
+ wmb(); /* Tools are synchronizing via p2m_generation. */
set_pmd(pmdp,
__pmd(__pa(pte_newpg[i]) | _KERNPG_TABLE));
+ wmb(); /* Tools are synchronizing via p2m_generation. */
+ HYPERVISOR_shared_info->arch.p2m_generation++;
pte_newpg[i] = NULL;
}

@@ -576,8 +584,12 @@ static bool alloc_p2m(unsigned long pfn)
spin_lock_irqsave(&p2m_update_lock, flags);

if (pte_pfn(*ptep) == p2m_pfn) {
+ HYPERVISOR_shared_info->arch.p2m_generation++;
+ wmb(); /* Tools are synchronizing via p2m_generation. */
set_pte(ptep,
pfn_pte(PFN_DOWN(__pa(p2m)), PAGE_KERNEL));
+ wmb(); /* Tools are synchronizing via p2m_generation. */
+ HYPERVISOR_shared_info->arch.p2m_generation++;
if (mid_mfn)
mid_mfn[mididx] = virt_to_mfn(p2m);
p2m = NULL;
@@ -629,6 +641,11 @@ bool __set_phys_to_machine(unsigned long pfn, unsigned long mfn)
return true;
}

+ /*
+ * The interface requires atomic updates on p2m elements.
+ * xen_safe_write_ulong() is using __put_user which does an atomic
+ * store via asm().
+ */
if (likely(!xen_safe_write_ulong(xen_p2m_addr + pfn, mfn)))
return true;

--
2.1.4

2015-06-08 12:10:18

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 03/16] xen: don't build mfn tree if tools don't need it

In case the Xen tools indicate they don't need the p2m 3 level tree
as they support the virtual mapped linear p2m list, just omit building
the tree.

Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
---
arch/x86/xen/p2m.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 703f803..6f80cd3 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -198,7 +198,8 @@ void __ref xen_build_mfn_list_list(void)
unsigned int level, topidx, mididx;
unsigned long *mid_mfn_p;

- if (xen_feature(XENFEAT_auto_translated_physmap))
+ if (xen_feature(XENFEAT_auto_translated_physmap) ||
+ xen_start_info->flags & SIF_VIRT_P2M_4TOOLS)
return;

/* Pre-initialize p2m_top_mfn to be completely missing */
@@ -259,8 +260,11 @@ void xen_setup_mfn_list_list(void)

BUG_ON(HYPERVISOR_shared_info == &xen_dummy_shared_info);

- HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
- virt_to_mfn(p2m_top_mfn);
+ if (xen_start_info->flags & SIF_VIRT_P2M_4TOOLS)
+ HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list = ~0UL;
+ else
+ HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list =
+ virt_to_mfn(p2m_top_mfn);
HYPERVISOR_shared_info->arch.max_pfn = xen_max_p2m_pfn;
HYPERVISOR_shared_info->arch.p2m_generation = 0;
HYPERVISOR_shared_info->arch.p2m_vaddr = (unsigned long)xen_p2m_addr;
--
2.1.4

2015-06-08 12:10:08

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 04/16] xen: eliminate scalability issues from initial mapping setup

Direct Xen to place the initial P->M table outside of the initial
mapping, as otherwise the 1G (implementation) / 2G (theoretical)
restriction on the size of the initial mapping limits the amount
of memory a domain can be handed initially.

As the initial P->M table is copied rather early during boot to
domain private memory and it's initial virtual mapping is dropped,
the easiest way to avoid virtual address conflicts with other
addresses in the kernel is to use a user address area for the
virtual address of the initial P->M table. This allows us to just
throw away the page tables of the initial mapping after the copy
without having to care about address invalidation.

It should be noted that this patch won't enable a pv-domain to USE
more than 512 GB of RAM. It just enables it to be started with a
P->M table covering more memory. This is especially important for
being able to boot a Dom0 on a system with more than 512 GB memory.

Signed-off-by: Juergen Gross <[email protected]>
Based-on-patch-by: Jan Beulich <[email protected]>
---
arch/x86/xen/mmu.c | 126 ++++++++++++++++++++++++++++++++++++++++++++----
arch/x86/xen/setup.c | 67 ++++++++++++++-----------
arch/x86/xen/xen-head.S | 2 +
3 files changed, 156 insertions(+), 39 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index dd151b2..c04e14e 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1114,6 +1114,77 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
xen_mc_flush();
}

+/*
+ * Make a page range writeable and free it.
+ */
+static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
+{
+ void *vaddr = __va(paddr);
+ void *vaddr_end = vaddr + size;
+
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
+ make_lowmem_page_readwrite(vaddr);
+
+ memblock_free(paddr, size);
+}
+
+static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl)
+{
+ unsigned long pa = __pa(pgtbl) & PHYSICAL_PAGE_MASK;
+
+ ClearPagePinned(virt_to_page(__va(pa)));
+ xen_free_ro_pages(pa, PAGE_SIZE);
+}
+
+/*
+ * Since it is well isolated we can (and since it is perhaps large we should)
+ * also free the page tables mapping the initial P->M table.
+ */
+static void __init xen_cleanmfnmap(unsigned long vaddr)
+{
+ unsigned long va = vaddr & PMD_MASK;
+ unsigned long pa;
+ pgd_t *pgd = pgd_offset_k(va);
+ pud_t *pud_page = pud_offset(pgd, 0);
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ unsigned int i;
+
+ set_pgd(pgd, __pgd(0));
+ do {
+ pud = pud_page + pud_index(va);
+ if (pud_none(*pud)) {
+ va += PUD_SIZE;
+ } else if (pud_large(*pud)) {
+ pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
+ xen_free_ro_pages(pa, PUD_SIZE);
+ va += PUD_SIZE;
+ } else {
+ pmd = pmd_offset(pud, va);
+ if (pmd_large(*pmd)) {
+ pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
+ xen_free_ro_pages(pa, PMD_SIZE);
+ } else if (!pmd_none(*pmd)) {
+ pte = pte_offset_kernel(pmd, va);
+ for (i = 0; i < PTRS_PER_PTE; ++i) {
+ if (pte_none(pte[i]))
+ break;
+ pa = pte_pfn(pte[i]) << PAGE_SHIFT;
+ xen_free_ro_pages(pa, PAGE_SIZE);
+ }
+ xen_cleanmfnmap_free_pgtbl(pte);
+ }
+ va += PMD_SIZE;
+ if (pmd_index(va))
+ continue;
+ xen_cleanmfnmap_free_pgtbl(pmd);
+ }
+
+ } while (pud_index(va) || pmd_index(va));
+ xen_cleanmfnmap_free_pgtbl(pud_page);
+}
+
static void __init xen_pagetable_p2m_free(void)
{
unsigned long size;
@@ -1128,18 +1199,25 @@ static void __init xen_pagetable_p2m_free(void)
/* using __ka address and sticking INVALID_P2M_ENTRY! */
memset((void *)xen_start_info->mfn_list, 0xff, size);

- /* We should be in __ka space. */
- BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
addr = xen_start_info->mfn_list;
- /* We roundup to the PMD, which means that if anybody at this stage is
- * using the __ka address of xen_start_info or xen_start_info->shared_info
- * they are in going to crash. Fortunatly we have already revectored
- * in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
+ /*
+ * We could be in __ka space.
+ * We roundup to the PMD, which means that if anybody at this stage is
+ * using the __ka address of xen_start_info or
+ * xen_start_info->shared_info they are in going to crash. Fortunatly
+ * we have already revectored in xen_setup_kernel_pagetable and in
+ * xen_setup_shared_info.
+ */
size = roundup(size, PMD_SIZE);
- xen_cleanhighmap(addr, addr + size);

- size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
- memblock_free(__pa(xen_start_info->mfn_list), size);
+ if (addr >= __START_KERNEL_map) {
+ xen_cleanhighmap(addr, addr + size);
+ size = PAGE_ALIGN(xen_start_info->nr_pages *
+ sizeof(unsigned long));
+ memblock_free(__pa(addr), size);
+ } else {
+ xen_cleanmfnmap(addr);
+ }

/* At this stage, cleanup_highmap has already cleaned __ka space
* from _brk_limit way up to the max_pfn_mapped (which is the end of
@@ -1461,6 +1539,24 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
#else /* CONFIG_X86_64 */
static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
{
+ unsigned long pfn;
+
+ if (xen_feature(XENFEAT_writable_page_tables) ||
+ xen_feature(XENFEAT_auto_translated_physmap) ||
+ xen_start_info->mfn_list >= __START_KERNEL_map)
+ return pte;
+
+ /*
+ * Pages belonging to the initial p2m list mapped outside the default
+ * address range must be mapped read-only. This region contains the
+ * page tables for mapping the p2m list, too, and page tables MUST be
+ * mapped read-only.
+ */
+ pfn = pte_pfn(pte);
+ if (pfn >= xen_start_info->first_p2m_pfn &&
+ pfn < xen_start_info->first_p2m_pfn + xen_start_info->nr_p2m_frames)
+ pte = __pte_ma(pte_val_ma(pte) & ~_PAGE_RW);
+
return pte;
}
#endif /* CONFIG_X86_64 */
@@ -1815,7 +1911,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
* mappings. Considering that on Xen after the kernel mappings we
* have the mappings of some pages that don't exist in pfn space, we
* set max_pfn_mapped to the last real pfn mapped. */
- max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+ if (xen_start_info->mfn_list < __START_KERNEL_map)
+ max_pfn_mapped = xen_start_info->first_p2m_pfn;
+ else
+ max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));

pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
pt_end = pt_base + xen_start_info->nr_pt_frames;
@@ -1855,6 +1954,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
/* Graft it onto L4[511][510] */
copy_page(level2_kernel_pgt, l2);

+ /* Copy the initial P->M table mappings if necessary. */
+ i = pgd_index(xen_start_info->mfn_list);
+ if (i && i < pgd_index(__START_KERNEL_map))
+ init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+
if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Make pagetable pieces RO */
set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
@@ -1895,6 +1999,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)

/* Our (by three pages) smaller Xen pagetable that we are using */
memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+ /* protect xen_start_info */
+ memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
/* Revector the xen_start_info */
xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
}
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 55f388e..adad417 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -560,6 +560,41 @@ static void __init xen_ignore_unusable(struct e820entry *list, size_t map_size)
}
}

+/*
+ * Reserve Xen mfn_list.
+ * See comment above "struct start_info" in <xen/interface/xen.h>
+ * We tried to make the the memblock_reserve more selective so
+ * that it would be clear what region is reserved. Sadly we ran
+ * in the problem wherein on a 64-bit hypervisor with a 32-bit
+ * initial domain, the pt_base has the cr3 value which is not
+ * neccessarily where the pagetable starts! As Jan put it: "
+ * Actually, the adjustment turns out to be correct: The page
+ * tables for a 32-on-64 dom0 get allocated in the order "first L1",
+ * "first L2", "first L3", so the offset to the page table base is
+ * indeed 2. When reading xen/include/public/xen.h's comment
+ * very strictly, this is not a violation (since there nothing is said
+ * that the first thing in the page table space is pointed to by
+ * pt_base; I admit that this seems to be implied though, namely
+ * do I think that it is implied that the page table space is the
+ * range [pt_base, pt_base + nt_pt_frames), whereas that
+ * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
+ * which - without a priori knowledge - the kernel would have
+ * difficulty to figure out)." - so lets just fall back to the
+ * easy way and reserve the whole region.
+ */
+static void __init xen_reserve_xen_mfnlist(void)
+{
+ if (xen_start_info->mfn_list >= __START_KERNEL_map) {
+ memblock_reserve(__pa(xen_start_info->mfn_list),
+ xen_start_info->pt_base -
+ xen_start_info->mfn_list);
+ return;
+ }
+
+ memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
+ PFN_PHYS(xen_start_info->nr_p2m_frames));
+}
+
/**
* machine_specific_memory_setup - Hook for machine specific memory setup.
**/
@@ -684,35 +719,10 @@ char * __init xen_memory_setup(void)
e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
E820_RESERVED);

- /*
- * Reserve Xen bits:
- * - mfn_list
- * - xen_start_info
- * See comment above "struct start_info" in <xen/interface/xen.h>
- * We tried to make the the memblock_reserve more selective so
- * that it would be clear what region is reserved. Sadly we ran
- * in the problem wherein on a 64-bit hypervisor with a 32-bit
- * initial domain, the pt_base has the cr3 value which is not
- * neccessarily where the pagetable starts! As Jan put it: "
- * Actually, the adjustment turns out to be correct: The page
- * tables for a 32-on-64 dom0 get allocated in the order "first L1",
- * "first L2", "first L3", so the offset to the page table base is
- * indeed 2. When reading xen/include/public/xen.h's comment
- * very strictly, this is not a violation (since there nothing is said
- * that the first thing in the page table space is pointed to by
- * pt_base; I admit that this seems to be implied though, namely
- * do I think that it is implied that the page table space is the
- * range [pt_base, pt_base + nt_pt_frames), whereas that
- * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
- * which - without a priori knowledge - the kernel would have
- * difficulty to figure out)." - so lets just fall back to the
- * easy way and reserve the whole region.
- */
- memblock_reserve(__pa(xen_start_info->mfn_list),
- xen_start_info->pt_base - xen_start_info->mfn_list);
-
sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);

+ xen_reserve_xen_mfnlist();
+
return "Xen";
}

@@ -739,8 +749,7 @@ char * __init xen_auto_xlated_memory_setup(void)
for (i = 0; i < memmap.nr_entries; i++)
e820_add_region(map[i].addr, map[i].size, map[i].type);

- memblock_reserve(__pa(xen_start_info->mfn_list),
- xen_start_info->pt_base - xen_start_info->mfn_list);
+ xen_reserve_xen_mfnlist();

return "Xen";
}
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 8afdfcc..b65f59a 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -104,6 +104,8 @@ ENTRY(hypercall_page)
ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, _ASM_PTR __PAGE_OFFSET)
#else
ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, _ASM_PTR __START_KERNEL_map)
+ /* Map the p2m table to a 512GB-aligned user address. */
+ ELFNOTE(Xen, XEN_ELFNOTE_INIT_P2M, .quad PGDIR_SIZE)
#endif
ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, _ASM_PTR startup_xen)
ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR hypercall_page)
--
2.1.4

2015-06-08 12:10:01

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 05/16] xen: move static e820 map to global scope

Instead of using a function local static e820 map in xen_memory_setup()
and calling various functions in the same source with the map as a
parameter use a map directly accessible by all functions in the source.

Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
---
arch/x86/xen/setup.c | 96 +++++++++++++++++++++++++++-------------------------
1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index adad417..ab6c36e 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -38,6 +38,10 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
/* Number of pages released from the initial allocation. */
unsigned long xen_released_pages;

+/* E820 map used during setting up memory. */
+static struct e820entry xen_e820_map[E820MAX] __initdata;
+static u32 xen_e820_map_entries __initdata;
+
/*
* Buffer used to remap identity mapped pages. We only need the virtual space.
* The physical page behind this address is remapped as needed to different
@@ -164,15 +168,13 @@ void __init xen_inv_extra_mem(void)
* This function updates min_pfn with the pfn found and returns
* the size of that range or zero if not found.
*/
-static unsigned long __init xen_find_pfn_range(
- const struct e820entry *list, size_t map_size,
- unsigned long *min_pfn)
+static unsigned long __init xen_find_pfn_range(unsigned long *min_pfn)
{
- const struct e820entry *entry;
+ const struct e820entry *entry = xen_e820_map;
unsigned int i;
unsigned long done = 0;

- for (i = 0, entry = list; i < map_size; i++, entry++) {
+ for (i = 0; i < xen_e820_map_entries; i++, entry++) {
unsigned long s_pfn;
unsigned long e_pfn;

@@ -356,9 +358,9 @@ static void __init xen_do_set_identity_and_remap_chunk(
* to Xen and not remapped.
*/
static unsigned long __init xen_set_identity_and_remap_chunk(
- const struct e820entry *list, size_t map_size, unsigned long start_pfn,
- unsigned long end_pfn, unsigned long nr_pages, unsigned long remap_pfn,
- unsigned long *released, unsigned long *remapped)
+ unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_pages,
+ unsigned long remap_pfn, unsigned long *released,
+ unsigned long *remapped)
{
unsigned long pfn;
unsigned long i = 0;
@@ -379,8 +381,7 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
if (cur_pfn + size > nr_pages)
size = nr_pages - cur_pfn;

- remap_range_size = xen_find_pfn_range(list, map_size,
- &remap_pfn);
+ remap_range_size = xen_find_pfn_range(&remap_pfn);
if (!remap_range_size) {
pr_warning("Unable to find available pfn range, not remapping identity pages\n");
xen_set_identity_and_release_chunk(cur_pfn,
@@ -411,13 +412,12 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
return remap_pfn;
}

-static void __init xen_set_identity_and_remap(
- const struct e820entry *list, size_t map_size, unsigned long nr_pages,
- unsigned long *released, unsigned long *remapped)
+static void __init xen_set_identity_and_remap(unsigned long nr_pages,
+ unsigned long *released, unsigned long *remapped)
{
phys_addr_t start = 0;
unsigned long last_pfn = nr_pages;
- const struct e820entry *entry;
+ const struct e820entry *entry = xen_e820_map;
unsigned long num_released = 0;
unsigned long num_remapped = 0;
int i;
@@ -433,9 +433,9 @@ static void __init xen_set_identity_and_remap(
* example) the DMI tables in a reserved region that begins on
* a non-page boundary.
*/
- for (i = 0, entry = list; i < map_size; i++, entry++) {
+ for (i = 0; i < xen_e820_map_entries; i++, entry++) {
phys_addr_t end = entry->addr + entry->size;
- if (entry->type == E820_RAM || i == map_size - 1) {
+ if (entry->type == E820_RAM || i == xen_e820_map_entries - 1) {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);

@@ -444,9 +444,9 @@ static void __init xen_set_identity_and_remap(

if (start_pfn < end_pfn)
last_pfn = xen_set_identity_and_remap_chunk(
- list, map_size, start_pfn,
- end_pfn, nr_pages, last_pfn,
- &num_released, &num_remapped);
+ start_pfn, end_pfn, nr_pages,
+ last_pfn, &num_released,
+ &num_remapped);
start = end;
}
}
@@ -549,12 +549,12 @@ static void __init xen_align_and_add_e820_region(phys_addr_t start,
e820_add_region(start, end - start, type);
}

-static void __init xen_ignore_unusable(struct e820entry *list, size_t map_size)
+static void __init xen_ignore_unusable(void)
{
- struct e820entry *entry;
+ struct e820entry *entry = xen_e820_map;
unsigned int i;

- for (i = 0, entry = list; i < map_size; i++, entry++) {
+ for (i = 0; i < xen_e820_map_entries; i++, entry++) {
if (entry->type == E820_UNUSABLE)
entry->type = E820_RAM;
}
@@ -600,8 +600,6 @@ static void __init xen_reserve_xen_mfnlist(void)
**/
char * __init xen_memory_setup(void)
{
- static struct e820entry map[E820MAX] __initdata;
-
unsigned long max_pfn = xen_start_info->nr_pages;
phys_addr_t mem_end;
int rc;
@@ -616,7 +614,7 @@ char * __init xen_memory_setup(void)
mem_end = PFN_PHYS(max_pfn);

memmap.nr_entries = E820MAX;
- set_xen_guest_handle(memmap.buffer, map);
+ set_xen_guest_handle(memmap.buffer, xen_e820_map);

op = xen_initial_domain() ?
XENMEM_machine_memory_map :
@@ -625,15 +623,16 @@ char * __init xen_memory_setup(void)
if (rc == -ENOSYS) {
BUG_ON(xen_initial_domain());
memmap.nr_entries = 1;
- map[0].addr = 0ULL;
- map[0].size = mem_end;
+ xen_e820_map[0].addr = 0ULL;
+ xen_e820_map[0].size = mem_end;
/* 8MB slack (to balance backend allocations). */
- map[0].size += 8ULL << 20;
- map[0].type = E820_RAM;
+ xen_e820_map[0].size += 8ULL << 20;
+ xen_e820_map[0].type = E820_RAM;
rc = 0;
}
BUG_ON(rc);
BUG_ON(memmap.nr_entries == 0);
+ xen_e820_map_entries = memmap.nr_entries;

/*
* Xen won't allow a 1:1 mapping to be created to UNUSABLE
@@ -644,10 +643,11 @@ char * __init xen_memory_setup(void)
* a patch in the future.
*/
if (xen_initial_domain())
- xen_ignore_unusable(map, memmap.nr_entries);
+ xen_ignore_unusable();

/* Make sure the Xen-supplied memory map is well-ordered. */
- sanitize_e820_map(map, memmap.nr_entries, &memmap.nr_entries);
+ sanitize_e820_map(xen_e820_map, xen_e820_map_entries,
+ &xen_e820_map_entries);

max_pages = xen_get_max_pages();
if (max_pages > max_pfn)
@@ -657,8 +657,8 @@ char * __init xen_memory_setup(void)
* Set identity map on non-RAM pages and prepare remapping the
* underlying RAM.
*/
- xen_set_identity_and_remap(map, memmap.nr_entries, max_pfn,
- &xen_released_pages, &remapped_pages);
+ xen_set_identity_and_remap(max_pfn, &xen_released_pages,
+ &remapped_pages);

extra_pages += xen_released_pages;
extra_pages += remapped_pages;
@@ -677,10 +677,10 @@ char * __init xen_memory_setup(void)
extra_pages = min(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
extra_pages);
i = 0;
- while (i < memmap.nr_entries) {
- phys_addr_t addr = map[i].addr;
- phys_addr_t size = map[i].size;
- u32 type = map[i].type;
+ while (i < xen_e820_map_entries) {
+ phys_addr_t addr = xen_e820_map[i].addr;
+ phys_addr_t size = xen_e820_map[i].size;
+ u32 type = xen_e820_map[i].type;

if (type == E820_RAM) {
if (addr < mem_end) {
@@ -696,9 +696,9 @@ char * __init xen_memory_setup(void)

xen_align_and_add_e820_region(addr, size, type);

- map[i].addr += size;
- map[i].size -= size;
- if (map[i].size == 0)
+ xen_e820_map[i].addr += size;
+ xen_e820_map[i].size -= size;
+ if (xen_e820_map[i].size == 0)
i++;
}

@@ -709,7 +709,7 @@ char * __init xen_memory_setup(void)
* PFNs above MAX_P2M_PFN are considered identity mapped as
* well.
*/
- set_phys_range_identity(map[i-1].addr / PAGE_SIZE, ~0ul);
+ set_phys_range_identity(xen_e820_map[i - 1].addr / PAGE_SIZE, ~0ul);

/*
* In domU, the ISA region is normal, usable memory, but we
@@ -731,23 +731,25 @@ char * __init xen_memory_setup(void)
*/
char * __init xen_auto_xlated_memory_setup(void)
{
- static struct e820entry map[E820MAX] __initdata;
-
struct xen_memory_map memmap;
int i;
int rc;

memmap.nr_entries = E820MAX;
- set_xen_guest_handle(memmap.buffer, map);
+ set_xen_guest_handle(memmap.buffer, xen_e820_map);

rc = HYPERVISOR_memory_op(XENMEM_memory_map, &memmap);
if (rc < 0)
panic("No memory map (%d)\n", rc);

- sanitize_e820_map(map, ARRAY_SIZE(map), &memmap.nr_entries);
+ xen_e820_map_entries = memmap.nr_entries;
+
+ sanitize_e820_map(xen_e820_map, ARRAY_SIZE(xen_e820_map),
+ &xen_e820_map_entries);

- for (i = 0; i < memmap.nr_entries; i++)
- e820_add_region(map[i].addr, map[i].size, map[i].type);
+ for (i = 0; i < xen_e820_map_entries; i++)
+ e820_add_region(xen_e820_map[i].addr, xen_e820_map[i].size,
+ xen_e820_map[i].type);

xen_reserve_xen_mfnlist();

--
2.1.4

2015-06-08 12:09:55

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 06/16] xen: split counting of extra memory pages from remapping

Memory pages in the initial memory setup done by the Xen hypervisor
conflicting with the target E820 map are remapped. In order to do this
those pages are counted and remapped in xen_set_identity_and_remap().

Split the counting from the remapping operation to be able to setup
the needed memory sizes in time but doing the remap operation at a
later time. This enables us to simplify the interface to
xen_set_identity_and_remap() as the number of remapped and released
pages is no longer needed here.

Finally move the remapping further down to prepare relocating
conflicting memory contents before the memory might be clobbered by
xen_set_identity_and_remap(). This requires to not destroy the Xen
E820 map when the one for the system is being constructed.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/setup.c | 98 +++++++++++++++++++++++++++++++---------------------
1 file changed, 58 insertions(+), 40 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index ab6c36e..87251b4 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -223,7 +223,7 @@ static int __init xen_free_mfn(unsigned long mfn)
* as a fallback if the remapping fails.
*/
static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
- unsigned long end_pfn, unsigned long nr_pages, unsigned long *released)
+ unsigned long end_pfn, unsigned long nr_pages)
{
unsigned long pfn, end;
int ret;
@@ -243,7 +243,7 @@ static void __init xen_set_identity_and_release_chunk(unsigned long start_pfn,
WARN(ret != 1, "Failed to release pfn %lx err=%d\n", pfn, ret);

if (ret == 1) {
- (*released)++;
+ xen_released_pages++;
if (!__set_phys_to_machine(pfn, INVALID_P2M_ENTRY))
break;
} else
@@ -359,8 +359,7 @@ static void __init xen_do_set_identity_and_remap_chunk(
*/
static unsigned long __init xen_set_identity_and_remap_chunk(
unsigned long start_pfn, unsigned long end_pfn, unsigned long nr_pages,
- unsigned long remap_pfn, unsigned long *released,
- unsigned long *remapped)
+ unsigned long remap_pfn)
{
unsigned long pfn;
unsigned long i = 0;
@@ -385,7 +384,7 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
if (!remap_range_size) {
pr_warning("Unable to find available pfn range, not remapping identity pages\n");
xen_set_identity_and_release_chunk(cur_pfn,
- cur_pfn + left, nr_pages, released);
+ cur_pfn + left, nr_pages);
break;
}
/* Adjust size to fit in current e820 RAM region */
@@ -397,7 +396,6 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
/* Update variables to reflect new mappings. */
i += size;
remap_pfn += size;
- *remapped += size;
}

/*
@@ -412,14 +410,11 @@ static unsigned long __init xen_set_identity_and_remap_chunk(
return remap_pfn;
}

-static void __init xen_set_identity_and_remap(unsigned long nr_pages,
- unsigned long *released, unsigned long *remapped)
+static void __init xen_set_identity_and_remap(unsigned long nr_pages)
{
phys_addr_t start = 0;
unsigned long last_pfn = nr_pages;
const struct e820entry *entry = xen_e820_map;
- unsigned long num_released = 0;
- unsigned long num_remapped = 0;
int i;

/*
@@ -445,16 +440,12 @@ static void __init xen_set_identity_and_remap(unsigned long nr_pages,
if (start_pfn < end_pfn)
last_pfn = xen_set_identity_and_remap_chunk(
start_pfn, end_pfn, nr_pages,
- last_pfn, &num_released,
- &num_remapped);
+ last_pfn);
start = end;
}
}

- *released = num_released;
- *remapped = num_remapped;
-
- pr_info("Released %ld page(s)\n", num_released);
+ pr_info("Released %ld page(s)\n", xen_released_pages);
}

/*
@@ -560,6 +551,28 @@ static void __init xen_ignore_unusable(void)
}
}

+static unsigned long __init xen_count_remap_pages(unsigned long max_pfn)
+{
+ unsigned long extra = 0;
+ const struct e820entry *entry = xen_e820_map;
+ int i;
+
+ for (i = 0; i < xen_e820_map_entries; i++, entry++) {
+ unsigned long start_pfn = PFN_DOWN(entry->addr);
+ unsigned long end_pfn = PFN_UP(entry->addr + entry->size);
+
+ if (start_pfn >= max_pfn)
+ break;
+ if (entry->type == E820_RAM)
+ continue;
+ if (end_pfn >= max_pfn)
+ end_pfn = max_pfn;
+ extra += end_pfn - start_pfn;
+ }
+
+ return extra;
+}
+
/*
* Reserve Xen mfn_list.
* See comment above "struct start_info" in <xen/interface/xen.h>
@@ -601,12 +614,12 @@ static void __init xen_reserve_xen_mfnlist(void)
char * __init xen_memory_setup(void)
{
unsigned long max_pfn = xen_start_info->nr_pages;
- phys_addr_t mem_end;
+ phys_addr_t mem_end, addr, size, chunk_size;
+ u32 type;
int rc;
struct xen_memory_map memmap;
unsigned long max_pages;
unsigned long extra_pages = 0;
- unsigned long remapped_pages;
int i;
int op;

@@ -653,15 +666,8 @@ char * __init xen_memory_setup(void)
if (max_pages > max_pfn)
extra_pages += max_pages - max_pfn;

- /*
- * Set identity map on non-RAM pages and prepare remapping the
- * underlying RAM.
- */
- xen_set_identity_and_remap(max_pfn, &xen_released_pages,
- &remapped_pages);
-
- extra_pages += xen_released_pages;
- extra_pages += remapped_pages;
+ /* How many extra pages do we need due to remapping? */
+ extra_pages += xen_count_remap_pages(max_pfn);

/*
* Clamp the amount of extra memory to a EXTRA_MEM_RATIO
@@ -677,29 +683,35 @@ char * __init xen_memory_setup(void)
extra_pages = min(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
extra_pages);
i = 0;
+ addr = xen_e820_map[0].addr;
+ size = xen_e820_map[0].size;
while (i < xen_e820_map_entries) {
- phys_addr_t addr = xen_e820_map[i].addr;
- phys_addr_t size = xen_e820_map[i].size;
- u32 type = xen_e820_map[i].type;
+ chunk_size = size;
+ type = xen_e820_map[i].type;

if (type == E820_RAM) {
if (addr < mem_end) {
- size = min(size, mem_end - addr);
+ chunk_size = min(size, mem_end - addr);
} else if (extra_pages) {
- size = min(size, PFN_PHYS(extra_pages));
- extra_pages -= PFN_DOWN(size);
- xen_add_extra_mem(addr, size);
- xen_max_p2m_pfn = PFN_DOWN(addr + size);
+ chunk_size = min(size, PFN_PHYS(extra_pages));
+ extra_pages -= PFN_DOWN(chunk_size);
+ xen_add_extra_mem(addr, chunk_size);
+ xen_max_p2m_pfn = PFN_DOWN(addr + chunk_size);
} else
type = E820_UNUSABLE;
}

- xen_align_and_add_e820_region(addr, size, type);
+ xen_align_and_add_e820_region(addr, chunk_size, type);

- xen_e820_map[i].addr += size;
- xen_e820_map[i].size -= size;
- if (xen_e820_map[i].size == 0)
+ addr += chunk_size;
+ size -= chunk_size;
+ if (size == 0) {
i++;
+ if (i < xen_e820_map_entries) {
+ addr = xen_e820_map[i].addr;
+ size = xen_e820_map[i].size;
+ }
+ }
}

/*
@@ -709,7 +721,7 @@ char * __init xen_memory_setup(void)
* PFNs above MAX_P2M_PFN are considered identity mapped as
* well.
*/
- set_phys_range_identity(xen_e820_map[i - 1].addr / PAGE_SIZE, ~0ul);
+ set_phys_range_identity(addr / PAGE_SIZE, ~0ul);

/*
* In domU, the ISA region is normal, usable memory, but we
@@ -723,6 +735,12 @@ char * __init xen_memory_setup(void)

xen_reserve_xen_mfnlist();

+ /*
+ * Set identity map on non-RAM pages and prepare remapping the
+ * underlying RAM.
+ */
+ xen_set_identity_and_remap(max_pfn);
+
return "Xen";
}

--
2.1.4

2015-06-08 12:09:42

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 07/16] xen: check memory area against e820 map

Provide a service routine to check a physical memory area against the
E820 map. The routine will return false if the complete area is RAM
according to the E820 map and true otherwise.

Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
---
arch/x86/xen/setup.c | 23 +++++++++++++++++++++++
arch/x86/xen/xen-ops.h | 1 +
2 files changed, 24 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 87251b4..99ef82c 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -573,6 +573,29 @@ static unsigned long __init xen_count_remap_pages(unsigned long max_pfn)
return extra;
}

+bool __init xen_is_e820_reserved(phys_addr_t start, phys_addr_t size)
+{
+ struct e820entry *entry;
+ unsigned mapcnt;
+ phys_addr_t end;
+
+ if (!size)
+ return false;
+
+ end = start + size;
+ entry = xen_e820_map;
+
+ for (mapcnt = 0; mapcnt < xen_e820_map_entries; mapcnt++) {
+ if (entry->type == E820_RAM && entry->addr <= start &&
+ (entry->addr + entry->size) >= end)
+ return false;
+
+ entry++;
+ }
+
+ return true;
+}
+
/*
* Reserve Xen mfn_list.
* See comment above "struct start_info" in <xen/interface/xen.h>
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 9e195c6..c1385b8 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -39,6 +39,7 @@ void xen_reserve_top(void);
void xen_mm_pin_all(void);
void xen_mm_unpin_all(void);

+bool __init xen_is_e820_reserved(phys_addr_t start, phys_addr_t size);
unsigned long __ref xen_chk_extra_mem(unsigned long pfn);
void __init xen_inv_extra_mem(void);
void __init xen_remap_memory(void);
--
2.1.4

2015-06-08 12:09:23

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 08/16] xen: find unused contiguous memory area

For being able to relocate pre-allocated data areas like initrd or
p2m list it is mandatory to find a contiguous memory area which is
not yet in use and doesn't conflict with the memory map we want to
be in effect.

In case such an area is found reserve it at once as this will be
required to be done in any case.

Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
---
arch/x86/xen/setup.c | 34 ++++++++++++++++++++++++++++++++++
arch/x86/xen/xen-ops.h | 1 +
2 files changed, 35 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 99ef82c..973d294 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -597,6 +597,40 @@ bool __init xen_is_e820_reserved(phys_addr_t start, phys_addr_t size)
}

/*
+ * Find a free area in physical memory not yet reserved and compliant with
+ * E820 map.
+ * Used to relocate pre-allocated areas like initrd or p2m list which are in
+ * conflict with the to be used E820 map.
+ * In case no area is found, return 0. Otherwise return the physical address
+ * of the area which is already reserved for convenience.
+ */
+phys_addr_t __init xen_find_free_area(phys_addr_t size)
+{
+ unsigned mapcnt;
+ phys_addr_t addr, start;
+ struct e820entry *entry = xen_e820_map;
+
+ for (mapcnt = 0; mapcnt < xen_e820_map_entries; mapcnt++, entry++) {
+ if (entry->type != E820_RAM || entry->size < size)
+ continue;
+ start = entry->addr;
+ for (addr = start; addr < start + size; addr += PAGE_SIZE) {
+ if (!memblock_is_reserved(addr))
+ continue;
+ start = addr + PAGE_SIZE;
+ if (start + size > entry->addr + entry->size)
+ break;
+ }
+ if (addr >= start + size) {
+ memblock_reserve(start, size);
+ return start;
+ }
+ }
+
+ return 0;
+}
+
+/*
* Reserve Xen mfn_list.
* See comment above "struct start_info" in <xen/interface/xen.h>
* We tried to make the the memblock_reserve more selective so
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index c1385b8..3f1669c 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -43,6 +43,7 @@ bool __init xen_is_e820_reserved(phys_addr_t start, phys_addr_t size);
unsigned long __ref xen_chk_extra_mem(unsigned long pfn);
void __init xen_inv_extra_mem(void);
void __init xen_remap_memory(void);
+phys_addr_t __init xen_find_free_area(phys_addr_t size);
char * __init xen_memory_setup(void);
char * xen_auto_xlated_memory_setup(void);
void __init xen_arch_setup(void);
--
2.1.4

2015-06-08 12:09:30

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 09/16] xen: check for kernel memory conflicting with memory layout

Checks whether the pre-allocated memory of the loaded kernel is in
conflict with the target memory map. If this is the case, just panic
instead of run into problems later, as there is nothing we can do
to repair this situation.

Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
---
arch/x86/xen/setup.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 973d294..9bd3f35 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -27,6 +27,7 @@
#include <xen/interface/memory.h>
#include <xen/interface/physdev.h>
#include <xen/features.h>
+#include <xen/hvc-console.h>
#include "xen-ops.h"
#include "vdso.h"
#include "p2m.h"
@@ -790,6 +791,17 @@ char * __init xen_memory_setup(void)

sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);

+ /*
+ * Check whether the kernel itself conflicts with the target E820 map.
+ * Failing now is better than running into weird problems later due
+ * to relocating (and even reusing) pages with kernel text or data.
+ */
+ if (xen_is_e820_reserved(__pa_symbol(_text),
+ __pa_symbol(__bss_stop) - __pa_symbol(_text))) {
+ xen_raw_console_write("Xen hypervisor allocated kernel memory conflicts with E820 map\n");
+ BUG();
+ }
+
xen_reserve_xen_mfnlist();

/*
--
2.1.4

2015-06-08 12:09:36

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 10/16] xen: check pre-allocated page tables for conflict with memory map

Check whether the page tables built by the domain builder are at
memory addresses which are in conflict with the target memory map.
If this is the case just panic instead of running into problems
later.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/mmu.c | 19 ++++++++++++++++---
arch/x86/xen/setup.c | 6 ++++++
arch/x86/xen/xen-ops.h | 1 +
3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index c04e14e..1982617 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -116,6 +116,7 @@ static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
DEFINE_PER_CPU(unsigned long, xen_cr3); /* cr3 stored as physaddr */
DEFINE_PER_CPU(unsigned long, xen_current_cr3); /* actual vcpu cr3 */

+static phys_addr_t xen_pt_base, xen_pt_size __initdata;

/*
* Just beyond the highest usermode address. STACK_TOP_MAX has a
@@ -1998,7 +1999,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
check_pt_base(&pt_base, &pt_end, addr[i]);

/* Our (by three pages) smaller Xen pagetable that we are using */
- memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+ xen_pt_base = PFN_PHYS(pt_base);
+ xen_pt_size = (pt_end - pt_base) * PAGE_SIZE;
+ memblock_reserve(xen_pt_base, xen_pt_size);
/* protect xen_start_info */
memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
/* Revector the xen_start_info */
@@ -2074,11 +2077,21 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
PFN_DOWN(__pa(initial_page_table)));
xen_write_cr3(__pa(initial_page_table));

- memblock_reserve(__pa(xen_start_info->pt_base),
- xen_start_info->nr_pt_frames * PAGE_SIZE);
+ xen_pt_base = __pa(xen_start_info->pt_base);
+ xen_pt_size = xen_start_info->nr_pt_frames * PAGE_SIZE;
+
+ memblock_reserve(xen_pt_base, xen_pt_size);
}
#endif /* CONFIG_X86_64 */

+void __init xen_pt_check_e820(void)
+{
+ if (xen_is_e820_reserved(xen_pt_base, xen_pt_size)) {
+ xen_raw_console_write("Xen hypervisor allocated page table memory conflicts with E820 map\n");
+ BUG();
+ }
+}
+
static unsigned char dummy_mapping[PAGE_SIZE] __page_aligned_bss;

static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 9bd3f35..3fca9c1 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -802,6 +802,12 @@ char * __init xen_memory_setup(void)
BUG();
}

+ /*
+ * Check for a conflict of the hypervisor supplied page tables with
+ * the target E820 map.
+ */
+ xen_pt_check_e820();
+
xen_reserve_xen_mfnlist();

/*
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 3f1669c..553abd8 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -35,6 +35,7 @@ void xen_build_mfn_list_list(void);
void xen_setup_machphys_mapping(void);
void xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
void xen_reserve_top(void);
+void __init xen_pt_check_e820(void);

void xen_mm_pin_all(void);
void xen_mm_unpin_all(void);
--
2.1.4

2015-06-08 12:09:08

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 11/16] xen: check for initrd conflicting with e820 map

Check whether the initrd is placed at a location which is conflicting
with the target E820 map. If this is the case relocate it to a new
area unused up to now and compliant to the E820 map.

Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
---
arch/x86/xen/setup.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 3fca9c1..0d7f881 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -632,6 +632,36 @@ phys_addr_t __init xen_find_free_area(phys_addr_t size)
}

/*
+ * Like memcpy, but with physical addresses for dest and src.
+ */
+static void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src,
+ phys_addr_t n)
+{
+ phys_addr_t dest_off, src_off, dest_len, src_len, len;
+ void *from, *to;
+
+ while (n) {
+ dest_off = dest & ~PAGE_MASK;
+ src_off = src & ~PAGE_MASK;
+ dest_len = n;
+ if (dest_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off)
+ dest_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - dest_off;
+ src_len = n;
+ if (src_len > (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off)
+ src_len = (NR_FIX_BTMAPS << PAGE_SHIFT) - src_off;
+ len = min(dest_len, src_len);
+ to = early_memremap(dest - dest_off, dest_len + dest_off);
+ from = early_memremap(src - src_off, src_len + src_off);
+ memcpy(to, from, len);
+ early_memunmap(to, dest_len + dest_off);
+ early_memunmap(from, src_len + src_off);
+ n -= len;
+ dest += len;
+ src += len;
+ }
+}
+
+/*
* Reserve Xen mfn_list.
* See comment above "struct start_info" in <xen/interface/xen.h>
* We tried to make the the memblock_reserve more selective so
@@ -810,6 +840,27 @@ char * __init xen_memory_setup(void)

xen_reserve_xen_mfnlist();

+ /* Check for a conflict of the initrd with the target E820 map. */
+ if (xen_is_e820_reserved(boot_params.hdr.ramdisk_image,
+ boot_params.hdr.ramdisk_size)) {
+ phys_addr_t new_area, start, size;
+
+ new_area = xen_find_free_area(boot_params.hdr.ramdisk_size);
+ if (!new_area) {
+ xen_raw_console_write("Can't find new memory area for initrd needed due to E820 map conflict\n");
+ BUG();
+ }
+
+ start = boot_params.hdr.ramdisk_image;
+ size = boot_params.hdr.ramdisk_size;
+ xen_phys_memcpy(new_area, start, size);
+ pr_info("initrd moved from [mem %#010llx-%#010llx] to [mem %#010llx-%#010llx]\n",
+ start, start + size, new_area, new_area + size);
+ memblock_free(start, size);
+ boot_params.hdr.ramdisk_image = new_area;
+ boot_params.ext_ramdisk_image = new_area >> 32;
+ }
+
/*
* Set identity map on non-RAM pages and prepare remapping the
* underlying RAM.
--
2.1.4

2015-06-08 12:09:12

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 12/16] mm: provide early_memremap_ro to establish read-only mapping

During early boot as Xen pv domain the kernel needs to map some page
tables supplied by the hypervisor read only. This is needed to be
able to relocate some data structures conflicting with the physical
memory map especially on systems with huge RAM (above 512GB).

Provide the function early_memremap_ro() to provide this read only
mapping.

Signed-off-by: Juergen Gross <[email protected]>
---
include/asm-generic/early_ioremap.h | 2 ++
include/asm-generic/fixmap.h | 3 +++
mm/early_ioremap.c | 11 +++++++++++
3 files changed, 16 insertions(+)

diff --git a/include/asm-generic/early_ioremap.h b/include/asm-generic/early_ioremap.h
index a5de55c..316bd04 100644
--- a/include/asm-generic/early_ioremap.h
+++ b/include/asm-generic/early_ioremap.h
@@ -11,6 +11,8 @@ extern void __iomem *early_ioremap(resource_size_t phys_addr,
unsigned long size);
extern void *early_memremap(resource_size_t phys_addr,
unsigned long size);
+extern void *early_memremap_ro(resource_size_t phys_addr,
+ unsigned long size);
extern void early_iounmap(void __iomem *addr, unsigned long size);
extern void early_memunmap(void *addr, unsigned long size);

diff --git a/include/asm-generic/fixmap.h b/include/asm-generic/fixmap.h
index f23174f..d8cc637 100644
--- a/include/asm-generic/fixmap.h
+++ b/include/asm-generic/fixmap.h
@@ -46,6 +46,9 @@ static inline unsigned long virt_to_fix(const unsigned long vaddr)
#ifndef FIXMAP_PAGE_NORMAL
#define FIXMAP_PAGE_NORMAL PAGE_KERNEL
#endif
+#ifndef FIXMAP_PAGE_RO
+#define FIXMAP_PAGE_RO PAGE_KERNEL_RO
+#endif
#ifndef FIXMAP_PAGE_NOCACHE
#define FIXMAP_PAGE_NOCACHE PAGE_KERNEL_NOCACHE
#endif
diff --git a/mm/early_ioremap.c b/mm/early_ioremap.c
index e10ccd2..e4ffaac 100644
--- a/mm/early_ioremap.c
+++ b/mm/early_ioremap.c
@@ -217,6 +217,12 @@ early_memremap(resource_size_t phys_addr, unsigned long size)
return (__force void *)__early_ioremap(phys_addr, size,
FIXMAP_PAGE_NORMAL);
}
+void __init *
+early_memremap_ro(resource_size_t phys_addr, unsigned long size)
+{
+ return (__force void *)__early_ioremap(phys_addr, size,
+ FIXMAP_PAGE_RO);
+}
#else /* CONFIG_MMU */

void __init __iomem *
@@ -231,6 +237,11 @@ early_memremap(resource_size_t phys_addr, unsigned long size)
{
return (void *)phys_addr;
}
+void __init *
+early_memremap_ro(resource_size_t phys_addr, unsigned long size)
+{
+ return (void *)phys_addr;
+}

void __init early_iounmap(void __iomem *addr, unsigned long size)
{
--
2.1.4

2015-06-08 12:08:05

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 13/16] xen: add explicit memblock_reserve() calls for special pages

Some special pages containing interfaces to xen are being reserved
implicitly only today. The memblock_reserve() call to reserve them is
meant to reserve the p2m list supplied by xen. It is just reserving
not only the p2m list itself, but some more pages up to the start of
the xen built page tables.

To be able to move the p2m list to another pfn range, which is needed
for support of huge RAM, this memblock_reserve() must be split up to
cover all affected reserved pages explicitly.

The affected pages are:
- start_info page
- xenstore ring
- console ring
- shared_info page

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/enlighten.c | 1 +
arch/x86/xen/mmu.c | 11 +++++++++++
arch/x86/xen/xen-ops.h | 1 +
3 files changed, 13 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 46957ea..a29bcdb 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1569,6 +1569,7 @@ asmlinkage __visible void __init xen_start_kernel(void)

xen_raw_console_write("mapping kernel into physical memory\n");
xen_setup_kernel_pagetable((pgd_t *)xen_start_info->pt_base, xen_start_info->nr_pages);
+ xen_reserve_special_pages();

/*
* Modify the cache mode translation tables to match Xen's PAT
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 1982617..a286953 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -2084,6 +2084,17 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
}
#endif /* CONFIG_X86_64 */

+void __init xen_reserve_special_pages(void)
+{
+ memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
+ memblock_reserve(PFN_PHYS(mfn_to_pfn(xen_start_info->store_mfn)),
+ PAGE_SIZE);
+ if (!xen_initial_domain())
+ memblock_reserve(PFN_PHYS(mfn_to_pfn(
+ xen_start_info->console.domU.mfn)), PAGE_SIZE);
+ memblock_reserve(__pa(HYPERVISOR_shared_info), PAGE_SIZE);
+}
+
void __init xen_pt_check_e820(void)
{
if (xen_is_e820_reserved(xen_pt_base, xen_pt_size)) {
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 553abd8..88bd15f 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -35,6 +35,7 @@ void xen_build_mfn_list_list(void);
void xen_setup_machphys_mapping(void);
void xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
void xen_reserve_top(void);
+void __init xen_reserve_special_pages(void);
void __init xen_pt_check_e820(void);

void xen_mm_pin_all(void);
--
2.1.4

2015-06-08 12:08:14

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 14/16] xen: move p2m list if conflicting with e820 map

Check whether the hypervisor supplied p2m list is placed at a location
which is conflicting with the target E820 map. If this is the case
relocate it to a new area unused up to now and compliant to the E820
map.

As the p2m list might by huge (up to several GB) and is required to be
mapped virtually, set up a temporary mapping for the copied list.

For pvh domains just delete the p2m related information from start
info instead of reserving the p2m memory, as we don't need it at all.

For 32 bit kernels adjust the memblock_reserve() parameters in order
to cover the page tables only. This requires to memblock_reserve() the
start_info page on it's own.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/mmu.c | 234 +++++++++++++++++++++++++++++++++++++++++++++----
arch/x86/xen/setup.c | 51 +++++------
arch/x86/xen/xen-ops.h | 3 +
3 files changed, 247 insertions(+), 41 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index a286953..06b8b64 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1094,6 +1094,16 @@ static void xen_exit_mmap(struct mm_struct *mm)

static void xen_post_allocator_init(void);

+static void __init pin_pagetable_pfn(unsigned cmd, unsigned long pfn)
+{
+ struct mmuext_op op;
+
+ op.cmd = cmd;
+ op.arg1.mfn = pfn_to_mfn(pfn);
+ if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF))
+ BUG();
+}
+
#ifdef CONFIG_X86_64
static void __init xen_cleanhighmap(unsigned long vaddr,
unsigned long vaddr_end)
@@ -1129,10 +1139,12 @@ static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
memblock_free(paddr, size);
}

-static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl)
+static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl, bool unpin)
{
unsigned long pa = __pa(pgtbl) & PHYSICAL_PAGE_MASK;

+ if (unpin)
+ pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(pa));
ClearPagePinned(virt_to_page(__va(pa)));
xen_free_ro_pages(pa, PAGE_SIZE);
}
@@ -1151,7 +1163,9 @@ static void __init xen_cleanmfnmap(unsigned long vaddr)
pmd_t *pmd;
pte_t *pte;
unsigned int i;
+ bool unpin;

+ unpin = (vaddr == 2 * PGDIR_SIZE);
set_pgd(pgd, __pgd(0));
do {
pud = pud_page + pud_index(va);
@@ -1168,22 +1182,24 @@ static void __init xen_cleanmfnmap(unsigned long vaddr)
xen_free_ro_pages(pa, PMD_SIZE);
} else if (!pmd_none(*pmd)) {
pte = pte_offset_kernel(pmd, va);
+ set_pmd(pmd, __pmd(0));
for (i = 0; i < PTRS_PER_PTE; ++i) {
if (pte_none(pte[i]))
break;
pa = pte_pfn(pte[i]) << PAGE_SHIFT;
xen_free_ro_pages(pa, PAGE_SIZE);
}
- xen_cleanmfnmap_free_pgtbl(pte);
+ xen_cleanmfnmap_free_pgtbl(pte, unpin);
}
va += PMD_SIZE;
if (pmd_index(va))
continue;
- xen_cleanmfnmap_free_pgtbl(pmd);
+ set_pud(pud, __pud(0));
+ xen_cleanmfnmap_free_pgtbl(pmd, unpin);
}

} while (pud_index(va) || pmd_index(va));
- xen_cleanmfnmap_free_pgtbl(pud_page);
+ xen_cleanmfnmap_free_pgtbl(pud_page, unpin);
}

static void __init xen_pagetable_p2m_free(void)
@@ -1219,6 +1235,12 @@ static void __init xen_pagetable_p2m_free(void)
} else {
xen_cleanmfnmap(addr);
}
+}
+
+static void __init xen_pagetable_cleanhighmap(void)
+{
+ unsigned long size;
+ unsigned long addr;

/* At this stage, cleanup_highmap has already cleaned __ka space
* from _brk_limit way up to the max_pfn_mapped (which is the end of
@@ -1251,6 +1273,8 @@ static void __init xen_pagetable_p2m_setup(void)

#ifdef CONFIG_X86_64
xen_pagetable_p2m_free();
+
+ xen_pagetable_cleanhighmap();
#endif
/* And revector! Bye bye old array */
xen_start_info->mfn_list = (unsigned long)xen_p2m_addr;
@@ -1586,15 +1610,6 @@ static void __init xen_set_pte_init(pte_t *ptep, pte_t pte)
native_set_pte(ptep, pte);
}

-static void __init pin_pagetable_pfn(unsigned cmd, unsigned long pfn)
-{
- struct mmuext_op op;
- op.cmd = cmd;
- op.arg1.mfn = pfn_to_mfn(pfn);
- if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF))
- BUG();
-}
-
/* Early in boot, while setting up the initial pagetable, assume
everything is pinned. */
static void __init xen_alloc_pte_init(struct mm_struct *mm, unsigned long pfn)
@@ -2002,11 +2017,189 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
xen_pt_base = PFN_PHYS(pt_base);
xen_pt_size = (pt_end - pt_base) * PAGE_SIZE;
memblock_reserve(xen_pt_base, xen_pt_size);
- /* protect xen_start_info */
- memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
+
/* Revector the xen_start_info */
xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
}
+
+/*
+ * Read a value from a physical address.
+ */
+static unsigned long __init xen_read_phys_ulong(phys_addr_t addr)
+{
+ unsigned long *vaddr;
+ unsigned long val;
+
+ vaddr = early_memremap_ro(addr, sizeof(val));
+ val = *vaddr;
+ early_memunmap(vaddr, sizeof(val));
+ return val;
+}
+
+/*
+ * Translate a virtual address to a physical one without relying on mapped
+ * page tables.
+ */
+static phys_addr_t __init xen_early_virt_to_phys(unsigned long vaddr)
+{
+ phys_addr_t pa;
+ pgd_t pgd;
+ pud_t pud;
+ pmd_t pmd;
+ pte_t pte;
+
+ pa = read_cr3();
+ pgd = native_make_pgd(xen_read_phys_ulong(pa + pgd_index(vaddr) *
+ sizeof(pgd)));
+ if (!pgd_present(pgd))
+ return 0;
+
+ pa = pgd_val(pgd) & PTE_PFN_MASK;
+ pud = native_make_pud(xen_read_phys_ulong(pa + pud_index(vaddr) *
+ sizeof(pud)));
+ if (!pud_present(pud))
+ return 0;
+ pa = pud_pfn(pud) << PAGE_SHIFT;
+ if (pud_large(pud))
+ return pa + (vaddr & ~PUD_MASK);
+
+ pmd = native_make_pmd(xen_read_phys_ulong(pa + pmd_index(vaddr) *
+ sizeof(pmd)));
+ if (!pmd_present(pmd))
+ return 0;
+ pa = pmd_pfn(pmd) << PAGE_SHIFT;
+ if (pmd_large(pmd))
+ return pa + (vaddr & ~PMD_MASK);
+
+ pte = native_make_pte(xen_read_phys_ulong(pa + pte_index(vaddr) *
+ sizeof(pte)));
+ if (!pte_present(pte))
+ return 0;
+ pa = pte_pfn(pte) << PAGE_SHIFT;
+
+ return pa | (vaddr & ~PAGE_MASK);
+}
+
+/*
+ * Find a new area for the hypervisor supplied p2m list and relocate the p2m to
+ * this area.
+ */
+void __init xen_relocate_p2m(void)
+{
+ phys_addr_t size, new_area, pt_phys, pmd_phys, pud_phys;
+ unsigned long p2m_pfn, p2m_pfn_end, n_frames, pfn, pfn_end;
+ int n_pte, n_pt, n_pmd, n_pud, idx_pte, idx_pt, idx_pmd, idx_pud;
+ pte_t *pt;
+ pmd_t *pmd;
+ pud_t *pud;
+ pgd_t *pgd;
+ unsigned long *new_p2m;
+
+ size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
+ n_pte = roundup(size, PAGE_SIZE) >> PAGE_SHIFT;
+ n_pt = roundup(size, PMD_SIZE) >> PMD_SHIFT;
+ n_pmd = roundup(size, PUD_SIZE) >> PUD_SHIFT;
+ n_pud = roundup(size, PGDIR_SIZE) >> PGDIR_SHIFT;
+ n_frames = n_pte + n_pt + n_pmd + n_pud;
+
+ new_area = xen_find_free_area(PFN_PHYS(n_frames));
+ if (!new_area) {
+ xen_raw_console_write("Can't find new memory area for p2m needed due to E820 map conflict\n");
+ BUG();
+ }
+
+ /*
+ * Setup the page tables for addressing the new p2m list.
+ * We have asked the hypervisor to map the p2m list at the user address
+ * PUD_SIZE. It may have done so, or it may have used a kernel space
+ * address depending on the Xen version.
+ * To avoid any possible virtual address collision, just use
+ * 2 * PUD_SIZE for the new area.
+ */
+ pud_phys = new_area;
+ pmd_phys = pud_phys + PFN_PHYS(n_pud);
+ pt_phys = pmd_phys + PFN_PHYS(n_pmd);
+ p2m_pfn = PFN_DOWN(pt_phys) + n_pt;
+
+ pgd = __va(read_cr3());
+ new_p2m = (unsigned long *)(2 * PGDIR_SIZE);
+ for (idx_pud = 0; idx_pud < n_pud; idx_pud++) {
+ pud = early_memremap(pud_phys, PAGE_SIZE);
+ clear_page(pud);
+ for (idx_pmd = 0; idx_pmd < min(n_pmd, PTRS_PER_PUD);
+ idx_pmd++) {
+ pmd = early_memremap(pmd_phys, PAGE_SIZE);
+ clear_page(pmd);
+ for (idx_pt = 0; idx_pt < min(n_pt, PTRS_PER_PMD);
+ idx_pt++) {
+ pt = early_memremap(pt_phys, PAGE_SIZE);
+ clear_page(pt);
+ for (idx_pte = 0;
+ idx_pte < min(n_pte, PTRS_PER_PTE);
+ idx_pte++) {
+ set_pte(pt + idx_pte,
+ pfn_pte(p2m_pfn, PAGE_KERNEL));
+ p2m_pfn++;
+ }
+ n_pte -= PTRS_PER_PTE;
+ early_memunmap(pt, PAGE_SIZE);
+ make_lowmem_page_readonly(__va(pt_phys));
+ pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE,
+ PFN_DOWN(pt_phys));
+ set_pmd(pmd + idx_pt,
+ __pmd(_PAGE_TABLE | pt_phys));
+ pt_phys += PAGE_SIZE;
+ }
+ n_pt -= PTRS_PER_PMD;
+ early_memunmap(pmd, PAGE_SIZE);
+ make_lowmem_page_readonly(__va(pmd_phys));
+ pin_pagetable_pfn(MMUEXT_PIN_L2_TABLE,
+ PFN_DOWN(pmd_phys));
+ set_pud(pud + idx_pmd, __pud(_PAGE_TABLE | pmd_phys));
+ pmd_phys += PAGE_SIZE;
+ }
+ n_pmd -= PTRS_PER_PUD;
+ early_memunmap(pud, PAGE_SIZE);
+ make_lowmem_page_readonly(__va(pud_phys));
+ pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(pud_phys));
+ set_pgd(pgd + 2 + idx_pud, __pgd(_PAGE_TABLE | pud_phys));
+ pud_phys += PAGE_SIZE;
+ }
+
+ /* Now copy the old p2m info to the new area. */
+ memcpy(new_p2m, xen_p2m_addr, size);
+ xen_p2m_addr = new_p2m;
+
+ /* Release the old p2m list and set new list info. */
+ p2m_pfn = PFN_DOWN(xen_early_virt_to_phys(xen_start_info->mfn_list));
+ BUG_ON(!p2m_pfn);
+ p2m_pfn_end = p2m_pfn + PFN_DOWN(size);
+
+ if (xen_start_info->mfn_list < __START_KERNEL_map) {
+ pfn = xen_start_info->first_p2m_pfn;
+ pfn_end = xen_start_info->first_p2m_pfn +
+ xen_start_info->nr_p2m_frames;
+ set_pgd(pgd + 1, __pgd(0));
+ } else {
+ pfn = p2m_pfn;
+ pfn_end = p2m_pfn_end;
+ }
+
+ memblock_free(PFN_PHYS(pfn), PAGE_SIZE * (pfn_end - pfn));
+ while (pfn < pfn_end) {
+ if (pfn == p2m_pfn) {
+ pfn = p2m_pfn_end;
+ continue;
+ }
+ make_lowmem_page_readwrite(__va(PFN_PHYS(pfn)));
+ pfn++;
+ }
+
+ xen_start_info->mfn_list = (unsigned long)xen_p2m_addr;
+ xen_start_info->first_p2m_pfn = PFN_DOWN(new_area);
+ xen_start_info->nr_p2m_frames = n_frames;
+}
+
#else /* !CONFIG_X86_64 */
static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD);
static RESERVE_BRK_ARRAY(pmd_t, swapper_kernel_pmd, PTRS_PER_PMD);
@@ -2077,7 +2270,16 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
PFN_DOWN(__pa(initial_page_table)));
xen_write_cr3(__pa(initial_page_table));

- xen_pt_base = __pa(xen_start_info->pt_base);
+ /*
+ * The hypervisor supplied page tables are located just after the
+ * start info page. For 32 bit domains xen_start_info->pt_base is the
+ * pgd address which might be not the first page table in the page
+ * table pool.
+ * So to be sure to reserve all page tables use the knowledge that the
+ * page tables are located just after the start info page (this is a
+ * guaranteed interface).
+ */
+ xen_pt_base = __pa(xen_start_info) + PAGE_SIZE;
xen_pt_size = xen_start_info->nr_pt_frames * PAGE_SIZE;

memblock_reserve(xen_pt_base, xen_pt_size);
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 0d7f881..b096d02 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -663,37 +663,35 @@ static void __init xen_phys_memcpy(phys_addr_t dest, phys_addr_t src,

/*
* Reserve Xen mfn_list.
- * See comment above "struct start_info" in <xen/interface/xen.h>
- * We tried to make the the memblock_reserve more selective so
- * that it would be clear what region is reserved. Sadly we ran
- * in the problem wherein on a 64-bit hypervisor with a 32-bit
- * initial domain, the pt_base has the cr3 value which is not
- * neccessarily where the pagetable starts! As Jan put it: "
- * Actually, the adjustment turns out to be correct: The page
- * tables for a 32-on-64 dom0 get allocated in the order "first L1",
- * "first L2", "first L3", so the offset to the page table base is
- * indeed 2. When reading xen/include/public/xen.h's comment
- * very strictly, this is not a violation (since there nothing is said
- * that the first thing in the page table space is pointed to by
- * pt_base; I admit that this seems to be implied though, namely
- * do I think that it is implied that the page table space is the
- * range [pt_base, pt_base + nt_pt_frames), whereas that
- * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
- * which - without a priori knowledge - the kernel would have
- * difficulty to figure out)." - so lets just fall back to the
- * easy way and reserve the whole region.
*/
static void __init xen_reserve_xen_mfnlist(void)
{
+ phys_addr_t start, size;
+
if (xen_start_info->mfn_list >= __START_KERNEL_map) {
- memblock_reserve(__pa(xen_start_info->mfn_list),
- xen_start_info->pt_base -
- xen_start_info->mfn_list);
+ start = __pa(xen_start_info->mfn_list);
+ size = PFN_ALIGN(xen_start_info->nr_pages *
+ sizeof(unsigned long));
+ } else {
+ start = PFN_PHYS(xen_start_info->first_p2m_pfn);
+ size = PFN_PHYS(xen_start_info->nr_p2m_frames);
+ }
+
+ if (!xen_is_e820_reserved(start, size)) {
+ memblock_reserve(start, size);
return;
}

- memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
- PFN_PHYS(xen_start_info->nr_p2m_frames));
+#ifdef CONFIG_X86_32
+ /*
+ * Relocating the p2m on 32 bit system to an arbitrary virtual address
+ * is not supported, so just give up.
+ */
+ xen_raw_console_write("Xen hypervisor allocated p2m list conflicts with E820 map\n");
+ BUG();
+#else
+ xen_relocate_p2m();
+#endif
}

/**
@@ -895,7 +893,10 @@ char * __init xen_auto_xlated_memory_setup(void)
e820_add_region(xen_e820_map[i].addr, xen_e820_map[i].size,
xen_e820_map[i].type);

- xen_reserve_xen_mfnlist();
+ /* Remove p2m info, it is not needed. */
+ xen_start_info->mfn_list = 0;
+ xen_start_info->first_p2m_pfn = 0;
+ xen_start_info->nr_p2m_frames = 0;

return "Xen";
}
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 88bd15f..ee01a4e 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -40,6 +40,9 @@ void __init xen_pt_check_e820(void);

void xen_mm_pin_all(void);
void xen_mm_unpin_all(void);
+#ifdef CONFIG_X86_64
+void __init xen_relocate_p2m(void);
+#endif

bool __init xen_is_e820_reserved(phys_addr_t start, phys_addr_t size);
unsigned long __ref xen_chk_extra_mem(unsigned long pfn);
--
2.1.4

2015-06-08 12:07:57

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 15/16] xen: allow more than 512 GB of RAM for 64 bit pv-domains

64 bit pv-domains under Xen are limited to 512 GB of RAM today. The
main reason has been the 3 level p2m tree, which was replaced by the
virtual mapped linear p2m list. Parallel to the p2m list which is
being used by the kernel itself there is a 3 level mfn tree for usage
by the Xen tools and eventually for crash dump analysis. For this tree
the linear p2m list can serve as a replacement, too. As the kernel
can't know whether the tools are capable of dealing with the p2m list
instead of the mfn tree, the limit of 512 GB can't be dropped in all
cases.

This patch replaces the hard limit by a kernel parameter which tells
the kernel to obey the 512 GB limit or not. The default is selected by
a configuration parameter which specifies whether the 512 GB limit
should be active per default for domUs (domain save/restore/migration
and crash dump analysis are affected).

Memory above the domain limit is returned to the hypervisor instead of
being identity mapped, which was wrong anyway.

The kernel configuration parameter to specify the maximum size of a
domain can be deleted, as it is not relevant any more.

Signed-off-by: Juergen Gross <[email protected]>
---
Documentation/kernel-parameters.txt | 7 +++++
arch/x86/include/asm/xen/page.h | 4 ---
arch/x86/xen/Kconfig | 20 ++++++++-----
arch/x86/xen/p2m.c | 10 +++----
arch/x86/xen/setup.c | 59 +++++++++++++++++++++++++++++++------
5 files changed, 73 insertions(+), 27 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 61ab162..2f4a4ca 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -4014,6 +4014,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
plus one apbt timer for broadcast timer.
x86_intel_mid_timer=apbt_only | lapic_and_apbt

+ xen_512gb_limit [KNL,X86-64,XEN]
+ Restricts the kernel running paravirtualized under Xen
+ to use only up to 512 GB of RAM. The reason to do so is
+ crash analysis tools and Xen tools for doing domain
+ save/restore/migration must be enabled to handle larger
+ domains.
+
xen_emul_unplug= [HW,X86,XEN]
Unplug Xen emulated devices
Format: [unplug0,][unplug1]
diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index c44a5d5..10ef14b 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -35,10 +35,6 @@ typedef struct xpaddr {
#define FOREIGN_FRAME(m) ((m) | FOREIGN_FRAME_BIT)
#define IDENTITY_FRAME(m) ((m) | IDENTITY_FRAME_BIT)

-/* Maximum amount of memory we can handle in a domain in pages */
-#define MAX_DOMAIN_PAGES \
- ((unsigned long)((u64)CONFIG_XEN_MAX_DOMAIN_MEMORY * 1024 * 1024 * 1024 / PAGE_SIZE))
-
extern unsigned long *machine_to_phys_mapping;
extern unsigned long machine_to_phys_nr;
extern unsigned long *xen_p2m_addr;
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index e88fda8..7bcf21b 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -23,14 +23,18 @@ config XEN_PVHVM
def_bool y
depends on XEN && PCI && X86_LOCAL_APIC

-config XEN_MAX_DOMAIN_MEMORY
- int
- default 500 if X86_64
- default 64 if X86_32
- depends on XEN
- help
- This only affects the sizing of some bss arrays, the unused
- portions of which are freed.
+config XEN_512GB
+ bool "Limit Xen pv-domain memory to 512GB"
+ depends on XEN && X86_64
+ default y
+ help
+ Limit paravirtualized user domains to 512GB of RAM.
+
+ The Xen tools and crash dump analysis tools might not support
+ pv-domains with more than 512 GB of RAM. This option controls the
+ default setting of the kernel to use only up to 512 GB or more.
+ It is always possible to change the default via specifying the
+ boot parameter "xen_512gb_limit".

config XEN_SAVE_RESTORE
bool
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 6f80cd3..365a64a 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -516,7 +516,7 @@ static pte_t *alloc_p2m_pmd(unsigned long addr, pte_t *pte_pg)
*/
static bool alloc_p2m(unsigned long pfn)
{
- unsigned topidx, mididx;
+ unsigned topidx;
unsigned long *top_mfn_p, *mid_mfn;
pte_t *ptep, *pte_pg;
unsigned int level;
@@ -524,9 +524,6 @@ static bool alloc_p2m(unsigned long pfn)
unsigned long addr = (unsigned long)(xen_p2m_addr + pfn);
unsigned long p2m_pfn;

- topidx = p2m_top_index(pfn);
- mididx = p2m_mid_index(pfn);
-
ptep = lookup_address(addr, &level);
BUG_ON(!ptep || level != PG_LEVEL_4K);
pte_pg = (pte_t *)((unsigned long)ptep & ~(PAGE_SIZE - 1));
@@ -538,7 +535,8 @@ static bool alloc_p2m(unsigned long pfn)
return false;
}

- if (p2m_top_mfn) {
+ if (p2m_top_mfn && pfn < MAX_P2M_PFN) {
+ topidx = p2m_top_index(pfn);
top_mfn_p = &p2m_top_mfn[topidx];
mid_mfn = ACCESS_ONCE(p2m_top_mfn_p[topidx]);

@@ -595,7 +593,7 @@ static bool alloc_p2m(unsigned long pfn)
wmb(); /* Tools are synchronizing via p2m_generation. */
HYPERVISOR_shared_info->arch.p2m_generation++;
if (mid_mfn)
- mid_mfn[mididx] = virt_to_mfn(p2m);
+ mid_mfn[p2m_mid_index(pfn)] = virt_to_mfn(p2m);
p2m = NULL;
}

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index b096d02..f960021 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -33,6 +33,8 @@
#include "p2m.h"
#include "mmu.h"

+#define GB(x) ((uint64_t)(x) * 1024 * 1024 * 1024)
+
/* Amount of extra memory space we add to the e820 ranges */
struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;

@@ -69,6 +71,26 @@ static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
*/
#define EXTRA_MEM_RATIO (10)

+static bool xen_512gb_limit __initdata = IS_ENABLED(CONFIG_XEN_512GB);
+
+static void __init xen_parse_512gb(void)
+{
+ bool val = false;
+ char *arg;
+
+ arg = strstr(xen_start_info->cmd_line, "xen_512gb_limit");
+ if (!arg)
+ return;
+
+ arg = strstr(xen_start_info->cmd_line, "xen_512gb_limit=");
+ if (!arg)
+ val = true;
+ else if (strtobool(arg + strlen("xen_512gb_limit="), &val))
+ return;
+
+ xen_512gb_limit = val;
+}
+
static void __init xen_add_extra_mem(phys_addr_t start, phys_addr_t size)
{
int i;
@@ -503,12 +525,29 @@ void __init xen_remap_memory(void)
pr_info("Remapped %ld page(s)\n", remapped);
}

+static unsigned long __init xen_get_pages_limit(void)
+{
+ unsigned long limit;
+
+#ifdef CONFIG_X86_32
+ limit = GB(64) / PAGE_SIZE;
+#else
+ limit = ~0ul;
+ if (!xen_initial_domain() && xen_512gb_limit)
+ limit = GB(512) / PAGE_SIZE;
+#endif
+ return limit;
+}
+
static unsigned long __init xen_get_max_pages(void)
{
- unsigned long max_pages = MAX_DOMAIN_PAGES;
+ unsigned long max_pages, limit;
domid_t domid = DOMID_SELF;
int ret;

+ limit = xen_get_pages_limit();
+ max_pages = limit;
+
/*
* For the initial domain we use the maximum reservation as
* the maximum page.
@@ -524,7 +563,7 @@ static unsigned long __init xen_get_max_pages(void)
max_pages = ret;
}

- return min(max_pages, MAX_DOMAIN_PAGES);
+ return min(max_pages, limit);
}

static void __init xen_align_and_add_e820_region(phys_addr_t start,
@@ -699,7 +738,7 @@ static void __init xen_reserve_xen_mfnlist(void)
**/
char * __init xen_memory_setup(void)
{
- unsigned long max_pfn = xen_start_info->nr_pages;
+ unsigned long max_pfn;
phys_addr_t mem_end, addr, size, chunk_size;
u32 type;
int rc;
@@ -709,7 +748,9 @@ char * __init xen_memory_setup(void)
int i;
int op;

- max_pfn = min(MAX_DOMAIN_PAGES, max_pfn);
+ xen_parse_512gb();
+ max_pfn = xen_get_pages_limit();
+ max_pfn = min(max_pfn, xen_start_info->nr_pages);
mem_end = PFN_PHYS(max_pfn);

memmap.nr_entries = E820MAX;
@@ -762,12 +803,15 @@ char * __init xen_memory_setup(void)
* is limited to the max size of lowmem, so that it doesn't
* get completely filled.
*
+ * Make sure we have no memory above max_pages, as this area
+ * isn't handled by the p2m management.
+ *
* In principle there could be a problem in lowmem systems if
* the initial memory is also very large with respect to
* lowmem, but we won't try to deal with that here.
*/
- extra_pages = min(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
- extra_pages);
+ extra_pages = min3(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
+ extra_pages, max_pages - max_pfn);
i = 0;
addr = xen_e820_map[0].addr;
size = xen_e820_map[0].size;
@@ -803,9 +847,6 @@ char * __init xen_memory_setup(void)
/*
* Set the rest as identity mapped, in case PCI BARs are
* located here.
- *
- * PFNs above MAX_P2M_PFN are considered identity mapped as
- * well.
*/
set_phys_range_identity(addr / PAGE_SIZE, ~0ul);

--
2.1.4

2015-06-08 12:07:52

by Jürgen Groß

[permalink] [raw]
Subject: [Patch V4 16/16] xen: remove no longer needed p2m.h

Cleanup by removing arch/x86/xen/p2m.h as it isn't needed any more.

Most definitions in this file are used in p2m.c only. Move those into
p2m.c.

set_phys_range_identity() is already declared in
arch/x86/include/asm/xen/page.h, add __init annotation there.

MAX_REMAP_RANGES isn't used at all, just delete it.

The only define left is P2M_PER_PAGE which is moved to page.h as well.

Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/include/asm/xen/page.h | 6 ++++--
arch/x86/xen/p2m.c | 6 +++++-
arch/x86/xen/p2m.h | 15 ---------------
arch/x86/xen/setup.c | 1 -
4 files changed, 9 insertions(+), 19 deletions(-)
delete mode 100644 arch/x86/xen/p2m.h

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 10ef14b..a3804fb 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -35,6 +35,8 @@ typedef struct xpaddr {
#define FOREIGN_FRAME(m) ((m) | FOREIGN_FRAME_BIT)
#define IDENTITY_FRAME(m) ((m) | IDENTITY_FRAME_BIT)

+#define P2M_PER_PAGE (PAGE_SIZE / sizeof(unsigned long))
+
extern unsigned long *machine_to_phys_mapping;
extern unsigned long machine_to_phys_nr;
extern unsigned long *xen_p2m_addr;
@@ -44,8 +46,8 @@ extern unsigned long xen_max_p2m_pfn;
extern unsigned long get_phys_to_machine(unsigned long pfn);
extern bool set_phys_to_machine(unsigned long pfn, unsigned long mfn);
extern bool __set_phys_to_machine(unsigned long pfn, unsigned long mfn);
-extern unsigned long set_phys_range_identity(unsigned long pfn_s,
- unsigned long pfn_e);
+extern unsigned long __init set_phys_range_identity(unsigned long pfn_s,
+ unsigned long pfn_e);

extern int set_foreign_p2m_mapping(struct gnttab_map_grant_ref *map_ops,
struct gnttab_map_grant_ref *kmap_ops,
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 365a64a..1f63ad2 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -78,10 +78,14 @@
#include <xen/balloon.h>
#include <xen/grant_table.h>

-#include "p2m.h"
#include "multicalls.h"
#include "xen-ops.h"

+#define P2M_MID_PER_PAGE (PAGE_SIZE / sizeof(unsigned long *))
+#define P2M_TOP_PER_PAGE (PAGE_SIZE / sizeof(unsigned long **))
+
+#define MAX_P2M_PFN (P2M_TOP_PER_PAGE * P2M_MID_PER_PAGE * P2M_PER_PAGE)
+
#define PMDS_PER_MID_PAGE (P2M_MID_PER_PAGE / PTRS_PER_PTE)

unsigned long *xen_p2m_addr __read_mostly;
diff --git a/arch/x86/xen/p2m.h b/arch/x86/xen/p2m.h
deleted file mode 100644
index ad8aee2..0000000
--- a/arch/x86/xen/p2m.h
+++ /dev/null
@@ -1,15 +0,0 @@
-#ifndef _XEN_P2M_H
-#define _XEN_P2M_H
-
-#define P2M_PER_PAGE (PAGE_SIZE / sizeof(unsigned long))
-#define P2M_MID_PER_PAGE (PAGE_SIZE / sizeof(unsigned long *))
-#define P2M_TOP_PER_PAGE (PAGE_SIZE / sizeof(unsigned long **))
-
-#define MAX_P2M_PFN (P2M_TOP_PER_PAGE * P2M_MID_PER_PAGE * P2M_PER_PAGE)
-
-#define MAX_REMAP_RANGES 10
-
-extern unsigned long __init set_phys_range_identity(unsigned long pfn_s,
- unsigned long pfn_e);
-
-#endif /* _XEN_P2M_H */
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index f960021..a1a77ea 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -30,7 +30,6 @@
#include <xen/hvc-console.h>
#include "xen-ops.h"
#include "vdso.h"
-#include "p2m.h"
#include "mmu.h"

#define GB(x) ((uint64_t)(x) * 1024 * 1024 * 1024)
--
2.1.4

2015-06-08 12:16:36

by David Vrabel

[permalink] [raw]
Subject: Re: [Patch V4 13/16] xen: add explicit memblock_reserve() calls for special pages

On 08/06/15 13:06, Juergen Gross wrote:
> Some special pages containing interfaces to xen are being reserved
> implicitly only today. The memblock_reserve() call to reserve them is
> meant to reserve the p2m list supplied by xen. It is just reserving
> not only the p2m list itself, but some more pages up to the start of
> the xen built page tables.
>
> To be able to move the p2m list to another pfn range, which is needed
> for support of huge RAM, this memblock_reserve() must be split up to
> cover all affected reserved pages explicitly.
>
> The affected pages are:
> - start_info page
> - xenstore ring
> - console ring
> - shared_info page

Reviewed-by: David Vrabel <[email protected]>

David

2015-06-08 13:42:58

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [Patch V4 00/16] xen: support pv-domains larger than 512GB

On 08/06/15 13:06, Juergen Gross wrote:
> Support 64 bit pv-domains with more than 512GB of memory.
>
> Tested with 64 bit dom0 on machines with 8GB and 1TB and 32 bit dom0 on a
> 8GB machine. Conflicts between E820 map and different hypervisor populated
> memory areas have been tested via a fake E820 map reserved area on the
> 8GB machine.

Do you have a git tree I can pull this from?

David

2015-06-08 13:46:04

by Jürgen Groß

[permalink] [raw]
Subject: Re: [Xen-devel] [Patch V4 00/16] xen: support pv-domains larger than 512GB

On 06/08/2015 03:42 PM, David Vrabel wrote:
> On 08/06/15 13:06, Juergen Gross wrote:
>> Support 64 bit pv-domains with more than 512GB of memory.
>>
>> Tested with 64 bit dom0 on machines with 8GB and 1TB and 32 bit dom0 on a
>> 8GB machine. Conflicts between E820 map and different hypervisor populated
>> memory areas have been tested via a fake E820 map reserved area on the
>> 8GB machine.
>
> Do you have a git tree I can pull this from?

No, I don't. Sorry for that.

Juergen

2015-06-10 15:27:01

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [Patch V4 00/16] xen: support pv-domains larger than 512GB

On 08/06/15 13:06, Juergen Gross wrote:
> Support 64 bit pv-domains with more than 512GB of memory.
>
> Tested with 64 bit dom0 on machines with 8GB and 1TB and 32 bit dom0 on a
> 8GB machine. Conflicts between E820 map and different hypervisor populated
> memory areas have been tested via a fake E820 map reserved area on the
> 8GB machine.

Applied to for-linus-4.2, thanks.

Boris or Konrad, can you kick of a test run with this branch, please?

David

2015-06-10 15:31:42

by Jürgen Groß

[permalink] [raw]
Subject: Re: [Xen-devel] [Patch V4 00/16] xen: support pv-domains larger than 512GB

On 06/10/2015 05:25 PM, David Vrabel wrote:
> On 08/06/15 13:06, Juergen Gross wrote:
>> Support 64 bit pv-domains with more than 512GB of memory.
>>
>> Tested with 64 bit dom0 on machines with 8GB and 1TB and 32 bit dom0 on a
>> 8GB machine. Conflicts between E820 map and different hypervisor populated
>> memory areas have been tested via a fake E820 map reserved area on the
>> 8GB machine.
>
> Applied to for-linus-4.2, thanks.

Thanks. It was a long way to go... :-)


Juergen

>
> Boris or Konrad, can you kick of a test run with this branch, please?
>
> David
>
>

2015-06-11 11:04:02

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [Patch V4 14/16] xen: move p2m list if conflicting with e820 map

On 08/06/15 13:06, Juergen Gross wrote:
> Check whether the hypervisor supplied p2m list is placed at a location
> which is conflicting with the target E820 map. If this is the case
> relocate it to a new area unused up to now and compliant to the E820
> map.
>
> As the p2m list might by huge (up to several GB) and is required to be
> mapped virtually, set up a temporary mapping for the copied list.
>
> For pvh domains just delete the p2m related information from start
> info instead of reserving the p2m memory, as we don't need it at all.
>
> For 32 bit kernels adjust the memblock_reserve() parameters in order
> to cover the page tables only. This requires to memblock_reserve() the
> start_info page on it's own.

This commit breaks 32-bit PV domUs.

I've dropped this whole series for now because:

a) you've clearly not tested this series enough with domUs.
b) patch #12 (mm: provide early_memremap_ro to establish read-only
mapping) is lacking an ack from an MM maintainer (sorry for not noticing
this earlier).

Please try again for 4.3.

David

[ 0.050840] Unpacking initramfs...
[ 0.050979] BUG: unable to handle kernel paging request at c22c2008
[ 0.050986] IP: [<c128373d>] memcpy+0x1d/0x40
[ 0.050992] *pdpt = 0000000001e97027 *pde = 00000000022c3067 *pte =
80000000022c2061
[ 0.051002] Oops: 0003 [#1] SMP
[ 0.051008] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
4.1.0-rc5.davidvr #11
[ 0.051013] task: df498000 ti: df492000 task.ti: df492000
[ 0.051018] EIP: e019:[<c128373d>] EFLAGS: 00010206 CPU: 0
[ 0.051023] EIP is at memcpy+0x1d/0x40
[ 0.051027] EAX: c22c2000 EBX: 00001000 ECX: 000003fe EDX: c20ca160
[ 0.051032] ESI: c20ca168 EDI: c22c2008 EBP: df493ce0 ESP: df493cd4
[ 0.051037] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021
[ 0.051042] CR0: 80050033 CR2: c22c2008 CR3: 016d2000 CR4: 00042660
[ 0.051049] Stack:
[ 0.051052] 00001000 00001000 00000000 df493d04 c1288df8 c111f6c5
df493de8 c22c2000
[ 0.051062] c22c3000 00001000 00001000 00000000 df493d58 c110a5e5
00001000 00000000
[ 0.052004] 00001000 00000001 df493d44 df493d48 df498000 00000001
c146fbe0 df5a7000
[ 0.052004] Call Trace:
[ 0.052004] [<c1288df8>] iov_iter_copy_from_user_atomic+0xd8/0x190
[ 0.052004] [<c111f6c5>] ? shmem_write_begin+0x55/0x90
[ 0.052004] [<c110a5e5>] generic_perform_write+0xc5/0x1a0
[ 0.052004] [<c110a852>] __generic_file_write_iter+0x192/0x1d0
[ 0.052004] [<c110a99c>] generic_file_write_iter+0x10c/0x300
[ 0.052004] [<c145e2f7>] ? __mutex_unlock_slowpath+0xb7/0x1f0
[ 0.052004] [<c10a3ae1>] ? lock_acquire+0xc1/0x240
[ 0.052004] [<c114a8f9>] __vfs_write+0xa9/0xe0
[ 0.052004] [<c114aaa0>] vfs_write+0x80/0x150
[ 0.052004] [<c11674f2>] ? __fdget+0x12/0x20
[ 0.052004] [<c114ac88>] SyS_write+0x58/0xc0
[ 0.052004] [<c162db20>] xwrite+0x27/0x53
[ 0.052004] [<c162db70>] do_copy+0x24/0xde
[ 0.052004] [<c162d6a3>] write_buffer+0x1e/0x2d
[ 0.052004] [<c162d931>] unpack_to_rootfs+0xe6/0x2ae
[ 0.052004] [<c10ad997>] ? vprintk_default+0x37/0x40
[ 0.052004] [<c162e049>] populate_rootfs+0x51/0x9b
[ 0.052004] [<c162cda4>] do_one_initcall+0x110/0x1bb
[ 0.052004] [<c162dff8>] ? maybe_link.part.1+0xe9/0xe9
[ 0.052004] [<c107c068>] ? parameq+0x18/0x70
[ 0.052004] [<c162c5ac>] ? repair_env_string+0x12/0x51
[ 0.052004] [<c162dff8>] ? maybe_link.part.1+0xe9/0xe9
[ 0.052004] [<c107c349>] ? parse_args+0x289/0x4b0
[ 0.052004] [<c14601de>] ? _raw_spin_unlock_irqrestore+0x3e/0x60
[ 0.052004] [<c1072ce3>] ? __usermodehelper_set_disable_depth+0x43/0x50