The observed (by the human eye) performance difference of early boot
between native and PV-on-Xen was just too large to not look into. As
it turns out, gaining performance back wasn't all that difficult.
While the series (re)introduces a small number of PTWR emulations on
the boot path (from phys_pte_init()), there has been a much larger
number of them post-boot. Hence I think if this was of concern, the
post-boot instances would want eliminating first.
Some of the later changes aren'r directly related to the main goal of
the series; these address aspects noticed while doing the investigation.
1: streamline set_pte_mfn()
2: restore (fix) xen_set_pte_init() behavior
3: adjust xen_set_fixmap()
4: adjust handling of the L3 user vsyscall special page table
5: there's no highmem anymore in PV mode
6: restrict PV Dom0 identity mapping
Jan
Considerations for it are a leftover from when 32-bit was still
supported.
Signed-off-by: Jan Beulich <[email protected]>
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -306,10 +306,6 @@ static void __init xen_update_mem_tables
BUG();
}
- /* Update kernel mapping, but not for highmem. */
- if (pfn >= PFN_UP(__pa(high_memory - 1)))
- return;
-
if (HYPERVISOR_update_va_mapping((unsigned long)__va(pfn << PAGE_SHIFT),
mfn_pte(mfn, PAGE_KERNEL), 0)) {
WARN(1, "Failed to update kernel mapping for mfn=%ld pfn=%ld\n",
Using __native_set_fixmap() here means guaranteed trap-and-emulate
instances the hypervisor has to deal with. Since the virtual address
covered by the to be adjusted page table entry is easy to determine (and
actually already gets obtained in a special case), simply use an
available, easy to invoke hypercall instead.
Signed-off-by: Jan Beulich <[email protected]>
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2010,6 +2010,7 @@ static unsigned char dummy_mapping[PAGE_
static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
{
pte_t pte;
+ unsigned long vaddr;
phys >>= PAGE_SHIFT;
@@ -2050,15 +2051,15 @@ static void xen_set_fixmap(unsigned idx,
break;
}
- __native_set_fixmap(idx, pte);
+ vaddr = __fix_to_virt(idx);
+ if (HYPERVISOR_update_va_mapping(vaddr, pte, UVMF_INVLPG))
+ BUG();
#ifdef CONFIG_X86_VSYSCALL_EMULATION
/* Replicate changes to map the vsyscall page into the user
pagetable vsyscall mapping. */
- if (idx == VSYSCALL_PAGE) {
- unsigned long vaddr = __fix_to_virt(idx);
+ if (idx == VSYSCALL_PAGE)
set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte);
- }
#endif
}
Marking the page tableas pinned without ever actually pinning is was
probably an oversight in the first place. The main reason for the change
is more subtle, though: The write of the one present entry each here and
in the subsequently allocated L2 table engage a code path in the
hypervisor which exists only for thought-to-be-broken guests: An mmu-
update operation to a page which is neither a page table nor marked
writable. The hypervisor merely assumes (or should I say "hopes") that
the fact that a writable reference to the page can be obtained means it
is okay to actually write to that page in response to such a hypercall.
While there make all involved code and data dependent upon
X86_VSYSCALL_EMULATION (some code was already).
Signed-off-by: Jan Beulich <[email protected]>
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -86,8 +86,10 @@
#include "mmu.h"
#include "debugfs.h"
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
/* l3 pud for userspace vsyscall mapping */
static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
+#endif
/*
* Protects atomic reservation decrease/increase against concurrent increases.
@@ -791,7 +793,9 @@ static void __init xen_mark_pinned(struc
static void __init xen_after_bootmem(void)
{
static_branch_enable(&xen_struct_pages_ready);
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
SetPagePinned(virt_to_page(level3_user_vsyscall));
+#endif
xen_pgd_walk(&init_mm, xen_mark_pinned, FIXADDR_TOP);
}
@@ -1761,7 +1765,6 @@ void __init xen_setup_kernel_pagetable(p
set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
- set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
@@ -1778,6 +1781,13 @@ void __init xen_setup_kernel_pagetable(p
/* Unpin Xen-provided one */
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+ /* Pin user vsyscall L3 */
+ set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
+ pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE,
+ PFN_DOWN(__pa_symbol(level3_user_vsyscall)));
+#endif
+
/*
* At this stage there can be no user pgd, and no page structure to
* attach it to, so make sure we just set kernel pgd.
When moving away RAM pages, there having been a mapping of those is not
a proper indication that instead MMIO should be mapped there. At the
point in time this effectively covers the low megabyte only. Mapping of
that is, however, the job of init_mem_mapping(). Comparing the two one
can also spot that we've been wrongly (or at least inconsistently) using
PAGE_KERNEL_IO here.
Simply zap any such mappings instead.
Signed-off-by: Jan Beulich <[email protected]>
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -425,13 +425,13 @@ static unsigned long __init xen_set_iden
}
/*
- * If the PFNs are currently mapped, the VA mapping also needs
- * to be updated to be 1:1.
+ * If the PFNs are currently mapped, their VA mappings need to be
+ * zapped.
*/
for (pfn = start_pfn; pfn <= max_pfn_mapped && pfn < end_pfn; pfn++)
(void)HYPERVISOR_update_va_mapping(
(unsigned long)__va(pfn << PAGE_SHIFT),
- mfn_pte(pfn, PAGE_KERNEL_IO), 0);
+ native_make_pte(0), 0);
return remap_pfn;
}
In preparation for restoring xen_set_pte_init()'s original behavior of
avoiding hypercalls, make set_pte_mfn() no longer use the standard
set_pte() code path. That one is more complicated than the alternative
of simply using an available hypercall directly. This way we can avoid
introducing a fair number (2k on my test system) of cases where the
hypervisor would trap-and-emulate page table updates.
Signed-off-by: Jan Beulich <[email protected]>
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -241,9 +241,11 @@ static void xen_set_pmd(pmd_t *ptr, pmd_
* Associate a virtual page frame with a given physical page frame
* and protection flags for that frame.
*/
-void set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags)
+void __init set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags)
{
- set_pte_vaddr(vaddr, mfn_pte(mfn, flags));
+ if (HYPERVISOR_update_va_mapping(vaddr, mfn_pte(mfn, flags),
+ UVMF_INVLPG))
+ BUG();
}
static bool xen_batched_set_pte(pte_t *ptep, pte_t pteval)
On 9/30/21 8:33 AM, Jan Beulich wrote:
> The observed (by the human eye) performance difference of early boot
> between native and PV-on-Xen was just too large to not look into. As
> it turns out, gaining performance back wasn't all that difficult.
>
> While the series (re)introduces a small number of PTWR emulations on
> the boot path (from phys_pte_init()), there has been a much larger
> number of them post-boot. Hence I think if this was of concern, the
> post-boot instances would want eliminating first.
>
> Some of the later changes aren'r directly related to the main goal of
> the series; these address aspects noticed while doing the investigation.
>
> 1: streamline set_pte_mfn()
> 2: restore (fix) xen_set_pte_init() behavior
> 3: adjust xen_set_fixmap()
> 4: adjust handling of the L3 user vsyscall special page table
> 5: there's no highmem anymore in PV mode
> 6: restrict PV Dom0 identity mapping
For the series:
Reviewed-by: Boris Ostrovsky <[email protected]>
On 9/30/21 8:33 AM, Jan Beulich wrote:
> The observed (by the human eye) performance difference of early boot
> between native and PV-on-Xen was just too large to not look into. As
> it turns out, gaining performance back wasn't all that difficult.
>
> While the series (re)introduces a small number of PTWR emulations on
> the boot path (from phys_pte_init()), there has been a much larger
> number of them post-boot. Hence I think if this was of concern, the
> post-boot instances would want eliminating first.
>
> Some of the later changes aren'r directly related to the main goal of
> the series; these address aspects noticed while doing the investigation.
>
> 1: streamline set_pte_mfn()
> 2: restore (fix) xen_set_pte_init() behavior
> 3: adjust xen_set_fixmap()
> 4: adjust handling of the L3 user vsyscall special page table
> 5: there's no highmem anymore in PV mode
> 6: restrict PV Dom0 identity mapping
Applied to for-linus-5.16
-boris