2024-04-04 14:49:28

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 0/4] Speed up boot with faster linear map creation

Hi All,

It turns out that creating the linear map can take a significant proportion of
the total boot time, especially when rodata=full. And most of the time is spent
waiting on superfluous tlb invalidation and memory barriers. This series reworks
the kernel pgtable generation code to significantly reduce the number of those
TLBIs, ISBs and DSBs. See each patch for details.

The below shows the execution time of map_mem() across a couple of different
systems with different RAM configurations. We measure after applying each patch
and show the improvement relative to base (v6.9-rc2):

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656 (-91%)
no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257 (-93%)
lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838 (-95%)

This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
boot tested various PAGE_SIZE and VA size configs.

---

Changes since v1 [1]
====================

- Added Tested-by tags (thanks to Eric and Itaru)
- Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
- Reordered patches (biggest impact & least controversial first)
- Reordered alloc/map/unmap functions in mmu.c to aid reader
- pte_clear() -> __pte_clear() in clear_fixmap_nosync()
- Reverted generic p4d_index() which caused x86 build error. Replaced with
unconditional p4d_index() define under arm64.


[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/

Thanks,
Ryan


Ryan Roberts (4):
arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
arm64: mm: Batch dsb and isb when populating pgtables
arm64: mm: Don't remap pgtables for allocate vs populate
arm64: mm: Lazily clear pte table mappings from fixmap

arch/arm64/include/asm/fixmap.h | 5 +-
arch/arm64/include/asm/mmu.h | 8 +
arch/arm64/include/asm/pgtable.h | 13 +-
arch/arm64/kernel/cpufeature.c | 10 +-
arch/arm64/mm/fixmap.c | 11 +
arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
6 files changed, 319 insertions(+), 105 deletions(-)

--
2.25.1



2024-04-04 14:49:42

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 1/4] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block

A large part of the kernel boot time is creating the kernel linear map
page tables. When rodata=full, all memory is mapped by pte. And when
there is lots of physical ram, there are lots of pte tables to populate.
The primary cost associated with this is mapping and unmapping the pte
table memory in the fixmap; at unmap time, the TLB entry must be
invalidated and this is expensive.

Previously, each pmd and pte table was fixmapped/fixunmapped for each
cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
we ended up issuing 32 TLBIs per (pmd|pte) table during the population
phase.

Let's fix that, and fixmap/fixunmap each page once per population, for a
saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
speedup.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
before | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
after | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)

Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: Itaru Kitayama <[email protected]>
Tested-by: Eric Chanudet <[email protected]>
---
arch/arm64/mm/mmu.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 495b732d5af3..fd91b5bdb514 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -172,12 +172,9 @@ bool pgattr_change_is_safe(u64 old, u64 new)
return ((old ^ new) & ~mask) == 0;
}

-static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
- phys_addr_t phys, pgprot_t prot)
+static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
+ phys_addr_t phys, pgprot_t prot)
{
- pte_t *ptep;
-
- ptep = pte_set_fixmap_offset(pmdp, addr);
do {
pte_t old_pte = __ptep_get(ptep);

@@ -193,7 +190,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
phys += PAGE_SIZE;
} while (ptep++, addr += PAGE_SIZE, addr != end);

- pte_clear_fixmap();
+ return ptep;
}

static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
@@ -204,6 +201,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
{
unsigned long next;
pmd_t pmd = READ_ONCE(*pmdp);
+ pte_t *ptep;

BUG_ON(pmd_sect(pmd));
if (pmd_none(pmd)) {
@@ -219,6 +217,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
}
BUG_ON(pmd_bad(pmd));

+ ptep = pte_set_fixmap_offset(pmdp, addr);
do {
pgprot_t __prot = prot;

@@ -229,20 +228,20 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
(flags & NO_CONT_MAPPINGS) == 0)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);

- init_pte(pmdp, addr, next, phys, __prot);
+ ptep = init_pte(ptep, addr, next, phys, __prot);

phys += next - addr;
} while (addr = next, addr != end);
+
+ pte_clear_fixmap();
}

-static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
- phys_addr_t phys, pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int), int flags)
+static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
+ phys_addr_t phys, pgprot_t prot,
+ phys_addr_t (*pgtable_alloc)(int), int flags)
{
unsigned long next;
- pmd_t *pmdp;

- pmdp = pmd_set_fixmap_offset(pudp, addr);
do {
pmd_t old_pmd = READ_ONCE(*pmdp);

@@ -269,7 +268,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
phys += next - addr;
} while (pmdp++, addr = next, addr != end);

- pmd_clear_fixmap();
+ return pmdp;
}

static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
@@ -279,6 +278,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
{
unsigned long next;
pud_t pud = READ_ONCE(*pudp);
+ pmd_t *pmdp;

/*
* Check for initial section mappings in the pgd/pud.
@@ -297,6 +297,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
}
BUG_ON(pud_bad(pud));

+ pmdp = pmd_set_fixmap_offset(pudp, addr);
do {
pgprot_t __prot = prot;

@@ -307,10 +308,13 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
(flags & NO_CONT_MAPPINGS) == 0)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);

- init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
+ pmdp = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc,
+ flags);

phys += next - addr;
} while (addr = next, addr != end);
+
+ pmd_clear_fixmap();
}

static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
--
2.25.1


2024-04-04 14:49:51

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 2/4] arm64: mm: Batch dsb and isb when populating pgtables

After removing uneccessary TLBIs, the next bottleneck when creating the
page tables for the linear map is DSB and ISB, which were previously
issued per-pte in __set_pte(). Since we are writing multiple ptes in a
given pte table, we can elide these barriers and insert them once we
have finished writing to the table.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
before | 77 (0%) | 431 (0%) | 1727 (0%) | 3796 (0%)
after | 13 (-84%) | 162 (-62%) | 655 (-62%) | 1656 (-56%)

Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: Itaru Kitayama <[email protected]>
Tested-by: Eric Chanudet <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 7 ++++++-
arch/arm64/mm/mmu.c | 13 ++++++++++++-
2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..105a95a8845c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
}

-static inline void __set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
{
WRITE_ONCE(*ptep, pte);
+}
+
+static inline void __set_pte(pte_t *ptep, pte_t pte)
+{
+ __set_pte_nosync(ptep, pte);

/*
* Only if the new pte is valid and kernel, otherwise TLB maintenance
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index fd91b5bdb514..dc86dceb0efe 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -178,7 +178,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
do {
pte_t old_pte = __ptep_get(ptep);

- __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+ /*
+ * Required barriers to make this visible to the table walker
+ * are deferred to the end of alloc_init_cont_pte().
+ */
+ __set_pte_nosync(ptep, pfn_pte(__phys_to_pfn(phys), prot));

/*
* After the PTE entry has been populated once, we
@@ -234,6 +238,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
} while (addr = next, addr != end);

pte_clear_fixmap();
+
+ /*
+ * Ensure all previous pgtable writes are visible to the table walker.
+ * See init_pte().
+ */
+ dsb(ishst);
+ isb();
}

static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
--
2.25.1


2024-04-04 14:50:11

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 3/4] arm64: mm: Don't remap pgtables for allocate vs populate

During linear map pgtable creation, each pgtable is fixmapped /
fixunmapped twice; once during allocation to zero the memory, and a
again during population to write the entries. This means each table has
2 TLB invalidations issued against it. Let's fix this so that each table
is only fixmapped/fixunmapped once, halving the number of TLBIs, and
improving performance.

Achieve this by abstracting pgtable allocate, map and unmap operations
out of the main pgtable population loop code and into a `struct
pgtable_ops` function pointer structure. This allows us to formalize the
semantics of "alloc" to mean "alloc and map", requiring an "unmap" when
finished. So "map" is only performed (and also matched by "unmap") if
the pgtable has already been allocated.

As a side effect of this refactoring, we no longer need to use the
fixmap at all once pages have been mapped in the linear map because
their "map" operation can simply do a __va() translation. So with this
change, we are down to 1 TLBI per table when doing early pgtable
manipulations, and 0 TLBIs when doing late pgtable manipulations.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
before | 13 (0%) | 162 (0%) | 655 (0%) | 1656 (0%)
after | 11 (-15%) | 109 (-33%) | 449 (-31%) | 1257 (-24%)

Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: Itaru Kitayama <[email protected]>
Tested-by: Eric Chanudet <[email protected]>
---
arch/arm64/include/asm/mmu.h | 8 +
arch/arm64/include/asm/pgtable.h | 2 +
arch/arm64/kernel/cpufeature.c | 10 +-
arch/arm64/mm/mmu.c | 308 ++++++++++++++++++++++---------
4 files changed, 237 insertions(+), 91 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 65977c7783c5..ae44353010e8 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -109,6 +109,14 @@ static inline bool kaslr_requires_kpti(void)
return true;
}

+#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+extern
+void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
+ phys_addr_t size, pgprot_t prot,
+ void *(*pgtable_alloc)(int, phys_addr_t *),
+ int flags);
+#endif
+
#define INIT_MM_CONTEXT(name) \
.pgd = swapper_pg_dir,

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 105a95a8845c..92c9aed5e7af 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1010,6 +1010,8 @@ static inline p4d_t *p4d_offset_kimg(pgd_t *pgdp, u64 addr)

static inline bool pgtable_l5_enabled(void) { return false; }

+#define p4d_index(addr) (((addr) >> P4D_SHIFT) & (PTRS_PER_P4D - 1))
+
/* Match p4d_offset folding in <asm/generic/pgtable-nop4d.h> */
#define p4d_set_fixmap(addr) NULL
#define p4d_set_fixmap_offset(p4dp, addr) ((p4d_t *)p4dp)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 56583677c1f2..9a70b1954706 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1866,17 +1866,13 @@ static bool has_lpa2(const struct arm64_cpu_capabilities *entry, int scope)
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
#define KPTI_NG_TEMP_VA (-(1UL << PMD_SHIFT))

-extern
-void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
- phys_addr_t size, pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int), int flags);
-
static phys_addr_t __initdata kpti_ng_temp_alloc;

-static phys_addr_t __init kpti_ng_pgd_alloc(int shift)
+static void *__init kpti_ng_pgd_alloc(int type, phys_addr_t *pa)
{
kpti_ng_temp_alloc -= PAGE_SIZE;
- return kpti_ng_temp_alloc;
+ *pa = kpti_ng_temp_alloc;
+ return __va(kpti_ng_temp_alloc);
}

static int __init __kpti_install_ng_mappings(void *__unused)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index dc86dceb0efe..90bf822859b8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -41,9 +41,42 @@
#include <asm/pgalloc.h>
#include <asm/kfence.h>

+enum pgtable_type {
+ TYPE_P4D = 0,
+ TYPE_PUD = 1,
+ TYPE_PMD = 2,
+ TYPE_PTE = 3,
+};
+
+/**
+ * struct pgtable_ops - Ops to allocate and access pgtable memory. Calls must be
+ * serialized by the caller.
+ * @alloc: Allocates 1 page of memory for use as pgtable `type` and maps it
+ * into va space. Returned memory is zeroed. Puts physical address
+ * of page in *pa, and returns virtual address of the mapping. User
+ * must explicitly unmap() before doing another alloc() or map() of
+ * the same `type`.
+ * @map: Determines the physical address of the pgtable of `type` by
+ * interpretting `parent` as the pgtable entry for the next level
+ * up. Maps the page and returns virtual address of the pgtable
+ * entry within the table that corresponds to `addr`. User must
+ * explicitly unmap() before doing another alloc() or map() of the
+ * same `type`.
+ * @unmap: Unmap the currently mapped page of `type`, which will have been
+ * mapped either as a result of a previous call to alloc() or
+ * map(). The page's virtual address must be considered invalid
+ * after this call returns.
+ */
+struct pgtable_ops {
+ void *(*alloc)(int type, phys_addr_t *pa);
+ void *(*map)(int type, void *parent, unsigned long addr);
+ void (*unmap)(int type);
+};
+
#define NO_BLOCK_MAPPINGS BIT(0)
#define NO_CONT_MAPPINGS BIT(1)
#define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
+#define NO_ALLOC BIT(3)

u64 kimage_voffset __ro_after_init;
EXPORT_SYMBOL(kimage_voffset);
@@ -106,34 +139,89 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
}
EXPORT_SYMBOL(phys_mem_access_prot);

-static phys_addr_t __init early_pgtable_alloc(int shift)
+static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
{
- phys_addr_t phys;
- void *ptr;
+ void *va;

- phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
- MEMBLOCK_ALLOC_NOLEAKTRACE);
- if (!phys)
+ *pa = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
+ MEMBLOCK_ALLOC_NOLEAKTRACE);
+ if (!*pa)
panic("Failed to allocate page table page\n");

- /*
- * The FIX_{PGD,PUD,PMD} slots may be in active use, but the FIX_PTE
- * slot will be free, so we can (ab)use the FIX_PTE slot to initialise
- * any level of table.
- */
- ptr = pte_set_fixmap(phys);
+ switch (type) {
+ case TYPE_P4D:
+ va = p4d_set_fixmap(*pa);
+ break;
+ case TYPE_PUD:
+ va = pud_set_fixmap(*pa);
+ break;
+ case TYPE_PMD:
+ va = pmd_set_fixmap(*pa);
+ break;
+ case TYPE_PTE:
+ va = pte_set_fixmap(*pa);
+ break;
+ default:
+ BUG();
+ }
+ memset(va, 0, PAGE_SIZE);

- memset(ptr, 0, PAGE_SIZE);
+ /* Ensure the zeroed page is visible to the page table walker */
+ dsb(ishst);

- /*
- * Implicit barriers also ensure the zeroed page is visible to the page
- * table walker
- */
- pte_clear_fixmap();
+ return va;
+}
+
+static void *__init early_pgtable_map(int type, void *parent, unsigned long addr)
+{
+ void *entry;
+
+ switch (type) {
+ case TYPE_P4D:
+ entry = p4d_set_fixmap_offset((pgd_t *)parent, addr);
+ break;
+ case TYPE_PUD:
+ entry = pud_set_fixmap_offset((p4d_t *)parent, addr);
+ break;
+ case TYPE_PMD:
+ entry = pmd_set_fixmap_offset((pud_t *)parent, addr);
+ break;
+ case TYPE_PTE:
+ entry = pte_set_fixmap_offset((pmd_t *)parent, addr);
+ break;
+ default:
+ BUG();
+ }

- return phys;
+ return entry;
+}
+
+static void __init early_pgtable_unmap(int type)
+{
+ switch (type) {
+ case TYPE_P4D:
+ p4d_clear_fixmap();
+ break;
+ case TYPE_PUD:
+ pud_clear_fixmap();
+ break;
+ case TYPE_PMD:
+ pmd_clear_fixmap();
+ break;
+ case TYPE_PTE:
+ pte_clear_fixmap();
+ break;
+ default:
+ BUG();
+ }
}

+static struct pgtable_ops early_pgtable_ops __initdata = {
+ .alloc = early_pgtable_alloc,
+ .map = early_pgtable_map,
+ .unmap = early_pgtable_unmap,
+};
+
bool pgattr_change_is_safe(u64 old, u64 new)
{
/*
@@ -200,7 +288,7 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
unsigned long end, phys_addr_t phys,
pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int),
+ struct pgtable_ops *ops,
int flags)
{
unsigned long next;
@@ -214,14 +302,15 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,

if (flags & NO_EXEC_MAPPINGS)
pmdval |= PMD_TABLE_PXN;
- BUG_ON(!pgtable_alloc);
- pte_phys = pgtable_alloc(PAGE_SHIFT);
+ BUG_ON(flags & NO_ALLOC);
+ ptep = ops->alloc(TYPE_PTE, &pte_phys);
+ ptep += pte_index(addr);
__pmd_populate(pmdp, pte_phys, pmdval);
- pmd = READ_ONCE(*pmdp);
+ } else {
+ BUG_ON(pmd_bad(pmd));
+ ptep = ops->map(TYPE_PTE, pmdp, addr);
}
- BUG_ON(pmd_bad(pmd));

- ptep = pte_set_fixmap_offset(pmdp, addr);
do {
pgprot_t __prot = prot;

@@ -237,7 +326,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
phys += next - addr;
} while (addr = next, addr != end);

- pte_clear_fixmap();
+ ops->unmap(TYPE_PTE);

/*
* Ensure all previous pgtable writes are visible to the table walker.
@@ -249,7 +338,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,

static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int), int flags)
+ struct pgtable_ops *ops, int flags)
{
unsigned long next;

@@ -271,7 +360,7 @@ static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
READ_ONCE(pmd_val(*pmdp))));
} else {
alloc_init_cont_pte(pmdp, addr, next, phys, prot,
- pgtable_alloc, flags);
+ ops, flags);

BUG_ON(pmd_val(old_pmd) != 0 &&
pmd_val(old_pmd) != READ_ONCE(pmd_val(*pmdp)));
@@ -285,7 +374,7 @@ static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
unsigned long end, phys_addr_t phys,
pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int), int flags)
+ struct pgtable_ops *ops, int flags)
{
unsigned long next;
pud_t pud = READ_ONCE(*pudp);
@@ -301,14 +390,15 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,

if (flags & NO_EXEC_MAPPINGS)
pudval |= PUD_TABLE_PXN;
- BUG_ON(!pgtable_alloc);
- pmd_phys = pgtable_alloc(PMD_SHIFT);
+ BUG_ON(flags & NO_ALLOC);
+ pmdp = ops->alloc(TYPE_PMD, &pmd_phys);
+ pmdp += pmd_index(addr);
__pud_populate(pudp, pmd_phys, pudval);
- pud = READ_ONCE(*pudp);
+ } else {
+ BUG_ON(pud_bad(pud));
+ pmdp = ops->map(TYPE_PMD, pudp, addr);
}
- BUG_ON(pud_bad(pud));

- pmdp = pmd_set_fixmap_offset(pudp, addr);
do {
pgprot_t __prot = prot;

@@ -319,18 +409,17 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
(flags & NO_CONT_MAPPINGS) == 0)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);

- pmdp = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc,
- flags);
+ pmdp = init_pmd(pmdp, addr, next, phys, __prot, ops, flags);

phys += next - addr;
} while (addr = next, addr != end);

- pmd_clear_fixmap();
+ ops->unmap(TYPE_PMD);
}

static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int),
+ struct pgtable_ops *ops,
int flags)
{
unsigned long next;
@@ -343,14 +432,15 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,

if (flags & NO_EXEC_MAPPINGS)
p4dval |= P4D_TABLE_PXN;
- BUG_ON(!pgtable_alloc);
- pud_phys = pgtable_alloc(PUD_SHIFT);
+ BUG_ON(flags & NO_ALLOC);
+ pudp = ops->alloc(TYPE_PUD, &pud_phys);
+ pudp += pud_index(addr);
__p4d_populate(p4dp, pud_phys, p4dval);
- p4d = READ_ONCE(*p4dp);
+ } else {
+ BUG_ON(p4d_bad(p4d));
+ pudp = ops->map(TYPE_PUD, p4dp, addr);
}
- BUG_ON(p4d_bad(p4d));

- pudp = pud_set_fixmap_offset(p4dp, addr);
do {
pud_t old_pud = READ_ONCE(*pudp);

@@ -372,7 +462,7 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
READ_ONCE(pud_val(*pudp))));
} else {
alloc_init_cont_pmd(pudp, addr, next, phys, prot,
- pgtable_alloc, flags);
+ ops, flags);

BUG_ON(pud_val(old_pud) != 0 &&
pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
@@ -380,12 +470,12 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
phys += next - addr;
} while (pudp++, addr = next, addr != end);

- pud_clear_fixmap();
+ ops->unmap(TYPE_PUD);
}

static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int),
+ struct pgtable_ops *ops,
int flags)
{
unsigned long next;
@@ -398,21 +488,21 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,

if (flags & NO_EXEC_MAPPINGS)
pgdval |= PGD_TABLE_PXN;
- BUG_ON(!pgtable_alloc);
- p4d_phys = pgtable_alloc(P4D_SHIFT);
+ BUG_ON(flags & NO_ALLOC);
+ p4dp = ops->alloc(TYPE_P4D, &p4d_phys);
+ p4dp += p4d_index(addr);
__pgd_populate(pgdp, p4d_phys, pgdval);
- pgd = READ_ONCE(*pgdp);
+ } else {
+ BUG_ON(pgd_bad(pgd));
+ p4dp = ops->map(TYPE_P4D, pgdp, addr);
}
- BUG_ON(pgd_bad(pgd));

- p4dp = p4d_set_fixmap_offset(pgdp, addr);
do {
p4d_t old_p4d = READ_ONCE(*p4dp);

next = p4d_addr_end(addr, end);

- alloc_init_pud(p4dp, addr, next, phys, prot,
- pgtable_alloc, flags);
+ alloc_init_pud(p4dp, addr, next, phys, prot, ops, flags);

BUG_ON(p4d_val(old_p4d) != 0 &&
p4d_val(old_p4d) != READ_ONCE(p4d_val(*p4dp)));
@@ -420,13 +510,13 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
phys += next - addr;
} while (p4dp++, addr = next, addr != end);

- p4d_clear_fixmap();
+ ops->unmap(TYPE_P4D);
}

static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int),
+ struct pgtable_ops *ops,
int flags)
{
unsigned long addr, end, next;
@@ -445,8 +535,7 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,

do {
next = pgd_addr_end(addr, end);
- alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
- flags);
+ alloc_init_p4d(pgdp, addr, next, phys, prot, ops, flags);
phys += next - addr;
} while (pgdp++, addr = next, addr != end);
}
@@ -454,36 +543,31 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int),
+ struct pgtable_ops *ops,
int flags)
{
mutex_lock(&fixmap_lock);
__create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
- pgtable_alloc, flags);
+ ops, flags);
mutex_unlock(&fixmap_lock);
}

-#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
-extern __alias(__create_pgd_mapping_locked)
-void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
- phys_addr_t size, pgprot_t prot,
- phys_addr_t (*pgtable_alloc)(int), int flags);
-#endif
-
-static phys_addr_t __pgd_pgtable_alloc(int shift)
+static void *__pgd_pgtable_alloc(int type, phys_addr_t *pa)
{
- void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
- BUG_ON(!ptr);
+ void *va = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
+
+ BUG_ON(!va);

/* Ensure the zeroed page is visible to the page table walker */
dsb(ishst);
- return __pa(ptr);
+ *pa = __pa(va);
+ return va;
}

-static phys_addr_t pgd_pgtable_alloc(int shift)
+static void *pgd_pgtable_alloc(int type, phys_addr_t *pa)
{
- phys_addr_t pa = __pgd_pgtable_alloc(shift);
- struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
+ void *va = __pgd_pgtable_alloc(type, pa);
+ struct ptdesc *ptdesc = page_ptdesc(phys_to_page(*pa));

/*
* Call proper page table ctor in case later we need to
@@ -493,13 +577,69 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
* We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is
* folded, and if so pagetable_pte_ctor() becomes nop.
*/
- if (shift == PAGE_SHIFT)
+ if (type == TYPE_PTE)
BUG_ON(!pagetable_pte_ctor(ptdesc));
- else if (shift == PMD_SHIFT)
+ else if (type == TYPE_PMD)
BUG_ON(!pagetable_pmd_ctor(ptdesc));

- return pa;
+ return va;
+}
+
+static void *pgd_pgtable_map(int type, void *parent, unsigned long addr)
+{
+ void *entry;
+
+ switch (type) {
+ case TYPE_P4D:
+ entry = p4d_offset((pgd_t *)parent, addr);
+ break;
+ case TYPE_PUD:
+ entry = pud_offset((p4d_t *)parent, addr);
+ break;
+ case TYPE_PMD:
+ entry = pmd_offset((pud_t *)parent, addr);
+ break;
+ case TYPE_PTE:
+ entry = pte_offset_kernel((pmd_t *)parent, addr);
+ break;
+ default:
+ BUG();
+ }
+
+ return entry;
+}
+
+static void pgd_pgtable_unmap(int type)
+{
+}
+
+static struct pgtable_ops pgd_pgtable_ops = {
+ .alloc = pgd_pgtable_alloc,
+ .map = pgd_pgtable_map,
+ .unmap = pgd_pgtable_unmap,
+};
+
+static struct pgtable_ops __pgd_pgtable_ops = {
+ .alloc = __pgd_pgtable_alloc,
+ .map = pgd_pgtable_map,
+ .unmap = pgd_pgtable_unmap,
+};
+
+#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
+ phys_addr_t size, pgprot_t prot,
+ void *(*pgtable_alloc)(int, phys_addr_t *),
+ int flags)
+{
+ struct pgtable_ops ops = {
+ .alloc = pgtable_alloc,
+ .map = pgd_pgtable_map,
+ .unmap = pgd_pgtable_unmap,
+ };
+
+ __create_pgd_mapping_locked(pgdir, phys, virt, size, prot, &ops, flags);
}
+#endif

/*
* This function can only be used to modify existing table entries,
@@ -514,8 +654,8 @@ void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
&phys, virt);
return;
}
- __create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL,
- NO_CONT_MAPPINGS);
+ __create_pgd_mapping(init_mm.pgd, phys, virt, size, prot,
+ &early_pgtable_ops, NO_CONT_MAPPINGS | NO_ALLOC);
}

void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
@@ -530,7 +670,7 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
- pgd_pgtable_alloc, flags);
+ &pgd_pgtable_ops, flags);
}

static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
@@ -542,8 +682,8 @@ static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
return;
}

- __create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL,
- NO_CONT_MAPPINGS);
+ __create_pgd_mapping(init_mm.pgd, phys, virt, size, prot,
+ &pgd_pgtable_ops, NO_CONT_MAPPINGS | NO_ALLOC);

/* flush the TLBs after updating live kernel mappings */
flush_tlb_kernel_range(virt, virt + size);
@@ -553,7 +693,7 @@ static void __init __map_memblock(pgd_t *pgdp, phys_addr_t start,
phys_addr_t end, pgprot_t prot, int flags)
{
__create_pgd_mapping(pgdp, start, __phys_to_virt(start), end - start,
- prot, early_pgtable_alloc, flags);
+ prot, &early_pgtable_ops, flags);
}

void __init mark_linear_text_alias_ro(void)
@@ -744,7 +884,7 @@ static int __init map_entry_trampoline(void)
memset(tramp_pg_dir, 0, PGD_SIZE);
__create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS,
entry_tramp_text_size(), prot,
- __pgd_pgtable_alloc, NO_BLOCK_MAPPINGS);
+ &__pgd_pgtable_ops, NO_BLOCK_MAPPINGS);

/* Map both the text and data into the kernel page table */
for (i = 0; i < DIV_ROUND_UP(entry_tramp_text_size(), PAGE_SIZE); i++)
@@ -1346,7 +1486,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
- size, params->pgprot, __pgd_pgtable_alloc,
+ size, params->pgprot, &__pgd_pgtable_ops,
flags);

memblock_clear_nomap(start, size);
--
2.25.1


2024-04-04 14:50:14

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v2 4/4] arm64: mm: Lazily clear pte table mappings from fixmap

With the pgtable operations abstracted into `struct pgtable_ops`, the
early pgtable alloc, map and unmap operations are nicely centralized. So
let's enhance the implementation to speed up the clearing of pte table
mappings in the fixmap.

Extend FIX_MAP so that we now have 16 slots in the fixmap dedicated for
pte tables. At alloc/map time, we select the next slot in the series and
map it. Or if we are at the end and no more slots are available, clear
down all of the slots and start at the beginning again. Batching the
clear like this means we can issue tlbis more efficiently.

Due to the batching, there may still be some slots mapped at the end, so
address this by adding an optional cleanup() function to `struct
pgtable_ops` to handle this for us.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
before | 11 (0%) | 109 (0%) | 449 (0%) | 1257 (0%)
after | 6 (-46%) | 61 (-44%) | 257 (-43%) | 838 (-33%)

Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: Itaru Kitayama <[email protected]>
Tested-by: Eric Chanudet <[email protected]>
---
arch/arm64/include/asm/fixmap.h | 5 +++-
arch/arm64/include/asm/pgtable.h | 4 ---
arch/arm64/mm/fixmap.c | 11 ++++++++
arch/arm64/mm/mmu.c | 44 +++++++++++++++++++++++++++++---
4 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index 87e307804b99..91fcd7c5c513 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -84,7 +84,9 @@ enum fixed_addresses {
* Used for kernel page table creation, so unmapped memory may be used
* for tables.
*/
- FIX_PTE,
+#define NR_PTE_SLOTS 16
+ FIX_PTE_END,
+ FIX_PTE_BEGIN = FIX_PTE_END + NR_PTE_SLOTS - 1,
FIX_PMD,
FIX_PUD,
FIX_P4D,
@@ -108,6 +110,7 @@ void __init early_fixmap_init(void);
#define __late_clear_fixmap(idx) __set_fixmap((idx), 0, FIXMAP_PAGE_CLEAR)

extern void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot);
+void __init clear_fixmap_nosync(enum fixed_addresses idx);

#include <asm-generic/fixmap.h>

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 92c9aed5e7af..4c7114d49697 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -691,10 +691,6 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
/* Find an entry in the third-level page table. */
#define pte_offset_phys(dir,addr) (pmd_page_paddr(READ_ONCE(*(dir))) + pte_index(addr) * sizeof(pte_t))

-#define pte_set_fixmap(addr) ((pte_t *)set_fixmap_offset(FIX_PTE, addr))
-#define pte_set_fixmap_offset(pmd, addr) pte_set_fixmap(pte_offset_phys(pmd, addr))
-#define pte_clear_fixmap() clear_fixmap(FIX_PTE)
-
#define pmd_page(pmd) phys_to_page(__pmd_to_phys(pmd))

/* use ONLY for statically allocated translation tables */
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index de1e09d986ad..0cb09bedeeec 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -131,6 +131,17 @@ void __set_fixmap(enum fixed_addresses idx,
}
}

+void __init clear_fixmap_nosync(enum fixed_addresses idx)
+{
+ unsigned long addr = __fix_to_virt(idx);
+ pte_t *ptep;
+
+ BUG_ON(idx <= FIX_HOLE || idx >= __end_of_fixed_addresses);
+
+ ptep = fixmap_pte(addr);
+ __pte_clear(&init_mm, addr, ptep);
+}
+
void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
{
const u64 dt_virt_base = __fix_to_virt(FIX_FDT);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 90bf822859b8..2e3b594aa23c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -66,11 +66,14 @@ enum pgtable_type {
* mapped either as a result of a previous call to alloc() or
* map(). The page's virtual address must be considered invalid
* after this call returns.
+ * @cleanup: (Optional) Called at the end of a set of operations to cleanup
+ * any lazy state.
*/
struct pgtable_ops {
void *(*alloc)(int type, phys_addr_t *pa);
void *(*map)(int type, void *parent, unsigned long addr);
void (*unmap)(int type);
+ void (*cleanup)(void);
};

#define NO_BLOCK_MAPPINGS BIT(0)
@@ -139,9 +142,33 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
}
EXPORT_SYMBOL(phys_mem_access_prot);

+static int pte_slot_next __initdata = FIX_PTE_BEGIN;
+
+static void __init clear_pte_fixmap_slots(void)
+{
+ unsigned long start = __fix_to_virt(FIX_PTE_BEGIN);
+ unsigned long end = __fix_to_virt(pte_slot_next);
+ int i;
+
+ for (i = FIX_PTE_BEGIN; i > pte_slot_next; i--)
+ clear_fixmap_nosync(i);
+
+ flush_tlb_kernel_range(start, end);
+ pte_slot_next = FIX_PTE_BEGIN;
+}
+
+static int __init pte_fixmap_slot(void)
+{
+ if (pte_slot_next < FIX_PTE_END)
+ clear_pte_fixmap_slots();
+
+ return pte_slot_next--;
+}
+
static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
{
void *va;
+ int slot;

*pa = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
MEMBLOCK_ALLOC_NOLEAKTRACE);
@@ -159,7 +186,9 @@ static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
va = pmd_set_fixmap(*pa);
break;
case TYPE_PTE:
- va = pte_set_fixmap(*pa);
+ slot = pte_fixmap_slot();
+ set_fixmap(slot, *pa);
+ va = (pte_t *)__fix_to_virt(slot);
break;
default:
BUG();
@@ -174,7 +203,9 @@ static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)

static void *__init early_pgtable_map(int type, void *parent, unsigned long addr)
{
+ phys_addr_t pa;
void *entry;
+ int slot;

switch (type) {
case TYPE_P4D:
@@ -187,7 +218,10 @@ static void *__init early_pgtable_map(int type, void *parent, unsigned long addr
entry = pmd_set_fixmap_offset((pud_t *)parent, addr);
break;
case TYPE_PTE:
- entry = pte_set_fixmap_offset((pmd_t *)parent, addr);
+ slot = pte_fixmap_slot();
+ pa = pte_offset_phys((pmd_t *)parent, addr);
+ set_fixmap(slot, pa);
+ entry = (pte_t *)(__fix_to_virt(slot) + (pa & (PAGE_SIZE - 1)));
break;
default:
BUG();
@@ -209,7 +243,7 @@ static void __init early_pgtable_unmap(int type)
pmd_clear_fixmap();
break;
case TYPE_PTE:
- pte_clear_fixmap();
+ // Unmap lazily: see clear_pte_fixmap_slots().
break;
default:
BUG();
@@ -220,6 +254,7 @@ static struct pgtable_ops early_pgtable_ops __initdata = {
.alloc = early_pgtable_alloc,
.map = early_pgtable_map,
.unmap = early_pgtable_unmap,
+ .cleanup = clear_pte_fixmap_slots,
};

bool pgattr_change_is_safe(u64 old, u64 new)
@@ -538,6 +573,9 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
alloc_init_p4d(pgdp, addr, next, phys, prot, ops, flags);
phys += next - addr;
} while (pgdp++, addr = next, addr != end);
+
+ if (ops->cleanup)
+ ops->cleanup();
}

static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
--
2.25.1


2024-04-05 07:40:08

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
> Hi All,
>
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And most of the time is spent
> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> the kernel pgtable generation code to significantly reduce the number of those
> TLBIs, ISBs and DSBs. See each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc2):
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656 (-91%)
> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257 (-93%)
> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838 (-95%)
>
> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
> boot tested various PAGE_SIZE and VA size configs.
>
> ---
>
> Changes since v1 [1]
> ====================
>
> - Added Tested-by tags (thanks to Eric and Itaru)
> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
> - Reordered patches (biggest impact & least controversial first)
> - Reordered alloc/map/unmap functions in mmu.c to aid reader
> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
> - Reverted generic p4d_index() which caused x86 build error. Replaced with
> unconditional p4d_index() define under arm64.
>
>
> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (4):
> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
> arm64: mm: Batch dsb and isb when populating pgtables
> arm64: mm: Don't remap pgtables for allocate vs populate
> arm64: mm: Lazily clear pte table mappings from fixmap
>
> arch/arm64/include/asm/fixmap.h | 5 +-
> arch/arm64/include/asm/mmu.h | 8 +
> arch/arm64/include/asm/pgtable.h | 13 +-
> arch/arm64/kernel/cpufeature.c | 10 +-
> arch/arm64/mm/fixmap.c | 11 +
> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
> 6 files changed, 319 insertions(+), 105 deletions(-)
>
> --
> 2.25.1
>

I've build and boot tested the v2 on FVP, base is taken from your
linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.

Thanks,
Itaru.


Attachments:
(No filename) (3.05 kB)
output.log (32.63 kB)
Download all attachments

2024-04-06 08:32:53

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

Hi Itaru,

On 05/04/2024 08:39, Itaru Kitayama wrote:
> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>> Hi All,
>>
>> It turns out that creating the linear map can take a significant proportion of
>> the total boot time, especially when rodata=full. And most of the time is spent
>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>> the kernel pgtable generation code to significantly reduce the number of those
>> TLBIs, ISBs and DSBs. See each patch for details.
>>
>> The below shows the execution time of map_mem() across a couple of different
>> systems with different RAM configurations. We measure after applying each patch
>> and show the improvement relative to base (v6.9-rc2):
>>
>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>> | ms (%) | ms (%) | ms (%) | ms (%)
>> ---------------|-------------|-------------|-------------|-------------
>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656 (-91%)
>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257 (-93%)
>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838 (-95%)
>>
>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>> boot tested various PAGE_SIZE and VA size configs.
>>
>> ---
>>
>> Changes since v1 [1]
>> ====================
>>
>> - Added Tested-by tags (thanks to Eric and Itaru)
>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>> - Reordered patches (biggest impact & least controversial first)
>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>> - Reverted generic p4d_index() which caused x86 build error. Replaced with
>> unconditional p4d_index() define under arm64.
>>
>>
>> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (4):
>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>> arm64: mm: Batch dsb and isb when populating pgtables
>> arm64: mm: Don't remap pgtables for allocate vs populate
>> arm64: mm: Lazily clear pte table mappings from fixmap
>>
>> arch/arm64/include/asm/fixmap.h | 5 +-
>> arch/arm64/include/asm/mmu.h | 8 +
>> arch/arm64/include/asm/pgtable.h | 13 +-
>> arch/arm64/kernel/cpufeature.c | 10 +-
>> arch/arm64/mm/fixmap.c | 11 +
>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>
>> --
>> 2.25.1
>>
>
> I've build and boot tested the v2 on FVP, base is taken from your
> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.

Thanks for taking a look at this.

I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:

Config: arm64 defconfig + the following:

# Squashfs for snaps, xfs for large file folios.
/scripts/config --enable CONFIG_SQUASHFS_LZ4
/scripts/config --enable CONFIG_SQUASHFS_LZO
/scripts/config --enable CONFIG_SQUASHFS_XZ
/scripts/config --enable CONFIG_SQUASHFS_ZSTD
/scripts/config --enable CONFIG_XFS_FS

# For general mm debug.
/scripts/config --enable CONFIG_DEBUG_VM
/scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
/scripts/config --enable CONFIG_DEBUG_VM_RB
/scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
/scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
/scripts/config --enable CONFIG_PAGE_TABLE_CHECK

# For mm selftests.
/scripts/config --enable CONFIG_USERFAULTFD
/scripts/config --enable CONFIG_TEST_VMALLOC
/scripts/config --enable CONFIG_GUP_TEST

Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
some mm selftests), with kernel command line to reserve hugetlbs and other
features required by some mm selftests:

"
transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
"

Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
git tree.


Although I don't think any of this config should make a difference to gup_longterm.

Looks like your errors are all "ftruncate() failed". I've seen this problem on
our CI system. There it is due to running the tests from NFS file system. What
filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
might also be problematic.

Does this problem reproduce with v6.9-rc2, without my patches? I except it
probably does?

Thanks,
Ryan

>
> Thanks,
> Itaru.


2024-04-06 10:31:59

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

Hi Ryan,

On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
> Hi Itaru,
>
> On 05/04/2024 08:39, Itaru Kitayama wrote:
> > On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
> >> Hi All,
> >>
> >> It turns out that creating the linear map can take a significant proportion of
> >> the total boot time, especially when rodata=full. And most of the time is spent
> >> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> >> the kernel pgtable generation code to significantly reduce the number of those
> >> TLBIs, ISBs and DSBs. See each patch for details.
> >>
> >> The below shows the execution time of map_mem() across a couple of different
> >> systems with different RAM configurations. We measure after applying each patch
> >> and show the improvement relative to base (v6.9-rc2):
> >>
> >> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> >> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> >> ---------------|-------------|-------------|-------------|-------------
> >> | ms (%) | ms (%) | ms (%) | ms (%)
> >> ---------------|-------------|-------------|-------------|-------------
> >> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
> >> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
> >> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656 (-91%)
> >> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257 (-93%)
> >> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838 (-95%)
> >>
> >> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
> >> boot tested various PAGE_SIZE and VA size configs.
> >>
> >> ---
> >>
> >> Changes since v1 [1]
> >> ====================
> >>
> >> - Added Tested-by tags (thanks to Eric and Itaru)
> >> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
> >> - Reordered patches (biggest impact & least controversial first)
> >> - Reordered alloc/map/unmap functions in mmu.c to aid reader
> >> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
> >> - Reverted generic p4d_index() which caused x86 build error. Replaced with
> >> unconditional p4d_index() define under arm64.
> >>
> >>
> >> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >> Ryan Roberts (4):
> >> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
> >> arm64: mm: Batch dsb and isb when populating pgtables
> >> arm64: mm: Don't remap pgtables for allocate vs populate
> >> arm64: mm: Lazily clear pte table mappings from fixmap
> >>
> >> arch/arm64/include/asm/fixmap.h | 5 +-
> >> arch/arm64/include/asm/mmu.h | 8 +
> >> arch/arm64/include/asm/pgtable.h | 13 +-
> >> arch/arm64/kernel/cpufeature.c | 10 +-
> >> arch/arm64/mm/fixmap.c | 11 +
> >> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
> >> 6 files changed, 319 insertions(+), 105 deletions(-)
> >>
> >> --
> >> 2.25.1
> >>
> >
> > I've build and boot tested the v2 on FVP, base is taken from your
> > linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>
> Thanks for taking a look at this.
>
> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>
> Config: arm64 defconfig + the following:
>
> # Squashfs for snaps, xfs for large file folios.
> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
> ./scripts/config --enable CONFIG_SQUASHFS_LZO
> ./scripts/config --enable CONFIG_SQUASHFS_XZ
> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
> ./scripts/config --enable CONFIG_XFS_FS
>
> # For general mm debug.
> ./scripts/config --enable CONFIG_DEBUG_VM
> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
> ./scripts/config --enable CONFIG_DEBUG_VM_RB
> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>
> # For mm selftests.
> ./scripts/config --enable CONFIG_USERFAULTFD
> ./scripts/config --enable CONFIG_TEST_VMALLOC
> ./scripts/config --enable CONFIG_GUP_TEST
>
> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
> some mm selftests), with kernel command line to reserve hugetlbs and other
> features required by some mm selftests:
>
> "
> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
> "
>
> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
> git tree.
>
>
> Although I don't think any of this config should make a difference to gup_longterm.
>
> Looks like your errors are all "ftruncate() failed". I've seen this problem on
> our CI system. There it is due to running the tests from NFS file system. What
> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
> might also be problematic.

That was it. This time I booted up the kernel including your series on
QEMU on my M1 and executed the gup_longterm program without the ftruncate
failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.

Thanks,
Itaru.

>
> Does this problem reproduce with v6.9-rc2, without my patches? I except it
> probably does?
>
> Thanks,
> Ryan
>
> >
> > Thanks,
> > Itaru.
>

2024-04-08 07:30:52

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 06/04/2024 11:31, Itaru Kitayama wrote:
> Hi Ryan,
>
> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>> Hi Itaru,
>>
>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> It turns out that creating the linear map can take a significant proportion of
>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>
>>>> The below shows the execution time of map_mem() across a couple of different
>>>> systems with different RAM configurations. We measure after applying each patch
>>>> and show the improvement relative to base (v6.9-rc2):
>>>>
>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>>>> ---------------|-------------|-------------|-------------|-------------
>>>> | ms (%) | ms (%) | ms (%) | ms (%)
>>>> ---------------|-------------|-------------|-------------|-------------
>>>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
>>>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
>>>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656 (-91%)
>>>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257 (-93%)
>>>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838 (-95%)
>>>>
>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>
>>>> ---
>>>>
>>>> Changes since v1 [1]
>>>> ====================
>>>>
>>>> - Added Tested-by tags (thanks to Eric and Itaru)
>>>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>> - Reordered patches (biggest impact & least controversial first)
>>>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>> - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>> unconditional p4d_index() define under arm64.
>>>>
>>>>
>>>> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>> Ryan Roberts (4):
>>>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>> arm64: mm: Batch dsb and isb when populating pgtables
>>>> arm64: mm: Don't remap pgtables for allocate vs populate
>>>> arm64: mm: Lazily clear pte table mappings from fixmap
>>>>
>>>> arch/arm64/include/asm/fixmap.h | 5 +-
>>>> arch/arm64/include/asm/mmu.h | 8 +
>>>> arch/arm64/include/asm/pgtable.h | 13 +-
>>>> arch/arm64/kernel/cpufeature.c | 10 +-
>>>> arch/arm64/mm/fixmap.c | 11 +
>>>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>
>>>> --
>>>> 2.25.1
>>>>
>>>
>>> I've build and boot tested the v2 on FVP, base is taken from your
>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>
>> Thanks for taking a look at this.
>>
>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>
>> Config: arm64 defconfig + the following:
>>
>> # Squashfs for snaps, xfs for large file folios.
>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>> ./scripts/config --enable CONFIG_XFS_FS
>>
>> # For general mm debug.
>> ./scripts/config --enable CONFIG_DEBUG_VM
>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>
>> # For mm selftests.
>> ./scripts/config --enable CONFIG_USERFAULTFD
>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>> ./scripts/config --enable CONFIG_GUP_TEST
>>
>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>> some mm selftests), with kernel command line to reserve hugetlbs and other
>> features required by some mm selftests:
>>
>> "
>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>> "
>>
>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>> git tree.
>>
>>
>> Although I don't think any of this config should make a difference to gup_longterm.
>>
>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>> our CI system. There it is due to running the tests from NFS file system. What
>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>> might also be problematic.
>
> That was it. This time I booted up the kernel including your series on
> QEMU on my M1 and executed the gup_longterm program without the ftruncate
> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.

I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
the disk? It might be worth enhancing the error log to provide the errno in
tools/testing/selftests/mm/gup_longterm.c.

Thanks,
Ryan

>
> Thanks,
> Itaru.
>
>>
>> Does this problem reproduce with v6.9-rc2, without my patches? I except it
>> probably does?
>>
>> Thanks,
>> Ryan
>>
>>>
>>> Thanks,
>>> Itaru.
>>


2024-04-09 00:11:33

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation


> Thanks,
> Ryan
>
>>
>> Thanks,
>> Itaru.
>>
>>>
>>> Does this problem reproduce with v6.9-rc2, without my patches? I except it
>>> probably does?
>>>
>>> Thanks,
>>> Ryan
>>>
>>>>
>>>> Thanks,
>>>> Itaru.



Attachments:
straced-gup_longterm.log (36.81 kB)

2024-04-09 10:08:07

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 09/04/2024 01:10, Itaru Kitayama wrote:
> Hi Ryan,
>
>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>
>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>> Hi Ryan,
>>>
>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>> Hi Itaru,
>>>>
>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>
>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>
>>>>>>               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>>
>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Changes since v1 [1]
>>>>>> ====================
>>>>>>
>>>>>>  - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>  - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>  - Reordered patches (biggest impact & least controversial first)
>>>>>>  - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>  - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>  - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>    unconditional p4d_index() define under arm64.
>>>>>>
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/ <https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>> Ryan Roberts (4):
>>>>>>  arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>  arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>  arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>  arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>
>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.25.1
>>>>>>
>>>>>
>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>>
>>>> Thanks for taking a look at this.
>>>>
>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>>
>>>> Config: arm64 defconfig + the following:
>>>>
>>>> # Squashfs for snaps, xfs for large file folios.
>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>
>>>> # For general mm debug.
>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>
>>>> # For mm selftests.
>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>
>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>> features required by some mm selftests:
>>>>
>>>> "
>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>> "
>>>>
>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>> git tree.
>>>>
>>>>
>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>>
>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>> might also be problematic.
>>>
>>> That was it. This time I booted up the kernel including your series on
>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>>
>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>> the disk? It might be worth enhancing the error log to provide the errno in
>> tools/testing/selftests/mm/gup_longterm.c.
>>
>
> Attached is the strace’d gup_longterm executiong log on your
> pgtable-boot-speedup-v2 kernel.

Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
set applied? I thought we previously concluded that it was independent of that?
I was under the impression that it was filesystem related and not something that
I was planning to investigate.

>
> Thanks,
> Itaru.
>
>> Thanks,
>> Ryan
>>
>>>
>>> Thanks,
>>> Itaru.
>>>
>>>>
>>>> Does this problem reproduce with v6.9-rc2, without my patches? I except it
>>>> probably does?
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>>
>>>>> Thanks,
>>>>> Itaru.
>
>


2024-04-09 10:25:10

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation



> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>
> On 09/04/2024 01:10, Itaru Kitayama wrote:
>> Hi Ryan,
>>
>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>
>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>> Hi Ryan,
>>>>
>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>> Hi Itaru,
>>>>>
>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>
>>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>
>>>>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>> | ms (%) | ms (%) | ms (%) | ms (%)
>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
>>>>>>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
>>>>>>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656 (-91%)
>>>>>>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257 (-93%)
>>>>>>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838 (-95%)
>>>>>>>
>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> Changes since v1 [1]
>>>>>>> ====================
>>>>>>>
>>>>>>> - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>> - Reordered patches (biggest impact & least controversial first)
>>>>>>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>> - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>> unconditional p4d_index() define under arm64.
>>>>>>>
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>> Ryan Roberts (4):
>>>>>>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>> arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>> arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>> arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>
>>>>>>> arch/arm64/include/asm/fixmap.h | 5 +-
>>>>>>> arch/arm64/include/asm/mmu.h | 8 +
>>>>>>> arch/arm64/include/asm/pgtable.h | 13 +-
>>>>>>> arch/arm64/kernel/cpufeature.c | 10 +-
>>>>>>> arch/arm64/mm/fixmap.c | 11 +
>>>>>>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.25.1
>>>>>>>
>>>>>>
>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>>>
>>>>> Thanks for taking a look at this.
>>>>>
>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>>>
>>>>> Config: arm64 defconfig + the following:
>>>>>
>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>
>>>>> # For general mm debug.
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>
>>>>> # For mm selftests.
>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>
>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>> features required by some mm selftests:
>>>>>
>>>>> "
>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>> "
>>>>>
>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>>> git tree.
>>>>>
>>>>>
>>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>>>
>>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>>> might also be problematic.
>>>>
>>>> That was it. This time I booted up the kernel including your series on
>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>>>
>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>>> the disk? It might be worth enhancing the error log to provide the errno in
>>> tools/testing/selftests/mm/gup_longterm.c.
>>>
>>
>> Attached is the strace’d gup_longterm executiong log on your
>> pgtable-boot-speedup-v2 kernel.
>
> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
> set applied? I thought we previously concluded that it was independent of that?
> I was under the impression that it was filesystem related and not something that
> I was planning to investigate.

No, irrespective of the kernel, if using 9p on FVP the test program fails.
It is indeed 9p filesystem related, as I switched to using NFS all the issues are gone.

Thanks,
Itaru.

>
>>
>> Thanks,
>> Itaru.
>>
>>> Thanks,
>>> Ryan
>>>
>>>>
>>>> Thanks,
>>>> Itaru.
>>>>
>>>>>
>>>>> Does this problem reproduce with v6.9-rc2, without my patches? I except it
>>>>> probably does?
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Itaru.



2024-04-09 11:30:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 09.04.24 13:22, David Hildenbrand wrote:
> On 09.04.24 12:13, Itaru Kitayama wrote:
>>
>>
>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>
>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>> Hi Ryan,
>>>>
>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>
>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>> Hi Itaru,
>>>>>>>
>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>
>>>>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>
>>>>>>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>>>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>> | ms (%) | ms (%) | ms (%) | ms (%)
>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
>>>>>>>>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
>>>>>>>>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656 (-91%)
>>>>>>>>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257 (-93%)
>>>>>>>>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838 (-95%)
>>>>>>>>>
>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Changes since v1 [1]
>>>>>>>>> ====================
>>>>>>>>>
>>>>>>>>> - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>> - Reordered patches (biggest impact & least controversial first)
>>>>>>>>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>> - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>>>> unconditional p4d_index() define under arm64.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ryan Roberts (4):
>>>>>>>>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>> arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>> arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>> arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>
>>>>>>>>> arch/arm64/include/asm/fixmap.h | 5 +-
>>>>>>>>> arch/arm64/include/asm/mmu.h | 8 +
>>>>>>>>> arch/arm64/include/asm/pgtable.h | 13 +-
>>>>>>>>> arch/arm64/kernel/cpufeature.c | 10 +-
>>>>>>>>> arch/arm64/mm/fixmap.c | 11 +
>>>>>>>>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 2.25.1
>>>>>>>>>
>>>>>>>>
>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>>>>>
>>>>>>> Thanks for taking a look at this.
>>>>>>>
>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>>>>>
>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>
>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>
>>>>>>> # For general mm debug.
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>
>>>>>>> # For mm selftests.
>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>
>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>>> features required by some mm selftests:
>>>>>>>
>>>>>>> "
>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>>> "
>>>>>>>
>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>>>>> git tree.
>>>>>>>
>>>>>>>
>>>>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>>>>>
>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>>>>> might also be problematic.
>>>>>>
>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>>>>>
>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>
>>>>
>>>> Attached is the strace’d gup_longterm executiong log on your
>>>> pgtable-boot-speedup-v2 kernel.
>>>
>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>>> set applied? I thought we previously concluded that it was independent of that?
>>> I was under the impression that it was filesystem related and not something that
>>> I was planning to investigate.
>>
>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>> It is indeed 9p filesystem related, as I switched to using NFS all the issues are gone.
>
> Did it never work on 9p? If so, we might have to SKIP that test.
>
> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
> fstatfs(3, 0xffffe505a840) = -1 EOPNOTSUPP (Operation not supported)
> ftruncate(3, 4096) = -1 ENOENT (No such file or directory)

Note: I'm wondering if the unlinkat here is the problem that makes
ftruncate() with 9p result in weird errors (e.g., the hypervisor
unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
which gives us weird errors here).

Then, we should lookup the fs type in run_with_local_tmpfile() before
the unlink() and simply skip the test if it is 9p.

--
Cheers,

David / dhildenb


2024-04-09 11:33:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 09.04.24 12:13, Itaru Kitayama wrote:
>
>
>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>
>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>> Hi Ryan,
>>>
>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>> Hi Itaru,
>>>>>>
>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>
>>>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>
>>>>>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>> | ms (%) | ms (%) | ms (%) | ms (%)
>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
>>>>>>>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
>>>>>>>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656 (-91%)
>>>>>>>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257 (-93%)
>>>>>>>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838 (-95%)
>>>>>>>>
>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Changes since v1 [1]
>>>>>>>> ====================
>>>>>>>>
>>>>>>>> - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>> - Reordered patches (biggest impact & least controversial first)
>>>>>>>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>> - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>>> unconditional p4d_index() define under arm64.
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>> Ryan Roberts (4):
>>>>>>>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>> arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>> arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>> arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>
>>>>>>>> arch/arm64/include/asm/fixmap.h | 5 +-
>>>>>>>> arch/arm64/include/asm/mmu.h | 8 +
>>>>>>>> arch/arm64/include/asm/pgtable.h | 13 +-
>>>>>>>> arch/arm64/kernel/cpufeature.c | 10 +-
>>>>>>>> arch/arm64/mm/fixmap.c | 11 +
>>>>>>>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>>
>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>>>>
>>>>>> Thanks for taking a look at this.
>>>>>>
>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>>>>
>>>>>> Config: arm64 defconfig + the following:
>>>>>>
>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>
>>>>>> # For general mm debug.
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>
>>>>>> # For mm selftests.
>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>
>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>> features required by some mm selftests:
>>>>>>
>>>>>> "
>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>> "
>>>>>>
>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>>>> git tree.
>>>>>>
>>>>>>
>>>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>>>>
>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>>>> might also be problematic.
>>>>>
>>>>> That was it. This time I booted up the kernel including your series on
>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>>>>
>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>
>>>
>>> Attached is the strace’d gup_longterm executiong log on your
>>> pgtable-boot-speedup-v2 kernel.
>>
>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>> set applied? I thought we previously concluded that it was independent of that?
>> I was under the impression that it was filesystem related and not something that
>> I was planning to investigate.
>
> No, irrespective of the kernel, if using 9p on FVP the test program fails.
> It is indeed 9p filesystem related, as I switched to using NFS all the issues are gone.

Did it never work on 9p? If so, we might have to SKIP that test.

openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
fstatfs(3, 0xffffe505a840) = -1 EOPNOTSUPP (Operation not supported)
ftruncate(3, 4096) = -1 ENOENT (No such file or directory)


fstatfs() fails and makes get_fs_type() simply say "0" -- IOW "I don't know",
which should be fine here, as it will make fs_is_unknown() trigger for relevant
cases where the type matters.

ftruncate() failing with ENOENT seems to be the problem.

But that error is a bit weird.

The man page says "ENOENT The named file does not exist.", which should only apply to
truncate() but not ftruncate().

Sound weird, but maybe that's the way to say here "not supported" ?

--
Cheers,

David / dhildenb


2024-04-09 12:21:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 09.04.24 13:29, David Hildenbrand wrote:
> On 09.04.24 13:22, David Hildenbrand wrote:
>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>
>>>
>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>> Hi Ryan,
>>>>>
>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>> Hi Itaru,
>>>>>>>>
>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> It turns out that creating the linear map can take a significant proportion of
>>>>>>>>>> the total boot time, especially when rodata=full. And most of the time is spent
>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This series reworks
>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number of those
>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>
>>>>>>>>>> The below shows the execution time of map_mem() across a couple of different
>>>>>>>>>> systems with different RAM configurations. We measure after applying each patch
>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>
>>>>>>>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>>>>>>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>> | ms (%) | ms (%) | ms (%) | ms (%)
>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
>>>>>>>>>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
>>>>>>>>>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656 (-91%)
>>>>>>>>>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257 (-93%)
>>>>>>>>>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838 (-95%)
>>>>>>>>>>
>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>> ====================
>>>>>>>>>>
>>>>>>>>>> - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>> - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>> - Reverted generic p4d_index() which caused x86 build error. Replaced with
>>>>>>>>>> unconditional p4d_index() define under arm64.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>> arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>> arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>> arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>
>>>>>>>>>> arch/arm64/include/asm/fixmap.h | 5 +-
>>>>>>>>>> arch/arm64/include/asm/mmu.h | 8 +
>>>>>>>>>> arch/arm64/include/asm/pgtable.h | 13 +-
>>>>>>>>>> arch/arm64/kernel/cpufeature.c | 10 +-
>>>>>>>>>> arch/arm64/mm/fixmap.c | 11 +
>>>>>>>>>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> 2.25.1
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not oks, would you take a look at it? The mm ksefltests used is from your linux-rr repo too.
>>>>>>>>
>>>>>>>> Thanks for taking a look at this.
>>>>>>>>
>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple M2 VM:
>>>>>>>>
>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>
>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>
>>>>>>>> # For general mm debug.
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>
>>>>>>>> # For mm selftests.
>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>
>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes (needed by
>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>>>> features required by some mm selftests:
>>>>>>>>
>>>>>>>> "
>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>>>> "
>>>>>>>>
>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests from same
>>>>>>>> git tree.
>>>>>>>>
>>>>>>>>
>>>>>>>> Although I don't think any of this config should make a difference to gup_longterm.
>>>>>>>>
>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this problem on
>>>>>>>> our CI system. There it is due to running the tests from NFS file system. What
>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using 9p? That
>>>>>>>> might also be problematic.
>>>>>>>
>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>> failures. When testing your kernel on FVP, I was executing the script from the FVP's host filesystem using 9p.
>>>>>>
>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough space on
>>>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>
>>>>>
>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>> pgtable-boot-speedup-v2 kernel.
>>>>
>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>>>> set applied? I thought we previously concluded that it was independent of that?
>>>> I was under the impression that it was filesystem related and not something that
>>>> I was planning to investigate.
>>>
>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>> It is indeed 9p filesystem related, as I switched to using NFS all the issues are gone.
>>
>> Did it never work on 9p? If so, we might have to SKIP that test.
>>
>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>> fstatfs(3, 0xffffe505a840) = -1 EOPNOTSUPP (Operation not supported)
>> ftruncate(3, 4096) = -1 ENOENT (No such file or directory)
>
> Note: I'm wondering if the unlinkat here is the problem that makes
> ftruncate() with 9p result in weird errors (e.g., the hypervisor
> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
> which gives us weird errors here).
>
> Then, we should lookup the fs type in run_with_local_tmpfile() before
> the unlink() and simply skip the test if it is 9p.

The unlink with 9p most certainly was a known issue in the past:

https://gitlab.com/qemu-project/qemu/-/issues/103

Maybe it's still an issue with older hypervisors (QEMU?)? Or it was
never completely resolved?

According to https://bugs.launchpad.net/qemu/+bug/1336794, QEMU v5.2.0
should contain a fix that is supposed to work with never kernels.

--
Cheers,

David / dhildenb


2024-04-09 14:13:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 09/04/2024 12:51, David Hildenbrand wrote:
> On 09.04.24 13:29, David Hildenbrand wrote:
>> On 09.04.24 13:22, David Hildenbrand wrote:
>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>
>>>>
>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>>>
>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>>>
>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>> Hi Itaru,
>>>>>>>>>
>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>> proportion of
>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>> time is spent
>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>> series reworks
>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>> of those
>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>
>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>> different
>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>> each patch
>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>
>>>>>>>>>>>                   | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>> Altra
>>>>>>>>>>>                   | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>> 512G
>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>                   |   ms    (%) |   ms    (%) |   ms    (%) |   
>>>>>>>>>>> ms    (%)
>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>>>>>>>
>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>> compile and
>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>> ====================
>>>>>>>>>>>
>>>>>>>>>>>      - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>      - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>      - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>      - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>      - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>      - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>> Replaced with
>>>>>>>>>>>        unconditional p4d_index() define under arm64.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>      arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>      arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>      arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>      arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>
>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> 2.25.1
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>> linux-rr repo too.
>>>>>>>>>
>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>
>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>> M2 VM:
>>>>>>>>>
>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>
>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>
>>>>>>>>> # For general mm debug.
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>
>>>>>>>>> # For mm selftests.
>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>
>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>> (needed by
>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>>>>> features required by some mm selftests:
>>>>>>>>>
>>>>>>>>> "
>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>>>>> "
>>>>>>>>>
>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>> from same
>>>>>>>>> git tree.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>> gup_longterm.
>>>>>>>>>
>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>> problem on
>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>> system. What
>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>> 9p? That
>>>>>>>>> might also be problematic.
>>>>>>>>
>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>
>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>> space on
>>>>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>
>>>>>>
>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>
>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>>>>> set applied? I thought we previously concluded that it was independent of
>>>>> that?
>>>>> I was under the impression that it was filesystem related and not something
>>>>> that
>>>>> I was planning to investigate.
>>>>
>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>> issues are gone.
>>>
>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>
>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>> 0600) = 3
>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>> supported)
>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or directory)
>>
>> Note: I'm wondering if the unlinkat here is the problem that makes
>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>> which gives us weird errors here).
>>
>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>> the unlink() and simply skip the test if it is 9p.
>
> The unlink with 9p most certainly was a known issue in the past:
>
> https://gitlab.com/qemu-project/qemu/-/issues/103
>
> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
> completely resolved?

I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" - Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates a 9p device, so perhaps the bug is in there.

Note that I see lots of "fallocate() failed" failures in gup_longterm when running on our CI system. This is a completely different setup; Real HW with Linux running bare metal using an NFS rootfs. I'm not sure if this is related. Logs show it failing consistently for the "tmpfile" and "local tmpfile" test configs. I also see a couple of these fails in the cow tests.

Logs for reference:

# # ----------------------
# # running ./gup_longterm
# # ----------------------
# # # [INFO] detected hugetlb page size: 2048 KiB
# # # [INFO] detected hugetlb page size: 32768 KiB
# # # [INFO] detected hugetlb page size: 64 KiB
# # # [INFO] detected hugetlb page size: 1048576 KiB
# # TAP version 13
# # 1..56
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
# # ok 1 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
# # not ok 2 fallocate() failed
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 3 fallocate() failed
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 4 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 5 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 6 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 7 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
# # ok 8 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
# # not ok 9 fallocate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 10 fallocate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 11 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 12 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 13 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 14 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd
# # ok 15 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
# # not ok 16 fallocate() failed
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 17 fallocate() failed
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 18 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 19 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 20 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 21 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
# # ok 22 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
# # not ok 23 fallocate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
# # not ok 24 fallocate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
# # ok 25 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
# # ok 26 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (64 kB)
# # ok 27 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
# # ok 28 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
# # ok 29 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
# # not ok 30 fallocate() failed
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 31 fallocate() failed
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 32 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 33 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 34 Should have worked
# # # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 35 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
# # ok 36 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
# # not ok 37 fallocate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 38 fallocate() failed
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 39 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 40 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 41 Should have worked
# # # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 42 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
# # ok 43 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
# # not ok 44 fallocate() failed
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 45 fallocate() failed
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 46 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 47 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 48 Should have worked
# # # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 49 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
# # ok 50 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
# # not ok 51 fallocate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
# # not ok 52 fallocate() failed
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
# # ok 53 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
# # ok 54 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (64 kB)
# # ok 55 Should have worked
# # # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
# # ok 56 Should have worked
# # Bail out! 16 out of 56 tests failed
# # # Totals: pass:40 fail:16 xfail:0 xpass:0 skip:0 error:0
# # [FAIL]
# not ok 13 gup_longterm # exit=1

2024-04-09 14:31:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 09.04.24 16:13, Ryan Roberts wrote:
> On 09/04/2024 12:51, David Hildenbrand wrote:
>> On 09.04.24 13:29, David Hildenbrand wrote:
>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>
>>>>>
>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>> Hi Itaru,
>>>>>>>>>>
>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>> proportion of
>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>> time is spent
>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>> series reworks
>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>> of those
>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>
>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>> different
>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>> each patch
>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>
>>>>>>>>>>>>                   | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>> Altra
>>>>>>>>>>>>                   | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>> 512G
>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>                   |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442   (0%)
>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796 (-78%)
>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656 (-91%)
>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257 (-93%)
>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838 (-95%)
>>>>>>>>>>>>
>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>> compile and
>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>>
>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>> ====================
>>>>>>>>>>>>
>>>>>>>>>>>>      - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>      - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>      - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>      - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>      - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>      - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>        unconditional p4d_index() define under arm64.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>      arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>      arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>      arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>      arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>
>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>
>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>
>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>> M2 VM:
>>>>>>>>>>
>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>
>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>
>>>>>>>>>> # For general mm debug.
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>
>>>>>>>>>> # For mm selftests.
>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>
>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>> (needed by
>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and other
>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>
>>>>>>>>>> "
>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
>>>>>>>>>> "
>>>>>>>>>>
>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>> from same
>>>>>>>>>> git tree.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>> gup_longterm.
>>>>>>>>>>
>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>> problem on
>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>> system. What
>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>> 9p? That
>>>>>>>>>> might also be problematic.
>>>>>>>>>
>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>
>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>> space on
>>>>>>>> the disk? It might be worth enhancing the error log to provide the errno in
>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>
>>>>>>>
>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>
>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2 patch
>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>> that?
>>>>>> I was under the impression that it was filesystem related and not something
>>>>>> that
>>>>>> I was planning to investigate.
>>>>>
>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>> issues are gone.
>>>>
>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>
>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>> 0600) = 3
>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>> supported)
>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or directory)
>>>
>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>> which gives us weird errors here).
>>>
>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>> the unlink() and simply skip the test if it is 9p.
>>
>> The unlink with 9p most certainly was a known issue in the past:
>>
>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>
>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>> completely resolved?
>
> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" - Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates a 9p device, so perhaps the bug is in there.

Very likely.

>
> Note that I see lots of "fallocate() failed" failures in gup_longterm when running on our CI system. This is a completely different setup; Real HW with Linux running bare metal using an NFS rootfs. I'm not sure if this is related. Logs show it failing consistently for the "tmpfile" and "local tmpfile" test configs. I also see a couple of these fails in the cow tests.

What is the fallocate() errno you are getting? strace log would help (to
see if statfs also fails already)! Likely a similar NFS issue.


--
Cheers,

David / dhildenb


2024-04-09 14:43:07

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 09/04/2024 15:29, David Hildenbrand wrote:
> On 09.04.24 16:13, Ryan Roberts wrote:
>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>
>>>>>>
>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>>>>>
>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>> Hi Ryan,
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>
>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>> of those
>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>> different
>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>> each patch
>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>
>>>>>>>>>>>>>                    | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>                    | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>>> 512G
>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>                    |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442  
>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796
>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656
>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257
>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838
>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>
>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>> compile and
>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>
>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>
>>>>>>>>>>>>>       - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>       - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>       - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>       - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>       - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>       - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>         unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>       arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>       arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>
>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>
>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>> M2 VM:
>>>>>>>>>>>
>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>
>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>
>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>
>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>
>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>> (needed by
>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>> other
>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>
>>>>>>>>>>> "
>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>> "
>>>>>>>>>>>
>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>> from same
>>>>>>>>>>> git tree.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>
>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>> problem on
>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>> system. What
>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>> 9p? That
>>>>>>>>>>> might also be problematic.
>>>>>>>>>>
>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>
>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>> space on
>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>> errno in
>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>
>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>> patch
>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>> that?
>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>> that
>>>>>>> I was planning to investigate.
>>>>>>
>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>> issues are gone.
>>>>>
>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>
>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>> 0600) = 3
>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>>> supported)
>>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or
>>>>> directory)
>>>>
>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>> which gives us weird errors here).
>>>>
>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>> the unlink() and simply skip the test if it is 9p.
>>>
>>> The unlink with 9p most certainly was a known issue in the past:
>>>
>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>
>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>> completely resolved?
>>
>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>> a 9p device, so perhaps the bug is in there.
>
> Very likely.
>
>>
>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>> running on our CI system. This is a completely different setup; Real HW with
>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>> configs. I also see a couple of these fails in the cow tests.
>
> What is the fallocate() errno you are getting? strace log would help (to see if
> statfs also fails already)! Likely a similar NFS issue.

Unfortunately this is a system I don't have access to. I've requested some of
this triage to be done, but its fairly low priority unfortunately.



2024-04-09 14:46:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 09.04.24 16:39, Ryan Roberts wrote:
> On 09/04/2024 15:29, David Hildenbrand wrote:
>> On 09.04.24 16:13, Ryan Roberts wrote:
>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>
>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                    | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>                    | VM, 16G     | VM, 64G     | VM, 256G    | Metal,
>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>                    |   ms    (%) |   ms    (%) |   ms    (%) |
>>>>>>>>>>>>>> ms    (%)
>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>> base           |  153   (0%) | 2227   (0%) | 8798   (0%) | 17442
>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>> no-cont-remap  |   77 (-49%) |  431 (-81%) | 1727 (-80%) |  3796
>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>> batch-barriers |   13 (-92%) |  162 (-93%) |  655 (-93%) |  1656
>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>> no-alloc-remap |   11 (-93%) |  109 (-95%) |  449 (-95%) |  1257
>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>> lazy-unmap     |    6 (-96%) |   61 (-97%) |  257 (-97%) |   838
>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>       - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>       - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>       - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>       - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>       - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>         unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>       arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>       arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>       arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h  |   5 +-
>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h     |   8 +
>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h |  13 +-
>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c   |  10 +-
>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c           |  11 +
>>>>>>>>>>>>>> arch/arm64/mm/mmu.c              | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>
>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>
>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>
>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>
>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>
>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>
>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>> (needed by
>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>> other
>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>
>>>>>>>>>>>> "
>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>> "
>>>>>>>>>>>>
>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>> from same
>>>>>>>>>>>> git tree.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>> problem on
>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>> system. What
>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>> 9p? That
>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>
>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>
>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>> space on
>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>> errno in
>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>
>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>> patch
>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>> that?
>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>> that
>>>>>>>> I was planning to investigate.
>>>>>>>
>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>> issues are gone.
>>>>>>
>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>
>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>> 0600) = 3
>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>> fstatfs(3, 0xffffe505a840)              = -1 EOPNOTSUPP (Operation not
>>>>>> supported)
>>>>>> ftruncate(3, 4096)                      = -1 ENOENT (No such file or
>>>>>> directory)
>>>>>
>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>> which gives us weird errors here).
>>>>>
>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>> the unlink() and simply skip the test if it is 9p.
>>>>
>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>
>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>
>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>> completely resolved?
>>>
>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>> a 9p device, so perhaps the bug is in there.
>>
>> Very likely.
>>
>>>
>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>> running on our CI system. This is a completely different setup; Real HW with
>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>> configs. I also see a couple of these fails in the cow tests.
>>
>> What is the fallocate() errno you are getting? strace log would help (to see if
>> statfs also fails already)! Likely a similar NFS issue.
>
> Unfortunately this is a system I don't have access to. I've requested some of
> this triage to be done, but its fairly low priority unfortunately.

To work around these BUGs (?) elsewhere, we could simply skip the test
if get_fs_type() is not able to detect the FS type. Likely that's an
early indicator that the unlink() messed something up.

.. doesn't feel right, though.

--
Cheers,

David / dhildenb


2024-04-09 23:31:18

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

Hi David,

> On Apr 9, 2024, at 23:45, David Hildenbrand <[email protected]> wrote:
>
> On 09.04.24 16:39, Ryan Roberts wrote:
>> On 09/04/2024 15:29, David Hildenbrand wrote:
>>> On 09.04.24 16:13, Ryan Roberts wrote:
>>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>>> Hi Ryan,
>>>>>>>>>>
>>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>> | VM, 16G | VM, 64G | VM, 256G | Metal,
>>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>> | ms (%) | ms (%) | ms (%) |
>>>>>>>>>>>>>>> ms (%)
>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442
>>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796
>>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656
>>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257
>>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838
>>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>> - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>> - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>> unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>> arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>> arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>> arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h | 5 +-
>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h | 8 +
>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h | 13 +-
>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c | 10 +-
>>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c | 11 +
>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>>
>>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>>
>>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>>
>>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>>
>>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>>> (needed by
>>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>>> other
>>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>>
>>>>>>>>>>>>> "
>>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>>> "
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>>> from same
>>>>>>>>>>>>> git tree.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>>> problem on
>>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>>> system. What
>>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>>> 9p? That
>>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>>
>>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>>> space on
>>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>>> errno in
>>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>>
>>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>>> patch
>>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>>> that?
>>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>>> that
>>>>>>>>> I was planning to investigate.
>>>>>>>>
>>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>>> issues are gone.
>>>>>>>
>>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>>
>>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>>> 0600) = 3
>>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>>> fstatfs(3, 0xffffe505a840) = -1 EOPNOTSUPP (Operation not
>>>>>>> supported)
>>>>>>> ftruncate(3, 4096) = -1 ENOENT (No such file or
>>>>>>> directory)
>>>>>>
>>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ..
>>>>>> which gives us weird errors here).
>>>>>>
>>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>>> the unlink() and simply skip the test if it is 9p.
>>>>>
>>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>>
>>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>>
>>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>>> completely resolved?
>>>>
>>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>>> a 9p device, so perhaps the bug is in there.
>>>
>>> Very likely.
>>>
>>>>
>>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>>> running on our CI system. This is a completely different setup; Real HW with
>>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>>> configs. I also see a couple of these fails in the cow tests.
>>>
>>> What is the fallocate() errno you are getting? strace log would help (to see if
>>> statfs also fails already)! Likely a similar NFS issue.
>> Unfortunately this is a system I don't have access to. I've requested some of
>> this triage to be done, but its fairly low priority unfortunately.
>
> To work around these BUGs (?) elsewhere, we could simply skip the test if get_fs_type() is not able to detect the FS type. Likely that's an early indicator that the unlink() messed something up.
>
> ... doesn't feel right, though.

I think it’s a good idea so that the mm kselftests results look reasonable. Since you’re an expert on GUP-fast (or fast-GUP?), when you update the code, could you print out errno as well like the split_huge_page_test.c does?

Thanks,
Itaru.

>
> --
> Cheers,
>
> David / dhildenb
>


2024-04-10 06:48:25

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation



>
>>
>> --
>> Cheers,
>>
>> David / dhildenb



Attachments:
straced-gup_longterm-nfs.log (39.59 kB)

2024-04-10 07:11:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 10.04.24 08:47, Itaru Kitayama wrote:
>
>
>> On Apr 10, 2024, at 8:30, Itaru Kitayama <[email protected]> wrote:
>>
>> Hi David,
>>
>>> On Apr 9, 2024, at 23:45, David Hildenbrand <[email protected]> wrote:
>>>
>>> On 09.04.24 16:39, Ryan Roberts wrote:
>>>> On 09/04/2024 15:29, David Hildenbrand wrote:
>>>>> On 09.04.24 16:13, Ryan Roberts wrote:
>>>>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>
>>>>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>>> | VM, 16G | VM, 64G | VM, 256G | Metal,
>>>>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>> | ms (%) | ms (%) | ms (%) |
>>>>>>>>>>>>>>>>> ms (%)
>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442
>>>>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>>>>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796
>>>>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>>>>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656
>>>>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>>>>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257
>>>>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>>>>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838
>>>>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>>>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>>>> - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>>>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>>>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>>>> - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>>>> unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>>>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>>>> arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>>>> arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>>>> arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h | 5 +-
>>>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h | 8 +
>>>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h | 13 +-
>>>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c | 10 +-
>>>>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c | 11 +
>>>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>>>>> (needed by
>>>>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>>>>> from same
>>>>>>>>>>>>>>> git tree.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>>>>> problem on
>>>>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>>>>> system. What
>>>>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>>>>> 9p? That
>>>>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>>>>> space on
>>>>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>>>>> errno in
>>>>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>>>>
>>>>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>>>>> patch
>>>>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>>>>> that?
>>>>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>>>>> that
>>>>>>>>>>> I was planning to investigate.
>>>>>>>>>>
>>>>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>>>>> issues are gone.
>>>>>>>>>
>>>>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>>>>
>>>>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>>>>> 0600) = 3
>>>>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>>>>> fstatfs(3, 0xffffe505a840) = -1 EOPNOTSUPP (Operation not
>>>>>>>>> supported)
>>>>>>>>> ftruncate(3, 4096) = -1 ENOENT (No such file or
>>>>>>>>> directory)
>>>>>>>>
>>>>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>>>>> which gives us weird errors here).
>>>>>>>>
>>>>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>>>>> the unlink() and simply skip the test if it is 9p.
>>>>>>>
>>>>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>>>>
>>>>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>>>>
>>>>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>>>>> completely resolved?
>>>>>>
>>>>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>>>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>>>>> a 9p device, so perhaps the bug is in there.
>>>>>
>>>>> Very likely.
>>>>>
>>>>>>
>>>>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>>>>> running on our CI system. This is a completely different setup; Real HW with
>>>>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>>>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>>>>> configs. I also see a couple of these fails in the cow tests.
>>>>>
>>>>> What is the fallocate() errno you are getting? strace log would help (to see if
>>>>> statfs also fails already)! Likely a similar NFS issue.
>>>> Unfortunately this is a system I don't have access to. I've requested some of
>>>> this triage to be done, but its fairly low priority unfortunately.
>>>
>>> To work around these BUGs (?) elsewhere, we could simply skip the test if get_fs_type() is not able to detect the FS type. Likely that's an early indicator that the unlink() messed something up.
>>>
>>> ... doesn't feel right, though.
>>
>> I think it’s a good idea so that the mm kselftests results look reasonable.

Yeah, but this will hide BUGs elsewhere. I suspect that in Ryan's NFS setup is
also a BUG lurking somewhere in the NFS implementation. But that's just a guess
until we have more details.

>> Since you’re an expert on GUP-fast (or fast-GUP?), when you update the code, could you print out errno as well like the split_huge_page_test.c does

While we could, I don't see much value in that for selftests. strace log is of much
more valuable to understand what is actually happening (e.g., fstatfs failing), and
quite easy to obtain.

>> Thanks,
>> Itaru.
>
> David, attached is the straced execution log of the gup_longterm kselftest over the NFS case.
> I’m running the program on FVP, let me know if you need other logs or test results.

For your run, it all looks good:

openat(AT_FDCWD, "/tmp", O_RDWR|O_EXCL|O_TMPFILE, 0600) = 3
fcntl(3, F_GETFL) = 0x424002 (flags O_RDWR|O_LARGEFILE|O_TMPFILE)
fstatfs(3, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=416015, f_bfree=415997, f_bavail=415997, f_files=416015, f_ffree=416009, f_fsid={val=[0x8e6b7ce6, 0xe1737440]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
ftruncate(3, 4096) = 0
fallocate(3, 0, 0, 4096) = 0

-> TMPFS/SHMEM, works as expected

openat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", 0) = 0
fstatfs(3, {f_type=NFS_SUPER_MAGIC, f_bsize=1048576, f_blocks=112200, f_bfree=27954, f_bavail=23296, f_files=7307264, f_ffree=4724815, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=1048576, f_flags=ST_VALID|ST_RELATIME}) = 0
ftruncate(3, 4096) = 0
fallocate(3, 0, 0, 4096) = 0

-> NFS, works as expected

Note that you get all skips (not fails), because your kernel is not compiled with CONFIG_GUP_TEST.

ok 1 # SKIP gup_test not available

--
Cheers,

David / dhildenb


2024-04-10 07:37:52

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation



> On Apr 10, 2024, at 16:10, David Hildenbrand <[email protected]> wrote:
>
> On 10.04.24 08:47, Itaru Kitayama wrote:
>>> On Apr 10, 2024, at 8:30, Itaru Kitayama <[email protected]> wrote:
>>>
>>> Hi David,
>>>
>>>> On Apr 9, 2024, at 23:45, David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 09.04.24 16:39, Ryan Roberts wrote:
>>>>> On 09/04/2024 15:29, David Hildenbrand wrote:
>>>>>> On 09.04.24 16:13, Ryan Roberts wrote:
>>>>>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>>>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>>>> | VM, 16G | VM, 64G | VM, 256G | Metal,
>>>>>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>>> | ms (%) | ms (%) | ms (%) |
>>>>>>>>>>>>>>>>>> ms (%)
>>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442
>>>>>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>>>>>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796
>>>>>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>>>>>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656
>>>>>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>>>>>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257
>>>>>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>>>>>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838
>>>>>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>>>>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>>>>> - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>>>>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>>>>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>>>>> - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>>>>> unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>>>>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>>>>> arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>>>>> arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>>>>> arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h | 5 +-
>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h | 8 +
>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h | 13 +-
>>>>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c | 10 +-
>>>>>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c | 11 +
>>>>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>>>>>> (needed by
>>>>>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>>>>>> from same
>>>>>>>>>>>>>>>> git tree.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>>>>>> problem on
>>>>>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>>>>>> system. What
>>>>>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>>>>>> 9p? That
>>>>>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>>>>>> space on
>>>>>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>>>>>> errno in
>>>>>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>>>>>> patch
>>>>>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>>>>>> that?
>>>>>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>>>>>> that
>>>>>>>>>>>> I was planning to investigate.
>>>>>>>>>>>
>>>>>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>>>>>> issues are gone.
>>>>>>>>>>
>>>>>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>>>>>
>>>>>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>>>>>> 0600) = 3
>>>>>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>>>>>> fstatfs(3, 0xffffe505a840) = -1 EOPNOTSUPP (Operation not
>>>>>>>>>> supported)
>>>>>>>>>> ftruncate(3, 4096) = -1 ENOENT (No such file or
>>>>>>>>>> directory)
>>>>>>>>>
>>>>>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>>>>>> which gives us weird errors here).
>>>>>>>>>
>>>>>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>>>>>> the unlink() and simply skip the test if it is 9p.
>>>>>>>>
>>>>>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>>>>>
>>>>>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>>>>>
>>>>>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>>>>>> completely resolved?
>>>>>>>
>>>>>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>>>>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>>>>>> a 9p device, so perhaps the bug is in there.
>>>>>>
>>>>>> Very likely.
>>>>>>
>>>>>>>
>>>>>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>>>>>> running on our CI system. This is a completely different setup; Real HW with
>>>>>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>>>>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>>>>>> configs. I also see a couple of these fails in the cow tests.
>>>>>>
>>>>>> What is the fallocate() errno you are getting? strace log would help (to see if
>>>>>> statfs also fails already)! Likely a similar NFS issue.
>>>>> Unfortunately this is a system I don't have access to. I've requested some of
>>>>> this triage to be done, but its fairly low priority unfortunately.
>>>>
>>>> To work around these BUGs (?) elsewhere, we could simply skip the test if get_fs_type() is not able to detect the FS type. Likely that's an early indicator that the unlink() messed something up.
>>>>
>>>> ... doesn't feel right, though.
>>>
>>> I think it’s a good idea so that the mm kselftests results look reasonable.
>
> Yeah, but this will hide BUGs elsewhere. I suspect that in Ryan's NFS setup is
> also a BUG lurking somewhere in the NFS implementation. But that's just a guess
> until we have more details.
>

Ok.

>>> Since you’re an expert on GUP-fast (or fast-GUP?), when you update the code, could you print out errno as well like the split_huge_page_test.c does
>
> While we could, I don't see much value in that for selftests. strace log is of much
> more valuable to understand what is actually happening (e.g., fstatfs failing), and
> quite easy to obtain.

Ok.

>
>>> Thanks,
>>> Itaru.
>> David, attached is the straced execution log of the gup_longterm kselftest over the NFS case.
>> I’m running the program on FVP, let me know if you need other logs or test results.
>
> For your run, it all looks good:
>
> openat(AT_FDCWD, "/tmp", O_RDWR|O_EXCL|O_TMPFILE, 0600) = 3
> fcntl(3, F_GETFL) = 0x424002 (flags O_RDWR|O_LARGEFILE|O_TMPFILE)
> fstatfs(3, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=416015, f_bfree=415997, f_bavail=415997, f_files=416015, f_ffree=416009, f_fsid={val=[0x8e6b7ce6, 0xe1737440]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
> ftruncate(3, 4096) = 0
> fallocate(3, 0, 0, 4096) = 0
>
> -> TMPFS/SHMEM, works as expected
>
> openat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", 0) = 0
> fstatfs(3, {f_type=NFS_SUPER_MAGIC, f_bsize=1048576, f_blocks=112200, f_bfree=27954, f_bavail=23296, f_files=7307264, f_ffree=4724815, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=1048576, f_flags=ST_VALID|ST_RELATIME}) = 0
> ftruncate(3, 4096) = 0
> fallocate(3, 0, 0, 4096) = 0
>
> -> NFS, works as expected
>
> Note that you get all skips (not fails), because your kernel is not compiled with CONFIG_GUP_TEST.
>
> ok 1 # SKIP gup_test not available

I rebuilt the v6.9-rc3 kernel with that option enabled. This time SKIPs are due to “need more free huge pages”, I’ll check even on a limited memory size system preparing enough huge pages is possible.

Thanks,
Itaru.

>
> --
> Cheers,
>
> David / dhildenb



2024-04-10 07:46:15

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Speed up boot with faster linear map creation

On 10.04.24 09:37, Itaru Kitayama wrote:
>
>
>> On Apr 10, 2024, at 16:10, David Hildenbrand <[email protected]> wrote:
>>
>> On 10.04.24 08:47, Itaru Kitayama wrote:
>>>> On Apr 10, 2024, at 8:30, Itaru Kitayama <[email protected]> wrote:
>>>>
>>>> Hi David,
>>>>
>>>>> On Apr 9, 2024, at 23:45, David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 09.04.24 16:39, Ryan Roberts wrote:
>>>>>> On 09/04/2024 15:29, David Hildenbrand wrote:
>>>>>>> On 09.04.24 16:13, Ryan Roberts wrote:
>>>>>>>> On 09/04/2024 12:51, David Hildenbrand wrote:
>>>>>>>>> On 09.04.24 13:29, David Hildenbrand wrote:
>>>>>>>>>> On 09.04.24 13:22, David Hildenbrand wrote:
>>>>>>>>>>> On 09.04.24 12:13, Itaru Kitayama wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Apr 9, 2024, at 19:04, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 09/04/2024 01:10, Itaru Kitayama wrote:
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Apr 8, 2024, at 16:30, Ryan Roberts <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 06/04/2024 11:31, Itaru Kitayama wrote:
>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Apr 06, 2024 at 09:32:34AM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>> Hi Itaru,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 05/04/2024 08:39, Itaru Kitayama wrote:
>>>>>>>>>>>>>>>>>> On Thu, Apr 04, 2024 at 03:33:04PM +0100, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It turns out that creating the linear map can take a significant
>>>>>>>>>>>>>>>>>>> proportion of
>>>>>>>>>>>>>>>>>>> the total boot time, especially when rodata=full. And most of the
>>>>>>>>>>>>>>>>>>> time is spent
>>>>>>>>>>>>>>>>>>> waiting on superfluous tlb invalidation and memory barriers. This
>>>>>>>>>>>>>>>>>>> series reworks
>>>>>>>>>>>>>>>>>>> the kernel pgtable generation code to significantly reduce the number
>>>>>>>>>>>>>>>>>>> of those
>>>>>>>>>>>>>>>>>>> TLBIs, ISBs and DSBs. See each patch for details.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The below shows the execution time of map_mem() across a couple of
>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> systems with different RAM configurations. We measure after applying
>>>>>>>>>>>>>>>>>>> each patch
>>>>>>>>>>>>>>>>>>> and show the improvement relative to base (v6.9-rc2):
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere
>>>>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>>>>> | VM, 16G | VM, 64G | VM, 256G | Metal,
>>>>>>>>>>>>>>>>>>> 512G
>>>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>>>> | ms (%) | ms (%) | ms (%) |
>>>>>>>>>>>>>>>>>>> ms (%)
>>>>>>>>>>>>>>>>>>> ---------------|-------------|-------------|-------------|-------------
>>>>>>>>>>>>>>>>>>> base | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442
>>>>>>>>>>>>>>>>>>> (0%)
>>>>>>>>>>>>>>>>>>> no-cont-remap | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796
>>>>>>>>>>>>>>>>>>> (-78%)
>>>>>>>>>>>>>>>>>>> batch-barriers | 13 (-92%) | 162 (-93%) | 655 (-93%) | 1656
>>>>>>>>>>>>>>>>>>> (-91%)
>>>>>>>>>>>>>>>>>>> no-alloc-remap | 11 (-93%) | 109 (-95%) | 449 (-95%) | 1257
>>>>>>>>>>>>>>>>>>> (-93%)
>>>>>>>>>>>>>>>>>>> lazy-unmap | 6 (-96%) | 61 (-97%) | 257 (-97%) | 838
>>>>>>>>>>>>>>>>>>> (-95%)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This series applies on top of v6.9-rc2. All mm selftests pass. I've
>>>>>>>>>>>>>>>>>>> compile and
>>>>>>>>>>>>>>>>>>> boot tested various PAGE_SIZE and VA size configs.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Changes since v1 [1]
>>>>>>>>>>>>>>>>>>> ====================
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Added Tested-by tags (thanks to Eric and Itaru)
>>>>>>>>>>>>>>>>>>> - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
>>>>>>>>>>>>>>>>>>> - Reordered patches (biggest impact & least controversial first)
>>>>>>>>>>>>>>>>>>> - Reordered alloc/map/unmap functions in mmu.c to aid reader
>>>>>>>>>>>>>>>>>>> - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
>>>>>>>>>>>>>>>>>>> - Reverted generic p4d_index() which caused x86 build error.
>>>>>>>>>>>>>>>>>>> Replaced with
>>>>>>>>>>>>>>>>>>> unconditional p4d_index() define under arm64.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/[email protected]/<https://lore.kernel.org/linux-arm-kernel/[email protected]/>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ryan Roberts (4):
>>>>>>>>>>>>>>>>>>> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>>>>>>>>>>>>>>>>>>> arm64: mm: Batch dsb and isb when populating pgtables
>>>>>>>>>>>>>>>>>>> arm64: mm: Don't remap pgtables for allocate vs populate
>>>>>>>>>>>>>>>>>>> arm64: mm: Lazily clear pte table mappings from fixmap
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/fixmap.h | 5 +-
>>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h | 8 +
>>>>>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h | 13 +-
>>>>>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c | 10 +-
>>>>>>>>>>>>>>>>>>> arch/arm64/mm/fixmap.c | 11 +
>>>>>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c | 377 +++++++++++++++++++++++--------
>>>>>>>>>>>>>>>>>>> 6 files changed, 319 insertions(+), 105 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've build and boot tested the v2 on FVP, base is taken from your
>>>>>>>>>>>>>>>>>> linux-rr repo. Running run_vmtests.sh on v2 left some gup longterm not
>>>>>>>>>>>>>>>>>> oks, would you take a look at it? The mm ksefltests used is from your
>>>>>>>>>>>>>>>>>> linux-rr repo too.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for taking a look at this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I can't reproduce your issue unfortunately; steps as follows on Apple
>>>>>>>>>>>>>>>>> M2 VM:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Config: arm64 defconfig + the following:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # Squashfs for snaps, xfs for large file folios.
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_XFS_FS
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # For general mm debug.
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # For mm selftests.
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_USERFAULTFD
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_TEST_VMALLOC
>>>>>>>>>>>>>>>>> ./scripts/config --enable CONFIG_GUP_TEST
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Running on VM with 12G memory, split across 2 (emulated) NUMA nodes
>>>>>>>>>>>>>>>>> (needed by
>>>>>>>>>>>>>>>>> some mm selftests), with kernel command line to reserve hugetlbs and
>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>> features required by some mm selftests:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>> transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
>>>>>>>>>>>>>>>>> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
>>>>>>>>>>>>>>>>> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K
>>>>>>>>>>>>>>>>> hugepages=0:2,1:2
>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ubuntu userspace running off XFS rootfs. Build and run mm selftests
>>>>>>>>>>>>>>>>> from same
>>>>>>>>>>>>>>>>> git tree.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Although I don't think any of this config should make a difference to
>>>>>>>>>>>>>>>>> gup_longterm.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Looks like your errors are all "ftruncate() failed". I've seen this
>>>>>>>>>>>>>>>>> problem on
>>>>>>>>>>>>>>>>> our CI system. There it is due to running the tests from NFS file
>>>>>>>>>>>>>>>>> system. What
>>>>>>>>>>>>>>>>> filesystem are you using? Perhaps you are sharing into the FVP using
>>>>>>>>>>>>>>>>> 9p? That
>>>>>>>>>>>>>>>>> might also be problematic.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That was it. This time I booted up the kernel including your series on
>>>>>>>>>>>>>>>> QEMU on my M1 and executed the gup_longterm program without the ftruncate
>>>>>>>>>>>>>>>> failures. When testing your kernel on FVP, I was executing the script
>>>>>>>>>>>>>>>> from the FVP's host filesystem using 9p.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure exactly what the root cause is. Perhaps there isn't enough
>>>>>>>>>>>>>>> space on
>>>>>>>>>>>>>>> the disk? It might be worth enhancing the error log to provide the
>>>>>>>>>>>>>>> errno in
>>>>>>>>>>>>>>> tools/testing/selftests/mm/gup_longterm.c.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Attached is the strace’d gup_longterm executiong log on your
>>>>>>>>>>>>>> pgtable-boot-speedup-v2 kernel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry are you saying that it only fails with the pgtable-boot-speedup-v2
>>>>>>>>>>>>> patch
>>>>>>>>>>>>> set applied? I thought we previously concluded that it was independent of
>>>>>>>>>>>>> that?
>>>>>>>>>>>>> I was under the impression that it was filesystem related and not something
>>>>>>>>>>>>> that
>>>>>>>>>>>>> I was planning to investigate.
>>>>>>>>>>>>
>>>>>>>>>>>> No, irrespective of the kernel, if using 9p on FVP the test program fails.
>>>>>>>>>>>> It is indeed 9p filesystem related, as I switched to using NFS all the
>>>>>>>>>>>> issues are gone.
>>>>>>>>>>>
>>>>>>>>>>> Did it never work on 9p? If so, we might have to SKIP that test.
>>>>>>>>>>>
>>>>>>>>>>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", O_RDWR|O_CREAT|O_EXCL,
>>>>>>>>>>> 0600) = 3
>>>>>>>>>>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_BLboOt", 0) = 0
>>>>>>>>>>> fstatfs(3, 0xffffe505a840) = -1 EOPNOTSUPP (Operation not
>>>>>>>>>>> supported)
>>>>>>>>>>> ftruncate(3, 4096) = -1 ENOENT (No such file or
>>>>>>>>>>> directory)
>>>>>>>>>>
>>>>>>>>>> Note: I'm wondering if the unlinkat here is the problem that makes
>>>>>>>>>> ftruncate() with 9p result in weird errors (e.g., the hypervisor
>>>>>>>>>> unlinked the file and cannot reopen it for the fstatfs/ftruncate. ...
>>>>>>>>>> which gives us weird errors here).
>>>>>>>>>>
>>>>>>>>>> Then, we should lookup the fs type in run_with_local_tmpfile() before
>>>>>>>>>> the unlink() and simply skip the test if it is 9p.
>>>>>>>>>
>>>>>>>>> The unlink with 9p most certainly was a known issue in the past:
>>>>>>>>>
>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/issues/103
>>>>>>>>>
>>>>>>>>> Maybe it's still an issue with older hypervisors (QEMU?)? Or it was never
>>>>>>>>> completely resolved?
>>>>>>>>
>>>>>>>> I believe Itaru is running on FVP (Fixed Virtual Platform - "fast model" -
>>>>>>>> Arm's architecture emulator). So QEMU won't be involved here. The FVP emulates
>>>>>>>> a 9p device, so perhaps the bug is in there.
>>>>>>>
>>>>>>> Very likely.
>>>>>>>
>>>>>>>>
>>>>>>>> Note that I see lots of "fallocate() failed" failures in gup_longterm when
>>>>>>>> running on our CI system. This is a completely different setup; Real HW with
>>>>>>>> Linux running bare metal using an NFS rootfs. I'm not sure if this is related.
>>>>>>>> Logs show it failing consistently for the "tmpfile" and "local tmpfile" test
>>>>>>>> configs. I also see a couple of these fails in the cow tests.
>>>>>>>
>>>>>>> What is the fallocate() errno you are getting? strace log would help (to see if
>>>>>>> statfs also fails already)! Likely a similar NFS issue.
>>>>>> Unfortunately this is a system I don't have access to. I've requested some of
>>>>>> this triage to be done, but its fairly low priority unfortunately.
>>>>>
>>>>> To work around these BUGs (?) elsewhere, we could simply skip the test if get_fs_type() is not able to detect the FS type. Likely that's an early indicator that the unlink() messed something up.
>>>>>
>>>>> ... doesn't feel right, though.
>>>>
>>>> I think it’s a good idea so that the mm kselftests results look reasonable.
>>
>> Yeah, but this will hide BUGs elsewhere. I suspect that in Ryan's NFS setup is
>> also a BUG lurking somewhere in the NFS implementation. But that's just a guess
>> until we have more details.
>>
>
> Ok.
>
>>>> Since you’re an expert on GUP-fast (or fast-GUP?), when you update the code, could you print out errno as well like the split_huge_page_test.c does
>>
>> While we could, I don't see much value in that for selftests. strace log is of much
>> more valuable to understand what is actually happening (e.g., fstatfs failing), and
>> quite easy to obtain.
>
> Ok.
>
>>
>>>> Thanks,
>>>> Itaru.
>>> David, attached is the straced execution log of the gup_longterm kselftest over the NFS case.
>>> I’m running the program on FVP, let me know if you need other logs or test results.
>>
>> For your run, it all looks good:
>>
>> openat(AT_FDCWD, "/tmp", O_RDWR|O_EXCL|O_TMPFILE, 0600) = 3
>> fcntl(3, F_GETFL) = 0x424002 (flags O_RDWR|O_LARGEFILE|O_TMPFILE)
>> fstatfs(3, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=416015, f_bfree=415997, f_bavail=415997, f_files=416015, f_ffree=416009, f_fsid={val=[0x8e6b7ce6, 0xe1737440]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
>> ftruncate(3, 4096) = 0
>> fallocate(3, 0, 0, 4096) = 0
>>
>> -> TMPFS/SHMEM, works as expected
>>
>> openat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
>> unlinkat(AT_FDCWD, "gup_longterm.c_tmpfile_WMLTNf", 0) = 0
>> fstatfs(3, {f_type=NFS_SUPER_MAGIC, f_bsize=1048576, f_blocks=112200, f_bfree=27954, f_bavail=23296, f_files=7307264, f_ffree=4724815, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=1048576, f_flags=ST_VALID|ST_RELATIME}) = 0
>> ftruncate(3, 4096) = 0
>> fallocate(3, 0, 0, 4096) = 0
>>
>> -> NFS, works as expected
>>
>> Note that you get all skips (not fails), because your kernel is not compiled with CONFIG_GUP_TEST.
>>
>> ok 1 # SKIP gup_test not available
>
> I rebuilt the v6.9-rc3 kernel with that option enabled. This time SKIPs are due to “need more free huge pages”, I’ll check even on a limited memory size system preparing enough huge pages is possible.

That's expected, you have to reserve hugetlb pages before running the
test. But the important thing is that tmpfs/nfs works for you.

--
Cheers,

David / dhildenb


2024-04-10 09:48:45

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block

On Thu, Apr 04, 2024 at 03:33:05PM +0100, Ryan Roberts wrote:
> A large part of the kernel boot time is creating the kernel linear map
> page tables. When rodata=full, all memory is mapped by pte. And when
> there is lots of physical ram, there are lots of pte tables to populate.
> The primary cost associated with this is mapping and unmapping the pte
> table memory in the fixmap; at unmap time, the TLB entry must be
> invalidated and this is expensive.
>
> Previously, each pmd and pte table was fixmapped/fixunmapped for each
> cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
> we ended up issuing 32 TLBIs per (pmd|pte) table during the population
> phase.
>
> Let's fix that, and fixmap/fixunmap each page once per population, for a
> saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
> speedup.
>
> Execution time of map_mem(), which creates the kernel linear map page
> tables, was measured on different machines with different RAM configs:
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> before | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
> after | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
>
> Signed-off-by: Ryan Roberts <[email protected]>
> Tested-by: Itaru Kitayama <[email protected]>
> Tested-by: Eric Chanudet <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 32 ++++++++++++++++++--------------
> 1 file changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 495b732d5af3..fd91b5bdb514 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -172,12 +172,9 @@ bool pgattr_change_is_safe(u64 old, u64 new)
> return ((old ^ new) & ~mask) == 0;
> }
>
> -static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> - phys_addr_t phys, pgprot_t prot)
> +static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
> + phys_addr_t phys, pgprot_t prot)
> {
> - pte_t *ptep;
> -
> - ptep = pte_set_fixmap_offset(pmdp, addr);
> do {
> pte_t old_pte = __ptep_get(ptep);
>
> @@ -193,7 +190,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> phys += PAGE_SIZE;
> } while (ptep++, addr += PAGE_SIZE, addr != end);
>
> - pte_clear_fixmap();
> + return ptep;
> }
>
> static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> @@ -204,6 +201,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> {
> unsigned long next;
> pmd_t pmd = READ_ONCE(*pmdp);
> + pte_t *ptep;
>
> BUG_ON(pmd_sect(pmd));
> if (pmd_none(pmd)) {
> @@ -219,6 +217,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> }
> BUG_ON(pmd_bad(pmd));
>
> + ptep = pte_set_fixmap_offset(pmdp, addr);
> do {
> pgprot_t __prot = prot;
>
> @@ -229,20 +228,20 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> (flags & NO_CONT_MAPPINGS) == 0)
> __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>
> - init_pte(pmdp, addr, next, phys, __prot);
> + ptep = init_pte(ptep, addr, next, phys, __prot);
>
> phys += next - addr;
> } while (addr = next, addr != end);

I reckon it might be better to leave init_pte() returning void, and move the
ptep along here, e.g.

ptep = pte_set_fixmap_offset(pmdp, addr);
do {
...

init_pte(ptep, addr, next, phys, __prot);

ptep += pte_index(next) - pte_index(addr);
phys += next - addr;
} while (addr = next, addr != end);


.. as that keeps the relationship between 'ptep' and 'phys' clear since
they're manipulated in the same way, adjacent to one another.

Regardless this looks good, so with that change or as-is:

Acked-by: Mark Rutland <[email protected]>

.. though I would prefer with that change. ;)

Mark.

2024-04-10 10:09:02

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] arm64: mm: Batch dsb and isb when populating pgtables

On Thu, Apr 04, 2024 at 03:33:06PM +0100, Ryan Roberts wrote:
> After removing uneccessary TLBIs, the next bottleneck when creating the
> page tables for the linear map is DSB and ISB, which were previously
> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
> given pte table, we can elide these barriers and insert them once we
> have finished writing to the table.
>
> Execution time of map_mem(), which creates the kernel linear map page
> tables, was measured on different machines with different RAM configs:
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> before | 77 (0%) | 431 (0%) | 1727 (0%) | 3796 (0%)
> after | 13 (-84%) | 162 (-62%) | 655 (-62%) | 1656 (-56%)
>
> Signed-off-by: Ryan Roberts <[email protected]>
> Tested-by: Itaru Kitayama <[email protected]>
> Tested-by: Eric Chanudet <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 7 ++++++-
> arch/arm64/mm/mmu.c | 13 ++++++++++++-
> 2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index afdd56d26ad7..105a95a8845c 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
> return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
> }
>
> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> +static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
> {
> WRITE_ONCE(*ptep, pte);
> +}
> +
> +static inline void __set_pte(pte_t *ptep, pte_t pte)
> +{
> + __set_pte_nosync(ptep, pte);
>
> /*
> * Only if the new pte is valid and kernel, otherwise TLB maintenance
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index fd91b5bdb514..dc86dceb0efe 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -178,7 +178,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
> do {
> pte_t old_pte = __ptep_get(ptep);
>
> - __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
> + /*
> + * Required barriers to make this visible to the table walker
> + * are deferred to the end of alloc_init_cont_pte().
> + */
> + __set_pte_nosync(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>
> /*
> * After the PTE entry has been populated once, we
> @@ -234,6 +238,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> } while (addr = next, addr != end);
>
> pte_clear_fixmap();
> +
> + /*
> + * Ensure all previous pgtable writes are visible to the table walker.
> + * See init_pte().
> + */
> + dsb(ishst);
> + isb();

Hmm... currently the call to pte_clear_fixmap() alone should be sufficient,
since that needs to update the PTE for the fixmap slot, then do maintenance for
that.

So we could avoid the addition of the dsb+isb here, and have a comment:

/*
* Note: barriers and maintenance necessary to clear the fixmap slot
* ensure that all previous pgtable writes are visible to the table
* walker.
*/
pte_clear_fixmap();

.. which'd be fine as long as we keep this fixmap clearing rather than trying
to do that lazily as in patch 4.

Mark.

> }
>
> static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
> --
> 2.25.1
>

2024-04-10 10:25:23

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] arm64: mm: Batch dsb and isb when populating pgtables

On 10/04/2024 11:06, Mark Rutland wrote:
> On Thu, Apr 04, 2024 at 03:33:06PM +0100, Ryan Roberts wrote:
>> After removing uneccessary TLBIs, the next bottleneck when creating the
>> page tables for the linear map is DSB and ISB, which were previously
>> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
>> given pte table, we can elide these barriers and insert them once we
>> have finished writing to the table.
>>
>> Execution time of map_mem(), which creates the kernel linear map page
>> tables, was measured on different machines with different RAM configs:
>>
>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>> | ms (%) | ms (%) | ms (%) | ms (%)
>> ---------------|-------------|-------------|-------------|-------------
>> before | 77 (0%) | 431 (0%) | 1727 (0%) | 3796 (0%)
>> after | 13 (-84%) | 162 (-62%) | 655 (-62%) | 1656 (-56%)
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> Tested-by: Itaru Kitayama <[email protected]>
>> Tested-by: Eric Chanudet <[email protected]>
>> ---
>> arch/arm64/include/asm/pgtable.h | 7 ++++++-
>> arch/arm64/mm/mmu.c | 13 ++++++++++++-
>> 2 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index afdd56d26ad7..105a95a8845c 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>> return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>> }
>>
>> -static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
>> {
>> WRITE_ONCE(*ptep, pte);
>> +}
>> +
>> +static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +{
>> + __set_pte_nosync(ptep, pte);
>>
>> /*
>> * Only if the new pte is valid and kernel, otherwise TLB maintenance
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index fd91b5bdb514..dc86dceb0efe 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -178,7 +178,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>> do {
>> pte_t old_pte = __ptep_get(ptep);
>>
>> - __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>> + /*
>> + * Required barriers to make this visible to the table walker
>> + * are deferred to the end of alloc_init_cont_pte().
>> + */
>> + __set_pte_nosync(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>>
>> /*
>> * After the PTE entry has been populated once, we
>> @@ -234,6 +238,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>> } while (addr = next, addr != end);
>>
>> pte_clear_fixmap();
>> +
>> + /*
>> + * Ensure all previous pgtable writes are visible to the table walker.
>> + * See init_pte().
>> + */
>> + dsb(ishst);
>> + isb();
>
> Hmm... currently the call to pte_clear_fixmap() alone should be sufficient,
> since that needs to update the PTE for the fixmap slot, then do maintenance for
> that.

Yes, true...

>
> So we could avoid the addition of the dsb+isb here, and have a comment:
>
> /*
> * Note: barriers and maintenance necessary to clear the fixmap slot
> * ensure that all previous pgtable writes are visible to the table
> * walker.
> */
> pte_clear_fixmap();
>
> ... which'd be fine as long as we keep this fixmap clearing rather than trying
> to do that lazily as in patch 4.

But it isn't patch 4 that breaks it, it's patch 3. Once we have abstracted
pte_clear_fixmap() into the ops->unmap() call, for the "late" ops, unmap is a
noop. I guess the best solution there would be to require that unmap() always
issues these barriers.

I'll do as you suggest for this patch. If we want to keep patch 3, then I'll add
the barriers for all unmap() impls.

>
> Mark.
>
>> }
>>
>> static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> --
>> 2.25.1
>>


2024-04-10 10:28:33

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block

On 10/04/2024 10:46, Mark Rutland wrote:
> On Thu, Apr 04, 2024 at 03:33:05PM +0100, Ryan Roberts wrote:
>> A large part of the kernel boot time is creating the kernel linear map
>> page tables. When rodata=full, all memory is mapped by pte. And when
>> there is lots of physical ram, there are lots of pte tables to populate.
>> The primary cost associated with this is mapping and unmapping the pte
>> table memory in the fixmap; at unmap time, the TLB entry must be
>> invalidated and this is expensive.
>>
>> Previously, each pmd and pte table was fixmapped/fixunmapped for each
>> cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
>> we ended up issuing 32 TLBIs per (pmd|pte) table during the population
>> phase.
>>
>> Let's fix that, and fixmap/fixunmap each page once per population, for a
>> saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
>> speedup.
>>
>> Execution time of map_mem(), which creates the kernel linear map page
>> tables, was measured on different machines with different RAM configs:
>>
>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>> | ms (%) | ms (%) | ms (%) | ms (%)
>> ---------------|-------------|-------------|-------------|-------------
>> before | 153 (0%) | 2227 (0%) | 8798 (0%) | 17442 (0%)
>> after | 77 (-49%) | 431 (-81%) | 1727 (-80%) | 3796 (-78%)
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> Tested-by: Itaru Kitayama <[email protected]>
>> Tested-by: Eric Chanudet <[email protected]>
>> ---
>> arch/arm64/mm/mmu.c | 32 ++++++++++++++++++--------------
>> 1 file changed, 18 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 495b732d5af3..fd91b5bdb514 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -172,12 +172,9 @@ bool pgattr_change_is_safe(u64 old, u64 new)
>> return ((old ^ new) & ~mask) == 0;
>> }
>>
>> -static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> - phys_addr_t phys, pgprot_t prot)
>> +static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>> + phys_addr_t phys, pgprot_t prot)
>> {
>> - pte_t *ptep;
>> -
>> - ptep = pte_set_fixmap_offset(pmdp, addr);
>> do {
>> pte_t old_pte = __ptep_get(ptep);
>>
>> @@ -193,7 +190,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> phys += PAGE_SIZE;
>> } while (ptep++, addr += PAGE_SIZE, addr != end);
>>
>> - pte_clear_fixmap();
>> + return ptep;
>> }
>>
>> static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>> @@ -204,6 +201,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>> {
>> unsigned long next;
>> pmd_t pmd = READ_ONCE(*pmdp);
>> + pte_t *ptep;
>>
>> BUG_ON(pmd_sect(pmd));
>> if (pmd_none(pmd)) {
>> @@ -219,6 +217,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>> }
>> BUG_ON(pmd_bad(pmd));
>>
>> + ptep = pte_set_fixmap_offset(pmdp, addr);
>> do {
>> pgprot_t __prot = prot;
>>
>> @@ -229,20 +228,20 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>> (flags & NO_CONT_MAPPINGS) == 0)
>> __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>
>> - init_pte(pmdp, addr, next, phys, __prot);
>> + ptep = init_pte(ptep, addr, next, phys, __prot);
>>
>> phys += next - addr;
>> } while (addr = next, addr != end);
>
> I reckon it might be better to leave init_pte() returning void, and move the
> ptep along here, e.g.
>
> ptep = pte_set_fixmap_offset(pmdp, addr);
> do {
> ...
>
> init_pte(ptep, addr, next, phys, __prot);
>
> ptep += pte_index(next) - pte_index(addr);
> phys += next - addr;
> } while (addr = next, addr != end);
>
>
> ... as that keeps the relationship between 'ptep' and 'phys' clear since
> they're manipulated in the same way, adjacent to one another.
>
> Regardless this looks good, so with that change or as-is:
>
> Acked-by: Mark Rutland <[email protected]>
>
> ... though I would prefer with that change. ;)

Yep, will change. And I'll do the same for pmd_init() too.

>
> Mark.


2024-04-10 11:06:19

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] arm64: mm: Batch dsb and isb when populating pgtables

On Wed, Apr 10, 2024 at 11:25:10AM +0100, Ryan Roberts wrote:
> On 10/04/2024 11:06, Mark Rutland wrote:
> > On Thu, Apr 04, 2024 at 03:33:06PM +0100, Ryan Roberts wrote:
[> >> @@ -234,6 +238,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> >> } while (addr = next, addr != end);
> >>
> >> pte_clear_fixmap();
> >> +
> >> + /*
> >> + * Ensure all previous pgtable writes are visible to the table walker.
> >> + * See init_pte().
> >> + */
> >> + dsb(ishst);
> >> + isb();
> >
> > Hmm... currently the call to pte_clear_fixmap() alone should be sufficient,
> > since that needs to update the PTE for the fixmap slot, then do maintenance for
> > that.
>
> Yes, true...
>
> >
> > So we could avoid the addition of the dsb+isb here, and have a comment:
> >
> > /*
> > * Note: barriers and maintenance necessary to clear the fixmap slot
> > * ensure that all previous pgtable writes are visible to the table
> > * walker.
> > */
> > pte_clear_fixmap();
> >
> > ... which'd be fine as long as we keep this fixmap clearing rather than trying
> > to do that lazily as in patch 4.
>
> But it isn't patch 4 that breaks it, it's patch 3. Once we have abstracted
> pte_clear_fixmap() into the ops->unmap() call, for the "late" ops, unmap is a
> noop.

Ah, yep; I hadn't spotted that yet.

> I guess the best solution there would be to require that unmap() always
> issues these barriers.
>
> I'll do as you suggest for this patch. If we want to keep patch 3, then I'll add
> the barriers for all unmap() impls.

Thanks. It's going to take me a bit longer to chew through patches 3 and 4, but
I will try to get through those soon.

For now a slightly simpler option would be to have patch 3 introduce the
DSB+ISB as above rather than in each of the unmap() impls.

Mark.

2024-04-11 13:15:51

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: mm: Don't remap pgtables for allocate vs populate

On Thu, Apr 04, 2024 at 03:33:07PM +0100, Ryan Roberts wrote:
> During linear map pgtable creation, each pgtable is fixmapped /
> fixunmapped twice; once during allocation to zero the memory, and a
> again during population to write the entries. This means each table has
> 2 TLB invalidations issued against it. Let's fix this so that each table
> is only fixmapped/fixunmapped once, halving the number of TLBIs, and
> improving performance.
>
> Achieve this by abstracting pgtable allocate, map and unmap operations
> out of the main pgtable population loop code and into a `struct
> pgtable_ops` function pointer structure. This allows us to formalize the
> semantics of "alloc" to mean "alloc and map", requiring an "unmap" when
> finished. So "map" is only performed (and also matched by "unmap") if
> the pgtable has already been allocated.
>
> As a side effect of this refactoring, we no longer need to use the
> fixmap at all once pages have been mapped in the linear map because
> their "map" operation can simply do a __va() translation. So with this
> change, we are down to 1 TLBI per table when doing early pgtable
> manipulations, and 0 TLBIs when doing late pgtable manipulations.
>
> Execution time of map_mem(), which creates the kernel linear map page
> tables, was measured on different machines with different RAM configs:
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> before | 13 (0%) | 162 (0%) | 655 (0%) | 1656 (0%)
> after | 11 (-15%) | 109 (-33%) | 449 (-31%) | 1257 (-24%)

Do we know how much of that gain is due to the early pgtable creation doing
fewer fixmap/fixunmap ops vs the later operations using the linear map?

I suspect that the bulk of that is down to the early pgtable creation, and if
so I think that we can get most of the benefit with a simpler change (see
below).

> Signed-off-by: Ryan Roberts <[email protected]>
> Tested-by: Itaru Kitayama <[email protected]>
> Tested-by: Eric Chanudet <[email protected]>
> ---
> arch/arm64/include/asm/mmu.h | 8 +
> arch/arm64/include/asm/pgtable.h | 2 +
> arch/arm64/kernel/cpufeature.c | 10 +-
> arch/arm64/mm/mmu.c | 308 ++++++++++++++++++++++---------
> 4 files changed, 237 insertions(+), 91 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 65977c7783c5..ae44353010e8 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -109,6 +109,14 @@ static inline bool kaslr_requires_kpti(void)
> return true;
> }
>
> +#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
> +extern
> +void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
> + phys_addr_t size, pgprot_t prot,
> + void *(*pgtable_alloc)(int, phys_addr_t *),
> + int flags);
> +#endif
> +
> #define INIT_MM_CONTEXT(name) \
> .pgd = swapper_pg_dir,
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 105a95a8845c..92c9aed5e7af 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1010,6 +1010,8 @@ static inline p4d_t *p4d_offset_kimg(pgd_t *pgdp, u64 addr)
>
> static inline bool pgtable_l5_enabled(void) { return false; }
>
> +#define p4d_index(addr) (((addr) >> P4D_SHIFT) & (PTRS_PER_P4D - 1))
> +
> /* Match p4d_offset folding in <asm/generic/pgtable-nop4d.h> */
> #define p4d_set_fixmap(addr) NULL
> #define p4d_set_fixmap_offset(p4dp, addr) ((p4d_t *)p4dp)
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 56583677c1f2..9a70b1954706 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1866,17 +1866,13 @@ static bool has_lpa2(const struct arm64_cpu_capabilities *entry, int scope)
> #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
> #define KPTI_NG_TEMP_VA (-(1UL << PMD_SHIFT))
>
> -extern
> -void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
> - phys_addr_t size, pgprot_t prot,
> - phys_addr_t (*pgtable_alloc)(int), int flags);
> -
> static phys_addr_t __initdata kpti_ng_temp_alloc;
>
> -static phys_addr_t __init kpti_ng_pgd_alloc(int shift)
> +static void *__init kpti_ng_pgd_alloc(int type, phys_addr_t *pa)
> {
> kpti_ng_temp_alloc -= PAGE_SIZE;
> - return kpti_ng_temp_alloc;
> + *pa = kpti_ng_temp_alloc;
> + return __va(kpti_ng_temp_alloc);
> }
>
> static int __init __kpti_install_ng_mappings(void *__unused)
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index dc86dceb0efe..90bf822859b8 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -41,9 +41,42 @@
> #include <asm/pgalloc.h>
> #include <asm/kfence.h>
>
> +enum pgtable_type {
> + TYPE_P4D = 0,
> + TYPE_PUD = 1,
> + TYPE_PMD = 2,
> + TYPE_PTE = 3,
> +};
> +
> +/**
> + * struct pgtable_ops - Ops to allocate and access pgtable memory. Calls must be
> + * serialized by the caller.
> + * @alloc: Allocates 1 page of memory for use as pgtable `type` and maps it
> + * into va space. Returned memory is zeroed. Puts physical address
> + * of page in *pa, and returns virtual address of the mapping. User
> + * must explicitly unmap() before doing another alloc() or map() of
> + * the same `type`.
> + * @map: Determines the physical address of the pgtable of `type` by
> + * interpretting `parent` as the pgtable entry for the next level
> + * up. Maps the page and returns virtual address of the pgtable
> + * entry within the table that corresponds to `addr`. User must
> + * explicitly unmap() before doing another alloc() or map() of the
> + * same `type`.
> + * @unmap: Unmap the currently mapped page of `type`, which will have been
> + * mapped either as a result of a previous call to alloc() or
> + * map(). The page's virtual address must be considered invalid
> + * after this call returns.
> + */
> +struct pgtable_ops {
> + void *(*alloc)(int type, phys_addr_t *pa);
> + void *(*map)(int type, void *parent, unsigned long addr);
> + void (*unmap)(int type);
> +};

There's a lot of boilerplate that results from having the TYPE_Pxx enumeration
and needing to handle that in the callbacks, and it's somewhat unfortunate that
the callbacks can't use the enum type directly (becuase the KPTI allocator is
in another file).

I'm not too keen on all of that.

As above, I suspect that most of the benefit comes from minimizing the
map/unmap calls in the early table creation, and I think that we can do that
without needing all this infrastructure if we keep the fixmapping explciit
in the alloc_init_pXX() functions, but factor that out of
early_pgtable_alloc().

Does something like the below look ok to you? The trade-off performance-wise is
that late uses will still use the fixmap, and will redundantly zero the tables,
but the logic remains fairly simple, and I suspect the overhead for late
allocations might not matter since the bulk of late changes are non-allocating.

Mark

---->8-----
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 105a95a8845c5..1eecf87021bd0 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1010,6 +1010,8 @@ static inline p4d_t *p4d_offset_kimg(pgd_t *pgdp, u64 addr)

static inline bool pgtable_l5_enabled(void) { return false; }

+#define p4d_index(addr) (((addr) >> P4D_SHIFT) & (PTRS_PER_P4D - 1)
+
/* Match p4d_offset folding in <asm/generic/pgtable-nop4d.h> */
#define p4d_set_fixmap(addr) NULL
#define p4d_set_fixmap_offset(p4dp, addr) ((p4d_t *)p4dp)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index dc86dceb0efe6..4b944ef8f618c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -109,28 +109,12 @@ EXPORT_SYMBOL(phys_mem_access_prot);
static phys_addr_t __init early_pgtable_alloc(int shift)
{
phys_addr_t phys;
- void *ptr;

phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
MEMBLOCK_ALLOC_NOLEAKTRACE);
if (!phys)
panic("Failed to allocate page table page\n");

- /*
- * The FIX_{PGD,PUD,PMD} slots may be in active use, but the FIX_PTE
- * slot will be free, so we can (ab)use the FIX_PTE slot to initialise
- * any level of table.
- */
- ptr = pte_set_fixmap(phys);
-
- memset(ptr, 0, PAGE_SIZE);
-
- /*
- * Implicit barriers also ensure the zeroed page is visible to the page
- * table walker
- */
- pte_clear_fixmap();
-
return phys;
}

@@ -172,6 +156,14 @@ bool pgattr_change_is_safe(u64 old, u64 new)
return ((old ^ new) & ~mask) == 0;
}

+static void init_clear_pgtable(void *table)
+{
+ clear_page(table);
+
+ /* Ensure the zeroing is observed by page table walks. */
+ dsb(ishst);
+}
+
static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot)
{
@@ -216,12 +208,18 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
pmdval |= PMD_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pte_phys = pgtable_alloc(PAGE_SHIFT);
+
+ ptep = pte_set_fixmap(pte_phys);
+ init_clear_pgtable(ptep);
+
__pmd_populate(pmdp, pte_phys, pmdval);
pmd = READ_ONCE(*pmdp);
+ } else {
+ ptep = pte_set_fixmap(pmd_page_paddr(pmd));
}
BUG_ON(pmd_bad(pmd));

- ptep = pte_set_fixmap_offset(pmdp, addr);
+ ptep += pte_index(addr);
do {
pgprot_t __prot = prot;

@@ -303,12 +301,18 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
pudval |= PUD_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pmd_phys = pgtable_alloc(PMD_SHIFT);
+
+ pmdp = pmd_set_fixmap(pmd_phys);
+ init_clear_pgtable(pmdp);
+
__pud_populate(pudp, pmd_phys, pudval);
pud = READ_ONCE(*pudp);
+ } else {
+ pmdp = pmd_set_fixmap(pud_page_paddr(pud));
}
BUG_ON(pud_bad(pud));

- pmdp = pmd_set_fixmap_offset(pudp, addr);
+ pmdp += pmd_index(addr);
do {
pgprot_t __prot = prot;

@@ -345,12 +349,18 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
p4dval |= P4D_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pud_phys = pgtable_alloc(PUD_SHIFT);
+
+ pudp = pud_set_fixmap(pud_phys);
+ init_clear_pgtable(pudp);
+
__p4d_populate(p4dp, pud_phys, p4dval);
p4d = READ_ONCE(*p4dp);
+ } else {
+ pudp = pud_set_fixmap(p4d_page_paddr(p4d));
}
BUG_ON(p4d_bad(p4d));

- pudp = pud_set_fixmap_offset(p4dp, addr);
+ pudp += pud_index(addr);
do {
pud_t old_pud = READ_ONCE(*pudp);

@@ -400,12 +410,18 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
pgdval |= PGD_TABLE_PXN;
BUG_ON(!pgtable_alloc);
p4d_phys = pgtable_alloc(P4D_SHIFT);
+
+ p4dp = p4d_set_fixmap(p4d_phys);
+ init_clear_pgtable(p4dp);
+
__pgd_populate(pgdp, p4d_phys, pgdval);
pgd = READ_ONCE(*pgdp);
+ } else {
+ p4dp = p4d_set_fixmap(pgd_page_paddr(pgd));
}
BUG_ON(pgd_bad(pgd));

- p4dp = p4d_set_fixmap_offset(pgdp, addr);
+ p4dp += p4d_index(addr);
do {
p4d_t old_p4d = READ_ONCE(*p4dp);

@@ -475,8 +491,6 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
BUG_ON(!ptr);

- /* Ensure the zeroed page is visible to the page table walker */
- dsb(ishst);
return __pa(ptr);
}


2024-04-11 13:24:51

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] arm64: mm: Lazily clear pte table mappings from fixmap

On Thu, Apr 04, 2024 at 03:33:08PM +0100, Ryan Roberts wrote:
> With the pgtable operations abstracted into `struct pgtable_ops`, the
> early pgtable alloc, map and unmap operations are nicely centralized. So
> let's enhance the implementation to speed up the clearing of pte table
> mappings in the fixmap.
>
> Extend FIX_MAP so that we now have 16 slots in the fixmap dedicated for
> pte tables. At alloc/map time, we select the next slot in the series and
> map it. Or if we are at the end and no more slots are available, clear
> down all of the slots and start at the beginning again. Batching the
> clear like this means we can issue tlbis more efficiently.
>
> Due to the batching, there may still be some slots mapped at the end, so
> address this by adding an optional cleanup() function to `struct
> pgtable_ops` to handle this for us.
>
> Execution time of map_mem(), which creates the kernel linear map page
> tables, was measured on different machines with different RAM configs:
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> before | 11 (0%) | 109 (0%) | 449 (0%) | 1257 (0%)
> after | 6 (-46%) | 61 (-44%) | 257 (-43%) | 838 (-33%)

I'd prefer to leave this as-is for now, as compared to the baseline this is the
last 2-3%, and (assuming my comments on patch 3 hold) that way we don't need
the pgtable_ops indirection, which'll keep the code fairly simple to understand.

So unless Catalin or Will feel otherwise, I'd suggest that we take patches 1
and 2, drop 3 and 4 for now, and maybe try the alternative approach I've
commented on patch 3.

Does that sound ok to you?

Mark.

> Signed-off-by: Ryan Roberts <[email protected]>
> Tested-by: Itaru Kitayama <[email protected]>
> Tested-by: Eric Chanudet <[email protected]>
> ---
> arch/arm64/include/asm/fixmap.h | 5 +++-
> arch/arm64/include/asm/pgtable.h | 4 ---
> arch/arm64/mm/fixmap.c | 11 ++++++++
> arch/arm64/mm/mmu.c | 44 +++++++++++++++++++++++++++++---
> 4 files changed, 56 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
> index 87e307804b99..91fcd7c5c513 100644
> --- a/arch/arm64/include/asm/fixmap.h
> +++ b/arch/arm64/include/asm/fixmap.h
> @@ -84,7 +84,9 @@ enum fixed_addresses {
> * Used for kernel page table creation, so unmapped memory may be used
> * for tables.
> */
> - FIX_PTE,
> +#define NR_PTE_SLOTS 16
> + FIX_PTE_END,
> + FIX_PTE_BEGIN = FIX_PTE_END + NR_PTE_SLOTS - 1,
> FIX_PMD,
> FIX_PUD,
> FIX_P4D,
> @@ -108,6 +110,7 @@ void __init early_fixmap_init(void);
> #define __late_clear_fixmap(idx) __set_fixmap((idx), 0, FIXMAP_PAGE_CLEAR)
>
> extern void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot);
> +void __init clear_fixmap_nosync(enum fixed_addresses idx);
>
> #include <asm-generic/fixmap.h>
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 92c9aed5e7af..4c7114d49697 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -691,10 +691,6 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
> /* Find an entry in the third-level page table. */
> #define pte_offset_phys(dir,addr) (pmd_page_paddr(READ_ONCE(*(dir))) + pte_index(addr) * sizeof(pte_t))
>
> -#define pte_set_fixmap(addr) ((pte_t *)set_fixmap_offset(FIX_PTE, addr))
> -#define pte_set_fixmap_offset(pmd, addr) pte_set_fixmap(pte_offset_phys(pmd, addr))
> -#define pte_clear_fixmap() clear_fixmap(FIX_PTE)
> -
> #define pmd_page(pmd) phys_to_page(__pmd_to_phys(pmd))
>
> /* use ONLY for statically allocated translation tables */
> diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
> index de1e09d986ad..0cb09bedeeec 100644
> --- a/arch/arm64/mm/fixmap.c
> +++ b/arch/arm64/mm/fixmap.c
> @@ -131,6 +131,17 @@ void __set_fixmap(enum fixed_addresses idx,
> }
> }
>
> +void __init clear_fixmap_nosync(enum fixed_addresses idx)
> +{
> + unsigned long addr = __fix_to_virt(idx);
> + pte_t *ptep;
> +
> + BUG_ON(idx <= FIX_HOLE || idx >= __end_of_fixed_addresses);
> +
> + ptep = fixmap_pte(addr);
> + __pte_clear(&init_mm, addr, ptep);
> +}
> +
> void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
> {
> const u64 dt_virt_base = __fix_to_virt(FIX_FDT);
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 90bf822859b8..2e3b594aa23c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -66,11 +66,14 @@ enum pgtable_type {
> * mapped either as a result of a previous call to alloc() or
> * map(). The page's virtual address must be considered invalid
> * after this call returns.
> + * @cleanup: (Optional) Called at the end of a set of operations to cleanup
> + * any lazy state.
> */
> struct pgtable_ops {
> void *(*alloc)(int type, phys_addr_t *pa);
> void *(*map)(int type, void *parent, unsigned long addr);
> void (*unmap)(int type);
> + void (*cleanup)(void);
> };
>
> #define NO_BLOCK_MAPPINGS BIT(0)
> @@ -139,9 +142,33 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
> }
> EXPORT_SYMBOL(phys_mem_access_prot);
>
> +static int pte_slot_next __initdata = FIX_PTE_BEGIN;
> +
> +static void __init clear_pte_fixmap_slots(void)
> +{
> + unsigned long start = __fix_to_virt(FIX_PTE_BEGIN);
> + unsigned long end = __fix_to_virt(pte_slot_next);
> + int i;
> +
> + for (i = FIX_PTE_BEGIN; i > pte_slot_next; i--)
> + clear_fixmap_nosync(i);
> +
> + flush_tlb_kernel_range(start, end);
> + pte_slot_next = FIX_PTE_BEGIN;
> +}
> +
> +static int __init pte_fixmap_slot(void)
> +{
> + if (pte_slot_next < FIX_PTE_END)
> + clear_pte_fixmap_slots();
> +
> + return pte_slot_next--;
> +}
> +
> static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
> {
> void *va;
> + int slot;
>
> *pa = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
> MEMBLOCK_ALLOC_NOLEAKTRACE);
> @@ -159,7 +186,9 @@ static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
> va = pmd_set_fixmap(*pa);
> break;
> case TYPE_PTE:
> - va = pte_set_fixmap(*pa);
> + slot = pte_fixmap_slot();
> + set_fixmap(slot, *pa);
> + va = (pte_t *)__fix_to_virt(slot);
> break;
> default:
> BUG();
> @@ -174,7 +203,9 @@ static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
>
> static void *__init early_pgtable_map(int type, void *parent, unsigned long addr)
> {
> + phys_addr_t pa;
> void *entry;
> + int slot;
>
> switch (type) {
> case TYPE_P4D:
> @@ -187,7 +218,10 @@ static void *__init early_pgtable_map(int type, void *parent, unsigned long addr
> entry = pmd_set_fixmap_offset((pud_t *)parent, addr);
> break;
> case TYPE_PTE:
> - entry = pte_set_fixmap_offset((pmd_t *)parent, addr);
> + slot = pte_fixmap_slot();
> + pa = pte_offset_phys((pmd_t *)parent, addr);
> + set_fixmap(slot, pa);
> + entry = (pte_t *)(__fix_to_virt(slot) + (pa & (PAGE_SIZE - 1)));
> break;
> default:
> BUG();
> @@ -209,7 +243,7 @@ static void __init early_pgtable_unmap(int type)
> pmd_clear_fixmap();
> break;
> case TYPE_PTE:
> - pte_clear_fixmap();
> + // Unmap lazily: see clear_pte_fixmap_slots().
> break;
> default:
> BUG();
> @@ -220,6 +254,7 @@ static struct pgtable_ops early_pgtable_ops __initdata = {
> .alloc = early_pgtable_alloc,
> .map = early_pgtable_map,
> .unmap = early_pgtable_unmap,
> + .cleanup = clear_pte_fixmap_slots,
> };
>
> bool pgattr_change_is_safe(u64 old, u64 new)
> @@ -538,6 +573,9 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
> alloc_init_p4d(pgdp, addr, next, phys, prot, ops, flags);
> phys += next - addr;
> } while (pgdp++, addr = next, addr != end);
> +
> + if (ops->cleanup)
> + ops->cleanup();
> }
>
> static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
> --
> 2.25.1
>

2024-04-11 13:38:08

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: mm: Don't remap pgtables for allocate vs populate

On 11/04/2024 14:02, Mark Rutland wrote:
> On Thu, Apr 04, 2024 at 03:33:07PM +0100, Ryan Roberts wrote:
>> During linear map pgtable creation, each pgtable is fixmapped /
>> fixunmapped twice; once during allocation to zero the memory, and a
>> again during population to write the entries. This means each table has
>> 2 TLB invalidations issued against it. Let's fix this so that each table
>> is only fixmapped/fixunmapped once, halving the number of TLBIs, and
>> improving performance.
>>
>> Achieve this by abstracting pgtable allocate, map and unmap operations
>> out of the main pgtable population loop code and into a `struct
>> pgtable_ops` function pointer structure. This allows us to formalize the
>> semantics of "alloc" to mean "alloc and map", requiring an "unmap" when
>> finished. So "map" is only performed (and also matched by "unmap") if
>> the pgtable has already been allocated.
>>
>> As a side effect of this refactoring, we no longer need to use the
>> fixmap at all once pages have been mapped in the linear map because
>> their "map" operation can simply do a __va() translation. So with this
>> change, we are down to 1 TLBI per table when doing early pgtable
>> manipulations, and 0 TLBIs when doing late pgtable manipulations.
>>
>> Execution time of map_mem(), which creates the kernel linear map page
>> tables, was measured on different machines with different RAM configs:
>>
>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>> | ms (%) | ms (%) | ms (%) | ms (%)
>> ---------------|-------------|-------------|-------------|-------------
>> before | 13 (0%) | 162 (0%) | 655 (0%) | 1656 (0%)
>> after | 11 (-15%) | 109 (-33%) | 449 (-31%) | 1257 (-24%)
>
> Do we know how much of that gain is due to the early pgtable creation doing
> fewer fixmap/fixunmap ops vs the later operations using the linear map?
>
> I suspect that the bulk of that is down to the early pgtable creation, and if
> so I think that we can get most of the benefit with a simpler change (see
> below).

All of this improvement is due to early pgtable creation doing fewer
fixmap/fixunmaps; I'm only measuring the execution time of map_mem(), which only
uses the early ops.

I haven't even looked to see if there are any hot paths where the late ops
benefit. I just saw it as a happy side-effect.

>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> Tested-by: Itaru Kitayama <[email protected]>
>> Tested-by: Eric Chanudet <[email protected]>
>> ---
>> arch/arm64/include/asm/mmu.h | 8 +
>> arch/arm64/include/asm/pgtable.h | 2 +
>> arch/arm64/kernel/cpufeature.c | 10 +-
>> arch/arm64/mm/mmu.c | 308 ++++++++++++++++++++++---------
>> 4 files changed, 237 insertions(+), 91 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
>> index 65977c7783c5..ae44353010e8 100644
>> --- a/arch/arm64/include/asm/mmu.h
>> +++ b/arch/arm64/include/asm/mmu.h
>> @@ -109,6 +109,14 @@ static inline bool kaslr_requires_kpti(void)
>> return true;
>> }
>>
>> +#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
>> +extern
>> +void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
>> + phys_addr_t size, pgprot_t prot,
>> + void *(*pgtable_alloc)(int, phys_addr_t *),
>> + int flags);
>> +#endif
>> +
>> #define INIT_MM_CONTEXT(name) \
>> .pgd = swapper_pg_dir,
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 105a95a8845c..92c9aed5e7af 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1010,6 +1010,8 @@ static inline p4d_t *p4d_offset_kimg(pgd_t *pgdp, u64 addr)
>>
>> static inline bool pgtable_l5_enabled(void) { return false; }
>>
>> +#define p4d_index(addr) (((addr) >> P4D_SHIFT) & (PTRS_PER_P4D - 1))
>> +
>> /* Match p4d_offset folding in <asm/generic/pgtable-nop4d.h> */
>> #define p4d_set_fixmap(addr) NULL
>> #define p4d_set_fixmap_offset(p4dp, addr) ((p4d_t *)p4dp)
>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> index 56583677c1f2..9a70b1954706 100644
>> --- a/arch/arm64/kernel/cpufeature.c
>> +++ b/arch/arm64/kernel/cpufeature.c
>> @@ -1866,17 +1866,13 @@ static bool has_lpa2(const struct arm64_cpu_capabilities *entry, int scope)
>> #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
>> #define KPTI_NG_TEMP_VA (-(1UL << PMD_SHIFT))
>>
>> -extern
>> -void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
>> - phys_addr_t size, pgprot_t prot,
>> - phys_addr_t (*pgtable_alloc)(int), int flags);
>> -
>> static phys_addr_t __initdata kpti_ng_temp_alloc;
>>
>> -static phys_addr_t __init kpti_ng_pgd_alloc(int shift)
>> +static void *__init kpti_ng_pgd_alloc(int type, phys_addr_t *pa)
>> {
>> kpti_ng_temp_alloc -= PAGE_SIZE;
>> - return kpti_ng_temp_alloc;
>> + *pa = kpti_ng_temp_alloc;
>> + return __va(kpti_ng_temp_alloc);
>> }
>>
>> static int __init __kpti_install_ng_mappings(void *__unused)
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index dc86dceb0efe..90bf822859b8 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -41,9 +41,42 @@
>> #include <asm/pgalloc.h>
>> #include <asm/kfence.h>
>>
>> +enum pgtable_type {
>> + TYPE_P4D = 0,
>> + TYPE_PUD = 1,
>> + TYPE_PMD = 2,
>> + TYPE_PTE = 3,
>> +};
>> +
>> +/**
>> + * struct pgtable_ops - Ops to allocate and access pgtable memory. Calls must be
>> + * serialized by the caller.
>> + * @alloc: Allocates 1 page of memory for use as pgtable `type` and maps it
>> + * into va space. Returned memory is zeroed. Puts physical address
>> + * of page in *pa, and returns virtual address of the mapping. User
>> + * must explicitly unmap() before doing another alloc() or map() of
>> + * the same `type`.
>> + * @map: Determines the physical address of the pgtable of `type` by
>> + * interpretting `parent` as the pgtable entry for the next level
>> + * up. Maps the page and returns virtual address of the pgtable
>> + * entry within the table that corresponds to `addr`. User must
>> + * explicitly unmap() before doing another alloc() or map() of the
>> + * same `type`.
>> + * @unmap: Unmap the currently mapped page of `type`, which will have been
>> + * mapped either as a result of a previous call to alloc() or
>> + * map(). The page's virtual address must be considered invalid
>> + * after this call returns.
>> + */
>> +struct pgtable_ops {
>> + void *(*alloc)(int type, phys_addr_t *pa);
>> + void *(*map)(int type, void *parent, unsigned long addr);
>> + void (*unmap)(int type);
>> +};
>
> There's a lot of boilerplate that results from having the TYPE_Pxx enumeration
> and needing to handle that in the callbacks, and it's somewhat unfortunate that
> the callbacks can't use the enum type directly (becuase the KPTI allocator is
> in another file).
>
> I'm not too keen on all of that.

Yes, I agree its quite a big change. And all the switches are naff. But I
couldn't see a way to avoid it and still get all the "benefits".

>
> As above, I suspect that most of the benefit comes from minimizing the
> map/unmap calls in the early table creation, and I think that we can do that
> without needing all this infrastructure if we keep the fixmapping explciit
> in the alloc_init_pXX() functions, but factor that out of
> early_pgtable_alloc().
>
> Does something like the below look ok to you?

Yes this is actually quite similar to my first attempt, but then I realised I
could get rid of the redudancies too.

> The trade-off performance-wise is
> that late uses will still use the fixmap, and will redundantly zero the tables,

I think we can mitigate the redudant zeroing for most kernel configs; tell the
allocator we don't need it to be zeroed. There are some obscure configs where
pages are zeroed on free instead of on alloc IIRC, so those would still have a
redundant clear but they are not widely used AIUI. (see bleow).

> but the logic remains fairly simple, and I suspect the overhead for late
> allocations might not matter since the bulk of late changes are non-allocating.

Its just the fixmap overhead that remains...

I'll benchmark with your below change, and also have a deeper look to check if
there are real places where fixmap might cause slowness for late ops.

>
> Mark
>
> ---->8-----
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 105a95a8845c5..1eecf87021bd0 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1010,6 +1010,8 @@ static inline p4d_t *p4d_offset_kimg(pgd_t *pgdp, u64 addr)
>
> static inline bool pgtable_l5_enabled(void) { return false; }
>
> +#define p4d_index(addr) (((addr) >> P4D_SHIFT) & (PTRS_PER_P4D - 1)
> +
> /* Match p4d_offset folding in <asm/generic/pgtable-nop4d.h> */
> #define p4d_set_fixmap(addr) NULL
> #define p4d_set_fixmap_offset(p4dp, addr) ((p4d_t *)p4dp)
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index dc86dceb0efe6..4b944ef8f618c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -109,28 +109,12 @@ EXPORT_SYMBOL(phys_mem_access_prot);
> static phys_addr_t __init early_pgtable_alloc(int shift)
> {
> phys_addr_t phys;
> - void *ptr;
>
> phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
> MEMBLOCK_ALLOC_NOLEAKTRACE);
> if (!phys)
> panic("Failed to allocate page table page\n");
>
> - /*
> - * The FIX_{PGD,PUD,PMD} slots may be in active use, but the FIX_PTE
> - * slot will be free, so we can (ab)use the FIX_PTE slot to initialise
> - * any level of table.
> - */
> - ptr = pte_set_fixmap(phys);
> -
> - memset(ptr, 0, PAGE_SIZE);
> -
> - /*
> - * Implicit barriers also ensure the zeroed page is visible to the page
> - * table walker
> - */
> - pte_clear_fixmap();
> -
> return phys;
> }
>
> @@ -172,6 +156,14 @@ bool pgattr_change_is_safe(u64 old, u64 new)
> return ((old ^ new) & ~mask) == 0;
> }
>
> +static void init_clear_pgtable(void *table)
> +{
> + clear_page(table);
> +
> + /* Ensure the zeroing is observed by page table walks. */
> + dsb(ishst);
> +}
> +
> static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
> phys_addr_t phys, pgprot_t prot)
> {
> @@ -216,12 +208,18 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> pmdval |= PMD_TABLE_PXN;
> BUG_ON(!pgtable_alloc);
> pte_phys = pgtable_alloc(PAGE_SHIFT);
> +
> + ptep = pte_set_fixmap(pte_phys);
> + init_clear_pgtable(ptep);
> +
> __pmd_populate(pmdp, pte_phys, pmdval);
> pmd = READ_ONCE(*pmdp);
> + } else {
> + ptep = pte_set_fixmap(pmd_page_paddr(pmd));
> }
> BUG_ON(pmd_bad(pmd));
>
> - ptep = pte_set_fixmap_offset(pmdp, addr);
> + ptep += pte_index(addr);
> do {
> pgprot_t __prot = prot;
>
> @@ -303,12 +301,18 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> pudval |= PUD_TABLE_PXN;
> BUG_ON(!pgtable_alloc);
> pmd_phys = pgtable_alloc(PMD_SHIFT);
> +
> + pmdp = pmd_set_fixmap(pmd_phys);
> + init_clear_pgtable(pmdp);
> +
> __pud_populate(pudp, pmd_phys, pudval);
> pud = READ_ONCE(*pudp);
> + } else {
> + pmdp = pmd_set_fixmap(pud_page_paddr(pud));
> }
> BUG_ON(pud_bad(pud));
>
> - pmdp = pmd_set_fixmap_offset(pudp, addr);
> + pmdp += pmd_index(addr);
> do {
> pgprot_t __prot = prot;
>
> @@ -345,12 +349,18 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
> p4dval |= P4D_TABLE_PXN;
> BUG_ON(!pgtable_alloc);
> pud_phys = pgtable_alloc(PUD_SHIFT);
> +
> + pudp = pud_set_fixmap(pud_phys);
> + init_clear_pgtable(pudp);
> +
> __p4d_populate(p4dp, pud_phys, p4dval);
> p4d = READ_ONCE(*p4dp);
> + } else {
> + pudp = pud_set_fixmap(p4d_page_paddr(p4d));
> }
> BUG_ON(p4d_bad(p4d));
>
> - pudp = pud_set_fixmap_offset(p4dp, addr);
> + pudp += pud_index(addr);
> do {
> pud_t old_pud = READ_ONCE(*pudp);
>
> @@ -400,12 +410,18 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
> pgdval |= PGD_TABLE_PXN;
> BUG_ON(!pgtable_alloc);
> p4d_phys = pgtable_alloc(P4D_SHIFT);
> +
> + p4dp = p4d_set_fixmap(p4d_phys);
> + init_clear_pgtable(p4dp);
> +
> __pgd_populate(pgdp, p4d_phys, pgdval);
> pgd = READ_ONCE(*pgdp);
> + } else {
> + p4dp = p4d_set_fixmap(pgd_page_paddr(pgd));
> }
> BUG_ON(pgd_bad(pgd));
>
> - p4dp = p4d_set_fixmap_offset(pgdp, addr);
> + p4dp += p4d_index(addr);
> do {
> p4d_t old_p4d = READ_ONCE(*p4dp);
>
> @@ -475,8 +491,6 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
> void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);

How about:

void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);


> BUG_ON(!ptr);
>
> - /* Ensure the zeroed page is visible to the page table walker */
> - dsb(ishst);
> return __pa(ptr);
> }
>


2024-04-11 13:45:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] arm64: mm: Lazily clear pte table mappings from fixmap

On 11/04/2024 14:24, Mark Rutland wrote:
> On Thu, Apr 04, 2024 at 03:33:08PM +0100, Ryan Roberts wrote:
>> With the pgtable operations abstracted into `struct pgtable_ops`, the
>> early pgtable alloc, map and unmap operations are nicely centralized. So
>> let's enhance the implementation to speed up the clearing of pte table
>> mappings in the fixmap.
>>
>> Extend FIX_MAP so that we now have 16 slots in the fixmap dedicated for
>> pte tables. At alloc/map time, we select the next slot in the series and
>> map it. Or if we are at the end and no more slots are available, clear
>> down all of the slots and start at the beginning again. Batching the
>> clear like this means we can issue tlbis more efficiently.
>>
>> Due to the batching, there may still be some slots mapped at the end, so
>> address this by adding an optional cleanup() function to `struct
>> pgtable_ops` to handle this for us.
>>
>> Execution time of map_mem(), which creates the kernel linear map page
>> tables, was measured on different machines with different RAM configs:
>>
>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>> | ms (%) | ms (%) | ms (%) | ms (%)
>> ---------------|-------------|-------------|-------------|-------------
>> before | 11 (0%) | 109 (0%) | 449 (0%) | 1257 (0%)
>> after | 6 (-46%) | 61 (-44%) | 257 (-43%) | 838 (-33%)
>
> I'd prefer to leave this as-is for now, as compared to the baseline this is the
> last 2-3%, and (assuming my comments on patch 3 hold) that way we don't need
> the pgtable_ops indirection, which'll keep the code fairly simple to understand.
>
> So unless Catalin or Will feel otherwise, I'd suggest that we take patches 1
> and 2, drop 3 and 4 for now, and maybe try the alternative approach I've
> commented on patch 3.
>
> Does that sound ok to you?

In principle, yes. Let me do some benchmarking with your proposed set of patches
to confim :)

>
> Mark.
>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> Tested-by: Itaru Kitayama <[email protected]>
>> Tested-by: Eric Chanudet <[email protected]>
>> ---
>> arch/arm64/include/asm/fixmap.h | 5 +++-
>> arch/arm64/include/asm/pgtable.h | 4 ---
>> arch/arm64/mm/fixmap.c | 11 ++++++++
>> arch/arm64/mm/mmu.c | 44 +++++++++++++++++++++++++++++---
>> 4 files changed, 56 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
>> index 87e307804b99..91fcd7c5c513 100644
>> --- a/arch/arm64/include/asm/fixmap.h
>> +++ b/arch/arm64/include/asm/fixmap.h
>> @@ -84,7 +84,9 @@ enum fixed_addresses {
>> * Used for kernel page table creation, so unmapped memory may be used
>> * for tables.
>> */
>> - FIX_PTE,
>> +#define NR_PTE_SLOTS 16
>> + FIX_PTE_END,
>> + FIX_PTE_BEGIN = FIX_PTE_END + NR_PTE_SLOTS - 1,
>> FIX_PMD,
>> FIX_PUD,
>> FIX_P4D,
>> @@ -108,6 +110,7 @@ void __init early_fixmap_init(void);
>> #define __late_clear_fixmap(idx) __set_fixmap((idx), 0, FIXMAP_PAGE_CLEAR)
>>
>> extern void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot);
>> +void __init clear_fixmap_nosync(enum fixed_addresses idx);
>>
>> #include <asm-generic/fixmap.h>
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 92c9aed5e7af..4c7114d49697 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -691,10 +691,6 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>> /* Find an entry in the third-level page table. */
>> #define pte_offset_phys(dir,addr) (pmd_page_paddr(READ_ONCE(*(dir))) + pte_index(addr) * sizeof(pte_t))
>>
>> -#define pte_set_fixmap(addr) ((pte_t *)set_fixmap_offset(FIX_PTE, addr))
>> -#define pte_set_fixmap_offset(pmd, addr) pte_set_fixmap(pte_offset_phys(pmd, addr))
>> -#define pte_clear_fixmap() clear_fixmap(FIX_PTE)
>> -
>> #define pmd_page(pmd) phys_to_page(__pmd_to_phys(pmd))
>>
>> /* use ONLY for statically allocated translation tables */
>> diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
>> index de1e09d986ad..0cb09bedeeec 100644
>> --- a/arch/arm64/mm/fixmap.c
>> +++ b/arch/arm64/mm/fixmap.c
>> @@ -131,6 +131,17 @@ void __set_fixmap(enum fixed_addresses idx,
>> }
>> }
>>
>> +void __init clear_fixmap_nosync(enum fixed_addresses idx)
>> +{
>> + unsigned long addr = __fix_to_virt(idx);
>> + pte_t *ptep;
>> +
>> + BUG_ON(idx <= FIX_HOLE || idx >= __end_of_fixed_addresses);
>> +
>> + ptep = fixmap_pte(addr);
>> + __pte_clear(&init_mm, addr, ptep);
>> +}
>> +
>> void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
>> {
>> const u64 dt_virt_base = __fix_to_virt(FIX_FDT);
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 90bf822859b8..2e3b594aa23c 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -66,11 +66,14 @@ enum pgtable_type {
>> * mapped either as a result of a previous call to alloc() or
>> * map(). The page's virtual address must be considered invalid
>> * after this call returns.
>> + * @cleanup: (Optional) Called at the end of a set of operations to cleanup
>> + * any lazy state.
>> */
>> struct pgtable_ops {
>> void *(*alloc)(int type, phys_addr_t *pa);
>> void *(*map)(int type, void *parent, unsigned long addr);
>> void (*unmap)(int type);
>> + void (*cleanup)(void);
>> };
>>
>> #define NO_BLOCK_MAPPINGS BIT(0)
>> @@ -139,9 +142,33 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
>> }
>> EXPORT_SYMBOL(phys_mem_access_prot);
>>
>> +static int pte_slot_next __initdata = FIX_PTE_BEGIN;
>> +
>> +static void __init clear_pte_fixmap_slots(void)
>> +{
>> + unsigned long start = __fix_to_virt(FIX_PTE_BEGIN);
>> + unsigned long end = __fix_to_virt(pte_slot_next);
>> + int i;
>> +
>> + for (i = FIX_PTE_BEGIN; i > pte_slot_next; i--)
>> + clear_fixmap_nosync(i);
>> +
>> + flush_tlb_kernel_range(start, end);
>> + pte_slot_next = FIX_PTE_BEGIN;
>> +}
>> +
>> +static int __init pte_fixmap_slot(void)
>> +{
>> + if (pte_slot_next < FIX_PTE_END)
>> + clear_pte_fixmap_slots();
>> +
>> + return pte_slot_next--;
>> +}
>> +
>> static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
>> {
>> void *va;
>> + int slot;
>>
>> *pa = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
>> MEMBLOCK_ALLOC_NOLEAKTRACE);
>> @@ -159,7 +186,9 @@ static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
>> va = pmd_set_fixmap(*pa);
>> break;
>> case TYPE_PTE:
>> - va = pte_set_fixmap(*pa);
>> + slot = pte_fixmap_slot();
>> + set_fixmap(slot, *pa);
>> + va = (pte_t *)__fix_to_virt(slot);
>> break;
>> default:
>> BUG();
>> @@ -174,7 +203,9 @@ static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
>>
>> static void *__init early_pgtable_map(int type, void *parent, unsigned long addr)
>> {
>> + phys_addr_t pa;
>> void *entry;
>> + int slot;
>>
>> switch (type) {
>> case TYPE_P4D:
>> @@ -187,7 +218,10 @@ static void *__init early_pgtable_map(int type, void *parent, unsigned long addr
>> entry = pmd_set_fixmap_offset((pud_t *)parent, addr);
>> break;
>> case TYPE_PTE:
>> - entry = pte_set_fixmap_offset((pmd_t *)parent, addr);
>> + slot = pte_fixmap_slot();
>> + pa = pte_offset_phys((pmd_t *)parent, addr);
>> + set_fixmap(slot, pa);
>> + entry = (pte_t *)(__fix_to_virt(slot) + (pa & (PAGE_SIZE - 1)));
>> break;
>> default:
>> BUG();
>> @@ -209,7 +243,7 @@ static void __init early_pgtable_unmap(int type)
>> pmd_clear_fixmap();
>> break;
>> case TYPE_PTE:
>> - pte_clear_fixmap();
>> + // Unmap lazily: see clear_pte_fixmap_slots().
>> break;
>> default:
>> BUG();
>> @@ -220,6 +254,7 @@ static struct pgtable_ops early_pgtable_ops __initdata = {
>> .alloc = early_pgtable_alloc,
>> .map = early_pgtable_map,
>> .unmap = early_pgtable_unmap,
>> + .cleanup = clear_pte_fixmap_slots,
>> };
>>
>> bool pgattr_change_is_safe(u64 old, u64 new)
>> @@ -538,6 +573,9 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
>> alloc_init_p4d(pgdp, addr, next, phys, prot, ops, flags);
>> phys += next - addr;
>> } while (pgdp++, addr = next, addr != end);
>> +
>> + if (ops->cleanup)
>> + ops->cleanup();
>> }
>>
>> static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>> --
>> 2.25.1
>>


2024-04-11 15:12:37

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: mm: Don't remap pgtables for allocate vs populate

On Thu, Apr 11, 2024 at 02:37:49PM +0100, Ryan Roberts wrote:
> On 11/04/2024 14:02, Mark Rutland wrote:
> > On Thu, Apr 04, 2024 at 03:33:07PM +0100, Ryan Roberts wrote:
> >> During linear map pgtable creation, each pgtable is fixmapped /
> >> fixunmapped twice; once during allocation to zero the memory, and a
> >> again during population to write the entries. This means each table has
> >> 2 TLB invalidations issued against it. Let's fix this so that each table
> >> is only fixmapped/fixunmapped once, halving the number of TLBIs, and
> >> improving performance.
> >>
> >> Achieve this by abstracting pgtable allocate, map and unmap operations
> >> out of the main pgtable population loop code and into a `struct
> >> pgtable_ops` function pointer structure. This allows us to formalize the
> >> semantics of "alloc" to mean "alloc and map", requiring an "unmap" when
> >> finished. So "map" is only performed (and also matched by "unmap") if
> >> the pgtable has already been allocated.
> >>
> >> As a side effect of this refactoring, we no longer need to use the
> >> fixmap at all once pages have been mapped in the linear map because
> >> their "map" operation can simply do a __va() translation. So with this
> >> change, we are down to 1 TLBI per table when doing early pgtable
> >> manipulations, and 0 TLBIs when doing late pgtable manipulations.
> >>
> >> Execution time of map_mem(), which creates the kernel linear map page
> >> tables, was measured on different machines with different RAM configs:
> >>
> >> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> >> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> >> ---------------|-------------|-------------|-------------|-------------
> >> | ms (%) | ms (%) | ms (%) | ms (%)
> >> ---------------|-------------|-------------|-------------|-------------
> >> before | 13 (0%) | 162 (0%) | 655 (0%) | 1656 (0%)
> >> after | 11 (-15%) | 109 (-33%) | 449 (-31%) | 1257 (-24%)
> >
> > Do we know how much of that gain is due to the early pgtable creation doing
> > fewer fixmap/fixunmap ops vs the later operations using the linear map?
> >
> > I suspect that the bulk of that is down to the early pgtable creation, and if
> > so I think that we can get most of the benefit with a simpler change (see
> > below).
>
> All of this improvement is due to early pgtable creation doing fewer
> fixmap/fixunmaps; I'm only measuring the execution time of map_mem(), which only
> uses the early ops.
>
> I haven't even looked to see if there are any hot paths where the late ops
> benefit. I just saw it as a happy side-effect.

Ah, of course. I skimmed this and forgot this was just timing map_mem().

[...]

> > There's a lot of boilerplate that results from having the TYPE_Pxx enumeration
> > and needing to handle that in the callbacks, and it's somewhat unfortunate that
> > the callbacks can't use the enum type directly (becuase the KPTI allocator is
> > in another file).
> >
> > I'm not too keen on all of that.
>
> Yes, I agree its quite a big change. And all the switches are naff. But I
> couldn't see a way to avoid it and still get all the "benefits".
>
> > As above, I suspect that most of the benefit comes from minimizing the
> > map/unmap calls in the early table creation, and I think that we can do that
> > without needing all this infrastructure if we keep the fixmapping explciit
> > in the alloc_init_pXX() functions, but factor that out of
> > early_pgtable_alloc().
> >
> > Does something like the below look ok to you?
>
> Yes this is actually quite similar to my first attempt, but then I realised I
> could get rid of the redudancies too.
>
> > The trade-off performance-wise is
> > that late uses will still use the fixmap, and will redundantly zero the tables,
>
> I think we can mitigate the redudant zeroing for most kernel configs; tell the
> allocator we don't need it to be zeroed. There are some obscure configs where
> pages are zeroed on free instead of on alloc IIRC, so those would still have a
> redundant clear but they are not widely used AIUI. (see bleow).

That sounds fine to me; minor comment below.

> > but the logic remains fairly simple, and I suspect the overhead for late
> > allocations might not matter since the bulk of late changes are non-allocating.
>
> Its just the fixmap overhead that remains...

True; my thinking there is that almost all of the later changes are for smaller
ranges than the linear map (~10s of MB vs GBs in your test data), so I'd expect
the overhead of those to be dominated by the cost of mappin the linear map.

The only big exception is arch_add_memory(), but memory hotplug is incredibly
rare, and we're not making it massively slower than it already was...

> I'll benchmark with your below change, and also have a deeper look to check if
> there are real places where fixmap might cause slowness for late ops.

Thanks!

[...]

> > @@ -475,8 +491,6 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
> > void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
>
> How about:
>
> void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);

Looks good to me, assuming we add a comment to say it'll be zeroed in
init_clear_pgtable().

Mark.

2024-04-11 15:28:23

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: mm: Don't remap pgtables for allocate vs populate

On 11/04/2024 15:48, Mark Rutland wrote:
> On Thu, Apr 11, 2024 at 02:37:49PM +0100, Ryan Roberts wrote:
>> On 11/04/2024 14:02, Mark Rutland wrote:
>>> On Thu, Apr 04, 2024 at 03:33:07PM +0100, Ryan Roberts wrote:
>>>> During linear map pgtable creation, each pgtable is fixmapped /
>>>> fixunmapped twice; once during allocation to zero the memory, and a
>>>> again during population to write the entries. This means each table has
>>>> 2 TLB invalidations issued against it. Let's fix this so that each table
>>>> is only fixmapped/fixunmapped once, halving the number of TLBIs, and
>>>> improving performance.
>>>>
>>>> Achieve this by abstracting pgtable allocate, map and unmap operations
>>>> out of the main pgtable population loop code and into a `struct
>>>> pgtable_ops` function pointer structure. This allows us to formalize the
>>>> semantics of "alloc" to mean "alloc and map", requiring an "unmap" when
>>>> finished. So "map" is only performed (and also matched by "unmap") if
>>>> the pgtable has already been allocated.
>>>>
>>>> As a side effect of this refactoring, we no longer need to use the
>>>> fixmap at all once pages have been mapped in the linear map because
>>>> their "map" operation can simply do a __va() translation. So with this
>>>> change, we are down to 1 TLBI per table when doing early pgtable
>>>> manipulations, and 0 TLBIs when doing late pgtable manipulations.
>>>>
>>>> Execution time of map_mem(), which creates the kernel linear map page
>>>> tables, was measured on different machines with different RAM configs:
>>>>
>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>>>> ---------------|-------------|-------------|-------------|-------------
>>>> | ms (%) | ms (%) | ms (%) | ms (%)
>>>> ---------------|-------------|-------------|-------------|-------------
>>>> before | 13 (0%) | 162 (0%) | 655 (0%) | 1656 (0%)
>>>> after | 11 (-15%) | 109 (-33%) | 449 (-31%) | 1257 (-24%)
>>>
>>> Do we know how much of that gain is due to the early pgtable creation doing
>>> fewer fixmap/fixunmap ops vs the later operations using the linear map?
>>>
>>> I suspect that the bulk of that is down to the early pgtable creation, and if
>>> so I think that we can get most of the benefit with a simpler change (see
>>> below).
>>
>> All of this improvement is due to early pgtable creation doing fewer
>> fixmap/fixunmaps; I'm only measuring the execution time of map_mem(), which only
>> uses the early ops.
>>
>> I haven't even looked to see if there are any hot paths where the late ops
>> benefit. I just saw it as a happy side-effect.
>
> Ah, of course. I skimmed this and forgot this was just timing map_mem().
>
> [...]
>
>>> There's a lot of boilerplate that results from having the TYPE_Pxx enumeration
>>> and needing to handle that in the callbacks, and it's somewhat unfortunate that
>>> the callbacks can't use the enum type directly (becuase the KPTI allocator is
>>> in another file).
>>>
>>> I'm not too keen on all of that.
>>
>> Yes, I agree its quite a big change. And all the switches are naff. But I
>> couldn't see a way to avoid it and still get all the "benefits".
>>
>>> As above, I suspect that most of the benefit comes from minimizing the
>>> map/unmap calls in the early table creation, and I think that we can do that
>>> without needing all this infrastructure if we keep the fixmapping explciit
>>> in the alloc_init_pXX() functions, but factor that out of
>>> early_pgtable_alloc().
>>>
>>> Does something like the below look ok to you?
>>
>> Yes this is actually quite similar to my first attempt, but then I realised I
>> could get rid of the redudancies too.
>>
>>> The trade-off performance-wise is
>>> that late uses will still use the fixmap, and will redundantly zero the tables,
>>
>> I think we can mitigate the redudant zeroing for most kernel configs; tell the
>> allocator we don't need it to be zeroed. There are some obscure configs where
>> pages are zeroed on free instead of on alloc IIRC, so those would still have a
>> redundant clear but they are not widely used AIUI. (see bleow).
>
> That sounds fine to me; minor comment below.
>
>>> but the logic remains fairly simple, and I suspect the overhead for late
>>> allocations might not matter since the bulk of late changes are non-allocating.
>>
>> Its just the fixmap overhead that remains...
>
> True; my thinking there is that almost all of the later changes are for smaller
> ranges than the linear map (~10s of MB vs GBs in your test data), so I'd expect
> the overhead of those to be dominated by the cost of mappin the linear map.
>
> The only big exception is arch_add_memory(), but memory hotplug is incredibly
> rare, and we're not making it massively slower than it already was...

What about something like coco guest mem (or whatever its called). Isn't that
scrubbed out of the linear map? So if a coco VM is started with GBs of memory,
could that be a real case we want to optimize?

>
>> I'll benchmark with your below change, and also have a deeper look to check if
>> there are real places where fixmap might cause slowness for late ops.
>
> Thanks!
>
> [...]
>
>>> @@ -475,8 +491,6 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
>>> void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
>>
>> How about:
>>
>> void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);
>
> Looks good to me, assuming we add a comment to say it'll be zeroed in
> init_clear_pgtable().

Sure.

>
> Mark.


2024-04-11 16:16:40

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: mm: Don't remap pgtables for allocate vs populate

On Thu, Apr 11, 2024 at 03:57:04PM +0100, Ryan Roberts wrote:
> On 11/04/2024 15:48, Mark Rutland wrote:
> > On Thu, Apr 11, 2024 at 02:37:49PM +0100, Ryan Roberts wrote:
> >> On 11/04/2024 14:02, Mark Rutland wrote:
> >>> but the logic remains fairly simple, and I suspect the overhead for late
> >>> allocations might not matter since the bulk of late changes are non-allocating.
> >>
> >> Its just the fixmap overhead that remains...
> >
> > True; my thinking there is that almost all of the later changes are for smaller
> > ranges than the linear map (~10s of MB vs GBs in your test data), so I'd expect
> > the overhead of those to be dominated by the cost of mappin the linear map.
> >
> > The only big exception is arch_add_memory(), but memory hotplug is incredibly
> > rare, and we're not making it massively slower than it already was...
>
> What about something like coco guest mem (or whatever its called). Isn't that
> scrubbed out of the linear map? So if a coco VM is started with GBs of memory,
> could that be a real case we want to optimize?

I think that's already handled -- the functions we have to carve portions out
of the linear map use apply_to_page_range(), which doesn't use the fixmap. See
set_memory_*() and set_direct_map_*() in arch/arm64/mm/pageattr.c.

Note that apply_to_page_range() does what its name implies and *only* handles
mappings at page granularity. Hence not using that for
mark_linear_text_alias_ro() and mark_rodata_ro() which need to be able to
handle blocks.

Mark.

2024-04-11 16:30:52

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: mm: Don't remap pgtables for allocate vs populate

On 11/04/2024 16:25, Mark Rutland wrote:
> On Thu, Apr 11, 2024 at 03:57:04PM +0100, Ryan Roberts wrote:
>> On 11/04/2024 15:48, Mark Rutland wrote:
>>> On Thu, Apr 11, 2024 at 02:37:49PM +0100, Ryan Roberts wrote:
>>>> On 11/04/2024 14:02, Mark Rutland wrote:
>>>>> but the logic remains fairly simple, and I suspect the overhead for late
>>>>> allocations might not matter since the bulk of late changes are non-allocating.
>>>>
>>>> Its just the fixmap overhead that remains...
>>>
>>> True; my thinking there is that almost all of the later changes are for smaller
>>> ranges than the linear map (~10s of MB vs GBs in your test data), so I'd expect
>>> the overhead of those to be dominated by the cost of mappin the linear map.
>>>
>>> The only big exception is arch_add_memory(), but memory hotplug is incredibly
>>> rare, and we're not making it massively slower than it already was...
>>
>> What about something like coco guest mem (or whatever its called). Isn't that
>> scrubbed out of the linear map? So if a coco VM is started with GBs of memory,
>> could that be a real case we want to optimize?
>
> I think that's already handled -- the functions we have to carve portions out
> of the linear map use apply_to_page_range(), which doesn't use the fixmap. See
> set_memory_*() and set_direct_map_*() in arch/arm64/mm/pageattr.c.

Ahh gottya. Yet another table walker :)

>
> Note that apply_to_page_range() does what its name implies and *only* handles
> mappings at page granularity. Hence not using that for
> mark_linear_text_alias_ro() and mark_rodata_ro() which need to be able to
> handle blocks.
>
> Mark.


2024-04-12 07:53:53

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: mm: Don't remap pgtables for allocate vs populate

Hi Mark,

[...]

> Does something like the below look ok to you? The trade-off performance-wise is
> that late uses will still use the fixmap, and will redundantly zero the tables,
> but the logic remains fairly simple, and I suspect the overhead for late
> allocations might not matter since the bulk of late changes are non-allocating.
>
> Mark
>
> ---->8-----
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 105a95a8845c5..1eecf87021bd0 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1010,6 +1010,8 @@ static inline p4d_t *p4d_offset_kimg(pgd_t *pgdp, u64 addr)
>
> static inline bool pgtable_l5_enabled(void) { return false; }
>
> +#define p4d_index(addr) (((addr) >> P4D_SHIFT) & (PTRS_PER_P4D - 1)
> +
> /* Match p4d_offset folding in <asm/generic/pgtable-nop4d.h> */
> #define p4d_set_fixmap(addr) NULL
> #define p4d_set_fixmap_offset(p4dp, addr) ((p4d_t *)p4dp)
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index dc86dceb0efe6..4b944ef8f618c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -109,28 +109,12 @@ EXPORT_SYMBOL(phys_mem_access_prot);
> static phys_addr_t __init early_pgtable_alloc(int shift)
> {
> phys_addr_t phys;
> - void *ptr;
>
> phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
> MEMBLOCK_ALLOC_NOLEAKTRACE);
> if (!phys)
> panic("Failed to allocate page table page\n");
>
> - /*
> - * The FIX_{PGD,PUD,PMD} slots may be in active use, but the FIX_PTE
> - * slot will be free, so we can (ab)use the FIX_PTE slot to initialise
> - * any level of table.
> - */
> - ptr = pte_set_fixmap(phys);
> -
> - memset(ptr, 0, PAGE_SIZE);
> -
> - /*
> - * Implicit barriers also ensure the zeroed page is visible to the page
> - * table walker
> - */
> - pte_clear_fixmap();
> -
> return phys;
> }
>
> @@ -172,6 +156,14 @@ bool pgattr_change_is_safe(u64 old, u64 new)
> return ((old ^ new) & ~mask) == 0;
> }
>
> +static void init_clear_pgtable(void *table)
> +{
> + clear_page(table);
> +
> + /* Ensure the zeroing is observed by page table walks. */
> + dsb(ishst);
> +}
> +
> static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
> phys_addr_t phys, pgprot_t prot)
> {
> @@ -216,12 +208,18 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> pmdval |= PMD_TABLE_PXN;
> BUG_ON(!pgtable_alloc);
> pte_phys = pgtable_alloc(PAGE_SHIFT);
> +
> + ptep = pte_set_fixmap(pte_phys);
> + init_clear_pgtable(ptep);
> +
> __pmd_populate(pmdp, pte_phys, pmdval);
> pmd = READ_ONCE(*pmdp);
> + } else {
> + ptep = pte_set_fixmap(pmd_page_paddr(pmd));
> }
> BUG_ON(pmd_bad(pmd));
>
> - ptep = pte_set_fixmap_offset(pmdp, addr);
> + ptep += pte_index(addr);
> do {
> pgprot_t __prot = prot;
>
> @@ -303,12 +301,18 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> pudval |= PUD_TABLE_PXN;
> BUG_ON(!pgtable_alloc);
> pmd_phys = pgtable_alloc(PMD_SHIFT);
> +
> + pmdp = pmd_set_fixmap(pmd_phys);
> + init_clear_pgtable(pmdp);
> +
> __pud_populate(pudp, pmd_phys, pudval);
> pud = READ_ONCE(*pudp);
> + } else {
> + pmdp = pmd_set_fixmap(pud_page_paddr(pud));
> }
> BUG_ON(pud_bad(pud));
>
> - pmdp = pmd_set_fixmap_offset(pudp, addr);
> + pmdp += pmd_index(addr);
> do {
> pgprot_t __prot = prot;
>
> @@ -345,12 +349,18 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
> p4dval |= P4D_TABLE_PXN;
> BUG_ON(!pgtable_alloc);
> pud_phys = pgtable_alloc(PUD_SHIFT);
> +
> + pudp = pud_set_fixmap(pud_phys);
> + init_clear_pgtable(pudp);
> +
> __p4d_populate(p4dp, pud_phys, p4dval);
> p4d = READ_ONCE(*p4dp);
> + } else {
> + pudp = pud_set_fixmap(p4d_page_paddr(p4d));

With this change I end up in pgtable folding hell. pXX_set_fixmap() is defined
as NULL when the level is folded (and pXX_page_paddr() is not defined at all).
So it all compiles, but doesn't boot.

I think the simplest approach is to follow this pattern:

----8<----
@@ -340,12 +338,15 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long
addr, unsigned long end,
p4dval |= P4D_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pud_phys = pgtable_alloc(PUD_SHIFT);
+ pudp = pud_set_fixmap(pud_phys);
+ init_clear_pgtable(pudp);
+ pudp += pud_index(addr);
__p4d_populate(p4dp, pud_phys, p4dval);
- p4d = READ_ONCE(*p4dp);
+ } else {
+ BUG_ON(p4d_bad(p4d));
+ pudp = pud_set_fixmap_offset(p4dp, addr);
}
- BUG_ON(p4d_bad(p4d));

- pudp = pud_set_fixmap_offset(p4dp, addr);
do {
pud_t old_pud = READ_ONCE(*pudp);
----8<----

For the map case, we continue to use pud_set_fixmap_offset() which is always
defined (and always works correctly).

Note also that the previously unconditional BUG_ON needs to be prior to the
fixmap call to be useful, and its really only valuable in the map case because
for the alloc case we are the ones setting the p4d so we already know its not
bad. This means we don't need the READ_ONCE() in the alloc case.

Shout if you disagree.

Thanks,
Ryan

> }
> BUG_ON(p4d_bad(p4d));
>
> - pudp = pud_set_fixmap_offset(p4dp, addr);
> + pudp += pud_index(addr);
> do {
> pud_t old_pud = READ_ONCE(*pudp);
>
> @@ -400,12 +410,18 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
> pgdval |= PGD_TABLE_PXN;
> BUG_ON(!pgtable_alloc);
> p4d_phys = pgtable_alloc(P4D_SHIFT);
> +
> + p4dp = p4d_set_fixmap(p4d_phys);
> + init_clear_pgtable(p4dp);
> +
> __pgd_populate(pgdp, p4d_phys, pgdval);
> pgd = READ_ONCE(*pgdp);
> + } else {
> + p4dp = p4d_set_fixmap(pgd_page_paddr(pgd));
> }
> BUG_ON(pgd_bad(pgd));
>
> - p4dp = p4d_set_fixmap_offset(pgdp, addr);
> + p4dp += p4d_index(addr);
> do {
> p4d_t old_p4d = READ_ONCE(*p4dp);
>
> @@ -475,8 +491,6 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
> void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
> BUG_ON(!ptr);
>
> - /* Ensure the zeroed page is visible to the page table walker */
> - dsb(ishst);
> return __pa(ptr);
> }
>


2024-04-12 09:31:53

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: mm: Don't remap pgtables for allocate vs populate

On Fri, Apr 12, 2024 at 08:53:18AM +0100, Ryan Roberts wrote:
> Hi Mark,
>
> [...]
>
> > Does something like the below look ok to you? The trade-off performance-wise is
> > that late uses will still use the fixmap, and will redundantly zero the tables,
> > but the logic remains fairly simple, and I suspect the overhead for late
> > allocations might not matter since the bulk of late changes are non-allocating.

> > @@ -303,12 +301,18 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> > pudval |= PUD_TABLE_PXN;
> > BUG_ON(!pgtable_alloc);
> > pmd_phys = pgtable_alloc(PMD_SHIFT);
> > +
> > + pmdp = pmd_set_fixmap(pmd_phys);
> > + init_clear_pgtable(pmdp);
> > +
> > __pud_populate(pudp, pmd_phys, pudval);
> > pud = READ_ONCE(*pudp);
> > + } else {
> > + pmdp = pmd_set_fixmap(pud_page_paddr(pud));
> > }
> > BUG_ON(pud_bad(pud));
> >
> > - pmdp = pmd_set_fixmap_offset(pudp, addr);
> > + pmdp += pmd_index(addr);
> > do {
> > pgprot_t __prot = prot;
> >
> > @@ -345,12 +349,18 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
> > p4dval |= P4D_TABLE_PXN;
> > BUG_ON(!pgtable_alloc);
> > pud_phys = pgtable_alloc(PUD_SHIFT);
> > +
> > + pudp = pud_set_fixmap(pud_phys);
> > + init_clear_pgtable(pudp);
> > +
> > __p4d_populate(p4dp, pud_phys, p4dval);
> > p4d = READ_ONCE(*p4dp);
> > + } else {
> > + pudp = pud_set_fixmap(p4d_page_paddr(p4d));
>
> With this change I end up in pgtable folding hell. pXX_set_fixmap() is defined
> as NULL when the level is folded (and pXX_page_paddr() is not defined at all).
> So it all compiles, but doesn't boot.

Sorry about that; I had not thought to check the folding logic when hacking
that up.

> I think the simplest approach is to follow this pattern:
>
> ----8<----
> @@ -340,12 +338,15 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long
> addr, unsigned long end,
> p4dval |= P4D_TABLE_PXN;
> BUG_ON(!pgtable_alloc);
> pud_phys = pgtable_alloc(PUD_SHIFT);
> + pudp = pud_set_fixmap(pud_phys);
> + init_clear_pgtable(pudp);
> + pudp += pud_index(addr);
> __p4d_populate(p4dp, pud_phys, p4dval);
> - p4d = READ_ONCE(*p4dp);
> + } else {
> + BUG_ON(p4d_bad(p4d));
> + pudp = pud_set_fixmap_offset(p4dp, addr);
> }
> - BUG_ON(p4d_bad(p4d));
>
> - pudp = pud_set_fixmap_offset(p4dp, addr);
> do {
> pud_t old_pud = READ_ONCE(*pudp);
> ----8<----
>
> For the map case, we continue to use pud_set_fixmap_offset() which is always
> defined (and always works correctly).
>
> Note also that the previously unconditional BUG_ON needs to be prior to the
> fixmap call to be useful, and its really only valuable in the map case because
> for the alloc case we are the ones setting the p4d so we already know its not
> bad. This means we don't need the READ_ONCE() in the alloc case.
>
> Shout if you disagree.

That looks good, and I agree with the reasoning here.

Thanks for working on this!

Mark.