2024-04-12 13:19:59

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 0/3] Speed up boot with faster linear map creation

Hi All,

It turns out that creating the linear map can take a significant proportion of
the total boot time, especially when rodata=full. And most of the time is spent
waiting on superfluous tlb invalidation and memory barriers. This series reworks
the kernel pgtable generation code to significantly reduce the number of those
TLBIs, ISBs and DSBs. See each patch for details.

The below shows the execution time of map_mem() across a couple of different
systems with different RAM configurations. We measure after applying each patch
and show the improvement relative to base (v6.9-rc2):

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
base | 168 (0%) | 2198 (0%) | 8644 (0%) | 17447 (0%)
no-cont-remap | 78 (-53%) | 435 (-80%) | 1723 (-80%) | 3779 (-78%)
batch-barriers | 11 (-93%) | 161 (-93%) | 656 (-92%) | 1654 (-91%)
no-alloc-remap | 10 (-94%) | 104 (-95%) | 438 (-95%) | 1223 (-93%)

This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
boot tested various PAGE_SIZE and VA size configs.

---

Changes since v2 [2]
====================

- Independently increment ptep/pmdp in alloc_init_cont_[pte|pmd]() rather than
return the incremented value from int_[pte|pmd]() (per Mark)
- Removed explicit barriers from alloc_init_cont_pte() and instead rely on the
barriers in pte_clear_fixmap() (per Mark)
- Significantly simplify approach to avoiding fixmap during alloc (patch 3
reworked) (per Mark)
- Dropped patch 4 - not possible with simplified patch 3 and improvement (~2%)
didn't warrant complexity (per Mark)


Changes since v1 [1]
====================

- Added Tested-by tags (thanks to Eric and Itaru)
- Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
- Reordered patches (biggest impact & least controversial first)
- Reordered alloc/map/unmap functions in mmu.c to aid reader
- pte_clear() -> __pte_clear() in clear_fixmap_nosync()
- Reverted generic p4d_index() which caused x86 build error. Replaced with
unconditional p4d_index() define under arm64.


[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/

Thanks,
Ryan

Ryan Roberts (3):
arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
arm64: mm: Batch dsb and isb when populating pgtables
arm64: mm: Don't remap pgtables for allocate vs populate

arch/arm64/include/asm/pgtable.h | 9 ++-
arch/arm64/mm/mmu.c | 101 +++++++++++++++++--------------
2 files changed, 65 insertions(+), 45 deletions(-)

--
2.25.1



2024-04-12 13:20:24

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block

A large part of the kernel boot time is creating the kernel linear map
page tables. When rodata=full, all memory is mapped by pte. And when
there is lots of physical ram, there are lots of pte tables to populate.
The primary cost associated with this is mapping and unmapping the pte
table memory in the fixmap; at unmap time, the TLB entry must be
invalidated and this is expensive.

Previously, each pmd and pte table was fixmapped/fixunmapped for each
cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
we ended up issuing 32 TLBIs per (pmd|pte) table during the population
phase.

Let's fix that, and fixmap/fixunmap each page once per population, for a
saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
speedup.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
before | 168 (0%) | 2198 (0%) | 8644 (0%) | 17447 (0%)
after | 78 (-53%) | 435 (-80%) | 1723 (-80%) | 3779 (-78%)

Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: Itaru Kitayama <[email protected]>
Tested-by: Eric Chanudet <[email protected]>
---
arch/arm64/mm/mmu.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 495b732d5af3..9f1d69b7b494 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -172,12 +172,9 @@ bool pgattr_change_is_safe(u64 old, u64 new)
return ((old ^ new) & ~mask) == 0;
}

-static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
+static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot)
{
- pte_t *ptep;
-
- ptep = pte_set_fixmap_offset(pmdp, addr);
do {
pte_t old_pte = __ptep_get(ptep);

@@ -192,8 +189,6 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,

phys += PAGE_SIZE;
} while (ptep++, addr += PAGE_SIZE, addr != end);
-
- pte_clear_fixmap();
}

static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
@@ -204,6 +199,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
{
unsigned long next;
pmd_t pmd = READ_ONCE(*pmdp);
+ pte_t *ptep;

BUG_ON(pmd_sect(pmd));
if (pmd_none(pmd)) {
@@ -219,6 +215,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
}
BUG_ON(pmd_bad(pmd));

+ ptep = pte_set_fixmap_offset(pmdp, addr);
do {
pgprot_t __prot = prot;

@@ -229,20 +226,21 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
(flags & NO_CONT_MAPPINGS) == 0)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);

- init_pte(pmdp, addr, next, phys, __prot);
+ init_pte(ptep, addr, next, phys, __prot);

+ ptep += pte_index(next) - pte_index(addr);
phys += next - addr;
} while (addr = next, addr != end);
+
+ pte_clear_fixmap();
}

-static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
+static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(int), int flags)
{
unsigned long next;
- pmd_t *pmdp;

- pmdp = pmd_set_fixmap_offset(pudp, addr);
do {
pmd_t old_pmd = READ_ONCE(*pmdp);

@@ -268,8 +266,6 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
}
phys += next - addr;
} while (pmdp++, addr = next, addr != end);
-
- pmd_clear_fixmap();
}

static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
@@ -279,6 +275,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
{
unsigned long next;
pud_t pud = READ_ONCE(*pudp);
+ pmd_t *pmdp;

/*
* Check for initial section mappings in the pgd/pud.
@@ -297,6 +294,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
}
BUG_ON(pud_bad(pud));

+ pmdp = pmd_set_fixmap_offset(pudp, addr);
do {
pgprot_t __prot = prot;

@@ -307,10 +305,13 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
(flags & NO_CONT_MAPPINGS) == 0)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);

- init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
+ init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);

+ pmdp += pmd_index(next) - pmd_index(addr);
phys += next - addr;
} while (addr = next, addr != end);
+
+ pmd_clear_fixmap();
}

static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
--
2.25.1


2024-04-12 13:20:41

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 2/3] arm64: mm: Batch dsb and isb when populating pgtables

After removing uneccessary TLBIs, the next bottleneck when creating the
page tables for the linear map is DSB and ISB, which were previously
issued per-pte in __set_pte(). Since we are writing multiple ptes in a
given pte table, we can elide these barriers and insert them once we
have finished writing to the table.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
before | 78 (0%) | 435 (0%) | 1723 (0%) | 3779 (0%)
after | 11 (-86%) | 161 (-63%) | 656 (-62%) | 1654 (-56%)

Signed-off-by: Ryan Roberts <[email protected]>
Tested-by: Itaru Kitayama <[email protected]>
Tested-by: Eric Chanudet <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 7 ++++++-
arch/arm64/mm/mmu.c | 11 ++++++++++-
2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..105a95a8845c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
}

-static inline void __set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
{
WRITE_ONCE(*ptep, pte);
+}
+
+static inline void __set_pte(pte_t *ptep, pte_t pte)
+{
+ __set_pte_nosync(ptep, pte);

/*
* Only if the new pte is valid and kernel, otherwise TLB maintenance
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9f1d69b7b494..ac88b89770a6 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -178,7 +178,11 @@ static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
do {
pte_t old_pte = __ptep_get(ptep);

- __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+ /*
+ * Required barriers to make this visible to the table walker
+ * are deferred to the end of alloc_init_cont_pte().
+ */
+ __set_pte_nosync(ptep, pfn_pte(__phys_to_pfn(phys), prot));

/*
* After the PTE entry has been populated once, we
@@ -232,6 +236,11 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
phys += next - addr;
} while (addr = next, addr != end);

+ /*
+ * Note: barriers and maintenance necessary to clear the fixmap slot
+ * ensure that all previous pgtable writes are visible to the table
+ * walker.
+ */
pte_clear_fixmap();
}

--
2.25.1


2024-04-12 13:22:16

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 3/3] arm64: mm: Don't remap pgtables for allocate vs populate

During linear map pgtable creation, each pgtable is fixmapped /
fixunmapped twice; once during allocation to zero the memory, and a
again during population to write the entries. This means each table has
2 TLB invalidations issued against it. Let's fix this so that each table
is only fixmapped/fixunmapped once, halving the number of TLBIs, and
improving performance.

Achieve this by separating allocation and initialization (zeroing) of
the page. The allocated page is now fixmapped directly by the walker and
initialized, before being populated and finally fixunmapped.

This approach keeps the change small, but has the side effect that late
allocations (using __get_free_page()) must also go through the generic
memory clearing routine. So let's tell __get_free_page() not to zero the
memory to avoid duplication.

Additionally this approach means that fixmap/fixunmap is still used for
late pgtable modifications. That's not technically needed since the
memory is all mapped in the linear map by that point. That's left as a
possible future optimization if found to be needed.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
before | 11 (0%) | 161 (0%) | 656 (0%) | 1654 (0%)
after | 10 (-11%) | 104 (-35%) | 438 (-33%) | 1223 (-26%)

Signed-off-by: Ryan Roberts <[email protected]>
Suggested-by: Mark Rutland <[email protected]>
Tested-by: Itaru Kitayama <[email protected]>
Tested-by: Eric Chanudet <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 2 +
arch/arm64/mm/mmu.c | 67 +++++++++++++++++---------------
2 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 105a95a8845c..92c9aed5e7af 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1010,6 +1010,8 @@ static inline p4d_t *p4d_offset_kimg(pgd_t *pgdp, u64 addr)

static inline bool pgtable_l5_enabled(void) { return false; }

+#define p4d_index(addr) (((addr) >> P4D_SHIFT) & (PTRS_PER_P4D - 1))
+
/* Match p4d_offset folding in <asm/generic/pgtable-nop4d.h> */
#define p4d_set_fixmap(addr) NULL
#define p4d_set_fixmap_offset(p4dp, addr) ((p4d_t *)p4dp)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ac88b89770a6..c927e9312f10 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -109,28 +109,12 @@ EXPORT_SYMBOL(phys_mem_access_prot);
static phys_addr_t __init early_pgtable_alloc(int shift)
{
phys_addr_t phys;
- void *ptr;

phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
MEMBLOCK_ALLOC_NOLEAKTRACE);
if (!phys)
panic("Failed to allocate page table page\n");

- /*
- * The FIX_{PGD,PUD,PMD} slots may be in active use, but the FIX_PTE
- * slot will be free, so we can (ab)use the FIX_PTE slot to initialise
- * any level of table.
- */
- ptr = pte_set_fixmap(phys);
-
- memset(ptr, 0, PAGE_SIZE);
-
- /*
- * Implicit barriers also ensure the zeroed page is visible to the page
- * table walker
- */
- pte_clear_fixmap();
-
return phys;
}

@@ -172,6 +156,14 @@ bool pgattr_change_is_safe(u64 old, u64 new)
return ((old ^ new) & ~mask) == 0;
}

+static void init_clear_pgtable(void *table)
+{
+ clear_page(table);
+
+ /* Ensure the zeroing is observed by page table walks. */
+ dsb(ishst);
+}
+
static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot)
{
@@ -214,12 +206,15 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
pmdval |= PMD_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pte_phys = pgtable_alloc(PAGE_SHIFT);
+ ptep = pte_set_fixmap(pte_phys);
+ init_clear_pgtable(ptep);
+ ptep += pte_index(addr);
__pmd_populate(pmdp, pte_phys, pmdval);
- pmd = READ_ONCE(*pmdp);
+ } else {
+ BUG_ON(pmd_bad(pmd));
+ ptep = pte_set_fixmap_offset(pmdp, addr);
}
- BUG_ON(pmd_bad(pmd));

- ptep = pte_set_fixmap_offset(pmdp, addr);
do {
pgprot_t __prot = prot;

@@ -298,12 +293,15 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
pudval |= PUD_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pmd_phys = pgtable_alloc(PMD_SHIFT);
+ pmdp = pmd_set_fixmap(pmd_phys);
+ init_clear_pgtable(pmdp);
+ pmdp += pmd_index(addr);
__pud_populate(pudp, pmd_phys, pudval);
- pud = READ_ONCE(*pudp);
+ } else {
+ BUG_ON(pud_bad(pud));
+ pmdp = pmd_set_fixmap_offset(pudp, addr);
}
- BUG_ON(pud_bad(pud));

- pmdp = pmd_set_fixmap_offset(pudp, addr);
do {
pgprot_t __prot = prot;

@@ -340,12 +338,15 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
p4dval |= P4D_TABLE_PXN;
BUG_ON(!pgtable_alloc);
pud_phys = pgtable_alloc(PUD_SHIFT);
+ pudp = pud_set_fixmap(pud_phys);
+ init_clear_pgtable(pudp);
+ pudp += pud_index(addr);
__p4d_populate(p4dp, pud_phys, p4dval);
- p4d = READ_ONCE(*p4dp);
+ } else {
+ BUG_ON(p4d_bad(p4d));
+ pudp = pud_set_fixmap_offset(p4dp, addr);
}
- BUG_ON(p4d_bad(p4d));

- pudp = pud_set_fixmap_offset(p4dp, addr);
do {
pud_t old_pud = READ_ONCE(*pudp);

@@ -395,12 +396,15 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
pgdval |= PGD_TABLE_PXN;
BUG_ON(!pgtable_alloc);
p4d_phys = pgtable_alloc(P4D_SHIFT);
+ p4dp = p4d_set_fixmap(p4d_phys);
+ init_clear_pgtable(p4dp);
+ p4dp += p4d_index(addr);
__pgd_populate(pgdp, p4d_phys, pgdval);
- pgd = READ_ONCE(*pgdp);
+ } else {
+ BUG_ON(pgd_bad(pgd));
+ p4dp = p4d_set_fixmap_offset(pgdp, addr);
}
- BUG_ON(pgd_bad(pgd));

- p4dp = p4d_set_fixmap_offset(pgdp, addr);
do {
p4d_t old_p4d = READ_ONCE(*p4dp);

@@ -467,11 +471,10 @@ void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,

static phys_addr_t __pgd_pgtable_alloc(int shift)
{
- void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
- BUG_ON(!ptr);
+ /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
+ void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);

- /* Ensure the zeroed page is visible to the page table walker */
- dsb(ishst);
+ BUG_ON(!ptr);
return __pa(ptr);
}

--
2.25.1


2024-04-12 14:56:42

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Speed up boot with faster linear map creation

On Fri, Apr 12, 2024 at 02:19:05PM +0100, Ryan Roberts wrote:
> Hi All,
>
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And most of the time is spent
> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> the kernel pgtable generation code to significantly reduce the number of those
> TLBIs, ISBs and DSBs. See each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc2):
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> base | 168 (0%) | 2198 (0%) | 8644 (0%) | 17447 (0%)
> no-cont-remap | 78 (-53%) | 435 (-80%) | 1723 (-80%) | 3779 (-78%)
> batch-barriers | 11 (-93%) | 161 (-93%) | 656 (-92%) | 1654 (-91%)
> no-alloc-remap | 10 (-94%) | 104 (-95%) | 438 (-95%) | 1223 (-93%)
>
> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
> boot tested various PAGE_SIZE and VA size configs.

Nice!

> Ryan Roberts (3):
> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
> arm64: mm: Batch dsb and isb when populating pgtables
> arm64: mm: Don't remap pgtables for allocate vs populate

For the series:

Reviewed-by: Mark Rutland <[email protected]>

Catalin, Will, are you happy to pick this up?

Mark.

2024-04-12 15:01:01

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Speed up boot with faster linear map creation

On Fri, 12 Apr 2024 at 15:19, Ryan Roberts <[email protected]> wrote:
>
> Hi All,
>
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And most of the time is spent
> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> the kernel pgtable generation code to significantly reduce the number of those
> TLBIs, ISBs and DSBs. See each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc2):
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> base | 168 (0%) | 2198 (0%) | 8644 (0%) | 17447 (0%)
> no-cont-remap | 78 (-53%) | 435 (-80%) | 1723 (-80%) | 3779 (-78%)
> batch-barriers | 11 (-93%) | 161 (-93%) | 656 (-92%) | 1654 (-91%)
> no-alloc-remap | 10 (-94%) | 104 (-95%) | 438 (-95%) | 1223 (-93%)
>
> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
> boot tested various PAGE_SIZE and VA size configs.
..
>
> Ryan Roberts (3):
> arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
> arm64: mm: Batch dsb and isb when populating pgtables
> arm64: mm: Don't remap pgtables for allocate vs populate
>

For the series,

Reviewed-by: Ard Biesheuvel <[email protected]>

2024-04-12 16:07:30

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Speed up boot with faster linear map creation

On Fri, 12 Apr 2024 14:19:05 +0100, Ryan Roberts wrote:
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And most of the time is spent
> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> the kernel pgtable generation code to significantly reduce the number of those
> TLBIs, ISBs and DSBs. See each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc2):
>
> [...]

Applied to arm64 (for-next/mm), thanks!

[1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
https://git.kernel.org/arm64/c/5c63db59c5f8
[2/3] arm64: mm: Batch dsb and isb when populating pgtables
https://git.kernel.org/arm64/c/1fcb7cea8a5f
[3/3] arm64: mm: Don't remap pgtables for allocate vs populate
https://git.kernel.org/arm64/c/0e9df1c905d8

Cheers,
--
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev

2024-04-12 22:44:53

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Speed up boot with faster linear map creation

On Fri, Apr 12, 2024 at 05:06:41PM +0100, Will Deacon wrote:
> On Fri, 12 Apr 2024 14:19:05 +0100, Ryan Roberts wrote:
> > It turns out that creating the linear map can take a significant proportion of
> > the total boot time, especially when rodata=full. And most of the time is spent
> > waiting on superfluous tlb invalidation and memory barriers. This series reworks
> > the kernel pgtable generation code to significantly reduce the number of those
> > TLBIs, ISBs and DSBs. See each patch for details.
> >
> > The below shows the execution time of map_mem() across a couple of different
> > systems with different RAM configurations. We measure after applying each patch
> > and show the improvement relative to base (v6.9-rc2):
> >
> > [...]
>
> Applied to arm64 (for-next/mm), thanks!
>
> [1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
> https://git.kernel.org/arm64/c/5c63db59c5f8
> [2/3] arm64: mm: Batch dsb and isb when populating pgtables
> https://git.kernel.org/arm64/c/1fcb7cea8a5f
> [3/3] arm64: mm: Don't remap pgtables for allocate vs populate
> https://git.kernel.org/arm64/c/0e9df1c905d8

I confirm this series boots the system on FVP (with my .config and my
buildroot rootfs using Shrinkwrap).

Tested-by: Itaru Kitayama <[email protected]>

Thanks,
Itaru.

>
> Cheers,
> --
> Will
>
> https://fixes.arm64.dev
> https://next.arm64.dev
> https://will.arm64.dev

2024-04-14 23:25:04

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block

Hi Ryan,

On Fri, Apr 12, 2024 at 02:19:06PM +0100, Ryan Roberts wrote:
> A large part of the kernel boot time is creating the kernel linear map
> page tables. When rodata=full, all memory is mapped by pte. And when
> there is lots of physical ram, there are lots of pte tables to populate.
> The primary cost associated with this is mapping and unmapping the pte
> table memory in the fixmap; at unmap time, the TLB entry must be
> invalidated and this is expensive.
>
> Previously, each pmd and pte table was fixmapped/fixunmapped for each
> cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
> we ended up issuing 32 TLBIs per (pmd|pte) table during the population
> phase.
>
> Let's fix that, and fixmap/fixunmap each page once per population, for a
> saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
> speedup.
>
> Execution time of map_mem(), which creates the kernel linear map page
> tables, was measured on different machines with different RAM configs:
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> before | 168 (0%) | 2198 (0%) | 8644 (0%) | 17447 (0%)
> after | 78 (-53%) | 435 (-80%) | 1723 (-80%) | 3779 (-78%)
>
> Signed-off-by: Ryan Roberts <[email protected]>
> Tested-by: Itaru Kitayama <[email protected]>
> Tested-by: Eric Chanudet <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 27 ++++++++++++++-------------
> 1 file changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 495b732d5af3..9f1d69b7b494 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -172,12 +172,9 @@ bool pgattr_change_is_safe(u64 old, u64 new)
> return ((old ^ new) & ~mask) == 0;
> }
>
> -static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> +static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
> phys_addr_t phys, pgprot_t prot)
> {
> - pte_t *ptep;
> -
> - ptep = pte_set_fixmap_offset(pmdp, addr);
> do {
> pte_t old_pte = __ptep_get(ptep);
>
> @@ -192,8 +189,6 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
>
> phys += PAGE_SIZE;
> } while (ptep++, addr += PAGE_SIZE, addr != end);
> -
> - pte_clear_fixmap();
> }
>
> static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> @@ -204,6 +199,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> {
> unsigned long next;
> pmd_t pmd = READ_ONCE(*pmdp);
> + pte_t *ptep;
>
> BUG_ON(pmd_sect(pmd));
> if (pmd_none(pmd)) {
> @@ -219,6 +215,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> }
> BUG_ON(pmd_bad(pmd));
>
> + ptep = pte_set_fixmap_offset(pmdp, addr);
> do {
> pgprot_t __prot = prot;
>
> @@ -229,20 +226,21 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> (flags & NO_CONT_MAPPINGS) == 0)
> __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>
> - init_pte(pmdp, addr, next, phys, __prot);
> + init_pte(ptep, addr, next, phys, __prot);
>
> + ptep += pte_index(next) - pte_index(addr);
> phys += next - addr;
> } while (addr = next, addr != end);
> +
> + pte_clear_fixmap();
> }
>
> -static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
> +static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
> phys_addr_t phys, pgprot_t prot,
> phys_addr_t (*pgtable_alloc)(int), int flags)
> {
> unsigned long next;
> - pmd_t *pmdp;
>
> - pmdp = pmd_set_fixmap_offset(pudp, addr);
> do {
> pmd_t old_pmd = READ_ONCE(*pmdp);
>
> @@ -268,8 +266,6 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
> }
> phys += next - addr;
> } while (pmdp++, addr = next, addr != end);
> -
> - pmd_clear_fixmap();
> }
>
> static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> @@ -279,6 +275,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> {
> unsigned long next;
> pud_t pud = READ_ONCE(*pudp);
> + pmd_t *pmdp;
>
> /*
> * Check for initial section mappings in the pgd/pud.
> @@ -297,6 +294,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> }
> BUG_ON(pud_bad(pud));
>
> + pmdp = pmd_set_fixmap_offset(pudp, addr);
> do {
> pgprot_t __prot = prot;
>
> @@ -307,10 +305,13 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> (flags & NO_CONT_MAPPINGS) == 0)
> __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>
> - init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
> + init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
>
> + pmdp += pmd_index(next) - pmd_index(addr);
> phys += next - addr;
> } while (addr = next, addr != end);
> +
> + pmd_clear_fixmap();
> }
>
> static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,

I looked at this specific patch 1/3 for a while and now make sense the
code changes to reduce down the number of TLBIs respecting the contigous
bit where available at both PMD and PTE levels. I can not finish other 2/3 and 3/3 patches in a timely manner but I'd like to give my

Reviewied-by: Itaru Kitayama <[email protected]>

on 1/3.

Thanks,
Itaru.

> --
> 2.25.1
>