2024-03-26 10:24:52

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1 0/3] Speed up boot with faster linear map creation

Hi All,

It turns out that creating the linear map can take a significant proportion of
the total boot time, especially when rodata=full. And a large portion of the
time it takes to create the linear map is issuing TLBIs. This series reworks the
kernel pgtable generation code to significantly reduce the number of TLBIs. See
each patch for details.

The below shows the execution time of map_mem() across a couple of different
systems with different RAM configurations. We measure after applying each patch
and show the improvement relative to base (v6.9-rc1):

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)

This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
tested all VA size configs (although I don't anticipate any issues); I'll do
this as part of followup.

Thanks,
Ryan


Ryan Roberts (3):
arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
arm64: mm: Don't remap pgtables for allocate vs populate
arm64: mm: Lazily clear pte table mappings from fixmap

arch/arm64/include/asm/fixmap.h | 5 +-
arch/arm64/include/asm/mmu.h | 8 +
arch/arm64/include/asm/pgtable.h | 4 -
arch/arm64/kernel/cpufeature.c | 10 +-
arch/arm64/mm/fixmap.c | 11 +
arch/arm64/mm/mmu.c | 364 +++++++++++++++++++++++--------
include/linux/pgtable.h | 8 +
7 files changed, 307 insertions(+), 103 deletions(-)

--
2.25.1



2024-03-27 10:10:06

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

Hi Ryan,

On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <[email protected]> wrote:
>
> Hi All,
>
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And a large portion of the
> time it takes to create the linear map is issuing TLBIs. This series reworks the
> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc1):
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
> no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
> no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
> lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
>
> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> tested all VA size configs (although I don't anticipate any issues); I'll do
> this as part of followup.
>

These are very nice results!

Before digging into the details: do we still have a strong case for
supporting contiguous PTEs and PMDs in these routines?

2024-03-27 10:45:57

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

On 27/03/2024 10:09, Ard Biesheuvel wrote:
> Hi Ryan,
>
> On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <[email protected]> wrote:
>>
>> Hi All,
>>
>> It turns out that creating the linear map can take a significant proportion of
>> the total boot time, especially when rodata=full. And a large portion of the
>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>> each patch for details.
>>
>> The below shows the execution time of map_mem() across a couple of different
>> systems with different RAM configurations. We measure after applying each patch
>> and show the improvement relative to base (v6.9-rc1):
>>
>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>> | ms (%) | ms (%) | ms (%) | ms (%)
>> ---------------|-------------|-------------|-------------|-------------
>> base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
>> no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
>> no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
>> lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
>>
>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>> tested all VA size configs (although I don't anticipate any issues); I'll do
>> this as part of followup.
>>
>
> These are very nice results!
>
> Before digging into the details: do we still have a strong case for
> supporting contiguous PTEs and PMDs in these routines?

We are currently using contptes and pmds for the linear map when rodata=[on|off]
IIRC? I don't see a need to remove the capability personally.

Also I was talking with Mark R yesterday and he suggested that an even better
solution might be to create a temp pgtable that maps the linear map with pmds,
switch to it, then create the real pgtable that maps the linear map with ptes,
then switch to that. The benefit being that we can avoid the fixmap entirely
when creating the second pgtable - we think this would likely be significantly
faster still.

My second patch adds the infrastructure to make this possible. But your changes
for LPA2 make it significantly more effort; since that change we are now using
the swapper pgtable when we populate the linear map into it - the kernel is
already mapped and that isn't done in paging_init() anymore. So I'm not quite
sure how we can easily make that work at the moment.

Thanks,
Ryan


2024-03-27 11:18:41

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

On 27/03/2024 11:06, Itaru Kitayama wrote:
> On Tue, Mar 26, 2024 at 10:14:45AM +0000, Ryan Roberts wrote:
>> Hi All,
>>
>> It turns out that creating the linear map can take a significant proportion of
>> the total boot time, especially when rodata=full. And a large portion of the
>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>> each patch for details.
>>
>> The below shows the execution time of map_mem() across a couple of different
>> systems with different RAM configurations. We measure after applying each patch
>> and show the improvement relative to base (v6.9-rc1):
>>
>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>> | ms (%) | ms (%) | ms (%) | ms (%)
>> ---------------|-------------|-------------|-------------|-------------
>> base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
>> no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
>> no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
>> lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
>>
>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>> tested all VA size configs (although I don't anticipate any issues); I'll do
>> this as part of followup.
>
> The series was applied cleanly on top of v6.9-rc1+ of Linus's master
> branch, and boots fine on M1 VM with 14GB of memory.
>
> Just out of curiosity, how did you measure the boot time and obtain the
> breakdown of the execution times of each phase?

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 495b732d5af3..8a9d47115784 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -792,7 +792,14 @@ static void __init create_idmap(void)

void __init paging_init(void)
{
+ u64 start, end;
+
+ start = __arch_counter_get_cntvct();
map_mem(swapper_pg_dir);
+ end = __arch_counter_get_cntvct();
+
+ pr_err("map_mem: time=%llu us\n",
+ ((end - start) * 1000000) / arch_timer_get_cntfrq());

memblock_allow_resize();

>
> Tested-by: Itaru Kitayama <[email protected]>

Thanks!

>
> Thanks,
> Itaru.
>
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (3):
>> arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
>> arm64: mm: Don't remap pgtables for allocate vs populate
>> arm64: mm: Lazily clear pte table mappings from fixmap
>>
>> arch/arm64/include/asm/fixmap.h | 5 +-
>> arch/arm64/include/asm/mmu.h | 8 +
>> arch/arm64/include/asm/pgtable.h | 4 -
>> arch/arm64/kernel/cpufeature.c | 10 +-
>> arch/arm64/mm/fixmap.c | 11 +
>> arch/arm64/mm/mmu.c | 364 +++++++++++++++++++++++--------
>> include/linux/pgtable.h | 8 +
>> 7 files changed, 307 insertions(+), 103 deletions(-)
>>
>> --
>> 2.25.1
>>


2024-03-27 11:19:49

by Itaru Kitayama

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

On Tue, Mar 26, 2024 at 10:14:45AM +0000, Ryan Roberts wrote:
> Hi All,
>
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And a large portion of the
> time it takes to create the linear map is issuing TLBIs. This series reworks the
> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc1):
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
> no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
> no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
> lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
>
> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> tested all VA size configs (although I don't anticipate any issues); I'll do
> this as part of followup.

The series was applied cleanly on top of v6.9-rc1+ of Linus's master
branch, and boots fine on M1 VM with 14GB of memory.

Just out of curiosity, how did you measure the boot time and obtain the
breakdown of the execution times of each phase?

Tested-by: Itaru Kitayama <[email protected]>

Thanks,
Itaru.

>
> Thanks,
> Ryan
>
>
> Ryan Roberts (3):
> arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
> arm64: mm: Don't remap pgtables for allocate vs populate
> arm64: mm: Lazily clear pte table mappings from fixmap
>
> arch/arm64/include/asm/fixmap.h | 5 +-
> arch/arm64/include/asm/mmu.h | 8 +
> arch/arm64/include/asm/pgtable.h | 4 -
> arch/arm64/kernel/cpufeature.c | 10 +-
> arch/arm64/mm/fixmap.c | 11 +
> arch/arm64/mm/mmu.c | 364 +++++++++++++++++++++++--------
> include/linux/pgtable.h | 8 +
> 7 files changed, 307 insertions(+), 103 deletions(-)
>
> --
> 2.25.1
>

2024-03-27 14:47:39

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <[email protected]> wrote:
>
> On 27/03/2024 10:09, Ard Biesheuvel wrote:
> > Hi Ryan,
> >
> > On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <[email protected]> wrote:
> >>
> >> Hi All,
> >>
> >> It turns out that creating the linear map can take a significant proportion of
> >> the total boot time, especially when rodata=full. And a large portion of the
> >> time it takes to create the linear map is issuing TLBIs. This series reworks the
> >> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> >> each patch for details.
> >>
> >> The below shows the execution time of map_mem() across a couple of different
> >> systems with different RAM configurations. We measure after applying each patch
> >> and show the improvement relative to base (v6.9-rc1):
> >>
> >> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> >> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> >> ---------------|-------------|-------------|-------------|-------------
> >> | ms (%) | ms (%) | ms (%) | ms (%)
> >> ---------------|-------------|-------------|-------------|-------------
> >> base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
> >> no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
> >> no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
> >> lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
> >>
> >> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> >> tested all VA size configs (although I don't anticipate any issues); I'll do
> >> this as part of followup.
> >>
> >
> > These are very nice results!
> >
> > Before digging into the details: do we still have a strong case for
> > supporting contiguous PTEs and PMDs in these routines?
>
> We are currently using contptes and pmds for the linear map when rodata=[on|off]
> IIRC?

In principle, yes. In practice?

> I don't see a need to remove the capability personally.
>

Since we are making changes here, it is a relevant question to ask imho.

> Also I was talking with Mark R yesterday and he suggested that an even better
> solution might be to create a temp pgtable that maps the linear map with pmds,
> switch to it, then create the real pgtable that maps the linear map with ptes,
> then switch to that. The benefit being that we can avoid the fixmap entirely
> when creating the second pgtable - we think this would likely be significantly
> faster still.
>

If this is going to be a temporary mapping for the duration of the
initial population of the linear map page tables, we might just as
well use a 1:1 TTBR0 mapping here, which would be completely disjoint
from swapper. And we'd only need to map memory that is being used for
page tables, so on those large systems we'd need to map only a small
slice. Maybe it's time to bring back the memblock alloc limit so we
can manage this more easily?

> My second patch adds the infrastructure to make this possible. But your changes
> for LPA2 make it significantly more effort; since that change we are now using
> the swapper pgtable when we populate the linear map into it - the kernel is
> already mapped and that isn't done in paging_init() anymore. So I'm not quite
> sure how we can easily make that work at the moment.
>

I think a mix of the fixmap approach with a 1:1 map could work here:
- use TTBR0 to create a temp 1:1 map of DRAM
- map page tables lazily as they are allocated but using a coarse mapping
- avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.

2024-03-27 15:20:09

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

On 27/03/2024 13:36, Ard Biesheuvel wrote:
> On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <[email protected]> wrote:
>>
>> On 27/03/2024 10:09, Ard Biesheuvel wrote:
>>> Hi Ryan,
>>>
>>> On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <[email protected]> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> It turns out that creating the linear map can take a significant proportion of
>>>> the total boot time, especially when rodata=full. And a large portion of the
>>>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>>>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>>>> each patch for details.
>>>>
>>>> The below shows the execution time of map_mem() across a couple of different
>>>> systems with different RAM configurations. We measure after applying each patch
>>>> and show the improvement relative to base (v6.9-rc1):
>>>>
>>>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>>>> ---------------|-------------|-------------|-------------|-------------
>>>> | ms (%) | ms (%) | ms (%) | ms (%)
>>>> ---------------|-------------|-------------|-------------|-------------
>>>> base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
>>>> no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
>>>> no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
>>>> lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
>>>>
>>>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>>>> tested all VA size configs (although I don't anticipate any issues); I'll do
>>>> this as part of followup.
>>>>
>>>
>>> These are very nice results!
>>>
>>> Before digging into the details: do we still have a strong case for
>>> supporting contiguous PTEs and PMDs in these routines?
>>
>> We are currently using contptes and pmds for the linear map when rodata=[on|off]
>> IIRC?
>
> In principle, yes. In practice?
>
>> I don't see a need to remove the capability personally.
>>
>
> Since we are making changes here, it is a relevant question to ask imho.
>
>> Also I was talking with Mark R yesterday and he suggested that an even better
>> solution might be to create a temp pgtable that maps the linear map with pmds,
>> switch to it, then create the real pgtable that maps the linear map with ptes,
>> then switch to that. The benefit being that we can avoid the fixmap entirely
>> when creating the second pgtable - we think this would likely be significantly
>> faster still.
>>
>
> If this is going to be a temporary mapping for the duration of the
> initial population of the linear map page tables, we might just as
> well use a 1:1 TTBR0 mapping here, which would be completely disjoint
> from swapper. And we'd only need to map memory that is being used for
> page tables, so on those large systems we'd need to map only a small
> slice. Maybe it's time to bring back the memblock alloc limit so we
> can manage this more easily?
>
>> My second patch adds the infrastructure to make this possible. But your changes
>> for LPA2 make it significantly more effort; since that change we are now using
>> the swapper pgtable when we populate the linear map into it - the kernel is
>> already mapped and that isn't done in paging_init() anymore. So I'm not quite
>> sure how we can easily make that work at the moment.
>>
>
> I think a mix of the fixmap approach with a 1:1 map could work here:
> - use TTBR0 to create a temp 1:1 map of DRAM
> - map page tables lazily as they are allocated but using a coarse mapping
> - avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.

Yes that could work I think. So to make sure I've understood:

- create a 1:1 map for all of DRAM using block and cont mappings where possible
- use memblock_phys_alloc_*() to allocate pgtable memory
- access via fixmap (should be minimal due to block mappings)
- install it in TTBR0
- create all the swapper mappings as normal (no block or cont mappings)
- use memblock_phys_alloc_*() to alloc pgtable memory
- phys address is also virtual address due to installed 1:1 map
- Remove 1:1 map from TTBR0
- memblock_phys_free() all the memory associated with 1:1 map

That sounds doable on top of the first 2 patches in this series - I'll have a
crack. The only missing piece is depth-first 1:1 map traversal to free the
tables. I'm guessing something already exists that I can repurpose?

Thanks,
Ryan


2024-03-27 16:02:20

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

On Wed, 27 Mar 2024 at 17:01, Ryan Roberts <[email protected]> wrote:
>
> On 27/03/2024 13:36, Ard Biesheuvel wrote:
> > On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <[email protected]> wrote:
> >>
> >> On 27/03/2024 10:09, Ard Biesheuvel wrote:
..
> >
> > I think a mix of the fixmap approach with a 1:1 map could work here:
> > - use TTBR0 to create a temp 1:1 map of DRAM
> > - map page tables lazily as they are allocated but using a coarse mapping
> > - avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.
>
> Yes that could work I think. So to make sure I've understood:
>
> - create a 1:1 map for all of DRAM using block and cont mappings where possible
> - use memblock_phys_alloc_*() to allocate pgtable memory
> - access via fixmap (should be minimal due to block mappings)

Yes but you'd only need the fixmap for pages that are not in the 1:1
map yet, so after an initial ramp up you wouldn't need it at all,
assuming locality of memblock allocations and the use of PMD mappings.
The only tricky thing here is ensuring that we are not mapping memory
that we shouldn't be touching.

> - install it in TTBR0
> - create all the swapper mappings as normal (no block or cont mappings)
> - use memblock_phys_alloc_*() to alloc pgtable memory
> - phys address is also virtual address due to installed 1:1 map
> - Remove 1:1 map from TTBR0
> - memblock_phys_free() all the memory associated with 1:1 map
>

Indeed.

> That sounds doable on top of the first 2 patches in this series - I'll have a
> crack. The only missing piece is depth-first 1:1 map traversal to free the
> tables. I'm guessing something already exists that I can repurpose?
>

Not that I am aware of, but that doesn't sound too complicated.

2024-03-27 16:14:53

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

On 27/03/2024 15:57, Ard Biesheuvel wrote:
> On Wed, 27 Mar 2024 at 17:01, Ryan Roberts <[email protected]> wrote:
>>
>> On 27/03/2024 13:36, Ard Biesheuvel wrote:
>>> On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 27/03/2024 10:09, Ard Biesheuvel wrote:
> ...
>>>
>>> I think a mix of the fixmap approach with a 1:1 map could work here:
>>> - use TTBR0 to create a temp 1:1 map of DRAM
>>> - map page tables lazily as they are allocated but using a coarse mapping
>>> - avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.
>>
>> Yes that could work I think. So to make sure I've understood:
>>
>> - create a 1:1 map for all of DRAM using block and cont mappings where possible
>> - use memblock_phys_alloc_*() to allocate pgtable memory
>> - access via fixmap (should be minimal due to block mappings)
>
> Yes but you'd only need the fixmap for pages that are not in the 1:1
> map yet, so after an initial ramp up you wouldn't need it at all,
> assuming locality of memblock allocations and the use of PMD mappings.
> The only tricky thing here is ensuring that we are not mapping memory
> that we shouldn't be touching.

That sounds a bit nasty though. I think it would be simpler to just reuse the
machinery we have, doing the 1:1 map using blocks and fixmap; It should be a
factor of 512 better than what we have, so probably not a problem at that point.
That way, we can rely on memblock to tell us what to map. If its still
problematic I can add a layer to support 1G mappings too.

>
>> - install it in TTBR0
>> - create all the swapper mappings as normal (no block or cont mappings)
>> - use memblock_phys_alloc_*() to alloc pgtable memory
>> - phys address is also virtual address due to installed 1:1 map
>> - Remove 1:1 map from TTBR0
>> - memblock_phys_free() all the memory associated with 1:1 map
>>
>
> Indeed.

One question on the state of TTBR0 at entrance to paging_init(); what is it? I
need to know so I can set it back after.

Currently I'm thinking I can do:

cpu_install_ttbr0(my_dram_idmap, TCR_T0SZ(vabits_actual));
<create swapper>
cpu_set_reserved_ttbr0();
local_flush_tlb_all();

But is it ok to leave the reserved pdg in ttbr0, or is it expecting something else?

>
>> That sounds doable on top of the first 2 patches in this series - I'll have a
>> crack. The only missing piece is depth-first 1:1 map traversal to free the
>> tables. I'm guessing something already exists that I can repurpose?
>>
>
> Not that I am aware of, but that doesn't sound too complicated.


2024-03-27 19:07:49

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables

After removing uneccessary TLBIs, the next bottleneck when creating the
page tables for the linear map is DSB and ISB, which were previously
issued per-pte in __set_pte(). Since we are writing multiple ptes in a
given pte table, we can elide these barriers and insert them once we
have finished writing to the table.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 7 ++++++-
arch/arm64/mm/mmu.c | 13 ++++++++++++-
2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bd5d02f3f0a3..81e427b23b3f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
}

-static inline void __set_pte(pte_t *ptep, pte_t pte)
+static inline void ___set_pte(pte_t *ptep, pte_t pte)
{
WRITE_ONCE(*ptep, pte);
+}
+
+static inline void __set_pte(pte_t *ptep, pte_t pte)
+{
+ ___set_pte(ptep, pte);

/*
* Only if the new pte is valid and kernel, otherwise TLB maintenance
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1b2a2a2d09b7..c6d5a76732d4 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -301,7 +301,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
do {
pte_t old_pte = __ptep_get(ptep);

- __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+ /*
+ * Required barriers to make this visible to the table walker
+ * are deferred to the end of alloc_init_cont_pte().
+ */
+ ___set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));

/*
* After the PTE entry has been populated once, we
@@ -358,6 +362,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
} while (addr = next, addr != end);

ops->unmap(TYPE_PTE);
+
+ /*
+ * Ensure all previous pgtable writes are visible to the table walker.
+ * See init_pte().
+ */
+ dsb(ishst);
+ isb();
}

static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
--
2.25.1


2024-03-27 19:22:37

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

On 26/03/2024 10:14, Ryan Roberts wrote:
> Hi All,
>
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And a large portion of the
> time it takes to create the linear map is issuing TLBIs. This series reworks the
> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc1):
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
> no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
> no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
> lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)

I've just appended an additional patch to this series. This takes us to a ~95%
reduction overall:

| Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
| VM, 16G | VM, 64G | VM, 256G | Metal, 512G
---------------|-------------|-------------|-------------|-------------
| ms (%) | ms (%) | ms (%) | ms (%)
---------------|-------------|-------------|-------------|-------------
base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
batch-barriers | 11 (-93%) | 61 (-97%) | 261 (-97%) | 837 (-95%)

Don't believe the intermediate block-based pgtable idea will now be neccessary
so I don't intend to persue that. It might be that we choose to drop the middle
two patchs; I'm keen to hear opinions.

Thanks,
Ryan


>
> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> tested all VA size configs (although I don't anticipate any issues); I'll do
> this as part of followup.
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (3):
> arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
> arm64: mm: Don't remap pgtables for allocate vs populate
> arm64: mm: Lazily clear pte table mappings from fixmap
>
> arch/arm64/include/asm/fixmap.h | 5 +-
> arch/arm64/include/asm/mmu.h | 8 +
> arch/arm64/include/asm/pgtable.h | 4 -
> arch/arm64/kernel/cpufeature.c | 10 +-
> arch/arm64/mm/fixmap.c | 11 +
> arch/arm64/mm/mmu.c | 364 +++++++++++++++++++++++--------
> include/linux/pgtable.h | 8 +
> 7 files changed, 307 insertions(+), 103 deletions(-)
>
> --
> 2.25.1
>


2024-03-28 07:23:59

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables

On Wed, 27 Mar 2024 at 21:07, Ryan Roberts <[email protected]> wrote:
>
> After removing uneccessary TLBIs, the next bottleneck when creating the
> page tables for the linear map is DSB and ISB, which were previously
> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
> given pte table, we can elide these barriers and insert them once we
> have finished writing to the table.
>

Nice!

> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 7 ++++++-
> arch/arm64/mm/mmu.c | 13 ++++++++++++-
> 2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index bd5d02f3f0a3..81e427b23b3f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
> return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
> }
>
> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> +static inline void ___set_pte(pte_t *ptep, pte_t pte)

IMHO, we should either use WRITE_ONCE() directly in the caller, or
find a better name.

> {
> WRITE_ONCE(*ptep, pte);
> +}
> +
> +static inline void __set_pte(pte_t *ptep, pte_t pte)
> +{
> + ___set_pte(ptep, pte);
>
> /*
> * Only if the new pte is valid and kernel, otherwise TLB maintenance
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 1b2a2a2d09b7..c6d5a76732d4 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -301,7 +301,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
> do {
> pte_t old_pte = __ptep_get(ptep);
>
> - __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
> + /*
> + * Required barriers to make this visible to the table walker
> + * are deferred to the end of alloc_init_cont_pte().
> + */
> + ___set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>
> /*
> * After the PTE entry has been populated once, we
> @@ -358,6 +362,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> } while (addr = next, addr != end);
>
> ops->unmap(TYPE_PTE);
> +
> + /*
> + * Ensure all previous pgtable writes are visible to the table walker.
> + * See init_pte().
> + */
> + dsb(ishst);
> + isb();
> }
>
> static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
> --
> 2.25.1
>

2024-03-28 08:45:27

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables

On 28/03/2024 07:23, Ard Biesheuvel wrote:
> On Wed, 27 Mar 2024 at 21:07, Ryan Roberts <[email protected]> wrote:
>>
>> After removing uneccessary TLBIs, the next bottleneck when creating the
>> page tables for the linear map is DSB and ISB, which were previously
>> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
>> given pte table, we can elide these barriers and insert them once we
>> have finished writing to the table.
>>
>
> Nice!
>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> arch/arm64/include/asm/pgtable.h | 7 ++++++-
>> arch/arm64/mm/mmu.c | 13 ++++++++++++-
>> 2 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index bd5d02f3f0a3..81e427b23b3f 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>> return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>> }
>>
>> -static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +static inline void ___set_pte(pte_t *ptep, pte_t pte)
>
> IMHO, we should either use WRITE_ONCE() directly in the caller, or
> find a better name.

How about __set_pte_nosync() ?

>
>> {
>> WRITE_ONCE(*ptep, pte);
>> +}
>> +
>> +static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +{
>> + ___set_pte(ptep, pte);
>>
>> /*
>> * Only if the new pte is valid and kernel, otherwise TLB maintenance
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 1b2a2a2d09b7..c6d5a76732d4 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -301,7 +301,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>> do {
>> pte_t old_pte = __ptep_get(ptep);
>>
>> - __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>> + /*
>> + * Required barriers to make this visible to the table walker
>> + * are deferred to the end of alloc_init_cont_pte().
>> + */
>> + ___set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>>
>> /*
>> * After the PTE entry has been populated once, we
>> @@ -358,6 +362,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>> } while (addr = next, addr != end);
>>
>> ops->unmap(TYPE_PTE);
>> +
>> + /*
>> + * Ensure all previous pgtable writes are visible to the table walker.
>> + * See init_pte().
>> + */
>> + dsb(ishst);
>> + isb();
>> }
>>
>> static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> --
>> 2.25.1
>>


2024-03-28 08:56:42

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables

On Thu, 28 Mar 2024 at 10:45, Ryan Roberts <[email protected]> wrote:
>
> On 28/03/2024 07:23, Ard Biesheuvel wrote:
> > On Wed, 27 Mar 2024 at 21:07, Ryan Roberts <[email protected]> wrote:
> >>
> >> After removing uneccessary TLBIs, the next bottleneck when creating the
> >> page tables for the linear map is DSB and ISB, which were previously
> >> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
> >> given pte table, we can elide these barriers and insert them once we
> >> have finished writing to the table.
> >>
> >
> > Nice!
> >
> >> Signed-off-by: Ryan Roberts <[email protected]>
> >> ---
> >> arch/arm64/include/asm/pgtable.h | 7 ++++++-
> >> arch/arm64/mm/mmu.c | 13 ++++++++++++-
> >> 2 files changed, 18 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index bd5d02f3f0a3..81e427b23b3f 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
> >> return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
> >> }
> >>
> >> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> >> +static inline void ___set_pte(pte_t *ptep, pte_t pte)
> >
> > IMHO, we should either use WRITE_ONCE() directly in the caller, or
> > find a better name.
>
> How about __set_pte_nosync() ?
>

Works for me.

2024-03-28 23:09:58

by Eric Chanudet

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

On Wed, Mar 27, 2024 at 07:12:06PM +0000, Ryan Roberts wrote:
> On 26/03/2024 10:14, Ryan Roberts wrote:
> > Hi All,
> >
> > It turns out that creating the linear map can take a significant proportion of
> > the total boot time, especially when rodata=full. And a large portion of the
> > time it takes to create the linear map is issuing TLBIs. This series reworks the
> > kernel pgtable generation code to significantly reduce the number of TLBIs. See
> > each patch for details.
> >
> > The below shows the execution time of map_mem() across a couple of different
> > systems with different RAM configurations. We measure after applying each patch
> > and show the improvement relative to base (v6.9-rc1):
> >
> > | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> > | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> > ---------------|-------------|-------------|-------------|-------------
> > | ms (%) | ms (%) | ms (%) | ms (%)
> > ---------------|-------------|-------------|-------------|-------------
> > base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
> > no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
> > no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
> > lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
>
> I've just appended an additional patch to this series. This takes us to a ~95%
> reduction overall:
>
> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
> | ms (%) | ms (%) | ms (%) | ms (%)
> ---------------|-------------|-------------|-------------|-------------
> base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
> no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
> no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
> lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
> batch-barriers | 11 (-93%) | 61 (-97%) | 261 (-97%) | 837 (-95%)
>
> Don't believe the intermediate block-based pgtable idea will now be neccessary
> so I don't intend to persue that. It might be that we choose to drop the middle
> two patchs; I'm keen to hear opinions.
>

Applied on v6.9-rc1, I have much shorter base timing on a similar
machine (Ampere HR350A). no-alloc-remap didn't show much difference
either.

| SA8775p-ride | Ampere HR350A|
| VM, 36G | Metal, 256G |
---------------|--------------|--------------|
| ms (%) | ms (%) |
---------------|--------------|--------------|
base | 358 (0%) | 2213 (0%) |
no-cont-remap | 232 (-35%) | 1283 (-42%) |
no-alloc-remap | 228 (-36%) | 1282 (-42%) |
lazy-unmap | 231 (-35%) | 1248 (-44%) |
batch-barriers | 25 (-93%) | 204 (-91%) |

Tested-By: Eric Chanudet <[email protected]>


> > This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> > tested all VA size configs (although I don't anticipate any issues); I'll do
> > this as part of followup.
> >
> > Thanks,
> > Ryan
> >
> >
> > Ryan Roberts (3):
> > arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
> > arm64: mm: Don't remap pgtables for allocate vs populate
> > arm64: mm: Lazily clear pte table mappings from fixmap
> >
> > arch/arm64/include/asm/fixmap.h | 5 +-
> > arch/arm64/include/asm/mmu.h | 8 +
> > arch/arm64/include/asm/pgtable.h | 4 -
> > arch/arm64/kernel/cpufeature.c | 10 +-
> > arch/arm64/mm/fixmap.c | 11 +
> > arch/arm64/mm/mmu.c | 364 +++++++++++++++++++++++--------
> > include/linux/pgtable.h | 8 +
> > 7 files changed, 307 insertions(+), 103 deletions(-)
> >
> > --
> > 2.25.1
> >
>

--
Eric Chanudet