LinuxLists.cc - [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

在 2022/7/4 18:35, Will Deacon 写道:
> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
>> use non block/section mapping(for crashkernel requires to shrink the region
>> in page granularity). But it will degrade performance when doing larging
>> continuous mem access in kernel(memcpy/memmove, etc).
>
> Hmm. It seems a bit silly to me that we take special care to unmap the
> crashkernel from the linear map even when can_set_direct_map() is false, as
> we won't be protecting the main kernel at all!
>
> Why don't we just leave the crashkernel mapped if !can_set_direct_map()
> and then this problem just goes away?
>
> Will

This question had been asked lask week.

1.Quoted messages from arch/arm64/mm/init.c

"Memory reservation for crash kernel either done early or deferred
depending on DMA memory zones configs (ZONE_DMA) --

In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
here instead of max_zone_phys(). This lets early reservation of
crash kernel memory which has a dependency on arm64_dma_phys_limit.
Reserving memory early for crash kernel allows linear creation of block
mappings (greater than page-granularity) for all the memory bank rangs.
In this scheme a comparatively quicker boot is observed.

If ZONE_DMA configs are defined, crash kernel memory reservation
is delayed until DMA zone memory range size initialization performed in
zone_sizes_init(). The defer is necessary to steer clear of DMA zone
memory range to avoid overlap allocation.

[[[
So crash kernel memory boundaries are not known when mapping all bank
memory ranges, which otherwise means not possible to exclude crash
kernel range from creating block mappings so page-granularity mappings
are created for the entire memory range.
]]]"

Namely, the init order: memblock init--->linear mem mapping(4k mapping
for crashkernel, requirinig page-granularity changing))--->zone dma
limit--->reserve crashkernel.
So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
mapping.

2.As mentioned above, when linear mem use 4k mapping simply, there is
high dtlb miss(degrade performance).
This patch use block/section mapping as far as possible with performance
improvement.

3.This patch reserve crashkernel as same as the history(ZONE DMA &
crashkernel reserving order), and only change the linear mem mapping to
block/section mapping.
Init order: memblock init--->linear mem mapping(block/section mapping
for linear mem mapping))--->zone dma limit--->reserve
crashkernel--->[[[only]]] rebuild 4k pagesize mapping for crashkernel mem

With this method, there will use block/section mapping as far as possible.

2022-07-04 11:37:09

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
> use non block/section mapping(for crashkernel requires to shrink the region
> in page granularity). But it will degrade performance when doing larging
> continuous mem access in kernel(memcpy/memmove, etc).

Hmm. It seems a bit silly to me that we take special care to unmap the
crashkernel from the linear map even when can_set_direct_map() is false, as
we won't be protecting the main kernel at all!

Why don't we just leave the crashkernel mapped if !can_set_direct_map()
and then this problem just goes away?

Will

2022-07-04 11:38:36

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
>
>
> 在 2022/7/4 18:35, Will Deacon 写道:
> > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
> > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
> > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will
> > > use non block/section mapping(for crashkernel requires to shrink the region
> > > in page granularity). But it will degrade performance when doing larging
> > > continuous mem access in kernel(memcpy/memmove, etc).
> >
> > Hmm. It seems a bit silly to me that we take special care to unmap the
> > crashkernel from the linear map even when can_set_direct_map() is false, as
> > we won't be protecting the main kernel at all!
> >
> > Why don't we just leave the crashkernel mapped if !can_set_direct_map()
> > and then this problem just goes away?
> >
> > Will
>
> This question had been asked lask week.

Sorry, I didn't spot that. Please could you link me to the conversation, as
I'm still unable to find it in my inbox?

> 1.Quoted messages from arch/arm64/mm/init.c
>
> "Memory reservation for crash kernel either done early or deferred
> depending on DMA memory zones configs (ZONE_DMA) --
>
> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
> here instead of max_zone_phys(). This lets early reservation of
> crash kernel memory which has a dependency on arm64_dma_phys_limit.
> Reserving memory early for crash kernel allows linear creation of block
> mappings (greater than page-granularity) for all the memory bank rangs.
> In this scheme a comparatively quicker boot is observed.
>
> If ZONE_DMA configs are defined, crash kernel memory reservation
> is delayed until DMA zone memory range size initialization performed in
> zone_sizes_init(). The defer is necessary to steer clear of DMA zone
> memory range to avoid overlap allocation.
>
> [[[
> So crash kernel memory boundaries are not known when mapping all bank memory
> ranges, which otherwise means not possible to exclude crash kernel range
> from creating block mappings so page-granularity mappings are created for
> the entire memory range.
> ]]]"
>
> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
> crashkernel, requirinig page-granularity changing))--->zone dma
> limit--->reserve crashkernel.
> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
> mapping.

Yes, I understand that is how things work today but I'm saying that we may
as well leave the crashkernel mapped (at block granularity) if
!can_set_direct_map() and then I think your patch becomes a lot simpler.

Will

2022-07-04 12:27:06

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

在 2022/7/4 19:14, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
>>
>>
>> 在 2022/7/4 18:35, Will Deacon 写道:
>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
>>>> use non block/section mapping(for crashkernel requires to shrink the region
>>>> in page granularity). But it will degrade performance when doing larging
>>>> continuous mem access in kernel(memcpy/memmove, etc).
>>>
>>> Hmm. It seems a bit silly to me that we take special care to unmap the
>>> crashkernel from the linear map even when can_set_direct_map() is false, as
>>> we won't be protecting the main kernel at all!
>>>
>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map()
>>> and then this problem just goes away?
>>>
>>> Will
>>
>> This question had been asked lask week.
>
> Sorry, I didn't spot that. Please could you link me to the conversation, as
> I'm still unable to find it in my inbox?

Please access this link:
https://lore.kernel.org/linux-arm-kernel/[email protected]/T/

>
>> 1.Quoted messages from arch/arm64/mm/init.c
>>
>> "Memory reservation for crash kernel either done early or deferred
>> depending on DMA memory zones configs (ZONE_DMA) --
>>
>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>> here instead of max_zone_phys(). This lets early reservation of
>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>> Reserving memory early for crash kernel allows linear creation of block
>> mappings (greater than page-granularity) for all the memory bank rangs.
>> In this scheme a comparatively quicker boot is observed.
>>
>> If ZONE_DMA configs are defined, crash kernel memory reservation
>> is delayed until DMA zone memory range size initialization performed in
>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone
>> memory range to avoid overlap allocation.
>>
>> [[[
>> So crash kernel memory boundaries are not known when mapping all bank memory
>> ranges, which otherwise means not possible to exclude crash kernel range
>> from creating block mappings so page-granularity mappings are created for
>> the entire memory range.
>> ]]]"
>>
>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>> crashkernel, requirinig page-granularity changing))--->zone dma
>> limit--->reserve crashkernel.
>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>> mapping.
>
> Yes, I understand that is how things work today but I'm saying that we may
> as well leave the crashkernel mapped (at block granularity) if
> !can_set_direct_map() and then I think your patch becomes a lot simpler.
>
> Will

But Page-granularity mapppings are necessary for crash kernel memory
range for shrinking its size via /sys/kernel/kexec_crash_size
interfac(Quoted from arch/arm64/mm/init.c).
So this patch split block/section mapping to 4k page-granularity mapping
for crashkernel mem.

Thanks.

2022-07-04 13:25:37

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
>
>
> 在 2022/7/4 19:14, Will Deacon 写道:
> > On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
> > >
> > >
> > > 在 2022/7/4 18:35, Will Deacon 写道:
> > > > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
> > > > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
> > > > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will
> > > > > use non block/section mapping(for crashkernel requires to shrink the region
> > > > > in page granularity). But it will degrade performance when doing larging
> > > > > continuous mem access in kernel(memcpy/memmove, etc).
> > > >
> > > > Hmm. It seems a bit silly to me that we take special care to unmap the
> > > > crashkernel from the linear map even when can_set_direct_map() is false, as
> > > > we won't be protecting the main kernel at all!
> > > >
> > > > Why don't we just leave the crashkernel mapped if !can_set_direct_map()
> > > > and then this problem just goes away?
> > >
> > > This question had been asked lask week.
> >
> > Sorry, I didn't spot that. Please could you link me to the conversation, as
> > I'm still unable to find it in my inbox?
>
> Please access this link:
> https://lore.kernel.org/linux-arm-kernel/[email protected]/T/

Sorry, but I read through the thread and I still can't find where the
possibility of leaving the crashkernel mapped was discussed.

> > > 1.Quoted messages from arch/arm64/mm/init.c
> > >
> > > "Memory reservation for crash kernel either done early or deferred
> > > depending on DMA memory zones configs (ZONE_DMA) --
> > >
> > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
> > > here instead of max_zone_phys(). This lets early reservation of
> > > crash kernel memory which has a dependency on arm64_dma_phys_limit.
> > > Reserving memory early for crash kernel allows linear creation of block
> > > mappings (greater than page-granularity) for all the memory bank rangs.
> > > In this scheme a comparatively quicker boot is observed.
> > >
> > > If ZONE_DMA configs are defined, crash kernel memory reservation
> > > is delayed until DMA zone memory range size initialization performed in
> > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone
> > > memory range to avoid overlap allocation.
> > >
> > > [[[
> > > So crash kernel memory boundaries are not known when mapping all bank memory
> > > ranges, which otherwise means not possible to exclude crash kernel range
> > > from creating block mappings so page-granularity mappings are created for
> > > the entire memory range.
> > > ]]]"
> > >
> > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for
> > > crashkernel, requirinig page-granularity changing))--->zone dma
> > > limit--->reserve crashkernel.
> > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
> > > mapping.
> >
> > Yes, I understand that is how things work today but I'm saying that we may
> > as well leave the crashkernel mapped (at block granularity) if
> > !can_set_direct_map() and then I think your patch becomes a lot simpler.
>
> But Page-granularity mapppings are necessary for crash kernel memory range
> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
> arch/arm64/mm/init.c).
> So this patch split block/section mapping to 4k page-granularity mapping for
> crashkernel mem.

Why? I don't see why the mapping granularity is relevant at all if we
always leave the whole thing mapped.

Will

2022-07-04 14:24:02

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

Thanks.

在 2022/7/4 21:15, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
>>
>>
>> 在 2022/7/4 19:14, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
>>>>
>>>>
>>>> 在 2022/7/4 18:35, Will Deacon 写道:
>>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
>>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
>>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
>>>>>> use non block/section mapping(for crashkernel requires to shrink the region
>>>>>> in page granularity). But it will degrade performance when doing larging
>>>>>> continuous mem access in kernel(memcpy/memmove, etc).
>>>>>
>>>>> Hmm. It seems a bit silly to me that we take special care to unmap the
>>>>> crashkernel from the linear map even when can_set_direct_map() is false, as
>>>>> we won't be protecting the main kernel at all!
>>>>>
>>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map()
>>>>> and then this problem just goes away?
>>>>
>>>> This question had been asked lask week.
>>>
>>> Sorry, I didn't spot that. Please could you link me to the conversation, as
>>> I'm still unable to find it in my inbox?
>>
>> Please access this link:
>> https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
>
> Sorry, but I read through the thread and I still can't find where the
> possibility of leaving the crashkernel mapped was discussed >
>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>
>>>> "Memory reservation for crash kernel either done early or deferred
>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>
>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>> here instead of max_zone_phys(). This lets early reservation of
>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>> Reserving memory early for crash kernel allows linear creation of block
>>>> mappings (greater than page-granularity) for all the memory bank rangs.
>>>> In this scheme a comparatively quicker boot is observed.
>>>>
>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>> is delayed until DMA zone memory range size initialization performed in
>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone
>>>> memory range to avoid overlap allocation.
>>>>
>>>> [[[
>>>> So crash kernel memory boundaries are not known when mapping all bank memory
>>>> ranges, which otherwise means not possible to exclude crash kernel range
>>>> from creating block mappings so page-granularity mappings are created for
>>>> the entire memory range.
>>>> ]]]"
>>>>
>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>>>> crashkernel, requirinig page-granularity changing))--->zone dma
>>>> limit--->reserve crashkernel.
>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>>>> mapping.
>>>
>>> Yes, I understand that is how things work today but I'm saying that we may
>>> as well leave the crashkernel mapped (at block granularity) if
>>> !can_set_direct_map() and then I think your patch becomes a lot simpler.
>>
>> But Page-granularity mapppings are necessary for crash kernel memory range
>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
>> arch/arm64/mm/init.c).
>> So this patch split block/section mapping to 4k page-granularity mapping for
>> crashkernel mem.
>
> Why? I don't see why the mapping granularity is relevant at all if we
> always leave the whole thing mapped.
>
> Will

I have find the commit 06a7f711246b081afc21fff859f1003f1f2a0fbc adding
/sys/kernel/kexec_crash_size for changing crashkernel mem size.

"Implement shrinking the reserved memory for crash kernel, if it is more
than enough."

Maybe we could use block/section mapping for crashkernle mem, and split
a part of crashkernel mem block/section mapping when shringking(by
writing to /sys/kernel/kexec_crash_size(handled by crash_shrink_memory,
crash_free_reserved_phys_range)).
(Maybe there is no need to split all crashkernle mem block/section
mapping at boot time).

Thanks.

2022-07-04 14:42:28

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

Thanks.

在 2022/7/4 22:23, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
>> 在 2022/7/4 21:15, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
>>>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>>>
>>>>>> "Memory reservation for crash kernel either done early or deferred
>>>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>>>
>>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>>>> here instead of max_zone_phys(). This lets early reservation of
>>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>>>> Reserving memory early for crash kernel allows linear creation of block
>>>>>> mappings (greater than page-granularity) for all the memory bank rangs.
>>>>>> In this scheme a comparatively quicker boot is observed.
>>>>>>
>>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>>>> is delayed until DMA zone memory range size initialization performed in
>>>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone
>>>>>> memory range to avoid overlap allocation.
>>>>>>
>>>>>> [[[
>>>>>> So crash kernel memory boundaries are not known when mapping all bank memory
>>>>>> ranges, which otherwise means not possible to exclude crash kernel range
>>>>>> from creating block mappings so page-granularity mappings are created for
>>>>>> the entire memory range.
>>>>>> ]]]"
>>>>>>
>>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>>>>>> crashkernel, requirinig page-granularity changing))--->zone dma
>>>>>> limit--->reserve crashkernel.
>>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>>>>>> mapping.
>>>>>
>>>>> Yes, I understand that is how things work today but I'm saying that we may
>>>>> as well leave the crashkernel mapped (at block granularity) if
>>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler.
>>>>
>>>> But Page-granularity mapppings are necessary for crash kernel memory range
>>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
>>>> arch/arm64/mm/init.c).
>>>> So this patch split block/section mapping to 4k page-granularity mapping for
>>>> crashkernel mem.
>>>
>>> Why? I don't see why the mapping granularity is relevant at all if we
>>> always leave the whole thing mapped.
>>>
>> There is another reason.
>>
>> When loading crashkernel finish, the do_kexec_load will use
>> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel
>> mem(protect crashkernel mem from access).
>>
>> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range
>>
>> In the apply_to_pmd_range, there is a judement： BUG_ON(pud_huge(*pud)). And
>> if the crashkernel use block/section mapping, there will be some error.
>>
>> Namely, it's need to use non block/section mapping for crashkernel mem
>> before shringking.
>
> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> that if we're leaving the thing mapped, no?
>
> Will

I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.

Because when invalid crashkernel mem pagetable, there is no chance to
rd/wr the crashkernel mem by mistake.

If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel
mem pagetable, there maybe some write operations to these mem by mistake
which may cause crashkernel boot error and vmcore saving error.

Can we change the arch_kexec_[un]protect_crashkres to support
block/section mapping?(But we also need to remap when shrinking)

Thanks.

2022-07-04 14:43:40

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> 在 2022/7/4 21:15, Will Deacon 写道:
> > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
> > > > > 1.Quoted messages from arch/arm64/mm/init.c
> > > > >
> > > > > "Memory reservation for crash kernel either done early or deferred
> > > > > depending on DMA memory zones configs (ZONE_DMA) --
> > > > >
> > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
> > > > > here instead of max_zone_phys(). This lets early reservation of
> > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit.
> > > > > Reserving memory early for crash kernel allows linear creation of block
> > > > > mappings (greater than page-granularity) for all the memory bank rangs.
> > > > > In this scheme a comparatively quicker boot is observed.
> > > > >
> > > > > If ZONE_DMA configs are defined, crash kernel memory reservation
> > > > > is delayed until DMA zone memory range size initialization performed in
> > > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone
> > > > > memory range to avoid overlap allocation.
> > > > >
> > > > > [[[
> > > > > So crash kernel memory boundaries are not known when mapping all bank memory
> > > > > ranges, which otherwise means not possible to exclude crash kernel range
> > > > > from creating block mappings so page-granularity mappings are created for
> > > > > the entire memory range.
> > > > > ]]]"
> > > > >
> > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for
> > > > > crashkernel, requirinig page-granularity changing))--->zone dma
> > > > > limit--->reserve crashkernel.
> > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
> > > > > mapping.
> > > >
> > > > Yes, I understand that is how things work today but I'm saying that we may
> > > > as well leave the crashkernel mapped (at block granularity) if
> > > > !can_set_direct_map() and then I think your patch becomes a lot simpler.
> > >
> > > But Page-granularity mapppings are necessary for crash kernel memory range
> > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
> > > arch/arm64/mm/init.c).
> > > So this patch split block/section mapping to 4k page-granularity mapping for
> > > crashkernel mem.
> >
> > Why? I don't see why the mapping granularity is relevant at all if we
> > always leave the whole thing mapped.
> >
> There is another reason.
>
> When loading crashkernel finish, the do_kexec_load will use
> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel
> mem(protect crashkernel mem from access).
>
> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range
>
> In the apply_to_pmd_range, there is a judement： BUG_ON(pud_huge(*pud)). And
> if the crashkernel use block/section mapping, there will be some error.
>
> Namely, it's need to use non block/section mapping for crashkernel mem
> before shringking.

Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
that if we're leaving the thing mapped, no?

Will

2022-07-04 15:06:29

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

在 2022/7/4 21:15, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
>>
>>
>> 在 2022/7/4 19:14, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote:
>>>>
>>>>
>>>> 在 2022/7/4 18:35, Will Deacon 写道:
>>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote:
>>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
>>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
>>>>>> use non block/section mapping(for crashkernel requires to shrink the region
>>>>>> in page granularity). But it will degrade performance when doing larging
>>>>>> continuous mem access in kernel(memcpy/memmove, etc).
>>>>>
>>>>> Hmm. It seems a bit silly to me that we take special care to unmap the
>>>>> crashkernel from the linear map even when can_set_direct_map() is false, as
>>>>> we won't be protecting the main kernel at all!
>>>>>
>>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map()
>>>>> and then this problem just goes away?
>>>>
>>>> This question had been asked lask week.
>>>
>>> Sorry, I didn't spot that. Please could you link me to the conversation, as
>>> I'm still unable to find it in my inbox?
>>
>> Please access this link:
>> https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
>
> Sorry, but I read through the thread and I still can't find where the
> possibility of leaving the crashkernel mapped was discussed.
>
>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>
>>>> "Memory reservation for crash kernel either done early or deferred
>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>
>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>> here instead of max_zone_phys(). This lets early reservation of
>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>> Reserving memory early for crash kernel allows linear creation of block
>>>> mappings (greater than page-granularity) for all the memory bank rangs.
>>>> In this scheme a comparatively quicker boot is observed.
>>>>
>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>> is delayed until DMA zone memory range size initialization performed in
>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone
>>>> memory range to avoid overlap allocation.
>>>>
>>>> [[[
>>>> So crash kernel memory boundaries are not known when mapping all bank memory
>>>> ranges, which otherwise means not possible to exclude crash kernel range
>>>> from creating block mappings so page-granularity mappings are created for
>>>> the entire memory range.
>>>> ]]]"
>>>>
>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>>>> crashkernel, requirinig page-granularity changing))--->zone dma
>>>> limit--->reserve crashkernel.
>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>>>> mapping.
>>>
>>> Yes, I understand that is how things work today but I'm saying that we may
>>> as well leave the crashkernel mapped (at block granularity) if
>>> !can_set_direct_map() and then I think your patch becomes a lot simpler.
>>
>> But Page-granularity mapppings are necessary for crash kernel memory range
>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
>> arch/arm64/mm/init.c).
>> So this patch split block/section mapping to 4k page-granularity mapping for
>> crashkernel mem.
>
> Why? I don't see why the mapping granularity is relevant at all if we
> always leave the whole thing mapped.
>
> Will

There is another reason.

When loading crashkernel finish, the do_kexec_load will use
arch_kexec_protect_crashkres to invalid all the pagetable for
crashkernel mem(protect crashkernel mem from access).

arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range

In the apply_to_pmd_range, there is a judement： BUG_ON(pud_huge(*pud)).
And if the crashkernel use block/section mapping, there will be some error.

Namely, it's need to use non block/section mapping for crashkernel mem
before shringking.

Thanks.

2022-07-04 17:01:30

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> Thanks.
>
> 在 2022/7/4 22:23, Will Deacon 写道:
> > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> > > 在 2022/7/4 21:15, Will Deacon 写道:
> > > > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
> > > > > > > 1.Quoted messages from arch/arm64/mm/init.c
> > > > > > >
> > > > > > > "Memory reservation for crash kernel either done early or deferred
> > > > > > > depending on DMA memory zones configs (ZONE_DMA) --
> > > > > > >
> > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
> > > > > > > here instead of max_zone_phys(). This lets early reservation of
> > > > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit.
> > > > > > > Reserving memory early for crash kernel allows linear creation of block
> > > > > > > mappings (greater than page-granularity) for all the memory bank rangs.
> > > > > > > In this scheme a comparatively quicker boot is observed.
> > > > > > >
> > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation
> > > > > > > is delayed until DMA zone memory range size initialization performed in
> > > > > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone
> > > > > > > memory range to avoid overlap allocation.
> > > > > > >
> > > > > > > [[[
> > > > > > > So crash kernel memory boundaries are not known when mapping all bank memory
> > > > > > > ranges, which otherwise means not possible to exclude crash kernel range
> > > > > > > from creating block mappings so page-granularity mappings are created for
> > > > > > > the entire memory range.
> > > > > > > ]]]"
> > > > > > >
> > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for
> > > > > > > crashkernel, requirinig page-granularity changing))--->zone dma
> > > > > > > limit--->reserve crashkernel.
> > > > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
> > > > > > > mapping.
> > > > > >
> > > > > > Yes, I understand that is how things work today but I'm saying that we may
> > > > > > as well leave the crashkernel mapped (at block granularity) if
> > > > > > !can_set_direct_map() and then I think your patch becomes a lot simpler.
> > > > >
> > > > > But Page-granularity mapppings are necessary for crash kernel memory range
> > > > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
> > > > > arch/arm64/mm/init.c).
> > > > > So this patch split block/section mapping to 4k page-granularity mapping for
> > > > > crashkernel mem.
> > > >
> > > > Why? I don't see why the mapping granularity is relevant at all if we
> > > > always leave the whole thing mapped.
> > > >
> > > There is another reason.
> > >
> > > When loading crashkernel finish, the do_kexec_load will use
> > > arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel
> > > mem(protect crashkernel mem from access).
> > >
> > > arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range
> > >
> > > In the apply_to_pmd_range, there is a judement： BUG_ON(pud_huge(*pud)). And
> > > if the crashkernel use block/section mapping, there will be some error.
> > >
> > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > before shringking.
> >
> > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > that if we're leaving the thing mapped, no?
> >
> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
>
> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> the crashkernel mem by mistake.
>
> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> pagetable, there maybe some write operations to these mem by mistake which
> may cause crashkernel boot error and vmcore saving error.

I don't really buy this line of reasoning. The entire main kernel is
writable, so why do we care about protecting the crashkernel so much? The
_code_ to launch the crash kernel is writable! If you care about preventing
writes to memory which should not be writable, then you should use
rodata=full.

Will

2022-07-04 17:19:36

by Ard Biesheuvel

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Mon, 4 Jul 2022 at 18:38, Will Deacon <[email protected]> wrote:
>
> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> > Thanks.
> >
> > 在 2022/7/4 22:23, Will Deacon 写道:
> > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
...
> > > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > > before shringking.
> > >
> > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > > that if we're leaving the thing mapped, no?
> > >
> > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
> >
> > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> > the crashkernel mem by mistake.
> >
> > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> > pagetable, there maybe some write operations to these mem by mistake which
> > may cause crashkernel boot error and vmcore saving error.
>
> I don't really buy this line of reasoning. The entire main kernel is
> writable, so why do we care about protecting the crashkernel so much? The
> _code_ to launch the crash kernel is writable! If you care about preventing
> writes to memory which should not be writable, then you should use
> rodata=full.
>

This is not entirely true - the core kernel text and rodata are
remapped r/o in the linear map, whereas all module code and rodata are
left writable when rodata != full.

But the conclusion is the same, imo: if you can't be bothered to
protect a good chunk of the code and rodata that the kernel relies on,
why should the crashkernel be treated any differently?

2022-07-05 02:53:48

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

在 2022/7/5 0:38, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
>> Thanks.
>>
>> 在 2022/7/4 22:23, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
>>>> 在 2022/7/4 21:15, Will Deacon 写道:
>>>>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote:
>>>>>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>>>>>
>>>>>>>> "Memory reservation for crash kernel either done early or deferred
>>>>>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>>>>>
>>>>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>>>>>> here instead of max_zone_phys(). This lets early reservation of
>>>>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>>>>>> Reserving memory early for crash kernel allows linear creation of block
>>>>>>>> mappings (greater than page-granularity) for all the memory bank rangs.
>>>>>>>> In this scheme a comparatively quicker boot is observed.
>>>>>>>>
>>>>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>>>>>> is delayed until DMA zone memory range size initialization performed in
>>>>>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone
>>>>>>>> memory range to avoid overlap allocation.
>>>>>>>>
>>>>>>>> [[[
>>>>>>>> So crash kernel memory boundaries are not known when mapping all bank memory
>>>>>>>> ranges, which otherwise means not possible to exclude crash kernel range
>>>>>>>> from creating block mappings so page-granularity mappings are created for
>>>>>>>> the entire memory range.
>>>>>>>> ]]]"
>>>>>>>>
>>>>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for
>>>>>>>> crashkernel, requirinig page-granularity changing))--->zone dma
>>>>>>>> limit--->reserve crashkernel.
>>>>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k
>>>>>>>> mapping.
>>>>>>>
>>>>>>> Yes, I understand that is how things work today but I'm saying that we may
>>>>>>> as well leave the crashkernel mapped (at block granularity) if
>>>>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler.
>>>>>>
>>>>>> But Page-granularity mapppings are necessary for crash kernel memory range
>>>>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from
>>>>>> arch/arm64/mm/init.c).
>>>>>> So this patch split block/section mapping to 4k page-granularity mapping for
>>>>>> crashkernel mem.
>>>>>
>>>>> Why? I don't see why the mapping granularity is relevant at all if we
>>>>> always leave the whole thing mapped.
>>>>>
>>>> There is another reason.
>>>>
>>>> When loading crashkernel finish, the do_kexec_load will use
>>>> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel
>>>> mem(protect crashkernel mem from access).
>>>>
>>>> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range
>>>>
>>>> In the apply_to_pmd_range, there is a judement： BUG_ON(pud_huge(*pud)). And
>>>> if the crashkernel use block/section mapping, there will be some error.
>>>>
>>>> Namely, it's need to use non block/section mapping for crashkernel mem
>>>> before shringking.
>>>
>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
>>> that if we're leaving the thing mapped, no?
>>>
>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
>>
>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
>> the crashkernel mem by mistake.
>>
>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
>> pagetable, there maybe some write operations to these mem by mistake which
>> may cause crashkernel boot error and vmcore saving error.
>
> I don't really buy this line of reasoning. The entire main kernel is
> writable, so why do we care about protecting the crashkernel so much? The
> _code_ to launch the crash kernel is writable! If you care about preventing
> writes to memory which should not be writable, then you should use
> rodata=full.
>
> Will
Thanks.

1.I think the normal kernel mem can be writeable or readonly

2.But the normal kernel should't access(read/write) crashkernel mem
after the crashkernel is loaded to the mem.(despite the normal kernel
write/read access protect)
So invalid pagetable for crashkernel mem is needed.

With this method, it can guarantee the usability of the crashkernel when
occuring emergency and generating vmcore.

2022-07-05 08:41:09

by Baoquan He

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On 07/04/22 at 07:09pm, Ard Biesheuvel wrote:
> On Mon, 4 Jul 2022 at 18:38, Will Deacon <[email protected]> wrote:
> >
> > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> > > Thanks.
> > >
> > > 在 2022/7/4 22:23, Will Deacon 写道:
> > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> ...
> > > > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > > > before shringking.
> > > >
> > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > > > that if we're leaving the thing mapped, no?
> > > >
> > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
> > >
> > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> > > the crashkernel mem by mistake.
> > >
> > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> > > pagetable, there maybe some write operations to these mem by mistake which
> > > may cause crashkernel boot error and vmcore saving error.
> >
> > I don't really buy this line of reasoning. The entire main kernel is
> > writable, so why do we care about protecting the crashkernel so much? The
> > _code_ to launch the crash kernel is writable! If you care about preventing
> > writes to memory which should not be writable, then you should use
> > rodata=full.
> >
>
> This is not entirely true - the core kernel text and rodata are
> remapped r/o in the linear map, whereas all module code and rodata are
> left writable when rodata != full.
>
> But the conclusion is the same, imo: if you can't be bothered to
> protect a good chunk of the code and rodata that the kernel relies on,
> why should the crashkernel be treated any differently?

Kernel text and rodata are remapped r/o in linear map, whereas
module code and rodata are left writable, it's different concept
than crashkernel region being mapped r/o.

If it's doable in technology to remap module code and rodata r/o, and
stamping into those regions will corrupt the entire system, we should do
it too. However, kdump is a system error diagonosing mechanism which is
very important and helpful on server, or some application scenarios, e.g
cloud. Stamping into crashkernel region will make it useless.

I am not against removing the arch_kexec_[un]protect_crashkres on arm64.
It is a balance:

Protecting the crashkernel region, causeing severe performance
degradation. This is always felt since we usually don't specify rodata
and enable kfence.

Taking off the protecting of crashkernel region, performance improved
very much, while wrong code may stamp into crashkernel region and fail
kdump. That could happen one in a million. Once happen, it's a nightmare
of kernel dev.

2022-07-05 10:00:37

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote:
> On Mon, 4 Jul 2022 at 18:38, Will Deacon <[email protected]> wrote:
> >
> > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> > > Thanks.
> > >
> > > 在 2022/7/4 22:23, Will Deacon 写道:
> > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> ...
> > > > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > > > before shringking.
> > > >
> > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > > > that if we're leaving the thing mapped, no?
> > > >
> > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
> > >
> > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> > > the crashkernel mem by mistake.
> > >
> > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> > > pagetable, there maybe some write operations to these mem by mistake which
> > > may cause crashkernel boot error and vmcore saving error.
> >
> > I don't really buy this line of reasoning. The entire main kernel is
> > writable, so why do we care about protecting the crashkernel so much? The
> > _code_ to launch the crash kernel is writable! If you care about preventing
> > writes to memory which should not be writable, then you should use
> > rodata=full.
> >
>
> This is not entirely true - the core kernel text and rodata are
> remapped r/o in the linear map, whereas all module code and rodata are
> left writable when rodata != full.

Yes, sorry, you're quite right. The kernel text is only writable if
rodata=off.

But I still think it makes sense to protect the crashkernel only if
rodata=full (which is the default on arm64) as this allows us to rely
on page mappings and I think fits well with what we do for modules.

> But the conclusion is the same, imo: if you can't be bothered to
> protect a good chunk of the code and rodata that the kernel relies on,
> why should the crashkernel be treated any differently?

Thanks.

Will

2022-07-05 13:10:18

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
>
>
> 在 2022/7/5 17:52, Will Deacon 写道:
> > On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote:
> > > On Mon, 4 Jul 2022 at 18:38, Will Deacon <[email protected]> wrote:
> > > >
> > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
> > > > > Thanks.
> > > > >
> > > > > 在 2022/7/4 22:23, Will Deacon 写道:
> > > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
> > > ...
> > > > > > > Namely, it's need to use non block/section mapping for crashkernel mem
> > > > > > > before shringking.
> > > > > >
> > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
> > > > > > that if we're leaving the thing mapped, no?
> > > > > >
> > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
> > > > >
> > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
> > > > > the crashkernel mem by mistake.
> > > > >
> > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
> > > > > pagetable, there maybe some write operations to these mem by mistake which
> > > > > may cause crashkernel boot error and vmcore saving error.
> > > >
> > > > I don't really buy this line of reasoning. The entire main kernel is
> > > > writable, so why do we care about protecting the crashkernel so much? The
> > > > _code_ to launch the crash kernel is writable! If you care about preventing
> > > > writes to memory which should not be writable, then you should use
> > > > rodata=full.
> > > >
> > >
> > > This is not entirely true - the core kernel text and rodata are
> > > remapped r/o in the linear map, whereas all module code and rodata are
> > > left writable when rodata != full.
> >
> > Yes, sorry, you're quite right. The kernel text is only writable if
> > rodata=off.
> >
> > But I still think it makes sense to protect the crashkernel only if
> > rodata=full (which is the default on arm64) as this allows us to rely
> > on page mappings and I think fits well with what we do for modules.
> >
> > > But the conclusion is the same, imo: if you can't be bothered to
> > > protect a good chunk of the code and rodata that the kernel relies on,
> > > why should the crashkernel be treated any differently?
> >
> > Thanks.
> >
> > Will
> Thanks.
>
> 1.The rodata full is harm to the performance and has been disabled in-house.
>
> 2.When using crashkernel with rodata non full, the kernel also will use non
> block/section mapping which cause high d-TLB miss and degrade performance
> greatly.
> This patch fix it to use block/section mapping as far as possible.
>
> bool can_set_direct_map(void)
> {
> return rodata_full || debug_pagealloc_enabled();
> }
>
> map_mem:
> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>
> 3.When rodata full is disabled, crashkernel also need protect(keep
> arch_kexec_[un]protect_crashkres using).
> I think crashkernel should't depend on radata full(Maybe other architecture
> don't support radata full now).

I think this is going round in circles :/

As a first step, can we please leave the crashkernel mapped unless
rodata=full? It should be a much simpler patch to write, review and maintain
and it gives you the performance you want when crashkernel is being used.

Will

2022-07-05 13:29:47

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

在 2022/7/5 17:52, Will Deacon 写道:
> On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote:
>> On Mon, 4 Jul 2022 at 18:38, Will Deacon <[email protected]> wrote:
>>>
>>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
>>>> Thanks.
>>>>
>>>> 在 2022/7/4 22:23, Will Deacon 写道:
>>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
>> ...
>>>>>> Namely, it's need to use non block/section mapping for crashkernel mem
>>>>>> before shringking.
>>>>>
>>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
>>>>> that if we're leaving the thing mapped, no?
>>>>>
>>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
>>>>
>>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
>>>> the crashkernel mem by mistake.
>>>>
>>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
>>>> pagetable, there maybe some write operations to these mem by mistake which
>>>> may cause crashkernel boot error and vmcore saving error.
>>>
>>> I don't really buy this line of reasoning. The entire main kernel is
>>> writable, so why do we care about protecting the crashkernel so much? The
>>> _code_ to launch the crash kernel is writable! If you care about preventing
>>> writes to memory which should not be writable, then you should use
>>> rodata=full.
>>>
>>
>> This is not entirely true - the core kernel text and rodata are
>> remapped r/o in the linear map, whereas all module code and rodata are
>> left writable when rodata != full.
>
> Yes, sorry, you're quite right. The kernel text is only writable if
> rodata=off.
>
> But I still think it makes sense to protect the crashkernel only if
> rodata=full (which is the default on arm64) as this allows us to rely
> on page mappings and I think fits well with what we do for modules.
>
>> But the conclusion is the same, imo: if you can't be bothered to
>> protect a good chunk of the code and rodata that the kernel relies on,
>> why should the crashkernel be treated any differently?
>
> Thanks.
>
> Will
Thanks.

1.The rodata full is harm to the performance and has been disabled in-house.

2.When using crashkernel with rodata non full, the kernel also will use
non block/section mapping which cause high d-TLB miss and degrade
performance greatly.
This patch fix it to use block/section mapping as far as possible.

bool can_set_direct_map(void)
{
return rodata_full || debug_pagealloc_enabled();
}

map_mem:
if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

3.When rodata full is disabled, crashkernel also need protect(keep
arch_kexec_[un]protect_crashkres using).
I think crashkernel should't depend on radata full(Maybe other
architecture don't support radata full now).

Thanks.

2022-07-05 14:36:40

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote:
> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
> >
> > 1.The rodata full is harm to the performance and has been disabled in-house.
> >
> > 2.When using crashkernel with rodata non full, the kernel also will use non
> > block/section mapping which cause high d-TLB miss and degrade performance
> > greatly.
> > This patch fix it to use block/section mapping as far as possible.
> >
> > bool can_set_direct_map(void)
> > {
> > return rodata_full || debug_pagealloc_enabled();
> > }
> >
> > map_mem:
> > if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> >
> > 3.When rodata full is disabled, crashkernel also need protect(keep
> > arch_kexec_[un]protect_crashkres using).
> > I think crashkernel should't depend on radata full(Maybe other architecture
> > don't support radata full now).
>
> I think this is going round in circles :/
>
> As a first step, can we please leave the crashkernel mapped unless
> rodata=full? It should be a much simpler patch to write, review and maintain
> and it gives you the performance you want when crashkernel is being used.

Since we are talking about large systems, what do you think about letting
them set CRASH_ALIGN to PUD_SIZE, then

unmap(crashkernel);
__create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS);

should be enough to make crash kernel mapped with base pages.

> Will

--
Sincerely yours,
Mike.

2022-07-05 14:40:31

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

在 2022/7/5 20:56, Mike Rapoport 写道:
> On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote:
>> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
>>>
>>> 1.The rodata full is harm to the performance and has been disabled in-house.
>>>
>>> 2.When using crashkernel with rodata non full, the kernel also will use non
>>> block/section mapping which cause high d-TLB miss and degrade performance
>>> greatly.
>>> This patch fix it to use block/section mapping as far as possible.
>>>
>>> bool can_set_direct_map(void)
>>> {
>>> return rodata_full || debug_pagealloc_enabled();
>>> }
>>>
>>> map_mem:
>>> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>>> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>>
>>> 3.When rodata full is disabled, crashkernel also need protect(keep
>>> arch_kexec_[un]protect_crashkres using).
>>> I think crashkernel should't depend on radata full(Maybe other architecture
>>> don't support radata full now).
>>
>> I think this is going round in circles :/
>>
>> As a first step, can we please leave the crashkernel mapped unless
>> rodata=full? It should be a much simpler patch to write, review and maintain
>> and it gives you the performance you want when crashkernel is being used.
>
> Since we are talking about large systems, what do you think about letting
> them set CRASH_ALIGN to PUD_SIZE, then
>
> unmap(crashkernel);
> __create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS);
>
> should be enough to make crash kernel mapped with base pages.
>
>> Will
>
Thanks.

1.When kernel boots with crashkernel, the crashkernel parameters format is:
0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M which is self-adaption
to multiple system.

2.As mentioned above, the crashkernel mem size maybe less than
PUD_SIZE(Not multiple time to PUD_SIZE).
So, maybe we also need use some non block/section mappings.

2022-07-05 14:40:48

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

在 2022/7/5 20:11, Will Deacon 写道:
> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
>>
>>
>> 在 2022/7/5 17:52, Will Deacon 写道:
>>> On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote:
>>>> On Mon, 4 Jul 2022 at 18:38, Will Deacon <[email protected]> wrote:
>>>>>
>>>>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote:
>>>>>> Thanks.
>>>>>>
>>>>>> 在 2022/7/4 22:23, Will Deacon 写道:
>>>>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote:
>>>> ...
>>>>>>>> Namely, it's need to use non block/section mapping for crashkernel mem
>>>>>>>> before shringking.
>>>>>>>
>>>>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do
>>>>>>> that if we're leaving the thing mapped, no?
>>>>>>>
>>>>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem.
>>>>>>
>>>>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr
>>>>>> the crashkernel mem by mistake.
>>>>>>
>>>>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem
>>>>>> pagetable, there maybe some write operations to these mem by mistake which
>>>>>> may cause crashkernel boot error and vmcore saving error.
>>>>>
>>>>> I don't really buy this line of reasoning. The entire main kernel is
>>>>> writable, so why do we care about protecting the crashkernel so much? The
>>>>> _code_ to launch the crash kernel is writable! If you care about preventing
>>>>> writes to memory which should not be writable, then you should use
>>>>> rodata=full.
>>>>>
>>>>
>>>> This is not entirely true - the core kernel text and rodata are
>>>> remapped r/o in the linear map, whereas all module code and rodata are
>>>> left writable when rodata != full.
>>>
>>> Yes, sorry, you're quite right. The kernel text is only writable if
>>> rodata=off.
>>>
>>> But I still think it makes sense to protect the crashkernel only if
>>> rodata=full (which is the default on arm64) as this allows us to rely
>>> on page mappings and I think fits well with what we do for modules.
>>>
>>>> But the conclusion is the same, imo: if you can't be bothered to
>>>> protect a good chunk of the code and rodata that the kernel relies on,
>>>> why should the crashkernel be treated any differently?
>>>
>>> Thanks.
>>>
>>> Will
>> Thanks.
>>
>> 1.The rodata full is harm to the performance and has been disabled in-house.
>>
>> 2.When using crashkernel with rodata non full, the kernel also will use non
>> block/section mapping which cause high d-TLB miss and degrade performance
>> greatly.
>> This patch fix it to use block/section mapping as far as possible.
>>
>> bool can_set_direct_map(void)
>> {
>> return rodata_full || debug_pagealloc_enabled();
>> }
>>
>> map_mem:
>> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>
>> 3.When rodata full is disabled, crashkernel also need protect(keep
>> arch_kexec_[un]protect_crashkres using).
>> I think crashkernel should't depend on radata full(Maybe other architecture
>> don't support radata full now).
>
> I think this is going round in circles :/
>
> As a first step, can we please leave the crashkernel mapped unless
> rodata=full? It should be a much simpler patch to write, review and maintain
> and it gives you the performance you want when crashkernel is being used.
>
> Will

Thanks.

There is a circle.

1.When the rodata is non full, there will be some error when calling
arch_kexec_[un]protect_crashkres(BUG_ON(pud_huge(*pud))) now.
It's also need non-block/section mapping for crashkernel mem.

2.In other words, maybe we should change
arch_kexec_[un]protect_crashkres to support block/section mapping which
can leave crashkernel block/section mapping.

But when we shrink the crashkernel mem, we also need to split some
block/section mapping(part mem for crashkernel, the left for the normal
kernel).
As a result, maybe we build crashkernel mem with non-block/section
mapping is appropriate(as this patch doing).

2022-07-05 15:10:33

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote:
> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote:
> >
> > 3.When rodata full is disabled, crashkernel also need protect(keep
> > arch_kexec_[un]protect_crashkres using).
> > I think crashkernel should't depend on radata full(Maybe other architecture
> > don't support radata full now).
>
> I think this is going round in circles :/
>
> As a first step, can we please leave the crashkernel mapped unless
> rodata=full? It should be a much simpler patch to write, review and maintain
> and it gives you the performance you want when crashkernel is being used.

As it seems I failed to communicate my thoughts about reusing the existing
unmap_hotplug_range() to remap the crash kernel, let's try a more formal
approach ;-)

This is what I came up with and it does not look too complex. There are a
couple of extra #ifdefs that can be removed if we toss some code around in
a preparation patch.

From 5adbfcbe370da0f09cd917e73aaac7ba8c6b45df Mon Sep 17 00:00:00 2001
From: Mike Rapoport <[email protected]>
Date: Sat, 2 Jul 2022 23:57:53 +0800
Subject: [PATCH] arm64/mm: remap crash kernel with base pages even if
rodata_full disabled

For server systems it is important to protect crash kernel memory for
post-mortem analysis. In order to protect this memory it should be mapped
at PTE level.

When CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled, usage of crash kernel
essentially forces mapping of the entire linear map with base pages even if
rodata_full is not set (commit 2687275a5843 ("arm64: Force
NO_BLOCK_MAPPINGS if crashkernel reservation is required")) and this causes
performance degradation.

To reduce the performance degradation, postpone reservation of the crash
kernel memory to bootmem_init() regardless of CONFIG_ZONE_DMA or
CONFIG_ZONE_DMA32 and enable remapping of the crash kernel memory at PTE
level.

Co-developed-by: Guanghui Feng <[email protected]>
Signed-off-by: Guanghui Feng <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/arm64/include/asm/mmu.h | 1 +
arch/arm64/mm/init.c | 8 +---
arch/arm64/mm/mmu.c | 91 +++++++++++++++++++-----------------
3 files changed, 52 insertions(+), 48 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 48f8466a4be9..f4eb2f61dd0d 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
extern void mark_linear_text_alias_ro(void);
extern bool kaslr_requires_kpti(void);
+extern void remap_crashkernel(void);

#define INIT_MM_CONTEXT(name) \
.pgd = init_pg_dir,
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 339ee84e5a61..51f8329931f8 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void)
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
insert_resource(&iomem_resource, &crashk_res);
+ remap_crashkernel();
}

/*
@@ -388,10 +389,6 @@ void __init arm64_memblock_init(void)
}

early_init_fdt_scan_reserved_mem();
-
- if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
- reserve_crashkernel();
-
high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
}

@@ -438,8 +435,7 @@ void __init bootmem_init(void)
* request_standard_resources() depends on crashkernel's memory being
* reserved, so do it here.
*/
- if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
- reserve_crashkernel();
+ reserve_crashkernel();

memblock_dump_all();
}
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 626ec32873c6..e0b5769bfc9f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -483,21 +483,6 @@ void __init mark_linear_text_alias_ro(void)
PAGE_KERNEL_RO);
}

-static bool crash_mem_map __initdata;
-
-static int __init enable_crash_mem_map(char *arg)
-{
- /*
- * Proper parameter parsing is done by reserve_crashkernel(). We only
- * need to know if the linear map has to avoid block mappings so that
- * the crashkernel reservations can be unmapped later.
- */
- crash_mem_map = true;
-
- return 0;
-}
-early_param("crashkernel", enable_crash_mem_map);
-
static void __init map_mem(pgd_t *pgdp)
{
static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
@@ -527,17 +512,6 @@ static void __init map_mem(pgd_t *pgdp)
*/
memblock_mark_nomap(kernel_start, kernel_end - kernel_start);

-#ifdef CONFIG_KEXEC_CORE
- if (crash_mem_map) {
- if (IS_ENABLED(CONFIG_ZONE_DMA) ||
- IS_ENABLED(CONFIG_ZONE_DMA32))
- flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
- else if (crashk_res.end)
- memblock_mark_nomap(crashk_res.start,
- resource_size(&crashk_res));
- }
-#endif
-
/* map all the memory banks */
for_each_mem_range(i, &start, &end) {
if (start >= end)
@@ -570,19 +544,6 @@ static void __init map_mem(pgd_t *pgdp)
* in page granularity and put back unused memory to buddy system
* through /sys/kernel/kexec_crash_size interface.
*/
-#ifdef CONFIG_KEXEC_CORE
- if (crash_mem_map &&
- !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) {
- if (crashk_res.end) {
- __map_memblock(pgdp, crashk_res.start,
- crashk_res.end + 1,
- PAGE_KERNEL,
- NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
- memblock_clear_nomap(crashk_res.start,
- resource_size(&crashk_res));
- }
- }
-#endif
}

void mark_rodata_ro(void)
@@ -827,7 +788,7 @@ int kern_addr_valid(unsigned long addr)
return pfn_valid(pte_pfn(pte));
}

-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_KEXEC_CORE)
static void free_hotplug_page_range(struct page *page, size_t size,
struct vmem_altmap *altmap)
{
@@ -839,6 +800,7 @@ static void free_hotplug_page_range(struct page *page, size_t size,
}
}

+#ifdef CONFIG_MEMORY_HOTPLUG
static void free_hotplug_pgtable_page(struct page *page)
{
free_hotplug_page_range(page, PAGE_SIZE, NULL);
@@ -862,6 +824,7 @@ static bool pgtable_range_aligned(unsigned long start, unsigned long end,
return false;
return true;
}
+#endif /* CONFIG_MEMORY_HOTPLUG */

static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
unsigned long end, bool free_mapped,
@@ -994,7 +957,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end,
unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap);
} while (addr = next, addr < end);
}
+#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */

+#ifdef CONFIG_MEMORY_HOTPLUG
static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
unsigned long end, unsigned long floor,
unsigned long ceiling)
@@ -1148,7 +1113,7 @@ static void free_empty_tables(unsigned long addr, unsigned long end,
free_empty_p4d_table(pgdp, addr, next, floor, ceiling);
} while (addr = next, addr < end);
}
-#endif
+#endif /* CONFIG_MEMORY_HOTPLUG */

#if !ARM64_KERNEL_USES_PMD_MAPS
int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -1213,7 +1178,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
unmap_hotplug_range(start, end, true, altmap);
free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
}
-#endif /* CONFIG_MEMORY_HOTPLUG */
+#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */

static inline pud_t *fixmap_pud(unsigned long addr)
{
@@ -1677,3 +1642,45 @@ static int __init prevent_bootmem_remove_init(void)
}
early_initcall(prevent_bootmem_remove_init);
#endif
+
+void __init remap_crashkernel(void)
+{
+#ifdef CONFIG_KEXEC_CORE
+ phys_addr_t start, end, size;
+ phys_addr_t aligned_start, aligned_end;
+
+ if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
+ return;
+
+ if (!crashk_res.end)
+ return;
+
+ start = crashk_res.start & PAGE_MASK;
+ end = PAGE_ALIGN(crashk_res.end);
+
+ aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
+ aligned_end = ALIGN(end, PUD_SIZE);
+
+ /* Clear PUDs containing crash kernel memory */
+ unmap_hotplug_range(__phys_to_virt(aligned_start),
+ __phys_to_virt(aligned_end), false, NULL);
+
+ /* map area from PUD start to start of crash kernel with large pages */
+ size = start - aligned_start;
+ __create_pgd_mapping(swapper_pg_dir, aligned_start,
+ __phys_to_virt(aligned_start),
+ size, PAGE_KERNEL, early_pgtable_alloc, 0);
+
+ /* map crash kernel memory with base pages */
+ size = end - start;
+ __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
+ size, PAGE_KERNEL, early_pgtable_alloc,
+ NO_EXEC_MAPPINGS | NO_BLOCK_MAPPINGS |
+ NO_CONT_MAPPINGS);
+
+ /* map area from end of crash kernel to PUD end with large pages */
+ size = aligned_end - end;
+ __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
+ size, PAGE_KERNEL, early_pgtable_alloc, 0);
+#endif
+}
--
2.35.3

> Will

--
Sincerely yours,
Mike.

2022-07-05 16:09:06

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> +void __init remap_crashkernel(void)
> +{
> +#ifdef CONFIG_KEXEC_CORE
> + phys_addr_t start, end, size;
> + phys_addr_t aligned_start, aligned_end;
> +
> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> + return;
> +
> + if (!crashk_res.end)
> + return;
> +
> + start = crashk_res.start & PAGE_MASK;
> + end = PAGE_ALIGN(crashk_res.end);
> +
> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> + aligned_end = ALIGN(end, PUD_SIZE);
> +
> + /* Clear PUDs containing crash kernel memory */
> + unmap_hotplug_range(__phys_to_virt(aligned_start),
> + __phys_to_virt(aligned_end), false, NULL);

What I don't understand is what happens if there's valid kernel data
between aligned_start and crashk_res.start (or the other end of the
range).

--
Catalin

2022-07-05 16:14:33

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > +void __init remap_crashkernel(void)
> > +{
> > +#ifdef CONFIG_KEXEC_CORE
> > + phys_addr_t start, end, size;
> > + phys_addr_t aligned_start, aligned_end;
> > +
> > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > + return;
> > +
> > + if (!crashk_res.end)
> > + return;
> > +
> > + start = crashk_res.start & PAGE_MASK;
> > + end = PAGE_ALIGN(crashk_res.end);
> > +
> > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > + aligned_end = ALIGN(end, PUD_SIZE);
> > +
> > + /* Clear PUDs containing crash kernel memory */
> > + unmap_hotplug_range(__phys_to_virt(aligned_start),
> > + __phys_to_virt(aligned_end), false, NULL);
>
> What I don't understand is what happens if there's valid kernel data
> between aligned_start and crashk_res.start (or the other end of the
> range).

Data shouldn't go anywhere :)

There is

+ /* map area from PUD start to start of crash kernel with large pages */
+ size = start - aligned_start;
+ __create_pgd_mapping(swapper_pg_dir, aligned_start,
+ __phys_to_virt(aligned_start),
+ size, PAGE_KERNEL, early_pgtable_alloc, 0);

and

+ /* map area from end of crash kernel to PUD end with large pages */
+ size = aligned_end - end;
+ __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
+ size, PAGE_KERNEL, early_pgtable_alloc, 0);

after the unmap, so after we tear down a part of a linear map we
immediately recreate it, just with a different page size.

This all happens before SMP, so there is no concurrency at that point.

> --
> Catalin

--
Sincerely yours,
Mike.

2022-07-05 17:56:10

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > +void __init remap_crashkernel(void)
> > > +{
> > > +#ifdef CONFIG_KEXEC_CORE
> > > + phys_addr_t start, end, size;
> > > + phys_addr_t aligned_start, aligned_end;
> > > +
> > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > + return;
> > > +
> > > + if (!crashk_res.end)
> > > + return;
> > > +
> > > + start = crashk_res.start & PAGE_MASK;
> > > + end = PAGE_ALIGN(crashk_res.end);
> > > +
> > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > + aligned_end = ALIGN(end, PUD_SIZE);
> > > +
> > > + /* Clear PUDs containing crash kernel memory */
> > > + unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > + __phys_to_virt(aligned_end), false, NULL);
> >
> > What I don't understand is what happens if there's valid kernel data
> > between aligned_start and crashk_res.start (or the other end of the
> > range).
>
> Data shouldn't go anywhere :)
>
> There is
>
> + /* map area from PUD start to start of crash kernel with large pages */
> + size = start - aligned_start;
> + __create_pgd_mapping(swapper_pg_dir, aligned_start,
> + __phys_to_virt(aligned_start),
> + size, PAGE_KERNEL, early_pgtable_alloc, 0);
>
> and
>
> + /* map area from end of crash kernel to PUD end with large pages */
> + size = aligned_end - end;
> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> + size, PAGE_KERNEL, early_pgtable_alloc, 0);
>
> after the unmap, so after we tear down a part of a linear map we
> immediately recreate it, just with a different page size.
>
> This all happens before SMP, so there is no concurrency at that point.

That brief period of unmap worries me. The kernel text, data and stack
are all in the vmalloc space but any other (memblock) allocation to this
point may be in the unmapped range before and after the crashkernel
reservation. The interrupts are off, so I think the only allocation and
potential access that may go in this range is the page table itself. But
it looks fragile to me.

--
Catalin

2022-07-05 20:49:51

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > > +void __init remap_crashkernel(void)
> > > > +{
> > > > +#ifdef CONFIG_KEXEC_CORE
> > > > + phys_addr_t start, end, size;
> > > > + phys_addr_t aligned_start, aligned_end;
> > > > +
> > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > > + return;
> > > > +
> > > > + if (!crashk_res.end)
> > > > + return;
> > > > +
> > > > + start = crashk_res.start & PAGE_MASK;
> > > > + end = PAGE_ALIGN(crashk_res.end);
> > > > +
> > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > > + aligned_end = ALIGN(end, PUD_SIZE);
> > > > +
> > > > + /* Clear PUDs containing crash kernel memory */
> > > > + unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > > + __phys_to_virt(aligned_end), false, NULL);
> > >
> > > What I don't understand is what happens if there's valid kernel data
> > > between aligned_start and crashk_res.start (or the other end of the
> > > range).
> >
> > Data shouldn't go anywhere :)
> >
> > There is
> >
> > + /* map area from PUD start to start of crash kernel with large pages */
> > + size = start - aligned_start;
> > + __create_pgd_mapping(swapper_pg_dir, aligned_start,
> > + __phys_to_virt(aligned_start),
> > + size, PAGE_KERNEL, early_pgtable_alloc, 0);
> >
> > and
> >
> > + /* map area from end of crash kernel to PUD end with large pages */
> > + size = aligned_end - end;
> > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> > + size, PAGE_KERNEL, early_pgtable_alloc, 0);
> >
> > after the unmap, so after we tear down a part of a linear map we
> > immediately recreate it, just with a different page size.
> >
> > This all happens before SMP, so there is no concurrency at that point.
>
> That brief period of unmap worries me. The kernel text, data and stack
> are all in the vmalloc space but any other (memblock) allocation to this
> point may be in the unmapped range before and after the crashkernel
> reservation. The interrupts are off, so I think the only allocation and
> potential access that may go in this range is the page table itself. But
> it looks fragile to me.

I agree there are chances there will be an allocation from the unmapped
range.

We can make sure this won't happen, though. We can cap the memblock
allocations with memblock_set_current_limit(aligned_end) or
memblock_reserve(algined_start, aligned_end) until the mappings are
restored.

> --
> Catalin

--
Sincerely yours,
Mike.

2022-07-06 02:55:17

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

Thanks.

在 2022/7/6 4:45, Mike Rapoport 写道:
> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
>>>>> +void __init remap_crashkernel(void)
>>>>> +{
>>>>> +#ifdef CONFIG_KEXEC_CORE
>>>>> + phys_addr_t start, end, size;
>>>>> + phys_addr_t aligned_start, aligned_end;
>>>>> +
>>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>>>>> + return;
>>>>> +
>>>>> + if (!crashk_res.end)
>>>>> + return;
>>>>> +
>>>>> + start = crashk_res.start & PAGE_MASK;
>>>>> + end = PAGE_ALIGN(crashk_res.end);
>>>>> +
>>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
>>>>> + aligned_end = ALIGN(end, PUD_SIZE);
>>>>> +
>>>>> + /* Clear PUDs containing crash kernel memory */
>>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start),
>>>>> + __phys_to_virt(aligned_end), false, NULL);
>>>>
>>>> What I don't understand is what happens if there's valid kernel data
>>>> between aligned_start and crashk_res.start (or the other end of the
>>>> range).
>>>
>>> Data shouldn't go anywhere :)
>>>
>>> There is
>>>
>>> + /* map area from PUD start to start of crash kernel with large pages */
>>> + size = start - aligned_start;
>>> + __create_pgd_mapping(swapper_pg_dir, aligned_start,
>>> + __phys_to_virt(aligned_start),
>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>
>>> and
>>>
>>> + /* map area from end of crash kernel to PUD end with large pages */
>>> + size = aligned_end - end;
>>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>
>>> after the unmap, so after we tear down a part of a linear map we
>>> immediately recreate it, just with a different page size.
>>>
>>> This all happens before SMP, so there is no concurrency at that point.
>>
>> That brief period of unmap worries me. The kernel text, data and stack
>> are all in the vmalloc space but any other (memblock) allocation to this
>> point may be in the unmapped range before and after the crashkernel
>> reservation. The interrupts are off, so I think the only allocation and
>> potential access that may go in this range is the page table itself. But
>> it looks fragile to me.
>
> I agree there are chances there will be an allocation from the unmapped
> range.
>
> We can make sure this won't happen, though. We can cap the memblock
> allocations with memblock_set_current_limit(aligned_end) or
> memblock_reserve(algined_start, aligned_end) until the mappings are
> restored.
>
>> --
>> Catalin
>
I think there is no need to worry about vmalloc mem.

1.As mentioned above,
When reserving crashkernel and remapping linear mem mapping, there is
only one boot cpu running. There is no other cpu/thread running at the
same time.

2.Although vmalloc may alloc mem from the ummaped area, but we will
rebuid remapping using pte level mapping which keeps virtual address to
the same physical address
(At the same time, no other cpu/thread is access vmalloc mem).

As a result, it has no effect to vmalloc mem.

2022-07-06 07:55:33

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Wed, Jul 06, 2022 at 10:49:43AM +0800, guanghui.fgh wrote:
> 在 2022/7/6 4:45, Mike Rapoport 写道:
> > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
> > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > > > > +void __init remap_crashkernel(void)
> > > > > > +{
> > > > > > +#ifdef CONFIG_KEXEC_CORE
> > > > > > + phys_addr_t start, end, size;
> > > > > > + phys_addr_t aligned_start, aligned_end;
> > > > > > +
> > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > > > > + return;
> > > > > > +
> > > > > > + if (!crashk_res.end)
> > > > > > + return;
> > > > > > +
> > > > > > + start = crashk_res.start & PAGE_MASK;
> > > > > > + end = PAGE_ALIGN(crashk_res.end);
> > > > > > +
> > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > > > > + aligned_end = ALIGN(end, PUD_SIZE);
> > > > > > +
> > > > > > + /* Clear PUDs containing crash kernel memory */
> > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > > > > + __phys_to_virt(aligned_end), false, NULL);
> > > > >
> > > > > What I don't understand is what happens if there's valid kernel data
> > > > > between aligned_start and crashk_res.start (or the other end of the
> > > > > range).
> > > >
> > > > Data shouldn't go anywhere :)
> > > >
> > > > There is
> > > >
> > > > + /* map area from PUD start to start of crash kernel with large pages */
> > > > + size = start - aligned_start;
> > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start,
> > > > + __phys_to_virt(aligned_start),
> > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > >
> > > > and
> > > >
> > > > + /* map area from end of crash kernel to PUD end with large pages */
> > > > + size = aligned_end - end;
> > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > >
> > > > after the unmap, so after we tear down a part of a linear map we
> > > > immediately recreate it, just with a different page size.
> > > >
> > > > This all happens before SMP, so there is no concurrency at that point.
> > >
> > > That brief period of unmap worries me. The kernel text, data and stack
> > > are all in the vmalloc space but any other (memblock) allocation to this
> > > point may be in the unmapped range before and after the crashkernel
> > > reservation. The interrupts are off, so I think the only allocation and
> > > potential access that may go in this range is the page table itself. But
> > > it looks fragile to me.
> >
> > I agree there are chances there will be an allocation from the unmapped
> > range.
> >
> > We can make sure this won't happen, though. We can cap the memblock
> > allocations with memblock_set_current_limit(aligned_end) or
> > memblock_reserve(algined_start, aligned_end) until the mappings are
> > restored.
>
> I think there is no need to worry about vmalloc mem.

That's not what I'm worried about. It's about memblock allocations that
are accessed through the linear map.

--
Catalin

2022-07-06 10:16:57

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote:
> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
> > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > > > +void __init remap_crashkernel(void)
> > > > > +{
> > > > > +#ifdef CONFIG_KEXEC_CORE
> > > > > + phys_addr_t start, end, size;
> > > > > + phys_addr_t aligned_start, aligned_end;
> > > > > +
> > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > > > + return;
> > > > > +
> > > > > + if (!crashk_res.end)
> > > > > + return;
> > > > > +
> > > > > + start = crashk_res.start & PAGE_MASK;
> > > > > + end = PAGE_ALIGN(crashk_res.end);
> > > > > +
> > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > > > + aligned_end = ALIGN(end, PUD_SIZE);
> > > > > +
> > > > > + /* Clear PUDs containing crash kernel memory */
> > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > > > + __phys_to_virt(aligned_end), false, NULL);
> > > >
> > > > What I don't understand is what happens if there's valid kernel data
> > > > between aligned_start and crashk_res.start (or the other end of the
> > > > range).
> > >
> > > Data shouldn't go anywhere :)
> > >
> > > There is
> > >
> > > + /* map area from PUD start to start of crash kernel with large pages */
> > > + size = start - aligned_start;
> > > + __create_pgd_mapping(swapper_pg_dir, aligned_start,
> > > + __phys_to_virt(aligned_start),
> > > + size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > >
> > > and
> > >
> > > + /* map area from end of crash kernel to PUD end with large pages */
> > > + size = aligned_end - end;
> > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> > > + size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > >
> > > after the unmap, so after we tear down a part of a linear map we
> > > immediately recreate it, just with a different page size.
> > >
> > > This all happens before SMP, so there is no concurrency at that point.
> >
> > That brief period of unmap worries me. The kernel text, data and stack
> > are all in the vmalloc space but any other (memblock) allocation to this
> > point may be in the unmapped range before and after the crashkernel
> > reservation. The interrupts are off, so I think the only allocation and
> > potential access that may go in this range is the page table itself. But
> > it looks fragile to me.
>
> I agree there are chances there will be an allocation from the unmapped
> range.
>
> We can make sure this won't happen, though. We can cap the memblock
> allocations with memblock_set_current_limit(aligned_end) or
> memblock_reserve(algined_start, aligned_end) until the mappings are
> restored.

We can reserve the region just before unmapping to avoid new allocations
for the page tables but we can't do much about pages already allocated
prior to calling remap_crashkernel().

--
Catalin

2022-07-06 14:00:34

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote:
> > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
> > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
> > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
> > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
> > > > > > +void __init remap_crashkernel(void)
> > > > > > +{
> > > > > > +#ifdef CONFIG_KEXEC_CORE
> > > > > > + phys_addr_t start, end, size;
> > > > > > + phys_addr_t aligned_start, aligned_end;
> > > > > > +
> > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> > > > > > + return;
> > > > > > +
> > > > > > + if (!crashk_res.end)
> > > > > > + return;
> > > > > > +
> > > > > > + start = crashk_res.start & PAGE_MASK;
> > > > > > + end = PAGE_ALIGN(crashk_res.end);
> > > > > > +
> > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
> > > > > > + aligned_end = ALIGN(end, PUD_SIZE);
> > > > > > +
> > > > > > + /* Clear PUDs containing crash kernel memory */
> > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start),
> > > > > > + __phys_to_virt(aligned_end), false, NULL);
> > > > >
> > > > > What I don't understand is what happens if there's valid kernel data
> > > > > between aligned_start and crashk_res.start (or the other end of the
> > > > > range).
> > > >
> > > > Data shouldn't go anywhere :)
> > > >
> > > > There is
> > > >
> > > > + /* map area from PUD start to start of crash kernel with large pages */
> > > > + size = start - aligned_start;
> > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start,
> > > > + __phys_to_virt(aligned_start),
> > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > >
> > > > and
> > > >
> > > > + /* map area from end of crash kernel to PUD end with large pages */
> > > > + size = aligned_end - end;
> > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
> > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0);
> > > >
> > > > after the unmap, so after we tear down a part of a linear map we
> > > > immediately recreate it, just with a different page size.
> > > >
> > > > This all happens before SMP, so there is no concurrency at that point.
> > >
> > > That brief period of unmap worries me. The kernel text, data and stack
> > > are all in the vmalloc space but any other (memblock) allocation to this
> > > point may be in the unmapped range before and after the crashkernel
> > > reservation. The interrupts are off, so I think the only allocation and
> > > potential access that may go in this range is the page table itself. But
> > > it looks fragile to me.
> >
> > I agree there are chances there will be an allocation from the unmapped
> > range.
> >
> > We can make sure this won't happen, though. We can cap the memblock
> > allocations with memblock_set_current_limit(aligned_end) or
> > memblock_reserve(algined_start, aligned_end) until the mappings are
> > restored.
>
> We can reserve the region just before unmapping to avoid new allocations
> for the page tables but we can't do much about pages already allocated
> prior to calling remap_crashkernel().

Right, this was bothering me too after I re-read you previous email.

One thing I can think of is to only remap the crash kernel memory if it is
a part of an allocation that exactly fits into one ore more PUDs.

Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
succeeds, we remap the entire area that now contains only memory allocated
in reserve_crashkernel() and free the extra memory after remapping is done.
If the large allocation fails, we fall back to the original size and
alignment and don't allow unmapping crash kernel memory in
arch_kexec_protect_crashkres().

> --
> Catalin

--
Sincerely yours,
Mike.

2022-07-06 15:24:51

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

Thanks.

在 2022/7/6 21:54, Mike Rapoport 写道:
> On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote:
>> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote:
>>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
>>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
>>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
>>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
>>>>>>> +void __init remap_crashkernel(void)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_KEXEC_CORE
>>>>>>> + phys_addr_t start, end, size;
>>>>>>> + phys_addr_t aligned_start, aligned_end;
>>>>>>> +
>>>>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>>>>>>> + return;
>>>>>>> +
>>>>>>> + if (!crashk_res.end)
>>>>>>> + return;
>>>>>>> +
>>>>>>> + start = crashk_res.start & PAGE_MASK;
>>>>>>> + end = PAGE_ALIGN(crashk_res.end);
>>>>>>> +
>>>>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
>>>>>>> + aligned_end = ALIGN(end, PUD_SIZE);
>>>>>>> +
>>>>>>> + /* Clear PUDs containing crash kernel memory */
>>>>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start),
>>>>>>> + __phys_to_virt(aligned_end), false, NULL);
>>>>>>
>>>>>> What I don't understand is what happens if there's valid kernel data
>>>>>> between aligned_start and crashk_res.start (or the other end of the
>>>>>> range).
>>>>>
>>>>> Data shouldn't go anywhere :)
>>>>>
>>>>> There is
>>>>>
>>>>> + /* map area from PUD start to start of crash kernel with large pages */
>>>>> + size = start - aligned_start;
>>>>> + __create_pgd_mapping(swapper_pg_dir, aligned_start,
>>>>> + __phys_to_virt(aligned_start),
>>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>>>
>>>>> and
>>>>>
>>>>> + /* map area from end of crash kernel to PUD end with large pages */
>>>>> + size = aligned_end - end;
>>>>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
>>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>>>
>>>>> after the unmap, so after we tear down a part of a linear map we
>>>>> immediately recreate it, just with a different page size.
>>>>>
>>>>> This all happens before SMP, so there is no concurrency at that point.
>>>>
>>>> That brief period of unmap worries me. The kernel text, data and stack
>>>> are all in the vmalloc space but any other (memblock) allocation to this
>>>> point may be in the unmapped range before and after the crashkernel
>>>> reservation. The interrupts are off, so I think the only allocation and
>>>> potential access that may go in this range is the page table itself. But
>>>> it looks fragile to me.
>>>
>>> I agree there are chances there will be an allocation from the unmapped
>>> range.
>>>
>>> We can make sure this won't happen, though. We can cap the memblock
>>> allocations with memblock_set_current_limit(aligned_end) or
>>> memblock_reserve(algined_start, aligned_end) until the mappings are
>>> restored.
>>
>> We can reserve the region just before unmapping to avoid new allocations
>> for the page tables but we can't do much about pages already allocated
>> prior to calling remap_crashkernel().
>
> Right, this was bothering me too after I re-read you previous email.
>
> One thing I can think of is to only remap the crash kernel memory if it is
> a part of an allocation that exactly fits into one ore more PUDs.
>
> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
> succeeds, we remap the entire area that now contains only memory allocated
> in reserve_crashkernel() and free the extra memory after remapping is done.
> If the large allocation fails, we fall back to the original size and
> alignment and don't allow unmapping crash kernel memory in
> arch_kexec_protect_crashkres().
>
>> --
>> Catalin
>
Thanks.

There is a new method.
I think we should use the patch v3(similar but need add some changes)

1.We can walk crashkernle block/section pagetable,
[[[(keep the origin block/section mapping valid]]]
rebuild the pte level page mapping for the crashkernel mem
rebuild left & right margin mem(which is in same block/section mapping
but out of crashkernel mem) with block/section mapping

2.'replace' the origin block/section mapping by new builded mapping
iterately

With this method, all the mem mapping keep valid all the time.

3.the patch v3 link:
https://lore.kernel.org/linux-mm/[email protected]/T/
(Need some changes)

2022-07-06 15:50:52

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

在 2022/7/6 23:18, guanghui.fgh 写道:
> Thanks.
>
> 在 2022/7/6 21:54, Mike Rapoport 写道:
>> On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote:
>>> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote:
>>>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
>>>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
>>>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
>>>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
>>>>>>>> +void __init remap_crashkernel(void)
>>>>>>>> +{
>>>>>>>> +#ifdef CONFIG_KEXEC_CORE
>>>>>>>> +    phys_addr_t start, end, size;
>>>>>>>> +    phys_addr_t aligned_start, aligned_end;
>>>>>>>> +
>>>>>>>> +    if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    if (!crashk_res.end)
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    start = crashk_res.start & PAGE_MASK;
>>>>>>>> +    end = PAGE_ALIGN(crashk_res.end);
>>>>>>>> +
>>>>>>>> +    aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
>>>>>>>> +    aligned_end = ALIGN(end, PUD_SIZE);
>>>>>>>> +
>>>>>>>> +    /* Clear PUDs containing crash kernel memory */
>>>>>>>> +    unmap_hotplug_range(__phys_to_virt(aligned_start),
>>>>>>>> +                __phys_to_virt(aligned_end), false, NULL);
>>>>>>>
>>>>>>> What I don't understand is what happens if there's valid kernel data
>>>>>>> between aligned_start and crashk_res.start (or the other end of the
>>>>>>> range).
>>>>>>
>>>>>> Data shouldn't go anywhere :)
>>>>>>
>>>>>> There is
>>>>>>
>>>>>> +    /* map area from PUD start to start of crash kernel with
>>>>>> large pages */
>>>>>> +    size = start - aligned_start;
>>>>>> +    __create_pgd_mapping(swapper_pg_dir, aligned_start,
>>>>>> +                 __phys_to_virt(aligned_start),
>>>>>> +                 size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>>>>
>>>>>> and
>>>>>>
>>>>>> +    /* map area from end of crash kernel to PUD end with large
>>>>>> pages */
>>>>>> +    size = aligned_end - end;
>>>>>> +    __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
>>>>>> +                 size, PAGE_KERNEL, early_pgtable_alloc, 0);
>>>>>>
>>>>>> after the unmap, so after we tear down a part of a linear map we
>>>>>> immediately recreate it, just with a different page size.
>>>>>>
>>>>>> This all happens before SMP, so there is no concurrency at that
>>>>>> point.
>>>>>
>>>>> That brief period of unmap worries me. The kernel text, data and stack
>>>>> are all in the vmalloc space but any other (memblock) allocation to
>>>>> this
>>>>> point may be in the unmapped range before and after the crashkernel
>>>>> reservation. The interrupts are off, so I think the only allocation
>>>>> and
>>>>> potential access that may go in this range is the page table
>>>>> itself. But
>>>>> it looks fragile to me.
>>>>
>>>> I agree there are chances there will be an allocation from the unmapped
>>>> range.
>>>>
>>>> We can make sure this won't happen, though. We can cap the memblock
>>>> allocations with memblock_set_current_limit(aligned_end) or
>>>> memblock_reserve(algined_start, aligned_end) until the mappings are
>>>> restored.
>>>
>>> We can reserve the region just before unmapping to avoid new allocations
>>> for the page tables but we can't do much about pages already allocated
>>> prior to calling remap_crashkernel().
>>
>> Right, this was bothering me too after I re-read you previous email.
>>
>> One thing I can think of is to only remap the crash kernel memory if
>> it is
>> a part of an allocation that exactly fits into one ore more PUDs.
>>
>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
>> succeeds, we remap the entire area that now contains only memory
>> allocated
>> in reserve_crashkernel() and free the extra memory after remapping is
>> done.
>> If the large allocation fails, we fall back to the original size and
>> alignment and don't allow unmapping crash kernel memory in
>> arch_kexec_protect_crashkres().
>>> --
>>> Catalin
>>
> Thanks.
>
> There is a new method.
> I think we should use the patch v3(similar but need add some changes)
>
> 1.We can walk crashkernle block/section pagetable,
> [[[(keep the origin block/section mapping valid]]]
> rebuild the pte level page mapping for the crashkernel mem
> rebuild left & right margin mem(which is in same block/section mapping
> but out of crashkernel mem) with block/section mapping
>
> 2.'replace' the origin block/section mapping by new builded mapping
> iterately
>
> With this method, all the mem mapping keep valid all the time.
>
> 3.the patch v3 link:
> https://lore.kernel.org/linux-mm/[email protected]/T/
>
> (Need some changes)

Namely,
When rebuilding for crashkernel mem pagemapping, there is no change to
the origin mapping.

When the new mapping is ready, we replace old mapping with the new
builded mapping.

With this method, keep all mem mapping valid all the time.

2022-07-06 16:00:27

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote:
> 在 2022/7/6 21:54, Mike Rapoport 写道:
> > One thing I can think of is to only remap the crash kernel memory if it is
> > a part of an allocation that exactly fits into one ore more PUDs.
> >
> > Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
> > PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
> > succeeds, we remap the entire area that now contains only memory allocated
> > in reserve_crashkernel() and free the extra memory after remapping is done.
> > If the large allocation fails, we fall back to the original size and
> > alignment and don't allow unmapping crash kernel memory in
> > arch_kexec_protect_crashkres().
>
> There is a new method.
> I think we should use the patch v3(similar but need add some changes)
>
> 1.We can walk crashkernle block/section pagetable,
> [[[(keep the origin block/section mapping valid]]]
> rebuild the pte level page mapping for the crashkernel mem
> rebuild left & right margin mem(which is in same block/section mapping but
> out of crashkernel mem) with block/section mapping
>
> 2.'replace' the origin block/section mapping by new builded mapping
> iterately
>
> With this method, all the mem mapping keep valid all the time.

As I already commented on one of your previous patches, this is not
allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I
think), the worst that can happen is a TLB conflict abort and the
handler should invalidate the TLBs and restart the faulting instruction,
assuming the handler won't try to access the same conflicting virtual
address. Prior to FEAT_BBM, that's not possible as the architecture does
not describe a precise behaviour of conflicting TLB entries (you might
as well get the TLB output of multiple entries being or'ed together).

--
Catalin

2022-07-07 17:24:04

[permalink] [raw]

Subject: Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation

2022-07-08 12:32:09

[permalink] [raw]

Subject: Re: [PATCH RESEND v4] arm64: mm: fix linear mem mapping access performance degradation

Thanks.

在 2022/7/6 23:40, Catalin Marinas 写道:
> On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote:
>> 在 2022/7/6 21:54, Mike Rapoport 写道:
>>> One thing I can think of is to only remap the crash kernel memory if it is
>>> a part of an allocation that exactly fits into one ore more PUDs.
>>>
>>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
>>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
>>> succeeds, we remap the entire area that now contains only memory allocated
>>> in reserve_crashkernel() and free the extra memory after remapping is done.
>>> If the large allocation fails, we fall back to the original size and
>>> alignment and don't allow unmapping crash kernel memory in
>>> arch_kexec_protect_crashkres().
>>
>> There is a new method.
>> I think we should use the patch v3(similar but need add some changes)
>>
>> 1.We can walk crashkernle block/section pagetable,
>> [[[(keep the origin block/section mapping valid]]]
>> rebuild the pte level page mapping for the crashkernel mem
>> rebuild left & right margin mem(which is in same block/section mapping but
>> out of crashkernel mem) with block/section mapping
>>
>> 2.'replace' the origin block/section mapping by new builded mapping
>> iterately
>>
>> With this method, all the mem mapping keep valid all the time.
>
> As I already commented on one of your previous patches, this is not
> allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I
> think), the worst that can happen is a TLB conflict abort and the
> handler should invalidate the TLBs and restart the faulting instruction,
> assuming the handler won't try to access the same conflicting virtual
> address. Prior to FEAT_BBM, that's not possible as the architecture does
> not describe a precise behaviour of conflicting TLB entries (you might
> as well get the TLB output of multiple entries being or'ed together).
>

The cpu can generate a TLB conflict abort if it detects that the address
being looked up in the TLB hits multiple entries.

(1).I think when gathering small page to block/section mapping, there
maybe tlb conflict if no complying with BBM.

Namely:
a.Map a 4KB page (address X)
Touch that page, in order to get the translation cached in the TLB

b.Modify the translation tables
replacing the mapping for address X with a 2MB mapping - DO NOT
INVALIDATE the TLB

c.Touch "X + 4KB"
This will/should miss in the TLB, causing a new walk returning the
2MB mapping

d.Touch X
Assuming they've not been evicted, you'll hit both on the 4KB and 2MB
mapping - as both cover address X.

There is tlb conflict.
(link:
https://community.arm.com/support-forums/f/dev-platforms-forum/13583/tlb-conflict-abort)

(2).But when spliting large block/section mapping to small granularity,
there maybe no tlb conflict.

Namely:
a.rebuild the pte level mapping without any change to orgin pagetable
(the relation between virtual address and physicall address keep same)

b.modify 1G mappting to use the new pte level mapping in the [[[mem]]]
without tlb flush

c.When the cpu access the 1G mem(anywhere),
If 1G tlb entry already cached in tlb, all the 1G mem will access
success(without any tlb loaded, no confilict)

If 1G tlb entry has been evicted, then the tlb will access pagetable
in mem(despite the cpu "catch" the old(1G) or new(4k) mapped pagetale in
the mem, all the 1G mem can access sucess)(load new tlb entry, no conflict)

d.Afterward, we flush the tlb and force cpu use the new pagetable.(no
conflict)

It seems that there are no two tlb entries for a same virtual address in
the tlb cache When spliting large block/section mapping.

(3).At the same time, I think we can use another way.
As the system linear maping is builded with init_pg_dir, we can also
resue the init_pg_dir to split the block/setion mapping sometime.
As init_pg_dir contain all kernel text/data access and we can comply
with the BBM requirement.

a.rebuild new pte level mapping without any change to the old
mapping(the cpu can't walk access the new page mapping, it's isolated)

b.change to use init_pg_dir

c.clear the old 1G block mapping and flush tlb

d.modify the linear mapping to use new pte level page mapping with
init_pg_dir(TLB BBM)

e.switch to swapper_pg_dir

Could you give me some advice?

Thanks.

2022-07-10 14:23:23

[permalink] [raw]

Subject: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation

The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
(enable crashkernel, disable rodata full, disable kfence), the mem_map will
use non block/section mapping(for crashkernel requires to shrink the region
in page granularity). But it will degrade performance when doing larging
continuous mem access in kernel(memcpy/memmove, etc).

There are many changes and discussions:
commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for
platforms with no DMA memory zones")
commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into
mem_init()")
commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel
reservation is required")

This patch changes mem_map to use block/section mapping with crashkernel.
Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at
mem_map, reserve crashkernel memory. And then walking pagetable to split
block/section mapping to non block/section mapping(normally 4K) [[[only]]]
for crashkernel mem. So the linear mem mapping use block/section mapping
as more as possible. We will reduce the cpu dTLB miss conspicuously, and
accelerate mem access about 10-20% performance improvement.

I have tested it with pft(Page Fault Test) and fio, obtained great
performace improvement.

For fio test:
1.prepare ramdisk
modprobe -r brd
modprobe brd rd_nr=1 rd_size=67108864
dmsetup remove_all
wipefs -a --force /dev/ram0
mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0
mkdir -p /fs/ram0
mount -t ext4 /dev/ram0 /fs/ram0

2.prepare fio paremeter in x.fio file:
[global]
bs=4k
ioengine=psync
iodepth=128
size=32G
direct=1
invalidate=1
group_reporting
thread=1
rw=read
directory=/fs/ram0
numjobs=1

[task_0]
cpus_allowed=16
stonewall=1

3.run testcase:
perf stat -e dTLB-load-misses fio x.fio

4.contrast
------------------------
without patch with patch
fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s
dTLB-load-misses 1,818,320,693 438,729,774
time elapsed(s) 70.500326434 62.877316408
user(s) 15.926332000 15.684721000
sys(s) 54.211939000 47.046165000

5.conclusion
Using this patch will reduce dTLB misses and improve performace greatly.

Signed-off-by: Guanghui Feng <[email protected]>
---
arch/arm64/include/asm/mmu.h | 1 +
arch/arm64/include/asm/mmu_context.h | 28 ++++
arch/arm64/mm/init.c | 8 +-
arch/arm64/mm/mmu.c | 305 +++++++++++++++++++++++++----------
arch/arm64/mm/proc.S | 19 +++
5 files changed, 267 insertions(+), 94 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 48f8466..1a46b81 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void)
extern void arm64_memblock_init(void);
extern void paging_init(void);
extern void bootmem_init(void);
+extern void map_crashkernel(void);
extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt);
extern void init_mem_pgprot(void);
extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 6770667..10dffd4 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -139,6 +139,34 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz)
isb();
}

+static inline void __nocfi cpu_replace_ttbr1_with_phys(phys_addr_t ttbr1)
+{
+ typedef void (ttbr_replace_func)(phys_addr_t);
+ extern ttbr_replace_func idmap_cpu_replace_ttbr1_with_flush_tlb;
+ ttbr_replace_func *replace_phys;
+
+ /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */
+ ttbr1 = phys_to_ttbr(ttbr1);
+
+ if (system_supports_cnp()) {
+ /*
+ * cpu_replace_ttbr1() is used when there's a boot CPU
+ * up (i.e. cpufeature framework is not up yet) and
+ * latter only when we enable CNP via cpufeature's
+ * enable() callback.
+ * Also we rely on the cpu_hwcap bit being set before
+ * calling the enable() function.
+ */
+ ttbr1 |= TTBR_CNP_BIT;
+ }
+
+ replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1_with_flush_tlb));
+
+ cpu_install_idmap();
+ replace_phys(ttbr1);
+ cpu_uninstall_idmap();
+}
+
/*
* Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
* avoiding the possibility of conflicting TLB entries being allocated.
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 339ee84..241d27e 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void)
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
insert_resource(&iomem_resource, &crashk_res);
+ map_crashkernel();
}

/*
@@ -388,10 +389,6 @@ void __init arm64_memblock_init(void)
}

early_init_fdt_scan_reserved_mem();
-
- if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
- reserve_crashkernel();
-
high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
}

@@ -438,8 +435,7 @@ void __init bootmem_init(void)
* request_standard_resources() depends on crashkernel's memory being
* reserved, so do it here.
*/
- if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
- reserve_crashkernel();
+ reserve_crashkernel();

memblock_dump_all();
}
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 626ec32..3e30c82 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -42,6 +42,7 @@
#define NO_BLOCK_MAPPINGS BIT(0)
#define NO_CONT_MAPPINGS BIT(1)
#define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
+#define NO_SEC_REMAPPINGS BIT(3) /* rebuild with non block/sec mapping*/

u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
@@ -155,8 +156,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new)
return ((old ^ new) & ~mask) == 0;
}

+static void pte_clear_cont(pte_t *ptep)
+{
+ int i = 0;
+ pte_t pte = READ_ONCE(*ptep);
+ if (pte_none(pte) || !pte_cont(pte))
+ return;
+ ptep -= ((u64)ptep / sizeof(pte_t)) &
+ ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1);
+ do {
+ pte = pte_mknoncont(READ_ONCE(*ptep));
+ set_pte(ptep, pte);
+ } while (++ptep, ++i < CONT_PTES);
+}
+
static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
- phys_addr_t phys, pgprot_t prot)
+ phys_addr_t phys, pgprot_t prot, int flags)
{
pte_t *ptep;

@@ -164,6 +179,9 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
do {
pte_t old_pte = READ_ONCE(*ptep);

+ if (flags & NO_SEC_REMAPPINGS)
+ pte_clear_cont(ptep);
+
set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));

/*
@@ -179,6 +197,22 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
pte_clear_fixmap();
}

+static inline void allock_pte_page(pmd_t *pmdp,
+ phys_addr_t (*pgtable_alloc)(int),
+ int flags)
+{
+ pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN;
+ phys_addr_t pte_phys;
+ if (!pmd_none(READ_ONCE(*pmdp)))
+ return;
+
+ if (flags & NO_EXEC_MAPPINGS)
+ pmdval |= PMD_TABLE_PXN;
+ BUG_ON(!pgtable_alloc);
+ pte_phys = pgtable_alloc(PAGE_SHIFT);
+ __pmd_populate(pmdp, pte_phys, pmdval);
+}
+
static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
unsigned long end, phys_addr_t phys,
pgprot_t prot,
@@ -189,17 +223,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
pmd_t pmd = READ_ONCE(*pmdp);

BUG_ON(pmd_sect(pmd));
- if (pmd_none(pmd)) {
- pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN;
- phys_addr_t pte_phys;
-
- if (flags & NO_EXEC_MAPPINGS)
- pmdval |= PMD_TABLE_PXN;
- BUG_ON(!pgtable_alloc);
- pte_phys = pgtable_alloc(PAGE_SHIFT);
- __pmd_populate(pmdp, pte_phys, pmdval);
- pmd = READ_ONCE(*pmdp);
- }
+ allock_pte_page(pmdp, pgtable_alloc, flags);
+ pmd = READ_ONCE(*pmdp);
BUG_ON(pmd_bad(pmd));

do {
@@ -208,16 +233,71 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
next = pte_cont_addr_end(addr, end);

/* use a contiguous mapping if the range is suitably aligned */
- if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) &&
- (flags & NO_CONT_MAPPINGS) == 0)
+ if (!(flags & NO_SEC_REMAPPINGS) &&
+ (((addr | next | phys) & ~CONT_PTE_MASK) == 0) &&
+ (flags & NO_CONT_MAPPINGS) == 0)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);

- init_pte(pmdp, addr, next, phys, __prot);
+ init_pte(pmdp, addr, next, phys, __prot, flags);

phys += next - addr;
} while (addr = next, addr != end);
}

+static void pmd_clear_cont(pmd_t *pmdp)
+{
+ int i = 0;
+ pmd_t pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd))
+ return;
+ pmdp -= ((u64)pmdp / sizeof(pmd_t)) &
+ ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1);
+ do {
+ pmd = READ_ONCE(*pmdp);
+ pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd)));
+ set_pmd(pmdp, pmd);
+ } while (++pmdp, ++i < CONT_PMDS);
+}
+
+static void init_pmd_remap(pud_t *pudp, unsigned long addr, unsigned long end,
+ phys_addr_t phys, pgprot_t prot,
+ phys_addr_t (*pgtable_alloc)(int), int flags)
+{
+ unsigned long next;
+ pmd_t *pmdp;
+ phys_addr_t map_offset;
+
+ pmdp = pmd_set_fixmap_offset(pudp, addr);
+ do {
+ next = pmd_addr_end(addr, end);
+
+ if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) {
+ pmd_clear_cont(pmdp);
+ pmd_clear(pmdp);
+ allock_pte_page(pmdp, pgtable_alloc, flags);
+
+ map_offset = addr - (addr & PMD_MASK);
+ if (map_offset)
+ alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr,
+ phys - map_offset, prot,
+ pgtable_alloc,
+ flags & (~NO_SEC_REMAPPINGS));
+
+ if (next < (addr & PMD_MASK) + PMD_SIZE)
+ alloc_init_cont_pte(pmdp, next,
+ (addr & PUD_MASK) + PUD_SIZE,
+ next - addr + phys,
+ prot, pgtable_alloc,
+ flags & (~NO_SEC_REMAPPINGS));
+ }
+ alloc_init_cont_pte(pmdp, addr, next, phys, prot,
+ pgtable_alloc, flags);
+ phys += next - addr;
+ } while (pmdp++, addr = next, addr != end);
+
+ pmd_clear_fixmap();
+}
+
static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(int), int flags)
@@ -255,6 +335,22 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
pmd_clear_fixmap();
}

+static inline void alloc_pmd_page(pud_t *pudp, phys_addr_t (*pgtable_alloc)(int), int flags)
+{
+
+ pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN;
+ phys_addr_t pmd_phys;
+
+ if (!pud_none(READ_ONCE(*pudp)))
+ return;
+
+ if (flags & NO_EXEC_MAPPINGS)
+ pudval |= PUD_TABLE_PXN;
+ BUG_ON(!pgtable_alloc);
+ pmd_phys = pgtable_alloc(PMD_SHIFT);
+ __pud_populate(pudp, pmd_phys, pudval);
+}
+
static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
unsigned long end, phys_addr_t phys,
pgprot_t prot,
@@ -267,17 +363,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
* Check for initial section mappings in the pgd/pud.
*/
BUG_ON(pud_sect(pud));
- if (pud_none(pud)) {
- pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN;
- phys_addr_t pmd_phys;
-
- if (flags & NO_EXEC_MAPPINGS)
- pudval |= PUD_TABLE_PXN;
- BUG_ON(!pgtable_alloc);
- pmd_phys = pgtable_alloc(PMD_SHIFT);
- __pud_populate(pudp, pmd_phys, pudval);
- pud = READ_ONCE(*pudp);
- }
+ alloc_pmd_page(pudp, pgtable_alloc, flags);
+ pud = READ_ONCE(*pudp);
BUG_ON(pud_bad(pud));

do {
@@ -286,16 +373,80 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
next = pmd_cont_addr_end(addr, end);

/* use a contiguous mapping if the range is suitably aligned */
- if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) &&
+ if (!(flags & NO_SEC_REMAPPINGS) &&
+ (((addr | next | phys) & ~CONT_PMD_MASK) == 0) &&
(flags & NO_CONT_MAPPINGS) == 0)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);

- init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
+ if (flags & NO_SEC_REMAPPINGS)
+ init_pmd_remap(pudp, addr, next, phys, __prot,
+ pgtable_alloc, flags);
+ else
+ init_pmd(pudp, addr, next, phys, __prot,
+ pgtable_alloc, flags);

phys += next - addr;
} while (addr = next, addr != end);
}

+static void init_pud_remap(pud_t *pudp, unsigned long addr, unsigned long next,
+ phys_addr_t phys, pgprot_t prot,
+ phys_addr_t (*pgtable_alloc)(int),
+ int flags)
+{
+ phys_addr_t map_offset;
+
+ if (!pud_none(*pudp) && pud_sect(*pudp)) {
+ pud_clear(pudp);
+ alloc_pmd_page(pudp, pgtable_alloc, flags);
+
+ map_offset = addr - (addr & PUD_MASK);
+ if (map_offset)
+ alloc_init_cont_pmd(pudp, addr & PUD_MASK,
+ addr, phys - map_offset,
+ prot, pgtable_alloc,
+ flags & (~NO_SEC_REMAPPINGS));
+
+ if (next < (addr & PUD_MASK) + PUD_SIZE)
+ alloc_init_cont_pmd(pudp, next,
+ (addr & PUD_MASK) + PUD_SIZE,
+ next - addr + phys,
+ prot, pgtable_alloc,
+ flags & (~NO_SEC_REMAPPINGS));
+ }
+ alloc_init_cont_pmd(pudp, addr, next, phys, prot,
+ pgtable_alloc, flags);
+}
+
+static void init_pud(pud_t *pudp, unsigned long addr, unsigned long next,
+ phys_addr_t phys, pgprot_t prot,
+ phys_addr_t (*pgtable_alloc)(int),
+ int flags)
+{
+ pud_t old_pud = READ_ONCE(*pudp);
+ /*
+ * For 4K granule only, attempt to put down a 1GB block
+ */
+ if (pud_sect_supported() &&
+ ((addr | next | phys) & ~PUD_MASK) == 0 &&
+ (flags & NO_BLOCK_MAPPINGS) == 0) {
+ pud_set_huge(pudp, phys, prot);
+
+ /*
+ * After the PUD entry has been populated once, we
+ * only allow updates to the permission attributes.
+ */
+ BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
+ READ_ONCE(pud_val(*pudp))));
+ } else {
+ alloc_init_cont_pmd(pudp, addr, next, phys, prot,
+ pgtable_alloc, flags);
+
+ BUG_ON(pud_val(old_pud) != 0 &&
+ pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
+ }
+}
+
static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(int),
@@ -325,37 +476,22 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
*/
if (system_state != SYSTEM_BOOTING)
mutex_lock(&fixmap_lock);
+
pudp = pud_set_fixmap_offset(p4dp, addr);
do {
- pud_t old_pud = READ_ONCE(*pudp);
-
next = pud_addr_end(addr, end);

- /*
- * For 4K granule only, attempt to put down a 1GB block
- */
- if (pud_sect_supported() &&
- ((addr | next | phys) & ~PUD_MASK) == 0 &&
- (flags & NO_BLOCK_MAPPINGS) == 0) {
- pud_set_huge(pudp, phys, prot);
-
- /*
- * After the PUD entry has been populated once, we
- * only allow updates to the permission attributes.
- */
- BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
- READ_ONCE(pud_val(*pudp))));
- } else {
- alloc_init_cont_pmd(pudp, addr, next, phys, prot,
- pgtable_alloc, flags);
-
- BUG_ON(pud_val(old_pud) != 0 &&
- pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
- }
+ if (flags & NO_SEC_REMAPPINGS)
+ init_pud_remap(pudp, addr, next, phys, prot,
+ pgtable_alloc, flags);
+ else
+ init_pud(pudp, addr, next, phys, prot, pgtable_alloc,
+ flags);
phys += next - addr;
} while (pudp++, addr = next, addr != end);

pud_clear_fixmap();
+
if (system_state != SYSTEM_BOOTING)
mutex_unlock(&fixmap_lock);
}
@@ -483,20 +619,37 @@ void __init mark_linear_text_alias_ro(void)
PAGE_KERNEL_RO);
}

-static bool crash_mem_map __initdata;
-
-static int __init enable_crash_mem_map(char *arg)
+#ifdef CONFIG_KEXEC_CORE
+void __init map_crashkernel(void)
{
- /*
- * Proper parameter parsing is done by reserve_crashkernel(). We only
- * need to know if the linear map has to avoid block mappings so that
- * the crashkernel reservations can be unmapped later.
- */
- crash_mem_map = true;
+ if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
+ return;

- return 0;
+ if (!crashk_res.end)
+ return;
+
+ cpu_replace_ttbr1_with_phys(__pa_symbol(init_pg_dir));
+ init_mm.pgd = init_pg_dir;
+
+ __create_pgd_mapping(swapper_pg_dir, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ crashk_res.end + 1 - crashk_res.start, PAGE_KERNEL,
+ early_pgtable_alloc,
+ NO_EXEC_MAPPINGS | NO_SEC_REMAPPINGS);
+
+ cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
+ init_mm.pgd = swapper_pg_dir;
+
+ memblock_phys_free(__pa_symbol(init_pg_dir),
+ __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
}
-early_param("crashkernel", enable_crash_mem_map);
+#else
+void __init map_crashkernel(void)
+{
+ memblock_phys_free(__pa_symbol(init_pg_dir),
+ __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
+}
+#endif

static void __init map_mem(pgd_t *pgdp)
{
@@ -527,17 +680,6 @@ static void __init map_mem(pgd_t *pgdp)
*/
memblock_mark_nomap(kernel_start, kernel_end - kernel_start);

-#ifdef CONFIG_KEXEC_CORE
- if (crash_mem_map) {
- if (IS_ENABLED(CONFIG_ZONE_DMA) ||
- IS_ENABLED(CONFIG_ZONE_DMA32))
- flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
- else if (crashk_res.end)
- memblock_mark_nomap(crashk_res.start,
- resource_size(&crashk_res));
- }
-#endif
-
/* map all the memory banks */
for_each_mem_range(i, &start, &end) {
if (start >= end)
@@ -570,19 +712,6 @@ static void __init map_mem(pgd_t *pgdp)
* in page granularity and put back unused memory to buddy system
* through /sys/kernel/kexec_crash_size interface.
*/
-#ifdef CONFIG_KEXEC_CORE
- if (crash_mem_map &&
- !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) {
- if (crashk_res.end) {
- __map_memblock(pgdp, crashk_res.start,
- crashk_res.end + 1,
- PAGE_KERNEL,
- NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
- memblock_clear_nomap(crashk_res.start,
- resource_size(&crashk_res));
- }
- }
-#endif
}

void mark_rodata_ro(void)
@@ -774,8 +903,8 @@ void __init paging_init(void)
cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
init_mm.pgd = swapper_pg_dir;

- memblock_phys_free(__pa_symbol(init_pg_dir),
- __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
+ //memblock_phys_free(__pa_symbol(init_pg_dir),
+ // __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));

memblock_allow_resize();
}
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 50bbed9..161bae6 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -178,6 +178,25 @@ SYM_FUNC_END(cpu_do_resume)
isb
.endm

+SYM_FUNC_START(idmap_cpu_replace_ttbr1_with_flush_tlb)
+ save_and_disable_daif flags=x2
+
+ __idmap_cpu_set_reserved_ttbr1 x1, x3
+
+ offset_ttbr1 x0, x3
+ msr ttbr1_el1, x0
+ isb
+
+ restore_daif x2
+
+ dsb nshst
+ tlbi vmalle1
+ dsb nsh
+ isb
+
+ ret
+SYM_FUNC_END(idmap_cpu_replace_ttbr1_with_flush_tlb)
+
/*
* void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1)
*
--
1.8.3.1

2022-07-10 15:04:32

[permalink] [raw]

Subject: Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation

Hi, Mike Rapoport, Catalin Marinas, I have rewrited the patch to handle
tlb conflict and Circular dependency.
Namely,

(1)When doing work for rebuiling crashkernel mem, the pgd is
swapper_pg_dir in use.

(2)Firstly, I change the swapper_pg_dir pgd to idmap_pg_dir.
As idmap_pg_dir use ttbr0_el1 and without any tlb confict in ttbr1_el1.
Afterward I flush all the tlb(all the swapper_pg_dir tlb cache flushed).

Later, I change idmap_pg_dir pgd to use init_pg_dir and flush tlb again.
Therefor, there is no tlb conflict when rebuiding with
init_pg_dir(without any tlb cache for swapper_pg_dir).

(3)When switch to init_pg_dir, I can saftely modify swapper_pg_dir
pagetable withou any [[[Circular dependency]]] and tlb conflict.
Namely, When unmapping or modifying any swapper_pg_dir pagetable, I will
use fix map to access the mem.

I have test it and work well. Could you give me some advice?

Thanks.

在 2022/7/10 21:44, Guanghui Feng 写道:
> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
> use non block/section mapping(for crashkernel requires to shrink the region
> in page granularity). But it will degrade performance when doing larging
> continuous mem access in kernel(memcpy/memmove, etc).
>
> There are many changes and discussions:
> commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for
> platforms with no DMA memory zones")
> commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into
> mem_init()")
> commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel
> reservation is required")
>
> This patch changes mem_map to use block/section mapping with crashkernel.
> Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at
> mem_map, reserve crashkernel memory. And then walking pagetable to split
> block/section mapping to non block/section mapping(normally 4K) [[[only]]]
> for crashkernel mem. So the linear mem mapping use block/section mapping
> as more as possible. We will reduce the cpu dTLB miss conspicuously, and
> accelerate mem access about 10-20% performance improvement.
>
> I have tested it with pft(Page Fault Test) and fio, obtained great
> performace improvement.
>
> For fio test:
> 1.prepare ramdisk
> modprobe -r brd
> modprobe brd rd_nr=1 rd_size=67108864
> dmsetup remove_all
> wipefs -a --force /dev/ram0
> mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0
> mkdir -p /fs/ram0
> mount -t ext4 /dev/ram0 /fs/ram0
>
> 2.prepare fio paremeter in x.fio file:
> [global]
> bs=4k
> ioengine=psync
> iodepth=128
> size=32G
> direct=1
> invalidate=1
> group_reporting
> thread=1
> rw=read
> directory=/fs/ram0
> numjobs=1
>
> [task_0]
> cpus_allowed=16
> stonewall=1
>
> 3.run testcase:
> perf stat -e dTLB-load-misses fio x.fio
>
> 4.contrast
> ------------------------
> without patch with patch
> fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s
> dTLB-load-misses 1,818,320,693 438,729,774
> time elapsed(s) 70.500326434 62.877316408
> user(s) 15.926332000 15.684721000
> sys(s) 54.211939000 47.046165000
>
> 5.conclusion
> Using this patch will reduce dTLB misses and improve performace greatly.
>
> Signed-off-by: Guanghui Feng <[email protected]>
> ---
> arch/arm64/include/asm/mmu.h | 1 +
> arch/arm64/include/asm/mmu_context.h | 28 ++++
> arch/arm64/mm/init.c | 8 +-
> arch/arm64/mm/mmu.c | 305 +++++++++++++++++++++++++----------
> arch/arm64/mm/proc.S | 19 +++
> 5 files changed, 267 insertions(+), 94 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 48f8466..1a46b81 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void)
> extern void arm64_memblock_init(void);
> extern void paging_init(void);
> extern void bootmem_init(void);
> +extern void map_crashkernel(void);
> extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt);
> extern void init_mem_pgprot(void);
> extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 6770667..10dffd4 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -139,6 +139,34 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz)
> isb();
> }
>
> +static inline void __nocfi cpu_replace_ttbr1_with_phys(phys_addr_t ttbr1)
> +{
> + typedef void (ttbr_replace_func)(phys_addr_t);
> + extern ttbr_replace_func idmap_cpu_replace_ttbr1_with_flush_tlb;
> + ttbr_replace_func *replace_phys;
> +
> + /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */
> + ttbr1 = phys_to_ttbr(ttbr1);
> +
> + if (system_supports_cnp()) {
> + /*
> + * cpu_replace_ttbr1() is used when there's a boot CPU
> + * up (i.e. cpufeature framework is not up yet) and
> + * latter only when we enable CNP via cpufeature's
> + * enable() callback.
> + * Also we rely on the cpu_hwcap bit being set before
> + * calling the enable() function.
> + */
> + ttbr1 |= TTBR_CNP_BIT;
> + }
> +
> + replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1_with_flush_tlb));
> +
> + cpu_install_idmap();
> + replace_phys(ttbr1);
> + cpu_uninstall_idmap();
cpu_install_idmap will Swtich to use idmap_pg_dir with use [[[ttbr0_el1]]].
swapper_gp_dir use the [[[ttbr1_el0]]], so there is no tlb conflit to
idmap_pg_dir(the different address space range).

When switching to idmap_pg_dir, there also will be a tlb flush all.

Afterward, the idmap_cpu_replace_ttbr1_with_flush_tlb will run in
idmap_pg_dir pgd and change the pgd to another pgd complying with the
parameter ttbr1.

So, there is no tlb conflict when switching pgd.

Namely, I can change ttbr1 pgd safely without any tlb conflict.

> +}
> +
> /*
> * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
> * avoiding the possibility of conflicting TLB entries being allocated.
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 339ee84..241d27e 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void)
> crashk_res.start = crash_base;
> crashk_res.end = crash_base + crash_size - 1;
> insert_resource(&iomem_resource, &crashk_res);
> + map_crashkernel();
> }
>
> /*
> @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void)
> }
>
> early_init_fdt_scan_reserved_mem();
> -
> - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
> - reserve_crashkernel();
> -
> high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
> }
>
> @@ -438,8 +435,7 @@ void __init bootmem_init(void)
> * request_standard_resources() depends on crashkernel's memory being
> * reserved, so do it here.
> */
> - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
> - reserve_crashkernel();
> + reserve_crashkernel();
>
> memblock_dump_all();
> }
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 626ec32..3e30c82 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -42,6 +42,7 @@
> #define NO_BLOCK_MAPPINGS BIT(0)
> #define NO_CONT_MAPPINGS BIT(1)
> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
> +#define NO_SEC_REMAPPINGS BIT(3) /* rebuild with non block/sec mapping*/
>
> u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
> u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
> @@ -155,8 +156,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new)
> return ((old ^ new) & ~mask) == 0;
> }
>
> +static void pte_clear_cont(pte_t *ptep)
> +{
> + int i = 0;
> + pte_t pte = READ_ONCE(*ptep);
> + if (pte_none(pte) || !pte_cont(pte))
> + return;
> + ptep -= ((u64)ptep / sizeof(pte_t)) &
> + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1);
> + do {
> + pte = pte_mknoncont(READ_ONCE(*ptep));
> + set_pte(ptep, pte);
> + } while (++ptep, ++i < CONT_PTES);
> +}
> +
> static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> - phys_addr_t phys, pgprot_t prot)
> + phys_addr_t phys, pgprot_t prot, int flags)
> {
> pte_t *ptep;
>
> @@ -164,6 +179,9 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> do {
> pte_t old_pte = READ_ONCE(*ptep);
>
> + if (flags & NO_SEC_REMAPPINGS)
> + pte_clear_cont(ptep);
> +
> set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>
> /*
> @@ -179,6 +197,22 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> pte_clear_fixmap();
> }
>
> +static inline void allock_pte_page(pmd_t *pmdp,
> + phys_addr_t (*pgtable_alloc)(int),
> + int flags)
> +{
> + pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN;
> + phys_addr_t pte_phys;
> + if (!pmd_none(READ_ONCE(*pmdp)))
> + return;
> +
> + if (flags & NO_EXEC_MAPPINGS)
> + pmdval |= PMD_TABLE_PXN;
> + BUG_ON(!pgtable_alloc);
> + pte_phys = pgtable_alloc(PAGE_SHIFT);
> + __pmd_populate(pmdp, pte_phys, pmdval);
> +}
> +
> static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> unsigned long end, phys_addr_t phys,
> pgprot_t prot,
> @@ -189,17 +223,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> pmd_t pmd = READ_ONCE(*pmdp);
>
> BUG_ON(pmd_sect(pmd));
> - if (pmd_none(pmd)) {
> - pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN;
> - phys_addr_t pte_phys;
> -
> - if (flags & NO_EXEC_MAPPINGS)
> - pmdval |= PMD_TABLE_PXN;
> - BUG_ON(!pgtable_alloc);
> - pte_phys = pgtable_alloc(PAGE_SHIFT);
> - __pmd_populate(pmdp, pte_phys, pmdval);
> - pmd = READ_ONCE(*pmdp);
> - }
> + allock_pte_page(pmdp, pgtable_alloc, flags);
> + pmd = READ_ONCE(*pmdp);
> BUG_ON(pmd_bad(pmd));
>
> do {
> @@ -208,16 +233,71 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> next = pte_cont_addr_end(addr, end);
>
> /* use a contiguous mapping if the range is suitably aligned */
> - if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) &&
> - (flags & NO_CONT_MAPPINGS) == 0)
> + if (!(flags & NO_SEC_REMAPPINGS) &&
> + (((addr | next | phys) & ~CONT_PTE_MASK) == 0) &&
> + (flags & NO_CONT_MAPPINGS) == 0)
> __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>
> - init_pte(pmdp, addr, next, phys, __prot);
> + init_pte(pmdp, addr, next, phys, __prot, flags);
>
> phys += next - addr;
> } while (addr = next, addr != end);
> }
>
> +static void pmd_clear_cont(pmd_t *pmdp)
> +{
> + int i = 0;
> + pmd_t pmd = READ_ONCE(*pmdp);
> + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd))
> + return;
> + pmdp -= ((u64)pmdp / sizeof(pmd_t)) &
> + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1);
> + do {
> + pmd = READ_ONCE(*pmdp);
> + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd)));
> + set_pmd(pmdp, pmd);
> + } while (++pmdp, ++i < CONT_PMDS);
> +}
> +
> +static void init_pmd_remap(pud_t *pudp, unsigned long addr, unsigned long end,
> + phys_addr_t phys, pgprot_t prot,
> + phys_addr_t (*pgtable_alloc)(int), int flags)
> +{
> + unsigned long next;
> + pmd_t *pmdp;
> + phys_addr_t map_offset;
> +
> + pmdp = pmd_set_fixmap_offset(pudp, addr);
> + do {
> + next = pmd_addr_end(addr, end);
> +
> + if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) {
> + pmd_clear_cont(pmdp);
> + pmd_clear(pmdp);
> + allock_pte_page(pmdp, pgtable_alloc, flags);
> +
> + map_offset = addr - (addr & PMD_MASK);
> + if (map_offset)
> + alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr,
> + phys - map_offset, prot,
> + pgtable_alloc,
> + flags & (~NO_SEC_REMAPPINGS));
> +
> + if (next < (addr & PMD_MASK) + PMD_SIZE)
> + alloc_init_cont_pte(pmdp, next,
> + (addr & PUD_MASK) + PUD_SIZE,
> + next - addr + phys,
> + prot, pgtable_alloc,
> + flags & (~NO_SEC_REMAPPINGS));
> + }
> + alloc_init_cont_pte(pmdp, addr, next, phys, prot,
> + pgtable_alloc, flags);
> + phys += next - addr;
> + } while (pmdp++, addr = next, addr != end);
> +
> + pmd_clear_fixmap();
> +}
> +
> static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
> phys_addr_t phys, pgprot_t prot,
> phys_addr_t (*pgtable_alloc)(int), int flags)
> @@ -255,6 +335,22 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
> pmd_clear_fixmap();
> }
>
> +static inline void alloc_pmd_page(pud_t *pudp, phys_addr_t (*pgtable_alloc)(int), int flags)
> +{
> +
> + pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN;
> + phys_addr_t pmd_phys;
> +
> + if (!pud_none(READ_ONCE(*pudp)))
> + return;
> +
> + if (flags & NO_EXEC_MAPPINGS)
> + pudval |= PUD_TABLE_PXN;
> + BUG_ON(!pgtable_alloc);
> + pmd_phys = pgtable_alloc(PMD_SHIFT);
> + __pud_populate(pudp, pmd_phys, pudval);
> +}
> +
> static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> unsigned long end, phys_addr_t phys,
> pgprot_t prot,
> @@ -267,17 +363,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> * Check for initial section mappings in the pgd/pud.
> */
> BUG_ON(pud_sect(pud));
> - if (pud_none(pud)) {
> - pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN;
> - phys_addr_t pmd_phys;
> -
> - if (flags & NO_EXEC_MAPPINGS)
> - pudval |= PUD_TABLE_PXN;
> - BUG_ON(!pgtable_alloc);
> - pmd_phys = pgtable_alloc(PMD_SHIFT);
> - __pud_populate(pudp, pmd_phys, pudval);
> - pud = READ_ONCE(*pudp);
> - }
> + alloc_pmd_page(pudp, pgtable_alloc, flags);
> + pud = READ_ONCE(*pudp);
> BUG_ON(pud_bad(pud));
>
> do {
> @@ -286,16 +373,80 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> next = pmd_cont_addr_end(addr, end);
>
> /* use a contiguous mapping if the range is suitably aligned */
> - if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) &&
> + if (!(flags & NO_SEC_REMAPPINGS) &&
> + (((addr | next | phys) & ~CONT_PMD_MASK) == 0) &&
> (flags & NO_CONT_MAPPINGS) == 0)
> __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>
> - init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
> + if (flags & NO_SEC_REMAPPINGS)
> + init_pmd_remap(pudp, addr, next, phys, __prot,
> + pgtable_alloc, flags);
> + else
> + init_pmd(pudp, addr, next, phys, __prot,
> + pgtable_alloc, flags);
>
> phys += next - addr;
> } while (addr = next, addr != end);
> }
>
> +static void init_pud_remap(pud_t *pudp, unsigned long addr, unsigned long next,
> + phys_addr_t phys, pgprot_t prot,
> + phys_addr_t (*pgtable_alloc)(int),
> + int flags)
> +{
> + phys_addr_t map_offset;
> +
> + if (!pud_none(*pudp) && pud_sect(*pudp)) {
> + pud_clear(pudp);
> + alloc_pmd_page(pudp, pgtable_alloc, flags);
> +
> + map_offset = addr - (addr & PUD_MASK);
> + if (map_offset)
> + alloc_init_cont_pmd(pudp, addr & PUD_MASK,
> + addr, phys - map_offset,
> + prot, pgtable_alloc,
> + flags & (~NO_SEC_REMAPPINGS));
> +
> + if (next < (addr & PUD_MASK) + PUD_SIZE)
> + alloc_init_cont_pmd(pudp, next,
> + (addr & PUD_MASK) + PUD_SIZE,
> + next - addr + phys,
> + prot, pgtable_alloc,
> + flags & (~NO_SEC_REMAPPINGS));
> + }
> + alloc_init_cont_pmd(pudp, addr, next, phys, prot,
> + pgtable_alloc, flags);
> +}
> +
> +static void init_pud(pud_t *pudp, unsigned long addr, unsigned long next,
> + phys_addr_t phys, pgprot_t prot,
> + phys_addr_t (*pgtable_alloc)(int),
> + int flags)
> +{
> + pud_t old_pud = READ_ONCE(*pudp);
> + /*
> + * For 4K granule only, attempt to put down a 1GB block
> + */
> + if (pud_sect_supported() &&
> + ((addr | next | phys) & ~PUD_MASK) == 0 &&
> + (flags & NO_BLOCK_MAPPINGS) == 0) {
> + pud_set_huge(pudp, phys, prot);
> +
> + /*
> + * After the PUD entry has been populated once, we
> + * only allow updates to the permission attributes.
> + */
> + BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
> + READ_ONCE(pud_val(*pudp))));
> + } else {
> + alloc_init_cont_pmd(pudp, addr, next, phys, prot,
> + pgtable_alloc, flags);
> +
> + BUG_ON(pud_val(old_pud) != 0 &&
> + pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
> + }
> +}
> +
> static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
> phys_addr_t phys, pgprot_t prot,
> phys_addr_t (*pgtable_alloc)(int),
> @@ -325,37 +476,22 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
> */
> if (system_state != SYSTEM_BOOTING)
> mutex_lock(&fixmap_lock);
> +
> pudp = pud_set_fixmap_offset(p4dp, addr);
> do {
> - pud_t old_pud = READ_ONCE(*pudp);
> -
> next = pud_addr_end(addr, end);
>
> - /*
> - * For 4K granule only, attempt to put down a 1GB block
> - */
> - if (pud_sect_supported() &&
> - ((addr | next | phys) & ~PUD_MASK) == 0 &&
> - (flags & NO_BLOCK_MAPPINGS) == 0) {
> - pud_set_huge(pudp, phys, prot);
> -
> - /*
> - * After the PUD entry has been populated once, we
> - * only allow updates to the permission attributes.
> - */
> - BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
> - READ_ONCE(pud_val(*pudp))));
> - } else {
> - alloc_init_cont_pmd(pudp, addr, next, phys, prot,
> - pgtable_alloc, flags);
> -
> - BUG_ON(pud_val(old_pud) != 0 &&
> - pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
> - }
> + if (flags & NO_SEC_REMAPPINGS)
> + init_pud_remap(pudp, addr, next, phys, prot,
> + pgtable_alloc, flags);
> + else
> + init_pud(pudp, addr, next, phys, prot, pgtable_alloc,
> + flags);
> phys += next - addr;
> } while (pudp++, addr = next, addr != end);
>
> pud_clear_fixmap();
> +
> if (system_state != SYSTEM_BOOTING)
> mutex_unlock(&fixmap_lock);
> }
> @@ -483,20 +619,37 @@ void __init mark_linear_text_alias_ro(void)
> PAGE_KERNEL_RO);
> }
>
> -static bool crash_mem_map __initdata;
> -
> -static int __init enable_crash_mem_map(char *arg)
> +#ifdef CONFIG_KEXEC_CORE
> +void __init map_crashkernel(void)
> {
> - /*
> - * Proper parameter parsing is done by reserve_crashkernel(). We only
> - * need to know if the linear map has to avoid block mappings so that
> - * the crashkernel reservations can be unmapped later.
> - */
> - crash_mem_map = true;
> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> + return;
>
> - return 0;
> + if (!crashk_res.end)
> + return;
> +
> + cpu_replace_ttbr1_with_phys(__pa_symbol(init_pg_dir));
> + init_mm.pgd = init_pg_dir;
> +
> + __create_pgd_mapping(swapper_pg_dir, crashk_res.start,
> + __phys_to_virt(crashk_res.start),
> + crashk_res.end + 1 - crashk_res.start, PAGE_KERNEL,
> + early_pgtable_alloc,
> + NO_EXEC_MAPPINGS | NO_SEC_REMAPPINGS);
> +
> + cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
> + init_mm.pgd = swapper_pg_dir;
> +
> + memblock_phys_free(__pa_symbol(init_pg_dir),
> + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
> }

Because unmap_hotplug_range work with swapper_gp_dir pgd. There are many
changes needed for working with init_pg_dir pgd.
So that I use the init_pg_dir and __create_pgd_mapping to direct
walk/modify swapper_gp_dir pagetable.

When finishing it, free the init_pg_dir memblock.

> -early_param("crashkernel", enable_crash_mem_map);
> +#else
> +void __init map_crashkernel(void)
> +{
> + memblock_phys_free(__pa_symbol(init_pg_dir),
> + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
> +}
> +#endif
>
> static void __init map_mem(pgd_t *pgdp)
> {
> @@ -527,17 +680,6 @@ static void __init map_mem(pgd_t *pgdp)
> */
> memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
>
> -#ifdef CONFIG_KEXEC_CORE
> - if (crash_mem_map) {
> - if (IS_ENABLED(CONFIG_ZONE_DMA) ||
> - IS_ENABLED(CONFIG_ZONE_DMA32))
> - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> - else if (crashk_res.end)
> - memblock_mark_nomap(crashk_res.start,
> - resource_size(&crashk_res));
> - }
> -#endif
> -
> /* map all the memory banks */
> for_each_mem_range(i, &start, &end) {
> if (start >= end)
> @@ -570,19 +712,6 @@ static void __init map_mem(pgd_t *pgdp)
> * in page granularity and put back unused memory to buddy system
> * through /sys/kernel/kexec_crash_size interface.
> */
> -#ifdef CONFIG_KEXEC_CORE
> - if (crash_mem_map &&
> - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) {
> - if (crashk_res.end) {
> - __map_memblock(pgdp, crashk_res.start,
> - crashk_res.end + 1,
> - PAGE_KERNEL,
> - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> - memblock_clear_nomap(crashk_res.start,
> - resource_size(&crashk_res));
> - }
> - }
> -#endif
> }
>
> void mark_rodata_ro(void)
> @@ -774,8 +903,8 @@ void __init paging_init(void)
> cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
> init_mm.pgd = swapper_pg_dir;
>
> - memblock_phys_free(__pa_symbol(init_pg_dir),
> - __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
> + //memblock_phys_free(__pa_symbol(init_pg_dir),
> + // __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
>
> memblock_allow_resize();
> }
> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> index 50bbed9..161bae6 100644
> --- a/arch/arm64/mm/proc.S
> +++ b/arch/arm64/mm/proc.S
> @@ -178,6 +178,25 @@ SYM_FUNC_END(cpu_do_resume)
> isb
> .endm
>
> +SYM_FUNC_START(idmap_cpu_replace_ttbr1_with_flush_tlb)
> + save_and_disable_daif flags=x2
> +
> + __idmap_cpu_set_reserved_ttbr1 x1, x3
> +
> + offset_ttbr1 x0, x3
> + msr ttbr1_el1, x0
> + isb
> +
> + restore_daif x2
> +
> + dsb nshst
> + tlbi vmalle1
> + dsb nsh
> + isb
> +
> + ret
> +SYM_FUNC_END(idmap_cpu_replace_ttbr1_with_flush_tlb)
> +
Switch ttbr1 to pgd(refer to x0), and flush tlb all.

> /*
> * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1)
> *

2022-07-10 15:54:13

[permalink] [raw]

Subject: Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation

In short, this path work:

1.Before doing work for rebuiling crashkernel mem, the pgd is
swapper_pg_dir in [[[ttbr1]]]

2.Change the [[[ttbr0]]]to use idmap_pg_dir pgd

3.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] are mapped [[[only]]]
with idmap_pg_dir mapping in [[[ttbr0]]]

4.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] will flush tlb all,
switch [[[ttbr1]]] to use init_pg_dir pgd(and flush tlb all again).
There is no tlb conflict to swapper_pg_dir.
There is no tlb cache for swapper_pg_dir.

5.Woring with init_pg_dir pgd to access swapper_pg_dir pagetable with
fix mapping. And modify crashkernel mapping in the swapper_pg_dir
without any tlb conflict and flush.

6.When finishing the work, switch ttbr1 pgd to the origin swapper_pg_dir
with cpu_replace_ttbr1 function(similar to the above).

在 2022/7/10 21:44, Guanghui Feng 写道:
> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone
> (enable crashkernel, disable rodata full, disable kfence), the mem_map will
> use non block/section mapping(for crashkernel requires to shrink the region
> in page granularity). But it will degrade performance when doing larging
> continuous mem access in kernel(memcpy/memmove, etc).
>
> There are many changes and discussions:
> commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for
> platforms with no DMA memory zones")
> commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into
> mem_init()")
> commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel
> reservation is required")
>
> This patch changes mem_map to use block/section mapping with crashkernel.
> Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at
> mem_map, reserve crashkernel memory. And then walking pagetable to split
> block/section mapping to non block/section mapping(normally 4K) [[[only]]]
> for crashkernel mem. So the linear mem mapping use block/section mapping
> as more as possible. We will reduce the cpu dTLB miss conspicuously, and
> accelerate mem access about 10-20% performance improvement.
>
> I have tested it with pft(Page Fault Test) and fio, obtained great
> performace improvement.
>
> For fio test:
> 1.prepare ramdisk
> modprobe -r brd
> modprobe brd rd_nr=1 rd_size=67108864
> dmsetup remove_all
> wipefs -a --force /dev/ram0
> mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0
> mkdir -p /fs/ram0
> mount -t ext4 /dev/ram0 /fs/ram0
>
> 2.prepare fio paremeter in x.fio file:
> [global]
> bs=4k
> ioengine=psync
> iodepth=128
> size=32G
> direct=1
> invalidate=1
> group_reporting
> thread=1
> rw=read
> directory=/fs/ram0
> numjobs=1
>
> [task_0]
> cpus_allowed=16
> stonewall=1
>
> 3.run testcase:
> perf stat -e dTLB-load-misses fio x.fio
>
> 4.contrast
> ------------------------
> without patch with patch
> fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s
> dTLB-load-misses 1,818,320,693 438,729,774
> time elapsed(s) 70.500326434 62.877316408
> user(s) 15.926332000 15.684721000
> sys(s) 54.211939000 47.046165000
>
> 5.conclusion
> Using this patch will reduce dTLB misses and improve performace greatly.
>
> Signed-off-by: Guanghui Feng <[email protected]>
> ---
> arch/arm64/include/asm/mmu.h | 1 +
> arch/arm64/include/asm/mmu_context.h | 28 ++++
> arch/arm64/mm/init.c | 8 +-
> arch/arm64/mm/mmu.c | 305 +++++++++++++++++++++++++----------
> arch/arm64/mm/proc.S | 19 +++
> 5 files changed, 267 insertions(+), 94 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 48f8466..1a46b81 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void)
> extern void arm64_memblock_init(void);
> extern void paging_init(void);
> extern void bootmem_init(void);
> +extern void map_crashkernel(void);
> extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt);
> extern void init_mem_pgprot(void);
> extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 6770667..10dffd4 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -139,6 +139,34 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz)
> isb();
> }
>
> +static inline void __nocfi cpu_replace_ttbr1_with_phys(phys_addr_t ttbr1)
> +{
> + typedef void (ttbr_replace_func)(phys_addr_t);
> + extern ttbr_replace_func idmap_cpu_replace_ttbr1_with_flush_tlb;
> + ttbr_replace_func *replace_phys;
> +
> + /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */
> + ttbr1 = phys_to_ttbr(ttbr1);
> +
> + if (system_supports_cnp()) {
> + /*
> + * cpu_replace_ttbr1() is used when there's a boot CPU
> + * up (i.e. cpufeature framework is not up yet) and
> + * latter only when we enable CNP via cpufeature's
> + * enable() callback.
> + * Also we rely on the cpu_hwcap bit being set before
> + * calling the enable() function.
> + */
> + ttbr1 |= TTBR_CNP_BIT;
> + }
> +
> + replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1_with_flush_tlb));
> +
> + cpu_install_idmap();
> + replace_phys(ttbr1);
> + cpu_uninstall_idmap();
> +}
> +
> /*
> * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
> * avoiding the possibility of conflicting TLB entries being allocated.
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 339ee84..241d27e 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void)
> crashk_res.start = crash_base;
> crashk_res.end = crash_base + crash_size - 1;
> insert_resource(&iomem_resource, &crashk_res);
> + map_crashkernel();
> }
>
> /*
> @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void)
> }
>
> early_init_fdt_scan_reserved_mem();
> -
> - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
> - reserve_crashkernel();
> -
> high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
> }
>
> @@ -438,8 +435,7 @@ void __init bootmem_init(void)
> * request_standard_resources() depends on crashkernel's memory being
> * reserved, so do it here.
> */
> - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
> - reserve_crashkernel();
> + reserve_crashkernel();
>
> memblock_dump_all();
> }
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 626ec32..3e30c82 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -42,6 +42,7 @@
> #define NO_BLOCK_MAPPINGS BIT(0)
> #define NO_CONT_MAPPINGS BIT(1)
> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
> +#define NO_SEC_REMAPPINGS BIT(3) /* rebuild with non block/sec mapping*/
>
> u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
> u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
> @@ -155,8 +156,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new)
> return ((old ^ new) & ~mask) == 0;
> }
>
> +static void pte_clear_cont(pte_t *ptep)
> +{
> + int i = 0;
> + pte_t pte = READ_ONCE(*ptep);
> + if (pte_none(pte) || !pte_cont(pte))
> + return;
> + ptep -= ((u64)ptep / sizeof(pte_t)) &
> + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1);
> + do {
> + pte = pte_mknoncont(READ_ONCE(*ptep));
> + set_pte(ptep, pte);
> + } while (++ptep, ++i < CONT_PTES);
> +}
> +
> static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> - phys_addr_t phys, pgprot_t prot)
> + phys_addr_t phys, pgprot_t prot, int flags)
> {
> pte_t *ptep;
>
> @@ -164,6 +179,9 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> do {
> pte_t old_pte = READ_ONCE(*ptep);
>
> + if (flags & NO_SEC_REMAPPINGS)
> + pte_clear_cont(ptep);
> +
> set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>
> /*
> @@ -179,6 +197,22 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> pte_clear_fixmap();
> }
>
> +static inline void allock_pte_page(pmd_t *pmdp,
> + phys_addr_t (*pgtable_alloc)(int),
> + int flags)
> +{
> + pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN;
> + phys_addr_t pte_phys;
> + if (!pmd_none(READ_ONCE(*pmdp)))
> + return;
> +
> + if (flags & NO_EXEC_MAPPINGS)
> + pmdval |= PMD_TABLE_PXN;
> + BUG_ON(!pgtable_alloc);
> + pte_phys = pgtable_alloc(PAGE_SHIFT);
> + __pmd_populate(pmdp, pte_phys, pmdval);
> +}
> +
> static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> unsigned long end, phys_addr_t phys,
> pgprot_t prot,
> @@ -189,17 +223,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> pmd_t pmd = READ_ONCE(*pmdp);
>
> BUG_ON(pmd_sect(pmd));
> - if (pmd_none(pmd)) {
> - pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN;
> - phys_addr_t pte_phys;
> -
> - if (flags & NO_EXEC_MAPPINGS)
> - pmdval |= PMD_TABLE_PXN;
> - BUG_ON(!pgtable_alloc);
> - pte_phys = pgtable_alloc(PAGE_SHIFT);
> - __pmd_populate(pmdp, pte_phys, pmdval);
> - pmd = READ_ONCE(*pmdp);
> - }
> + allock_pte_page(pmdp, pgtable_alloc, flags);
> + pmd = READ_ONCE(*pmdp);
> BUG_ON(pmd_bad(pmd));
>
> do {
> @@ -208,16 +233,71 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> next = pte_cont_addr_end(addr, end);
>
> /* use a contiguous mapping if the range is suitably aligned */
> - if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) &&
> - (flags & NO_CONT_MAPPINGS) == 0)
> + if (!(flags & NO_SEC_REMAPPINGS) &&
> + (((addr | next | phys) & ~CONT_PTE_MASK) == 0) &&
> + (flags & NO_CONT_MAPPINGS) == 0)
> __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>
> - init_pte(pmdp, addr, next, phys, __prot);
> + init_pte(pmdp, addr, next, phys, __prot, flags);
>
> phys += next - addr;
> } while (addr = next, addr != end);
> }
>
> +static void pmd_clear_cont(pmd_t *pmdp)
> +{
> + int i = 0;
> + pmd_t pmd = READ_ONCE(*pmdp);
> + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd))
> + return;
> + pmdp -= ((u64)pmdp / sizeof(pmd_t)) &
> + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1);
> + do {
> + pmd = READ_ONCE(*pmdp);
> + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd)));
> + set_pmd(pmdp, pmd);
> + } while (++pmdp, ++i < CONT_PMDS);
> +}
> +
> +static void init_pmd_remap(pud_t *pudp, unsigned long addr, unsigned long end,
> + phys_addr_t phys, pgprot_t prot,
> + phys_addr_t (*pgtable_alloc)(int), int flags)
> +{
> + unsigned long next;
> + pmd_t *pmdp;
> + phys_addr_t map_offset;
> +
> + pmdp = pmd_set_fixmap_offset(pudp, addr);
> + do {
> + next = pmd_addr_end(addr, end);
> +
> + if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) {
> + pmd_clear_cont(pmdp);
> + pmd_clear(pmdp);
> + allock_pte_page(pmdp, pgtable_alloc, flags);
> +
> + map_offset = addr - (addr & PMD_MASK);
> + if (map_offset)
> + alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr,
> + phys - map_offset, prot,
> + pgtable_alloc,
> + flags & (~NO_SEC_REMAPPINGS));
> +
> + if (next < (addr & PMD_MASK) + PMD_SIZE)
> + alloc_init_cont_pte(pmdp, next,
> + (addr & PUD_MASK) + PUD_SIZE,
> + next - addr + phys,
> + prot, pgtable_alloc,
> + flags & (~NO_SEC_REMAPPINGS));
> + }
> + alloc_init_cont_pte(pmdp, addr, next, phys, prot,
> + pgtable_alloc, flags);
> + phys += next - addr;
> + } while (pmdp++, addr = next, addr != end);
> +
> + pmd_clear_fixmap();
> +}
> +
> static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
> phys_addr_t phys, pgprot_t prot,
> phys_addr_t (*pgtable_alloc)(int), int flags)
> @@ -255,6 +335,22 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
> pmd_clear_fixmap();
> }
>
> +static inline void alloc_pmd_page(pud_t *pudp, phys_addr_t (*pgtable_alloc)(int), int flags)
> +{
> +
> + pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN;
> + phys_addr_t pmd_phys;
> +
> + if (!pud_none(READ_ONCE(*pudp)))
> + return;
> +
> + if (flags & NO_EXEC_MAPPINGS)
> + pudval |= PUD_TABLE_PXN;
> + BUG_ON(!pgtable_alloc);
> + pmd_phys = pgtable_alloc(PMD_SHIFT);
> + __pud_populate(pudp, pmd_phys, pudval);
> +}
> +
> static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> unsigned long end, phys_addr_t phys,
> pgprot_t prot,
> @@ -267,17 +363,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> * Check for initial section mappings in the pgd/pud.
> */
> BUG_ON(pud_sect(pud));
> - if (pud_none(pud)) {
> - pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN;
> - phys_addr_t pmd_phys;
> -
> - if (flags & NO_EXEC_MAPPINGS)
> - pudval |= PUD_TABLE_PXN;
> - BUG_ON(!pgtable_alloc);
> - pmd_phys = pgtable_alloc(PMD_SHIFT);
> - __pud_populate(pudp, pmd_phys, pudval);
> - pud = READ_ONCE(*pudp);
> - }
> + alloc_pmd_page(pudp, pgtable_alloc, flags);
> + pud = READ_ONCE(*pudp);
> BUG_ON(pud_bad(pud));
>
> do {
> @@ -286,16 +373,80 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> next = pmd_cont_addr_end(addr, end);
>
> /* use a contiguous mapping if the range is suitably aligned */
> - if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) &&
> + if (!(flags & NO_SEC_REMAPPINGS) &&
> + (((addr | next | phys) & ~CONT_PMD_MASK) == 0) &&
> (flags & NO_CONT_MAPPINGS) == 0)
> __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>
> - init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
> + if (flags & NO_SEC_REMAPPINGS)
> + init_pmd_remap(pudp, addr, next, phys, __prot,
> + pgtable_alloc, flags);
> + else
> + init_pmd(pudp, addr, next, phys, __prot,
> + pgtable_alloc, flags);
>
> phys += next - addr;
> } while (addr = next, addr != end);
> }
>
> +static void init_pud_remap(pud_t *pudp, unsigned long addr, unsigned long next,
> + phys_addr_t phys, pgprot_t prot,
> + phys_addr_t (*pgtable_alloc)(int),
> + int flags)
> +{
> + phys_addr_t map_offset;
> +
> + if (!pud_none(*pudp) && pud_sect(*pudp)) {
> + pud_clear(pudp);
> + alloc_pmd_page(pudp, pgtable_alloc, flags);
> +
> + map_offset = addr - (addr & PUD_MASK);
> + if (map_offset)
> + alloc_init_cont_pmd(pudp, addr & PUD_MASK,
> + addr, phys - map_offset,
> + prot, pgtable_alloc,
> + flags & (~NO_SEC_REMAPPINGS));
> +
> + if (next < (addr & PUD_MASK) + PUD_SIZE)
> + alloc_init_cont_pmd(pudp, next,
> + (addr & PUD_MASK) + PUD_SIZE,
> + next - addr + phys,
> + prot, pgtable_alloc,
> + flags & (~NO_SEC_REMAPPINGS));
> + }
> + alloc_init_cont_pmd(pudp, addr, next, phys, prot,
> + pgtable_alloc, flags);
> +}
> +
> +static void init_pud(pud_t *pudp, unsigned long addr, unsigned long next,
> + phys_addr_t phys, pgprot_t prot,
> + phys_addr_t (*pgtable_alloc)(int),
> + int flags)
> +{
> + pud_t old_pud = READ_ONCE(*pudp);
> + /*
> + * For 4K granule only, attempt to put down a 1GB block
> + */
> + if (pud_sect_supported() &&
> + ((addr | next | phys) & ~PUD_MASK) == 0 &&
> + (flags & NO_BLOCK_MAPPINGS) == 0) {
> + pud_set_huge(pudp, phys, prot);
> +
> + /*
> + * After the PUD entry has been populated once, we
> + * only allow updates to the permission attributes.
> + */
> + BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
> + READ_ONCE(pud_val(*pudp))));
> + } else {
> + alloc_init_cont_pmd(pudp, addr, next, phys, prot,
> + pgtable_alloc, flags);
> +
> + BUG_ON(pud_val(old_pud) != 0 &&
> + pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
> + }
> +}
> +
> static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
> phys_addr_t phys, pgprot_t prot,
> phys_addr_t (*pgtable_alloc)(int),
> @@ -325,37 +476,22 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
> */
> if (system_state != SYSTEM_BOOTING)
> mutex_lock(&fixmap_lock);
> +
> pudp = pud_set_fixmap_offset(p4dp, addr);
> do {
> - pud_t old_pud = READ_ONCE(*pudp);
> -
> next = pud_addr_end(addr, end);
>
> - /*
> - * For 4K granule only, attempt to put down a 1GB block
> - */
> - if (pud_sect_supported() &&
> - ((addr | next | phys) & ~PUD_MASK) == 0 &&
> - (flags & NO_BLOCK_MAPPINGS) == 0) {
> - pud_set_huge(pudp, phys, prot);
> -
> - /*
> - * After the PUD entry has been populated once, we
> - * only allow updates to the permission attributes.
> - */
> - BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
> - READ_ONCE(pud_val(*pudp))));
> - } else {
> - alloc_init_cont_pmd(pudp, addr, next, phys, prot,
> - pgtable_alloc, flags);
> -
> - BUG_ON(pud_val(old_pud) != 0 &&
> - pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
> - }
> + if (flags & NO_SEC_REMAPPINGS)
> + init_pud_remap(pudp, addr, next, phys, prot,
> + pgtable_alloc, flags);
> + else
> + init_pud(pudp, addr, next, phys, prot, pgtable_alloc,
> + flags);
> phys += next - addr;
> } while (pudp++, addr = next, addr != end);
>
> pud_clear_fixmap();
> +
> if (system_state != SYSTEM_BOOTING)
> mutex_unlock(&fixmap_lock);
> }
> @@ -483,20 +619,37 @@ void __init mark_linear_text_alias_ro(void)
> PAGE_KERNEL_RO);
> }
>
> -static bool crash_mem_map __initdata;
> -
> -static int __init enable_crash_mem_map(char *arg)
> +#ifdef CONFIG_KEXEC_CORE
> +void __init map_crashkernel(void)
> {
> - /*
> - * Proper parameter parsing is done by reserve_crashkernel(). We only
> - * need to know if the linear map has to avoid block mappings so that
> - * the crashkernel reservations can be unmapped later.
> - */
> - crash_mem_map = true;
> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
> + return;
>
> - return 0;
> + if (!crashk_res.end)
> + return;
> +
> + cpu_replace_ttbr1_with_phys(__pa_symbol(init_pg_dir));
> + init_mm.pgd = init_pg_dir;
> +
> + __create_pgd_mapping(swapper_pg_dir, crashk_res.start,
> + __phys_to_virt(crashk_res.start),
> + crashk_res.end + 1 - crashk_res.start, PAGE_KERNEL,
> + early_pgtable_alloc,
> + NO_EXEC_MAPPINGS | NO_SEC_REMAPPINGS);
> +
> + cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
> + init_mm.pgd = swapper_pg_dir;
> +
> + memblock_phys_free(__pa_symbol(init_pg_dir),
> + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
> }
> -early_param("crashkernel", enable_crash_mem_map);
> +#else
> +void __init map_crashkernel(void)
> +{
> + memblock_phys_free(__pa_symbol(init_pg_dir),
> + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
> +}
> +#endif
>
> static void __init map_mem(pgd_t *pgdp)
> {
> @@ -527,17 +680,6 @@ static void __init map_mem(pgd_t *pgdp)
> */
> memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
>
> -#ifdef CONFIG_KEXEC_CORE
> - if (crash_mem_map) {
> - if (IS_ENABLED(CONFIG_ZONE_DMA) ||
> - IS_ENABLED(CONFIG_ZONE_DMA32))
> - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> - else if (crashk_res.end)
> - memblock_mark_nomap(crashk_res.start,
> - resource_size(&crashk_res));
> - }
> -#endif
> -
> /* map all the memory banks */
> for_each_mem_range(i, &start, &end) {
> if (start >= end)
> @@ -570,19 +712,6 @@ static void __init map_mem(pgd_t *pgdp)
> * in page granularity and put back unused memory to buddy system
> * through /sys/kernel/kexec_crash_size interface.
> */
> -#ifdef CONFIG_KEXEC_CORE
> - if (crash_mem_map &&
> - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) {
> - if (crashk_res.end) {
> - __map_memblock(pgdp, crashk_res.start,
> - crashk_res.end + 1,
> - PAGE_KERNEL,
> - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> - memblock_clear_nomap(crashk_res.start,
> - resource_size(&crashk_res));
> - }
> - }
> -#endif
> }
>
> void mark_rodata_ro(void)
> @@ -774,8 +903,8 @@ void __init paging_init(void)
> cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
> init_mm.pgd = swapper_pg_dir;
>
> - memblock_phys_free(__pa_symbol(init_pg_dir),
> - __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
> + //memblock_phys_free(__pa_symbol(init_pg_dir),
> + // __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
>
> memblock_allow_resize();
> }
> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> index 50bbed9..161bae6 100644
> --- a/arch/arm64/mm/proc.S
> +++ b/arch/arm64/mm/proc.S
> @@ -178,6 +178,25 @@ SYM_FUNC_END(cpu_do_resume)
> isb
> .endm
>
> +SYM_FUNC_START(idmap_cpu_replace_ttbr1_with_flush_tlb)
> + save_and_disable_daif flags=x2
> +
> + __idmap_cpu_set_reserved_ttbr1 x1, x3
> +
> + offset_ttbr1 x0, x3
> + msr ttbr1_el1, x0
> + isb
> +
> + restore_daif x2
> +
> + dsb nshst
> + tlbi vmalle1
> + dsb nsh
> + isb
> +
> + ret
> +SYM_FUNC_END(idmap_cpu_replace_ttbr1_with_flush_tlb)
> +
> /*
> * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1)
> *

2022-07-18 13:20:04

[permalink] [raw]

Subject: Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation

On Sun, Jul 10, 2022 at 11:33:02PM +0800, guanghui.fgh wrote:
> In short, this path work:
>
> 1.Before doing work for rebuiling crashkernel mem, the pgd is swapper_pg_dir
> in [[[ttbr1]]]
>
> 2.Change the [[[ttbr0]]]to use idmap_pg_dir pgd
>
> 3.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] are mapped [[[only]]]
> with idmap_pg_dir mapping in [[[ttbr0]]]
>
> 4.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] will flush tlb all,
> switch [[[ttbr1]]] to use init_pg_dir pgd(and flush tlb all again).
> There is no tlb conflict to swapper_pg_dir.
> There is no tlb cache for swapper_pg_dir.
>
> 5.Woring with init_pg_dir pgd to access swapper_pg_dir pagetable with fix
> mapping. And modify crashkernel mapping in the swapper_pg_dir without any
> tlb conflict and flush.
>
> 6.When finishing the work, switch ttbr1 pgd to the origin swapper_pg_dir
> with cpu_replace_ttbr1 function(similar to the above).

I do not think that this complexity is justified. As I have stated on
numerous occasions already, I would prefer that we leave the crashkernel
mapped when rodata is not "full". That fixes your performance issue and
matches what we do for module code, so I do not see a security argument
against it.

I do not plan to merge this patch as-is.

Thanks,

Will

2022-07-25 07:24:17