On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> On 2022/6/21 13:33, Baoquan He wrote:
> > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > If the crashkernel has both high memory above DMA zones and low memory
> > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > high memory instead of the low memory. This means that only high memory
> > > requires write protection based on page-level mapping. The allocation of
> > > high memory does not depend on the DMA boundary. So we can reserve the
> > > high memory first even if the crashkernel reservation is deferred.
> > >
> > > This means that the block mapping can still be performed on other kernel
> > > linear address spaces, the TLB miss rate can be reduced and the system
> > > performance will be improved.
> >
> > Ugh, this looks a little ugly, honestly.
> >
> > If that's for sure arm64 can't split large page mapping of linear
> > region, this patch is one way to optimize linear mapping. Given kdump
> > setting is necessary on arm64 server, the booting speed is truly
> > impacted heavily.
>
> Is there some conclusion or discussion that arm64 can't split large page
> mapping?
>
> Could the crashkernel reservation (and Kfence pool) be splited dynamically?
>
> I found Mark replay "arm64: remove page granularity limitation from
> KFENCE"[1],
>
> ? "We also avoid live changes from block<->table mappings, since the
> ? archtitecture gives us very weak guarantees there and generally requires
> ? a Break-Before-Make sequence (though IIRC this was tightened up
> ? somewhat, so maybe going one way is supposed to work). Unless it's
> ? really necessary, I'd rather not split these block mappings while
> ? they're live."
The problem with splitting is that you can end up with two entries in
the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
abort (but can be worse like loss of coherency).
Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
all, the software would have to unmap the range, TLBI, remap. With
FEAT_BBM (level 2), we can do this without tearing the mapping down but
we still need to handle the potential TLB conflict abort. The handler
only needs a TLBI but if it touches the memory range being changed it
risks faulting again. With vmap stacks and the kernel image mapped in
the vmalloc space, we have a small window where this could be handled
but we probably can't go into the C part of the exception handling
(tracing etc. may access a kmalloc'ed object for example).
Another option is to do a stop_machine() (if multi-processor at that
point), disable the MMUs, modify the page tables, re-enable the MMU but
it's also complicated.
--
Catalin
Hi Catalin,
On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> > On 2022/6/21 13:33, Baoquan He wrote:
> > > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > > If the crashkernel has both high memory above DMA zones and low memory
> > > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > > high memory instead of the low memory. This means that only high memory
> > > > requires write protection based on page-level mapping. The allocation of
> > > > high memory does not depend on the DMA boundary. So we can reserve the
> > > > high memory first even if the crashkernel reservation is deferred.
> > > >
> > > > This means that the block mapping can still be performed on other kernel
> > > > linear address spaces, the TLB miss rate can be reduced and the system
> > > > performance will be improved.
> > >
> > > Ugh, this looks a little ugly, honestly.
> > >
> > > If that's for sure arm64 can't split large page mapping of linear
> > > region, this patch is one way to optimize linear mapping. Given kdump
> > > setting is necessary on arm64 server, the booting speed is truly
> > > impacted heavily.
> >
> > Is there some conclusion or discussion that arm64 can't split large page
> > mapping?
> >
> > Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> >
> > I found Mark replay "arm64: remove page granularity limitation from
> > KFENCE"[1],
> >
> > ? "We also avoid live changes from block<->table mappings, since the
> > ? archtitecture gives us very weak guarantees there and generally requires
> > ? a Break-Before-Make sequence (though IIRC this was tightened up
> > ? somewhat, so maybe going one way is supposed to work). Unless it's
> > ? really necessary, I'd rather not split these block mappings while
> > ? they're live."
>
> The problem with splitting is that you can end up with two entries in
> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> abort (but can be worse like loss of coherency).
Thanks for this explanation. Is this a drawback of arm64 design? X86
code do the same thing w/o issue, is there way to overcome this on
arm64 from hardware or software side?
I ever got a arm64 server with huge memory, w or w/o crashkernel setting
have different bootup time. And the more often TLB miss and flush will
cause performance cost. It is really a pity if we have very powerful
arm64 cpu and system capacity, but bottlenecked by this drawback.
>
> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> all, the software would have to unmap the range, TLBI, remap. With
> FEAT_BBM (level 2), we can do this without tearing the mapping down but
> we still need to handle the potential TLB conflict abort. The handler
> only needs a TLBI but if it touches the memory range being changed it
> risks faulting again. With vmap stacks and the kernel image mapped in
> the vmalloc space, we have a small window where this could be handled
> but we probably can't go into the C part of the exception handling
> (tracing etc. may access a kmalloc'ed object for example).
>
> Another option is to do a stop_machine() (if multi-processor at that
> point), disable the MMUs, modify the page tables, re-enable the MMU but
> it's also complicated.
>
> --
> Catalin
>
On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
> On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> > The problem with splitting is that you can end up with two entries in
> > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> > abort (but can be worse like loss of coherency).
>
> Thanks for this explanation. Is this a drawback of arm64 design? X86
> code do the same thing w/o issue, is there way to overcome this on
> arm64 from hardware or software side?
It is a drawback of the arm64 implementations. Having multiple TLB
entries for the same VA would need additional logic in hardware to
detect, so the microarchitects have pushed back. In ARMv8.4, some
balanced was reached with FEAT_BBM so that the only visible side-effect
is a potential TLB conflict abort that could be resolved by software.
> I ever got a arm64 server with huge memory, w or w/o crashkernel setting
> have different bootup time. And the more often TLB miss and flush will
> cause performance cost. It is really a pity if we have very powerful
> arm64 cpu and system capacity, but bottlenecked by this drawback.
Is it only the boot time affected or the runtime performance as well?
--
Catalin
On 06/23/22 at 03:07pm, Catalin Marinas wrote:
> On Wed, Jun 22, 2022 at 04:35:16PM +0800, Baoquan He wrote:
> > On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> > > The problem with splitting is that you can end up with two entries in
> > > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> > > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> > > abort (but can be worse like loss of coherency).
> >
> > Thanks for this explanation. Is this a drawback of arm64 design? X86
> > code do the same thing w/o issue, is there way to overcome this on
> > arm64 from hardware or software side?
>
> It is a drawback of the arm64 implementations. Having multiple TLB
> entries for the same VA would need additional logic in hardware to
> detect, so the microarchitects have pushed back. In ARMv8.4, some
> balanced was reached with FEAT_BBM so that the only visible side-effect
> is a potential TLB conflict abort that could be resolved by software.
I see, thx.
>
> > I ever got a arm64 server with huge memory, w or w/o crashkernel setting
> > have different bootup time. And the more often TLB miss and flush will
> > cause performance cost. It is really a pity if we have very powerful
> > arm64 cpu and system capacity, but bottlenecked by this drawback.
>
> Is it only the boot time affected or the runtime performance as well?
Sorry for late reply. What I observerd is the boot time serious latecy
with huge memory. Since the timestamp is not available at that time,
we can't tell the number. I didn't notice the runtime performance.