2023-04-27 03:52:48

by Gang Li

[permalink] [raw]
Subject: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

Hi all,

I have encountered a performance issue on our ARM64 machine, which seems
to be caused by the flush_tlb_kernel_range.

Here is the stack on the ARM64 machine:

# ARM64:
```
ghes_unmap
clear_fixmap
__set_fixmap
flush_tlb_kernel_range
```

As we can see, the ARM64 implementation eventually calls
flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
AMD64, the implementation calls flush_tlb_one_kernel instead.

# AMD64:
```
ghes_unmap
clear_fixmap
__set_fixmap
mmu.set_fixmap
native_set_fixmap
__native_set_fixmap
set_pte_vaddr
set_pte_vaddr_p4d
__set_pte_vaddr
flush_tlb_one_kernel
```

On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
performance degradation.

This arm64 patch said:
https://lore.kernel.org/all/[email protected]/
(commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)

```
/*
* Despite its name, this function must still broadcast the TLB
* invalidation in order to ensure other CPUs don't end up with junk
* entries as a result of speculation. Unusually, its also called in
* IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
* TLB broadcasting, then we're in trouble here.
*/
static inline void arch_apei_flush_tlb_one(unsigned long addr)
{
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}
```

1. I am curious to know the reason behind the design choice of flushing
the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
the TLB on a single core. Are there any TLB design details that make a
difference here?

2. Is it possible to let the ARM64 to flush the TLB on just one core,
similar to the AMD64?

3. If so, would there be any potential drawbacks or limitations to
making such a change?

Thanks,

Gang Li


2023-04-27 07:36:51

by Mark Rutland

[permalink] [raw]
Subject: Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

Hi,

On Thu, Apr 27, 2023 at 11:26:50AM +0800, Gang Li wrote:
> Hi all,
>
> I have encountered a performance issue on our ARM64 machine, which seems
> to be caused by the flush_tlb_kernel_range.

Can you please provide a few more details on what you're seeing?

What does your performance issue look like?

Are you sure that the performance issue is caused by flush_tlb_kernel_range()
specifically?

> Here is the stack on the ARM64 machine:
>
> # ARM64:
> ```
> ghes_unmap
> clear_fixmap
> __set_fixmap
> flush_tlb_kernel_range
> ```
>
> As we can see, the ARM64 implementation eventually calls
> flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
> AMD64, the implementation calls flush_tlb_one_kernel instead.
>
> # AMD64:
> ```
> ghes_unmap
> clear_fixmap
> __set_fixmap
> mmu.set_fixmap
> native_set_fixmap
> __native_set_fixmap
> set_pte_vaddr
> set_pte_vaddr_p4d
> __set_pte_vaddr
> flush_tlb_one_kernel
> ```
>
> On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
> performance degradation.

As above, could you please provide more details on this?

> This arm64 patch said:
> https://lore.kernel.org/all/[email protected]/
> (commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)
>
> ```
> /*
> * Despite its name, this function must still broadcast the TLB
> * invalidation in order to ensure other CPUs don't end up with junk
> * entries as a result of speculation. Unusually, its also called in
> * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
> * TLB broadcasting, then we're in trouble here.
> */
> static inline void arch_apei_flush_tlb_one(unsigned long addr)
> {
> flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> }
> ```
>
> 1. I am curious to know the reason behind the design choice of flushing
> the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
> the TLB on a single core. Are there any TLB design details that make a
> difference here?

I don't know why arm64 only clears this on a single CPU.

On arm64 we *must* invalidate the TLB on all CPUs as the kernel page tables are
shared by all CPUs, and the architectural Break-Before-Make rules in require
the TLB to be invalidated between two valid (but distinct) entries.

> 2. Is it possible to let the ARM64 to flush the TLB on just one core,
> similar to the AMD64?

No. If we omitted the broadcast TLB invalidation, then a different CPU may
fetch the old value into a TLB, then fetch the new value. When this happens,
the architecture permits "amalgamation", with UNPREDICTABLE results, which
could result in memory corruption, taking SErrors, etc.

> 3. If so, would there be any potential drawbacks or limitations to
> making such a change?

As above, we must use broadcast TLB invalidation here.

Thanks,
Mark.

2023-05-05 10:04:37

by Gang Li

[permalink] [raw]
Subject: Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

This series accidentally lost CC. Now I forward the lost emails to the
mailing list.

On 2023/4/28 17:27, Mark Rutland wrote:
>
>
> Hi,
>
> Just to check -- did you mean to drop the other Ccs? It would be good to keep
> this discussion on-list if possible.
>
> On Fri, Apr 28, 2023 at 01:49:46PM +0800, Gang Li wrote:
>> On 2023/4/27 15:30, Mark Rutland wrote:
>>> On Thu, Apr 27, 2023 at 11:26:50AM +0800, Gang Li wrote:
>>>> 1. I am curious to know the reason behind the design choice of flushing
>>>> the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
>>>> the TLB on a single core. Are there any TLB design details that make a
>>>> difference here?
>>>
>>> I don't know why arm64 only clears this on a single CPU.
>>
>> Sorry, I'm a bit confused.
>>
>> Did you mean you don't know why *amd64* only clears this on a single
>> CPU?
>
> Yes, sorry; I meant to say "amd64" rather than "arm64" here.
>
>> Looks like I should ask amd64 guy ????
>
> ????
>
>>> On arm64 we *must* invalidate the TLB on all CPUs as the kernel page tables are
>>> shared by all CPUs, and the architectural Break-Before-Make rules in require
>>> the TLB to be invalidated between two valid (but distinct) entries.
>>
>> ghes_unmap is protected by a spin_lock, so only one core can access this
>> mem area at a time. I understand that there will be no TLB for
>> this memory area on other cores.
>>
>> Is it because arm64 has speculative execution? Even if the core does not
>> hold the spin_lock, the TLB will still cache the critical section?
>
> The architecture allows a CPU to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
>
> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless. Consequently, the
> spinlock doesn't make any difference.
>
> Thanks,
> Mark.
>

2023-05-05 12:35:21

by Gang Li

[permalink] [raw]
Subject: Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

Hi,

I found that in `ghes_unmap` protected by spinlock, arm64 and x86 have
different strategies for flushing tlb.

# arm64 call trace:
```
holding a spin lock
ghes_unmap
clear_fixmap
__set_fixmap
flush_tlb_kernel_range
```

# x86 call trace:
```
holding a spin lock
ghes_unmap
clear_fixmap
__set_fixmap
mmu.set_fixmap
native_set_fixmap
__native_set_fixmap
set_pte_vaddr
set_pte_vaddr_p4d
__set_pte_vaddr
flush_tlb_one_kernel
```

As we can see, ghes_unmap in arm64 eventually calls
flush_tlb_kernel_range to broadcast TLB invalidation. However, on
x86, ghes_unmap calls flush_tlb_one_kernel.

Why arm64 needs to broadcast TLB invalidation in ghes_unmap, while only
one CPU has accessed this memory area?

Mark Rutland said in
https://lore.kernel.org/lkml/[email protected]/

> The architecture (arm64) allows a CPU to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the
> root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
>
> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless.
> Consequently, the
> spinlock doesn't make any difference.
>

arm64 broadcast TLB invalidation in ghes_unmap, because TLB entry can be
allocated regardless of whether the CPU explicitly accesses memory.

Why doesn't x86 broadcast TLB invalidation in ghes_unmap? Is there any
difference between x86 and arm64 in TLB allocation and invalidation
strategy?

Thanks,
Gang Li

2023-05-06 03:09:15

by Gang Li

[permalink] [raw]
Subject: Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

Hi,

On 2023/4/28 17:27, Mark Rutland wrote:> The architecture allows a CPU
to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the
> root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
>

TLB will be allocated due to prefetching or branch prediction. Will it
be invalidated when the prediction fails?

> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless.
> Consequently, the
> spinlock doesn't make any difference.
>

And is there any kind of ARM manual or guide that
explains these details to help us programming better?

Thanks a lot for your help.
Gang Li

2023-05-09 15:16:17

by Mark Rutland

[permalink] [raw]
Subject: Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

On Sat, May 06, 2023 at 10:51:23AM +0800, Gang Li wrote:
> Hi,
>
> On 2023/4/28 17:27, Mark Rutland wrote:> The architecture allows a CPU to
> allocate TLB entries at any time for any
> > reason, for any valid translation table entries reachable from the root
> > in TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> > reasons.
>
> TLB will be allocated due to prefetching or branch prediction. Will it
> be invalidated when the prediction fails?

No; once allocated they're allowed to remain until explicitly invalidated.

See below for more detail.

> > Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> > memory location -- TLB entries can be allocated regardless.
> > Consequently, the
> > spinlock doesn't make any difference.
>
> And is there any kind of ARM manual or guide that explains these details to
> help us programming better?

There's no guide that I am aware of, but this is described in the ARM ARM. The
current relase (ARM DDI 0487J.a) can be found at:

https://developer.arm.com/documentation/ddi0487/ja

... and in future, the latest version should be available at:

https://developer.arm.com/documentation/ddi0487/latest

In the latest release (ARM DDI 0487J.a) relevant information can be found in
section D8 "The AArch64 Virtual Memory System Architecture", with key
information in D8.13 "Translation Lookaside Buffers" and D8.14 "TLB
maintenance".

For example, early in D8.13 we have the rule:

| R_SQBCS
|
| When address translation is enabled, a translation table entry for an
| in-context translation regime that does not cause a Translation fault, an
| Address size fault, or an Access flag fault is permitted to be cached in a
| TLB or intermediate TLB caching structure as the result of an explicit or
| speculative access.

Thanks,
Mark.

2023-05-16 03:47:39

by Gang Li

[permalink] [raw]
Subject: Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

Hi all!

On 2023/5/5 20:28, Gang Li wrote:
> Hi,
>
> I found that in `ghes_unmap` protected by spinlock, arm64 and x86 have
> different strategies for flushing tlb.
>
> # arm64 call trace:
> ```
> holding a spin lock
> ghes_unmap
>  clear_fixmap
>   __set_fixmap
>    flush_tlb_kernel_range
> ```
>
> # x86 call trace:
> ```
> holding a spin lock
> ghes_unmap
>  clear_fixmap
>   __set_fixmap
>    mmu.set_fixmap
>     native_set_fixmap
>      __native_set_fixmap
>       set_pte_vaddr
>        set_pte_vaddr_p4d
>         __set_pte_vaddr
>          flush_tlb_one_kernel
> ```
>
> arm64 broadcast TLB invalidation in ghes_unmap, because TLB entry can be
> allocated regardless of whether the CPU explicitly accesses memory.
>
> Why doesn't x86 broadcast TLB invalidation in ghes_unmap? Is there any
> difference between x86 and arm64 in TLB allocation and invalidation
> strategy?
>

I found this in Intel® 64 and IA-32 Architectures Software Developer
Manuals:

> 4.10.2.3 Details of TLB Use
> Subject to the limitations given in the previous paragraph, the
> processor may cache a translation for any linear address, even if that
> address is not used to access memory. For example, the processor may
> cache translations required for prefetches and for accesses that result
> from speculative execution that would never actually occur in the
> executed code path.

Both x86 and arm64 can cache TLB for prefetches and speculative
execution. Then why are their flush policies different?

Thanks,
Gang Li

2023-05-16 07:56:32

by Gang Li

[permalink] [raw]
Subject: Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

Hi,

On 2023/5/9 22:30, Mark Rutland wrote:
> For example, early in D8.13 we have the rule:
>
> | R_SQBCS
> |
> | When address translation is enabled, a translation table entry for an
> | in-context translation regime that does not cause a Translation fault, an
> | Address size fault, or an Access flag fault is permitted to be cached in a
> | TLB or intermediate TLB caching structure as the result of an explicit or
> | speculative access.
>

Thanks a lot!

I looked up the x86 manual and found that the x86 TLB cache mechanism is
similar to arm64 (but the x86 guys haven't reply me yet):

Intel® 64 and IA-32 Architectures Software Developer Manuals:
> 4.10.2.3 Details of TLB Use
> Subject to the limitations given in the previous paragraph, the
> processor may cache a translation for any linear address, even if that
> address is not used to access memory. For example, the processor may
> cache translations required for prefetches and for accesses that result
> from speculative execution that would never actually occur in the
> executed code path.

Both architectures have similar TLB cache policies, why arm64 flush all
and x86 flush local in ghes_map and ghes_unmap?

I think flush all may be unnecessary.

1. Before accessing ghes data. Each CPU needs to call ghes_map, which
will create the mapping and flush their own TLb to make sure the current
CPU is using the latest mapping.

2. And there is no need to flush all in ghes_unmap, because the ghes_map
of other CPUs will flush their own TLBs before accessing the memory.

What do you think?

Thanks,
Gang Li.

2023-05-16 12:20:15

by Mark Rutland

[permalink] [raw]
Subject: Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

On Tue, May 16, 2023 at 03:47:16PM +0800, Gang Li wrote:
> Hi,
>
> On 2023/5/9 22:30, Mark Rutland wrote:
> > For example, early in D8.13 we have the rule:
> >
> > | R_SQBCS
> > |
> > | When address translation is enabled, a translation table entry for an
> > | in-context translation regime that does not cause a Translation fault, an
> > | Address size fault, or an Access flag fault is permitted to be cached in a
> > | TLB or intermediate TLB caching structure as the result of an explicit or
> > | speculative access.
> >
>
> Thanks a lot!
>
> I looked up the x86 manual and found that the x86 TLB cache mechanism is
> similar to arm64 (but the x86 guys haven't reply me yet):
>
> Intel® 64 and IA-32 Architectures Software Developer Manuals:
> > 4.10.2.3 Details of TLB Use
> > Subject to the limitations given in the previous paragraph, the
> > processor may cache a translation for any linear address, even if that
> > address is not used to access memory. For example, the processor may
> > cache translations required for prefetches and for accesses that result
> > from speculative execution that would never actually occur in the
> > executed code path.
>
> Both architectures have similar TLB cache policies, why arm64 flush all
> and x86 flush local in ghes_map and ghes_unmap?
>
> I think flush all may be unnecessary.
>
> 1. Before accessing ghes data. Each CPU needs to call ghes_map, which
> will create the mapping and flush their own TLb to make sure the current
> CPU is using the latest mapping.
>
> 2. And there is no need to flush all in ghes_unmap, because the ghes_map
> of other CPUs will flush their own TLBs before accessing the memory.

This is not sufficient. Regardless of whether CPUs *explicitly* access the VA
range, any CPU which can reach the live translation table entry is allowed to
fetch that and allocate it into a TLB at any time.

When a Break-Before-Make sequence isn't followed, the architecture permits a
number of resulting behaviours, including "amalgamation", where the TLB entries
are combined in some arbitrary IMPLEMENTATION DEFINED way. The architecture
isn't very clear here, but doesn't rule out two entries being combined such
that it generates an atbirary physical address and/or such tha the MMU thinks
the entry is from an intermediate walk. In either of those cases, the CPU might
speculative access device memory (which could change the state of the system,
or cause fatal SErrors), and/or allocate further junk into TLBs.

So per the architecture, broadcast maintenance is necessary on arm64. The only
way to avoid it would be to have a local set of translation tables which are
not shared with other CPUs.

I suspect x86 might not have the same issue with amalgamation.

Thanks,
Mark.