2022-03-23 09:29:55

by Nadav Amit

[permalink] [raw]
Subject: [PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

From: Nadav Amit <[email protected]>

On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
contended and reading it should (arguably) be avoided as much as
possible.

Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
even when it is not necessary (e.g., the mm was already switched).
This is wasteful.

Moreover, one of the existing optimizations is to read mm's tlb_gen to
see if there are additional in-flight TLB invalidations and flush the
entire TLB in such a case. However, if the request's tlb_gen was already
flushed, the benefit of checking the mm's tlb_gen is likely to be offset
by the overhead of the check itself.

Running will-it-scale with tlb_flush1_threads show a considerable
benefit on 56-core Skylake (up to +24%):

threads Baseline (v5.17+) +Patch
1 159960 160202
5 310808 308378 (-0.7%)
10 479110 490728
15 526771 562528
20 534495 587316
25 547462 628296
30 579616 666313
35 594134 701814
40 612288 732967
45 617517 749727
50 637476 735497
55 614363 778913 (+24%)

Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>

--

Note: The benchmarked kernels include Dave's revert of commit
6035152d8eeb ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for
tlb_is_not_lazy()
---
arch/x86/mm/tlb.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 300b11e45792..6d7c69526051 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -733,10 +733,10 @@ static void flush_tlb_func(void *info)
const struct flush_tlb_info *f = info;
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
- u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
bool local = smp_processor_id() == f->initiating_cpu;
unsigned long nr_invalidate = 0;
+ u64 mm_tlb_gen;

/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());
@@ -770,6 +770,22 @@ static void flush_tlb_func(void *info)
return;
}

+ if (f->new_tlb_gen <= local_tlb_gen) {
+ /*
+ * We are already up to date in respect to f->new_tlb_gen.
+ * While the core might be still behind mm_tlb_gen, checking
+ * mm_tlb_gen unnecessarily would have negative caching effects
+ * so avoid it.
+ */
+ return;
+ }
+
+ /*
+ * Defer mm_tlb_gen reading as long as possible to avoid cache
+ * contention.
+ */
+ mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
+
if (unlikely(local_tlb_gen == mm_tlb_gen)) {
/*
* There's nothing to do: we're already up to date. This can
--
2.25.1


2022-03-28 22:37:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

On Tue, Mar 22, 2022 at 10:07:57PM +0000, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
> contended and reading it should (arguably) be avoided as much as
> possible.
>
> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
> even when it is not necessary (e.g., the mm was already switched).
> This is wasteful.
>
> Moreover, one of the existing optimizations is to read mm's tlb_gen to
> see if there are additional in-flight TLB invalidations and flush the
> entire TLB in such a case. However, if the request's tlb_gen was already
> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
> by the overhead of the check itself.
>
> Running will-it-scale with tlb_flush1_threads show a considerable
> benefit on 56-core Skylake (up to +24%):
>
> threads Baseline (v5.17+) +Patch
> 1 159960 160202
> 5 310808 308378 (-0.7%)
> 10 479110 490728
> 15 526771 562528
> 20 534495 587316
> 25 547462 628296
> 30 579616 666313
> 35 594134 701814
> 40 612288 732967
> 45 617517 749727
> 50 637476 735497
> 55 614363 778913 (+24%)
>

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2022-06-06 14:40:10

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

On Mar 28, 2022, at 3:35 AM, Peter Zijlstra <[email protected]> wrote:

> On Tue, Mar 22, 2022 at 10:07:57PM +0000, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>>
>> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
>> contended and reading it should (arguably) be avoided as much as
>> possible.
>>
>> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
>> even when it is not necessary (e.g., the mm was already switched).
>> This is wasteful.
>>
>> Moreover, one of the existing optimizations is to read mm's tlb_gen to
>> see if there are additional in-flight TLB invalidations and flush the
>> entire TLB in such a case. However, if the request's tlb_gen was already
>> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
>> by the overhead of the check itself.
>>
>> Running will-it-scale with tlb_flush1_threads show a considerable
>> benefit on 56-core Skylake (up to +24%):
>>
>> threads Baseline (v5.17+) +Patch
>> 1 159960 160202
>> 5 310808 308378 (-0.7%)
>> 10 479110 490728
>> 15 526771 562528
>> 20 534495 587316
>> 25 547462 628296
>> 30 579616 666313
>> 35 594134 701814
>> 40 612288 732967
>> 45 617517 749727
>> 50 637476 735497
>> 55 614363 778913 (+24%)
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

Ping?

2022-06-06 15:52:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

On 3/22/22 15:07, Nadav Amit wrote:
> + if (f->new_tlb_gen <= local_tlb_gen) {
> + /*
> + * We are already up to date in respect to f->new_tlb_gen.
> + * While the core might be still behind mm_tlb_gen, checking
> + * mm_tlb_gen unnecessarily would have negative caching effects
> + * so avoid it.
> + */
> + return;
> + }
> +

Nit: There's at least one "we" in here that needs to get fixed up. I'll
plan to do that when I apply it, but a v2 with that fixed and Peter's
ack added might save me five minutes.

2022-06-06 16:47:17

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/tlb: avoid reading mm_tlb_gen when possible

On Jun 6, 2022, at 8:29 AM, Dave Hansen <[email protected]> wrote:

> ⚠ External Email
>
> On 3/22/22 15:07, Nadav Amit wrote:
>> + if (f->new_tlb_gen <= local_tlb_gen) {
>> + /*
>> + * We are already up to date in respect to f->new_tlb_gen.
>> + * While the core might be still behind mm_tlb_gen, checking
>> + * mm_tlb_gen unnecessarily would have negative caching effects
>> + * so avoid it.
>> + */
>> + return;
>> + }
>> +
>
> Nit: There's at least one "we" in here that needs to get fixed up. I'll
> plan to do that when I apply it, but a v2 with that fixed and Peter's
> ack added might save me five minutes.

No good deed goes unpunished.

I’ll send v2 later today.