2022-06-07 08:18:24

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible

On Jun 6, 2022, at 1:48 PM, Andy Lutomirski <[email protected]> wrote:

> ⚠ External Email
>
> On Mon, Jun 6, 2022, at 11:01 AM, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>>
>> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
>> contended and reading it should (arguably) be avoided as much as
>> possible.
>>
>> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
>> even when it is not necessary (e.g., the mm was already switched).
>> This is wasteful.
>>
>> Moreover, one of the existing optimizations is to read mm's tlb_gen to
>> see if there are additional in-flight TLB invalidations and flush the
>> entire TLB in such a case. However, if the request's tlb_gen was already
>> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
>> by the overhead of the check itself.
>>
>> Running will-it-scale with tlb_flush1_threads show a considerable
>> benefit on 56-core Skylake (up to +24%):
>
> Acked-by: Andy Lutomirski <[email protected]>
>
> But...
>
> I'm suspicious that the analysis is missing something. Under this kind of workload, there are a whole bunch of flushes being initiated, presumably in parallel. Each flush does an RMW on mm_tlb_gen (which will make the cacheline exclusive on the initiating CPU). And each flush sends out an IPI, and the IPI handler reads mm_tlb_gen (which makes the cacheline shared) when it updates the local tlb_gen. So you're doing (at least!) an E->S and S->E transition per flush. Your patch doesn't change this.
>
> But your patch does add a whole new case in which the IPI handler simply doesn't flush! I think it takes either quite a bit of racing or a well-timed context switch to hit that case, but, if you hit it, then you skip a flush and you skip the read of mm_tlb_gen.
>
> Have you tested what happens if you do something like your patch but you also make the mm_tlb_gen read unconditional? I'm curious if there's more to the story than you're seeing.
>
> You could also contemplate a somewhat evil hack in which you don't read mm_tlb_gen even if you *do* flush and instead use f->new_tlb_gen. That would potentially do a bit of extra flushing but would avoid the flush path causing the E->S transition. (Which may be of dubious value for real workloads, since I don't think there's a credible way to avoid having context switches read mm_tlb_gen.)

Thanks Andy. I still think that the performance comes from saving cache
accesses, which are skipped in certain cases in this workload. I would note
that this patch comes from me profiling will-it-scale, after Dave complained
that I ruined the performance in some other patch. So this is not a random
“I tried something and it’s better”.

I vaguely remember profiling the number of cache-[something] and seeing an
effect, and I cannot explain such performance improvement by just skipping a
flush. But...

Having said all of that, I will run at least the first experiment that you
asked for. I was considering skipping reading mm_tlb_gen completely, but for
the reasons that you mentioned considered it as something that might
introduce performance regression for workloads that are more important than
will-it-scale.

I would also admit that I am not sure how to completely prevent speculative
read of mm->tlb_gen. I guess a serializing instruction is out of the
question, so this optimization is a best-effort.


2022-06-08 04:38:20

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible

On Jun 6, 2022, at 2:06 PM, Nadav Amit <[email protected]> wrote:

> On Jun 6, 2022, at 1:48 PM, Andy Lutomirski <[email protected]> wrote:
>
>> ⚠ External Email
>>
>> On Mon, Jun 6, 2022, at 11:01 AM, Nadav Amit wrote:
>>> From: Nadav Amit <[email protected]>
>>>
>>> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
>>> contended and reading it should (arguably) be avoided as much as
>>> possible.
>>>
>>> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
>>> even when it is not necessary (e.g., the mm was already switched).
>>> This is wasteful.
>>>
>>> Moreover, one of the existing optimizations is to read mm's tlb_gen to
>>> see if there are additional in-flight TLB invalidations and flush the
>>> entire TLB in such a case. However, if the request's tlb_gen was already
>>> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
>>> by the overhead of the check itself.
>>>
>>> Running will-it-scale with tlb_flush1_threads show a considerable
>>> benefit on 56-core Skylake (up to +24%):
>>
>> Acked-by: Andy Lutomirski <[email protected]>
>>
>> But...
>>
>> I'm suspicious that the analysis is missing something. Under this kind of workload, there are a whole bunch of flushes being initiated, presumably in parallel. Each flush does an RMW on mm_tlb_gen (which will make the cacheline exclusive on the initiating CPU). And each flush sends out an IPI, and the IPI handler reads mm_tlb_gen (which makes the cacheline shared) when it updates the local tlb_gen. So you're doing (at least!) an E->S and S->E transition per flush. Your patch doesn't change this.
>>
>> But your patch does add a whole new case in which the IPI handler simply doesn't flush! I think it takes either quite a bit of racing or a well-timed context switch to hit that case, but, if you hit it, then you skip a flush and you skip the read of mm_tlb_gen.
>>
>> Have you tested what happens if you do something like your patch but you also make the mm_tlb_gen read unconditional? I'm curious if there's more to the story than you're seeing.
>>
>> You could also contemplate a somewhat evil hack in which you don't read mm_tlb_gen even if you *do* flush and instead use f->new_tlb_gen. That would potentially do a bit of extra flushing but would avoid the flush path causing the E->S transition. (Which may be of dubious value for real workloads, since I don't think there's a credible way to avoid having context switches read mm_tlb_gen.)
>
> Thanks Andy. I still think that the performance comes from saving cache
> accesses, which are skipped in certain cases in this workload. I would note
> that this patch comes from me profiling will-it-scale, after Dave complained
> that I ruined the performance in some other patch. So this is not a random
> “I tried something and it’s better”.
>
> I vaguely remember profiling the number of cache-[something] and seeing an
> effect, and I cannot explain such performance improvement by just skipping a
> flush. But...
>
> Having said all of that, I will run at least the first experiment that you
> asked for. I was considering skipping reading mm_tlb_gen completely, but for
> the reasons that you mentioned considered it as something that might
> introduce performance regression for workloads that are more important than
> will-it-scale.
>
> I would also admit that I am not sure how to completely prevent speculative
> read of mm->tlb_gen. I guess a serializing instruction is out of the
> question, so this optimization is a best-effort.

Here are the results of my runs. Note that these results are not comparable
to the ones before. This time I had an older machine (Haswell) and the
configuration is slightly different (IIRC the previous run was with PTI
disabled and now it is enabled; arguably this is less favorable for this
patch, since the cache-effect part of the overall TLB shootdown is smaller).

As you noted, Andy, there are two things - related in my mind - that the
patch does. It returns early with no flush if f->new_tlb_gen<=local_tlb_gen
and it tries to avoid touching mm->tlb_gen to minimize cache effects.

You asked to run experiments that separate the effect.

no early return early return
5.18.1 +patch touch mm->tlb_gen no mm->tlb_gen [*]

1 159504 159705 159284 159499
5 326725 320440 323629 303195
10 489841 497678 498601 442976
15 552282 576148 570337 503709
20 588333 628960 619342 551789
25 637319 675633 659985 591575
30 643372 691581 670599 613017
35 677259 706157 689624 646873
40 659309 728078 716718 655364
45 670985 735346 696558 665500

[*] mm->tlb_gen îs completely remove from flush_tlb_func() in this setup

Now, clearly this experiment is limited, and I did not measure the number of
TLB shootdowns, number of cache-misses, etc.

Having said that, the conclusions I get to:
1. The performance benefit appears to come from both the early return and
avoiding touching mm->tlb_gen.
2. Even in this setup, your optimization (Andy) of checking mm->tlb_gen,
pays off. Removing mm->tlb_gen completely from flush_tlb_func is bad.

Now, if you ask me how this whole thing can be further improved, then I
would say that perhaps on remote cores it is best to do the actual TLB flush
after we went over the TLB shootdown entries and figured out if we have
multiple outstanding TLB shootdowns and what the new generation is (using
max of f->new_tlb_gen; without touching mm->tlb_gen). It can be done by
adding a stage in flush_smp_call_function_queue(), but it might slightly
break the abstraction layers.

Having said all of that, I think that the performance improvement is
considerable even in this config (which I remind, is less favorable for the
benchmark).