LinuxLists.cc - [PATCH v3 0/9] x86: Concurrent TLB flushes

2019-07-19 00:59:34

Subject: [PATCH v3 0/9] x86: Concurrent TLB flushes

[ Cover-letter is identical to v2, including benchmark results,
excluding the change log. ]

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:

sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

Th. tip-jun28 avg (stdev) +patch-set avg (stdev) change
--- --------------------- ---------------------- ------
1 1267765 (14146) 1299253 (5715) +2.4%
2 1734644 (11936) 1799225 (19577) +3.7%
4 2821268 (41184) 2919132 (40149) +3.4%
8 4171652 (31243) 4376925 (65416) +4.9%
16 5590729 (24160) 5829866 (8127) +4.2%
24 6250212 (24481) 6522303 (28044) +4.3%
32 3994314 (26606) 4077543 (10685) +2.0%
48 4345177 (28091) 4417821 (41337) +1.6%

(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

Th. tip-jun28 avg (stdev) +patch-set avg (stdev) change
--- --------------------- ---------------------- ------
1 1598896 (5174) 1607903 (4091) +0.5%
2 2109472 (17827) 2224726 (4372) +5.4%
4 3448587 (11952) 3668551 (30219) +6.3%
8 5425778 (29641) 5606266 (33519) +3.3%
16 6931232 (34677) 7054052 (27873) +1.7%
24 7612473 (23482) 7783138 (13871) +2.2%
32 4296274 (18029) 4283279 (32323) -0.3%
48 4770029 (35541) 4764760 (13575) -0.1%

Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.

I tried to reduce the size of the code of the main patch, which required
restructuring of the series.

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Nadav Amit (9):
smp: Run functions concurrently in smp_call_function_many()
x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
x86/mm/tlb: Flush remote and local TLBs concurrently
x86/mm/tlb: Privatize cpu_tlbstate
x86/mm/tlb: Do not make is_lazy dirty for no reason
cpumask: Mark functions as pure
x86/mm/tlb: Remove UV special case
x86/mm/tlb: Remove unnecessary uses of the inline keyword

arch/x86/hyperv/mmu.c | 10 +-
arch/x86/include/asm/paravirt.h | 6 +-
arch/x86/include/asm/paravirt_types.h | 4 +-
arch/x86/include/asm/tlbflush.h | 47 ++++-----
arch/x86/include/asm/trace/hyperv.h | 2 +-
arch/x86/kernel/kvm.c | 11 ++-
arch/x86/kernel/paravirt.c | 2 +-
arch/x86/mm/init.c | 2 +-
arch/x86/mm/tlb.c | 133 ++++++++++++++++----------
arch/x86/xen/mmu_pv.c | 11 +--
include/linux/cpumask.h | 6 +-
include/linux/smp.h | 27 ++++--
include/trace/events/xen.h | 2 +-
kernel/smp.c | 133 ++++++++++++--------------
14 files changed, 218 insertions(+), 178 deletions(-)

--
2.20.1

2019-07-19 00:59:40

Subject: [PATCH v3 0/9] x86: Concurrent TLB flushes

Subject: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Subject: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

Subject: [PATCH v3 7/9] cpumask: Mark functions as pure

Subject: [PATCH v3 9/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword

Subject: [PATCH v3 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason

Subject: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

Subject: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case

Subject: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: [PATCH v3 2/9] x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()

Subject: Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case

Subject: Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Subject: Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

Subject: Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Subject: Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

Subject: Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

Subject: Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

Subject: Re: [PATCH v3 0/9] x86: Concurrent TLB flushes

Subject: Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Subject: Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Subject: Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

Subject: Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

Subject: Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

Subject: Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Subject: Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

Subject: Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

Subject: RE: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

Subject: Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case