Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760606AbcDFDRs (ORCPT ); Tue, 5 Apr 2016 23:17:48 -0400 Received: from mail-pa0-f48.google.com ([209.85.220.48]:34912 "EHLO mail-pa0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752812AbcDFDRr (ORCPT ); Tue, 5 Apr 2016 23:17:47 -0400 From: Alex Shi To: Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , x86@kernel.org (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)), linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)) Cc: Alex Shi , Andrew Morton , Andy Lutomirski , Rik van Riel Subject: [REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT Date: Wed, 6 Apr 2016 11:14:17 +0800 Message-Id: <1459912457-5630-1-git-send-email-alex.shi@linaro.org> X-Mailer: git-send-email 2.7.2.333.g70bd996 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4311 Lines: 106 It seems Intel core still share the TLB pool, flush both of threads' TLB just cause a extra useless IPI and a extra flush. The extra flush will flush out TLB again which another thread just introduced. That's double waste. The micro testing show memory access can save about 25% time on my haswell i7 desktop. munmap source code is here: https://lkml.org/lkml/2012/5/17/59 test result on Kernel v4.5.0: $/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16 munmap use 57ms 14072ns/time, memory access uses 48356 times/thread/ms, cost 20ns/time Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16': 18,739,808 dTLB-load-misses # 2.47% of all dTLB cache hits (43.05%) 757,380,911 dTLB-loads (34.34%) 2,125,275 dTLB-store-misses (32.23%) 318,307,759 dTLB-stores (46.32%) 32,765 iTLB-load-misses # 2.03% of all iTLB cache hits (56.90%) 1,616,237 iTLB-loads (44.47%) 41,476 tlb:tlb_flush 1.443484546 seconds time elapsed /proc/vmstat/nr_tlb_remote_flush increased: 4616 /proc/vmstat/nr_tlb_remote_flush_received increased: 32262 test result on Kernel v4.5.0 + this patch: $/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16 munmap use 48ms 11933ns/time, memory access uses 59966 times/thread/ms, cost 16ns/time Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16': 15,984,772 dTLB-load-misses # 1.89% of all dTLB cache hits (41.72%) 844,099,241 dTLB-loads (33.30%) 1,328,102 dTLB-store-misses (52.13%) 280,902,875 dTLB-stores (52.03%) 27,678 iTLB-load-misses # 1.67% of all iTLB cache hits (35.35%) 1,659,550 iTLB-loads (38.38%) 25,137 tlb:tlb_flush 1.428880301 seconds time elapsed /proc/vmstat/nr_tlb_remote_flush increased: 4616 /proc/vmstat/nr_tlb_remote_flush_received increased: 15912 BTW, This change isn't architecturally guaranteed. Signed-off-by: Alex Shi Cc: Andrew Morton To: linux-kernel@vger.kernel.org To: Mel Gorman To: x86@kernel.org To: "H. Peter Anvin" To: Thomas Gleixner Cc: Andy Lutomirski Cc: Rik van Riel Cc: Alex Shi --- arch/x86/mm/tlb.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 8f4cc3d..6510316 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -134,7 +134,10 @@ void native_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm, unsigned long start, unsigned long end) { + int cpu; struct flush_tlb_info info; + cpumask_t flush_mask, *sblmask; + info.flush_mm = mm; info.flush_start = start; info.flush_end = end; @@ -151,7 +154,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask, &info, 1); return; } - smp_call_function_many(cpumask, flush_tlb_func, &info, 1); + + if (unlikely(smp_num_siblings <= 1)) { + smp_call_function_many(cpumask, flush_tlb_func, &info, 1); + return; + } + + /* Only one flush needed on both siblings of SMT */ + cpumask_copy(&flush_mask, cpumask); + for_each_cpu(cpu, &flush_mask) { + sblmask = topology_sibling_cpumask(cpu); + if (!cpumask_subset(sblmask, &flush_mask)) + continue; + + cpumask_clear_cpu(cpumask_next(cpu, sblmask), &flush_mask); + } + + smp_call_function_many(&flush_mask, flush_tlb_func, &info, 1); } void flush_tlb_current_task(void) -- 2.7.2.333.g70bd996