From: Alex Shi <alex.shi@linaro.org>
To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>,
        x86@kernel.org (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)),
        linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND
	64-BIT))
Cc: Alex Shi <alex.shi@linaro.org>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Rik van Riel <riel@redhat.com>
Subject: [REF PATCH] x86/tlb: just do tlb flush on one of siblings of SMT
Date: Wed,  6 Apr 2016 11:14:17 +0800
Message-Id: <1459912457-5630-1-git-send-email-alex.shi@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4311
Lines: 106

It seems Intel core still share the TLB pool, flush both of threads' TLB
just cause a extra useless IPI and a extra flush. The extra flush will 
flush out TLB again which another thread just introduced.
That's double waste.

The micro testing show memory access can save about 25% time on my 
haswell i7 desktop.
munmap source code is here: https://lkml.org/lkml/2012/5/17/59

test result on Kernel v4.5.0:
$/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 57ms 14072ns/time, memory access uses 48356 times/thread/ms, cost 20ns/time

 Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16':

        18,739,808      dTLB-load-misses          #    2.47% of all dTLB cache hits   (43.05%)
       757,380,911      dTLB-loads                                                    (34.34%)
         2,125,275      dTLB-store-misses                                             (32.23%)
       318,307,759      dTLB-stores                                                   (46.32%)
            32,765      iTLB-load-misses          #    2.03% of all iTLB cache hits   (56.90%)
         1,616,237      iTLB-loads                                                    (44.47%)
            41,476      tlb:tlb_flush

       1.443484546 seconds time elapsed

/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 32262

test result on Kernel v4.5.0 + this patch:
$/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 48ms 11933ns/time, memory access uses 59966 times/thread/ms, cost 16ns/time

 Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16':

        15,984,772      dTLB-load-misses          #    1.89% of all dTLB cache hits   (41.72%)
       844,099,241      dTLB-loads                                                    (33.30%)
         1,328,102      dTLB-store-misses                                             (52.13%)
       280,902,875      dTLB-stores                                                   (52.03%)
            27,678      iTLB-load-misses          #    1.67% of all iTLB cache hits   (35.35%)
         1,659,550      iTLB-loads                                                    (38.38%)
            25,137      tlb:tlb_flush

       1.428880301 seconds time elapsed

/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 15912

BTW, 
This change isn't architecturally guaranteed.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
To: linux-kernel@vger.kernel.org
To: Mel Gorman <mgorman@suse.de>
To: x86@kernel.org
To: "H. Peter Anvin" <hpa@zytor.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Alex Shi <alex.shi@linaro.org>
---
 arch/x86/mm/tlb.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8f4cc3d..6510316 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -134,7 +134,10 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
+	int cpu;
 	struct flush_tlb_info info;
+	cpumask_t flush_mask, *sblmask;
+
 	info.flush_mm = mm;
 	info.flush_start = start;
 	info.flush_end = end;
@@ -151,7 +154,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+
+	if (unlikely(smp_num_siblings <= 1)) {
+		smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+		return;
+	}
+
+	/* Only one flush needed on both siblings of SMT */
+	cpumask_copy(&flush_mask, cpumask);
+	for_each_cpu(cpu, &flush_mask) {
+		sblmask = topology_sibling_cpumask(cpu);
+		if (!cpumask_subset(sblmask, &flush_mask))
+			continue;
+
+		cpumask_clear_cpu(cpumask_next(cpu, sblmask), &flush_mask);
+	}
+
+	smp_call_function_many(&flush_mask, flush_tlb_func, &info, 1);
 }
 
 void flush_tlb_current_task(void)
-- 
2.7.2.333.g70bd996