It seems Intel core still share the TLB pool, flush both of threads' TLB
just cause a extra useless IPI and a extra flush. The extra flush will
flush out TLB again which another thread just introduced.
That's double waste.
The micro testing show memory access can save about 25% time on my
haswell i7 desktop.
munmap source code is here: https://lkml.org/lkml/2012/5/17/59
test result on Kernel v4.5.0:
$/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 57ms 14072ns/time, memory access uses 48356 times/thread/ms, cost 20ns/time
Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16':
18,739,808 dTLB-load-misses # 2.47% of all dTLB cache hits (43.05%)
757,380,911 dTLB-loads (34.34%)
2,125,275 dTLB-store-misses (32.23%)
318,307,759 dTLB-stores (46.32%)
32,765 iTLB-load-misses # 2.03% of all iTLB cache hits (56.90%)
1,616,237 iTLB-loads (44.47%)
41,476 tlb:tlb_flush
1.443484546 seconds time elapsed
/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 32262
test result on Kernel v4.5.0 + this patch:
$/home/alexs/bin/perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads -e tlb:tlb_flush munmap -n 64 -t 16
munmap use 48ms 11933ns/time, memory access uses 59966 times/thread/ms, cost 16ns/time
Performance counter stats for '/home/alexs/backups/exec-laptop/tlb/munmap -n 64 -t 16':
15,984,772 dTLB-load-misses # 1.89% of all dTLB cache hits (41.72%)
844,099,241 dTLB-loads (33.30%)
1,328,102 dTLB-store-misses (52.13%)
280,902,875 dTLB-stores (52.03%)
27,678 iTLB-load-misses # 1.67% of all iTLB cache hits (35.35%)
1,659,550 iTLB-loads (38.38%)
25,137 tlb:tlb_flush
1.428880301 seconds time elapsed
/proc/vmstat/nr_tlb_remote_flush increased: 4616
/proc/vmstat/nr_tlb_remote_flush_received increased: 15912
BTW,
This change isn't architecturally guaranteed.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
To: [email protected]
To: Mel Gorman <[email protected]>
To: [email protected]
To: "H. Peter Anvin" <[email protected]>
To: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Alex Shi <[email protected]>
---
arch/x86/mm/tlb.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8f4cc3d..6510316 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -134,7 +134,10 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
struct mm_struct *mm, unsigned long start,
unsigned long end)
{
+ int cpu;
struct flush_tlb_info info;
+ cpumask_t flush_mask, *sblmask;
+
info.flush_mm = mm;
info.flush_start = start;
info.flush_end = end;
@@ -151,7 +154,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
&info, 1);
return;
}
- smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+
+ if (unlikely(smp_num_siblings <= 1)) {
+ smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+ return;
+ }
+
+ /* Only one flush needed on both siblings of SMT */
+ cpumask_copy(&flush_mask, cpumask);
+ for_each_cpu(cpu, &flush_mask) {
+ sblmask = topology_sibling_cpumask(cpu);
+ if (!cpumask_subset(sblmask, &flush_mask))
+ continue;
+
+ cpumask_clear_cpu(cpumask_next(cpu, sblmask), &flush_mask);
+ }
+
+ smp_call_function_many(&flush_mask, flush_tlb_func, &info, 1);
}
void flush_tlb_current_task(void)
--
2.7.2.333.g70bd996
On Apr 5, 2016 8:17 PM, "Alex Shi" <[email protected]> wrote:
>
> It seems Intel core still share the TLB pool, flush both of threads' TLB
> just cause a extra useless IPI and a extra flush. The extra flush will
> flush out TLB again which another thread just introduced.
> That's double waste.
Do you have a reference in both the SDM and the APM for this?
Do we have a guarantee that this serialized the front end such that
the non-targetted sibling won't execute an instruction that it decoded
from a stale translation?
This will conflict rather deeply with my PCID series, too.
--Andy
On 04/06/2016 12:47 PM, Andy Lutomirski wrote:
> On Apr 5, 2016 8:17 PM, "Alex Shi" <[email protected]> wrote:
>>
>> It seems Intel core still share the TLB pool, flush both of threads' TLB
>> just cause a extra useless IPI and a extra flush. The extra flush will
>> flush out TLB again which another thread just introduced.
>> That's double waste.
>
> Do you have a reference in both the SDM and the APM for this?
No. as I said in the end of commit log. There are no any official
guarantee for this usage, but it seems working widely in Intel CPUs.
And the performance benefit is so tempted...
Is there Intel's guys like to dig it more? :)
>
> Do we have a guarantee that this serialized the front end such that
> the non-targetted sibling won't execute an instruction that it decoded
> from a stale translation?
Is your worrying an evidence for my guess? Otherwise the stale
instruction happens either before IPI coming in... :)
>
> This will conflict rather deeply with my PCID series, too.
>
> --Andy
>