LinuxLists.cc - [PATCH 0/7] [RESEND][v4] x86: rework tlb range flushing code

2014-07-01 16:50:35

Subject: [PATCH 0/7] [RESEND][v4] x86: rework tlb range flushing code

x86 Maintainers,

Could this get picked up in to the x86 tree, please? That way,
it will get plenty of time to bake before the 3.17 merge window.

Changes from v3:
* Include the patch I was using to gather detailed statistics
about the length of the ranged TLB flushes
* Fix some documentation typos
* Add a patch to rework the remote tlb flush code to plumb the
tracepoints in easier, and add missing tracepoints
* use __print_symbolic() for the human-readable tracepoint
descriptions
* change an int to bool in patch 1
* Specifically call out that we removed itlb vs. dtlb logic

Changes from v2:
* Added a brief comment above the ceiling tunable
* Updated the documentation to mention large pages and say
"individual flush" instead of invlpg in most cases.

I guess the x86 tree is probably the right place to queue this
up.

I've run this through a variety of systems in the LKP harness,
as well as running it on my desktop for a few days. I'm yet to
see an to see if any perfmance regressions (or gains) show up.

Without the last (instrumentation/debugging) patch:

arch/x86/include/asm/mmu_context.h | 6 ++
arch/x86/include/asm/processor.h | 1
arch/x86/kernel/cpu/amd.c | 7 --
arch/x86/kernel/cpu/common.c | 13 ----
arch/x86/kernel/cpu/intel.c | 26 ---------
arch/x86/mm/tlb.c | 106 ++++++++++++++++++-------------------
include/linux/mm_types.h | 8 ++
7 files changed, 68 insertions(+), 99 deletions(-)
[davehans@viggo linux.git]$

--

I originally went to look at this becuase I realized that newer
CPUs were not present in the intel_tlb_flushall_shift_set() code.

I went to try to figure out where to stick newer CPUs (do we
consider them more like SandyBridge or IvyBridge), and was not
able to repeat the original experiments.

Instead, this set does:
1. Rework the code a bit to ready it for tracepoints
2. Add tracepoints
3. Add a new tunable and set it to a sane value

2014-07-01 16:48:50

by Dave Hansen

[permalink] [raw]

Subject: [PATCH 1/7] x86: mm: clean up tlb flushing code

From: Dave Hansen <[email protected]>

The

if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)

line of code is not exactly the easiest to audit, especially when
it ends up at two different indentation levels. This eliminates
one of the the copy-n-paste versions. It also gives us a unified
exit point for each path through this function. We need this in
a minute for our tracepoint.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---

b/arch/x86/mm/tlb.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)

diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~simplify-tlb-code 2014-06-30 16:18:05.525535882 -0700
+++ b/arch/x86/mm/tlb.c 2014-06-30 16:18:05.535536335 -0700
@@ -161,23 +161,24 @@ void flush_tlb_current_task(void)
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
+ bool need_flush_others_all = true;
unsigned long addr;
unsigned act_entries, tlb_entries = 0;
unsigned long nr_base_pages;

preempt_disable();
if (current->active_mm != mm)
- goto flush_all;
+ goto out;

if (!current->mm) {
leave_mm(smp_processor_id());
- goto flush_all;
+ goto out;
}

if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
|| vmflag & VM_HUGETLB) {
local_flush_tlb();
- goto flush_all;
+ goto out;
}

/* In modern CPU, last level tlb used for both data/ins */
@@ -196,22 +197,20 @@ void flush_tlb_mm_range(struct mm_struct
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
+ need_flush_others_all = false;
/* flush range by one by one 'invlpg' */
for (addr = start; addr < end; addr += PAGE_SIZE) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
__flush_tlb_single(addr);
}
-
- if (cpumask_any_but(mm_cpumask(mm),
- smp_processor_id()) < nr_cpu_ids)
- flush_tlb_others(mm_cpumask(mm), mm, start, end);
- preempt_enable();
- return;
}
-
-flush_all:
+out:
+ if (need_flush_others_all) {
+ start = 0UL;
+ end = TLB_FLUSH_ALL;
+ }
if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
- flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
+ flush_tlb_others(mm_cpumask(mm), mm, start, end);
preempt_enable();
}

_

2014-07-01 16:49:00

by Dave Hansen

[permalink] [raw]

Subject: [PATCH 3/7] x86: mm: fix missed global TLB flush stat

From: Dave Hansen <[email protected]>

If we take the

if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
local_flush_tlb();
goto out;
}

path out of flush_tlb_mm_range(), we will have flushed the tlb,
but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the
way out of the function so that we always take a single path when
doing a full tlb flush.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---

b/arch/x86/mm/tlb.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff -puN arch/x86/mm/tlb.c~fix-missed-global-flush-stat arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~fix-missed-global-flush-stat 2014-06-30 16:18:26.857507178 -0700
+++ b/arch/x86/mm/tlb.c 2014-06-30 16:18:26.864507497 -0700
@@ -164,8 +164,9 @@ unsigned long tlb_single_page_flush_ceil
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
- int need_flush_others_all = 1;
unsigned long addr;
+ /* do a global flush by default */
+ unsigned long base_pages_to_flush = TLB_FLUSH_ALL;

preempt_disable();
if (current->active_mm != mm)
@@ -176,16 +177,14 @@ void flush_tlb_mm_range(struct mm_struct
goto out;
}

- if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
- local_flush_tlb();
- goto out;
- }
+ if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
+ base_pages_to_flush = (end - start) >> PAGE_SHIFT;

- if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
+ if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
+ base_pages_to_flush = TLB_FLUSH_ALL;
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
- need_flush_others_all = 0;
/* flush range by one by one 'invlpg' */
for (addr = start; addr < end; addr += PAGE_SIZE) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
@@ -193,7 +192,7 @@ void flush_tlb_mm_range(struct mm_struct
}
}
out:
- if (need_flush_others_all) {
+ if (base_pages_to_flush == TLB_FLUSH_ALL) {
start = 0UL;
end = TLB_FLUSH_ALL;
}
_

2014-07-01 16:49:16

by Dave Hansen

[permalink] [raw]

Subject: [PATCH 6/7] x86: mm: new tunable for single vs full TLB flush

From: Dave Hansen <[email protected]>

Most of the logic here is in the documentation file. Please take
a look at it.

I know we've come full-circle here back to a tunable, but this
new one is *WAY* simpler. I challenge anyone to describe in one
sentence how the old one worked. Here's the way the new one
works:

If we are flushing more pages than the ceiling, we use
the full flush, otherwise we use per-page flushes.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---

b/Documentation/x86/tlb.txt | 75 ++++++++++++++++++++++++++++++++++++++++++++
b/arch/x86/mm/tlb.c | 46 ++++++++++++++++++++++++++
2 files changed, 121 insertions(+)

diff -puN arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush 2014-06-30 16:18:42.105201432 -0700
+++ b/arch/x86/mm/tlb.c 2014-06-30 16:18:42.111201706 -0700
@@ -265,3 +265,49 @@ void flush_tlb_kernel_range(unsigned lon
on_each_cpu(do_kernel_range_flush, &info, 1);
}
}
+
+static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ char buf[32];
+ unsigned int len;
+
+ len = sprintf(buf, "%ld\n", tlb_single_page_flush_ceiling);
+ return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t tlbflush_write_file(struct file *file,
+ const char __user *user_buf, size_t count, loff_t *ppos)
+{
+ char buf[32];
+ ssize_t len;
+ int ceiling;
+
+ len = min(count, sizeof(buf) - 1);
+ if (copy_from_user(buf, user_buf, len))
+ return -EFAULT;
+
+ buf[len] = '\0';
+ if (kstrtoint(buf, 0, &ceiling))
+ return -EINVAL;
+
+ if (ceiling < 0)
+ return -EINVAL;
+
+ tlb_single_page_flush_ceiling = ceiling;
+ return count;
+}
+
+static const struct file_operations fops_tlbflush = {
+ .read = tlbflush_read_file,
+ .write = tlbflush_write_file,
+ .llseek = default_llseek,
+};
+
+static int __init create_tlb_single_page_flush_ceiling(void)
+{
+ debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
+ arch_debugfs_dir, NULL, &fops_tlbflush);
+ return 0;
+}
+late_initcall(create_tlb_single_page_flush_ceiling);
diff -puN /dev/null Documentation/x86/tlb.txt
--- /dev/null 2014-04-10 11:28:14.066815724 -0700
+++ b/Documentation/x86/tlb.txt 2014-06-30 16:18:42.112201751 -0700
@@ -0,0 +1,75 @@
+When the kernel unmaps or modified the attributes of a range of
+memory, it has two choices:
+ 1. Flush the entire TLB with a two-instruction sequence. This is
+ a quick operation, but it causes collateral damage: TLB entries
+ from areas other than the one we are trying to flush will be
+ destroyed and must be refilled later, at some cost.
+ 2. Use the invlpg instruction to invalidate a single page at a
+ time. This could potentialy cost many more instructions, but
+ it is a much more precise operation, causing no collateral
+ damage to other TLB entries.
+
+Which method to do depends on a few things:
+ 1. The size of the flush being performed. A flush of the entire
+ address space is obviously better performed by flushing the
+ entire TLB than doing 2^48/PAGE_SIZE individual flushes.
+ 2. The contents of the TLB. If the TLB is empty, then there will
+ be no collateral damage caused by doing the global flush, and
+ all of the individual flush will have ended up being wasted
+ work.
+ 3. The size of the TLB. The larger the TLB, the more collateral
+ damage we do with a full flush. So, the larger the TLB, the
+ more attrative an individual flush looks. Data and
+ instructions have separate TLBs, as do different page sizes.
+ 4. The microarchitecture. The TLB has become a multi-level
+ cache on modern CPUs, and the global flushes have become more
+ expensive relative to single-page flushes.
+
+There is obviously no way the kernel can know all these things,
+especially the contents of the TLB during a given flush. The
+sizes of the flush will vary greatly depending on the workload as
+well. There is essentially no "right" point to choose.
+
+You may be doing too many individual invalidations if you see the
+invlpg instruction (or instructions _near_ it) show up high in
+profiles. If you believe that individual invalidations being
+called too often, you can lower the tunable:
+
+ /sys/debug/kernel/x86/tlb_single_page_flush_ceiling
+
+This will cause us to do the global flush for more cases.
+Lowering it to 0 will disable the use of the individual flushes.
+Setting it to 1 is a very conservative setting and it should
+never need to be 0 under normal circumstances.
+
+Despite the fact that a single individual flush on x86 is
+guaranteed to flush a full 2MB [1], hugetlbfs always uses the full
+flushes. THP is treated exactly the same as normal memory.
+
+You might see invlpg inside of flush_tlb_mm_range() show up in
+profiles, or you can use the trace_tlb_flush() tracepoints. to
+determine how long the flush operations are taking.
+
+Essentially, you are balancing the cycles you spend doing invlpg
+with the cycles that you spend refilling the TLB later.
+
+You can measure how expensive TLB refills are by using
+performance counters and 'perf stat', like this:
+
+perf stat -e
+ cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
+ cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
+ cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
+ cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
+ cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
+ cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
+
+That works on an IvyBridge-era CPU (i5-3320M). Different CPUs
+may have differently-named counters, but they should at least
+be there in some form. You can use pmu-tools 'ocperf list'
+(https://github.com/andikleen/pmu-tools) to find the right
+counters for a given CPU.
+
+1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
+ says: "One execution of INVLPG is sufficient even for a page
+ with size greater than 4 KBytes."
_

2014-07-01 16:49:34

by Dave Hansen

[permalink] [raw]

Subject: [PATCH 7/7] x86: mm: set TLB flush tunable to sane value (33)

From: Dave Hansen <[email protected]>

This has been run through Intel's LKP tests across a wide range
of modern sytems and workloads and it wasn't shown to make a
measurable performance difference positive or negative.

Now that we have some shiny new tracepoints, we can actually
figure out what the heck is going on.

During a kernel compile, 60% of the flush_tlb_mm_range() calls
are for a single page. It breaks down like this:

size percent percent<=
V V V
GLOBAL: 2.20% 2.20% avg cycles: 2283
1: 56.92% 59.12% avg cycles: 1276
2: 13.78% 72.90% avg cycles: 1505
3: 8.26% 81.16% avg cycles: 1880
4: 7.41% 88.58% avg cycles: 2447
5: 1.73% 90.31% avg cycles: 2358
6: 1.32% 91.63% avg cycles: 2563
7: 1.14% 92.77% avg cycles: 2862
8: 0.62% 93.39% avg cycles: 3542
9: 0.08% 93.47% avg cycles: 3289
10: 0.43% 93.90% avg cycles: 3570
11: 0.20% 94.10% avg cycles: 3767
12: 0.08% 94.18% avg cycles: 3996
13: 0.03% 94.20% avg cycles: 4077
14: 0.02% 94.23% avg cycles: 4836
15: 0.04% 94.26% avg cycles: 5699
16: 0.06% 94.32% avg cycles: 5041
17: 0.57% 94.89% avg cycles: 5473
18: 0.02% 94.91% avg cycles: 5396
19: 0.03% 94.95% avg cycles: 5296
20: 0.02% 94.96% avg cycles: 6749
21: 0.18% 95.14% avg cycles: 6225
22: 0.01% 95.15% avg cycles: 6393
23: 0.01% 95.16% avg cycles: 6861
24: 0.12% 95.28% avg cycles: 6912
25: 0.05% 95.32% avg cycles: 7190
26: 0.01% 95.33% avg cycles: 7793
27: 0.01% 95.34% avg cycles: 7833
28: 0.01% 95.35% avg cycles: 8253
29: 0.08% 95.42% avg cycles: 8024
30: 0.03% 95.45% avg cycles: 9670
31: 0.01% 95.46% avg cycles: 8949
32: 0.01% 95.46% avg cycles: 9350
33: 3.11% 98.57% avg cycles: 8534
34: 0.02% 98.60% avg cycles: 10977
35: 0.02% 98.62% avg cycles: 11400

We get in to dimishing returns pretty quickly. On pre-IvyBridge
CPUs, we used to set the limit at 8 pages, and it was set at 128
on IvyBrige. That 128 number looks pretty silly considering that
less than 0.5% of the flushes are that large.

The previous code tried to size this number based on the size of
the TLB. Good idea, but it's error-prone, needs maintenance
(which it didn't get up to now), and probably would not matter in
practice much.

Settting it to 33 means that we cover the mallopt
M_TRIM_THRESHOLD, which is the most universally common size to do
flushes.

That's the short version. Here's the long one for why I chose 33:

1. These numbers have a constant bias in the timestamps from the
tracing. Probably counts for a couple hundred cycles in each of
these tests, but it should be fairly _even_ across all of them.
The smallest delta between the tracepoints I have ever seen is
335 cycles. This is one reason the cycles/page cost goes down in
general as the flushes get larger. The true cost is nearer to
100 cycles.
2. A full flush is more expensive than a single invlpg, but not
by much (single percentages).
3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles). At those rates, refilling the 512-entry dTLB takes
22,000 cycles.
4. 22,000 cycles is approximately the equivalent of doing 85
invlpg operations. But, the odds are that the TLB can
actually be filled up faster than that because TLB misses that
are close in time also tend to leverage the same caches.
6. ~98% of flushes are <=33 pages. There are a lot of flushes of
33 pages, probably because libc's M_TRIM_THRESHOLD is set to
128k (32 pages)
7. I've found no consistent data to support changing the IvyBridge
vs. SandyBridge tunable by a factor of 16

I used the performance counters on this hardware (IvyBridge i5-3320M)
to figure out the tlb miss costs:

ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush

7,720,030,970 dtlb_load_misses_walk_duration [57.13%]
169,856,353 dtlb_load_misses_walk_completed [57.15%]
708,832,859 dtlb_store_misses_walk_duration [57.17%]
19,346,823 dtlb_store_misses_walk_completed [57.17%]
2,779,687,402 itlb_misses_walk_duration [57.15%]
82,241,148 itlb_misses_walk_completed [57.13%]
770,717 itlb_itlb_flush [57.11%]

Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles). At those rates, refilling the 512-entry dTLB takes
22,000 cycles. On a SandyBridge system with more cores and larger
caches, those are dtlb=13.4ns and itlb=9.5ns.

cat perf.stat.txt | perl -pe 's/,//g'
| awk '/itlb_misses_walk_duration/ { icyc+=$1 }
/itlb_misses_walk_completed/ { imiss+=$1 }
/dtlb_.*_walk_duration/ { dcyc+=$1 }
/dtlb_.*.*completed/ { dmiss+=$1 }
END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, " ----- ", icyc,imiss, dcyc,dmiss }

On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any

The assumptions that this code went in under:
https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are
about 100ns. Being generous, that is over by a factor of 6 on the
refill side, although it is fairly close on the cost of an invlpg.
An increase of a single invlpg operation seems to lengthen the flush
range operation by about 200 cycles. Here is one example of the data
collected for flushing 10 and 11 pages (full data are below):

10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714
11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145

How to generate this table:

echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb
echo x86-tsc > /sys/kernel/debug/tracing/trace_clock
echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter
echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable

Pipe the trace output in to this script:

http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt

Note that these data were gathered with the invlpg threshold set to
150 pages. Only data points with >=50 of samples were printed:

Flush % of %<=
in flush this
pages es size
------------------------------------------------------------------------------
-1: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960
1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895
2: 13.78% 72.90% avg cycles: 1505 cycles/page: 752 samples: 150335
3: 8.26% 81.16% avg cycles: 1880 cycles/page: 626 samples: 90131
4: 7.41% 88.58% avg cycles: 2447 cycles/page: 611 samples: 80877
5: 1.73% 90.31% avg cycles: 2358 cycles/page: 471 samples: 18885
6: 1.32% 91.63% avg cycles: 2563 cycles/page: 427 samples: 14397
7: 1.14% 92.77% avg cycles: 2862 cycles/page: 408 samples: 12441
8: 0.62% 93.39% avg cycles: 3542 cycles/page: 442 samples: 6721
9: 0.08% 93.47% avg cycles: 3289 cycles/page: 365 samples: 917
10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714
11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145
12: 0.08% 94.18% avg cycles: 3996 cycles/page: 333 samples: 864
13: 0.03% 94.20% avg cycles: 4077 cycles/page: 313 samples: 289
14: 0.02% 94.23% avg cycles: 4836 cycles/page: 345 samples: 236
15: 0.04% 94.26% avg cycles: 5699 cycles/page: 379 samples: 390
16: 0.06% 94.32% avg cycles: 5041 cycles/page: 315 samples: 643
17: 0.57% 94.89% avg cycles: 5473 cycles/page: 321 samples: 6229
18: 0.02% 94.91% avg cycles: 5396 cycles/page: 299 samples: 224
19: 0.03% 94.95% avg cycles: 5296 cycles/page: 278 samples: 367
20: 0.02% 94.96% avg cycles: 6749 cycles/page: 337 samples: 185
21: 0.18% 95.14% avg cycles: 6225 cycles/page: 296 samples: 1964
22: 0.01% 95.15% avg cycles: 6393 cycles/page: 290 samples: 83
23: 0.01% 95.16% avg cycles: 6861 cycles/page: 298 samples: 61
24: 0.12% 95.28% avg cycles: 6912 cycles/page: 288 samples: 1307
25: 0.05% 95.32% avg cycles: 7190 cycles/page: 287 samples: 533
26: 0.01% 95.33% avg cycles: 7793 cycles/page: 299 samples: 94
27: 0.01% 95.34% avg cycles: 7833 cycles/page: 290 samples: 66
28: 0.01% 95.35% avg cycles: 8253 cycles/page: 294 samples: 73
29: 0.08% 95.42% avg cycles: 8024 cycles/page: 276 samples: 846
30: 0.03% 95.45% avg cycles: 9670 cycles/page: 322 samples: 296
31: 0.01% 95.46% avg cycles: 8949 cycles/page: 288 samples: 79
32: 0.01% 95.46% avg cycles: 9350 cycles/page: 292 samples: 60
33: 3.11% 98.57% avg cycles: 8534 cycles/page: 258 samples: 33936
34: 0.02% 98.60% avg cycles: 10977 cycles/page: 322 samples: 268
35: 0.02% 98.62% avg cycles: 11400 cycles/page: 325 samples: 177
36: 0.01% 98.63% avg cycles: 11504 cycles/page: 319 samples: 161
37: 0.02% 98.65% avg cycles: 11596 cycles/page: 313 samples: 182
38: 0.02% 98.66% avg cycles: 11850 cycles/page: 311 samples: 195
39: 0.01% 98.68% avg cycles: 12158 cycles/page: 311 samples: 128
40: 0.01% 98.68% avg cycles: 11626 cycles/page: 290 samples: 78
41: 0.04% 98.73% avg cycles: 11435 cycles/page: 278 samples: 477
42: 0.01% 98.73% avg cycles: 12571 cycles/page: 299 samples: 74
43: 0.01% 98.74% avg cycles: 12562 cycles/page: 292 samples: 78
44: 0.01% 98.75% avg cycles: 12991 cycles/page: 295 samples: 108
45: 0.01% 98.76% avg cycles: 13169 cycles/page: 292 samples: 78
46: 0.02% 98.78% avg cycles: 12891 cycles/page: 280 samples: 261
47: 0.01% 98.79% avg cycles: 13099 cycles/page: 278 samples: 67
48: 0.01% 98.80% avg cycles: 13851 cycles/page: 288 samples: 77
49: 0.01% 98.80% avg cycles: 13749 cycles/page: 280 samples: 66
50: 0.01% 98.81% avg cycles: 13949 cycles/page: 278 samples: 73
52: 0.00% 98.82% avg cycles: 14243 cycles/page: 273 samples: 52
54: 0.01% 98.83% avg cycles: 15312 cycles/page: 283 samples: 87
55: 0.01% 98.84% avg cycles: 15197 cycles/page: 276 samples: 109
56: 0.02% 98.86% avg cycles: 15234 cycles/page: 272 samples: 208
57: 0.00% 98.86% avg cycles: 14888 cycles/page: 261 samples: 53
58: 0.01% 98.87% avg cycles: 15037 cycles/page: 259 samples: 59
59: 0.01% 98.87% avg cycles: 15752 cycles/page: 266 samples: 63
62: 0.00% 98.89% avg cycles: 16222 cycles/page: 261 samples: 54
64: 0.02% 98.91% avg cycles: 17179 cycles/page: 268 samples: 248
65: 0.12% 99.03% avg cycles: 18762 cycles/page: 288 samples: 1324
85: 0.00% 99.10% avg cycles: 21649 cycles/page: 254 samples: 50
127: 0.01% 99.18% avg cycles: 32397 cycles/page: 255 samples: 75
128: 0.13% 99.31% avg cycles: 31711 cycles/page: 247 samples: 1466
129: 0.18% 99.49% avg cycles: 33017 cycles/page: 255 samples: 1927
181: 0.33% 99.84% avg cycles: 2489 cycles/page: 13 samples: 3547
256: 0.05% 99.91% avg cycles: 2305 cycles/page: 9 samples: 550
512: 0.03% 99.95% avg cycles: 2133 cycles/page: 4 samples: 304
1512: 0.01% 99.99% avg cycles: 3038 cycles/page: 2 samples: 65

Here are the tlb counters during a 10-second slice of a kernel compile
for a SandyBridge system. It's better than IvyBridge, but probably
due to the larger caches since this was one of the 'X' extreme parts.

10,873,007,282 dtlb_load_misses_walk_duration
250,711,333 dtlb_load_misses_walk_completed
1,212,395,865 dtlb_store_misses_walk_duration
31,615,772 dtlb_store_misses_walk_completed
5,091,010,274 itlb_misses_walk_duration
163,193,511 itlb_misses_walk_completed
1,321,980 itlb_itlb_flush

10.008045158 seconds time elapsed

# cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, " ----- ", icyc,imiss, dcyc,dmiss }'
itlb ns/miss: 9.45338 dtlb ns/miss: 12.9716

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---

b/arch/x86/mm/tlb.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff -puN arch/x86/mm/tlb.c~set-tunable-to-sane-value arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~set-tunable-to-sane-value 2014-06-30 16:19:41.069885885 -0700
+++ b/arch/x86/mm/tlb.c 2014-06-30 16:19:41.074886112 -0700
@@ -164,8 +164,17 @@ void flush_tlb_current_task(void)
preempt_enable();
}

-/* in units of pages */
-unsigned long tlb_single_page_flush_ceiling = 1;
+/*
+ * See Documentation/x86/tlb.txt for details. We choose 33
+ * because it is large enough to cover the vast majority (at
+ * least 95%) of allocations, and is small enough that we are
+ * confident it will not cause too much overhead. Each single
+ * flush is about 100 ns, so this caps the maximum overhead at
+ * _about_ 3,000 ns.
+ *
+ * This is in units of pages.
+ */
+unsigned long tlb_single_page_flush_ceiling = 33;

void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
_

2014-07-01 16:48:57

by Dave Hansen

[permalink] [raw]

Subject: [PATCH 5/7] x86: mm: add tracepoints for TLB flushes

Changes from v3:
* remove trace_tlb.c and __print_symbolic() instead
* make sure to cover all cases in flush_tlb_func()
* remove _DONE "reason" since it was not precise enough

--

From: Dave Hansen <[email protected]>

We don't have any good way to figure out what kinds of flushes
are being attempted. Right now, we can try to use the vm
counters, but those only tell us what we actually did with the
hardware (one-by-one vs full) and don't tell us what was actually
_requested_.

This allows us to select out "interesting" TLB flushes that we
might want to optimize (like the ranged ones) and ignore the ones
that we have very little control over (the ones at context
switch).

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
---

b/arch/x86/include/asm/mmu_context.h | 6 +++++
b/arch/x86/mm/init.c | 7 +++++
b/arch/x86/mm/tlb.c | 11 +++++++--
b/include/linux/mm_types.h | 8 ++++++
b/include/trace/events/tlb.h | 41 +++++++++++++++++++++++++++++++++++
5 files changed, 71 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~tlb-trace-flushes arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~tlb-trace-flushes 2014-06-30 16:18:33.294800281 -0700
+++ b/arch/x86/include/asm/mmu_context.h 2014-06-30 16:18:33.306800827 -0700
@@ -3,6 +3,10 @@

#include <asm/desc.h>
#include <linux/atomic.h>
+#include <linux/mm_types.h>
+
+#include <trace/events/tlb.h>
+
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
@@ -44,6 +48,7 @@ static inline void switch_mm(struct mm_s

/* Re-load page tables */
load_cr3(next->pgd);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);

/* Stop flush ipis for the previous mm */
cpumask_clear_cpu(cpu, mm_cpumask(prev));
@@ -71,6 +76,7 @@ static inline void switch_mm(struct mm_s
* to make sure to use no freed page tables.
*/
load_cr3(next->pgd);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
load_LDT_nolock(&next->context);
}
}
diff -puN arch/x86/mm/init.c~tlb-trace-flushes arch/x86/mm/init.c
--- a/arch/x86/mm/init.c~tlb-trace-flushes 2014-06-30 16:18:33.298800464 -0700
+++ b/arch/x86/mm/init.c 2014-06-30 16:18:37.515992478 -0700
@@ -18,6 +18,13 @@
#include <asm/dma.h> /* for MAX_DMA_PFN */
#include <asm/microcode.h>

+/*
+ * We need to define the tracepoints somewhere, and tlb.c
+ * is only compied when SMP=y.
+ */
+#define CREATE_TRACE_POINTS
+#include <trace/events/tlb.h>
+
#include "mm_internal.h"

static unsigned long __initdata pgt_buf_start;
diff -puN arch/x86/mm/tlb.c~tlb-trace-flushes arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~tlb-trace-flushes 2014-06-30 16:18:33.301800600 -0700
+++ b/arch/x86/mm/tlb.c 2014-06-30 16:18:33.307800873 -0700
@@ -49,6 +49,7 @@ void leave_mm(int cpu)
if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
load_cr3(swapper_pg_dir);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
}
}
EXPORT_SYMBOL_GPL(leave_mm);
@@ -107,15 +108,19 @@ static void flush_tlb_func(void *info)

count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
- if (f->flush_end == TLB_FLUSH_ALL)
+ if (f->flush_end == TLB_FLUSH_ALL) {
local_flush_tlb();
- else {
+ trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
+ } else {
unsigned long addr;
+ unsigned long nr_pages =
+ f->flush_end - f->flush_start / PAGE_SIZE;
addr = f->flush_start;
while (addr < f->flush_end) {
__flush_tlb_single(addr);
addr += PAGE_SIZE;
}
+ trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages);
}
} else
leave_mm(smp_processor_id());
@@ -153,6 +158,7 @@ void flush_tlb_current_task(void)

count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
+ trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
preempt_enable();
@@ -191,6 +197,7 @@ void flush_tlb_mm_range(struct mm_struct
__flush_tlb_single(addr);
}
}
+ trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
out:
if (base_pages_to_flush == TLB_FLUSH_ALL) {
start = 0UL;
diff -puN include/linux/mm_types.h~tlb-trace-flushes include/linux/mm_types.h
--- a/include/linux/mm_types.h~tlb-trace-flushes 2014-06-30 16:18:33.303800691 -0700
+++ b/include/linux/mm_types.h 2014-06-30 16:18:35.089882014 -0700
@@ -516,4 +516,12 @@ struct vm_special_mapping
struct page **pages;
};

+enum tlb_flush_reason {
+ TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_REMOTE_SHOOTDOWN,
+ TLB_LOCAL_SHOOTDOWN,
+ TLB_LOCAL_MM_SHOOTDOWN,
+ NR_TLB_FLUSH_REASONS,
+};
+
#endif /* _LINUX_MM_TYPES_H */
diff -puN /dev/null include/trace/events/tlb.h
--- /dev/null 2014-04-10 11:28:14.066815724 -0700
+++ b/include/trace/events/tlb.h 2014-06-30 16:18:37.476990703 -0700
@@ -0,0 +1,41 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tlb
+
+#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TLB_H
+
+#include <linux/mm_types.h>
+#include <linux/tracepoint.h>
+
+#define TLB_FLUSH_REASON \
+ { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" }, \
+ { TLB_REMOTE_SHOOTDOWN, "remote shootdown" }, \
+ { TLB_LOCAL_SHOOTDOWN, "local shootdown" }, \
+ { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" }
+
+TRACE_EVENT(tlb_flush,
+
+ TP_PROTO(int reason, unsigned long pages),
+ TP_ARGS(reason, pages),
+
+ TP_STRUCT__entry(
+ __field( int, reason)
+ __field(unsigned long, pages)
+ ),
+
+ TP_fast_assign(
+ __entry->reason = reason;
+ __entry->pages = pages;
+ ),
+
+ TP_printk("pages:%ld reason:%s (%d)",
+ __entry->pages,
+ __print_symbolic(__entry->reason, TLB_FLUSH_REASON),
+ __entry->reason)
+);
+
+#endif /* _TRACE_TLB_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
+
_

2014-07-01 16:50:38

by Dave Hansen

[permalink] [raw]

Subject: [PATCH 4/7] x86: mm: unify remote invlpg code

From: Dave Hansen <[email protected]>

There are currently three paths through the remote flush code:

1. full invalidation
2. single page invalidation using invlpg
3. ranged invalidation using invlpg

This takes 2 and 3 and combines them in to a single path by
making the single-page one just be the start and end be start
plus a single page. This makes placement of our tracepoint easier.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
---

b/arch/x86/mm/tlb.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN arch/x86/mm/tlb.c~x86-tlb-simplify-remote-flush-code arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~x86-tlb-simplify-remote-flush-code 2014-06-30 16:18:28.009559635 -0700
+++ b/arch/x86/mm/tlb.c 2014-06-30 16:18:28.013559817 -0700
@@ -102,13 +102,13 @@ static void flush_tlb_func(void *info)

if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm))
return;
+ if (!f->flush_end)
+ f->flush_end = f->flush_start + PAGE_SIZE;

count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
if (f->flush_end == TLB_FLUSH_ALL)
local_flush_tlb();
- else if (!f->flush_end)
- __flush_tlb_single(f->flush_start);
else {
unsigned long addr;
addr = f->flush_start;
_

2014-07-01 16:48:54

by Dave Hansen

[permalink] [raw]

Subject: [PATCH 2/7] x86: mm: rip out complicated, out-of-date, buggy TLB flushing

From: Dave Hansen <[email protected]>

I think the flush_tlb_mm_range() code that tries to tune the
flush sizes based on the CPU needs to get ripped out for
several reasons:

1. It is obviously buggy. It uses mm->total_vm to judge the
task's footprint in the TLB. It should certainly be using
some measure of RSS, *NOT* ->total_vm since only resident
memory can populate the TLB.
2. Haswell, and several other CPUs are missing from the
intel_tlb_flushall_shift_set() function. Thus, it has been
demonstrated to bitrot quickly in practice.
3. It is plain wrong in my vm:
[ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.037444] tlb_flushall_shift: 6
Which leads to it to never use invlpg.
4. The assumptions about TLB refill costs are wrong:
http://lkml.kernel.org/r/[email protected]
(more on this in later patches)
5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59
I believe the sample times were too short. Running the
benchmark in a loop yields times that vary quite a bit.

Note that this leaves us with a static ceiling of 1 page. This
is a conservative, dumb setting, and will be revised in a later
patch.

This also removes the code which attempts to predict whether we
are flushing data or instructions. We expect instruction flushes
to be relatively rare and not worth tuning for explicitly.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---

b/arch/x86/include/asm/processor.h | 1
b/arch/x86/kernel/cpu/amd.c | 7 --
b/arch/x86/kernel/cpu/common.c | 13 -----
b/arch/x86/kernel/cpu/intel.c | 26 -----------
b/arch/x86/mm/tlb.c | 87 ++++---------------------------------
5 files changed, 13 insertions(+), 121 deletions(-)

diff -puN arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing 2014-06-30 16:18:14.284934720 -0700
+++ b/arch/x86/include/asm/processor.h 2014-06-30 16:18:14.300935449 -0700
@@ -72,7 +72,6 @@ extern u16 __read_mostly tlb_lld_4k[NR_I
extern u16 __read_mostly tlb_lld_2m[NR_INFO];
extern u16 __read_mostly tlb_lld_4m[NR_INFO];
extern u16 __read_mostly tlb_lld_1g[NR_INFO];
-extern s8 __read_mostly tlb_flushall_shift;

/*
* CPU type and hardware bug flags. Kept separately for each CPU.
diff -puN arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/amd.c
--- a/arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-06-30 16:18:14.285934765 -0700
+++ b/arch/x86/kernel/cpu/amd.c 2014-06-30 16:18:14.300935449 -0700
@@ -741,11 +741,6 @@ static unsigned int amd_size_cache(struc
}
#endif

-static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c)
-{
- tlb_flushall_shift = 6;
-}
-
static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
{
u32 ebx, eax, ecx, edx;
@@ -793,8 +788,6 @@ static void cpu_detect_tlb_amd(struct cp
tlb_lli_2m[ENTRIES] = eax & mask;

tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
-
- cpu_set_tlb_flushall_shift(c);
}

static const struct cpu_dev amd_cpu_dev = {
diff -puN arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-06-30 16:18:14.287934857 -0700
+++ b/arch/x86/kernel/cpu/common.c 2014-06-30 16:18:15.537991775 -0700
@@ -481,26 +481,17 @@ u16 __read_mostly tlb_lld_2m[NR_INFO];
u16 __read_mostly tlb_lld_4m[NR_INFO];
u16 __read_mostly tlb_lld_1g[NR_INFO];

-/*
- * tlb_flushall_shift shows the balance point in replacing cr3 write
- * with multiple 'invlpg'. It will do this replacement when
- * flush_tlb_lines <= active_lines/2^tlb_flushall_shift.
- * If tlb_flushall_shift is -1, means the replacement will be disabled.
- */
-s8 __read_mostly tlb_flushall_shift = -1;
-
void cpu_detect_tlb(struct cpuinfo_x86 *c)
{
if (this_cpu->c_detect_tlb)
this_cpu->c_detect_tlb(c);

printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n"
- "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n"
- "tlb_flushall_shift: %d\n",
+ "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n",
tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES],
tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES],
tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES],
- tlb_lld_1g[ENTRIES], tlb_flushall_shift);
+ tlb_lld_1g[ENTRIES]);
}

void detect_ht(struct cpuinfo_x86 *c)
diff -puN arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-06-30 16:18:14.289934948 -0700
+++ b/arch/x86/kernel/cpu/intel.c 2014-06-30 16:18:15.698999106 -0700
@@ -634,31 +634,6 @@ static void intel_tlb_lookup(const unsig
}
}

-static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
-{
- switch ((c->x86 << 8) + c->x86_model) {
- case 0x60f: /* original 65 nm celeron/pentium/core2/xeon, "Merom"/"Conroe" */
- case 0x616: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */
- case 0x617: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */
- case 0x61d: /* six-core 45 nm xeon "Dunnington" */
- tlb_flushall_shift = -1;
- break;
- case 0x63a: /* Ivybridge */
- tlb_flushall_shift = 2;
- break;
- case 0x61a: /* 45 nm nehalem, "Bloomfield" */
- case 0x61e: /* 45 nm nehalem, "Lynnfield" */
- case 0x625: /* 32 nm nehalem, "Clarkdale" */
- case 0x62c: /* 32 nm nehalem, "Gulftown" */
- case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
- case 0x62f: /* 32 nm Xeon E7 */
- case 0x62a: /* SandyBridge */
- case 0x62d: /* SandyBridge, "Romely-EP" */
- default:
- tlb_flushall_shift = 6;
- }
-}
-
static void intel_detect_tlb(struct cpuinfo_x86 *c)
{
int i, j, n;
@@ -683,7 +658,6 @@ static void intel_detect_tlb(struct cpui
for (j = 1 ; j < 16 ; j++)
intel_tlb_lookup(desc[j]);
}
- intel_tlb_flushall_shift_set(c);
}

static const struct cpu_dev intel_cpu_dev = {
diff -puN arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-06-30 16:18:14.292935085 -0700
+++ b/arch/x86/mm/tlb.c 2014-06-30 16:18:18.233114490 -0700
@@ -158,13 +158,14 @@ void flush_tlb_current_task(void)
preempt_enable();
}

+/* in units of pages */
+unsigned long tlb_single_page_flush_ceiling = 1;
+
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
- bool need_flush_others_all = true;
+ int need_flush_others_all = 1;
unsigned long addr;
- unsigned act_entries, tlb_entries = 0;
- unsigned long nr_base_pages;

preempt_disable();
if (current->active_mm != mm)
@@ -175,29 +176,16 @@ void flush_tlb_mm_range(struct mm_struct
goto out;
}

- if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
- || vmflag & VM_HUGETLB) {
+ if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
local_flush_tlb();
goto out;
}

- /* In modern CPU, last level tlb used for both data/ins */
- if (vmflag & VM_EXEC)
- tlb_entries = tlb_lli_4k[ENTRIES];
- else
- tlb_entries = tlb_lld_4k[ENTRIES];
-
- /* Assume all of TLB entries was occupied by this task */
- act_entries = tlb_entries >> tlb_flushall_shift;
- act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
- nr_base_pages = (end - start) >> PAGE_SHIFT;
-
- /* tlb_flushall_shift is on balance point, details in commit log */
- if (nr_base_pages > act_entries) {
+ if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
- need_flush_others_all = false;
+ need_flush_others_all = 0;
/* flush range by one by one 'invlpg' */
for (addr = start; addr < end; addr += PAGE_SIZE) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
@@ -259,68 +247,15 @@ static void do_kernel_range_flush(void *

void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
- unsigned act_entries;
- struct flush_tlb_info info;
-
- /* In modern CPU, last level tlb used for both data/ins */
- act_entries = tlb_lld_4k[ENTRIES];

/* Balance as user space task's flush, a bit conservative */
- if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 ||
- (end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift)
-
+ if (end == TLB_FLUSH_ALL ||
+ (end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
on_each_cpu(do_flush_tlb_all, NULL, 1);
- else {
+ } else {
+ struct flush_tlb_info info;
info.flush_start = start;
info.flush_end = end;
on_each_cpu(do_kernel_range_flush, &info, 1);
}
}
-
-#ifdef CONFIG_DEBUG_TLBFLUSH
-static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
- size_t count, loff_t *ppos)
-{
- char buf[32];
- unsigned int len;
-
- len = sprintf(buf, "%hd\n", tlb_flushall_shift);
- return simple_read_from_buffer(user_buf, count, ppos, buf, len);
-}
-
-static ssize_t tlbflush_write_file(struct file *file,
- const char __user *user_buf, size_t count, loff_t *ppos)
-{
- char buf[32];
- ssize_t len;
- s8 shift;
-
- len = min(count, sizeof(buf) - 1);
- if (copy_from_user(buf, user_buf, len))
- return -EFAULT;
-
- buf[len] = '\0';
- if (kstrtos8(buf, 0, &shift))
- return -EINVAL;
-
- if (shift < -1 || shift >= BITS_PER_LONG)
- return -EINVAL;
-
- tlb_flushall_shift = shift;
- return count;
-}
-
-static const struct file_operations fops_tlbflush = {
- .read = tlbflush_read_file,
- .write = tlbflush_write_file,
- .llseek = default_llseek,
-};
-
-static int __init create_tlb_flushall_shift(void)
-{
- debugfs_create_file("tlb_flushall_shift", S_IRUSR | S_IWUSR,
- arch_debugfs_dir, NULL, &fops_tlbflush);
- return 0;
-}
-late_initcall(create_tlb_flushall_shift);
-#endif
_

2014-07-02 18:24:18

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH 7/7] x86: mm: set TLB flush tunable to sane value (33)

On 07/02/2014 11:16 AM, David Nellans wrote:
> Intuition here is that invalidate caused refills will almost always
> be serviced from the L2 or better since we've recently walked to
> modify the page needing flush and thus pre-warmed the caches for any
> refill? Or is this an artifact of the flush/refill test setup?

There are lots of caches in place, not just the CPU's normal L1/2/3
memory caches. See "4.10.3 Paging-Structure Caches" in the Intel SDM.
I _believe_ TLB misses can be serviced from these caches and their
purpose is to avoid going out to memory (or the memory caches).

So I think the effect that we're seeing is from _all_ of the caches,
plus prefetching. If you start a prefetch for a TLB miss before you
actually start to run the instruction needing the TLB entry, you will
pay less than the entire cost of going out to memory (or the memory caches).

2014-07-02 18:26:05

by David Nellans

[permalink] [raw]

Subject: Re: [PATCH 7/7] x86: mm: set TLB flush tunable to sane value (33)

On 07/01/2014 11:48 AM, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> This has been run through Intel's LKP tests across a wide range
> of modern sytems and workloads and it wasn't shown to make a
> measurable performance difference positive or negative.
>
> Now that we have some shiny new tracepoints, we can actually
> figure out what the heck is going on.
>
> During a kernel compile, 60% of the flush_tlb_mm_range() calls
> are for a single page. It breaks down like this:
>
> size percent percent<=
> V V V
> GLOBAL: 2.20% 2.20% avg cycles: 2283
> 1: 56.92% 59.12% avg cycles: 1276
> 2: 13.78% 72.90% avg cycles: 1505
> 3: 8.26% 81.16% avg cycles: 1880
> 4: 7.41% 88.58% avg cycles: 2447
> 5: 1.73% 90.31% avg cycles: 2358
> 6: 1.32% 91.63% avg cycles: 2563
> 7: 1.14% 92.77% avg cycles: 2862
> 8: 0.62% 93.39% avg cycles: 3542
> 9: 0.08% 93.47% avg cycles: 3289
> 10: 0.43% 93.90% avg cycles: 3570
> 11: 0.20% 94.10% avg cycles: 3767
> 12: 0.08% 94.18% avg cycles: 3996
> 13: 0.03% 94.20% avg cycles: 4077
> 14: 0.02% 94.23% avg cycles: 4836
> 15: 0.04% 94.26% avg cycles: 5699
> 16: 0.06% 94.32% avg cycles: 5041
> 17: 0.57% 94.89% avg cycles: 5473
> 18: 0.02% 94.91% avg cycles: 5396
> 19: 0.03% 94.95% avg cycles: 5296
> 20: 0.02% 94.96% avg cycles: 6749
> 21: 0.18% 95.14% avg cycles: 6225
> 22: 0.01% 95.15% avg cycles: 6393
> 23: 0.01% 95.16% avg cycles: 6861
> 24: 0.12% 95.28% avg cycles: 6912
> 25: 0.05% 95.32% avg cycles: 7190
> 26: 0.01% 95.33% avg cycles: 7793
> 27: 0.01% 95.34% avg cycles: 7833
> 28: 0.01% 95.35% avg cycles: 8253
> 29: 0.08% 95.42% avg cycles: 8024
> 30: 0.03% 95.45% avg cycles: 9670
> 31: 0.01% 95.46% avg cycles: 8949
> 32: 0.01% 95.46% avg cycles: 9350
> 33: 3.11% 98.57% avg cycles: 8534
> 34: 0.02% 98.60% avg cycles: 10977
> 35: 0.02% 98.62% avg cycles: 11400
>
> We get in to dimishing returns pretty quickly. On pre-IvyBridge
> CPUs, we used to set the limit at 8 pages, and it was set at 128
> on IvyBrige. That 128 number looks pretty silly considering that
> less than 0.5% of the flushes are that large.
>
> The previous code tried to size this number based on the size of
> the TLB. Good idea, but it's error-prone, needs maintenance
> (which it didn't get up to now), and probably would not matter in
> practice much.
>
> Settting it to 33 means that we cover the mallopt
> M_TRIM_THRESHOLD, which is the most universally common size to do
> flushes.
>
> That's the short version. Here's the long one for why I chose 33:
>
> 1. These numbers have a constant bias in the timestamps from the
> tracing. Probably counts for a couple hundred cycles in each of
> these tests, but it should be fairly _even_ across all of them.
> The smallest delta between the tracepoints I have ever seen is
> 335 cycles. This is one reason the cycles/page cost goes down in
> general as the flushes get larger. The true cost is nearer to
> 100 cycles.
> 2. A full flush is more expensive than a single invlpg, but not
> by much (single percentages).
> 3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
> (~34 cycles). At those rates, refilling the 512-entry dTLB takes
> 22,000 cycles.
> 4. 22,000 cycles is approximately the equivalent of doing 85
> invlpg operations. But, the odds are that the TLB can
> actually be filled up faster than that because TLB misses that
> are close in time also tend to leverage the same caches.
> 6. ~98% of flushes are <=33 pages. There are a lot of flushes of
> 33 pages, probably because libc's M_TRIM_THRESHOLD is set to
> 128k (32 pages)
> 7. I've found no consistent data to support changing the IvyBridge
> vs. SandyBridge tunable by a factor of 16
>
> I used the performance counters on this hardware (IvyBridge i5-3320M)
> to figure out the tlb miss costs:
>
> ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush
>
> 7,720,030,970 dtlb_load_misses_walk_duration [57.13%]
> 169,856,353 dtlb_load_misses_walk_completed [57.15%]
> 708,832,859 dtlb_store_misses_walk_duration [57.17%]
> 19,346,823 dtlb_store_misses_walk_completed [57.17%]
> 2,779,687,402 itlb_misses_walk_duration [57.15%]
> 82,241,148 itlb_misses_walk_completed [57.13%]
> 770,717 itlb_itlb_flush [57.11%]
>
> Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
> (~34 cycles). At those rates, refilling the 512-entry dTLB takes
> 22,000 cycles. On a SandyBridge system with more cores and larger
> caches, those are dtlb=13.4ns and itlb=9.5ns.

Intuition here is that invalidate caused refills will almost always be serviced from the L2
or better since we've recently walked to modify the page needing flush and thus pre-warmed the caches
for any refill? Or is this an artifact of the flush/refill test setup? Main mem latency even on Ivybridge is ~100
clocks, worse in previous generations, so to get down to average ~30 cycle refill you basically can never be
missing in the L1 or maybe L2 which seems optimistic.

2014-07-02 19:04:55

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH 0/7] [RESEND][v4] x86: rework tlb range flushing code

On Tue, 2014-07-01 at 09:48 -0700, Dave Hansen wrote:
> x86 Maintainers,
>
> Could this get picked up in to the x86 tree, please? That way,
> it will get plenty of time to bake before the 3.17 merge window.

I had originally tried out this series (~v1, v2 iirc) on large KVM
configurations. Setting the TLB flush tunable to 33 seemed prudent and
didn't cause anything weird. Feel free to add my:

Tested-by: Davidlohr Bueso <[email protected]>