2014-01-09 14:35:09

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/5] Fix ebizzy performance regression due to X86 TLB range flush v3

Changelog since v2
o Rebase to v3.13-rc7 to pick up scheduler-related fixes
o Describe methodology in changelog
o Reset tlb flush shift for all models except Ivybridge

Changelog since v1
o Drop a pagetable walk that seems redundant
o Account for TLB flushes only when debugging
o Drop the patch that took number of CPUs to flush into account

ebizzy regressed between 3.4 and 3.10 while testing on a new
machine. Bisection initially found at least three problems of which the
first was commit 611ae8e3 (x86/tlb: enable tlb flush range support for
x86). Second was related to TLB flush accounting. The third was related
to ACPI cpufreq and so it was disabled for the purposes of this series.

The intent of the TLB range flush series was to preserve existing TLB
entries by flushing a range one page at a time instead of flushing the
address space. This makes a certain amount of sense if the address space
being flushed was known to have existing hot entries. The decision on
whether to do a full mm flush or a number of single page flushes depends
on the size of the relevant TLB and how many of these hot entries would
be preserved by a targeted flush. This implicitly assumes a lot including
the following examples

o That the full TLB is in use by the task being flushed
o The TLB has hot entries that are going to be used in the near future
o The TLB has entries for the range being cached
o The cost of the per-page flushes is similar to a single mm flush
o Large pages are unimportant and can always be globally flushed
o Small flushes from workloads are very common

The first three are completely unknowable but unfortunately it is something
that is probably true of micro benchmarks designed to exercise these
paths. The fourth one depends completely on the hardware. The large page
check used to make sense but now the number of entries required to do
a range flush is so small that it is a redundant check. The last one is
the strangest because generally only a process that was mapping/unmapping
very small regions would hit this. It's possible it is the common case
for virtualised workloads that is managing the address space of its
guests. Maybe this was the real original motivation of the TLB range flush
support for x86. If this is the case then the patches need to be revisited
and clearly flagged as being of benefit to virtualisation.

As things currently stand, Ebizzy sees very little benefit as it discards
newly allocated memory very quickly and regressed badly on Ivybridge where
it constantly flushes ranges of 128 pages one page at a time. Earlier
machines may not have seen this problem as the balance point was at a
different location. While I'm wary of optimising for such a benchmark,
it's commonly tested and it's apparent that the worst case defaults for
Ivybridge need to be re-examined.

The following small series brings ebizzy closer to 3.4-era performance
for the very limited set of machines tested. It does not bring
performance fully back in line but the recent idle power regression
fix has already been identified as regressing ebizzy performance
(http://www.spinics.net/lists/stable/msg31352.html) and would need to be
addressed first. Benchmark results are included in the relevant patch's
changelog.

arch/x86/include/asm/tlbflush.h | 6 ++---
arch/x86/kernel/cpu/amd.c | 5 +---
arch/x86/kernel/cpu/intel.c | 10 +++-----
arch/x86/kernel/cpu/mtrr/generic.c | 4 +--
arch/x86/mm/tlb.c | 52 ++++++++++----------------------------
include/linux/vm_event_item.h | 4 +--
include/linux/vmstat.h | 8 ++++++
7 files changed, 32 insertions(+), 57 deletions(-)

--
1.8.4


2014-01-09 14:35:15

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/5] x86: mm: Account for TLB flushes only when debugging

Bisection between 3.11 and 3.12 fingered commit 9824cf97 (mm: vmstats:
tlb flush counters). The counters are undeniably useful but how often
do we really need to debug TLB flush related issues? It does not justify
taking the penalty everywhere so make it a debugging option.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 6 +++---
arch/x86/kernel/cpu/mtrr/generic.c | 4 ++--
arch/x86/mm/tlb.c | 14 +++++++-------
include/linux/vm_event_item.h | 4 ++--
include/linux/vmstat.h | 8 ++++++++
5 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e6d90ba..04905bf 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -62,7 +62,7 @@ static inline void __flush_tlb_all(void)

static inline void __flush_tlb_one(unsigned long addr)
{
- count_vm_event(NR_TLB_LOCAL_FLUSH_ONE);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
__flush_tlb_single(addr);
}

@@ -93,13 +93,13 @@ static inline void __flush_tlb_one(unsigned long addr)
*/
static inline void __flush_tlb_up(void)
{
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
__flush_tlb();
}

static inline void flush_tlb_all(void)
{
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
__flush_tlb_all();
}

diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index ce2d0a2..0e25a1b 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -683,7 +683,7 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
}

/* Flush all TLBs via a mov %cr3, %reg; mov %reg, %cr3 */
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
__flush_tlb();

/* Save MTRR state */
@@ -697,7 +697,7 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
static void post_set(void) __releases(set_atomicity_lock)
{
/* Flush TLBs (no need to flush caches - they are disabled) */
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
__flush_tlb();

/* Intel (P6) standard MTRRs */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ae699b3..05446c1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -103,7 +103,7 @@ static void flush_tlb_func(void *info)
if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm))
return;

- count_vm_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
if (f->flush_end == TLB_FLUSH_ALL)
local_flush_tlb();
@@ -131,7 +131,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
info.flush_start = start;
info.flush_end = end;

- count_vm_event(NR_TLB_REMOTE_FLUSH);
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
if (is_uv_system()) {
unsigned int cpu;

@@ -151,7 +151,7 @@ void flush_tlb_current_task(void)

preempt_disable();

- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
@@ -215,7 +215,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

/* tlb_flushall_shift is on balance point, details in commit log */
if ((end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift) {
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
if (has_large_page(mm, start, end)) {
@@ -224,7 +224,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
}
/* flush range by one by one 'invlpg' */
for (addr = start; addr < end; addr += PAGE_SIZE) {
- count_vm_event(NR_TLB_LOCAL_FLUSH_ONE);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
__flush_tlb_single(addr);
}

@@ -262,7 +262,7 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start)

static void do_flush_tlb_all(void *info)
{
- count_vm_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
__flush_tlb_all();
if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_LAZY)
leave_mm(smp_processor_id());
@@ -270,7 +270,7 @@ static void do_flush_tlb_all(void *info)

void flush_tlb_all(void)
{
- count_vm_event(NR_TLB_REMOTE_FLUSH);
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
on_each_cpu(do_flush_tlb_all, NULL, 1);
}

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index c557c6d..070de3d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,12 +71,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
#endif
-#ifdef CONFIG_SMP
+#ifdef CONFIG_DEBUG_TLBFLUSH
NR_TLB_REMOTE_FLUSH, /* cpu tried to flush others' tlbs */
NR_TLB_REMOTE_FLUSH_RECEIVED,/* cpu received ipi for flush */
-#endif
NR_TLB_LOCAL_FLUSH_ALL,
NR_TLB_LOCAL_FLUSH_ONE,
+#endif
NR_VM_EVENT_ITEMS
};

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e4b9480..80ebba9 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -83,6 +83,14 @@ static inline void vm_events_fold_cpu(int cpu)
#define count_vm_numa_events(x, y) do { (void)(y); } while (0)
#endif /* CONFIG_NUMA_BALANCING */

+#ifdef CONFIG_DEBUG_TLBFLUSH
+#define count_vm_tlb_event(x) count_vm_event(x)
+#define count_vm_tlb_events(x, y) count_vm_events(x, y)
+#else
+#define count_vm_tlb_event(x) do {} while (0)
+#define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
+#endif
+
#define __count_zone_vm_events(item, zone, delta) \
__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
zone_idx(zone), delta)
--
1.8.4

2014-01-09 14:35:23

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/5] x86: mm: Clean up inconsistencies when flushing TLB ranges

NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the
comparison with total_vm is done before taking tlb_flushall_shift into
account. Clean it up.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
---
arch/x86/mm/tlb.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 05446c1..5176526 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -189,6 +189,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
{
unsigned long addr;
unsigned act_entries, tlb_entries = 0;
+ unsigned long nr_base_pages;

preempt_disable();
if (current->active_mm != mm)
@@ -210,18 +211,17 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
tlb_entries = tlb_lli_4k[ENTRIES];
else
tlb_entries = tlb_lld_4k[ENTRIES];
+
/* Assume all of TLB entries was occupied by this task */
- act_entries = mm->total_vm > tlb_entries ? tlb_entries : mm->total_vm;
+ act_entries = tlb_entries >> tlb_flushall_shift;
+ act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
+ nr_base_pages = (end - start) >> PAGE_SHIFT;

/* tlb_flushall_shift is on balance point, details in commit log */
- if ((end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift) {
+ if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
- if (has_large_page(mm, start, end)) {
- local_flush_tlb();
- goto flush_all;
- }
/* flush range by one by one 'invlpg' */
for (addr = start; addr < end; addr += PAGE_SIZE) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
--
1.8.4

2014-01-09 14:36:11

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/5] mm: x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge

There was a large ebizzy performance regression that was bisected to commit
611ae8e3 (x86/tlb: enable tlb flush range support for x86). The problem
was related to the tlb_flushall_shift tuning for IvyBridge which was
altered. The problem is that it is not clear if the tuning values for each
CPU family is correct as the methodology used to tune the values is unclear.

This patch uses a conservative tlb_flushall_shift value for all CPU families
except IvyBridge so the decision can be revisited if any regression is found
as a result of this change. IvyBridge is an exception as testing with one
methodology determined that the value of 2 is acceptable. Details are in the
changelog for the patch "x86: mm: Change tlb_flushall_shift for IvyBridge".

One important aspect of this to watch out for is Xen. The original commit
log mentioned large performance gains on Xen. It's possible Xen is more
sensitive to this value if it flushes small ranges of pages more frequently
than workloads on bare metal typically do.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 5 +----
arch/x86/kernel/cpu/intel.c | 10 +++-------
2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index bca023b..7aa2545 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -758,10 +758,7 @@ static unsigned int amd_size_cache(struct cpuinfo_x86 *c, unsigned int size)

static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c)
{
- tlb_flushall_shift = 5;
-
- if (c->x86 <= 0x11)
- tlb_flushall_shift = 4;
+ tlb_flushall_shift = 6;
}

static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index bbe1b8b..d358a39 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -615,21 +615,17 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
case 0x61d: /* six-core 45 nm xeon "Dunnington" */
tlb_flushall_shift = -1;
break;
+ case 0x63a: /* Ivybridge */
+ tlb_flushall_shift = 2;
+ break;
case 0x61a: /* 45 nm nehalem, "Bloomfield" */
case 0x61e: /* 45 nm nehalem, "Lynnfield" */
case 0x625: /* 32 nm nehalem, "Clarkdale" */
case 0x62c: /* 32 nm nehalem, "Gulftown" */
case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
case 0x62f: /* 32 nm Xeon E7 */
- tlb_flushall_shift = 6;
- break;
case 0x62a: /* SandyBridge */
case 0x62d: /* SandyBridge, "Romely-EP" */
- tlb_flushall_shift = 5;
- break;
- case 0x63a: /* Ivybridge */
- tlb_flushall_shift = 2;
- break;
default:
tlb_flushall_shift = 6;
}
--
1.8.4

2014-01-09 14:36:22

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/5] x86: mm: Change tlb_flushall_shift for IvyBridge

There was a large performance regression that was bisected to commit 611ae8e3
(x86/tlb: enable tlb flush range support for x86). This patch simply changes
the default balance point between a local and global flush for IvyBridge.

In the interest of allowing the tests to be reproduced, this patch was
tested using mmtests 0.15 with the following configurations

configs/config-global-dhp__tlbflush-performance
configs/config-global-dhp__scheduler-performance
configs/config-global-dhp__network-performance

Results are from two machines

Ivybridge 4 threads: Intel(R) Core(TM) i3-3240 CPU @ 3.40GHz
Ivybridge 8 threads: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Page fault microbenchmark showed nothing interesting.

Ebizzy was configured to run multiple iterations and threads. Thread counts
ranged from 1 to NR_CPUS*2. For each thread count, it ran 100 iterations and
each iteration lasted 10 seconds.

Ivybridge 4 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 6395.44 ( 0.00%) 6789.09 ( 6.16%)
Mean 2 7012.85 ( 0.00%) 8052.16 ( 14.82%)
Mean 3 6403.04 ( 0.00%) 6973.74 ( 8.91%)
Mean 4 6135.32 ( 0.00%) 6582.33 ( 7.29%)
Mean 5 6095.69 ( 0.00%) 6526.68 ( 7.07%)
Mean 6 6114.33 ( 0.00%) 6416.64 ( 4.94%)
Mean 7 6085.10 ( 0.00%) 6448.51 ( 5.97%)
Mean 8 6120.62 ( 0.00%) 6462.97 ( 5.59%)

Ivybridge 8 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 7336.65 ( 0.00%) 7787.02 ( 6.14%)
Mean 2 8218.41 ( 0.00%) 9484.13 ( 15.40%)
Mean 3 7973.62 ( 0.00%) 8922.01 ( 11.89%)
Mean 4 7798.33 ( 0.00%) 8567.03 ( 9.86%)
Mean 5 7158.72 ( 0.00%) 8214.23 ( 14.74%)
Mean 6 6852.27 ( 0.00%) 7952.45 ( 16.06%)
Mean 7 6774.65 ( 0.00%) 7536.35 ( 11.24%)
Mean 8 6510.50 ( 0.00%) 6894.05 ( 5.89%)
Mean 12 6182.90 ( 0.00%) 6661.29 ( 7.74%)
Mean 16 6100.09 ( 0.00%) 6608.69 ( 8.34%)

Ebizzy hits the worst case scenario for TLB range flushing every time and
it shows for these Ivybridge CPUs at least that the default choice is a
poor on. The patch addresses the problem.

Next was a tlbflush microbenchmark written by Alex Shi at
http://marc.info/?l=linux-kernel&m=133727348217113 . It measures access
costs while the TLB is being flushed. The expectation is that if there are
always full TLB flushes that the benchmark would suffer and it benefits
from range flushing

There are 320 iterations of the test per thread count. The number of
entries is randomly selected with a min of 1 and max of 512. To ensure
a reasonably even spread of entries, the full range is broken up into 8
sections and a random number selected within that section.

iteration 1, random number between 0-64
iteration 2, random number between 64-128 etc

This is still a very weak methodology. When you do not know what are
typical ranges, random is a reasonable choice but it can be easily argued
that the opimisation was for smaller ranges and an even spread is not
representative of any workload that matters. To improve this, we'd need to
know the probability distribution of TLB flush range sizes for a set of
workloads that are considered "common", build a synthetic trace and feed
that into this benchmark. Even that is not perfect because it would not
account for the time between flushes but there are limits of what can be
reasonably done and still be doing something useful. If a representative
synthetic trace is provided then this benchmark could be revisited and
the shift values retuned.

Ivybridge 4 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 10.50 ( 0.00%) 10.50 ( 0.03%)
Mean 2 17.59 ( 0.00%) 17.18 ( 2.34%)
Mean 3 22.98 ( 0.00%) 21.74 ( 5.41%)
Mean 5 47.13 ( 0.00%) 46.23 ( 1.92%)
Mean 8 43.30 ( 0.00%) 42.56 ( 1.72%)

Ivybridge 8 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 9.45 ( 0.00%) 9.36 ( 0.93%)
Mean 2 9.37 ( 0.00%) 9.70 ( -3.54%)
Mean 3 9.36 ( 0.00%) 9.29 ( 0.70%)
Mean 5 14.49 ( 0.00%) 15.04 ( -3.75%)
Mean 8 41.08 ( 0.00%) 38.73 ( 5.71%)
Mean 13 32.04 ( 0.00%) 31.24 ( 2.49%)
Mean 16 40.05 ( 0.00%) 39.04 ( 2.51%)

For both CPUs, average access time is reduced which is good as this is
the benchmark that was used to tune the shift values in the first place
albeit it is now known *how* the benchmark was used.

The scheduler benchmarks were somewhat inconclusive. They showed gains
and losses and makes me reconsider how stable those benchmarks really
are or if something else might be interfering with the test results
recently.

Network benchmarks were inconclusive. Almost all results were flat
except for netperf-udp tests on the 4 thread machine. These results
were unstable and showed large variations between reboots. It is
unknown if this is a recent problems but I've noticed before that
netperf-udp results tend to vary.

Based on these results, changing the default for Ivybridge seems
like a logical choice.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
---
arch/x86/kernel/cpu/intel.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index ea04b34..bbe1b8b 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -628,7 +628,7 @@ static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
tlb_flushall_shift = 5;
break;
case 0x63a: /* Ivybridge */
- tlb_flushall_shift = 1;
+ tlb_flushall_shift = 2;
break;
default:
tlb_flushall_shift = 6;
--
1.8.4

2014-01-09 14:37:16

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/5] x86: mm: Eliminate redundant page table walk during TLB range flushing

When choosing between doing an address space or ranged flush, the x86
implementation of flush_tlb_mm_range takes into account whether there are
any large pages in the range. A per-page flush typically requires fewer
entries than would covered by a single large page and the check is redundant.

There is one potential exception. THP migration flushes single THP entries
and it conceivably would benefit from flushing a single entry instead
of the mm. However, this flush is after a THP allocation, copy and page
table update potentially with any other threads serialised behind it. In
comparison to that, the flush is noise. It makes more sense to optimise
balancing to require fewer flushes than to optimise the flush itself.

This patch deletes the redundant huge page check.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/mm/tlb.c | 28 +---------------------------
1 file changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5176526..dd8dda1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -158,32 +158,6 @@ void flush_tlb_current_task(void)
preempt_enable();
}

-/*
- * It can find out the THP large page, or
- * HUGETLB page in tlb_flush when THP disabled
- */
-static inline unsigned long has_large_page(struct mm_struct *mm,
- unsigned long start, unsigned long end)
-{
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- unsigned long addr = ALIGN(start, HPAGE_SIZE);
- for (; addr < end; addr += HPAGE_SIZE) {
- pgd = pgd_offset(mm, addr);
- if (likely(!pgd_none(*pgd))) {
- pud = pud_offset(pgd, addr);
- if (likely(!pud_none(*pud))) {
- pmd = pmd_offset(pud, addr);
- if (likely(!pmd_none(*pmd)))
- if (pmd_large(*pmd))
- return addr;
- }
- }
- }
- return 0;
-}
-
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
@@ -218,7 +192,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
nr_base_pages = (end - start) >> PAGE_SHIFT;

/* tlb_flushall_shift is on balance point, details in commit log */
- if (nr_base_pages > act_entries || has_large_page(mm, start, end)) {
+ if (nr_base_pages > act_entries) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
--
1.8.4

2014-01-09 19:46:33

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/5] x86: mm: Account for TLB flushes only when debugging

On 01/09/2014 09:34 AM, Mel Gorman wrote:
> Bisection between 3.11 and 3.12 fingered commit 9824cf97 (mm: vmstats:
> tlb flush counters). The counters are undeniably useful but how often
> do we really need to debug TLB flush related issues? It does not justify
> taking the penalty everywhere so make it a debugging option.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2014-01-09 19:48:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/5] x86: mm: Clean up inconsistencies when flushing TLB ranges

On 01/09/2014 09:34 AM, Mel Gorman wrote:
> NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the
> comparison with total_vm is done before taking tlb_flushall_shift into
> account. Clean it up.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Alex Shi <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2014-01-09 20:01:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: mm: Change tlb_flushall_shift for IvyBridge

On 01/09/2014 09:34 AM, Mel Gorman wrote:
> There was a large performance regression that was bisected to commit 611ae8e3
> (x86/tlb: enable tlb flush range support for x86). This patch simply changes
> the default balance point between a local and global flush for IvyBridge.
>
> In the interest of allowing the tests to be reproduced, this patch was
> tested using mmtests 0.15 with the following configurations
>
> configs/config-global-dhp__tlbflush-performance
> configs/config-global-dhp__scheduler-performance
> configs/config-global-dhp__network-performance


> Based on these results, changing the default for Ivybridge seems
> like a logical choice.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Alex Shi <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2014-01-09 20:02:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 5/5] mm: x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge

On 01/09/2014 09:34 AM, Mel Gorman wrote:
> There was a large ebizzy performance regression that was bisected to commit
> 611ae8e3 (x86/tlb: enable tlb flush range support for x86). The problem
> was related to the tlb_flushall_shift tuning for IvyBridge which was
> altered. The problem is that it is not clear if the tuning values for each
> CPU family is correct as the methodology used to tune the values is unclear.
>
> This patch uses a conservative tlb_flushall_shift value for all CPU families
> except IvyBridge so the decision can be revisited if any regression is found
> as a result of this change. IvyBridge is an exception as testing with one
> methodology determined that the value of 2 is acceptable. Details are in the
> changelog for the patch "x86: mm: Change tlb_flushall_shift for IvyBridge".
>
> One important aspect of this to watch out for is Xen. The original commit
> log mentioned large performance gains on Xen. It's possible Xen is more
> sensitive to this value if it flushes small ranges of pages more frequently
> than workloads on bare metal typically do.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2014-01-09 20:13:13

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/5] x86: mm: Eliminate redundant page table walk during TLB range flushing

On 01/09/2014 09:34 AM, Mel Gorman wrote:
> When choosing between doing an address space or ranged flush, the x86
> implementation of flush_tlb_mm_range takes into account whether there are
> any large pages in the range. A per-page flush typically requires fewer
> entries than would covered by a single large page and the check is redundant.
>
> There is one potential exception. THP migration flushes single THP entries
> and it conceivably would benefit from flushing a single entry instead
> of the mm. However, this flush is after a THP allocation, copy and page
> table update potentially with any other threads serialised behind it. In
> comparison to that, the flush is noise. It makes more sense to optimise
> balancing to require fewer flushes than to optimise the flush itself.
>
> This patch deletes the redundant huge page check.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2014-01-09 21:40:06

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 0/5] Fix ebizzy performance regression due to X86 TLB range flush v3

On Thu, 2014-01-09 at 14:34 +0000, Mel Gorman wrote:
> Changelog since v2
> o Rebase to v3.13-rc7 to pick up scheduler-related fixes
> o Describe methodology in changelog
> o Reset tlb flush shift for all models except Ivybridge
>
> Changelog since v1
> o Drop a pagetable walk that seems redundant
> o Account for TLB flushes only when debugging
> o Drop the patch that took number of CPUs to flush into account
>
> ebizzy regressed between 3.4 and 3.10 while testing on a new
> machine. Bisection initially found at least three problems of which the
> first was commit 611ae8e3 (x86/tlb: enable tlb flush range support for
> x86). Second was related to TLB flush accounting. The third was related
> to ACPI cpufreq and so it was disabled for the purposes of this series.
>
> The intent of the TLB range flush series was to preserve existing TLB
> entries by flushing a range one page at a time instead of flushing the
> address space. This makes a certain amount of sense if the address space
> being flushed was known to have existing hot entries. The decision on
> whether to do a full mm flush or a number of single page flushes depends
> on the size of the relevant TLB and how many of these hot entries would
> be preserved by a targeted flush. This implicitly assumes a lot including
> the following examples
>
> o That the full TLB is in use by the task being flushed
> o The TLB has hot entries that are going to be used in the near future
> o The TLB has entries for the range being cached
> o The cost of the per-page flushes is similar to a single mm flush
> o Large pages are unimportant and can always be globally flushed
> o Small flushes from workloads are very common
>
> The first three are completely unknowable but unfortunately it is something
> that is probably true of micro benchmarks designed to exercise these
> paths. The fourth one depends completely on the hardware. The large page
> check used to make sense but now the number of entries required to do
> a range flush is so small that it is a redundant check. The last one is
> the strangest because generally only a process that was mapping/unmapping
> very small regions would hit this. It's possible it is the common case
> for virtualised workloads that is managing the address space of its
> guests. Maybe this was the real original motivation of the TLB range flush
> support for x86. If this is the case then the patches need to be revisited
> and clearly flagged as being of benefit to virtualisation.
>
> As things currently stand, Ebizzy sees very little benefit as it discards
> newly allocated memory very quickly and regressed badly on Ivybridge where
> it constantly flushes ranges of 128 pages one page at a time. Earlier
> machines may not have seen this problem as the balance point was at a
> different location. While I'm wary of optimising for such a benchmark,
> it's commonly tested and it's apparent that the worst case defaults for
> Ivybridge need to be re-examined.
>
> The following small series brings ebizzy closer to 3.4-era performance
> for the very limited set of machines tested. It does not bring
> performance fully back in line but the recent idle power regression
> fix has already been identified as regressing ebizzy performance
> (http://www.spinics.net/lists/stable/msg31352.html) and would need to be
> addressed first. Benchmark results are included in the relevant patch's
> changelog.
>
> arch/x86/include/asm/tlbflush.h | 6 ++---
> arch/x86/kernel/cpu/amd.c | 5 +---
> arch/x86/kernel/cpu/intel.c | 10 +++-----
> arch/x86/kernel/cpu/mtrr/generic.c | 4 +--
> arch/x86/mm/tlb.c | 52 ++++++++++----------------------------
> include/linux/vm_event_item.h | 4 +--
> include/linux/vmstat.h | 8 ++++++
> 7 files changed, 32 insertions(+), 57 deletions(-)

I Tried this set on a couple of workloads, no performance regressions.
So, fwiw:

Tested-by: Davidlohr Bueso <[email protected]>

2014-01-16 11:12:13

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm: vmstat: Do not display stats for TLB flushes unless debugging

The patch "x86: mm: Account for TLB flushes only when debugging" removed
vmstat counters related to TLB flushes unless CONFIG_DEBUG_TLBFLUSH was
set from the vm_event_item enum but not the vmstat_text text.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmstat.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7249614..def5dd2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -851,12 +851,14 @@ const char * const vmstat_text[] = {
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
#endif
+#ifdef CONFIG_DEBUG_TLBFLUSH
#ifdef CONFIG_SMP
"nr_tlb_remote_flush",
"nr_tlb_remote_flush_received",
-#endif
+#endif /* CONFIG_SMP */
"nr_tlb_local_flush_all",
"nr_tlb_local_flush_one",
+#endif /* CONFIG_DEBUG_TLBFLUSH */

#endif /* CONFIG_VM_EVENTS_COUNTERS */
};

2014-01-16 12:25:31

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm: vmstat: Do not display stats for TLB flushes unless debugging

On 01/16/2014 06:12 AM, Mel Gorman wrote:
> The patch "x86: mm: Account for TLB flushes only when debugging" removed
> vmstat counters related to TLB flushes unless CONFIG_DEBUG_TLBFLUSH was
> set from the vm_event_item enum but not the vmstat_text text.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2014-01-16 14:01:33

by Fengguang Wu

[permalink] [raw]
Subject: [TLB range flush] +34.7% hackbench.throughput

Hi Mel,

I applied your patchset on v3.13-rc7 and get some test results. The
results are encouraging: hackbench throughput increased by 34.7% with
parameters 1600%-threads-pipe on a 2S SNB server.

In case you are interested, here are the full list of changes.
kconfig is attached.

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
171792 ~ 0% +34.7% 231378 lkp-snb01/micro/hackbench/1600%-threads-pipe
171792 +34.7% 231378 TOTAL hackbench.throughput

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
2296537 ~ 1% -100.0% 5 grantley/micro/kbuild/200%
291546 ~ 0% +1.3e+14% 3.85e+17 lkp-a04/micro/netperf/120s-200%-TCP_CRR
96565 ~ 0% -100.0% 0 lkp-a04/micro/netperf/120s-200%-TCP_MAERTS
97525 ~ 1% +1.7e+14% 1.692e+17 lkp-a04/micro/netperf/120s-200%-TCP_RR
97140 ~ 0% +1.8e+16% 1.76e+19 lkp-a04/micro/netperf/120s-200%-TCP_SENDFILE
97303 ~ 0% -100.0% 0 lkp-a04/micro/netperf/120s-200%-UDP_RR
6294840 ~ 2% +4.2e+12% 2.617e+17 ~ 3% lkp-snb01/micro/hackbench/1600%-process-pipe
1384593 ~ 1% +6.9e+12% 9.551e+16 lkp-snb01/micro/hackbench/1600%-threads-pipe
1119351 ~ 2% +1.8e+13% 2.038e+17 lkp-snb01/micro/hackbench/1600%-threads-socket
186442 ~ 0% +3.5e+13% 6.473e+16 ~ 0% xps2/micro/pigz/100%
11961847 +1.6e+14% 1.878e+19 TOTAL proc-vmstat.nr_tlb_local_flush_one

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
150 ~ 4% +2.6e+17% 3.85e+17 lkp-a04/micro/netperf/120s-200%-TCP_CRR
153 ~ 5% -100.0% 0 lkp-a04/micro/netperf/120s-200%-TCP_MAERTS
148 ~ 5% +1.1e+17% 1.692e+17 lkp-a04/micro/netperf/120s-200%-TCP_RR
153 ~ 3% +1.1e+13% 1.679e+13 lkp-a04/micro/netperf/120s-200%-TCP_SENDFILE
154 ~ 5% -100.0% 0 lkp-a04/micro/netperf/120s-200%-UDP_RR
24275 ~12% +5.6e+14% 1.361e+17 lkp-snb01/micro/hackbench/1600%-threads-pipe
25035 +2.8e+15% 6.903e+17 TOTAL proc-vmstat.nr_tlb_remote_flush

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
215 ~ 4% +1.8e+17% 3.85e+17 lkp-a04/micro/netperf/120s-200%-TCP_CRR
222 ~ 4% -100.0% 0 lkp-a04/micro/netperf/120s-200%-TCP_MAERTS
213 ~ 3% +7.9e+16% 1.692e+17 lkp-a04/micro/netperf/120s-200%-TCP_RR
221 ~ 3% +7.9e+18% 1.76e+19 lkp-a04/micro/netperf/120s-200%-TCP_SENDFILE
221 ~ 3% -100.0% 0 lkp-a04/micro/netperf/120s-200%-UDP_RR
275020 ~16% +6e+13% 1.663e+17 lkp-snb01/micro/hackbench/1600%-threads-pipe
276115 +6.6e+15% 1.832e+19 TOTAL proc-vmstat.nr_tlb_remote_flush_received

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
497469 ~ 1% -97.8% 10855 grantley/micro/kbuild/200%
10025 ~ 0% +3.8e+15% 3.85e+17 lkp-a04/micro/netperf/120s-200%-TCP_CRR
9772 ~ 0% -51.4% 4752 lkp-a04/micro/netperf/120s-200%-TCP_MAERTS
9877 ~ 1% +1.7e+15% 1.692e+17 lkp-a04/micro/netperf/120s-200%-TCP_RR
9818 ~ 0% +1.7e+11% 1.679e+13 lkp-a04/micro/netperf/120s-200%-TCP_SENDFILE
9850 ~ 0% -40.6% 5848 lkp-a04/micro/netperf/120s-200%-UDP_RR
16816 ~ 1% +5.8e+14% 9.8e+16 lkp-snb01/micro/hackbench/1600%-threads-pipe
8659 ~ 1% +2071.0% 187996 lkp-snb01/micro/hackbench/1600%-threads-socket
572289 +1.1e+14% 6.522e+17 TOTAL proc-vmstat.nr_tlb_local_flush_all

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
1.807e+08 ~ 1% +56.2% 2.822e+08 lkp-snb01/micro/hackbench/1600%-threads-pipe
1.807e+08 +56.2% 2.822e+08 TOTAL proc-vmstat.numa_local

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
1.807e+08 ~ 1% +56.2% 2.822e+08 lkp-snb01/micro/hackbench/1600%-threads-pipe
1.807e+08 +56.2% 2.822e+08 TOTAL proc-vmstat.numa_hit

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
1.818e+08 ~ 1% +56.0% 2.836e+08 lkp-snb01/micro/hackbench/1600%-threads-pipe
1.818e+08 +56.0% 2.836e+08 TOTAL proc-vmstat.pgfree

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
2228224 ~15% +41.2% 3145728 ~ 0% nhm8/micro/dbench/100%
2228224 +41.2% 3145728 TOTAL meminfo.DirectMap1G

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
8.696e+08 ~ 1% -33.0% 5.827e+08 lkp-snb01/micro/hackbench/1600%-threads-pipe
8.696e+08 -33.0% 5.827e+08 TOTAL interrupts.RES

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
1.771e+08 ~ 1% +50.4% 2.664e+08 lkp-snb01/micro/hackbench/1600%-threads-pipe
1.771e+08 +50.4% 2.664e+08 TOTAL proc-vmstat.pgalloc_normal

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
4022784 ~ 8% -22.7% 3107840 ~ 0% nhm8/micro/dbench/100%
4022784 -22.7% 3107840 TOTAL meminfo.DirectMap2M

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
4821300 ~ 1% -14.4% 4128651 lkp-snb01/micro/hackbench/1600%-threads-pipe
4821300 -14.4% 4128651 TOTAL proc-vmstat.pgfault

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
1413677 ~ 0% -31.9% 962827 lkp-snb01/micro/hackbench/1600%-threads-pipe
1413677 -31.9% 962827 TOTAL vmstat.system.in

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
2.386e+09 ~ 0% -27.2% 1.737e+09 lkp-snb01/micro/hackbench/1600%-threads-pipe
2.386e+09 -27.2% 1.737e+09 TOTAL time.voluntary_context_switches

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
5575434 ~ 0% -26.3% 4108849 lkp-snb01/micro/hackbench/1600%-threads-pipe
5575434 -26.3% 4108849 TOTAL vmstat.system.cs

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
9.359e+08 ~ 1% -25.2% 6.999e+08 lkp-snb01/micro/hackbench/1600%-threads-pipe
9.359e+08 -25.2% 6.999e+08 TOTAL time.involuntary_context_switches

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
1229364 ~ 1% +32.5% 1629469 lkp-snb01/micro/hackbench/1600%-threads-pipe
1229364 +32.5% 1629469 TOTAL time.minor_page_faults

v3.13-rc7 eb9bbbe145c10a3b28a249c4a
--------------- -------------------------
1638 ~ 1% +25.4% 2054 lkp-snb01/micro/hackbench/1600%-threads-pipe
1638 +25.4% 2054 TOTAL time.user_time

Thanks,
Fengguang


Attachments:
(No filename) (7.03 kB)
x86_64-lkp (78.69 kB)
Download all attachments

2014-01-16 18:49:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [TLB range flush] +34.7% hackbench.throughput

On Thu, Jan 16, 2014 at 10:01:18PM +0800, Fengguang Wu wrote:
> Hi Mel,
>
> I applied your patchset on v3.13-rc7 and get some test results. The
> results are encouraging: hackbench throughput increased by 34.7% with
> parameters 1600%-threads-pipe on a 2S SNB server.
>
> In case you are interested, here are the full list of changes.
> kconfig is attached.
>

I am intersted and thanks very much for the report. It's very encouraging.

--
Mel Gorman
SUSE Labs

2014-01-16 23:22:19

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] mm: vmstat: Do not display stats for TLB flushes unless debugging

On Thu, 16 Jan 2014, Mel Gorman wrote:

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7249614..def5dd2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -851,12 +851,14 @@ const char * const vmstat_text[] = {
> "thp_zero_page_alloc",
> "thp_zero_page_alloc_failed",
> #endif
> +#ifdef CONFIG_DEBUG_TLBFLUSH
> #ifdef CONFIG_SMP
> "nr_tlb_remote_flush",
> "nr_tlb_remote_flush_received",
> -#endif
> +#endif /* CONFIG_SMP */
> "nr_tlb_local_flush_all",
> "nr_tlb_local_flush_one",
> +#endif /* CONFIG_DEBUG_TLBFLUSH */
>
> #endif /* CONFIG_VM_EVENTS_COUNTERS */
> };

Hmm, so why are NR_TLB_REMOTE_FLUSH{,_RECEIVED} defined for !CONFIG_SMP in
linux-next?

2014-01-17 08:53:13

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: vmstat: Do not display stats for TLB flushes unless debugging

On Thu, Jan 16, 2014 at 03:22:12PM -0800, David Rientjes wrote:
> On Thu, 16 Jan 2014, Mel Gorman wrote:
>
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 7249614..def5dd2 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -851,12 +851,14 @@ const char * const vmstat_text[] = {
> > "thp_zero_page_alloc",
> > "thp_zero_page_alloc_failed",
> > #endif
> > +#ifdef CONFIG_DEBUG_TLBFLUSH
> > #ifdef CONFIG_SMP
> > "nr_tlb_remote_flush",
> > "nr_tlb_remote_flush_received",
> > -#endif
> > +#endif /* CONFIG_SMP */
> > "nr_tlb_local_flush_all",
> > "nr_tlb_local_flush_one",
> > +#endif /* CONFIG_DEBUG_TLBFLUSH */
> >
> > #endif /* CONFIG_VM_EVENTS_COUNTERS */
> > };
>
> Hmm, so why are NR_TLB_REMOTE_FLUSH{,_RECEIVED} defined for !CONFIG_SMP in
> linux-next?

Because there are times when I am a complete muppet and this
is one of them. This is a revised version of the patch "x86:
mm: Account for TLB flushes only when debugging" which is
x86-mm-account-for-tlb-flushes-only-when-debugging.patch in mmotm

Thanks David.

---8<---
x86: mm: Account for TLB flushes only when debugging

Bisection between 3.11 and 3.12 fingered commit 9824cf97 (mm: vmstats:
tlb flush counters). The counters are undeniably useful but how often
do we really need to debug TLB flush related issues? It does not justify
taking the penalty everywhere so make it a debugging option.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 6 +++---
arch/x86/kernel/cpu/mtrr/generic.c | 4 ++--
arch/x86/mm/tlb.c | 14 +++++++-------
include/linux/vm_event_item.h | 4 +++-
include/linux/vmstat.h | 8 ++++++++
mm/vmstat.c | 4 +++-
6 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e6d90ba..04905bf 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -62,7 +62,7 @@ static inline void __flush_tlb_all(void)

static inline void __flush_tlb_one(unsigned long addr)
{
- count_vm_event(NR_TLB_LOCAL_FLUSH_ONE);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
__flush_tlb_single(addr);
}

@@ -93,13 +93,13 @@ static inline void __flush_tlb_one(unsigned long addr)
*/
static inline void __flush_tlb_up(void)
{
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
__flush_tlb();
}

static inline void flush_tlb_all(void)
{
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
__flush_tlb_all();
}

diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index ce2d0a2..0e25a1b 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -683,7 +683,7 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
}

/* Flush all TLBs via a mov %cr3, %reg; mov %reg, %cr3 */
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
__flush_tlb();

/* Save MTRR state */
@@ -697,7 +697,7 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
static void post_set(void) __releases(set_atomicity_lock)
{
/* Flush TLBs (no need to flush caches - they are disabled) */
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
__flush_tlb();

/* Intel (P6) standard MTRRs */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ae699b3..05446c1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -103,7 +103,7 @@ static void flush_tlb_func(void *info)
if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm))
return;

- count_vm_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
if (f->flush_end == TLB_FLUSH_ALL)
local_flush_tlb();
@@ -131,7 +131,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
info.flush_start = start;
info.flush_end = end;

- count_vm_event(NR_TLB_REMOTE_FLUSH);
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
if (is_uv_system()) {
unsigned int cpu;

@@ -151,7 +151,7 @@ void flush_tlb_current_task(void)

preempt_disable();

- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
@@ -215,7 +215,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

/* tlb_flushall_shift is on balance point, details in commit log */
if ((end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift) {
- count_vm_event(NR_TLB_LOCAL_FLUSH_ALL);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
if (has_large_page(mm, start, end)) {
@@ -224,7 +224,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
}
/* flush range by one by one 'invlpg' */
for (addr = start; addr < end; addr += PAGE_SIZE) {
- count_vm_event(NR_TLB_LOCAL_FLUSH_ONE);
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
__flush_tlb_single(addr);
}

@@ -262,7 +262,7 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start)

static void do_flush_tlb_all(void *info)
{
- count_vm_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
__flush_tlb_all();
if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_LAZY)
leave_mm(smp_processor_id());
@@ -270,7 +270,7 @@ static void do_flush_tlb_all(void *info)

void flush_tlb_all(void)
{
- count_vm_event(NR_TLB_REMOTE_FLUSH);
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
on_each_cpu(do_flush_tlb_all, NULL, 1);
}

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index c557c6d..3a712e2 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,12 +71,14 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
#endif
+#ifdef CONFIG_DEBUG_TLBFLUSH
#ifdef CONFIG_SMP
NR_TLB_REMOTE_FLUSH, /* cpu tried to flush others' tlbs */
NR_TLB_REMOTE_FLUSH_RECEIVED,/* cpu received ipi for flush */
-#endif
+#endif /* CONFIG_SMP */
NR_TLB_LOCAL_FLUSH_ALL,
NR_TLB_LOCAL_FLUSH_ONE,
+#endif /* CONFIG_DEBUG_TLBFLUSH */
NR_VM_EVENT_ITEMS
};

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e4b9480..80ebba9 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -83,6 +83,14 @@ static inline void vm_events_fold_cpu(int cpu)
#define count_vm_numa_events(x, y) do { (void)(y); } while (0)
#endif /* CONFIG_NUMA_BALANCING */

+#ifdef CONFIG_DEBUG_TLBFLUSH
+#define count_vm_tlb_event(x) count_vm_event(x)
+#define count_vm_tlb_events(x, y) count_vm_events(x, y)
+#else
+#define count_vm_tlb_event(x) do {} while (0)
+#define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
+#endif
+
#define __count_zone_vm_events(item, zone, delta) \
__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
zone_idx(zone), delta)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7249614..def5dd2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -851,12 +851,14 @@ const char * const vmstat_text[] = {
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
#endif
+#ifdef CONFIG_DEBUG_TLBFLUSH
#ifdef CONFIG_SMP
"nr_tlb_remote_flush",
"nr_tlb_remote_flush_received",
-#endif
+#endif /* CONFIG_SMP */
"nr_tlb_local_flush_all",
"nr_tlb_local_flush_one",
+#endif /* CONFIG_DEBUG_TLBFLUSH */

#endif /* CONFIG_VM_EVENTS_COUNTERS */
};