2017-08-02 07:34:29

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v6 0/7] fixes of TLB batching races

It turns out that Linux TLB batching mechanism suffers from various races.
Races that are caused due to batching during reclamation were recently
handled by Mel and this patch-set deals with others. The more fundamental
issue is that concurrent updates of the page-tables allow for TLB flushes
to be batched on one core, while another core changes the page-tables.
This other core may assume a PTE change does not require a flush based on
the updated PTE value, while it is unaware that TLB flushes are still
pending.

This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was observed
by hacking the kernel and adding a delay before flushing and replacing the
KSM page.

Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.

v5 -> v6:
* Combining with Minchan Kim's patch set, adding ack's (Andrew)
* Minor: missing header, typos (Nadav)
* Renaming arch_generic_tlb_finish_mmu (Mel)

Michnan's v1 -> v2 (combined):
* TLB batching API separation core part from arch specific one (Mel)
* introduce mm_tlb_flush_nested (Mel)

v4 -> v5:
* Fixing embarrassing build mistake (0day)

v3 -> v4:
* Change function names to indicate they inc/dec and not set/clear
(Sergey)
* Avoid additional barriers, and instead revert the patch that accessed
mm_tlb_flush_pending without a lock (Mel)

v2 -> v3:
* Do not init tlb_flush_pending if it is not defined without (Sergey)
* Internalize memory barriers to mm_tlb_flush_pending (Minchan)

v1 -> v2:
* Explain the implications of the implications of the race (Andrew)
* Mark the patch that address the race as stable (Andrew)
* Add another patch to clean the use of barriers (Andrew)

Minchan Kim (4):
mm: refactoring TLB gathering API
mm: make tlb_flush_pending global
mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
mm: fix KSM data corruption

Nadav Amit (3):
mm: migrate: prevent racy access to tlb_flush_pending
mm: migrate: fix barriers around tlb_flush_pending
Revert "mm: numa: defer TLB flush for THP migration as long as
possible"

arch/arm/include/asm/tlb.h | 11 ++++++--
arch/ia64/include/asm/tlb.h | 8 ++++--
arch/s390/include/asm/tlb.h | 17 +++++++-----
arch/sh/include/asm/tlb.h | 8 +++---
arch/um/include/asm/tlb.h | 13 ++++++---
fs/proc/task_mmu.c | 7 +++--
include/asm-generic/tlb.h | 7 ++---
include/linux/mm_types.h | 64 +++++++++++++++++++++++++++------------------
kernel/fork.c | 2 +-
mm/debug.c | 4 +--
mm/huge_memory.c | 7 +++++
mm/ksm.c | 3 ++-
mm/memory.c | 41 ++++++++++++++++++++++++-----
mm/migrate.c | 6 -----
mm/mprotect.c | 4 +--
15 files changed, 135 insertions(+), 67 deletions(-)

--
2.11.0


2017-08-02 07:33:46

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v6 3/7] Revert "mm: numa: defer TLB flush for THP migration as long as possible"

While deferring TLB flushes is a good practice, the reverted patch
caused pending TLB flushes to be checked while the page-table lock is
not taken. As a result, in architectures with weak memory model (PPC),
Linux may miss a memory-barrier, miss the fact TLB flushes are pending,
and cause (in theory) a memory corruption.

Since the alternative of using smp_mb__after_unlock_lock() was
considered a bit open-coded, and the performance impact is expected to
be small, the previous patch is reverted.

This reverts commit b0943d61b8fa420180f92f64ef67662b4f6cc493.

Suggested-by: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
mm/huge_memory.c | 7 +++++++
mm/migrate.c | 6 ------
2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88c6167f194d..b51d83e410eb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1496,6 +1496,13 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
}

/*
+ * The page_table_lock above provides a memory barrier
+ * with change_protection_range.
+ */
+ if (mm_tlb_flush_pending(vma->vm_mm))
+ flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+
+ /*
* Migrate the THP to the requested node, returns with page unlocked
* and access rights restored.
*/
diff --git a/mm/migrate.c b/mm/migrate.c
index 89a0a1707f4c..1f6c2f41b3cb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1935,12 +1935,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
put_page(new_page);
goto out_fail;
}
- /*
- * We are not sure a pending tlb flush here is for a huge page
- * mapping or not. Hence use the tlb range variant
- */
- if (mm_tlb_flush_pending(mm))
- flush_tlb_range(vma, mmun_start, mmun_end);

/* Prepare a page as a migration target */
__SetPageLocked(new_page);
--
2.11.0

2017-08-02 07:33:51

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v6 6/7] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

From: Minchan Kim <[email protected]>

Nadav reported parallel MADV_DONTNEED on same range has a stale TLB problem
and Mel fixed it[1] and found same problem on MADV_FREE[2].

Quote from Mel Gorman

"The race in question is CPU 0 running madv_free and updating some PTEs
while CPU 1 is also running madv_free and looking at the same PTEs. CPU 1
may have writable TLB entries for a page but fail the pte_dirty check
(because CPU 0 has updated it already) and potentially fail to flush.
Hence, when madv_free on CPU 1 returns, there are still potentially
writable TLB entries and the underlying PTE is still present so that a
subsequent write does not necessarily propagate the dirty bit to the
underlying PTE any more. Reclaim at some unknown time at the future may
then see that the PTE is still clean and discard the page even though a
write has happened in the meantime. I think this is possible but I could
have missed some protection in madv_free that prevents it happening."

This patch aims for solving both problems all at once and is ready for
other problem with KSM, MADV_FREE and soft-dirty story[3].

TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending and
mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch
there are parallel threads going on. In that case, forcefully, flush TLB to
prevent for user to access memory via stale TLB entry although it fail to
gather page table entry.

I confirmed this patch works with [4] test program Nadav gave so this patch
supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" in
current mmotm.

NOTE:

This patch modifies arch-specific TLB gathering interface(x86, ia64, s390,
sh, um). It seems most of architecture are straightforward but s390 need to
be careful because tlb_flush_mmu works only if mm->context.flush_mm is set
to non-zero which happens only a pte entry really is cleared by
ptep_get_and_clear and friends. However, this problem never changes the pte
entries but need to flush to prevent memory access from stale tlb.

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
[3] http://lkml.kernel.org/r/[email protected]
[4] https://patchwork.kernel.org/patch/9861621/

Cc: Ingo Molnar <[email protected]>
Cc: Russell King <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: [email protected]
Reported-by: Nadav Amit <[email protected]>
Reported-by: Mel Gorman <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
arch/arm/include/asm/tlb.h | 7 ++++++-
arch/ia64/include/asm/tlb.h | 4 +++-
arch/s390/include/asm/tlb.h | 7 ++++++-
arch/sh/include/asm/tlb.h | 4 ++--
arch/um/include/asm/tlb.h | 7 ++++++-
include/asm-generic/tlb.h | 2 +-
include/linux/mm_types.h | 8 ++++++++
mm/memory.c | 17 +++++++++++++++--
8 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 7f5b2a2d3861..d5562f9ce600 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -168,8 +168,13 @@ arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,

static inline void
arch_tlb_finish_mmu(struct mmu_gather *tlb,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end, bool force)
{
+ if (force) {
+ tlb->range_start = start;
+ tlb->range_end = end;
+ }
+
tlb_flush_mmu(tlb);

/* keep the page table cache within bounds */
diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h
index 93cadc04ac62..cbe5ac3699bf 100644
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
@@ -187,8 +187,10 @@ arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
*/
static inline void
arch_tlb_finish_mmu(struct mmu_gather *tlb,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end, bool force)
{
+ if (force)
+ tlb->need_flush = 1;
/*
* Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
* tlb->end_addr.
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 0e59ef57e234..b20a3621024f 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -77,8 +77,13 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)

static inline void
arch_tlb_finish_mmu(struct mmu_gather *tlb,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end, bool force)
{
+ if (force) {
+ tlb->start = start;
+ tlb->end = end;
+ }
+
tlb_flush_mmu(tlb);
}

diff --git a/arch/sh/include/asm/tlb.h b/arch/sh/include/asm/tlb.h
index 89786560dbd4..51a8bc967e75 100644
--- a/arch/sh/include/asm/tlb.h
+++ b/arch/sh/include/asm/tlb.h
@@ -49,9 +49,9 @@ arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,

static inline void
arch_tlb_finish_mmu(struct mmu_gather *tlb,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end, bool force)
{
- if (tlb->fullmm)
+ if (tlb->fullmm || force)
flush_tlb_mm(tlb->mm);

/* keep the page table cache within bounds */
diff --git a/arch/um/include/asm/tlb.h b/arch/um/include/asm/tlb.h
index 2a901eca7145..344d95619d03 100644
--- a/arch/um/include/asm/tlb.h
+++ b/arch/um/include/asm/tlb.h
@@ -87,8 +87,13 @@ tlb_flush_mmu(struct mmu_gather *tlb)
*/
static inline void
arch_tlb_finish_mmu(struct mmu_gather *tlb,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end, bool force)
{
+ if (force) {
+ tlb->start = start;
+ tlb->end = end;
+ tlb->need_flush = 1;
+ }
tlb_flush_mmu(tlb);

/* keep the page table cache within bounds */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 8f71521e7a44..faddde44de8c 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -116,7 +116,7 @@ void arch_tlb_gather_mmu(struct mmu_gather *tlb,
struct mm_struct *mm, unsigned long start, unsigned long end);
void tlb_flush_mmu(struct mmu_gather *tlb);
void arch_tlb_finish_mmu(struct mmu_gather *tlb,
- unsigned long start, unsigned long end);
+ unsigned long start, unsigned long end, bool force);
extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
int page_size);

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fc44315df47a..664c1e553228 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -534,6 +534,14 @@ static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
return atomic_read(&mm->tlb_flush_pending) > 0;
}

+/*
+ * Returns true if there are two above TLB batching threads in parallel.
+ */
+static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
+{
+ return atomic_read(&mm->tlb_flush_pending) > 1;
+}
+
static inline void init_tlb_flush_pending(struct mm_struct *mm)
{
atomic_set(&mm->tlb_flush_pending, 0);
diff --git a/mm/memory.c b/mm/memory.c
index 7848b5030be0..d7a620dd183a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,10 +272,13 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
* that were required.
*/
void arch_tlb_finish_mmu(struct mmu_gather *tlb,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end, bool force)
{
struct mmu_gather_batch *batch, *next;

+ if (force)
+ __tlb_adjust_range(tlb, start, end - start);
+
tlb_flush_mmu(tlb);

/* keep the page table cache within bounds */
@@ -404,12 +407,22 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
unsigned long start, unsigned long end)
{
arch_tlb_gather_mmu(tlb, mm, start, end);
+ inc_tlb_flush_pending(tlb->mm);
}

void tlb_finish_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end)
{
- arch_tlb_finish_mmu(tlb, start, end);
+ /*
+ * If there are parallel threads are doing PTE changes on same range
+ * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
+ * flush by batching, a thread has stable TLB entry can fail to flush
+ * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
+ * forcefully if we detect parallel PTE batching threads.
+ */
+ bool force = mm_tlb_flush_nested(tlb->mm);
+
+ arch_tlb_finish_mmu(tlb, start, end, force);
}

/*
--
2.11.0

2017-08-02 07:33:49

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v6 2/7] mm: migrate: fix barriers around tlb_flush_pending

Reading tlb_flush_pending while the page-table lock is taken does not
require a barrier, since the lock/unlock already acts as a barrier.
Removing the barrier in mm_tlb_flush_pending() to address this issue.

However, migrate_misplaced_transhuge_page() calls mm_tlb_flush_pending()
while the page-table lock is already released, which may present a
problem on architectures with weak memory model (PPC). To deal with this
case, a new parameter is added to mm_tlb_flush_pending() to indicate
if it is read without the page-table lock taken, and calling
smp_mb__after_unlock_lock() in this case.

Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Mel Gorman <[email protected]>

Signed-off-by: Nadav Amit <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/linux/mm_types.h | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f5263dd0f1bc..2956513619a7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -522,12 +522,12 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
/*
* Memory barriers to keep this state in sync are graciously provided by
* the page table locks, outside of which no page table modifications happen.
- * The barriers below prevent the compiler from re-ordering the instructions
- * around the memory barriers that are already present in the code.
+ * The barriers are used to ensure the order between tlb_flush_pending updates,
+ * which happen while the lock is not taken, and the PTE updates, which happen
+ * while the lock is taken, are serialized.
*/
static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
{
- barrier();
return atomic_read(&mm->tlb_flush_pending) > 0;
}

@@ -550,7 +550,13 @@ static inline void inc_tlb_flush_pending(struct mm_struct *mm)
/* Clearing is done after a TLB flush, which also provides a barrier. */
static inline void dec_tlb_flush_pending(struct mm_struct *mm)
{
- barrier();
+ /*
+ * Guarantee that the tlb_flush_pending does not not leak into the
+ * critical section, since we must order the PTE change and changes to
+ * the pending TLB flush indication. We could have relied on TLB flush
+ * as a memory barrier, but this behavior is not clearly documented.
+ */
+ smp_mb__before_atomic();
atomic_dec(&mm->tlb_flush_pending);
}
#else
--
2.11.0

2017-08-02 07:34:13

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v6 1/7] mm: migrate: prevent racy access to tlb_flush_pending

From: Nadav Amit <[email protected]>

Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.

This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.

An actual data corruption was not observed, yet the race was
was confirmed by adding assertion to check tlb_flush_pending is not set
by two threads, adding artificial latency in change_protection_range()
and using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")

Cc: Andy Lutomirski <[email protected]>

Signed-off-by: Nadav Amit <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Minchan Kim <[email protected]>
---
include/linux/mm_types.h | 31 ++++++++++++++++++++++---------
kernel/fork.c | 2 +-
mm/debug.c | 2 +-
mm/mprotect.c | 4 ++--
4 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..f5263dd0f1bc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -493,7 +493,7 @@ struct mm_struct {
* can move process memory needs to flush the TLB when moving a
* PROT_NONE or PROT_NUMA mapped page.
*/
- bool tlb_flush_pending;
+ atomic_t tlb_flush_pending;
#endif
struct uprobes_state uprobes_state;
#ifdef CONFIG_HUGETLB_PAGE
@@ -528,33 +528,46 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
{
barrier();
- return mm->tlb_flush_pending;
+ return atomic_read(&mm->tlb_flush_pending) > 0;
}
-static inline void set_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void init_tlb_flush_pending(struct mm_struct *mm)
{
- mm->tlb_flush_pending = true;
+ atomic_set(&mm->tlb_flush_pending, 0);
+}
+
+static inline void inc_tlb_flush_pending(struct mm_struct *mm)
+{
+ atomic_inc(&mm->tlb_flush_pending);

/*
- * Guarantee that the tlb_flush_pending store does not leak into the
+ * Guarantee that the tlb_flush_pending increase does not leak into the
* critical section updating the page tables
*/
smp_mb__before_spinlock();
}
+
/* Clearing is done after a TLB flush, which also provides a barrier. */
-static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+static inline void dec_tlb_flush_pending(struct mm_struct *mm)
{
barrier();
- mm->tlb_flush_pending = false;
+ atomic_dec(&mm->tlb_flush_pending);
}
#else
static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
{
return false;
}
-static inline void set_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void init_tlb_flush_pending(struct mm_struct *mm)
{
}
-static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void inc_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+
+static inline void dec_tlb_flush_pending(struct mm_struct *mm)
{
}
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d2bf95..840e7a7132e1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -809,7 +809,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm_init_aio(mm);
mm_init_owner(mm, p);
mmu_notifier_mm_init(mm);
- clear_tlb_flush_pending(mm);
+ init_tlb_flush_pending(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
#endif
diff --git a/mm/debug.c b/mm/debug.c
index db1cd26d8752..d70103bb4731 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -159,7 +159,7 @@ void dump_mm(const struct mm_struct *mm)
mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq,
#endif
#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
- mm->tlb_flush_pending,
+ atomic_read(&mm->tlb_flush_pending),
#endif
mm->def_flags, &mm->def_flags
);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..0c413774c1e3 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -245,7 +245,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
- set_tlb_flush_pending(mm);
+ inc_tlb_flush_pending(mm);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
@@ -257,7 +257,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
/* Only flush the TLB if we actually modified any entries: */
if (pages)
flush_tlb_range(vma, start, end);
- clear_tlb_flush_pending(mm);
+ dec_tlb_flush_pending(mm);

return pages;
}
--
2.11.0

2017-08-02 07:34:31

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v6 4/7] mm: refactoring TLB gathering API

From: Minchan Kim <[email protected]>

This patch is a preparatory patch for solving race problems caused by
TLB batch. For that, we will increase/decrease TLB flush pending count
of mm_struct whenever tlb_[gather|finish]_mmu is called.

Before making it simple, this patch separates architecture specific
part and rename it to arch_tlb_[gather|finish]_mmu and generic part
just calls it.

It shouldn't change any behavior.

Cc: Ingo Molnar <[email protected]>
Cc: Russell King <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: [email protected]
Cc: Mel Gorman <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
arch/arm/include/asm/tlb.h | 6 ++++--
arch/ia64/include/asm/tlb.h | 6 ++++--
arch/s390/include/asm/tlb.h | 12 ++++++------
arch/sh/include/asm/tlb.h | 6 ++++--
arch/um/include/asm/tlb.h | 8 +++++---
include/asm-generic/tlb.h | 7 ++++---
include/linux/mm_types.h | 6 ++++++
mm/memory.c | 28 +++++++++++++++++++++-------
8 files changed, 54 insertions(+), 25 deletions(-)

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 3f2eb76243e3..7f5b2a2d3861 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -148,7 +148,8 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)
}

static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end)
+arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
tlb->mm = mm;
tlb->fullmm = !(start | (end+1));
@@ -166,7 +167,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
}

static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+arch_tlb_finish_mmu(struct mmu_gather *tlb,
+ unsigned long start, unsigned long end)
{
tlb_flush_mmu(tlb);

diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h
index fced197b9626..93cadc04ac62 100644
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
@@ -168,7 +168,8 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb)


static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end)
+arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
tlb->mm = mm;
tlb->max = ARRAY_SIZE(tlb->local);
@@ -185,7 +186,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
* collected.
*/
static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+arch_tlb_finish_mmu(struct mmu_gather *tlb,
+ unsigned long start, unsigned long end)
{
/*
* Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 853b2a3d8dee..0e59ef57e234 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -47,10 +47,9 @@ struct mmu_table_batch {
extern void tlb_table_flush(struct mmu_gather *tlb);
extern void tlb_remove_table(struct mmu_gather *tlb, void *table);

-static inline void tlb_gather_mmu(struct mmu_gather *tlb,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
+static inline void
+arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
tlb->mm = mm;
tlb->start = start;
@@ -76,8 +75,9 @@ static inline void tlb_flush_mmu(struct mmu_gather *tlb)
tlb_flush_mmu_free(tlb);
}

-static inline void tlb_finish_mmu(struct mmu_gather *tlb,
- unsigned long start, unsigned long end)
+static inline void
+arch_tlb_finish_mmu(struct mmu_gather *tlb,
+ unsigned long start, unsigned long end)
{
tlb_flush_mmu(tlb);
}
diff --git a/arch/sh/include/asm/tlb.h b/arch/sh/include/asm/tlb.h
index 46e0d635e36f..89786560dbd4 100644
--- a/arch/sh/include/asm/tlb.h
+++ b/arch/sh/include/asm/tlb.h
@@ -36,7 +36,8 @@ static inline void init_tlb_gather(struct mmu_gather *tlb)
}

static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end)
+arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
tlb->mm = mm;
tlb->start = start;
@@ -47,7 +48,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
}

static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+arch_tlb_finish_mmu(struct mmu_gather *tlb,
+ unsigned long start, unsigned long end)
{
if (tlb->fullmm)
flush_tlb_mm(tlb->mm);
diff --git a/arch/um/include/asm/tlb.h b/arch/um/include/asm/tlb.h
index 600a2e9bfee2..2a901eca7145 100644
--- a/arch/um/include/asm/tlb.h
+++ b/arch/um/include/asm/tlb.h
@@ -45,7 +45,8 @@ static inline void init_tlb_gather(struct mmu_gather *tlb)
}

static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end)
+arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
tlb->mm = mm;
tlb->start = start;
@@ -80,12 +81,13 @@ tlb_flush_mmu(struct mmu_gather *tlb)
tlb_flush_mmu_free(tlb);
}

-/* tlb_finish_mmu
+/* arch_tlb_finish_mmu
* Called at the end of the shootdown operation to free up any resources
* that were required.
*/
static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+arch_tlb_finish_mmu(struct mmu_gather *tlb,
+ unsigned long start, unsigned long end)
{
tlb_flush_mmu(tlb);

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 8afa4335e5b2..8f71521e7a44 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -112,10 +112,11 @@ struct mmu_gather {

#define HAVE_GENERIC_MMU_GATHER

-void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end);
+void arch_tlb_gather_mmu(struct mmu_gather *tlb,
+ struct mm_struct *mm, unsigned long start, unsigned long end);
void tlb_flush_mmu(struct mmu_gather *tlb);
-void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start,
- unsigned long end);
+void arch_tlb_finish_mmu(struct mmu_gather *tlb,
+ unsigned long start, unsigned long end);
extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
int page_size);

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2956513619a7..248f4ed1f3e1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -518,6 +518,12 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return mm->cpu_vm_mask_var;
}

+struct mmu_gather;
+extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+extern void tlb_finish_mmu(struct mmu_gather *tlb,
+ unsigned long start, unsigned long end);
+
#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
/*
* Memory barriers to keep this state in sync are graciously provided by
diff --git a/mm/memory.c b/mm/memory.c
index bb11c474857e..7848b5030be0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -215,12 +215,8 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
return true;
}

-/* tlb_gather_mmu
- * Called to initialize an (on-stack) mmu_gather structure for page-table
- * tear-down from @mm. The @fullmm argument is used when @mm is without
- * users and we're going to destroy the full address space (exit/execve).
- */
-void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end)
+void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
tlb->mm = mm;

@@ -275,7 +271,8 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
* Called at the end of the shootdown operation to free up any resources
* that were required.
*/
-void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+void arch_tlb_finish_mmu(struct mmu_gather *tlb,
+ unsigned long start, unsigned long end)
{
struct mmu_gather_batch *batch, *next;

@@ -398,6 +395,23 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)

#endif /* CONFIG_HAVE_RCU_TABLE_FREE */

+/* tlb_gather_mmu
+ * Called to initialize an (on-stack) mmu_gather structure for page-table
+ * tear-down from @mm. The @fullmm argument is used when @mm is without
+ * users and we're going to destroy the full address space (exit/execve).
+ */
+void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ arch_tlb_gather_mmu(tlb, mm, start, end);
+}
+
+void tlb_finish_mmu(struct mmu_gather *tlb,
+ unsigned long start, unsigned long end)
+{
+ arch_tlb_finish_mmu(tlb, start, end);
+}
+
/*
* Note: this doesn't free the actual pages themselves. That
* has been handled earlier when unmapping all the memory regions.
--
2.11.0

2017-08-02 07:34:58

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v6 7/7] mm: fix KSM data corruption

From: Minchan Kim <[email protected]>

Nadav reported KSM can corrupt the user data by the TLB batching race[1].
That means data user written can be lost.

Quote from Nadav Amit
"
For this race we need 4 CPUs:

CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
write later.

CPU1: Runs madvise_free on the range that includes the PTE. It would clear
the dirty-bit. It batches TLB flushes.

CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
care about the fact that it clears the PTE write-bit, and of course,
batches TLB flushes.

CPU3: Runs KSM. Our purpose is to pass the following test in
write_protect_page():

if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
(pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))

Since it will avoid TLB flush. And we want to do it while the PTE is stale.
Later, and before replacing the page, we would be able to change the page.

Note that all the operations the CPU1-3 perform canhappen in parallel since
they only acquire mmap_sem for read.

We start with two identical pages. Everything below regards the same
page/PTE.

CPU0 CPU1 CPU2 CPU3
---- ---- ---- ----
Write the same
value on page

[cache PTE as
dirty in TLB]

MADV_FREE
pte_mkclean()

4 > clear_refs
pte_wrprotect()

write_protect_page()
[ success, no flush ]

pages_indentical()
[ ok ]

Write to page
different value

[Ok, using stale
PTE]

replace_page()

Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
already wrote on the page, but KSM ignored this write, and it got lost.
"

In above scenario, MADV_FREE is fixed by changing TLB batching API
including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty part.

This patch changes soft-dirty uses TLB batching API instead of flush_tlb_mm
and KSM checks pending TLB flush by using mm_tlb_flush_pending so that it
will flush TLB to avoid data lost if there are other parallel threads
pending TLB flush.

[1] http://lkml.kernel.org/r/[email protected]

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Reported-by: Nadav Amit <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
Tested-by: Nadav Amit <[email protected]>
---
fs/proc/task_mmu.c | 7 +++++--
mm/ksm.c | 3 ++-
2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 520802da059c..aa20da220973 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -16,9 +16,10 @@
#include <linux/mmu_notifier.h>
#include <linux/page_idle.h>
#include <linux/shmem_fs.h>
+#include <linux/uaccess.h>

#include <asm/elf.h>
-#include <linux/uaccess.h>
+#include <asm/tlb.h>
#include <asm/tlbflush.h>
#include "internal.h"

@@ -1009,6 +1010,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
struct mm_struct *mm;
struct vm_area_struct *vma;
enum clear_refs_types type;
+ struct mmu_gather tlb;
int itype;
int rv;

@@ -1055,6 +1057,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
}

down_read(&mm->mmap_sem);
+ tlb_gather_mmu(&tlb, mm, 0, -1);
if (type == CLEAR_REFS_SOFT_DIRTY) {
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (!(vma->vm_flags & VM_SOFTDIRTY))
@@ -1076,7 +1079,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
walk_page_range(0, mm->highest_vm_end, &clear_refs_walk);
if (type == CLEAR_REFS_SOFT_DIRTY)
mmu_notifier_invalidate_range_end(mm, 0, -1);
- flush_tlb_mm(mm);
+ tlb_finish_mmu(&tlb, 0, -1);
up_read(&mm->mmap_sem);
out_mm:
mmput(mm);
diff --git a/mm/ksm.c b/mm/ksm.c
index 216184af0e19..e5bf02e39752 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -883,7 +883,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
goto out_unlock;

if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
- (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) {
+ (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
+ mm_tlb_flush_pending(mm)) {
pte_t entry;

swapped = PageSwapCache(page);
--
2.11.0

2017-08-02 07:35:54

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v6 5/7] mm: make tlb_flush_pending global

From: Minchan Kim <[email protected]>

Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
COMPACTION] but upcoming patches to solve subtle TLB flush batching
problem will use it regardless of compaction/NUMA so this patch
doesn't remove the dependency.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
include/linux/mm_types.h | 21 ---------------------
mm/debug.c | 2 --
2 files changed, 23 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 248f4ed1f3e1..fc44315df47a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -487,14 +487,12 @@ struct mm_struct {
/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
#endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
/*
* An operation with batched TLB flushing is going on. Anything that
* can move process memory needs to flush the TLB when moving a
* PROT_NONE or PROT_NUMA mapped page.
*/
atomic_t tlb_flush_pending;
-#endif
struct uprobes_state uprobes_state;
#ifdef CONFIG_HUGETLB_PAGE
atomic_long_t hugetlb_usage;
@@ -524,7 +522,6 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
extern void tlb_finish_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end);

-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
/*
* Memory barriers to keep this state in sync are graciously provided by
* the page table locks, outside of which no page table modifications happen.
@@ -565,24 +562,6 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
smp_mb__before_atomic();
atomic_dec(&mm->tlb_flush_pending);
}
-#else
-static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
-{
- return false;
-}
-
-static inline void init_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-
-static inline void inc_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-
-static inline void dec_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-#endif

struct vm_fault;

diff --git a/mm/debug.c b/mm/debug.c
index d70103bb4731..18a9b15b1e37 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -158,9 +158,7 @@ void dump_mm(const struct mm_struct *mm)
#ifdef CONFIG_NUMA_BALANCING
mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq,
#endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
atomic_read(&mm->tlb_flush_pending),
-#endif
mm->def_flags, &mm->def_flags
);
}
--
2.11.0

2017-08-02 14:29:45

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] mm: make tlb_flush_pending global

Hi Minchan,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.13-rc3]
[cannot apply to next-20170802]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
config: sh-allyesconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sh

All warnings (new ones prefixed by >>):

In file included from include/linux/printk.h:6:0,
from include/linux/kernel.h:13,
from mm/debug.c:8:
mm/debug.c: In function 'dump_mm':
>> include/linux/kern_levels.h:4:18: warning: format '%lx' expects argument of type 'long unsigned int', but argument 40 has type 'int' [-Wformat=]
#define KERN_SOH "\001" /* ASCII Start Of Header */
^
include/linux/kern_levels.h:7:20: note: in expansion of macro 'KERN_SOH'
#define KERN_EMERG KERN_SOH "0" /* system is unusable */
^~~~~~~~
>> include/linux/printk.h:295:9: note: in expansion of macro 'KERN_EMERG'
printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
^~~~~~~~~~
>> mm/debug.c:102:2: note: in expansion of macro 'pr_emerg'
pr_emerg("mm %p mmap %p seqnum %d task_size %lu\n"
^~~~~~~~
>> include/linux/kern_levels.h:4:18: warning: format '%p' expects argument of type 'void *', but argument 41 has type 'long unsigned int' [-Wformat=]
#define KERN_SOH "\001" /* ASCII Start Of Header */
^
include/linux/kern_levels.h:7:20: note: in expansion of macro 'KERN_SOH'
#define KERN_EMERG KERN_SOH "0" /* system is unusable */
^~~~~~~~
>> include/linux/printk.h:295:9: note: in expansion of macro 'KERN_EMERG'
printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
^~~~~~~~~~
>> mm/debug.c:102:2: note: in expansion of macro 'pr_emerg'
pr_emerg("mm %p mmap %p seqnum %d task_size %lu\n"
^~~~~~~~
>> include/linux/kern_levels.h:4:18: warning: too many arguments for format [-Wformat-extra-args]
#define KERN_SOH "\001" /* ASCII Start Of Header */
^
include/linux/kern_levels.h:7:20: note: in expansion of macro 'KERN_SOH'
#define KERN_EMERG KERN_SOH "0" /* system is unusable */
^~~~~~~~
>> include/linux/printk.h:295:9: note: in expansion of macro 'KERN_EMERG'
printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
^~~~~~~~~~
>> mm/debug.c:102:2: note: in expansion of macro 'pr_emerg'
pr_emerg("mm %p mmap %p seqnum %d task_size %lu\n"
^~~~~~~~

vim +/pr_emerg +102 mm/debug.c

82742a3a5 Sasha Levin 2014-10-09 99
31c9afa6d Sasha Levin 2014-10-09 100 void dump_mm(const struct mm_struct *mm)
31c9afa6d Sasha Levin 2014-10-09 101 {
7a82ca0d6 Andrew Morton 2014-10-09 @102 pr_emerg("mm %p mmap %p seqnum %d task_size %lu\n"
31c9afa6d Sasha Levin 2014-10-09 103 #ifdef CONFIG_MMU
31c9afa6d Sasha Levin 2014-10-09 104 "get_unmapped_area %p\n"
31c9afa6d Sasha Levin 2014-10-09 105 #endif
31c9afa6d Sasha Levin 2014-10-09 106 "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
dc6c9a35b Kirill A. Shutemov 2015-02-11 107 "pgd %p mm_users %d mm_count %d nr_ptes %lu nr_pmds %lu map_count %d\n"
31c9afa6d Sasha Levin 2014-10-09 108 "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
846383359 Konstantin Khlebnikov 2016-01-14 109 "pinned_vm %lx data_vm %lx exec_vm %lx stack_vm %lx\n"
31c9afa6d Sasha Levin 2014-10-09 110 "start_code %lx end_code %lx start_data %lx end_data %lx\n"
31c9afa6d Sasha Levin 2014-10-09 111 "start_brk %lx brk %lx start_stack %lx\n"
31c9afa6d Sasha Levin 2014-10-09 112 "arg_start %lx arg_end %lx env_start %lx env_end %lx\n"
31c9afa6d Sasha Levin 2014-10-09 113 "binfmt %p flags %lx core_state %p\n"
31c9afa6d Sasha Levin 2014-10-09 114 #ifdef CONFIG_AIO
31c9afa6d Sasha Levin 2014-10-09 115 "ioctx_table %p\n"
31c9afa6d Sasha Levin 2014-10-09 116 #endif
31c9afa6d Sasha Levin 2014-10-09 117 #ifdef CONFIG_MEMCG
31c9afa6d Sasha Levin 2014-10-09 118 "owner %p "
31c9afa6d Sasha Levin 2014-10-09 119 #endif
31c9afa6d Sasha Levin 2014-10-09 120 "exe_file %p\n"
31c9afa6d Sasha Levin 2014-10-09 121 #ifdef CONFIG_MMU_NOTIFIER
31c9afa6d Sasha Levin 2014-10-09 122 "mmu_notifier_mm %p\n"
31c9afa6d Sasha Levin 2014-10-09 123 #endif
31c9afa6d Sasha Levin 2014-10-09 124 #ifdef CONFIG_NUMA_BALANCING
31c9afa6d Sasha Levin 2014-10-09 125 "numa_next_scan %lu numa_scan_offset %lu numa_scan_seq %d\n"
31c9afa6d Sasha Levin 2014-10-09 126 #endif
31c9afa6d Sasha Levin 2014-10-09 127 #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
31c9afa6d Sasha Levin 2014-10-09 128 "tlb_flush_pending %d\n"
31c9afa6d Sasha Levin 2014-10-09 129 #endif
b8eceeb99 Vlastimil Babka 2016-03-15 130 "def_flags: %#lx(%pGv)\n",
31c9afa6d Sasha Levin 2014-10-09 131
31c9afa6d Sasha Levin 2014-10-09 132 mm, mm->mmap, mm->vmacache_seqnum, mm->task_size,
31c9afa6d Sasha Levin 2014-10-09 133 #ifdef CONFIG_MMU
31c9afa6d Sasha Levin 2014-10-09 134 mm->get_unmapped_area,
31c9afa6d Sasha Levin 2014-10-09 135 #endif
31c9afa6d Sasha Levin 2014-10-09 136 mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end,
31c9afa6d Sasha Levin 2014-10-09 137 mm->pgd, atomic_read(&mm->mm_users),
31c9afa6d Sasha Levin 2014-10-09 138 atomic_read(&mm->mm_count),
31c9afa6d Sasha Levin 2014-10-09 139 atomic_long_read((atomic_long_t *)&mm->nr_ptes),
dc6c9a35b Kirill A. Shutemov 2015-02-11 140 mm_nr_pmds((struct mm_struct *)mm),
31c9afa6d Sasha Levin 2014-10-09 141 mm->map_count,
31c9afa6d Sasha Levin 2014-10-09 142 mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
846383359 Konstantin Khlebnikov 2016-01-14 143 mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm,
31c9afa6d Sasha Levin 2014-10-09 144 mm->start_code, mm->end_code, mm->start_data, mm->end_data,
31c9afa6d Sasha Levin 2014-10-09 145 mm->start_brk, mm->brk, mm->start_stack,
31c9afa6d Sasha Levin 2014-10-09 146 mm->arg_start, mm->arg_end, mm->env_start, mm->env_end,
31c9afa6d Sasha Levin 2014-10-09 147 mm->binfmt, mm->flags, mm->core_state,
31c9afa6d Sasha Levin 2014-10-09 148 #ifdef CONFIG_AIO
31c9afa6d Sasha Levin 2014-10-09 149 mm->ioctx_table,
31c9afa6d Sasha Levin 2014-10-09 150 #endif
31c9afa6d Sasha Levin 2014-10-09 151 #ifdef CONFIG_MEMCG
31c9afa6d Sasha Levin 2014-10-09 152 mm->owner,
31c9afa6d Sasha Levin 2014-10-09 153 #endif
31c9afa6d Sasha Levin 2014-10-09 154 mm->exe_file,
31c9afa6d Sasha Levin 2014-10-09 155 #ifdef CONFIG_MMU_NOTIFIER
31c9afa6d Sasha Levin 2014-10-09 156 mm->mmu_notifier_mm,
31c9afa6d Sasha Levin 2014-10-09 157 #endif
31c9afa6d Sasha Levin 2014-10-09 158 #ifdef CONFIG_NUMA_BALANCING
31c9afa6d Sasha Levin 2014-10-09 159 mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq,
31c9afa6d Sasha Levin 2014-10-09 160 #endif
fd2fc6e1f Nadav Amit 2017-08-01 161 atomic_read(&mm->tlb_flush_pending),
b8eceeb99 Vlastimil Babka 2016-03-15 162 mm->def_flags, &mm->def_flags
31c9afa6d Sasha Levin 2014-10-09 163 );
31c9afa6d Sasha Levin 2014-10-09 164 }
31c9afa6d Sasha Levin 2014-10-09 165

:::::: The code at line 102 was first introduced by commit
:::::: 7a82ca0d6437261d0727ce472ae4f3a05a9ce5f7 mm/debug.c: use pr_emerg()

:::::: TO: Andrew Morton <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (8.41 kB)
.config.gz (45.06 kB)
Download all attachments

2017-08-02 23:23:20

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] mm: make tlb_flush_pending global

On Wed, Aug 02, 2017 at 10:28:47PM +0800, kbuild test robot wrote:
> Hi Minchan,
>
> [auto build test WARNING on linus/master]
> [also build test WARNING on v4.13-rc3]
> [cannot apply to next-20170802]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
> config: sh-allyesconfig (attached as .config)
> compiler: sh4-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=sh
>
> All warnings (new ones prefixed by >>):
>
> In file included from include/linux/printk.h:6:0,
> from include/linux/kernel.h:13,
> from mm/debug.c:8:
> mm/debug.c: In function 'dump_mm':
> >> include/linux/kern_levels.h:4:18: warning: format '%lx' expects argument of type 'long unsigned int', but argument 40 has type 'int' [-Wformat=]

Thanks. lkp.

This patch should fix it.

>From 5be44c3cbe0e4149cc8b438f2e3fcad046091a29 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Tue, 1 Aug 2017 14:04:58 +0900
Subject: [PATCH v6 5/7] mm: make tlb_flush_pending global

Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
COMPACTION] but upcoming patches to solve subtle TLB flush bacting
problem will use it regardless of compaction/numa so this patch
doesn't remove the dependency.

Cc: Nadav Amit <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/mm_types.h | 21 ---------------------
mm/debug.c | 4 ----
2 files changed, 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c605f2a3a68e..892a7b0196fd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -487,14 +487,12 @@ struct mm_struct {
/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
#endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
/*
* An operation with batched TLB flushing is going on. Anything that
* can move process memory needs to flush the TLB when moving a
* PROT_NONE or PROT_NUMA mapped page.
*/
atomic_t tlb_flush_pending;
-#endif
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
/* See flush_tlb_batched_pending() */
bool tlb_flush_batched;
@@ -528,7 +526,6 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
extern void tlb_finish_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end);

-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
/*
* Memory barriers to keep this state in sync are graciously provided by
* the page table locks, outside of which no page table modifications happen.
@@ -569,24 +566,6 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
smp_mb__before_atomic();
atomic_dec(&mm->tlb_flush_pending);
}
-#else
-static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
-{
- return false;
-}
-
-static inline void init_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-
-static inline void inc_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-
-static inline void dec_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-#endif

struct vm_fault;

diff --git a/mm/debug.c b/mm/debug.c
index d70103bb4731..5715448ab0b5 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -124,9 +124,7 @@ void dump_mm(const struct mm_struct *mm)
#ifdef CONFIG_NUMA_BALANCING
"numa_next_scan %lu numa_scan_offset %lu numa_scan_seq %d\n"
#endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
"tlb_flush_pending %d\n"
-#endif
"def_flags: %#lx(%pGv)\n",

mm, mm->mmap, mm->vmacache_seqnum, mm->task_size,
@@ -158,9 +156,7 @@ void dump_mm(const struct mm_struct *mm)
#ifdef CONFIG_NUMA_BALANCING
mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq,
#endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
atomic_read(&mm->tlb_flush_pending),
-#endif
mm->def_flags, &mm->def_flags
);
}
--
2.7.4


2017-08-02 23:26:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] fixes of TLB batching races

On Tue, Aug 01, 2017 at 05:08:11PM -0700, Nadav Amit wrote:
> It turns out that Linux TLB batching mechanism suffers from various races.
> Races that are caused due to batching during reclamation were recently
> handled by Mel and this patch-set deals with others. The more fundamental
> issue is that concurrent updates of the page-tables allow for TLB flushes
> to be batched on one core, while another core changes the page-tables.
> This other core may assume a PTE change does not require a flush based on
> the updated PTE value, while it is unaware that TLB flushes are still
> pending.
>
> This behavior affects KSM (which may result in memory corruption) and
> MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
> proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
> Memory corruption in KSM is harder to produce in practice, but was observed
> by hacking the kernel and adding a delay before flushing and replacing the
> KSM page.
>
> Finally, there is also one memory barrier missing, which may affect
> architectures with weak memory model.
>
> v5 -> v6:
> * Combining with Minchan Kim's patch set, adding ack's (Andrew)
> * Minor: missing header, typos (Nadav)
> * Renaming arch_generic_tlb_finish_mmu (Mel)

Thanks for intergrating/correction, Nadav.

2017-08-02 23:28:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] mm: make tlb_flush_pending global

On Wed, 2 Aug 2017 22:28:47 +0800 kbuild test robot <[email protected]> wrote:

> Hi Minchan,
>
> [auto build test WARNING on linus/master]
> [also build test WARNING on v4.13-rc3]
> [cannot apply to next-20170802]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
> config: sh-allyesconfig (attached as .config)
> compiler: sh4-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=sh
>
> All warnings (new ones prefixed by >>):
>
> In file included from include/linux/printk.h:6:0,
> from include/linux/kernel.h:13,
> from mm/debug.c:8:
> mm/debug.c: In function 'dump_mm':
> >> include/linux/kern_levels.h:4:18: warning: format '%lx' expects argument of type 'long unsigned int', but argument 40 has type 'int' [-Wformat=]
>
> ...
>

This?

From: Andrew Morton <[email protected]>
Subject: mm-make-tlb_flush_pending-global-fix

remove more ifdefs from world's ugliest printk statement

Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nadav Amit <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/debug.c | 2 --
1 file changed, 2 deletions(-)

diff -puN include/linux/mm_types.h~mm-make-tlb_flush_pending-global-fix include/linux/mm_types.h
diff -puN mm/debug.c~mm-make-tlb_flush_pending-global-fix mm/debug.c
--- a/mm/debug.c~mm-make-tlb_flush_pending-global-fix
+++ a/mm/debug.c
@@ -124,9 +124,7 @@ void dump_mm(const struct mm_struct *mm)
#ifdef CONFIG_NUMA_BALANCING
"numa_next_scan %lu numa_scan_offset %lu numa_scan_seq %d\n"
#endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
"tlb_flush_pending %d\n"
-#endif
"def_flags: %#lx(%pGv)\n",

mm, mm->mmap, mm->vmacache_seqnum, mm->task_size,
_

2017-08-02 23:34:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] mm: make tlb_flush_pending global

On Wed, Aug 02, 2017 at 04:27:58PM -0700, Andrew Morton wrote:
> On Wed, 2 Aug 2017 22:28:47 +0800 kbuild test robot <[email protected]> wrote:
>
> > Hi Minchan,
> >
> > [auto build test WARNING on linus/master]
> > [also build test WARNING on v4.13-rc3]
> > [cannot apply to next-20170802]
> > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> >
> > url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
> > config: sh-allyesconfig (attached as .config)
> > compiler: sh4-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
> > reproduce:
> > wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> > chmod +x ~/bin/make.cross
> > # save the attached .config to linux build tree
> > make.cross ARCH=sh
> >
> > All warnings (new ones prefixed by >>):
> >
> > In file included from include/linux/printk.h:6:0,
> > from include/linux/kernel.h:13,
> > from mm/debug.c:8:
> > mm/debug.c: In function 'dump_mm':
> > >> include/linux/kern_levels.h:4:18: warning: format '%lx' expects argument of type 'long unsigned int', but argument 40 has type 'int' [-Wformat=]
> >
> > ...
> >
>
> This?
>
> From: Andrew Morton <[email protected]>
> Subject: mm-make-tlb_flush_pending-global-fix
>
> remove more ifdefs from world's ugliest printk statement
>
> Cc: Mel Gorman <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Nadav Amit <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

I'm a bit late.
Thanks for the fix, Andrew!

2017-08-03 16:41:24

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] mm: make tlb_flush_pending global

Hi Minchan,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.13-rc3]
[cannot apply to next-20170803]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
config: x86_64-randconfig-a0-08032207 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All warnings (new ones prefixed by >>):

mm/debug.c: In function 'dump_mm':
>> mm/debug.c:102: warning: format '%#lx' expects type 'long unsigned int', but argument 40 has type 'int'
>> mm/debug.c:102: warning: format '%p' expects type 'void *', but argument 41 has type 'long unsigned int'
mm/debug.c:102: warning: too many arguments for format

vim +102 mm/debug.c

82742a3a5 Sasha Levin 2014-10-09 99
31c9afa6d Sasha Levin 2014-10-09 100 void dump_mm(const struct mm_struct *mm)
31c9afa6d Sasha Levin 2014-10-09 101 {
7a82ca0d6 Andrew Morton 2014-10-09 @102 pr_emerg("mm %p mmap %p seqnum %d task_size %lu\n"
31c9afa6d Sasha Levin 2014-10-09 103 #ifdef CONFIG_MMU
31c9afa6d Sasha Levin 2014-10-09 104 "get_unmapped_area %p\n"
31c9afa6d Sasha Levin 2014-10-09 105 #endif
31c9afa6d Sasha Levin 2014-10-09 106 "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
dc6c9a35b Kirill A. Shutemov 2015-02-11 107 "pgd %p mm_users %d mm_count %d nr_ptes %lu nr_pmds %lu map_count %d\n"
31c9afa6d Sasha Levin 2014-10-09 108 "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
846383359 Konstantin Khlebnikov 2016-01-14 109 "pinned_vm %lx data_vm %lx exec_vm %lx stack_vm %lx\n"
31c9afa6d Sasha Levin 2014-10-09 110 "start_code %lx end_code %lx start_data %lx end_data %lx\n"
31c9afa6d Sasha Levin 2014-10-09 111 "start_brk %lx brk %lx start_stack %lx\n"
31c9afa6d Sasha Levin 2014-10-09 112 "arg_start %lx arg_end %lx env_start %lx env_end %lx\n"
31c9afa6d Sasha Levin 2014-10-09 113 "binfmt %p flags %lx core_state %p\n"
31c9afa6d Sasha Levin 2014-10-09 114 #ifdef CONFIG_AIO
31c9afa6d Sasha Levin 2014-10-09 115 "ioctx_table %p\n"
31c9afa6d Sasha Levin 2014-10-09 116 #endif
31c9afa6d Sasha Levin 2014-10-09 117 #ifdef CONFIG_MEMCG
31c9afa6d Sasha Levin 2014-10-09 118 "owner %p "
31c9afa6d Sasha Levin 2014-10-09 119 #endif
31c9afa6d Sasha Levin 2014-10-09 120 "exe_file %p\n"
31c9afa6d Sasha Levin 2014-10-09 121 #ifdef CONFIG_MMU_NOTIFIER
31c9afa6d Sasha Levin 2014-10-09 122 "mmu_notifier_mm %p\n"
31c9afa6d Sasha Levin 2014-10-09 123 #endif
31c9afa6d Sasha Levin 2014-10-09 124 #ifdef CONFIG_NUMA_BALANCING
31c9afa6d Sasha Levin 2014-10-09 125 "numa_next_scan %lu numa_scan_offset %lu numa_scan_seq %d\n"
31c9afa6d Sasha Levin 2014-10-09 126 #endif
31c9afa6d Sasha Levin 2014-10-09 127 #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
31c9afa6d Sasha Levin 2014-10-09 128 "tlb_flush_pending %d\n"
31c9afa6d Sasha Levin 2014-10-09 129 #endif
b8eceeb99 Vlastimil Babka 2016-03-15 130 "def_flags: %#lx(%pGv)\n",
31c9afa6d Sasha Levin 2014-10-09 131
31c9afa6d Sasha Levin 2014-10-09 132 mm, mm->mmap, mm->vmacache_seqnum, mm->task_size,
31c9afa6d Sasha Levin 2014-10-09 133 #ifdef CONFIG_MMU
31c9afa6d Sasha Levin 2014-10-09 134 mm->get_unmapped_area,
31c9afa6d Sasha Levin 2014-10-09 135 #endif
31c9afa6d Sasha Levin 2014-10-09 136 mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end,
31c9afa6d Sasha Levin 2014-10-09 137 mm->pgd, atomic_read(&mm->mm_users),
31c9afa6d Sasha Levin 2014-10-09 138 atomic_read(&mm->mm_count),
31c9afa6d Sasha Levin 2014-10-09 139 atomic_long_read((atomic_long_t *)&mm->nr_ptes),
dc6c9a35b Kirill A. Shutemov 2015-02-11 140 mm_nr_pmds((struct mm_struct *)mm),
31c9afa6d Sasha Levin 2014-10-09 141 mm->map_count,
31c9afa6d Sasha Levin 2014-10-09 142 mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
846383359 Konstantin Khlebnikov 2016-01-14 143 mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm,
31c9afa6d Sasha Levin 2014-10-09 144 mm->start_code, mm->end_code, mm->start_data, mm->end_data,
31c9afa6d Sasha Levin 2014-10-09 145 mm->start_brk, mm->brk, mm->start_stack,
31c9afa6d Sasha Levin 2014-10-09 146 mm->arg_start, mm->arg_end, mm->env_start, mm->env_end,
31c9afa6d Sasha Levin 2014-10-09 147 mm->binfmt, mm->flags, mm->core_state,
31c9afa6d Sasha Levin 2014-10-09 148 #ifdef CONFIG_AIO
31c9afa6d Sasha Levin 2014-10-09 149 mm->ioctx_table,
31c9afa6d Sasha Levin 2014-10-09 150 #endif
31c9afa6d Sasha Levin 2014-10-09 151 #ifdef CONFIG_MEMCG
31c9afa6d Sasha Levin 2014-10-09 152 mm->owner,
31c9afa6d Sasha Levin 2014-10-09 153 #endif
31c9afa6d Sasha Levin 2014-10-09 154 mm->exe_file,
31c9afa6d Sasha Levin 2014-10-09 155 #ifdef CONFIG_MMU_NOTIFIER
31c9afa6d Sasha Levin 2014-10-09 156 mm->mmu_notifier_mm,
31c9afa6d Sasha Levin 2014-10-09 157 #endif
31c9afa6d Sasha Levin 2014-10-09 158 #ifdef CONFIG_NUMA_BALANCING
31c9afa6d Sasha Levin 2014-10-09 159 mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq,
31c9afa6d Sasha Levin 2014-10-09 160 #endif
fd2fc6e1f Nadav Amit 2017-08-01 161 atomic_read(&mm->tlb_flush_pending),
b8eceeb99 Vlastimil Babka 2016-03-15 162 mm->def_flags, &mm->def_flags
31c9afa6d Sasha Levin 2014-10-09 163 );
31c9afa6d Sasha Levin 2014-10-09 164 }
31c9afa6d Sasha Levin 2014-10-09 165

:::::: The code at line 102 was first introduced by commit
:::::: 7a82ca0d6437261d0727ce472ae4f3a05a9ce5f7 mm/debug.c: use pr_emerg()

:::::: TO: Andrew Morton <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (6.45 kB)
.config.gz (22.26 kB)
Download all attachments

2017-08-08 01:20:38

by kernel test robot

[permalink] [raw]
Subject: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression


Greeting,

FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:


commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715


in testcase: will-it-scale
on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
with following parameters:

nr_task: 16
mode: process
test: brk1
cpufreq_governor: performance

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale



Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/01org/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

testcase/path_params/tbox_group/run: will-it-scale/16-process-brk1-performance/lkp-bdw-ep3d

378005bdbac0a2ec 76742700225cad9df49f053993
---------------- --------------------------
%stddev change %stddev
\ | \
3405093 -19% 2747088 will-it-scale.per_process_ops
2.396e+08 8283% 2.008e+10 perf-stat.iTLB-loads
0.03 26% 0.04 ? 10% perf-stat.branch-miss-rate%
0.00 25% 0.00 ? 11% perf-stat.dTLB-store-miss-rate%
0.56 18% 0.66 perf-stat.cpi
2.859e+13 -13% 2.479e+13 perf-stat.instructions
5.906e+12 -14% 5.087e+12 perf-stat.branch-instructions
4.831e+12 -14% 4.154e+12 perf-stat.dTLB-stores
1.575e+10 ? 6% -15% 1.345e+10 perf-stat.iTLB-load-misses
1.77 -15% 1.51 perf-stat.ipc
8.476e+12 ? 5% -16% 7.079e+12 perf-stat.dTLB-loads
98.49 -59% 40.10 perf-stat.iTLB-load-miss-rate%



will-it-scale.per_process_ops

3.5e+06 ++----------------------------------------------------------------+
*..*.*..*.*.. .*.*..*.*..*. .* *..*.*..*.*..*.*..*. .*. .*
3.4e+06 ++ *.*. *. : : *. *. |
3.3e+06 ++ : : |
| : : |
3.2e+06 ++ :: |
3.1e+06 ++ * |
| |
3e+06 ++ |
2.9e+06 ++ |
| |
2.8e+06 ++ |
2.7e+06 O+ O O O O O O O O O O O O O O O O O O O O O O O O O
| O |
2.6e+06 ++----------------------------------------------------------------+


perf-stat.branch-instructions

6.8e+12 ++----------------------------------------------------------------+
6.6e+12 ++ * |
| :: |
6.4e+12 ++ : : |
6.2e+12 ++ : : |
| .*.*..*.*..* * |
6e+12 *+.*.*..*.*..*.*. + *..*.*..*.*..*.*..*.*..*.*..*
5.8e+12 ++ + + |
5.6e+12 ++ * |
| |
5.4e+12 ++ |
5.2e+12 O+ O O O O O O O O O O O O O |
| O O O O O O O O O O O
5e+12 ++ O O |
4.8e+12 ++----------------------------------------------------------------+


perf-stat.iTLB-loads

3.5e+10 ++----------------------------------------------------------------+
O O O O O |
3e+10 ++ |
| O O |
2.5e+10 ++ O |
| |
2e+10 ++ O O O O O O O O O O O O O O O O O O O
| |
1.5e+10 ++ |
| |
1e+10 ++ |
| |
5e+09 ++ |
| |
0 *+-*-*--*-*--*-*--*-*--*-*--*-*--*--*-*--*-*--*-*--*-*--*-*--*-*--*


perf-stat.iTLB-load-miss-rate_

100 *+-*-*--*--*-*-----*--*-*--*--*-*--*--*-*--*--*-*--*--*--*-*--*--*-*--*
| *. |
90 ++ |
80 ++ |
| |
70 ++ |
| |
60 ++ |
| |
50 ++ |
40 ++ O O O O O O
| O O O O O O O O O O O O O |
30 ++ O O O |
O O O O O |
20 ++--------------------------------------------------------------------+

[*] bisect-good sample
[O] bisect-bad sample


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Xiaolong


Attachments:
(No filename) (7.32 kB)
config-4.13.0-rc2-00028-g767427002 (157.18 kB)
job-script (6.63 kB)
job.yaml (4.33 kB)
reproduce (286.00 B)
Download all attachments

2017-08-08 02:28:34

by Minchan Kim

[permalink] [raw]
Subject: Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

Hi,

On Tue, Aug 08, 2017 at 09:19:23AM +0800, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:
>
>
> commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
>
>
> in testcase: will-it-scale
> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
> with following parameters:
>
> nr_task: 16
> mode: process
> test: brk1
> cpufreq_governor: performance
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale

Thanks for the report.
Could you explain what kinds of workload you are testing?

Does it calls frequently madvise(MADV_DONTNEED) in parallel on multiple
threads?

2017-08-08 04:23:40

by Nadav Amit

[permalink] [raw]
Subject: Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

Minchan Kim <[email protected]> wrote:

> Hi,
>
> On Tue, Aug 08, 2017 at 09:19:23AM +0800, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:
>>
>>
>> commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
>> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
>>
>>
>> in testcase: will-it-scale
>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
>> with following parameters:
>>
>> nr_task: 16
>> mode: process
>> test: brk1
>> cpufreq_governor: performance
>>
>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>> test-url: https://github.com/antonblanchard/will-it-scale
>
> Thanks for the report.
> Could you explain what kinds of workload you are testing?
>
> Does it calls frequently madvise(MADV_DONTNEED) in parallel on multiple
> threads?

According to the description it is "testcase:brk increase/decrease of one
page”. According to the mode it spawns multiple processes, not threads.

Since a single page is unmapped each time, and the iTLB-loads increase
dramatically, I would suspect that for some reason a full TLB flush is
caused during do_munmap().

If I find some free time, I’ll try to profile the workload - but feel free
to beat me to it.

Nadav


2017-08-08 05:51:09

by Nadav Amit

[permalink] [raw]
Subject: Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

Nadav Amit <[email protected]> wrote:

> Minchan Kim <[email protected]> wrote:
>
>> Hi,
>>
>> On Tue, Aug 08, 2017 at 09:19:23AM +0800, kernel test robot wrote:
>>> Greeting,
>>>
>>> FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:
>>>
>>>
>>> commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
>>> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
>>>
>>>
>>> in testcase: will-it-scale
>>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
>>> with following parameters:
>>>
>>> nr_task: 16
>>> mode: process
>>> test: brk1
>>> cpufreq_governor: performance
>>>
>>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>> test-url: https://github.com/antonblanchard/will-it-scale
>>
>> Thanks for the report.
>> Could you explain what kinds of workload you are testing?
>>
>> Does it calls frequently madvise(MADV_DONTNEED) in parallel on multiple
>> threads?
>
> According to the description it is "testcase:brk increase/decrease of one
> page”. According to the mode it spawns multiple processes, not threads.
>
> Since a single page is unmapped each time, and the iTLB-loads increase
> dramatically, I would suspect that for some reason a full TLB flush is
> caused during do_munmap().
>
> If I find some free time, I’ll try to profile the workload - but feel free
> to beat me to it.

The root-cause appears to be that tlb_finish_mmu() does not call
dec_tlb_flush_pending() - as it should. Any chance you can take care of it?

Having said that it appears that cpumask_any_but() is really inefficient
since it does not have an optimization for the case in which
small_const_nbits(nbits)==true. When I find some free time, I’ll try to deal
with it.

Thanks,
Nadav

2017-08-08 08:08:26

by Minchan Kim

[permalink] [raw]
Subject: Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

On Mon, Aug 07, 2017 at 10:51:00PM -0700, Nadav Amit wrote:
> Nadav Amit <[email protected]> wrote:
>
> > Minchan Kim <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> On Tue, Aug 08, 2017 at 09:19:23AM +0800, kernel test robot wrote:
> >>> Greeting,
> >>>
> >>> FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:
> >>>
> >>>
> >>> commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
> >>> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
> >>>
> >>>
> >>> in testcase: will-it-scale
> >>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
> >>> with following parameters:
> >>>
> >>> nr_task: 16
> >>> mode: process
> >>> test: brk1
> >>> cpufreq_governor: performance
> >>>
> >>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> >>> test-url: https://github.com/antonblanchard/will-it-scale
> >>
> >> Thanks for the report.
> >> Could you explain what kinds of workload you are testing?
> >>
> >> Does it calls frequently madvise(MADV_DONTNEED) in parallel on multiple
> >> threads?
> >
> > According to the description it is "testcase:brk increase/decrease of one
> > page”. According to the mode it spawns multiple processes, not threads.
> >
> > Since a single page is unmapped each time, and the iTLB-loads increase
> > dramatically, I would suspect that for some reason a full TLB flush is
> > caused during do_munmap().
> >
> > If I find some free time, I’ll try to profile the workload - but feel free
> > to beat me to it.
>
> The root-cause appears to be that tlb_finish_mmu() does not call
> dec_tlb_flush_pending() - as it should. Any chance you can take care of it?

Oops, but with second looking, it seems it's not my fault. ;-)
https://marc.info/?l=linux-mm&m=150156699114088&w=2

Anyway, thanks for the pointing out.
xiaolong.ye, could you retest with this fix?

>From 83012114c9cd9304f0d55d899bb4b9329d0e22ac Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Tue, 8 Aug 2017 17:05:19 +0900
Subject: [PATCH] mm: decrease tlb flush pending count in tlb_finish_mmu

The tlb pending count increased by tlb_gather_mmu should be decreased
at tlb_finish_mmu. Otherwise, A lot of TLB happens which makes
performance regression.

Signed-off-by: Minchan Kim <[email protected]>
---
mm/memory.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/memory.c b/mm/memory.c
index 34b1fcb829e4..ad2617552f55 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -423,6 +423,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
bool force = mm_tlb_flush_nested(tlb->mm);

arch_tlb_finish_mmu(tlb, start, end, force);
+ dec_tlb_flush_pending(tlb->mm);
}

/*
--
2.7.4

2017-08-09 01:26:26

by kernel test robot

[permalink] [raw]
Subject: Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

On 08/08, Minchan Kim wrote:
>On Mon, Aug 07, 2017 at 10:51:00PM -0700, Nadav Amit wrote:
>> Nadav Amit <[email protected]> wrote:
>>
>> > Minchan Kim <[email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> On Tue, Aug 08, 2017 at 09:19:23AM +0800, kernel test robot wrote:
>> >>> Greeting,
>> >>>
>> >>> FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:
>> >>>
>> >>>
>> >>> commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
>> >>> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
>> >>>
>> >>>
>> >>> in testcase: will-it-scale
>> >>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
>> >>> with following parameters:
>> >>>
>> >>> nr_task: 16
>> >>> mode: process
>> >>> test: brk1
>> >>> cpufreq_governor: performance
>> >>>
>> >>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>> >>> test-url: https://github.com/antonblanchard/will-it-scale
>> >>
>> >> Thanks for the report.
>> >> Could you explain what kinds of workload you are testing?
>> >>
>> >> Does it calls frequently madvise(MADV_DONTNEED) in parallel on multiple
>> >> threads?
>> >
>> > According to the description it is "testcase:brk increase/decrease of one
>> > page”. According to the mode it spawns multiple processes, not threads.
>> >
>> > Since a single page is unmapped each time, and the iTLB-loads increase
>> > dramatically, I would suspect that for some reason a full TLB flush is
>> > caused during do_munmap().
>> >
>> > If I find some free time, I’ll try to profile the workload - but feel free
>> > to beat me to it.
>>
>> The root-cause appears to be that tlb_finish_mmu() does not call
>> dec_tlb_flush_pending() - as it should. Any chance you can take care of it?
>
>Oops, but with second looking, it seems it's not my fault. ;-)
>https://marc.info/?l=linux-mm&m=150156699114088&w=2
>
>Anyway, thanks for the pointing out.
>xiaolong.ye, could you retest with this fix?

Sure, I'll provide the result later.

Thanks,
Xiaolong
>
>From 83012114c9cd9304f0d55d899bb4b9329d0e22ac Mon Sep 17 00:00:00 2001
>From: Minchan Kim <[email protected]>
>Date: Tue, 8 Aug 2017 17:05:19 +0900
>Subject: [PATCH] mm: decrease tlb flush pending count in tlb_finish_mmu
>
>The tlb pending count increased by tlb_gather_mmu should be decreased
>at tlb_finish_mmu. Otherwise, A lot of TLB happens which makes
>performance regression.
>
>Signed-off-by: Minchan Kim <[email protected]>
>---
> mm/memory.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/mm/memory.c b/mm/memory.c
>index 34b1fcb829e4..ad2617552f55 100644
>--- a/mm/memory.c
>+++ b/mm/memory.c
>@@ -423,6 +423,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> bool force = mm_tlb_flush_nested(tlb->mm);
>
> arch_tlb_finish_mmu(tlb, start, end, force);
>+ dec_tlb_flush_pending(tlb->mm);
> }
>
> /*
>--
>2.7.4
>

2017-08-09 03:00:10

by kernel test robot

[permalink] [raw]
Subject: Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

On 08/08, Minchan Kim wrote:
>On Mon, Aug 07, 2017 at 10:51:00PM -0700, Nadav Amit wrote:
>> Nadav Amit <[email protected]> wrote:
>>
>> > Minchan Kim <[email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> On Tue, Aug 08, 2017 at 09:19:23AM +0800, kernel test robot wrote:
>> >>> Greeting,
>> >>>
>> >>> FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:
>> >>>
>> >>>
>> >>> commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
>> >>> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
>> >>>
>> >>>
>> >>> in testcase: will-it-scale
>> >>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
>> >>> with following parameters:
>> >>>
>> >>> nr_task: 16
>> >>> mode: process
>> >>> test: brk1
>> >>> cpufreq_governor: performance
>> >>>
>> >>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>> >>> test-url: https://github.com/antonblanchard/will-it-scale
>> >>
>> >> Thanks for the report.
>> >> Could you explain what kinds of workload you are testing?
>> >>
>> >> Does it calls frequently madvise(MADV_DONTNEED) in parallel on multiple
>> >> threads?
>> >
>> > According to the description it is "testcase:brk increase/decrease of one
>> > page”. According to the mode it spawns multiple processes, not threads.
>> >
>> > Since a single page is unmapped each time, and the iTLB-loads increase
>> > dramatically, I would suspect that for some reason a full TLB flush is
>> > caused during do_munmap().
>> >
>> > If I find some free time, I’ll try to profile the workload - but feel free
>> > to beat me to it.
>>
>> The root-cause appears to be that tlb_finish_mmu() does not call
>> dec_tlb_flush_pending() - as it should. Any chance you can take care of it?
>
>Oops, but with second looking, it seems it's not my fault. ;-)
>https://marc.info/?l=linux-mm&m=150156699114088&w=2
>
>Anyway, thanks for the pointing out.
>xiaolong.ye, could you retest with this fix?
>

I've queued tests for 5 times and results show this patch (e8f682574e4 "mm:
decrease tlb flush pending count in tlb_finish_mmu") does help recover the
performance back.

378005bdbac0a2ec 76742700225cad9df49f053993 e8f682574e45b6406dadfffeb4
---------------- -------------------------- --------------------------
%stddev change %stddev change %stddev
\ | \ | \
3405093 -19% 2747088 -2% 3348752 will-it-scale.per_process_ops
1280 ± 3% -2% 1257 ± 3% -6% 1207 vmstat.system.cs
2702 ± 18% 11% 3002 ± 19% 17% 3156 ± 18% numa-vmstat.node0.nr_mapped
10765 ± 18% 11% 11964 ± 19% 17% 12588 ± 18% numa-meminfo.node0.Mapped
0.00 ± 47% -40% 0.00 ± 45% -84% 0.00 ± 42% mpstat.cpu.soft%

Thanks,
Xiaolong


>From 83012114c9cd9304f0d55d899bb4b9329d0e22ac Mon Sep 17 00:00:00 2001
>From: Minchan Kim <[email protected]>
>Date: Tue, 8 Aug 2017 17:05:19 +0900
>Subject: [PATCH] mm: decrease tlb flush pending count in tlb_finish_mmu
>
>The tlb pending count increased by tlb_gather_mmu should be decreased
>at tlb_finish_mmu. Otherwise, A lot of TLB happens which makes
>performance regression.
>
>Signed-off-by: Minchan Kim <[email protected]>
>---
> mm/memory.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/mm/memory.c b/mm/memory.c
>index 34b1fcb829e4..ad2617552f55 100644
>--- a/mm/memory.c
>+++ b/mm/memory.c
>@@ -423,6 +423,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> bool force = mm_tlb_flush_nested(tlb->mm);
>
> arch_tlb_finish_mmu(tlb, start, end, force);
>+ dec_tlb_flush_pending(tlb->mm);
> }
>
> /*
>--
>2.7.4
>

2017-08-10 04:13:57

by Minchan Kim

[permalink] [raw]
Subject: Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

On Wed, Aug 09, 2017 at 10:59:02AM +0800, Ye Xiaolong wrote:
> On 08/08, Minchan Kim wrote:
> >On Mon, Aug 07, 2017 at 10:51:00PM -0700, Nadav Amit wrote:
> >> Nadav Amit <[email protected]> wrote:
> >>
> >> > Minchan Kim <[email protected]> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> On Tue, Aug 08, 2017 at 09:19:23AM +0800, kernel test robot wrote:
> >> >>> Greeting,
> >> >>>
> >> >>> FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:
> >> >>>
> >> >>>
> >> >>> commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
> >> >>> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
> >> >>>
> >> >>>
> >> >>> in testcase: will-it-scale
> >> >>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
> >> >>> with following parameters:
> >> >>>
> >> >>> nr_task: 16
> >> >>> mode: process
> >> >>> test: brk1
> >> >>> cpufreq_governor: performance
> >> >>>
> >> >>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> >> >>> test-url: https://github.com/antonblanchard/will-it-scale
> >> >>
> >> >> Thanks for the report.
> >> >> Could you explain what kinds of workload you are testing?
> >> >>
> >> >> Does it calls frequently madvise(MADV_DONTNEED) in parallel on multiple
> >> >> threads?
> >> >
> >> > According to the description it is "testcase:brk increase/decrease of one
> >> > page”. According to the mode it spawns multiple processes, not threads.
> >> >
> >> > Since a single page is unmapped each time, and the iTLB-loads increase
> >> > dramatically, I would suspect that for some reason a full TLB flush is
> >> > caused during do_munmap().
> >> >
> >> > If I find some free time, I’ll try to profile the workload - but feel free
> >> > to beat me to it.
> >>
> >> The root-cause appears to be that tlb_finish_mmu() does not call
> >> dec_tlb_flush_pending() - as it should. Any chance you can take care of it?
> >
> >Oops, but with second looking, it seems it's not my fault. ;-)
> >https://marc.info/?l=linux-mm&m=150156699114088&w=2
> >
> >Anyway, thanks for the pointing out.
> >xiaolong.ye, could you retest with this fix?
> >
>
> I've queued tests for 5 times and results show this patch (e8f682574e4 "mm:
> decrease tlb flush pending count in tlb_finish_mmu") does help recover the
> performance back.
>
> 378005bdbac0a2ec 76742700225cad9df49f053993 e8f682574e45b6406dadfffeb4
> ---------------- -------------------------- --------------------------
> %stddev change %stddev change %stddev
> \ | \ | \
> 3405093 -19% 2747088 -2% 3348752 will-it-scale.per_process_ops
> 1280 ± 3% -2% 1257 ± 3% -6% 1207 vmstat.system.cs
> 2702 ± 18% 11% 3002 ± 19% 17% 3156 ± 18% numa-vmstat.node0.nr_mapped
> 10765 ± 18% 11% 11964 ± 19% 17% 12588 ± 18% numa-meminfo.node0.Mapped
> 0.00 ± 47% -40% 0.00 ± 45% -84% 0.00 ± 42% mpstat.cpu.soft%
>
> Thanks,
> Xiaolong

Thanks for the testing!

2017-08-10 04:14:57

by Nadav Amit

[permalink] [raw]
Subject: Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

Minchan Kim <[email protected]> wrote:

> On Wed, Aug 09, 2017 at 10:59:02AM +0800, Ye Xiaolong wrote:
>> On 08/08, Minchan Kim wrote:
>>> On Mon, Aug 07, 2017 at 10:51:00PM -0700, Nadav Amit wrote:
>>>> Nadav Amit <[email protected]> wrote:
>>>>
>>>>> Minchan Kim <[email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On Tue, Aug 08, 2017 at 09:19:23AM +0800, kernel test robot wrote:
>>>>>>> Greeting,
>>>>>>>
>>>>>>> FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:
>>>>>>>
>>>>>>>
>>>>>>> commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
>>>>>>> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
>>>>>>>
>>>>>>>
>>>>>>> in testcase: will-it-scale
>>>>>>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
>>>>>>> with following parameters:
>>>>>>>
>>>>>>> nr_task: 16
>>>>>>> mode: process
>>>>>>> test: brk1
>>>>>>> cpufreq_governor: performance
>>>>>>>
>>>>>>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>>>>>> test-url: https://github.com/antonblanchard/will-it-scale
>>>>>>
>>>>>> Thanks for the report.
>>>>>> Could you explain what kinds of workload you are testing?
>>>>>>
>>>>>> Does it calls frequently madvise(MADV_DONTNEED) in parallel on multiple
>>>>>> threads?
>>>>>
>>>>> According to the description it is "testcase:brk increase/decrease of one
>>>>> page”. According to the mode it spawns multiple processes, not threads.
>>>>>
>>>>> Since a single page is unmapped each time, and the iTLB-loads increase
>>>>> dramatically, I would suspect that for some reason a full TLB flush is
>>>>> caused during do_munmap().
>>>>>
>>>>> If I find some free time, I’ll try to profile the workload - but feel free
>>>>> to beat me to it.
>>>>
>>>> The root-cause appears to be that tlb_finish_mmu() does not call
>>>> dec_tlb_flush_pending() - as it should. Any chance you can take care of it?
>>>
>>> Oops, but with second looking, it seems it's not my fault. ;-)
>>> https://marc.info/?l=linux-mm&m=150156699114088&w=2
>>>
>>> Anyway, thanks for the pointing out.
>>> xiaolong.ye, could you retest with this fix?
>>
>> I've queued tests for 5 times and results show this patch (e8f682574e4 "mm:
>> decrease tlb flush pending count in tlb_finish_mmu") does help recover the
>> performance back.
>>
>> 378005bdbac0a2ec 76742700225cad9df49f053993 e8f682574e45b6406dadfffeb4
>> ---------------- -------------------------- --------------------------
>> %stddev change %stddev change %stddev
>> \ | \ | \
>> 3405093 -19% 2747088 -2% 3348752 will-it-scale.per_process_ops
>> 1280 ± 3% -2% 1257 ± 3% -6% 1207 vmstat.system.cs
>> 2702 ± 18% 11% 3002 ± 19% 17% 3156 ± 18% numa-vmstat.node0.nr_mapped
>> 10765 ± 18% 11% 11964 ± 19% 17% 12588 ± 18% numa-meminfo.node0.Mapped
>> 0.00 ± 47% -40% 0.00 ± 45% -84% 0.00 ± 42% mpstat.cpu.soft%
>>
>> Thanks,
>> Xiaolong
>
> Thanks for the testing!

Sorry again for screwing your patch, Minchan.



2017-08-10 04:20:44

by Minchan Kim

[permalink] [raw]
Subject: Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

On Wed, Aug 09, 2017 at 09:14:50PM -0700, Nadav Amit wrote:

Hi Nadav,

< snip >

> >>>>> According to the description it is "testcase:brk increase/decrease of one
> >>>>> page”. According to the mode it spawns multiple processes, not threads.
> >>>>>
> >>>>> Since a single page is unmapped each time, and the iTLB-loads increase
> >>>>> dramatically, I would suspect that for some reason a full TLB flush is
> >>>>> caused during do_munmap().
> >>>>>
> >>>>> If I find some free time, I’ll try to profile the workload - but feel free
> >>>>> to beat me to it.
> >>>>
> >>>> The root-cause appears to be that tlb_finish_mmu() does not call
> >>>> dec_tlb_flush_pending() - as it should. Any chance you can take care of it?
> >>>
> >>> Oops, but with second looking, it seems it's not my fault. ;-)
> >>> https://marc.info/?l=linux-mm&m=150156699114088&w=2
> >>>
> >>> Anyway, thanks for the pointing out.
> >>> xiaolong.ye, could you retest with this fix?
> >>
> >> I've queued tests for 5 times and results show this patch (e8f682574e4 "mm:
> >> decrease tlb flush pending count in tlb_finish_mmu") does help recover the
> >> performance back.
> >>
> >> 378005bdbac0a2ec 76742700225cad9df49f053993 e8f682574e45b6406dadfffeb4
> >> ---------------- -------------------------- --------------------------
> >> %stddev change %stddev change %stddev
> >> \ | \ | \
> >> 3405093 -19% 2747088 -2% 3348752 will-it-scale.per_process_ops
> >> 1280 ± 3% -2% 1257 ± 3% -6% 1207 vmstat.system.cs
> >> 2702 ± 18% 11% 3002 ± 19% 17% 3156 ± 18% numa-vmstat.node0.nr_mapped
> >> 10765 ± 18% 11% 11964 ± 19% 17% 12588 ± 18% numa-meminfo.node0.Mapped
> >> 0.00 ± 47% -40% 0.00 ± 45% -84% 0.00 ± 42% mpstat.cpu.soft%
> >>
> >> Thanks,
> >> Xiaolong
> >
> > Thanks for the testing!
>
> Sorry again for screwing your patch, Minchan.

Never mind! It always happens. :)
In this chance, I really appreciates your insight/testing/cooperation!

2017-08-11 09:24:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 4/7] mm: refactoring TLB gathering API

On Tue, Aug 01, 2017 at 05:08:15PM -0700, Nadav Amit wrote:
> From: Minchan Kim <[email protected]>
>
> This patch is a preparatory patch for solving race problems caused by
> TLB batch. For that, we will increase/decrease TLB flush pending count
> of mm_struct whenever tlb_[gather|finish]_mmu is called.
>
> Before making it simple, this patch separates architecture specific
> part and rename it to arch_tlb_[gather|finish]_mmu and generic part
> just calls it.

I absolutely hate this. We should unify this stuff, not diverge it
further.

2017-08-11 10:50:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 3/7] Revert "mm: numa: defer TLB flush for THP migration as long as possible"

On Tue, Aug 01, 2017 at 05:08:14PM -0700, Nadav Amit wrote:
> While deferring TLB flushes is a good practice, the reverted patch
> caused pending TLB flushes to be checked while the page-table lock is
> not taken. As a result, in architectures with weak memory model (PPC),
> Linux may miss a memory-barrier, miss the fact TLB flushes are pending,
> and cause (in theory) a memory corruption.
>
> Since the alternative of using smp_mb__after_unlock_lock() was
> considered a bit open-coded, and the performance impact is expected to
> be small, the previous patch is reverted.

FWIW this Changelog sucks arse; you completely fail to explain the
broken ordering.

2017-08-11 13:31:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 6/7] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

On Tue, Aug 01, 2017 at 05:08:17PM -0700, Nadav Amit wrote:
> void tlb_finish_mmu(struct mmu_gather *tlb,
> unsigned long start, unsigned long end)
> {
> - arch_tlb_finish_mmu(tlb, start, end);
> + /*
> + * If there are parallel threads are doing PTE changes on same range
> + * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> + * flush by batching, a thread has stable TLB entry can fail to flush
> + * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> + * forcefully if we detect parallel PTE batching threads.
> + */
> + bool force = mm_tlb_flush_nested(tlb->mm);
> +
> + arch_tlb_finish_mmu(tlb, start, end, force);
> }

I don't understand the comment nor the ordering. What guarantees we see
the increment if we need to?

2017-08-11 17:12:52

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v6 4/7] mm: refactoring TLB gathering API

Peter Zijlstra <[email protected]> wrote:

> On Tue, Aug 01, 2017 at 05:08:15PM -0700, Nadav Amit wrote:
>> From: Minchan Kim <[email protected]>
>>
>> This patch is a preparatory patch for solving race problems caused by
>> TLB batch. For that, we will increase/decrease TLB flush pending count
>> of mm_struct whenever tlb_[gather|finish]_mmu is called.
>>
>> Before making it simple, this patch separates architecture specific
>> part and rename it to arch_tlb_[gather|finish]_mmu and generic part
>> just calls it.
>
> I absolutely hate this. We should unify this stuff, not diverge it
> further.

Agreed, but I don’t see how this patch makes the situation any worse.

I’ll review your other comments by tomorrow due to some personal
constraints.


2017-08-13 06:14:27

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v6 6/7] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

Peter Zijlstra <[email protected]> wrote:

> On Tue, Aug 01, 2017 at 05:08:17PM -0700, Nadav Amit wrote:
>> void tlb_finish_mmu(struct mmu_gather *tlb,
>> unsigned long start, unsigned long end)
>> {
>> - arch_tlb_finish_mmu(tlb, start, end);
>> + /*
>> + * If there are parallel threads are doing PTE changes on same range
>> + * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>> + * flush by batching, a thread has stable TLB entry can fail to flush
>> + * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>> + * forcefully if we detect parallel PTE batching threads.
>> + */
>> + bool force = mm_tlb_flush_nested(tlb->mm);
>> +
>> + arch_tlb_finish_mmu(tlb, start, end, force);
>> }
>
> I don't understand the comment nor the ordering. What guarantees we see
> the increment if we need to?

The comment regards the problem that is described in the change-log, and a
long thread that is referenced in it. So the question is whether “I don’t
understand” means “I don’t understand” or “it is not clear enough”. I’ll
be glad to address either one - just say which.

As for the ordering - I tried to clarify it in the thread of the commit. Let
me know if it is clear now.

Regards,
Nadav


2017-08-13 12:08:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 6/7] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

On Sun, Aug 13, 2017 at 06:14:21AM +0000, Nadav Amit wrote:
> Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Aug 01, 2017 at 05:08:17PM -0700, Nadav Amit wrote:
> >> void tlb_finish_mmu(struct mmu_gather *tlb,
> >> unsigned long start, unsigned long end)
> >> {
> >> - arch_tlb_finish_mmu(tlb, start, end);
> >> + /*
> >> + * If there are parallel threads are doing PTE changes on same range
> >> + * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> >> + * flush by batching, a thread has stable TLB entry can fail to flush
> >> + * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> >> + * forcefully if we detect parallel PTE batching threads.
> >> + */
> >> + bool force = mm_tlb_flush_nested(tlb->mm);
> >> +
> >> + arch_tlb_finish_mmu(tlb, start, end, force);
> >> }
> >
> > I don't understand the comment nor the ordering. What guarantees we see
> > the increment if we need to?
>
> The comment regards the problem that is described in the change-log, and a
> long thread that is referenced in it. So the question is whether “I don’t
> understand” means “I don’t understand” or “it is not clear enough”. I’ll
> be glad to address either one - just say which.

I only read the comment, that _should_ be sufficient. Comments that rely
on Changelogs and random threads are useless.

The comment on its own simply doesn't make sense.

> As for the ordering - I tried to clarify it in the thread of the commit. Let
> me know if it is clear now.

Yeah, I'll do a new patch because if it only cares about _the_ PTL, we
can do away with that extra smp_mb__after_atomic().

2017-08-14 00:49:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 4/7] mm: refactoring TLB gathering API

On Fri, Aug 11, 2017 at 10:12:45AM -0700, Nadav Amit wrote:
> Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Aug 01, 2017 at 05:08:15PM -0700, Nadav Amit wrote:
> >> From: Minchan Kim <[email protected]>
> >>
> >> This patch is a preparatory patch for solving race problems caused by
> >> TLB batch. For that, we will increase/decrease TLB flush pending count
> >> of mm_struct whenever tlb_[gather|finish]_mmu is called.
> >>
> >> Before making it simple, this patch separates architecture specific
> >> part and rename it to arch_tlb_[gather|finish]_mmu and generic part
> >> just calls it.
> >
> > I absolutely hate this. We should unify this stuff, not diverge it
> > further.
>
> Agreed, but I don’t see how this patch makes the situation any worse.

Agree with Nadav. I don't think this patch makes things diverge further.
Peter, If you are strong against of it, please tell us what part you
are hating.

2017-08-14 01:26:21

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 6/7] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

Hi Peter,

On Fri, Aug 11, 2017 at 03:30:20PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 01, 2017 at 05:08:17PM -0700, Nadav Amit wrote:
> > void tlb_finish_mmu(struct mmu_gather *tlb,
> > unsigned long start, unsigned long end)
> > {
> > - arch_tlb_finish_mmu(tlb, start, end);
> > + /*
> > + * If there are parallel threads are doing PTE changes on same range
> > + * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> > + * flush by batching, a thread has stable TLB entry can fail to flush
> > + * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> > + * forcefully if we detect parallel PTE batching threads.
> > + */
> > + bool force = mm_tlb_flush_nested(tlb->mm);
> > +
> > + arch_tlb_finish_mmu(tlb, start, end, force);
> > }
>
> I don't understand the comment nor the ordering. What guarantees we see
> the increment if we need to?

How about this about commenting part?

>From 05f06fd6aba14447a9ca2df8b810fbcf9a58e14b Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Mon, 14 Aug 2017 10:16:56 +0900
Subject: [PATCH] mm: add describable comment for TLB batch race

[1] is a rather subtle/complicated bug so that it's hard to
understand it with limited code comment.

This patch adds a sequence diagaram to explain the problem
more easily, I hope.

[1] 99baac21e458, mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

Cc: Peter Zijlstra <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/memory.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index bcbe56f52163..f571b0eb9816 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -413,12 +413,37 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
void tlb_finish_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end)
{
+
+
/*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
* flush by batching, a thread has stable TLB entry can fail to flush
* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
* forcefully if we detect parallel PTE batching threads.
+ *
+ * Example: MADV_DONTNEED stale TLB problem on same range
+ *
+ * CPU 0 CPU 1
+ * *a = 1;
+ * MADV_DONTNEED
+ * MADV_DONTNEED tlb_gather_mmu
+ * tlb_gather_mmu
+ * down_read(mmap_sem) down_read(mmap_sem)
+ * pte_lock
+ * pte_get_and_clear
+ * tlb_remove_tlb_entry
+ * pte_unlock
+ * pte_lock
+ * found out the pte is none
+ * pte_unlock
+ * tlb_finish_mmu doesn't flush
+ *
+ * Access the address with stale TLB
+ * *a = 2;ie, success without segfault
+ * tlb_finish_mmu flush on range
+ * but it is too late.
+ *
*/
bool force = mm_tlb_flush_nested(tlb->mm);

--
2.7.4