2017-08-11 07:53:30

by Stephen Rothwell

[permalink] [raw]
Subject: linux-next: manual merge of the akpm-current tree with the tip tree

Hi all,

Today's linux-next merge of the akpm-current tree got conflicts in:

include/linux/mm_types.h
mm/huge_memory.c

between commit:

8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

from the tip tree and commits:

16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as long as possible"")

from the akpm-current tree.

The latter 2 are now in Linus' tree as well (but were not when I started
the day).

The only way forward I could see was to revert

8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

and the three following commits

ff7a5fb0f1d5 ("overlayfs, locking: Remove smp_mb__before_spinlock() usage")
d89e588ca408 ("locking: Introduce smp_mb__after_spinlock()")
a9668cd6ee28 ("locking: Remove smp_mb__before_spinlock()")

before merging the akpm-current tree again.

--
Cheers,
Stephen Rothwell


2017-08-11 09:35:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> Hi all,
>
> Today's linux-next merge of the akpm-current tree got conflicts in:
>
> include/linux/mm_types.h
> mm/huge_memory.c
>
> between commit:
>
> 8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
>
> from the tip tree and commits:
>
> 16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as long as possible"")
>
> from the akpm-current tree.
>
> The latter 2 are now in Linus' tree as well (but were not when I started
> the day).
>
> The only way forward I could see was to revert
>
> 8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
>
> and the three following commits
>
> ff7a5fb0f1d5 ("overlayfs, locking: Remove smp_mb__before_spinlock() usage")
> d89e588ca408 ("locking: Introduce smp_mb__after_spinlock()")
> a9668cd6ee28 ("locking: Remove smp_mb__before_spinlock()")
>
> before merging the akpm-current tree again.

Here's two patches that apply on top of tip.


Attachments:
(No filename) (1.08 kB)
nadav_amit-mm-migrate__prevent_racy_access_to_tlb_flush_pending.patch (4.81 kB)
nadav_amit-revert__mm-numa__defer_tlb_flush_for_thp_migration_as_long_as_possible_.patch (2.52 kB)
Download all attachments

2017-08-11 10:48:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Fri, Aug 11, 2017 at 11:34:49AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> > Hi all,
> >
> > Today's linux-next merge of the akpm-current tree got conflicts in:
> >
> > include/linux/mm_types.h
> > mm/huge_memory.c
> >
> > between commit:
> >
> > 8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> >
> > from the tip tree and commits:
> >
> > 16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> > a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as long as possible"")
> >
> > from the akpm-current tree.
> >
> > The latter 2 are now in Linus' tree as well (but were not when I started
> > the day).
> >
> > The only way forward I could see was to revert
> >
> > 8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> >
> > and the three following commits
> >
> > ff7a5fb0f1d5 ("overlayfs, locking: Remove smp_mb__before_spinlock() usage")
> > d89e588ca408 ("locking: Introduce smp_mb__after_spinlock()")
> > a9668cd6ee28 ("locking: Remove smp_mb__before_spinlock()")
> >
> > before merging the akpm-current tree again.
>
> Here's two patches that apply on top of tip.
>


And here's one to fix the PPC ordering issue I found while doing those
patches.


---
Subject: mm: Fix barrier for inc_tlb_flush_pending() for PPC
From: Peter Zijlstra <[email protected]>
Date: Fri Aug 11 12:43:33 CEST 2017

When we have SPLIT_PTE_PTLOCKS and have RCpc locks (PPC) the UNLOCK of
one does not in fact order against the LOCK of another lock. Therefore
the documented scheme does not work.

Add an explicit smp_mb__after_atomic() to cure things.

Also update the comment to reflect the new inc/dec thing.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/mm_types.h | 34 ++++++++++++++++++++++++----------
1 file changed, 24 insertions(+), 10 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -533,7 +533,7 @@ static inline bool mm_tlb_flush_pending(
{
/*
* Must be called with PTL held; such that our PTL acquire will have
- * observed the store from set_tlb_flush_pending().
+ * observed the increment from inc_tlb_flush_pending().
*/
return atomic_read(&mm->tlb_flush_pending);
}
@@ -547,13 +547,11 @@ static inline void inc_tlb_flush_pending
{
atomic_inc(&mm->tlb_flush_pending);
/*
- * The only time this value is relevant is when there are indeed pages
- * to flush. And we'll only flush pages after changing them, which
- * requires the PTL.
- *
* So the ordering here is:
*
- * mm->tlb_flush_pending = true;
+ * atomic_inc(&mm->tlb_flush_pending)
+ * smp_mb__after_atomic();
+ *
* spin_lock(&ptl);
* ...
* set_pte_at();
@@ -565,17 +563,33 @@ static inline void inc_tlb_flush_pending
* spin_unlock(&ptl);
*
* flush_tlb_range();
- * mm->tlb_flush_pending = false;
+ * atomic_dec(&mm->tlb_flush_pending);
*
- * So the =true store is constrained by the PTL unlock, and the =false
- * store is constrained by the TLB invalidate.
+ * Where we order the increment against the PTE modification with the
+ * smp_mb__after_atomic(). It would appear that the spin_unlock(&ptl)
+ * is sufficient to constrain the inc, because we only care about the
+ * value if there is indeed a pending PTE modification. However with
+ * SPLIT_PTE_PTLOCKS and RCpc locks (PPC) the UNLOCK of one lock does
+ * not order against the LOCK of another lock.
+ *
+ * The decrement is ordered by the flush_tlb_range(), such that
+ * mm_tlb_flush_pending() will not return false unless all flushes have
+ * completed.
*/
+ smp_mb__after_atomic();
}

/* Clearing is done after a TLB flush, which also provides a barrier. */
static inline void dec_tlb_flush_pending(struct mm_struct *mm)
{
- /* see set_tlb_flush_pending */
+ /*
+ * See inc_tlb_flush_pending().
+ *
+ * This cannot be smp_mb__before_atomic() because smp_mb() simply does
+ * not order against TLB invalidate completion, which is what we need.
+ *
+ * Therefore we must rely on tlb_flush_*() to guarantee order.
+ */
atomic_dec(&mm->tlb_flush_pending);
}
#else

2017-08-11 11:46:02

by Stephen Rothwell

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

Hi Peter,

On Fri, 11 Aug 2017 11:34:49 +0200 Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> >
> > Today's linux-next merge of the akpm-current tree got conflicts in:
> >
> > include/linux/mm_types.h
> > mm/huge_memory.c
> >
> > between commit:
> >
> > 8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> >
> > from the tip tree and commits:
> >
> > 16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> > a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as long as possible"")
> >
> > from the akpm-current tree.
> >
> > The latter 2 are now in Linus' tree as well (but were not when I started
> > the day).
>
> Here's two patches that apply on top of tip.

What I will really need (on Monday) is a merge resolution between
Linus' tree and the tip tree ...

--
Cheers,
Stephen Rothwell

2017-08-11 11:56:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree


* Stephen Rothwell <[email protected]> wrote:

> Hi Peter,
>
> On Fri, 11 Aug 2017 11:34:49 +0200 Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> > >
> > > Today's linux-next merge of the akpm-current tree got conflicts in:
> > >
> > > include/linux/mm_types.h
> > > mm/huge_memory.c
> > >
> > > between commit:
> > >
> > > 8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> > >
> > > from the tip tree and commits:
> > >
> > > 16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
> > > a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as long as possible"")
> > >
> > > from the akpm-current tree.
> > >
> > > The latter 2 are now in Linus' tree as well (but were not when I started
> > > the day).
> >
> > Here's two patches that apply on top of tip.
>
> What I will really need (on Monday) is a merge resolution between
> Linus' tree and the tip tree ...

I've done a minimal conflict resolution merge locally. Peter, could you please
double check my resolution, in:

040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts

Thanks,

Ingo

2017-08-11 12:17:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Fri, Aug 11, 2017 at 01:56:07PM +0200, Ingo Molnar wrote:
> I've done a minimal conflict resolution merge locally. Peter, could you please
> double check my resolution, in:
>
> 040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts

That merge is a bit wonky, but not terminally broken afaict.

It now does two TLB flushes, the below cleans that up.

---
mm/huge_memory.c | 22 +++++-----------------
1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ce883459e246..08f6c1993832 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1410,7 +1410,6 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
int target_nid, last_cpupid = -1;
- bool need_flush = false;
bool page_locked;
bool migrated = false;
bool was_writable;
@@ -1497,22 +1496,18 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
}

/*
- * The page_table_lock above provides a memory barrier
- * with change_protection_range.
- */
- if (mm_tlb_flush_pending(vma->vm_mm))
- flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
- /*
* Since we took the NUMA fault, we must have observed the !accessible
* bit. Make sure all other CPUs agree with that, to avoid them
* modifying the page we're about to migrate.
*
* Must be done under PTL such that we'll observe the relevant
- * set_tlb_flush_pending().
+ * inc_tlb_flush_pending().
+ *
+ * We are not sure a pending tlb flush here is for a huge page
+ * mapping or not. Hence use the tlb range variant
*/
if (mm_tlb_flush_pending(vma->vm_mm))
- need_flush = true;
+ flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);

/*
* Migrate the THP to the requested node, returns with page unlocked
@@ -1520,13 +1515,6 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
*/
spin_unlock(vmf->ptl);

- /*
- * We are not sure a pending tlb flush here is for a huge page
- * mapping or not. Hence use the tlb range variant
- */
- if (need_flush)
- flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
vmf->pmd, pmd, vmf->address, page, target_nid);
if (migrated) {

2017-08-11 12:44:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree


* Peter Zijlstra <[email protected]> wrote:

> On Fri, Aug 11, 2017 at 01:56:07PM +0200, Ingo Molnar wrote:
> > I've done a minimal conflict resolution merge locally. Peter, could you please
> > double check my resolution, in:
> >
> > 040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts
>
> That merge is a bit wonky, but not terminally broken afaict.
>
> It now does two TLB flushes, the below cleans that up.

Cool, thanks - I've applied it as a separate commit, to reduce the evilness of the
merge commit.

Will push it all out in time to make Stephen's Monday morning a bit less of a
Monday morning.

Thanks,

Ingo

2017-08-11 13:49:25

by Stephen Rothwell

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

Hi Ingo,

On Fri, 11 Aug 2017 14:44:25 +0200 Ingo Molnar <[email protected]> wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Aug 11, 2017 at 01:56:07PM +0200, Ingo Molnar wrote:
> > > I've done a minimal conflict resolution merge locally. Peter, could you please
> > > double check my resolution, in:
> > >
> > > 040cca3ab2f6: Merge branch 'linus' into locking/core, to resolve conflicts
> >
> > That merge is a bit wonky, but not terminally broken afaict.
> >
> > It now does two TLB flushes, the below cleans that up.
>
> Cool, thanks - I've applied it as a separate commit, to reduce the evilness of the
> merge commit.
>
> Will push it all out in time to make Stephen's Monday morning a bit less of a
> Monday morning.

Thanks you very much.

--
Cheers,
Stephen Rothwell

2017-08-11 14:05:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree


Ok, so I have the below to still go on-top.

Ideally someone would clarify the situation around
mm_tlb_flush_nested(), because ideally we'd remove the
smp_mb__after_atomic() and go back to relying on PTL alone.

This also removes the pointless smp_mb__before_atomic()

---
Subject: mm: Fix barriers for the tlb_flush_pending thing
From: Peter Zijlstra <[email protected]>
Date: Fri Aug 11 12:43:33 CEST 2017

I'm not 100% sure we always care about the same PTL and when we have
SPLIT_PTE_PTLOCKS and have RCpc locks (PPC) the UNLOCK of one does not
in fact order against the LOCK of another lock. Therefore the
documented scheme does not work if we care about multiple PTLs

mm_tlb_flush_pending() appears to only care about a single PTL:

- arch pte_accessible() (x86, arm64) only cares about that one PTE.
- do_huge_pmd_numa_page() also only cares about a single (huge) page.
- ksm write_protect_page() also only cares about a single page.

however mm_tlb_flush_nested() is a mystery, it appears to care about
anything inside the range. For now rely on it doing at least _a_ PTL
lock instead of taking _the_ PTL lock.

Therefore add an explicit smp_mb__after_atomic() to cure things.

Also remove the smp_mb__before_atomic() on the dec side, as its
completely pointless. We must rely on flush_tlb_range() to DTRT.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/mm_types.h | 38 ++++++++++++++++++++++----------------
1 file changed, 22 insertions(+), 16 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -537,13 +537,13 @@ static inline bool mm_tlb_flush_pending(
{
/*
* Must be called with PTL held; such that our PTL acquire will have
- * observed the store from set_tlb_flush_pending().
+ * observed the increment from inc_tlb_flush_pending().
*/
- return atomic_read(&mm->tlb_flush_pending) > 0;
+ return atomic_read(&mm->tlb_flush_pending);
}

/*
- * Returns true if there are two above TLB batching threads in parallel.
+ * Returns true if there are two or more TLB batching threads in parallel.
*/
static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
{
@@ -558,15 +558,12 @@ static inline void init_tlb_flush_pendin
static inline void inc_tlb_flush_pending(struct mm_struct *mm)
{
atomic_inc(&mm->tlb_flush_pending);
-
/*
- * The only time this value is relevant is when there are indeed pages
- * to flush. And we'll only flush pages after changing them, which
- * requires the PTL.
- *
* So the ordering here is:
*
* atomic_inc(&mm->tlb_flush_pending);
+ * smp_mb__after_atomic();
+ *
* spin_lock(&ptl);
* ...
* set_pte_at();
@@ -580,21 +577,30 @@ static inline void inc_tlb_flush_pending
* flush_tlb_range();
* atomic_dec(&mm->tlb_flush_pending);
*
- * So the =true store is constrained by the PTL unlock, and the =false
- * store is constrained by the TLB invalidate.
+ * Where we order the increment against the PTE modification with the
+ * smp_mb__after_atomic(). It would appear that the spin_unlock(&ptl)
+ * is sufficient to constrain the inc, because we only care about the
+ * value if there is indeed a pending PTE modification. However with
+ * SPLIT_PTE_PTLOCKS and RCpc locks (PPC) the UNLOCK of one lock does
+ * not order against the LOCK of another lock.
+ *
+ * The decrement is ordered by the flush_tlb_range(), such that
+ * mm_tlb_flush_pending() will not return false unless all flushes have
+ * completed.
*/
+ smp_mb__after_atomic();
}

-/* Clearing is done after a TLB flush, which also provides a barrier. */
static inline void dec_tlb_flush_pending(struct mm_struct *mm)
{
/*
- * Guarantee that the tlb_flush_pending does not not leak into the
- * critical section, since we must order the PTE change and changes to
- * the pending TLB flush indication. We could have relied on TLB flush
- * as a memory barrier, but this behavior is not clearly documented.
+ * See inc_tlb_flush_pending().
+ *
+ * This cannot be smp_mb__before_atomic() because smp_mb() simply does
+ * not order against TLB invalidate completion, which is what we need.
+ *
+ * Therefore we must rely on tlb_flush_*() to guarantee order.
*/
- smp_mb__before_atomic();
atomic_dec(&mm->tlb_flush_pending);
}


2017-08-13 06:06:38

by Nadav Amit

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

Peter Zijlstra <[email protected]> wrote:

>
> Ok, so I have the below to still go on-top.
>
> Ideally someone would clarify the situation around
> mm_tlb_flush_nested(), because ideally we'd remove the
> smp_mb__after_atomic() and go back to relying on PTL alone.
>
> This also removes the pointless smp_mb__before_atomic()
>
> ---
> Subject: mm: Fix barriers for the tlb_flush_pending thing
> From: Peter Zijlstra <[email protected]>
> Date: Fri Aug 11 12:43:33 CEST 2017
>
> I'm not 100% sure we always care about the same PTL and when we have
> SPLIT_PTE_PTLOCKS and have RCpc locks (PPC) the UNLOCK of one does not
> in fact order against the LOCK of another lock. Therefore the
> documented scheme does not work if we care about multiple PTLs
>
> mm_tlb_flush_pending() appears to only care about a single PTL:
>
> - arch pte_accessible() (x86, arm64) only cares about that one PTE.
> - do_huge_pmd_numa_page() also only cares about a single (huge) page.
> - ksm write_protect_page() also only cares about a single page.
>
> however mm_tlb_flush_nested() is a mystery, it appears to care about
> anything inside the range. For now rely on it doing at least _a_ PTL
> lock instead of taking _the_ PTL lock.

It does not care about “anything” inside the range, but only on situations
in which there is at least one (same) PT that was modified by one core and
then read by the other. So, yes, it will always be _the_ same PTL, and not
_a_ PTL - in the cases that flush is really needed.

The issue that might require additional barriers is that
inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
not held. IIUC, since the release-acquire might not behave as a full memory
barrier, this requires an explicit memory barrier.

> Therefore add an explicit smp_mb__after_atomic() to cure things.
>
> Also remove the smp_mb__before_atomic() on the dec side, as its
> completely pointless. We must rely on flush_tlb_range() to DTRT.

Good. It seemed fishy to me, but I was focused on the TLB consistency and
less on the barriers (that’s my excuse).

Nadav


2017-08-13 12:50:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Sun, Aug 13, 2017 at 06:06:32AM +0000, Nadav Amit wrote:
> > however mm_tlb_flush_nested() is a mystery, it appears to care about
> > anything inside the range. For now rely on it doing at least _a_ PTL
> > lock instead of taking _the_ PTL lock.
>
> It does not care about “anything” inside the range, but only on situations
> in which there is at least one (same) PT that was modified by one core and
> then read by the other. So, yes, it will always be _the_ same PTL, and not
> _a_ PTL - in the cases that flush is really needed.
>
> The issue that might require additional barriers is that
> inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
> not held. IIUC, since the release-acquire might not behave as a full memory
> barrier, this requires an explicit memory barrier.

So I'm not entirely clear about this yet.

How about:


CPU0 CPU1

tlb_gather_mmu()

lock PTLn
no mod
unlock PTLn

tlb_gather_mmu()

lock PTLm
mod
include in tlb range
unlock PTLm

lock PTLn
mod
unlock PTLn

tlb_finish_mmu()
force = mm_tlb_flush_nested(tlb->mm);
arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()



In this case you also want CPU1's mm_tlb_flush_nested() call to return
true, right?

But even with an smp_mb__after_atomic() at CPU0's tlg_bather_mmu()
you're not guaranteed CPU1 sees the increment. The only way to do that
is to make the PTL locks RCsc and that is a much more expensive
proposition.


What about:


CPU0 CPU1

tlb_gather_mmu()

lock PTLn
no mod
unlock PTLn


lock PTLm
mod
include in tlb range
unlock PTLm

tlb_gather_mmu()

lock PTLn
mod
unlock PTLn

tlb_finish_mmu()
force = mm_tlb_flush_nested(tlb->mm);
arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()

Do we want CPU1 to see it here? If so, where does it end?


CPU0 CPU1

tlb_gather_mmu()

lock PTLn
no mod
unlock PTLn


lock PTLm
mod
include in tlb range
unlock PTLm

tlb_finish_mmu()
force = mm_tlb_flush_nested(tlb->mm);

tlb_gather_mmu()

lock PTLn
mod
unlock PTLn

arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()


This?


Could you clarify under what exact condition mm_tlb_flush_nested() must
return true?

2017-08-14 03:09:17

by Minchan Kim

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

Hi Peter,

On Fri, Aug 11, 2017 at 04:04:50PM +0200, Peter Zijlstra wrote:
>
> Ok, so I have the below to still go on-top.
>
> Ideally someone would clarify the situation around
> mm_tlb_flush_nested(), because ideally we'd remove the
> smp_mb__after_atomic() and go back to relying on PTL alone.
>
> This also removes the pointless smp_mb__before_atomic()

I'm not an expert of barrier stuff but IIUC, mm_tlb_flush_nested's
side full memory barrier can go with removing smp_mb__after_atomic
in inc_tlb_flush_pending side?


diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 490af494c2da..5ad0e66df363 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -544,7 +544,12 @@ static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
*/
static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
{
- return atomic_read(&mm->tlb_flush_pending) > 1;
+ /*
+ * atomic_dec_and_test's full memory barrier guarantees
+ * to see uptodate tlb_flush_pending count in other CPU
+ * without relying on page table lock.
+ */
+ return !atomic_dec_and_test(&mm->tlb_flush_pending);
}

static inline void init_tlb_flush_pending(struct mm_struct *mm)
diff --git a/mm/memory.c b/mm/memory.c
index f571b0eb9816..e90b57bc65fb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -407,6 +407,10 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
unsigned long start, unsigned long end)
{
arch_tlb_gather_mmu(tlb, mm, start, end);
+ /*
+ * couterpart is mm_tlb_flush_nested in tlb_finish_mmu
+ * which decreases pending count.
+ */
inc_tlb_flush_pending(tlb->mm);
}

@@ -446,9 +450,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
*
*/
bool force = mm_tlb_flush_nested(tlb->mm);
-
arch_tlb_finish_mmu(tlb, start, end, force);
- dec_tlb_flush_pending(tlb->mm);
}

/*

2017-08-14 03:16:19

by Minchan Kim

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Sun, Aug 13, 2017 at 02:50:19PM +0200, Peter Zijlstra wrote:
> On Sun, Aug 13, 2017 at 06:06:32AM +0000, Nadav Amit wrote:
> > > however mm_tlb_flush_nested() is a mystery, it appears to care about
> > > anything inside the range. For now rely on it doing at least _a_ PTL
> > > lock instead of taking _the_ PTL lock.
> >
> > It does not care about “anything” inside the range, but only on situations
> > in which there is at least one (same) PT that was modified by one core and
> > then read by the other. So, yes, it will always be _the_ same PTL, and not
> > _a_ PTL - in the cases that flush is really needed.
> >
> > The issue that might require additional barriers is that
> > inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
> > not held. IIUC, since the release-acquire might not behave as a full memory
> > barrier, this requires an explicit memory barrier.
>
> So I'm not entirely clear about this yet.
>
> How about:
>
>
> CPU0 CPU1
>
> tlb_gather_mmu()
>
> lock PTLn
> no mod
> unlock PTLn
>
> tlb_gather_mmu()
>
> lock PTLm
> mod
> include in tlb range
> unlock PTLm
>
> lock PTLn
> mod
> unlock PTLn
>
> tlb_finish_mmu()
> force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
>
>
> ... more ...
>
> tlb_finish_mmu()
>
>
>
> In this case you also want CPU1's mm_tlb_flush_nested() call to return
> true, right?

No, because CPU 1 mofified pte and added it into tlb range
so regardless of nested, it will flush TLB so there is no stale
TLB problem.

>
> But even with an smp_mb__after_atomic() at CPU0's tlg_bather_mmu()
> you're not guaranteed CPU1 sees the increment. The only way to do that
> is to make the PTL locks RCsc and that is a much more expensive
> proposition.
>
>
> What about:
>
>
> CPU0 CPU1
>
> tlb_gather_mmu()
>
> lock PTLn
> no mod
> unlock PTLn
>
>
> lock PTLm
> mod
> include in tlb range
> unlock PTLm
>
> tlb_gather_mmu()
>
> lock PTLn
> mod
> unlock PTLn
>
> tlb_finish_mmu()
> force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
>
>
> ... more ...
>
> tlb_finish_mmu()
>
> Do we want CPU1 to see it here? If so, where does it end?

Ditto. Since CPU 1 has added range, it will flush TLB regardless
of nested condition.

>
> CPU0 CPU1
>
> tlb_gather_mmu()
>
> lock PTLn
> no mod
> unlock PTLn
>
>
> lock PTLm
> mod
> include in tlb range
> unlock PTLm
>
> tlb_finish_mmu()
> force = mm_tlb_flush_nested(tlb->mm);
>
> tlb_gather_mmu()
>
> lock PTLn
> mod
> unlock PTLn
>
> arch_tlb_finish_mmu(force);
>
>
> ... more ...
>
> tlb_finish_mmu()
>
>
> This?
>
>
> Could you clarify under what exact condition mm_tlb_flush_nested() must
> return true?

mm_tlb_flush_nested aims for the CPU side where there is no pte update
but need TLB flush.
As I wrote https://marc.info/?l=linux-mm&m=150267398226529&w=2,
it has stable TLB problem if we don't flush TLB although there is no
pte modification.

2017-08-14 05:07:23

by Nadav Amit

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

Minchan Kim <[email protected]> wrote:

> On Sun, Aug 13, 2017 at 02:50:19PM +0200, Peter Zijlstra wrote:
>> On Sun, Aug 13, 2017 at 06:06:32AM +0000, Nadav Amit wrote:
>>>> however mm_tlb_flush_nested() is a mystery, it appears to care about
>>>> anything inside the range. For now rely on it doing at least _a_ PTL
>>>> lock instead of taking _the_ PTL lock.
>>>
>>> It does not care about “anything” inside the range, but only on situations
>>> in which there is at least one (same) PT that was modified by one core and
>>> then read by the other. So, yes, it will always be _the_ same PTL, and not
>>> _a_ PTL - in the cases that flush is really needed.
>>>
>>> The issue that might require additional barriers is that
>>> inc_tlb_flush_pending() and mm_tlb_flush_nested() are called when the PTL is
>>> not held. IIUC, since the release-acquire might not behave as a full memory
>>> barrier, this requires an explicit memory barrier.
>>
>> So I'm not entirely clear about this yet.
>>
>> How about:
>>
>>
>> CPU0 CPU1
>>
>> tlb_gather_mmu()
>>
>> lock PTLn
>> no mod
>> unlock PTLn
>>
>> tlb_gather_mmu()
>>
>> lock PTLm
>> mod
>> include in tlb range
>> unlock PTLm
>>
>> lock PTLn
>> mod
>> unlock PTLn
>>
>> tlb_finish_mmu()
>> force = mm_tlb_flush_nested(tlb->mm);
>> arch_tlb_finish_mmu(force);
>>
>>
>> ... more ...
>>
>> tlb_finish_mmu()
>>
>>
>>
>> In this case you also want CPU1's mm_tlb_flush_nested() call to return
>> true, right?
>
> No, because CPU 1 mofified pte and added it into tlb range
> so regardless of nested, it will flush TLB so there is no stale
> TLB problem.
>
>> But even with an smp_mb__after_atomic() at CPU0's tlg_bather_mmu()
>> you're not guaranteed CPU1 sees the increment. The only way to do that
>> is to make the PTL locks RCsc and that is a much more expensive
>> proposition.
>>
>>
>> What about:
>>
>>
>> CPU0 CPU1
>>
>> tlb_gather_mmu()
>>
>> lock PTLn
>> no mod
>> unlock PTLn
>>
>>
>> lock PTLm
>> mod
>> include in tlb range
>> unlock PTLm
>>
>> tlb_gather_mmu()
>>
>> lock PTLn
>> mod
>> unlock PTLn
>>
>> tlb_finish_mmu()
>> force = mm_tlb_flush_nested(tlb->mm);
>> arch_tlb_finish_mmu(force);
>>
>>
>> ... more ...
>>
>> tlb_finish_mmu()
>>
>> Do we want CPU1 to see it here? If so, where does it end?
>
> Ditto. Since CPU 1 has added range, it will flush TLB regardless
> of nested condition.
>
>> CPU0 CPU1
>>
>> tlb_gather_mmu()
>>
>> lock PTLn
>> no mod
>> unlock PTLn
>>
>>
>> lock PTLm
>> mod
>> include in tlb range
>> unlock PTLm
>>
>> tlb_finish_mmu()
>> force = mm_tlb_flush_nested(tlb->mm);
>>
>> tlb_gather_mmu()
>>
>> lock PTLn
>> mod
>> unlock PTLn
>>
>> arch_tlb_finish_mmu(force);
>>
>>
>> ... more ...
>>
>> tlb_finish_mmu()
>>
>>
>> This?
>>
>>
>> Could you clarify under what exact condition mm_tlb_flush_nested() must
>> return true?
>
> mm_tlb_flush_nested aims for the CPU side where there is no pte update
> but need TLB flush.
> As I wrote https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dmm-26m-3D150267398226529-26w-3D2&d=DwIDaQ&c=uilaK90D4TOVoH58JNXRgQ&r=x9zhXCtCLvTDtvE65-BGSA&m=v2Z7eDi7z1H9zdngcjZvlNeBudWzA9KvcXFNpU2A77s&s=amaSu_gurmBHHPcl3Pxfdl0Tk_uTnmf60tMQAsNDHVU&e= ,
> it has stable TLB problem if we don't flush TLB although there is no
> pte modification.

To clarify: the main problem that these patches address is when the first
CPU updates the PTE, and second CPU sees the updated value and thinks: “the
PTE is already what I wanted - no flush is needed”.

For some reason (I would assume intentional), all the examples here first
“do not modify” the PTE, and then modify it - which is not an “interesting”
case. However, based on what I understand on the memory barriers, I think
there is indeed a missing barrier before reading it in
mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
before reading, would solve the problem with least impact on systems with
strong memory ordering.

Minchan, as for the solution you proposed, it seems to open again a race,
since the “pending” indication is removed before the actual TLB flush is
performed.

Nadav

2017-08-14 05:23:21

by Minchan Kim

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Mon, Aug 14, 2017 at 05:07:19AM +0000, Nadav Amit wrote:
< snip >

> Minchan, as for the solution you proposed, it seems to open again a race,
> since the “pending” indication is removed before the actual TLB flush is
> performed.

Oops, you're right!

2017-08-14 08:38:45

by Minchan Kim

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

Hi Nadav,

On Mon, Aug 14, 2017 at 05:07:19AM +0000, Nadav Amit wrote:
< snip >

> For some reason (I would assume intentional), all the examples here first
> “do not modify” the PTE, and then modify it - which is not an “interesting”
> case. However, based on what I understand on the memory barriers, I think
> there is indeed a missing barrier before reading it in
> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,

memory-barrier.txt always scares me. I have read it for a while
and IIUC, it seems semantic of spin_unlock(&same_pte) would be
enough without some memory-barrier inside mm_tlb_flush_nested.

I would be missing something totally.

Could you explain what kinds of sequence you have in mind to
have such problem?

Thanks.

2017-08-14 18:54:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Mon, Aug 14, 2017 at 12:09:14PM +0900, Minchan Kim wrote:
> @@ -446,9 +450,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> *
> */
> bool force = mm_tlb_flush_nested(tlb->mm);
> -
> arch_tlb_finish_mmu(tlb, start, end, force);
> - dec_tlb_flush_pending(tlb->mm);
> }

No, I think this breaks all the mm_tlb_flush_pending() users. They need
the decrement to not be visible until the TLB flush is complete.

2017-08-14 19:38:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Mon, Aug 14, 2017 at 05:07:19AM +0000, Nadav Amit wrote:
> >> So I'm not entirely clear about this yet.
> >>
> >> How about:
> >>
> >>
> >> CPU0 CPU1
> >>
> >> tlb_gather_mmu()
> >>
> >> lock PTLn
> >> no mod
> >> unlock PTLn
> >>
> >> tlb_gather_mmu()
> >>
> >> lock PTLm
> >> mod
> >> include in tlb range
> >> unlock PTLm
> >>
> >> lock PTLn
> >> mod
> >> unlock PTLn
> >>
> >> tlb_finish_mmu()
> >> force = mm_tlb_flush_nested(tlb->mm);
> >> arch_tlb_finish_mmu(force);
> >>
> >>
> >> ... more ...
> >>
> >> tlb_finish_mmu()
> >>
> >>
> >>
> >> In this case you also want CPU1's mm_tlb_flush_nested() call to return
> >> true, right?
> >
> > No, because CPU 1 mofified pte and added it into tlb range
> > so regardless of nested, it will flush TLB so there is no stale
> > TLB problem.

> To clarify: the main problem that these patches address is when the first
> CPU updates the PTE, and second CPU sees the updated value and thinks: “the
> PTE is already what I wanted - no flush is needed”.

OK, that simplifies things.

> For some reason (I would assume intentional), all the examples here first
> “do not modify” the PTE, and then modify it - which is not an “interesting”
> case.

Depends on what you call 'interesting' :-) They are 'interesting' to
make work from a memory ordering POV. And since I didn't get they were
excluded from the set, I worried.

In fact, if they were to be included, I couldn't make it work at all. So
I'm really glad to hear we can disregard them.

> However, based on what I understand on the memory barriers, I think
> there is indeed a missing barrier before reading it in
> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
> before reading, would solve the problem with least impact on systems with
> strong memory ordering.

No, all is well. If, as you say, we're naturally constrained to the case
where we only care about prior modification we can rely on the RCpc PTL
locks.

Consider:


CPU0 CPU1

tlb_gather_mmu()

tlb_gather_mmu()
inc --------.
| (inc is constrained by RELEASE)
lock PTLn |
mod ^
unlock PTLn -----------------> lock PTLn
v no mod
| unlock PTLn
|
| lock PTLm
| mod
| include in tlb range
| unlock PTLm
|
(read is constrained |
by ACQUIRE) |
| tlb_finish_mmu()
`---- force = mm_tlb_flush_nested(tlb->mm);
arch_tlb_finish_mmu(force);


... more ...

tlb_finish_mmu()


Then CPU1's acquire of PTLn orders against CPU0's release of that same
PTLn which guarantees we observe both its (prior) modified PTE and the
mm->tlb_flush_pending increment from tlb_gather_mmu().

So all we need for mm_tlb_flush_nested() to work is having acquired the
right PTL at least once before calling it.

At the same time, the decrements need to be after the TLB invalidate is
complete, this ensures that _IF_ we observe the decrement, we must've
also observed the corresponding invalidate.

Something like the below is then sufficient.

---
Subject: mm: Clarify tlb_flush_pending barriers
From: Peter Zijlstra <[email protected]>
Date: Fri, 11 Aug 2017 16:04:50 +0200

Better document the ordering around tlb_flush_pending.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/mm_types.h | 78 +++++++++++++++++++++++++++--------------------
1 file changed, 45 insertions(+), 33 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
extern void tlb_finish_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end);

-/*
- * Memory barriers to keep this state in sync are graciously provided by
- * the page table locks, outside of which no page table modifications happen.
- * The barriers are used to ensure the order between tlb_flush_pending updates,
- * which happen while the lock is not taken, and the PTE updates, which happen
- * while the lock is taken, are serialized.
- */
-static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
-{
- /*
- * Must be called with PTL held; such that our PTL acquire will have
- * observed the store from set_tlb_flush_pending().
- */
- return atomic_read(&mm->tlb_flush_pending) > 0;
-}
-
-/*
- * Returns true if there are two above TLB batching threads in parallel.
- */
-static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
-{
- return atomic_read(&mm->tlb_flush_pending) > 1;
-}
-
static inline void init_tlb_flush_pending(struct mm_struct *mm)
{
atomic_set(&mm->tlb_flush_pending, 0);
@@ -558,7 +534,6 @@ static inline void init_tlb_flush_pendin
static inline void inc_tlb_flush_pending(struct mm_struct *mm)
{
atomic_inc(&mm->tlb_flush_pending);
-
/*
* The only time this value is relevant is when there are indeed pages
* to flush. And we'll only flush pages after changing them, which
@@ -580,24 +555,61 @@ static inline void inc_tlb_flush_pending
* flush_tlb_range();
* atomic_dec(&mm->tlb_flush_pending);
*
- * So the =true store is constrained by the PTL unlock, and the =false
- * store is constrained by the TLB invalidate.
+ * Where the increment if constrained by the PTL unlock, it thus
+ * ensures that the increment is visible if the PTE modification is
+ * visible. After all, if there is no PTE modification, nobody cares
+ * about TLB flushes either.
+ *
+ * This very much relies on users (mm_tlb_flush_pending() and
+ * mm_tlb_flush_nested()) only caring about _specific_ PTEs (and
+ * therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc
+ * locks (PPC) the unlock of one doesn't order against the lock of
+ * another PTL.
+ *
+ * The decrement is ordered by the flush_tlb_range(), such that
+ * mm_tlb_flush_pending() will not return false unless all flushes have
+ * completed.
*/
}

-/* Clearing is done after a TLB flush, which also provides a barrier. */
static inline void dec_tlb_flush_pending(struct mm_struct *mm)
{
/*
- * Guarantee that the tlb_flush_pending does not not leak into the
- * critical section, since we must order the PTE change and changes to
- * the pending TLB flush indication. We could have relied on TLB flush
- * as a memory barrier, but this behavior is not clearly documented.
+ * See inc_tlb_flush_pending().
+ *
+ * This cannot be smp_mb__before_atomic() because smp_mb() simply does
+ * not order against TLB invalidate completion, which is what we need.
+ *
+ * Therefore we must rely on tlb_flush_*() to guarantee order.
*/
- smp_mb__before_atomic();
atomic_dec(&mm->tlb_flush_pending);
}

+static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
+{
+ /*
+ * Must be called after having acquired the PTL; orders against that
+ * PTLs release and therefore ensures that if we observe the modified
+ * PTE we must also observe the increment from inc_tlb_flush_pending().
+ *
+ * That is, it only guarantees to return true if there is a flush
+ * pending for _this_ PTL.
+ */
+ return atomic_read(&mm->tlb_flush_pending);
+}
+
+static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
+{
+ /*
+ * Similar to mm_tlb_flush_pending(), we must have acquired the PTL
+ * for which there is a TLB flush pending in order to guarantee
+ * we've seen both that PTE modification and the increment.
+ *
+ * (no requirement on actually still holding the PTL, that is irrelevant)
+ */
+ return atomic_read(&mm->tlb_flush_pending) > 1;
+}
+
struct vm_fault;

struct vm_special_mapping {

2017-08-14 19:57:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Mon, Aug 14, 2017 at 05:38:39PM +0900, Minchan Kim wrote:
> memory-barrier.txt always scares me. I have read it for a while
> and IIUC, it seems semantic of spin_unlock(&same_pte) would be
> enough without some memory-barrier inside mm_tlb_flush_nested.

Indeed, see the email I just send. Its both spin_lock() and
spin_unlock() that we care about.

Aside from the semi permeable barrier of these primitives, RCpc ensures
these orderings only work against the _same_ lock variable.

Let me try and explain the ordering for PPC (which is by far the worst
we have in this regard):


spin_lock(lock)
{
while (test_and_set(lock))
cpu_relax();
lwsync();
}


spin_unlock(lock)
{
lwsync();
clear(lock);
}

Now LWSYNC has fairly 'simple' semantics, but with fairly horrible
ramifications. Consider LWSYNC to provide _local_ TSO ordering, this
means that it allows 'stores reordered after loads'.

For the spin_lock() that implies that all load/store's inside the lock
do indeed stay in, but the ACQUIRE is only on the LOAD of the
test_and_set(). That is, the actual _set_ can leak in. After all it can
re-order stores after load (inside the lock).

For unlock it again means all load/store's prior stay prior, and the
RELEASE is on the store clearing the lock state (nothing surprising
here).

Now the _local_ part, the main take-away is that these orderings are
strictly CPU local. What makes the spinlock work across CPUs (as we'd
very much expect it to) is the address dependency on the lock variable.

In order for the spin_lock() to succeed, it must observe the clear. Its
this link that crosses between the CPUs and builds the ordering. But
only the two CPUs agree on this order. A third CPU not involved in
this transaction can disagree on the order of events.

2017-08-15 07:52:02

by Nadav Amit

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

Peter Zijlstra <[email protected]> wrote:

> On Mon, Aug 14, 2017 at 05:07:19AM +0000, Nadav Amit wrote:
>>>> So I'm not entirely clear about this yet.
>>>>
>>>> How about:
>>>>
>>>>
>>>> CPU0 CPU1
>>>>
>>>> tlb_gather_mmu()
>>>>
>>>> lock PTLn
>>>> no mod
>>>> unlock PTLn
>>>>
>>>> tlb_gather_mmu()
>>>>
>>>> lock PTLm
>>>> mod
>>>> include in tlb range
>>>> unlock PTLm
>>>>
>>>> lock PTLn
>>>> mod
>>>> unlock PTLn
>>>>
>>>> tlb_finish_mmu()
>>>> force = mm_tlb_flush_nested(tlb->mm);
>>>> arch_tlb_finish_mmu(force);
>>>>
>>>>
>>>> ... more ...
>>>>
>>>> tlb_finish_mmu()
>>>>
>>>>
>>>>
>>>> In this case you also want CPU1's mm_tlb_flush_nested() call to return
>>>> true, right?
>>>
>>> No, because CPU 1 mofified pte and added it into tlb range
>>> so regardless of nested, it will flush TLB so there is no stale
>>> TLB problem.
>
>> To clarify: the main problem that these patches address is when the first
>> CPU updates the PTE, and second CPU sees the updated value and thinks: “the
>> PTE is already what I wanted - no flush is needed”.
>
> OK, that simplifies things.
>
>> For some reason (I would assume intentional), all the examples here first
>> “do not modify” the PTE, and then modify it - which is not an “interesting”
>> case.
>
> Depends on what you call 'interesting' :-) They are 'interesting' to
> make work from a memory ordering POV. And since I didn't get they were
> excluded from the set, I worried.
>
> In fact, if they were to be included, I couldn't make it work at all. So
> I'm really glad to hear we can disregard them.
>
>> However, based on what I understand on the memory barriers, I think
>> there is indeed a missing barrier before reading it in
>> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
>> before reading, would solve the problem with least impact on systems with
>> strong memory ordering.
>
> No, all is well. If, as you say, we're naturally constrained to the case
> where we only care about prior modification we can rely on the RCpc PTL
> locks.
>
> Consider:
>
>
> CPU0 CPU1
>
> tlb_gather_mmu()
>
> tlb_gather_mmu()
> inc --------.
> | (inc is constrained by RELEASE)
> lock PTLn |
> mod ^
> unlock PTLn -----------------> lock PTLn
> v no mod
> | unlock PTLn
> |
> | lock PTLm
> | mod
> | include in tlb range
> | unlock PTLm
> |
> (read is constrained |
> by ACQUIRE) |
> | tlb_finish_mmu()
> `---- force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
>
>
> ... more ...
>
> tlb_finish_mmu()
>
>
> Then CPU1's acquire of PTLn orders against CPU0's release of that same
> PTLn which guarantees we observe both its (prior) modified PTE and the
> mm->tlb_flush_pending increment from tlb_gather_mmu().
>
> So all we need for mm_tlb_flush_nested() to work is having acquired the
> right PTL at least once before calling it.
>
> At the same time, the decrements need to be after the TLB invalidate is
> complete, this ensures that _IF_ we observe the decrement, we must've
> also observed the corresponding invalidate.
>
> Something like the below is then sufficient.
>
> ---
> Subject: mm: Clarify tlb_flush_pending barriers
> From: Peter Zijlstra <[email protected]>
> Date: Fri, 11 Aug 2017 16:04:50 +0200
>
> Better document the ordering around tlb_flush_pending.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> include/linux/mm_types.h | 78 +++++++++++++++++++++++++++--------------------
> 1 file changed, 45 insertions(+), 33 deletions(-)
>
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
> extern void tlb_finish_mmu(struct mmu_gather *tlb,
> unsigned long start, unsigned long end);
>
> -/*
> - * Memory barriers to keep this state in sync are graciously provided by
> - * the page table locks, outside of which no page table modifications happen.
> - * The barriers are used to ensure the order between tlb_flush_pending updates,
> - * which happen while the lock is not taken, and the PTE updates, which happen
> - * while the lock is taken, are serialized.
> - */
> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> -{
> - /*
> - * Must be called with PTL held; such that our PTL acquire will have
> - * observed the store from set_tlb_flush_pending().
> - */
> - return atomic_read(&mm->tlb_flush_pending) > 0;
> -}
> -
> -/*
> - * Returns true if there are two above TLB batching threads in parallel.
> - */
> -static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> -{
> - return atomic_read(&mm->tlb_flush_pending) > 1;
> -}
> -
> static inline void init_tlb_flush_pending(struct mm_struct *mm)
> {
> atomic_set(&mm->tlb_flush_pending, 0);
> @@ -558,7 +534,6 @@ static inline void init_tlb_flush_pendin
> static inline void inc_tlb_flush_pending(struct mm_struct *mm)
> {
> atomic_inc(&mm->tlb_flush_pending);
> -
> /*
> * The only time this value is relevant is when there are indeed pages
> * to flush. And we'll only flush pages after changing them, which
> @@ -580,24 +555,61 @@ static inline void inc_tlb_flush_pending
> * flush_tlb_range();
> * atomic_dec(&mm->tlb_flush_pending);
> *
> - * So the =true store is constrained by the PTL unlock, and the =false
> - * store is constrained by the TLB invalidate.
> + * Where the increment if constrained by the PTL unlock, it thus
> + * ensures that the increment is visible if the PTE modification is
> + * visible. After all, if there is no PTE modification, nobody cares
> + * about TLB flushes either.
> + *
> + * This very much relies on users (mm_tlb_flush_pending() and
> + * mm_tlb_flush_nested()) only caring about _specific_ PTEs (and
> + * therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc
> + * locks (PPC) the unlock of one doesn't order against the lock of
> + * another PTL.
> + *
> + * The decrement is ordered by the flush_tlb_range(), such that
> + * mm_tlb_flush_pending() will not return false unless all flushes have
> + * completed.
> */
> }
>
> -/* Clearing is done after a TLB flush, which also provides a barrier. */
> static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> {
> /*
> - * Guarantee that the tlb_flush_pending does not not leak into the
> - * critical section, since we must order the PTE change and changes to
> - * the pending TLB flush indication. We could have relied on TLB flush
> - * as a memory barrier, but this behavior is not clearly documented.
> + * See inc_tlb_flush_pending().
> + *
> + * This cannot be smp_mb__before_atomic() because smp_mb() simply does
> + * not order against TLB invalidate completion, which is what we need.
> + *
> + * Therefore we must rely on tlb_flush_*() to guarantee order.
> */
> - smp_mb__before_atomic();
> atomic_dec(&mm->tlb_flush_pending);
> }
>
> +static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> +{
> + /*
> + * Must be called after having acquired the PTL; orders against that
> + * PTLs release and therefore ensures that if we observe the modified
> + * PTE we must also observe the increment from inc_tlb_flush_pending().
> + *
> + * That is, it only guarantees to return true if there is a flush
> + * pending for _this_ PTL.
> + */
> + return atomic_read(&mm->tlb_flush_pending);
> +}
> +
> +static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> +{
> + /*
> + * Similar to mm_tlb_flush_pending(), we must have acquired the PTL
> + * for which there is a TLB flush pending in order to guarantee
> + * we've seen both that PTE modification and the increment.
> + *
> + * (no requirement on actually still holding the PTL, that is irrelevant)
> + */
> + return atomic_read(&mm->tlb_flush_pending) > 1;
> +}
> +
> struct vm_fault;
>
> struct vm_special_mapping {

Thanks for the detailed explanation. I will pay more attention next time.


2017-08-16 04:14:23

by Minchan Kim

[permalink] [raw]
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree

On Mon, Aug 14, 2017 at 09:57:23PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 14, 2017 at 05:38:39PM +0900, Minchan Kim wrote:
> > memory-barrier.txt always scares me. I have read it for a while
> > and IIUC, it seems semantic of spin_unlock(&same_pte) would be
> > enough without some memory-barrier inside mm_tlb_flush_nested.
>
> Indeed, see the email I just send. Its both spin_lock() and
> spin_unlock() that we care about.
>
> Aside from the semi permeable barrier of these primitives, RCpc ensures
> these orderings only work against the _same_ lock variable.
>
> Let me try and explain the ordering for PPC (which is by far the worst
> we have in this regard):
>
>
> spin_lock(lock)
> {
> while (test_and_set(lock))
> cpu_relax();
> lwsync();
> }
>
>
> spin_unlock(lock)
> {
> lwsync();
> clear(lock);
> }
>
> Now LWSYNC has fairly 'simple' semantics, but with fairly horrible
> ramifications. Consider LWSYNC to provide _local_ TSO ordering, this
> means that it allows 'stores reordered after loads'.
>
> For the spin_lock() that implies that all load/store's inside the lock
> do indeed stay in, but the ACQUIRE is only on the LOAD of the
> test_and_set(). That is, the actual _set_ can leak in. After all it can
> re-order stores after load (inside the lock).
>
> For unlock it again means all load/store's prior stay prior, and the
> RELEASE is on the store clearing the lock state (nothing surprising
> here).
>
> Now the _local_ part, the main take-away is that these orderings are
> strictly CPU local. What makes the spinlock work across CPUs (as we'd
> very much expect it to) is the address dependency on the lock variable.
>
> In order for the spin_lock() to succeed, it must observe the clear. Its
> this link that crosses between the CPUs and builds the ordering. But
> only the two CPUs agree on this order. A third CPU not involved in
> this transaction can disagree on the order of events.

The detail explanation in your previous reply makes me comfortable
from scary memory-barrier.txt but this reply makes me scared again. ;-)

Thanks for the kind clarification, Peter!