Working on some GUP cleanups (e.g., getting rid of some FOLL_ flags) and
preparing for other GUP changes (getting rid of FOLL_FORCE|FOLL_WRITE for
for taking a R/O longterm pin), this is something I can easily send out
independently.
Get rid of FOLL_NUMA, allow FOLL_FORCE access to PROT_NONE mapped pages
in GUP-fast, and fixup some documentation around NUMA hinting.
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Xu <[email protected]>
David Hildenbrand (3):
mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()
mm/gup: use gup_can_follow_protnone() also in GUP-fast
mm: fixup documentation regarding pte_numa() and PROT_NUMA
include/linux/mm.h | 16 +++++++++++++++-
include/linux/mm_types.h | 12 ++++++------
mm/gup.c | 26 +++++---------------------
mm/huge_memory.c | 2 +-
4 files changed, 27 insertions(+), 29 deletions(-)
--
2.37.1
pte_numa() no longer exists -- replaced by pte_protnone() -- and PROT_NUMA
probably never existed: MM_CP_PROT_NUMA also ends up using PROT_NONE.
Let's fixup the doc.
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm_types.h | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cf97f3884fda..85a6e9853b7b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -631,22 +631,22 @@ struct mm_struct {
#endif
#ifdef CONFIG_NUMA_BALANCING
/*
- * numa_next_scan is the next time that the PTEs will be marked
- * pte_numa. NUMA hinting faults will gather statistics and
- * migrate pages to new nodes if necessary.
+ * numa_next_scan is the next time that PTEs will be remapped
+ * PROT_NONE to trigger NUMA hinting faults; such faults gather
+ * statistics and migrate pages to new nodes if necessary.
*/
unsigned long numa_next_scan;
- /* Restart point for scanning and setting pte_numa */
+ /* Restart point for scanning and remapping PTEs. */
unsigned long numa_scan_offset;
- /* numa_scan_seq prevents two threads setting pte_numa */
+ /* numa_scan_seq prevents two threads remapping PTEs. */
int numa_scan_seq;
#endif
/*
* An operation with batched TLB flushing is going on. Anything
* that can move process memory needs to flush the TLB when
- * moving a PROT_NONE or PROT_NUMA mapped page.
+ * moving a PROT_NONE mapped page.
*/
atomic_t tlb_flush_pending;
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
--
2.37.1
There seems to be no reason why FOLL_FORCE during GUP-fast would have to
fallback to the slow path when stumbling over a PROT_NONE mapped page. We
only have to trigger hinting faults in case FOLL_FORCE is not set, and any
kind of fault handling naturally happens from the slow path -- where
NUMA hinting accounting/handling would be performed.
Note that the comment regarding THP migration is outdated:
commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against
THP migration") described that this was required for THP due to lack of PMD
migration entries. Nowadays, we do have proper PMD migration entries in
place -- see set_pmd_migration_entry(), which does a proper
pmdp_invalidate() when placing the migration entry.
So let's just reuse gup_can_follow_protnone() here to make it
consistent and drop the somewhat outdated comments.
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/gup.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index a1355dbd848e..dfef23071dc8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2350,11 +2350,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
struct page *page;
struct folio *folio;
- /*
- * Similar to the PMD case below, NUMA hinting must take slow
- * path using the pte_protnone check.
- */
- if (pte_protnone(pte))
+ if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
goto pte_unmap;
if (!pte_access_permitted(pte, flags & FOLL_WRITE))
@@ -2736,12 +2732,8 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
pmd_devmap(pmd))) {
- /*
- * NUMA hinting faults need to be handled in the GUP
- * slowpath for accounting purposes and so that they
- * can be serialised against THP migration.
- */
- if (pmd_protnone(pmd))
+ if (pmd_protnone(pmd) &&
+ !gup_can_follow_protnone(flags))
return 0;
if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
--
2.37.1
No need for a special flag that is not even properly documented to be
internal-only.
Let's just factor this check out and get rid of this flag. The separate
function has the nice benefit that we can centralize comments.
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 16 +++++++++++++++-
mm/gup.c | 12 ++----------
mm/huge_memory.c | 2 +-
3 files changed, 18 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 982f2607180b..8b85765d7a98 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2881,7 +2881,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
* and return without waiting upon it */
#define FOLL_NOFAULT 0x80 /* do not fault in pages */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
-#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
@@ -2997,6 +2996,21 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
return !PageAnonExclusive(page);
}
+/*
+ * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
+ * a (NUMA hinting) fault is required.
+ */
+static inline bool gup_can_follow_protnone(unsigned int flags)
+{
+ /*
+ * FOLL_FORCE has to be able to make progress even if the VMA is
+ * inaccessible. Further, FOLL_FORCE access usually does not represent
+ * application behaviour and we should avoid triggering NUMA hinting
+ * faults.
+ */
+ return flags & FOLL_FORCE;
+}
+
typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
unsigned long size, pte_fn_t fn, void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..a1355dbd848e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -554,7 +554,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
migration_entry_wait(mm, pmd, address);
goto retry;
}
- if ((flags & FOLL_NUMA) && pte_protnone(pte))
+ if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
goto no_page;
page = vm_normal_page(vma, address, pte);
@@ -707,7 +707,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
if (likely(!pmd_trans_huge(pmdval)))
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
- if ((flags & FOLL_NUMA) && pmd_protnone(pmdval))
+ if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags))
return no_page_table(vma, flags);
retry_locked:
@@ -1153,14 +1153,6 @@ static long __get_user_pages(struct mm_struct *mm,
VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
- /*
- * If FOLL_FORCE is set then do not force a full fault as the hinting
- * fault information is unrelated to the reference behaviour of a task
- * using the address space
- */
- if (!(gup_flags & FOLL_FORCE))
- gup_flags |= FOLL_NUMA;
-
do {
struct page *page;
unsigned int foll_flags = gup_flags;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e9414ee57c5b..482c1826e723 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1449,7 +1449,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
return ERR_PTR(-EFAULT);
/* Full NUMA hinting faults to serialise migration in fault paths */
- if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
+ if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags))
return NULL;
if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
--
2.37.1
On 25.08.22 18:46, David Hildenbrand wrote:
> There seems to be no reason why FOLL_FORCE during GUP-fast would have to
> fallback to the slow path when stumbling over a PROT_NONE mapped page. We
> only have to trigger hinting faults in case FOLL_FORCE is not set, and any
> kind of fault handling naturally happens from the slow path -- where
> NUMA hinting accounting/handling would be performed.
>
> Note that the comment regarding THP migration is outdated:
> commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against
> THP migration") described that this was required for THP due to lack of PMD
> migration entries. Nowadays, we do have proper PMD migration entries in
> place -- see set_pmd_migration_entry(), which does a proper
> pmdp_invalidate() when placing the migration entry.
>
> So let's just reuse gup_can_follow_protnone() here to make it
> consistent and drop the somewhat outdated comments.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/gup.c | 14 +++-----------
> 1 file changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index a1355dbd848e..dfef23071dc8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2350,11 +2350,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> struct page *page;
> struct folio *folio;
>
> - /*
> - * Similar to the PMD case below, NUMA hinting must take slow
> - * path using the pte_protnone check.
> - */
> - if (pte_protnone(pte))
> + if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
> goto pte_unmap;
>
> if (!pte_access_permitted(pte, flags & FOLL_WRITE))
> @@ -2736,12 +2732,8 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
>
> if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
> pmd_devmap(pmd))) {
> - /*
> - * NUMA hinting faults need to be handled in the GUP
> - * slowpath for accounting purposes and so that they
> - * can be serialised against THP migration.
> - */
> - if (pmd_protnone(pmd))
> + if (pmd_protnone(pmd) &&
> + !gup_can_follow_protnone(flags))
> return 0;
>
> if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
I just stumbled over something interesting. If we have a pte_protnone()
entry, ptep_clear_flush() might not flush, because the !pte_accessible()
does not hold.
Consequently, we could be in trouble when using ptep_clear_flush() on a
pte_protnone() PTE to make sure that GUP cannot run anymore.
Will give this a better thought, but most probably I'll replace this
patch by a proper documentation update here.
--
Thanks,
David / dhildenb
On Tue, Aug 30, 2022 at 08:23:52PM +0200, David Hildenbrand wrote:
> ... and looking into the details of TLB flush and GUP-fast interaction
> nowadays, that case is no longer relevant. A TLB flush is no longer
> sufficient to stop concurrent GUP-fast ever since we introduced generic
> RCU GUP-fast.
Yes, we've had RCU GUP fast for a while, and it is more widely used
now, IIRC.
It has been a bit, but if I remember, GUP fast in RCU mode worked on a
few principles:
- The PTE page must not be freed without RCU
- The PTE page content must be convertable to a struct page using the
usual rules (eg PTE Special)
- That struct page refcount may go from 0->1 inside the RCU
- In the case the refcount goes from 0->1 there must be sufficient
barriers such that GUP fast observing the refcount of 1 will also
observe the PTE entry has changed. ie before the refcount is
dropped in the zap it has to clear the PTE entry, the refcount
decr has to be a 'release' and the refcount incr in gup fast has be
to be an 'acquire'.
- The rest of the system must tolerate speculative refcount
increments from GUP on any random page
The basic idea being that if GUP fast obtains a valid reference on a
page *and* the PTE entry has not changed then everything is fine.
The tricks with TLB invalidation are just a "poor mans" RCU, and
arguably these days aren't really needed since I think we could make
everything use real RCU always without penalty if we really wanted.
Today we can create a unique 'struct pagetable_page' as Matthew has
been doing in other places that guarentees a rcu_head is always
available for every page used in a page table. Using that we could
drop the code in the TLB flusher that allocates memory for the
rcu_head and hopes for the best. (Or even is the common struct page
rcu_head already guarenteed to exist for pagetable pages now a days?)
IMHO that is the main reason we still have the non-RCU mode at all..
Jason
On 26.08.22 16:59, David Hildenbrand wrote:
> On 25.08.22 18:46, David Hildenbrand wrote:
>> There seems to be no reason why FOLL_FORCE during GUP-fast would have to
>> fallback to the slow path when stumbling over a PROT_NONE mapped page. We
>> only have to trigger hinting faults in case FOLL_FORCE is not set, and any
>> kind of fault handling naturally happens from the slow path -- where
>> NUMA hinting accounting/handling would be performed.
>>
>> Note that the comment regarding THP migration is outdated:
>> commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against
>> THP migration") described that this was required for THP due to lack of PMD
>> migration entries. Nowadays, we do have proper PMD migration entries in
>> place -- see set_pmd_migration_entry(), which does a proper
>> pmdp_invalidate() when placing the migration entry.
>>
>> So let's just reuse gup_can_follow_protnone() here to make it
>> consistent and drop the somewhat outdated comments.
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/gup.c | 14 +++-----------
>> 1 file changed, 3 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index a1355dbd848e..dfef23071dc8 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2350,11 +2350,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>> struct page *page;
>> struct folio *folio;
>>
>> - /*
>> - * Similar to the PMD case below, NUMA hinting must take slow
>> - * path using the pte_protnone check.
>> - */
>> - if (pte_protnone(pte))
>> + if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
>> goto pte_unmap;
>>
>> if (!pte_access_permitted(pte, flags & FOLL_WRITE))
>> @@ -2736,12 +2732,8 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
>>
>> if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
>> pmd_devmap(pmd))) {
>> - /*
>> - * NUMA hinting faults need to be handled in the GUP
>> - * slowpath for accounting purposes and so that they
>> - * can be serialised against THP migration.
>> - */
>> - if (pmd_protnone(pmd))
>> + if (pmd_protnone(pmd) &&
>> + !gup_can_follow_protnone(flags))
>> return 0;
>>
>> if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
>
>
> I just stumbled over something interesting. If we have a pte_protnone()
> entry, ptep_clear_flush() might not flush, because the !pte_accessible()
> does not hold.
>
> Consequently, we could be in trouble when using ptep_clear_flush() on a
> pte_protnone() PTE to make sure that GUP cannot run anymore.
>
> Will give this a better thought, but most probably I'll replace this
> patch by a proper documentation update here.
... and looking into the details of TLB flush and GUP-fast interaction
nowadays, that case is no longer relevant. A TLB flush is no longer
sufficient to stop concurrent GUP-fast ever since we introduced generic
RCU GUP-fast.
--
Thanks,
David / dhildenb
On 30.08.22 20:45, Jason Gunthorpe wrote:
> On Tue, Aug 30, 2022 at 08:23:52PM +0200, David Hildenbrand wrote:
>> ... and looking into the details of TLB flush and GUP-fast interaction
>> nowadays, that case is no longer relevant. A TLB flush is no longer
>> sufficient to stop concurrent GUP-fast ever since we introduced generic
>> RCU GUP-fast.
>
> Yes, we've had RCU GUP fast for a while, and it is more widely used
> now, IIRC.
>
> It has been a bit, but if I remember, GUP fast in RCU mode worked on a
> few principles:
>
> - The PTE page must not be freed without RCU
> - The PTE page content must be convertable to a struct page using the
> usual rules (eg PTE Special)
> - That struct page refcount may go from 0->1 inside the RCU
> - In the case the refcount goes from 0->1 there must be sufficient
> barriers such that GUP fast observing the refcount of 1 will also
> observe the PTE entry has changed. ie before the refcount is
> dropped in the zap it has to clear the PTE entry, the refcount
> decr has to be a 'release' and the refcount incr in gup fast has be
> to be an 'acquire'.
> - The rest of the system must tolerate speculative refcount
> increments from GUP on any random page
> > The basic idea being that if GUP fast obtains a valid reference on a
> page *and* the PTE entry has not changed then everything is fine.
>
> The tricks with TLB invalidation are just a "poor mans" RCU, and
> arguably these days aren't really needed since I think we could make
> everything use real RCU always without penalty if we really wanted.
>
> Today we can create a unique 'struct pagetable_page' as Matthew has
> been doing in other places that guarentees a rcu_head is always
> available for every page used in a page table. Using that we could
> drop the code in the TLB flusher that allocates memory for the
> rcu_head and hopes for the best. (Or even is the common struct page
> rcu_head already guarenteed to exist for pagetable pages now a days?)
>
> IMHO that is the main reason we still have the non-RCU mode at all..
Good, I managed to attract the attention of someone who understands that machinery :)
While validating whether GUP-fast and PageAnonExclusive code work correctly,
I started looking at the whole RCU GUP-fast machinery. I do have a patch to
improve PageAnonExclusive clearing (I think we're missing memory barriers to
make it work as expected in any possible case), but I also stumbled eventually
over a more generic issue that might need memory barriers.
Any thoughts whether I am missing something or this is actually missing
memory barriers?
From ce8c941c11d1f60cea87a3e4d941041dc6b79900 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Mon, 29 Aug 2022 16:57:07 +0200
Subject: [PATCH] mm/gup: update refcount+pincount before testing if the PTE
changed
mm/ksm.c:write_protect_page() has to make sure that no unknown
references to a mapped page exist and that no additional ones with write
permissions are possible -- unknown references could have write permissions
and modify the page afterwards.
Conceptually, mm/ksm.c:write_protect_page() consists of:
(1) Clear/invalidate PTE
(2) Check if there are unknown references; back off if so.
(3) Update PTE (e.g., map it R/O)
Conceptually, GUP-fast code consists of:
(1) Read the PTE
(2) Increment refcount/pincount of the mapped page
(3) Check if the PTE changed by re-reading it; back off if so.
To make sure GUP-fast won't be able to grab additional references after
clearing the PTE, but will properly detect the change and back off, we
need a memory barrier between updating the recount/pincount and checking
if it changed.
try_grab_folio() doesn't necessarily imply a memory barrier, so add an
explicit smp_mb__after_atomic() after the atomic RMW operation to
increment the refcount and pincount.
ptep_clear_flush() used to clear the PTE and flush the TLB should imply
a memory barrier for flushing the TLB, so don't add another one for now.
PageAnonExclusive handling requires further care and will be handled
separately.
Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/gup.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..0008b808f484 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2392,6 +2392,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
goto pte_unmap;
}
+ /*
+ * Update refcount/pincount before testing for changed PTE. This
+ * is required for code like mm/ksm.c:write_protect_page() that
+ * wants to make sure that a page has no unknown references
+ * after clearing the PTE.
+ */
+ smp_mb__after_atomic();
+
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
gup_put_folio(folio, 1, flags);
goto pte_unmap;
@@ -2577,6 +2585,9 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
if (!folio)
return 0;
+ /* See gup_pte_range(). */
+ smp_mb__after_atomic();
+
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
gup_put_folio(folio, refs, flags);
return 0;
@@ -2643,6 +2654,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
if (!folio)
return 0;
+ /* See gup_pte_range(). */
+ smp_mb__after_atomic();
+
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
gup_put_folio(folio, refs, flags);
return 0;
@@ -2683,6 +2697,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
if (!folio)
return 0;
+ /* See gup_pte_range(). */
+ smp_mb__after_atomic();
+
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
gup_put_folio(folio, refs, flags);
return 0;
--
2.37.1
--
Thanks,
David / dhildenb
On 8/30/22 11:53, David Hildenbrand wrote:
> Good, I managed to attract the attention of someone who understands that machinery :)
>
> While validating whether GUP-fast and PageAnonExclusive code work correctly,
> I started looking at the whole RCU GUP-fast machinery. I do have a patch to
> improve PageAnonExclusive clearing (I think we're missing memory barriers to
> make it work as expected in any possible case), but I also stumbled eventually
> over a more generic issue that might need memory barriers.
>
> Any thoughts whether I am missing something or this is actually missing
> memory barriers?
>
It's actually missing memory barriers.
In fact, others have had that same thought! [1] :) In that 2019 thread,
I recall that this got dismissed because of a focus on the IPI-based
aspect of gup fast synchronization (there was some hand waving, perhaps
accurate waving, about memory barriers vs. CPU interrupts). But now the
RCU (non-IPI) implementation is more widely used than it used to be, the
issue is clearer.
>
> From ce8c941c11d1f60cea87a3e4d941041dc6b79900 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <[email protected]>
> Date: Mon, 29 Aug 2022 16:57:07 +0200
> Subject: [PATCH] mm/gup: update refcount+pincount before testing if the PTE
> changed
>
> mm/ksm.c:write_protect_page() has to make sure that no unknown
> references to a mapped page exist and that no additional ones with write
> permissions are possible -- unknown references could have write permissions
> and modify the page afterwards.
>
> Conceptually, mm/ksm.c:write_protect_page() consists of:
> (1) Clear/invalidate PTE
> (2) Check if there are unknown references; back off if so.
> (3) Update PTE (e.g., map it R/O)
>
> Conceptually, GUP-fast code consists of:
> (1) Read the PTE
> (2) Increment refcount/pincount of the mapped page
> (3) Check if the PTE changed by re-reading it; back off if so.
>
> To make sure GUP-fast won't be able to grab additional references after
> clearing the PTE, but will properly detect the change and back off, we
> need a memory barrier between updating the recount/pincount and checking
> if it changed.
>
> try_grab_folio() doesn't necessarily imply a memory barrier, so add an
> explicit smp_mb__after_atomic() after the atomic RMW operation to
> increment the refcount and pincount.
>
> ptep_clear_flush() used to clear the PTE and flush the TLB should imply
> a memory barrier for flushing the TLB, so don't add another one for now.
>
> PageAnonExclusive handling requires further care and will be handled
> separately.
>
> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/gup.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 5abdaf487460..0008b808f484 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2392,6 +2392,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> goto pte_unmap;
> }
>
> + /*
> + * Update refcount/pincount before testing for changed PTE. This
> + * is required for code like mm/ksm.c:write_protect_page() that
> + * wants to make sure that a page has no unknown references
> + * after clearing the PTE.
> + */
> + smp_mb__after_atomic();
> +
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> gup_put_folio(folio, 1, flags);
> goto pte_unmap;
> @@ -2577,6 +2585,9 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
> if (!folio)
> return 0;
>
> + /* See gup_pte_range(). */
Don't we usually also identify what each mb pairs with, in the comments? That would help.
> + smp_mb__after_atomic();
> +
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> gup_put_folio(folio, refs, flags);
> return 0;
> @@ -2643,6 +2654,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> if (!folio)
> return 0;
>
> + /* See gup_pte_range(). */
> + smp_mb__after_atomic();
> +
> if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> gup_put_folio(folio, refs, flags);
> return 0;
> @@ -2683,6 +2697,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> if (!folio)
> return 0;
>
> + /* See gup_pte_range(). */
> + smp_mb__after_atomic();
> +
> if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> gup_put_folio(folio, refs, flags);
> return 0;
[1] https://lore.kernel.org/lkml/[email protected]/
thanks,
--
John Hubbard
NVIDIA
On 30.08.22 21:18, John Hubbard wrote:
> On 8/30/22 11:53, David Hildenbrand wrote:
>> Good, I managed to attract the attention of someone who understands that machinery :)
>>
>> While validating whether GUP-fast and PageAnonExclusive code work correctly,
>> I started looking at the whole RCU GUP-fast machinery. I do have a patch to
>> improve PageAnonExclusive clearing (I think we're missing memory barriers to
>> make it work as expected in any possible case), but I also stumbled eventually
>> over a more generic issue that might need memory barriers.
>>
>> Any thoughts whether I am missing something or this is actually missing
>> memory barriers?
>>
>
> It's actually missing memory barriers.
>
> In fact, others have had that same thought! [1] :) In that 2019 thread,
> I recall that this got dismissed because of a focus on the IPI-based
> aspect of gup fast synchronization (there was some hand waving, perhaps
> accurate waving, about memory barriers vs. CPU interrupts). But now the
> RCU (non-IPI) implementation is more widely used than it used to be, the
> issue is clearer.
>
>>
>> From ce8c941c11d1f60cea87a3e4d941041dc6b79900 Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <[email protected]>
>> Date: Mon, 29 Aug 2022 16:57:07 +0200
>> Subject: [PATCH] mm/gup: update refcount+pincount before testing if the PTE
>> changed
>>
>> mm/ksm.c:write_protect_page() has to make sure that no unknown
>> references to a mapped page exist and that no additional ones with write
>> permissions are possible -- unknown references could have write permissions
>> and modify the page afterwards.
>>
>> Conceptually, mm/ksm.c:write_protect_page() consists of:
>> (1) Clear/invalidate PTE
>> (2) Check if there are unknown references; back off if so.
>> (3) Update PTE (e.g., map it R/O)
>>
>> Conceptually, GUP-fast code consists of:
>> (1) Read the PTE
>> (2) Increment refcount/pincount of the mapped page
>> (3) Check if the PTE changed by re-reading it; back off if so.
>>
>> To make sure GUP-fast won't be able to grab additional references after
>> clearing the PTE, but will properly detect the change and back off, we
>> need a memory barrier between updating the recount/pincount and checking
>> if it changed.
>>
>> try_grab_folio() doesn't necessarily imply a memory barrier, so add an
>> explicit smp_mb__after_atomic() after the atomic RMW operation to
>> increment the refcount and pincount.
>>
>> ptep_clear_flush() used to clear the PTE and flush the TLB should imply
>> a memory barrier for flushing the TLB, so don't add another one for now.
>>
>> PageAnonExclusive handling requires further care and will be handled
>> separately.
>>
>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/gup.c | 17 +++++++++++++++++
>> 1 file changed, 17 insertions(+)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 5abdaf487460..0008b808f484 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2392,6 +2392,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>> goto pte_unmap;
>> }
>>
>> + /*
>> + * Update refcount/pincount before testing for changed PTE. This
>> + * is required for code like mm/ksm.c:write_protect_page() that
>> + * wants to make sure that a page has no unknown references
>> + * after clearing the PTE.
>> + */
>> + smp_mb__after_atomic();
>> +
>> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>> gup_put_folio(folio, 1, flags);
>> goto pte_unmap;
>> @@ -2577,6 +2585,9 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>> if (!folio)
>> return 0;
>>
>> + /* See gup_pte_range(). */
>
> Don't we usually also identify what each mb pairs with, in the comments? That would help.
Yeah, if only I could locate them reliably (as documented ptep_clear_flush()
should imply one I guess) ... but it will depend on the context.
As I now have the attention of two people that understand that machinery,
here goes the PageAnonExclusive thing I *think* should be correct.
The IPI-based mechanism really did make such synchronization with
GUP-fast easier ...
From 8f91ef3555178149ad560b5424a9854b2ceee2d6 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Sat, 27 Aug 2022 10:44:13 +0200
Subject: [PATCH] mm: rework PageAnonExclusive() interaction with GUP-fast
commit 6c287605fd56 (mm: remember exclusively mapped anonymous pages with
PG_anon_exclusive) made sure that when PageAnonExclusive() has to be
cleared during temporary unmapping of a page, that the PTE is
cleared/invalidated and that the TLB is flushed.
That handling was inspired by an outdated comment in
mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
the past to synchronize with GUP-fast. However, ever since general RCU GUP
fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU
get_user_pages_fast()"), a TLB flush is no longer sufficient and
required to synchronize with concurrent GUP-fast
Peter pointed out, that TLB flush is not required, and looking into
details it turns out that he's right. To synchronize with GUP-fast, it's
sufficient to clear the PTE only: GUP-fast will either detect that the PTE
changed or that PageAnonExclusive is not set and back off. However, we
rely on a given memory order and should make sure that that order is
always respected.
Conceptually, GUP-fast pinning code of anon pages consists of:
(1) Read the PTE
(2) Pin the mapped page
(3) Check if the PTE changed by re-reading it; back off if so.
(4) Check if PageAnonExclusive is not set; back off if so.
Conceptually, PageAnonExclusive clearing code consists of:
(1) Clear PTE
(2) Check if the page is pinned; back off if so.
(3) Clear PageAnonExclusive
(4) Restore PTE (optional)
As GUP-fast temporarily pins the page before validating whether the PTE
changed, and PageAnonExclusive clearing code clears the PTE before
checking if the page is pinned, GUP-fast cannot end up pinning an anon
page that is not exclusive.
One corner case to consider is when we restore the PTE to the same value
after PageAnonExclusive was cleared, as it can happen in
mm/ksm.c:write_protect_page(). In that case, GUP-fast might not detect
a PTE change (because there was none). However, as restoring the PTE
happens after clearing PageAnonExclusive, GUP-fast would detect that
PageAnonExclusive was cleared in that case and would properly back off.
Let's document that, avoid the TLB flush where possible and use proper
explicit memory barriers where required. We shouldn't really care about the
additional memory barriers here, as we're not on extremely hot paths.
The possible issues due to reordering are of theoretical nature so far,
but it better be addressed.
Note that we don't need a memory barrier between checking if the page is
pinned and clearing PageAnonExclusive, because stores are not
speculated.
Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 9 +++++--
include/linux/rmap.h | 58 ++++++++++++++++++++++++++++++++++++++++----
mm/huge_memory.c | 3 +++
mm/ksm.c | 1 +
mm/migrate_device.c | 22 +++++++----------
mm/rmap.c | 11 +++++----
6 files changed, 79 insertions(+), 25 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..f7e8f4b34fb5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2975,8 +2975,8 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
* PageAnonExclusive() has to protect against concurrent GUP:
* * Ordinary GUP: Using the PT lock
* * GUP-fast and fork(): mm->write_protect_seq
- * * GUP-fast and KSM or temporary unmapping (swap, migration):
- * clear/invalidate+flush of the page table entry
+ * * GUP-fast and KSM or temporary unmapping (swap, migration): see
+ * page_try_share_anon_rmap()
*
* Must be called with the (sub)page that's actually referenced via the
* page table entry, which might not necessarily be the head page for a
@@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
*/
if (!PageAnon(page))
return false;
+
+ /* See page_try_share_anon_rmap() for GUP-fast details. */
+ if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
+ smp_rmb();
+
/*
* Note that PageKsm() pages cannot be exclusive, and consequently,
* cannot get pinned.
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bf80adca980b..454c159f2aae 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
* @page: the exclusive anonymous page to try marking possibly shared
*
* The caller needs to hold the PT lock and has to have the page table entry
- * cleared/invalidated+flushed, to properly sync against GUP-fast.
+ * cleared/invalidated.
*
* This is similar to page_try_dup_anon_rmap(), however, not used during fork()
* to duplicate a mapping, but instead to prepare for KSM or temporarily
@@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
{
VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
- /* See page_try_dup_anon_rmap(). */
- if (likely(!is_device_private_page(page) &&
- unlikely(page_maybe_dma_pinned(page))))
- return -EBUSY;
+ /* device private pages cannot get pinned via GUP. */
+ if (unlikely(is_device_private_page(page))) {
+ ClearPageAnonExclusive(page);
+ return 0;
+ }
+ /*
+ * We have to make sure that while we clear PageAnonExclusive, that
+ * the page is not pinned and that concurrent GUP-fast won't succeed in
+ * concurrently pinning the page.
+ *
+ * Conceptually, GUP-fast pinning code of anon pages consists of:
+ * (1) Read the PTE
+ * (2) Pin the mapped page
+ * (3) Check if the PTE changed by re-reading it; back off if so.
+ * (4) Check if PageAnonExclusive is not set; back off if so.
+ *
+ * Conceptually, PageAnonExclusive clearing code consists of:
+ * (1) Clear PTE
+ * (2) Check if the page is pinned; back off if so.
+ * (3) Clear PageAnonExclusive
+ * (4) Restore PTE (optional)
+ *
+ * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
+ * the right order. Memory order between (2) and (3) is handled by
+ * GUP-fast, independent of PageAnonExclusive.
+ *
+ * When clearing PageAnonExclusive(), we have to make sure that (1),
+ * (2), (3) and (4) happen in the right order.
+ *
+ * Note that (4) has to happen after (3) in both cases to handle the
+ * corner case whereby the PTE is restored to the original value after
+ * clearing PageAnonExclusive and while GUP-fast might not detect the
+ * PTE change, it will detect the PageAnonExclusive change.
+ *
+ * We assume that there might not be a memory barrier after
+ * clearing/invalidating the PTE (1) and before restoring the PTE (4),
+ * so we use explicit ones here.
+ *
+ * These memory barriers are paired with memory barriers in GUP-fast
+ * code, including gup_must_unshare().
+ */
+
+ /* Clear/invalidate the PTE before checking for PINs. */
+ if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
+ smp_mb();
+
+ if (unlikely(page_maybe_dma_pinned(page)))
+ return -EBUSY;
ClearPageAnonExclusive(page);
+
+ /* Clear PageAnonExclusive() before eventually restoring the PTE. */
+ if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
+ smp_mb__after_atomic();
return 0;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e9414ee57c5b..2aef8d76fcf2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2140,6 +2140,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*
* In case we cannot clear PageAnonExclusive(), split the PMD
* only and let try_to_migrate_one() fail later.
+ *
+ * See page_try_share_anon_rmap(): invalidate PMD first.
*/
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
@@ -3177,6 +3179,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+ /* See page_try_share_anon_rmap(): invalidate PMD first. */
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
if (anon_exclusive && page_try_share_anon_rmap(page)) {
set_pmd_at(mm, address, pvmw->pmd, pmdval);
diff --git a/mm/ksm.c b/mm/ksm.c
index d7526c705081..971cf923c0eb 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1091,6 +1091,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
goto out_unlock;
}
+ /* See page_try_share_anon_rmap(): clear PTE first. */
if (anon_exclusive && page_try_share_anon_rmap(page)) {
set_pte_at(mm, pvmw.address, pvmw.pte, entry);
goto out_unlock;
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d65476..47e955212f15 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -193,20 +193,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
bool anon_exclusive;
pte_t swp_pte;
+ ptep_get_and_clear(mm, addr, ptep);
+
+ /* See page_try_share_anon_rmap(): clear PTE first. */
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
- if (anon_exclusive) {
- flush_cache_page(vma, addr, pte_pfn(*ptep));
- ptep_clear_flush(vma, addr, ptep);
-
- if (page_try_share_anon_rmap(page)) {
- set_pte_at(mm, addr, ptep, pte);
- unlock_page(page);
- put_page(page);
- mpfn = 0;
- goto next;
- }
- } else {
- ptep_get_and_clear(mm, addr, ptep);
+ if (anon_exclusive && page_try_share_anon_rmap(page)) {
+ set_pte_at(mm, addr, ptep, pte);
+ unlock_page(page);
+ put_page(page);
+ mpfn = 0;
+ goto next;
}
migrate->cpages++;
diff --git a/mm/rmap.c b/mm/rmap.c
index edc06c52bc82..b3a21ea9f3d0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1579,11 +1579,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
} else {
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
- /*
- * Nuke the page table entry. When having to clear
- * PageAnonExclusive(), we always have to flush.
- */
- if (should_defer_flush(mm, flags) && !anon_exclusive) {
+ /* Nuke the page table entry. */
+ if (should_defer_flush(mm, flags)) {
/*
* We clear the PTE but do not flush so potentially
* a remote CPU could still be writing to the folio.
@@ -1714,6 +1711,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
page_vma_mapped_walk_done(&pvmw);
break;
}
+
+ /* See page_try_share_anon_rmap(): clear PTE first. */
if (anon_exclusive &&
page_try_share_anon_rmap(subpage)) {
swap_free(entry);
@@ -2045,6 +2044,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
}
VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
!anon_exclusive, subpage);
+
+ /* See page_try_share_anon_rmap(): clear PTE first. */
if (anon_exclusive &&
page_try_share_anon_rmap(subpage)) {
if (folio_test_hugetlb(folio))
--
2.37.1
--
Thanks,
David / dhildenb
On Tue, Aug 30, 2022 at 08:53:06PM +0200, David Hildenbrand wrote:
> On 30.08.22 20:45, Jason Gunthorpe wrote:
> > On Tue, Aug 30, 2022 at 08:23:52PM +0200, David Hildenbrand wrote:
> >> ... and looking into the details of TLB flush and GUP-fast interaction
> >> nowadays, that case is no longer relevant. A TLB flush is no longer
> >> sufficient to stop concurrent GUP-fast ever since we introduced generic
> >> RCU GUP-fast.
> >
> > Yes, we've had RCU GUP fast for a while, and it is more widely used
> > now, IIRC.
> >
> > It has been a bit, but if I remember, GUP fast in RCU mode worked on a
> > few principles:
> >
> > - The PTE page must not be freed without RCU
> > - The PTE page content must be convertable to a struct page using the
> > usual rules (eg PTE Special)
> > - That struct page refcount may go from 0->1 inside the RCU
> > - In the case the refcount goes from 0->1 there must be sufficient
> > barriers such that GUP fast observing the refcount of 1 will also
> > observe the PTE entry has changed. ie before the refcount is
> > dropped in the zap it has to clear the PTE entry, the refcount
> > decr has to be a 'release' and the refcount incr in gup fast has be
> > to be an 'acquire'.
> > - The rest of the system must tolerate speculative refcount
> > increments from GUP on any random page
> > > The basic idea being that if GUP fast obtains a valid reference on a
> > page *and* the PTE entry has not changed then everything is fine.
> >
> > The tricks with TLB invalidation are just a "poor mans" RCU, and
> > arguably these days aren't really needed since I think we could make
> > everything use real RCU always without penalty if we really wanted.
> >
> > Today we can create a unique 'struct pagetable_page' as Matthew has
> > been doing in other places that guarentees a rcu_head is always
> > available for every page used in a page table. Using that we could
> > drop the code in the TLB flusher that allocates memory for the
> > rcu_head and hopes for the best. (Or even is the common struct page
> > rcu_head already guarenteed to exist for pagetable pages now a days?)
> >
> > IMHO that is the main reason we still have the non-RCU mode at all..
>
>
> Good, I managed to attract the attention of someone who understands that machinery :)
>
> While validating whether GUP-fast and PageAnonExclusive code work correctly,
> I started looking at the whole RCU GUP-fast machinery. I do have a patch to
> improve PageAnonExclusive clearing (I think we're missing memory barriers to
> make it work as expected in any possible case), but I also stumbled eventually
> over a more generic issue that might need memory barriers.
>
> Any thoughts whether I am missing something or this is actually missing
> memory barriers?
I don't like the use of smb_mb very much, I deliberately choose the
more modern language of release/acquire because it makes it a lot
clearer what barriers are doing..
So, if we dig into it, using what I said above, the atomic refcount is:
gup_pte_range()
try_grab_folio()
try_get_folio()
folio_ref_try_add_rcu()
folio_ref_add_unless()
page_ref_add_unless()
atomic_add_unless()
So that wants to be an acquire
The pairing release is in the page table code that does the put_page,
it wants to be an atomic_dec_return() as a release.
Now, we go and look at Documentation/atomic_t.txt to try to understand
what are the ordering semantics of the atomics we are using and become
dazed-confused like me:
ORDERING (go read memory-barriers.txt first)
--------
- RMW operations that have a return value are fully ordered;
- RMW operations that are conditional are unordered on FAILURE,
otherwise the above rules apply.
Fully ordered primitives are ordered against everything prior and everything
subsequent. Therefore a fully ordered primitive is like having an smp_mb()
before and an smp_mb() after the primitive.
So, I take that to mean that both atomic_add_unless() and
atomic_dec_return() are "fully ordered" and "fully ordered" is a super
set of acquire/release.
Thus, we already have the necessary barriers integrated into the
atomic being used.
The smb_mb_after_atomic stuff is to be used with atomics that don't
return values, there are some examples in the doc
Jason
On 8/30/22 12:57, Jason Gunthorpe wrote:
> I don't like the use of smb_mb very much, I deliberately choose the
> more modern language of release/acquire because it makes it a lot
> clearer what barriers are doing..
>
> So, if we dig into it, using what I said above, the atomic refcount is:
>
> gup_pte_range()
> try_grab_folio()
> try_get_folio()
> folio_ref_try_add_rcu()
> folio_ref_add_unless()
> page_ref_add_unless()
> atomic_add_unless()
>
> So that wants to be an acquire
>
> The pairing release is in the page table code that does the put_page,
> it wants to be an atomic_dec_return() as a release.
Thanks for making that a lot clearer, at least for me anyway!
>
> Now, we go and look at Documentation/atomic_t.txt to try to understand
> what are the ordering semantics of the atomics we are using and become
> dazed-confused like me:
>
> ORDERING (go read memory-barriers.txt first)
> --------
>
> - RMW operations that have a return value are fully ordered;
>
> - RMW operations that are conditional are unordered on FAILURE,
> otherwise the above rules apply.
>
> Fully ordered primitives are ordered against everything prior and everything
> subsequent. Therefore a fully ordered primitive is like having an smp_mb()
> before and an smp_mb() after the primitive.
>
> So, I take that to mean that both atomic_add_unless() and
> atomic_dec_return() are "fully ordered" and "fully ordered" is a super
> set of acquire/release.
>
> Thus, we already have the necessary barriers integrated into the
> atomic being used.
As long as we continue to sort-of-accidentally use atomic_add_unless(),
which returns a value, instead of atomic_add(), which does not. :)
Likewise on the put_page() side: we are depending on the somewhat
accidental (from the perspective of memory barriers) use of
atomics that return values.
Maybe it would be good to add a little note at each site, to that
effect?
>
> The smb_mb_after_atomic stuff is to be used with atomics that don't
> return values, there are some examples in the doc
>
> Jason
thanks,
--
John Hubbard
NVIDIA
On Tue, Aug 30, 2022 at 01:12:53PM -0700, John Hubbard wrote:
> As long as we continue to sort-of-accidentally use atomic_add_unless(),
> which returns a value, instead of atomic_add(), which does not. :)
I should say I didn't have time to carefully check what put_page was
doing, but IIRC it is an atomic_dec_return to decide if it should free
the page.
The conditional is pretty much inherent to the model, because 0 is
special it always has to be checked. Not so accidental
But you might make the case that these could be the relaxed versions
of the atomic...
> Likewise on the put_page() side: we are depending on the somewhat
> accidental (from the perspective of memory barriers) use of
> atomics that return values.
>
> Maybe it would be good to add a little note at each site, to that
> effect?
It would be fantastic if things that are required to be
acquire/releases are documented as acquire/release :)
It is incredibly subtle stuff and all carefully inlined for maximum
performance.
Jason
On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
> */
> if (!PageAnon(page))
> return false;
> +
> + /* See page_try_share_anon_rmap() for GUP-fast details. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
> + smp_rmb();
> +
> /*
> * Note that PageKsm() pages cannot be exclusive, and consequently,
> * cannot get pinned.
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bf80adca980b..454c159f2aae 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
> * @page: the exclusive anonymous page to try marking possibly shared
> *
> * The caller needs to hold the PT lock and has to have the page table entry
> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
> + * cleared/invalidated.
> *
> * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
> * to duplicate a mapping, but instead to prepare for KSM or temporarily
> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
> {
> VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>
> - /* See page_try_dup_anon_rmap(). */
> - if (likely(!is_device_private_page(page) &&
> - unlikely(page_maybe_dma_pinned(page))))
> - return -EBUSY;
> + /* device private pages cannot get pinned via GUP. */
> + if (unlikely(is_device_private_page(page))) {
> + ClearPageAnonExclusive(page);
> + return 0;
> + }
>
> + /*
> + * We have to make sure that while we clear PageAnonExclusive, that
> + * the page is not pinned and that concurrent GUP-fast won't succeed in
> + * concurrently pinning the page.
> + *
> + * Conceptually, GUP-fast pinning code of anon pages consists of:
> + * (1) Read the PTE
> + * (2) Pin the mapped page
> + * (3) Check if the PTE changed by re-reading it; back off if so.
> + * (4) Check if PageAnonExclusive is not set; back off if so.
> + *
> + * Conceptually, PageAnonExclusive clearing code consists of:
> + * (1) Clear PTE
> + * (2) Check if the page is pinned; back off if so.
> + * (3) Clear PageAnonExclusive
> + * (4) Restore PTE (optional)
> + *
> + * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
> + * the right order. Memory order between (2) and (3) is handled by
> + * GUP-fast, independent of PageAnonExclusive.
> + *
> + * When clearing PageAnonExclusive(), we have to make sure that (1),
> + * (2), (3) and (4) happen in the right order.
> + *
> + * Note that (4) has to happen after (3) in both cases to handle the
> + * corner case whereby the PTE is restored to the original value after
> + * clearing PageAnonExclusive and while GUP-fast might not detect the
> + * PTE change, it will detect the PageAnonExclusive change.
> + *
> + * We assume that there might not be a memory barrier after
> + * clearing/invalidating the PTE (1) and before restoring the PTE (4),
> + * so we use explicit ones here.
> + *
> + * These memory barriers are paired with memory barriers in GUP-fast
> + * code, including gup_must_unshare().
> + */
> +
> + /* Clear/invalidate the PTE before checking for PINs. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb();
> +
> + if (unlikely(page_maybe_dma_pinned(page)))
> + return -EBUSY;
It is usually a bad sign to see an attempt to create a "read release"..
> ClearPageAnonExclusive(page);
> +
> + /* Clear PageAnonExclusive() before eventually restoring the PTE. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb__after_atomic();
> return 0;
> }
I don't know enough about the memory model to say if this is OK..
Generally, I've never seen an algorithm be successfull with these
kinds of multi-atomic gyrations.
If we break it down a bit, and replace the 'read release' with an
actual atomic for discussion:
CPU0 CPU1
clear pte
incr_return ref // release & acquire
add_ref // acquire
This seems OK, if CPU1 views !dma then CPU0 must view clear pte due to
the atomic's release/acquire semantic
If CPU1 views dma then it just exits
Now the second phase:
CPU0 CPU1
clear anon_exclusive
restore pte // release
read_pte // acquire
read anon_exclusive
If CPU0 observes the restored PTE then it must observe the cleared
anon_exclusive
Otherwise CPU0 must observe the cleared PTE.
So, maybe I could convince myself it is OK, but I think your placement
of barriers is confusing as to what data the barrier is actually
linked to.
We are using a barrier around the ref - acquire on the CPU0 and full
barier on the CPU1 (eg atomic_read(); smb_mb_after_atomic() )
The second phase uses a smp_store_release/load_acquire on the PTE.
It is the same barriers you sketched but integrated with the data they
are ordering.
Jason
On 30.08.22 21:57, Jason Gunthorpe wrote:
> On Tue, Aug 30, 2022 at 08:53:06PM +0200, David Hildenbrand wrote:
>> On 30.08.22 20:45, Jason Gunthorpe wrote:
>>> On Tue, Aug 30, 2022 at 08:23:52PM +0200, David Hildenbrand wrote:
>>>> ... and looking into the details of TLB flush and GUP-fast interaction
>>>> nowadays, that case is no longer relevant. A TLB flush is no longer
>>>> sufficient to stop concurrent GUP-fast ever since we introduced generic
>>>> RCU GUP-fast.
>>>
>>> Yes, we've had RCU GUP fast for a while, and it is more widely used
>>> now, IIRC.
>>>
>>> It has been a bit, but if I remember, GUP fast in RCU mode worked on a
>>> few principles:
>>>
>>> - The PTE page must not be freed without RCU
>>> - The PTE page content must be convertable to a struct page using the
>>> usual rules (eg PTE Special)
>>> - That struct page refcount may go from 0->1 inside the RCU
>>> - In the case the refcount goes from 0->1 there must be sufficient
>>> barriers such that GUP fast observing the refcount of 1 will also
>>> observe the PTE entry has changed. ie before the refcount is
>>> dropped in the zap it has to clear the PTE entry, the refcount
>>> decr has to be a 'release' and the refcount incr in gup fast has be
>>> to be an 'acquire'.
>>> - The rest of the system must tolerate speculative refcount
>>> increments from GUP on any random page
>>>> The basic idea being that if GUP fast obtains a valid reference on a
>>> page *and* the PTE entry has not changed then everything is fine.
>>>
>>> The tricks with TLB invalidation are just a "poor mans" RCU, and
>>> arguably these days aren't really needed since I think we could make
>>> everything use real RCU always without penalty if we really wanted.
>>>
>>> Today we can create a unique 'struct pagetable_page' as Matthew has
>>> been doing in other places that guarentees a rcu_head is always
>>> available for every page used in a page table. Using that we could
>>> drop the code in the TLB flusher that allocates memory for the
>>> rcu_head and hopes for the best. (Or even is the common struct page
>>> rcu_head already guarenteed to exist for pagetable pages now a days?)
>>>
>>> IMHO that is the main reason we still have the non-RCU mode at all..
>>
>>
>> Good, I managed to attract the attention of someone who understands that machinery :)
>>
>> While validating whether GUP-fast and PageAnonExclusive code work correctly,
>> I started looking at the whole RCU GUP-fast machinery. I do have a patch to
>> improve PageAnonExclusive clearing (I think we're missing memory barriers to
>> make it work as expected in any possible case), but I also stumbled eventually
>> over a more generic issue that might need memory barriers.
>>
>> Any thoughts whether I am missing something or this is actually missing
>> memory barriers?
>
> I don't like the use of smb_mb very much, I deliberately choose the
> more modern language of release/acquire because it makes it a lot
> clearer what barriers are doing..
>
> So, if we dig into it, using what I said above, the atomic refcount is:
>
> gup_pte_range()
> try_grab_folio()
> try_get_folio()
> folio_ref_try_add_rcu()
> folio_ref_add_unless()
> page_ref_add_unless()
> atomic_add_unless()
Right, that seems to work as expected for checking the refcount after
clearing the PTE.
Unfortunately, it's not sufficien to identify whether a page may be
pinned, because the flow continues as
folio = try_get_folio(page, refs)
...
if (folio_test_large(folio))
atomic_add(refs, folio_pincount_ptr(folio));
else
folio_ref_add(folio, ...)
So I guess we'd need a smb_mb() before re-checking the PTE for that case.
>
> So that wants to be an acquire
>
> The pairing release is in the page table code that does the put_page,
> it wants to be an atomic_dec_return() as a release.
>
> Now, we go and look at Documentation/atomic_t.txt to try to understand
> what are the ordering semantics of the atomics we are using and become
> dazed-confused like me:
I read that 3 times and got dizzy. Thanks for double-checking, very much
appreciated!
>
> ORDERING (go read memory-barriers.txt first)
> --------
>
> - RMW operations that have a return value are fully ordered;
>
> - RMW operations that are conditional are unordered on FAILURE,
> otherwise the above rules apply.
>
> Fully ordered primitives are ordered against everything prior and everything
> subsequent. Therefore a fully ordered primitive is like having an smp_mb()
> before and an smp_mb() after the primitive.
>
> So, I take that to mean that both atomic_add_unless() and
> atomic_dec_return() are "fully ordered" and "fully ordered" is a super
> set of acquire/release.
>
> Thus, we already have the necessary barriers integrated into the
> atomic being used.
>
> The smb_mb_after_atomic stuff is to be used with atomics that don't
> return values, there are some examples in the doc
Yes, I missed the point that RMW operations that return a value are
fully ordered and imply smp_mb() before / after.
--
Thanks,
David / dhildenb
On 31.08.22 01:44, Jason Gunthorpe wrote:
> On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
>> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
>> */
>> if (!PageAnon(page))
>> return false;
>> +
>> + /* See page_try_share_anon_rmap() for GUP-fast details. */
>> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
>> + smp_rmb();
>> +
>> /*
>> * Note that PageKsm() pages cannot be exclusive, and consequently,
>> * cannot get pinned.
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index bf80adca980b..454c159f2aae 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
>> * @page: the exclusive anonymous page to try marking possibly shared
>> *
>> * The caller needs to hold the PT lock and has to have the page table entry
>> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
>> + * cleared/invalidated.
>> *
>> * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
>> * to duplicate a mapping, but instead to prepare for KSM or temporarily
>> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
>> {
>> VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>>
>> - /* See page_try_dup_anon_rmap(). */
>> - if (likely(!is_device_private_page(page) &&
>> - unlikely(page_maybe_dma_pinned(page))))
>> - return -EBUSY;
>> + /* device private pages cannot get pinned via GUP. */
>> + if (unlikely(is_device_private_page(page))) {
>> + ClearPageAnonExclusive(page);
>> + return 0;
>> + }
>>
>> + /*
>> + * We have to make sure that while we clear PageAnonExclusive, that
>> + * the page is not pinned and that concurrent GUP-fast won't succeed in
>> + * concurrently pinning the page.
>> + *
>> + * Conceptually, GUP-fast pinning code of anon pages consists of:
>> + * (1) Read the PTE
>> + * (2) Pin the mapped page
>> + * (3) Check if the PTE changed by re-reading it; back off if so.
>> + * (4) Check if PageAnonExclusive is not set; back off if so.
>> + *
>> + * Conceptually, PageAnonExclusive clearing code consists of:
>> + * (1) Clear PTE
>> + * (2) Check if the page is pinned; back off if so.
>> + * (3) Clear PageAnonExclusive
>> + * (4) Restore PTE (optional)
>> + *
>> + * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
>> + * the right order. Memory order between (2) and (3) is handled by
>> + * GUP-fast, independent of PageAnonExclusive.
>> + *
>> + * When clearing PageAnonExclusive(), we have to make sure that (1),
>> + * (2), (3) and (4) happen in the right order.
>> + *
>> + * Note that (4) has to happen after (3) in both cases to handle the
>> + * corner case whereby the PTE is restored to the original value after
>> + * clearing PageAnonExclusive and while GUP-fast might not detect the
>> + * PTE change, it will detect the PageAnonExclusive change.
>> + *
>> + * We assume that there might not be a memory barrier after
>> + * clearing/invalidating the PTE (1) and before restoring the PTE (4),
>> + * so we use explicit ones here.
>> + *
>> + * These memory barriers are paired with memory barriers in GUP-fast
>> + * code, including gup_must_unshare().
>> + */
>> +
>> + /* Clear/invalidate the PTE before checking for PINs. */
>> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>> + smp_mb();
>> +
>> + if (unlikely(page_maybe_dma_pinned(page)))
>> + return -EBUSY;
>
> It is usually a bad sign to see an attempt to create a "read release"..
I still have to get used to the acquire/release semantics ... :)
>
>> ClearPageAnonExclusive(page);
>> +
>> + /* Clear PageAnonExclusive() before eventually restoring the PTE. */
>> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>> + smp_mb__after_atomic();
>> return 0;
>> }
>
> I don't know enough about the memory model to say if this is OK..
I guess it's best to include some memory model folks once we have something
that looks reasonable.
>
> Generally, I've never seen an algorithm be successfull with these
> kinds of multi-atomic gyrations.
Yeah, I'm absolutely looking for a nicer alternative to sync with
RCU GUP-fast. So far I wasn't successful.
>
> If we break it down a bit, and replace the 'read release' with an
> actual atomic for discussion:
>
>
> CPU0 CPU1
> clear pte
> incr_return ref // release & acquire
> add_ref // acquire
>
> This seems OK, if CPU1 views !dma then CPU0 must view clear pte due to
> the atomic's release/acquire semantic
>
> If CPU1 views dma then it just exits
>
>
> Now the second phase:
>
> CPU0 CPU1
> clear anon_exclusive
> restore pte // release
>
> read_pte // acquire
> read anon_exclusive
>
> If CPU0 observes the restored PTE then it must observe the cleared
> anon_exclusive
>
> Otherwise CPU0 must observe the cleared PTE.
>
> So, maybe I could convince myself it is OK, but I think your placement
> of barriers is confusing as to what data the barrier is actually
> linked to.
>
> We are using a barrier around the ref - acquire on the CPU0 and full
> barier on the CPU1 (eg atomic_read(); smb_mb_after_atomic() )
When dropping the other patch, I think we still need something like
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..8c5ff41d5e56 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -158,6 +158,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
else
folio_ref_add(folio,
refs * (GUP_PIN_COUNTING_BIAS - 1));
+ /*
+ * Adjust the pincount before re-checking the PTE for changes.
+ *
+ * Paired with a memory barrier in page_try_share_anon_rmap().
+ */
+ smb_mb__after_atomic();
+
node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
return folio;
>
> The second phase uses a smp_store_release/load_acquire on the PTE.
>
> It is the same barriers you sketched but integrated with the data they
> are ordering.
Sorry for having to ask, but what exactly would be your suggestion?
Thanks for having a look!
--
Thanks,
David / dhildenb
On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
> On 30.08.22 21:18, John Hubbard wrote:
> > On 8/30/22 11:53, David Hildenbrand wrote:
> >> Good, I managed to attract the attention of someone who understands that machinery :)
> >>
> >> While validating whether GUP-fast and PageAnonExclusive code work correctly,
> >> I started looking at the whole RCU GUP-fast machinery. I do have a patch to
> >> improve PageAnonExclusive clearing (I think we're missing memory barriers to
> >> make it work as expected in any possible case), but I also stumbled eventually
> >> over a more generic issue that might need memory barriers.
> >>
> >> Any thoughts whether I am missing something or this is actually missing
> >> memory barriers?
> >>
> >
> > It's actually missing memory barriers.
> >
> > In fact, others have had that same thought! [1] :) In that 2019 thread,
> > I recall that this got dismissed because of a focus on the IPI-based
> > aspect of gup fast synchronization (there was some hand waving, perhaps
> > accurate waving, about memory barriers vs. CPU interrupts). But now the
> > RCU (non-IPI) implementation is more widely used than it used to be, the
> > issue is clearer.
> >
> >>
> >> From ce8c941c11d1f60cea87a3e4d941041dc6b79900 Mon Sep 17 00:00:00 2001
> >> From: David Hildenbrand <[email protected]>
> >> Date: Mon, 29 Aug 2022 16:57:07 +0200
> >> Subject: [PATCH] mm/gup: update refcount+pincount before testing if the PTE
> >> changed
> >>
> >> mm/ksm.c:write_protect_page() has to make sure that no unknown
> >> references to a mapped page exist and that no additional ones with write
> >> permissions are possible -- unknown references could have write permissions
> >> and modify the page afterwards.
> >>
> >> Conceptually, mm/ksm.c:write_protect_page() consists of:
> >> (1) Clear/invalidate PTE
> >> (2) Check if there are unknown references; back off if so.
> >> (3) Update PTE (e.g., map it R/O)
> >>
> >> Conceptually, GUP-fast code consists of:
> >> (1) Read the PTE
> >> (2) Increment refcount/pincount of the mapped page
> >> (3) Check if the PTE changed by re-reading it; back off if so.
> >>
> >> To make sure GUP-fast won't be able to grab additional references after
> >> clearing the PTE, but will properly detect the change and back off, we
> >> need a memory barrier between updating the recount/pincount and checking
> >> if it changed.
> >>
> >> try_grab_folio() doesn't necessarily imply a memory barrier, so add an
> >> explicit smp_mb__after_atomic() after the atomic RMW operation to
> >> increment the refcount and pincount.
> >>
> >> ptep_clear_flush() used to clear the PTE and flush the TLB should imply
> >> a memory barrier for flushing the TLB, so don't add another one for now.
> >>
> >> PageAnonExclusive handling requires further care and will be handled
> >> separately.
> >>
> >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> >> Signed-off-by: David Hildenbrand <[email protected]>
> >> ---
> >> mm/gup.c | 17 +++++++++++++++++
> >> 1 file changed, 17 insertions(+)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 5abdaf487460..0008b808f484 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -2392,6 +2392,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >> goto pte_unmap;
> >> }
> >>
> >> + /*
> >> + * Update refcount/pincount before testing for changed PTE. This
> >> + * is required for code like mm/ksm.c:write_protect_page() that
> >> + * wants to make sure that a page has no unknown references
> >> + * after clearing the PTE.
> >> + */
> >> + smp_mb__after_atomic();
> >> +
> >> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> >> gup_put_folio(folio, 1, flags);
> >> goto pte_unmap;
> >> @@ -2577,6 +2585,9 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
> >> if (!folio)
> >> return 0;
> >>
> >> + /* See gup_pte_range(). */
> >
> > Don't we usually also identify what each mb pairs with, in the comments? That would help.
>
> Yeah, if only I could locate them reliably (as documented ptep_clear_flush()
> should imply one I guess) ... but it will depend on the context.
>
>
> As I now have the attention of two people that understand that machinery,
> here goes the PageAnonExclusive thing I *think* should be correct.
>
> The IPI-based mechanism really did make such synchronization with
> GUP-fast easier ...
>
>
>
> From 8f91ef3555178149ad560b5424a9854b2ceee2d6 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <[email protected]>
> Date: Sat, 27 Aug 2022 10:44:13 +0200
> Subject: [PATCH] mm: rework PageAnonExclusive() interaction with GUP-fast
>
> commit 6c287605fd56 (mm: remember exclusively mapped anonymous pages with
> PG_anon_exclusive) made sure that when PageAnonExclusive() has to be
> cleared during temporary unmapping of a page, that the PTE is
> cleared/invalidated and that the TLB is flushed.
>
> That handling was inspired by an outdated comment in
> mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
> the past to synchronize with GUP-fast. However, ever since general RCU GUP
> fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU
> get_user_pages_fast()"), a TLB flush is no longer sufficient and
> required to synchronize with concurrent GUP-fast
>
> Peter pointed out, that TLB flush is not required, and looking into
> details it turns out that he's right. To synchronize with GUP-fast, it's
> sufficient to clear the PTE only: GUP-fast will either detect that the PTE
> changed or that PageAnonExclusive is not set and back off. However, we
> rely on a given memory order and should make sure that that order is
> always respected.
>
> Conceptually, GUP-fast pinning code of anon pages consists of:
> (1) Read the PTE
> (2) Pin the mapped page
> (3) Check if the PTE changed by re-reading it; back off if so.
> (4) Check if PageAnonExclusive is not set; back off if so.
>
> Conceptually, PageAnonExclusive clearing code consists of:
> (1) Clear PTE
> (2) Check if the page is pinned; back off if so.
> (3) Clear PageAnonExclusive
> (4) Restore PTE (optional)
>
> As GUP-fast temporarily pins the page before validating whether the PTE
> changed, and PageAnonExclusive clearing code clears the PTE before
> checking if the page is pinned, GUP-fast cannot end up pinning an anon
> page that is not exclusive.
>
> One corner case to consider is when we restore the PTE to the same value
> after PageAnonExclusive was cleared, as it can happen in
> mm/ksm.c:write_protect_page(). In that case, GUP-fast might not detect
> a PTE change (because there was none). However, as restoring the PTE
> happens after clearing PageAnonExclusive, GUP-fast would detect that
> PageAnonExclusive was cleared in that case and would properly back off.
>
> Let's document that, avoid the TLB flush where possible and use proper
> explicit memory barriers where required. We shouldn't really care about the
> additional memory barriers here, as we're not on extremely hot paths.
>
> The possible issues due to reordering are of theoretical nature so far,
> but it better be addressed.
>
> Note that we don't need a memory barrier between checking if the page is
> pinned and clearing PageAnonExclusive, because stores are not
> speculated.
>
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> include/linux/mm.h | 9 +++++--
> include/linux/rmap.h | 58 ++++++++++++++++++++++++++++++++++++++++----
> mm/huge_memory.c | 3 +++
> mm/ksm.c | 1 +
> mm/migrate_device.c | 22 +++++++----------
> mm/rmap.c | 11 +++++----
> 6 files changed, 79 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21f8b27bd9fd..f7e8f4b34fb5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2975,8 +2975,8 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
> * PageAnonExclusive() has to protect against concurrent GUP:
> * * Ordinary GUP: Using the PT lock
> * * GUP-fast and fork(): mm->write_protect_seq
> - * * GUP-fast and KSM or temporary unmapping (swap, migration):
> - * clear/invalidate+flush of the page table entry
> + * * GUP-fast and KSM or temporary unmapping (swap, migration): see
> + * page_try_share_anon_rmap()
> *
> * Must be called with the (sub)page that's actually referenced via the
> * page table entry, which might not necessarily be the head page for a
> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
> */
> if (!PageAnon(page))
> return false;
> +
> + /* See page_try_share_anon_rmap() for GUP-fast details. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
> + smp_rmb();
> +
> /*
> * Note that PageKsm() pages cannot be exclusive, and consequently,
> * cannot get pinned.
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bf80adca980b..454c159f2aae 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
> * @page: the exclusive anonymous page to try marking possibly shared
> *
> * The caller needs to hold the PT lock and has to have the page table entry
> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
> + * cleared/invalidated.
> *
> * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
> * to duplicate a mapping, but instead to prepare for KSM or temporarily
> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
> {
> VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>
> - /* See page_try_dup_anon_rmap(). */
> - if (likely(!is_device_private_page(page) &&
> - unlikely(page_maybe_dma_pinned(page))))
> - return -EBUSY;
> + /* device private pages cannot get pinned via GUP. */
> + if (unlikely(is_device_private_page(page))) {
> + ClearPageAnonExclusive(page);
> + return 0;
> + }
>
> + /*
> + * We have to make sure that while we clear PageAnonExclusive, that
> + * the page is not pinned and that concurrent GUP-fast won't succeed in
> + * concurrently pinning the page.
> + *
> + * Conceptually, GUP-fast pinning code of anon pages consists of:
> + * (1) Read the PTE
> + * (2) Pin the mapped page
> + * (3) Check if the PTE changed by re-reading it; back off if so.
> + * (4) Check if PageAnonExclusive is not set; back off if so.
> + *
> + * Conceptually, PageAnonExclusive clearing code consists of:
> + * (1) Clear PTE
> + * (2) Check if the page is pinned; back off if so.
> + * (3) Clear PageAnonExclusive
> + * (4) Restore PTE (optional)
> + *
> + * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
> + * the right order. Memory order between (2) and (3) is handled by
> + * GUP-fast, independent of PageAnonExclusive.
> + *
> + * When clearing PageAnonExclusive(), we have to make sure that (1),
> + * (2), (3) and (4) happen in the right order.
> + *
> + * Note that (4) has to happen after (3) in both cases to handle the
> + * corner case whereby the PTE is restored to the original value after
> + * clearing PageAnonExclusive and while GUP-fast might not detect the
> + * PTE change, it will detect the PageAnonExclusive change.
> + *
> + * We assume that there might not be a memory barrier after
> + * clearing/invalidating the PTE (1) and before restoring the PTE (4),
> + * so we use explicit ones here.
> + *
> + * These memory barriers are paired with memory barriers in GUP-fast
> + * code, including gup_must_unshare().
> + */
> +
> + /* Clear/invalidate the PTE before checking for PINs. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb();
Wondering whether this could be smp_mb__before_atomic().
> +
> + if (unlikely(page_maybe_dma_pinned(page)))
> + return -EBUSY;
> ClearPageAnonExclusive(page);
> +
> + /* Clear PageAnonExclusive() before eventually restoring the PTE. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb__after_atomic();
> return 0;
> }
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9414ee57c5b..2aef8d76fcf2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2140,6 +2140,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> *
> * In case we cannot clear PageAnonExclusive(), split the PMD
> * only and let try_to_migrate_one() fail later.
> + *
> + * See page_try_share_anon_rmap(): invalidate PMD first.
> */
> anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
> @@ -3177,6 +3179,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>
> + /* See page_try_share_anon_rmap(): invalidate PMD first. */
> anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> if (anon_exclusive && page_try_share_anon_rmap(page)) {
> set_pmd_at(mm, address, pvmw->pmd, pmdval);
> diff --git a/mm/ksm.c b/mm/ksm.c
> index d7526c705081..971cf923c0eb 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1091,6 +1091,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> goto out_unlock;
> }
>
> + /* See page_try_share_anon_rmap(): clear PTE first. */
> if (anon_exclusive && page_try_share_anon_rmap(page)) {
> set_pte_at(mm, pvmw.address, pvmw.pte, entry);
> goto out_unlock;
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d65476..47e955212f15 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -193,20 +193,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> bool anon_exclusive;
> pte_t swp_pte;
>
flush_cache_page() missing here?
Better copy Alistair too when post formally since this will have a slight
conflict with the other thread.
> + ptep_get_and_clear(mm, addr, ptep);
> +
> + /* See page_try_share_anon_rmap(): clear PTE first. */
> anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> - if (anon_exclusive) {
> - flush_cache_page(vma, addr, pte_pfn(*ptep));
> - ptep_clear_flush(vma, addr, ptep);
> -
> - if (page_try_share_anon_rmap(page)) {
> - set_pte_at(mm, addr, ptep, pte);
> - unlock_page(page);
> - put_page(page);
> - mpfn = 0;
> - goto next;
> - }
> - } else {
> - ptep_get_and_clear(mm, addr, ptep);
> + if (anon_exclusive && page_try_share_anon_rmap(page)) {
> + set_pte_at(mm, addr, ptep, pte);
> + unlock_page(page);
> + put_page(page);
> + mpfn = 0;
> + goto next;
> }
>
> migrate->cpages++;
Thanks,
--
Peter Xu
[...]
>> + /*
>> + * We have to make sure that while we clear PageAnonExclusive, that
>> + * the page is not pinned and that concurrent GUP-fast won't succeed in
>> + * concurrently pinning the page.
>> + *
>> + * Conceptually, GUP-fast pinning code of anon pages consists of:
>> + * (1) Read the PTE
>> + * (2) Pin the mapped page
>> + * (3) Check if the PTE changed by re-reading it; back off if so.
>> + * (4) Check if PageAnonExclusive is not set; back off if so.
>> + *
>> + * Conceptually, PageAnonExclusive clearing code consists of:
>> + * (1) Clear PTE
>> + * (2) Check if the page is pinned; back off if so.
>> + * (3) Clear PageAnonExclusive
>> + * (4) Restore PTE (optional)
>> + *
>> + * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
>> + * the right order. Memory order between (2) and (3) is handled by
>> + * GUP-fast, independent of PageAnonExclusive.
>> + *
>> + * When clearing PageAnonExclusive(), we have to make sure that (1),
>> + * (2), (3) and (4) happen in the right order.
>> + *
>> + * Note that (4) has to happen after (3) in both cases to handle the
>> + * corner case whereby the PTE is restored to the original value after
>> + * clearing PageAnonExclusive and while GUP-fast might not detect the
>> + * PTE change, it will detect the PageAnonExclusive change.
>> + *
>> + * We assume that there might not be a memory barrier after
>> + * clearing/invalidating the PTE (1) and before restoring the PTE (4),
>> + * so we use explicit ones here.
>> + *
>> + * These memory barriers are paired with memory barriers in GUP-fast
>> + * code, including gup_must_unshare().
>> + */
>> +
>> + /* Clear/invalidate the PTE before checking for PINs. */
>> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>> + smp_mb();
>
> Wondering whether this could be smp_mb__before_atomic().
We'll read via atomic_read().
That's a non-RMW operation. smp_mb__before_atomic() only applies to
RMW (Read Modify Write) operations.
I have an updated patch with improved description/comments, that includes the
following explanation/example and showcases how the two barrier pairs
interact:
Thread 0 (KSM) Thread 1 (GUP-fast)
(B1) Read the PTE
# (B2) skipped without FOLL_WRITE
(A1) Clear PTE
smb_mb()
(A2) Check pinned
(B3) Pin the mapped page
smb_mb()
(A3) Clear PageAnonExclusive
smb_wmb()
(A4) Restore PTE
(B4) Check if the PTE changed
smb_rmb()
(B5) Check PageAnonExclusive
>
>> +
>> + if (unlikely(page_maybe_dma_pinned(page)))
>> + return -EBUSY;
>> ClearPageAnonExclusive(page);
>> +
>> + /* Clear PageAnonExclusive() before eventually restoring the PTE. */
>> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>> + smp_mb__after_atomic();
>> return 0;
>> }
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e9414ee57c5b..2aef8d76fcf2 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2140,6 +2140,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> *
>> * In case we cannot clear PageAnonExclusive(), split the PMD
>> * only and let try_to_migrate_one() fail later.
>> + *
>> + * See page_try_share_anon_rmap(): invalidate PMD first.
>> */
>> anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>> if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
>> @@ -3177,6 +3179,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>> flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>> pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>>
>> + /* See page_try_share_anon_rmap(): invalidate PMD first. */
>> anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>> if (anon_exclusive && page_try_share_anon_rmap(page)) {
>> set_pmd_at(mm, address, pvmw->pmd, pmdval);
>> diff --git a/mm/ksm.c b/mm/ksm.c
>> index d7526c705081..971cf923c0eb 100644
>> --- a/mm/ksm.c
>> +++ b/mm/ksm.c
>> @@ -1091,6 +1091,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>> goto out_unlock;
>> }
>>
>> + /* See page_try_share_anon_rmap(): clear PTE first. */
>> if (anon_exclusive && page_try_share_anon_rmap(page)) {
>> set_pte_at(mm, pvmw.address, pvmw.pte, entry);
>> goto out_unlock;
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 27fb37d65476..47e955212f15 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -193,20 +193,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> bool anon_exclusive;
>> pte_t swp_pte;
>>
>
> flush_cache_page() missing here?
Hmm, wouldn't that already be missing on the !anon path right now?
>
> Better copy Alistair too when post formally since this will have a slight
> conflict with the other thread.
Yes, I'll give him a heads-up right away: full patch in
https://lkml.kernel.org/r/[email protected]
Thanks for having a look Peter1
--
Thanks,
David / dhildenb
On Wed, Aug 31, 2022 at 06:31:23PM +0200, David Hildenbrand wrote:
> >> + /* Clear/invalidate the PTE before checking for PINs. */
> >> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> >> + smp_mb();
> >
> > Wondering whether this could be smp_mb__before_atomic().
>
> We'll read via atomic_read().
>
> That's a non-RMW operation. smp_mb__before_atomic() only applies to
> RMW (Read Modify Write) operations.
Ah right.
> >> diff --git a/mm/ksm.c b/mm/ksm.c
> >> index d7526c705081..971cf923c0eb 100644
> >> --- a/mm/ksm.c
> >> +++ b/mm/ksm.c
> >> @@ -1091,6 +1091,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >> goto out_unlock;
> >> }
> >>
> >> + /* See page_try_share_anon_rmap(): clear PTE first. */
> >> if (anon_exclusive && page_try_share_anon_rmap(page)) {
> >> set_pte_at(mm, pvmw.address, pvmw.pte, entry);
> >> goto out_unlock;
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index 27fb37d65476..47e955212f15 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -193,20 +193,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> bool anon_exclusive;
> >> pte_t swp_pte;
> >>
> >
> > flush_cache_page() missing here?
>
> Hmm, wouldn't that already be missing on the !anon path right now?
Yes, I think Alistair plans to fix it too in the other patchset. So either
this will rebase to that or it should fix it too. Thanks,
--
Peter Xu
>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>> index d7526c705081..971cf923c0eb 100644
>>>> --- a/mm/ksm.c
>>>> +++ b/mm/ksm.c
>>>> @@ -1091,6 +1091,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>>>> goto out_unlock;
>>>> }
>>>>
>>>> + /* See page_try_share_anon_rmap(): clear PTE first. */
>>>> if (anon_exclusive && page_try_share_anon_rmap(page)) {
>>>> set_pte_at(mm, pvmw.address, pvmw.pte, entry);
>>>> goto out_unlock;
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 27fb37d65476..47e955212f15 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -193,20 +193,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>> bool anon_exclusive;
>>>> pte_t swp_pte;
>>>>
>>>
>>> flush_cache_page() missing here?
>>
>> Hmm, wouldn't that already be missing on the !anon path right now?
>
> Yes, I think Alistair plans to fix it too in the other patchset. So either
> this will rebase to that or it should fix it too. Thanks,
>
I'll include it in this patch for now, because by dropping it I would
make the situation "worse". But most probably we want a separate fix
upfront that we can properly backport to older kernels.
Thanks!
--
Thanks,
David / dhildenb
David Hildenbrand <[email protected]> writes:
>>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>>> index d7526c705081..971cf923c0eb 100644
>>>>> --- a/mm/ksm.c
>>>>> +++ b/mm/ksm.c
>>>>> @@ -1091,6 +1091,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>>>>> goto out_unlock;
>>>>> }
>>>>>
>>>>> + /* See page_try_share_anon_rmap(): clear PTE first. */
>>>>> if (anon_exclusive && page_try_share_anon_rmap(page)) {
>>>>> set_pte_at(mm, pvmw.address, pvmw.pte, entry);
>>>>> goto out_unlock;
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 27fb37d65476..47e955212f15 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -193,20 +193,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>> bool anon_exclusive;
>>>>> pte_t swp_pte;
>>>>>
>>>>
>>>> flush_cache_page() missing here?
>>>
>>> Hmm, wouldn't that already be missing on the !anon path right now?
>>
>> Yes, I think Alistair plans to fix it too in the other patchset. So either
>> this will rebase to that or it should fix it too. Thanks,
>>
Thanks for the heads up. I'm still digesting this thread but I do plan
on fixing that up in my series which I hope to post an updated version
of tomorrow.
> I'll include it in this patch for now, because by dropping it I would
> make the situation "worse". But most probably we want a separate fix
> upfront that we can properly backport to older kernels.
Yeah, probably best if we can rebase this on my fix-up series.
> Thanks!