Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.
v7:
- Fixed very silly bug in writeable_file_mapping_allowed() inverting the
logic.
- Removed unnecessary RCU lock code and replaced with adaptation of Peter's
idea.
- Removed unnecessary open-coded folio_test_anon() in
folio_longterm_write_pin_allowed() and restructured to generally permit
NULL folio_mapping().
v6:
- Rebased on latest mm-unstable as of 28th April 2023.
- Add PUP-fast check with handling for rcu-locked TLB shootdown to synchronise
correctly.
- Split patch series into 3 to make it more digestible.
https://lore.kernel.org/all/[email protected]/
v5:
- Rebased on latest mm-unstable as of 25th April 2023.
- Some small refactorings suggested by John.
- Added an extended description of the problem in the comment around
writeable_file_mapping_allowed() for clarity.
- Updated commit message as suggested by Mika and John.
https://lore.kernel.org/all/6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com/
v4:
- Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to
reduce duplication and update to use this in the GUP check. Note that
both separately check vm_ops_needs_writenotify() as the latter needs to
test this before the vm_pgprot_modify() test, resulting in
vma_wants_writenotify() checking this twice, however it is such a small
check this should not be egregious.
https://lore.kernel.org/all/3b92d56f55671a0389252379237703df6e86ea48.1682464032.git.lstoakes@gmail.com/
v3:
- Rebased on latest mm-unstable as of 24th April 2023.
- Explicitly check whether file system requires folio dirtying. Note that
vma_wants_writenotify() could not be used directly as it is very much focused
on determining if the PTE r/w should be set (e.g. assuming private mapping
does not require it as already set, soft dirty considerations).
- Tested code against shmem and hugetlb mappings - confirmed that these are not
disallowed by the check.
- Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only
for FOLL_LONGTERM pins.
- As a result, limit check to internal GUP code.
https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/
v2:
- Add accidentally excluded ptrace_access_vm() use of
FOLL_ALLOW_BROKEN_FILE_MAPPING.
- Tweak commit message.
https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/
v1:
https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/
Lorenzo Stoakes (3):
mm/mmap: separate writenotify and dirty tracking logic
mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed
mappings
mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed
mappings
include/linux/mm.h | 1 +
mm/gup.c | 105 +++++++++++++++++++++++++++++++++++++++++++--
mm/mmap.c | 36 ++++++++++++----
3 files changed, 130 insertions(+), 12 deletions(-)
--
2.40.1
vma_wants_writenotify() is specifically intended for setting PTE page table
flags, accounting for existing PTE flag state and whether that might
already be read-only while mixing this check with a check whether the
filesystem performs dirty tracking.
Separate out the notions of dirty tracking and a PTE write notify checking
in order that we can invoke the dirty tracking check from elsewhere.
Note that this change introduces a very small duplicate check of the
separated out vm_ops_needs_writenotify(). This is necessary to avoid making
vma_needs_dirty_tracking() needlessly complicated (e.g. passing a
check_writenotify flag or having it assume this check was already
performed). This is such a small check that it doesn't seem too egregious
to do this.
Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Mika Penttilä <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
include/linux/mm.h | 1 +
mm/mmap.c | 36 +++++++++++++++++++++++++++---------
2 files changed, 28 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..7b1d4e7393ef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
MM_CP_UFFD_WP_RESOLVE)
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
{
diff --git a/mm/mmap.c b/mm/mmap.c
index 5522130ae606..295c5f2e9bd9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
}
#endif /* __ARCH_WANT_SYS_OLD_MMAP */
+/* Do VMA operations imply write notify is required? */
+static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
+{
+ return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
+}
+
+/*
+ * Does this VMA require the underlying folios to have their dirty state
+ * tracked?
+ */
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
+{
+ /* Does the filesystem need to be notified? */
+ if (vm_ops_needs_writenotify(vma->vm_ops))
+ return true;
+
+ /* Specialty mapping? */
+ if (vma->vm_flags & VM_PFNMAP)
+ return false;
+
+ /* Can the mapping track the dirty pages? */
+ return vma->vm_file && vma->vm_file->f_mapping &&
+ mapping_can_writeback(vma->vm_file->f_mapping);
+}
+
/*
* Some shared mappings will want the pages marked read-only
* to track write events. If so, we'll downgrade vm_page_prot
@@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
{
vm_flags_t vm_flags = vma->vm_flags;
- const struct vm_operations_struct *vm_ops = vma->vm_ops;
/* If it was private or non-writable, the write bit is already clear */
if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
return 0;
/* The backer wishes to know when pages are first written to? */
- if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
+ if (vm_ops_needs_writenotify(vma->vm_ops))
return 1;
/* The open routine did something to the protections that pgprot_modify
@@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
if (userfaultfd_wp(vma))
return 1;
- /* Specialty mapping? */
- if (vm_flags & VM_PFNMAP)
- return 0;
-
- /* Can the mapping track the dirty pages? */
- return vma->vm_file && vma->vm_file->f_mapping &&
- mapping_can_writeback(vma->vm_file->f_mapping);
+ return vma_needs_dirty_tracking(vma);
}
/*
--
2.40.1
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This results in both data being written to a folio without writenotify, and
the folio being dirtied unexpectedly (if the caller decides to do so).
This issue was first reported by Jan Kara [1] in 2018, where the problem
resulted in file system crashes.
This is only relevant when the mappings are file-backed and the underlying
file system requires folio dirty tracking. File systems which do not, such
as shmem or hugetlb, are not at risk and therefore can be written to
without issue.
Unfortunately this limitation of GUP has been present for some time and
requires future rework of the GUP API in order to provide correct write
access to such mappings.
However, for the time being we introduce this check to prevent the most
egregious case of this occurring, use of the FOLL_LONGTERM pin.
These mappings are considerably more likely to be written to after
folios are cleaned and thus simply must not be permitted to do so.
This patch changes only the slow-path GUP functions, a following patch
adapts the GUP-fast path along similar lines.
[1]:https://lore.kernel.org/linux-mm/[email protected]/
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Mika Penttilä <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
mm/gup.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index ff689c88a357..6e209ca10967 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -959,16 +959,53 @@ static int faultin_page(struct vm_area_struct *vma,
return 0;
}
+/*
+ * Writing to file-backed mappings which require folio dirty tracking using GUP
+ * is a fundamentally broken operation, as kernel write access to GUP mappings
+ * do not adhere to the semantics expected by a file system.
+ *
+ * Consider the following scenario:-
+ *
+ * 1. A folio is written to via GUP which write-faults the memory, notifying
+ * the file system and dirtying the folio.
+ * 2. Later, writeback is triggered, resulting in the folio being cleaned and
+ * the PTE being marked read-only.
+ * 3. The GUP caller writes to the folio, as it is mapped read/write via the
+ * direct mapping.
+ * 4. The GUP caller, now done with the page, unpins it and sets it dirty
+ * (though it does not have to).
+ *
+ * This results in both data being written to a folio without writenotify, and
+ * the folio being dirtied unexpectedly (if the caller decides to do so).
+ */
+static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
+ unsigned long gup_flags)
+{
+ /*
+ * If we aren't pinning then no problematic write can occur. A long term
+ * pin is the most egregious case so this is the case we disallow.
+ */
+ if (!(gup_flags & (FOLL_PIN | FOLL_LONGTERM)))
+ return true;
+
+ /*
+ * If the VMA does not require dirty tracking then no problematic write
+ * can occur either.
+ */
+ return !vma_needs_dirty_tracking(vma);
+}
+
static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
{
vm_flags_t vm_flags = vma->vm_flags;
int write = (gup_flags & FOLL_WRITE);
int foreign = (gup_flags & FOLL_REMOTE);
+ bool vma_anon = vma_is_anonymous(vma);
if (vm_flags & (VM_IO | VM_PFNMAP))
return -EFAULT;
- if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+ if ((gup_flags & FOLL_ANON) && !vma_anon)
return -EFAULT;
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
@@ -978,6 +1015,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
return -EFAULT;
if (write) {
+ if (!vma_anon &&
+ !writeable_file_mapping_allowed(vma, gup_flags))
+ return -EFAULT;
+
if (!(vm_flags & VM_WRITE)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
--
2.40.1
Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.
This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
we have already done in the slow path.
We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous and then whitelist known-good mappings - specifically hugetlb
and shmem mappings.
While we obtain a stable folio for this check, the mapping might not be, as
a truncate could nullify it at any time. Since doing so requires mappings
to be zapped, we can synchronise against a TLB shootdown operation.
For some architectures TLB shootdown is synchronised by IPI, against which
we are protected as the GUP-fast operation is performed with interrupts
disabled. Equally, we are protected from architectures which specify
CONFIG_MMU_GATHER_RCU_TABLE_FREE as the interrupts being disabled imply an
RCU lock as well.
We whitelist anonymous mappings (and those which otherwise do not have a
valid mapping), shmem and hugetlb mappings, none of which require dirty
tracking so are safe to long-term pin.
It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.
Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Kirill A . Shutemov <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
mm/gup.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 60 insertions(+), 2 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 6e209ca10967..93b4aa39e5a5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -18,6 +18,7 @@
#include <linux/migrate.h>
#include <linux/mm_inline.h>
#include <linux/sched/mm.h>
+#include <linux/shmem_fs.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
@@ -95,6 +96,52 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
return folio;
}
+/*
+ * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
+ * FOLL_WRITE pin is permitted for a specific folio.
+ *
+ * This assumes the folio is stable and pinned.
+ *
+ * Writing to pinned file-backed dirty tracked folios is inherently problematic
+ * (see comment describing the writeable_file_mapping_allowed() function). We
+ * therefore try to avoid the most egregious case of a long-term mapping doing
+ * so.
+ *
+ * This function cannot be as thorough as that one as the VMA is not available
+ * in the fast path, so instead we whitelist known good cases.
+ */
+static bool folio_longterm_write_pin_allowed(struct folio *folio)
+{
+ struct address_space *mapping;
+
+ /*
+ * GUP-fast disables IRQs - this prevents IPIs from causing page tables
+ * to disappear from under us, as well as preventing RCU grace periods
+ * from making progress (i.e. implying rcu_read_lock()).
+ *
+ * This means we can rely on the folio remaining stable for all
+ * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
+ * and those that do not.
+ *
+ * We get the added benefit that given inodes, and thus address_space,
+ * objects are RCU freed, we can rely on the mapping remaining stable
+ * here with no risk of a truncation or similar race.
+ */
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * If no mapping can be found, this implies an anonymous or otherwise
+ * non-file backed folio so in this instance we permit the pin.
+ *
+ * shmem and hugetlb mappings do not require dirty-tracking so we
+ * explicitly whitelist these.
+ *
+ * Other non dirty-tracked folios will be picked up on the slow path.
+ */
+ mapping = folio_mapping(folio);
+ return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
+}
+
/**
* try_grab_folio() - Attempt to get or pin a folio.
* @page: pointer to page to be grabbed
@@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
*/
struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
+ bool is_longterm = flags & FOLL_LONGTERM;
+
if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
return NULL;
@@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
* right zone, so fail and let the caller fall back to the slow
* path.
*/
- if (unlikely((flags & FOLL_LONGTERM) &&
- !is_longterm_pinnable_page(page)))
+ if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
return NULL;
/*
@@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
if (!folio)
return NULL;
+ /*
+ * Can this folio be safely pinned? We need to perform this
+ * check after the folio is stabilised.
+ */
+ if ((flags & FOLL_WRITE) && is_longterm &&
+ !folio_longterm_write_pin_allowed(folio)) {
+ folio_put_refs(folio, refs);
+ return NULL;
+ }
+
/*
* When pinning a large folio, use an exact count to track it.
*
--
2.40.1
On 02.05.23 18:34, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
>
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
>
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
>
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
>
> For example, consider the following scenario:-
>
> 1. A folio is written to via GUP which write-faults the memory, notifying
> the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
> the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
> direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
> (though it does not have to).
>
> This results in both data being written to a folio without writenotify, and
> the folio being dirtied unexpectedly (if the caller decides to do so).
>
> This issue was first reported by Jan Kara [1] in 2018, where the problem
> resulted in file system crashes.
>
> This is only relevant when the mappings are file-backed and the underlying
> file system requires folio dirty tracking. File systems which do not, such
> as shmem or hugetlb, are not at risk and therefore can be written to
> without issue.
>
> Unfortunately this limitation of GUP has been present for some time and
> requires future rework of the GUP API in order to provide correct write
> access to such mappings.
>
> However, for the time being we introduce this check to prevent the most
> egregious case of this occurring, use of the FOLL_LONGTERM pin.
>
> These mappings are considerably more likely to be written to after
> folios are cleaned and thus simply must not be permitted to do so.
>
> This patch changes only the slow-path GUP functions, a following patch
> adapts the GUP-fast path along similar lines.
>
> [1]:https://lore.kernel.org/linux-mm/[email protected]/
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> Reviewed-by: John Hubbard <[email protected]>
> Reviewed-by: Mika Penttilä <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> ---
> mm/gup.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index ff689c88a357..6e209ca10967 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -959,16 +959,53 @@ static int faultin_page(struct vm_area_struct *vma,
> return 0;
> }
>
> +/*
> + * Writing to file-backed mappings which require folio dirty tracking using GUP
> + * is a fundamentally broken operation, as kernel write access to GUP mappings
> + * do not adhere to the semantics expected by a file system.
> + *
> + * Consider the following scenario:-
> + *
> + * 1. A folio is written to via GUP which write-faults the memory, notifying
> + * the file system and dirtying the folio.
> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> + * the PTE being marked read-only.
> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> + * direct mapping.
> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> + * (though it does not have to).
> + *
> + * This results in both data being written to a folio without writenotify, and
> + * the folio being dirtied unexpectedly (if the caller decides to do so).
> + */
> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> + unsigned long gup_flags)
> +{
> + /*
> + * If we aren't pinning then no problematic write can occur. A long term
> + * pin is the most egregious case so this is the case we disallow.
> + */
> + if (!(gup_flags & (FOLL_PIN | FOLL_LONGTERM)))
> + return true;
If you really want to keep FOLL_PIN here ... this has to be
if ((gup_flags & (FOLL_PIN | FOLL_LONGTERM)) != (FOLL_PIN | FOLL_LONGTERM))
or two separate checks.
Otherwise you'd also proceed if only FOLL_PIN is set.
Unless my tired eyes betrayed me.
--
Thanks,
David / dhildenb
On 02.05.23 18:34, Lorenzo Stoakes wrote:
> vma_wants_writenotify() is specifically intended for setting PTE page table
> flags, accounting for existing PTE flag state and whether that might
> already be read-only while mixing this check with a check whether the
> filesystem performs dirty tracking.
>
> Separate out the notions of dirty tracking and a PTE write notify checking
> in order that we can invoke the dirty tracking check from elsewhere.
>
> Note that this change introduces a very small duplicate check of the
> separated out vm_ops_needs_writenotify(). This is necessary to avoid making
> vma_needs_dirty_tracking() needlessly complicated (e.g. passing a
> check_writenotify flag or having it assume this check was already
> performed). This is such a small check that it doesn't seem too egregious
> to do this.
>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> Reviewed-by: John Hubbard <[email protected]>
> Reviewed-by: Mika Penttilä <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> ---
> include/linux/mm.h | 1 +
> mm/mmap.c | 36 +++++++++++++++++++++++++++---------
> 2 files changed, 28 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..7b1d4e7393ef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
> MM_CP_UFFD_WP_RESOLVE)
>
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5522130ae606..295c5f2e9bd9 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
> }
> #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>
> +/* Do VMA operations imply write notify is required? */
> +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
> +{
> + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
> +}
> +
> +/*
> + * Does this VMA require the underlying folios to have their dirty state
> + * tracked?
> + */
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
> +{
Sorry for not noticing this earlier, but ...
what about MAP_PRIVATE mappings? When we write, we populate an anon
page, which will work as expected ... because we don't have to notify
the fs?
I think you really also want the "If it was private or non-writable, the
write bit is already clear */" part as well and remove "false" in that case.
Or was there a good reason to disallow private mappings as well?
--
Thanks,
David / dhildenb
On Tue, May 02, 2023 at 06:38:53PM +0200, David Hildenbrand wrote:
> On 02.05.23 18:34, Lorenzo Stoakes wrote:
> > vma_wants_writenotify() is specifically intended for setting PTE page table
> > flags, accounting for existing PTE flag state and whether that might
> > already be read-only while mixing this check with a check whether the
> > filesystem performs dirty tracking.
> >
> > Separate out the notions of dirty tracking and a PTE write notify checking
> > in order that we can invoke the dirty tracking check from elsewhere.
> >
> > Note that this change introduces a very small duplicate check of the
> > separated out vm_ops_needs_writenotify(). This is necessary to avoid making
> > vma_needs_dirty_tracking() needlessly complicated (e.g. passing a
> > check_writenotify flag or having it assume this check was already
> > performed). This is such a small check that it doesn't seem too egregious
> > to do this.
> >
> > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > Reviewed-by: John Hubbard <[email protected]>
> > Reviewed-by: Mika Penttil? <[email protected]>
> > Reviewed-by: Jan Kara <[email protected]>
> > Reviewed-by: Jason Gunthorpe <[email protected]>
> > ---
> > include/linux/mm.h | 1 +
> > mm/mmap.c | 36 +++++++++++++++++++++++++++---------
> > 2 files changed, 28 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 27ce77080c79..7b1d4e7393ef 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
> > MM_CP_UFFD_WP_RESOLVE)
> > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
> > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> > {
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 5522130ae606..295c5f2e9bd9 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
> > }
> > #endif /* __ARCH_WANT_SYS_OLD_MMAP */
> > +/* Do VMA operations imply write notify is required? */
> > +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
> > +{
> > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
> > +}
> > +
> > +/*
> > + * Does this VMA require the underlying folios to have their dirty state
> > + * tracked?
> > + */
> > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
> > +{
>
> Sorry for not noticing this earlier, but ...
pints_owed++
>
> what about MAP_PRIVATE mappings? When we write, we populate an anon page,
> which will work as expected ... because we don't have to notify the fs?
>
> I think you really also want the "If it was private or non-writable, the
> write bit is already clear */" part as well and remove "false" in that case.
>
Not sure a 'write bit is already clear' case is relevant to checking
whether a filesystem dirty tracks? That seems specific entirely to the page
table bits.
That's why I didn't include it,
A !VM_WRITE shouldn't be GUP-writable except for FOLL_FORCE, and that
surely could be problematic if VM_MAYWRITE later?
Thinking about it though a !VM_SHARE should probably can be safely assumed
to not be dirty-trackable, so we probably do need to add a check for
!VM_SHARED -> !vma_needs_dirty_tracking
> Or was there a good reason to disallow private mappings as well?
>
Until the page is CoW'd walking the page tables will get you to the page
cache page right? This was the reason I (perhaps rather too quickly) felt
MAP_PRIVATE should be excluded.
However a FOLL_WRITE would trigger CoW... and then we'd be trivially OK.
So yeah, ok perhaps I dismissed that a little too soon. I was concerned
about some sort of egregious FOLL_FORCE case where somehow we'd end up with
the page cache folio. But actually, that probably can't happen...
> --
> Thanks,
>
> David / dhildenb
>
On Tue, May 02, 2023 at 06:42:42PM +0200, David Hildenbrand wrote:
> On 02.05.23 18:34, Lorenzo Stoakes wrote:
> > Writing to file-backed mappings which require folio dirty tracking using
> > GUP is a fundamentally broken operation, as kernel write access to GUP
> > mappings do not adhere to the semantics expected by a file system.
> >
> > A GUP caller uses the direct mapping to access the folio, which does not
> > cause write notify to trigger, nor does it enforce that the caller marks
> > the folio dirty.
> >
> > The problem arises when, after an initial write to the folio, writeback
> > results in the folio being cleaned and then the caller, via the GUP
> > interface, writes to the folio again.
> >
> > As a result of the use of this secondary, direct, mapping to the folio no
> > write notify will occur, and if the caller does mark the folio dirty, this
> > will be done so unexpectedly.
> >
> > For example, consider the following scenario:-
> >
> > 1. A folio is written to via GUP which write-faults the memory, notifying
> > the file system and dirtying the folio.
> > 2. Later, writeback is triggered, resulting in the folio being cleaned and
> > the PTE being marked read-only.
> > 3. The GUP caller writes to the folio, as it is mapped read/write via the
> > direct mapping.
> > 4. The GUP caller, now done with the page, unpins it and sets it dirty
> > (though it does not have to).
> >
> > This results in both data being written to a folio without writenotify, and
> > the folio being dirtied unexpectedly (if the caller decides to do so).
> >
> > This issue was first reported by Jan Kara [1] in 2018, where the problem
> > resulted in file system crashes.
> >
> > This is only relevant when the mappings are file-backed and the underlying
> > file system requires folio dirty tracking. File systems which do not, such
> > as shmem or hugetlb, are not at risk and therefore can be written to
> > without issue.
> >
> > Unfortunately this limitation of GUP has been present for some time and
> > requires future rework of the GUP API in order to provide correct write
> > access to such mappings.
> >
> > However, for the time being we introduce this check to prevent the most
> > egregious case of this occurring, use of the FOLL_LONGTERM pin.
> >
> > These mappings are considerably more likely to be written to after
> > folios are cleaned and thus simply must not be permitted to do so.
> >
> > This patch changes only the slow-path GUP functions, a following patch
> > adapts the GUP-fast path along similar lines.
> >
> > [1]:https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Suggested-by: Jason Gunthorpe <[email protected]>
> > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > Reviewed-by: John Hubbard <[email protected]>
> > Reviewed-by: Mika Penttil? <[email protected]>
> > Reviewed-by: Jan Kara <[email protected]>
> > Reviewed-by: Jason Gunthorpe <[email protected]>
> > ---
> > mm/gup.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 42 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index ff689c88a357..6e209ca10967 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -959,16 +959,53 @@ static int faultin_page(struct vm_area_struct *vma,
> > return 0;
> > }
> > +/*
> > + * Writing to file-backed mappings which require folio dirty tracking using GUP
> > + * is a fundamentally broken operation, as kernel write access to GUP mappings
> > + * do not adhere to the semantics expected by a file system.
> > + *
> > + * Consider the following scenario:-
> > + *
> > + * 1. A folio is written to via GUP which write-faults the memory, notifying
> > + * the file system and dirtying the folio.
> > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> > + * the PTE being marked read-only.
> > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> > + * direct mapping.
> > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> > + * (though it does not have to).
> > + *
> > + * This results in both data being written to a folio without writenotify, and
> > + * the folio being dirtied unexpectedly (if the caller decides to do so).
> > + */
> > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> > + unsigned long gup_flags)
> > +{
> > + /*
> > + * If we aren't pinning then no problematic write can occur. A long term
> > + * pin is the most egregious case so this is the case we disallow.
> > + */
> > + if (!(gup_flags & (FOLL_PIN | FOLL_LONGTERM)))
> > + return true;
>
> If you really want to keep FOLL_PIN here ... this has to be
>
> if ((gup_flags & (FOLL_PIN | FOLL_LONGTERM)) != (FOLL_PIN | FOLL_LONGTERM))
>
> or two separate checks.
>
> Otherwise you'd also proceed if only FOLL_PIN is set.
>
> Unless my tired eyes betrayed me.
Your tired eyes are rapidly taking on the firey visage of the dark lord
Sauron... but also, ugh god pints_owed_to_myself++.
Sorry this was a me rushing it out of shame thing. Will fix on
respin... apologies for spam everyone :)
>
>
> --
> Thanks,
>
> David / dhildenb
>
On Tue, May 02, 2023 at 05:53:46PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 06:38:53PM +0200, David Hildenbrand wrote:
> > On 02.05.23 18:34, Lorenzo Stoakes wrote:
> > > vma_wants_writenotify() is specifically intended for setting PTE page table
> > > flags, accounting for existing PTE flag state and whether that might
> > > already be read-only while mixing this check with a check whether the
> > > filesystem performs dirty tracking.
> > >
> > > Separate out the notions of dirty tracking and a PTE write notify checking
> > > in order that we can invoke the dirty tracking check from elsewhere.
> > >
> > > Note that this change introduces a very small duplicate check of the
> > > separated out vm_ops_needs_writenotify(). This is necessary to avoid making
> > > vma_needs_dirty_tracking() needlessly complicated (e.g. passing a
> > > check_writenotify flag or having it assume this check was already
> > > performed). This is such a small check that it doesn't seem too egregious
> > > to do this.
> > >
> > > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > > Reviewed-by: John Hubbard <[email protected]>
> > > Reviewed-by: Mika Penttil? <[email protected]>
> > > Reviewed-by: Jan Kara <[email protected]>
> > > Reviewed-by: Jason Gunthorpe <[email protected]>
> > > ---
> > > include/linux/mm.h | 1 +
> > > mm/mmap.c | 36 +++++++++++++++++++++++++++---------
> > > 2 files changed, 28 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 27ce77080c79..7b1d4e7393ef 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> > > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
> > > MM_CP_UFFD_WP_RESOLVE)
> > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
> > > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> > > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> > > {
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 5522130ae606..295c5f2e9bd9 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
> > > }
> > > #endif /* __ARCH_WANT_SYS_OLD_MMAP */
> > > +/* Do VMA operations imply write notify is required? */
> > > +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
> > > +{
> > > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
> > > +}
> > > +
> > > +/*
> > > + * Does this VMA require the underlying folios to have their dirty state
> > > + * tracked?
> > > + */
> > > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
> > > +{
> >
> > Sorry for not noticing this earlier, but ...
>
> pints_owed++
>
> >
> > what about MAP_PRIVATE mappings? When we write, we populate an anon page,
> > which will work as expected ... because we don't have to notify the fs?
> >
> > I think you really also want the "If it was private or non-writable, the
> > write bit is already clear */" part as well and remove "false" in that case.
> >
>
> Not sure a 'write bit is already clear' case is relevant to checking
> whether a filesystem dirty tracks? That seems specific entirely to the page
> table bits.
>
> That's why I didn't include it,
>
> A !VM_WRITE shouldn't be GUP-writable except for FOLL_FORCE, and that
> surely could be problematic if VM_MAYWRITE later?
>
> Thinking about it though a !VM_SHARE should probably can be safely assumed
> to not be dirty-trackable, so we probably do need to add a check for
> !VM_SHARED -> !vma_needs_dirty_tracking
>
On second thoughts, we explicitly check FOLL_FORCE && !is_cow_mapping() in
check_vma_flags() so that case cannot occur.
So actually yes we should probably include this on the basis of that and
the fact that a FOLL_WRITE operation will CoW the MAP_PRIVATE mapping.
This was an (over)abundance of caution.
Will fix on respin.
> > Or was there a good reason to disallow private mappings as well?
> >
>
> Until the page is CoW'd walking the page tables will get you to the page
> cache page right? This was the reason I (perhaps rather too quickly) felt
> MAP_PRIVATE should be excluded.
>
> However a FOLL_WRITE would trigger CoW... and then we'd be trivially OK.
>
> So yeah, ok perhaps I dismissed that a little too soon. I was concerned
> about some sort of egregious FOLL_FORCE case where somehow we'd end up with
> the page cache folio. But actually, that probably can't happen...
>
> > --
> > Thanks,
> >
> > David / dhildenb
> >
[...]
> +{
> + struct address_space *mapping;
> +
> + /*
> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> + * to disappear from under us, as well as preventing RCU grace periods
> + * from making progress (i.e. implying rcu_read_lock()).
> + *
> + * This means we can rely on the folio remaining stable for all
> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> + * and those that do not.
> + *
> + * We get the added benefit that given inodes, and thus address_space,
> + * objects are RCU freed, we can rely on the mapping remaining stable
> + * here with no risk of a truncation or similar race.
> + */
> + lockdep_assert_irqs_disabled();
> +
> + /*
> + * If no mapping can be found, this implies an anonymous or otherwise
> + * non-file backed folio so in this instance we permit the pin.
> + *
> + * shmem and hugetlb mappings do not require dirty-tracking so we
> + * explicitly whitelist these.
> + *
> + * Other non dirty-tracked folios will be picked up on the slow path.
> + */
> + mapping = folio_mapping(folio);
> + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
"Folios in the swap cache return the swap mapping" -- you might disallow
pinning anonymous pages that are in the swap cache.
I recall that there are corner cases where we can end up with an anon
page that's mapped writable but still in the swap cache ... so you'd
fallback to the GUP slow path (acceptable for these corner cases, I
guess), however especially the comment is a bit misleading then.
So I'd suggest not dropping the folio_test_anon() check, or open-coding
it ... which will make this piece of code most certainly easier to get
when staring at folio_mapping(). Or to spell it out in the comment
(usually I prefer code over comments).
> +}
> +
> /**
> * try_grab_folio() - Attempt to get or pin a folio.
> * @page: pointer to page to be grabbed
> @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> */
> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> {
> + bool is_longterm = flags & FOLL_LONGTERM;
> +
> if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> return NULL;
>
> @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> * right zone, so fail and let the caller fall back to the slow
> * path.
> */
> - if (unlikely((flags & FOLL_LONGTERM) &&
> - !is_longterm_pinnable_page(page)))
> + if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> return NULL;
>
> /*
> @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> if (!folio)
> return NULL;
>
> + /*
> + * Can this folio be safely pinned? We need to perform this
> + * check after the folio is stabilised.
> + */
> + if ((flags & FOLL_WRITE) && is_longterm &&
> + !folio_longterm_write_pin_allowed(folio)) {
> + folio_put_refs(folio, refs);
> + return NULL;
> + }
So we perform this change before validating whether the PTE changed.
Hmm, naturally, I would have done it afterwards.
IIRC, without IPI syncs during TLB flush (i.e.,
CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that
(1) We lookup the pte
(2) The page was unmapped and free
(3) The page gets reallocated and used
(4) We pin the page
(5) We dereference page->mapping
If we then de-reference page->mapping that gets used by whoever
allocated it for something completely different (not a pointer to
something reasonable), I wonder if we might be in trouble.
Checking first, whether the PTE changed makes sure that what we pinned
and what we're looking at is what we expected.
... I can spot that the page_is_secretmem() check is also done before
that. But it at least makes sure that it's still an LRU page before
staring at the mapping (making it a little safer?).
BUT, I keep messing up this part of the story. Maybe it all works as
expected because we will be synchronizing RCU somehow before actually
freeing the page in the !IPI case. ... but I think that's only true for
page tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE.
--
Thanks,
David / dhildenb
On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> [...]
>
> > +{
> > + struct address_space *mapping;
> > +
> > + /*
> > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > + * to disappear from under us, as well as preventing RCU grace periods
> > + * from making progress (i.e. implying rcu_read_lock()).
> > + *
> > + * This means we can rely on the folio remaining stable for all
> > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > + * and those that do not.
> > + *
> > + * We get the added benefit that given inodes, and thus address_space,
> > + * objects are RCU freed, we can rely on the mapping remaining stable
> > + * here with no risk of a truncation or similar race.
> > + */
> > + lockdep_assert_irqs_disabled();
> > +
> > + /*
> > + * If no mapping can be found, this implies an anonymous or otherwise
> > + * non-file backed folio so in this instance we permit the pin.
> > + *
> > + * shmem and hugetlb mappings do not require dirty-tracking so we
> > + * explicitly whitelist these.
> > + *
> > + * Other non dirty-tracked folios will be picked up on the slow path.
> > + */
> > + mapping = folio_mapping(folio);
> > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
>
> "Folios in the swap cache return the swap mapping" -- you might disallow
> pinning anonymous pages that are in the swap cache.
>
> I recall that there are corner cases where we can end up with an anon page
> that's mapped writable but still in the swap cache ... so you'd fallback to
> the GUP slow path (acceptable for these corner cases, I guess), however
> especially the comment is a bit misleading then.
>
> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> ... which will make this piece of code most certainly easier to get when
> staring at folio_mapping(). Or to spell it out in the comment (usually I
> prefer code over comments).
So how stable is folio->mapping at this point? Can two subsequent reads
get different values? (eg. an actual mapping and NULL)
If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
the compiler from emitting the load multiple times.
On 02.05.23 19:09, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 05:53:46PM +0100, Lorenzo Stoakes wrote:
>> On Tue, May 02, 2023 at 06:38:53PM +0200, David Hildenbrand wrote:
>>> On 02.05.23 18:34, Lorenzo Stoakes wrote:
>>>> vma_wants_writenotify() is specifically intended for setting PTE page table
>>>> flags, accounting for existing PTE flag state and whether that might
>>>> already be read-only while mixing this check with a check whether the
>>>> filesystem performs dirty tracking.
>>>>
>>>> Separate out the notions of dirty tracking and a PTE write notify checking
>>>> in order that we can invoke the dirty tracking check from elsewhere.
>>>>
>>>> Note that this change introduces a very small duplicate check of the
>>>> separated out vm_ops_needs_writenotify(). This is necessary to avoid making
>>>> vma_needs_dirty_tracking() needlessly complicated (e.g. passing a
>>>> check_writenotify flag or having it assume this check was already
>>>> performed). This is such a small check that it doesn't seem too egregious
>>>> to do this.
>>>>
>>>> Signed-off-by: Lorenzo Stoakes <[email protected]>
>>>> Reviewed-by: John Hubbard <[email protected]>
>>>> Reviewed-by: Mika Penttilä <[email protected]>
>>>> Reviewed-by: Jan Kara <[email protected]>
>>>> Reviewed-by: Jason Gunthorpe <[email protected]>
>>>> ---
>>>> include/linux/mm.h | 1 +
>>>> mm/mmap.c | 36 +++++++++++++++++++++++++++---------
>>>> 2 files changed, 28 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 27ce77080c79..7b1d4e7393ef 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
>>>> MM_CP_UFFD_WP_RESOLVE)
>>>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
>>>> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
>>>> static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
>>>> {
>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>> index 5522130ae606..295c5f2e9bd9 100644
>>>> --- a/mm/mmap.c
>>>> +++ b/mm/mmap.c
>>>> @@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
>>>> }
>>>> #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>>>> +/* Do VMA operations imply write notify is required? */
>>>> +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
>>>> +{
>>>> + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Does this VMA require the underlying folios to have their dirty state
>>>> + * tracked?
>>>> + */
>>>> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
>>>> +{
>>>
>>> Sorry for not noticing this earlier, but ...
>>
>> pints_owed++
Having tired eyes and jumping back and forth between tasks really seems
to start getting expensive ;)
>>
>>>
>>> what about MAP_PRIVATE mappings? When we write, we populate an anon page,
>>> which will work as expected ... because we don't have to notify the fs?
>>>
>>> I think you really also want the "If it was private or non-writable, the
>>> write bit is already clear */" part as well and remove "false" in that case.
>>>
>>
>> Not sure a 'write bit is already clear' case is relevant to checking
>> whether a filesystem dirty tracks? That seems specific entirely to the page
>> table bits.
>>
>> That's why I didn't include it,
>>
>> A !VM_WRITE shouldn't be GUP-writable except for FOLL_FORCE, and that
>> surely could be problematic if VM_MAYWRITE later?
>>
>> Thinking about it though a !VM_SHARE should probably can be safely assumed
>> to not be dirty-trackable, so we probably do need to add a check for
>> !VM_SHARED -> !vma_needs_dirty_tracking
>>
>
> On second thoughts, we explicitly check FOLL_FORCE && !is_cow_mapping() in
> check_vma_flags() so that case cannot occur.
>
> So actually yes we should probably include this on the basis of that and
> the fact that a FOLL_WRITE operation will CoW the MAP_PRIVATE mapping.
>
Yes, we only allow to FOLL_FORCE write to (exclusive) anonymous pages
that are mapped read-only. If it's not that, we trigger a (fake) write
fault.
--
Thanks,
David / dhildenb
On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> [...]
>
> > +{
> > + struct address_space *mapping;
> > +
> > + /*
> > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > + * to disappear from under us, as well as preventing RCU grace periods
> > + * from making progress (i.e. implying rcu_read_lock()).
> > + *
> > + * This means we can rely on the folio remaining stable for all
> > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > + * and those that do not.
> > + *
> > + * We get the added benefit that given inodes, and thus address_space,
> > + * objects are RCU freed, we can rely on the mapping remaining stable
> > + * here with no risk of a truncation or similar race.
> > + */
> > + lockdep_assert_irqs_disabled();
> > +
> > + /*
> > + * If no mapping can be found, this implies an anonymous or otherwise
> > + * non-file backed folio so in this instance we permit the pin.
> > + *
> > + * shmem and hugetlb mappings do not require dirty-tracking so we
> > + * explicitly whitelist these.
> > + *
> > + * Other non dirty-tracked folios will be picked up on the slow path.
> > + */
> > + mapping = folio_mapping(folio);
> > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
>
> "Folios in the swap cache return the swap mapping" -- you might disallow
> pinning anonymous pages that are in the swap cache.
>
> I recall that there are corner cases where we can end up with an anon page
> that's mapped writable but still in the swap cache ... so you'd fallback to
> the GUP slow path (acceptable for these corner cases, I guess), however
> especially the comment is a bit misleading then.
How could that happen?
>
> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> ... which will make this piece of code most certainly easier to get when
> staring at folio_mapping(). Or to spell it out in the comment (usually I
> prefer code over comments).
I literally made this change based on your suggestion :) but perhaps I
misinterpreted what you meant.
I do spell it out in the comment that the page can be anonymous, But perhaps
explicitly checking the mapping flags is the way to go.
>
> > +}
> > +
> > /**
> > * try_grab_folio() - Attempt to get or pin a folio.
> > * @page: pointer to page to be grabbed
> > @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > */
> > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > {
> > + bool is_longterm = flags & FOLL_LONGTERM;
> > +
> > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> > return NULL;
> > @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > * right zone, so fail and let the caller fall back to the slow
> > * path.
> > */
> > - if (unlikely((flags & FOLL_LONGTERM) &&
> > - !is_longterm_pinnable_page(page)))
> > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> > return NULL;
> > /*
> > @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > if (!folio)
> > return NULL;
> > + /*
> > + * Can this folio be safely pinned? We need to perform this
> > + * check after the folio is stabilised.
> > + */
> > + if ((flags & FOLL_WRITE) && is_longterm &&
> > + !folio_longterm_write_pin_allowed(folio)) {
> > + folio_put_refs(folio, refs);
> > + return NULL;
> > + }
>
> So we perform this change before validating whether the PTE changed.
>
> Hmm, naturally, I would have done it afterwards.
>
> IIRC, without IPI syncs during TLB flush (i.e.,
> CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that
> (1) We lookup the pte
> (2) The page was unmapped and free
> (3) The page gets reallocated and used
> (4) We pin the page
> (5) We dereference page->mapping
But we have an implied RCU lock from disabled IRQs right? Unless that CONFIG
option does something odd (I've not really dug into its brehaviour). It feels
like that would break GUP-fast as a whole.
>
> If we then de-reference page->mapping that gets used by whoever allocated it
> for something completely different (not a pointer to something reasonable),
> I wonder if we might be in trouble.
>
> Checking first, whether the PTE changed makes sure that what we pinned and
> what we're looking at is what we expected.
>
> ... I can spot that the page_is_secretmem() check is also done before that.
> But it at least makes sure that it's still an LRU page before staring at the
> mapping (making it a little safer?).
As do we :)
We also via try_get_folio() check to ensure that we aren't subject to a split.
>
> BUT, I keep messing up this part of the story. Maybe it all works as
> expected because we will be synchronizing RCU somehow before actually
> freeing the page in the !IPI case. ... but I think that's only true for page
> tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE.
My understanding based on what Peter said is that the IRQs being disabled should
prevent anything bad from happening here.
>
> --
> Thanks,
>
> David / dhildenb
>
On Tue, May 02, 2023 at 07:22:32PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> > [...]
> >
> > > +{
> > > + struct address_space *mapping;
> > > +
> > > + /*
> > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > > + * to disappear from under us, as well as preventing RCU grace periods
> > > + * from making progress (i.e. implying rcu_read_lock()).
> > > + *
> > > + * This means we can rely on the folio remaining stable for all
> > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > + * and those that do not.
> > > + *
> > > + * We get the added benefit that given inodes, and thus address_space,
> > > + * objects are RCU freed, we can rely on the mapping remaining stable
> > > + * here with no risk of a truncation or similar race.
> > > + */
> > > + lockdep_assert_irqs_disabled();
> > > +
> > > + /*
> > > + * If no mapping can be found, this implies an anonymous or otherwise
> > > + * non-file backed folio so in this instance we permit the pin.
> > > + *
> > > + * shmem and hugetlb mappings do not require dirty-tracking so we
> > > + * explicitly whitelist these.
> > > + *
> > > + * Other non dirty-tracked folios will be picked up on the slow path.
> > > + */
> > > + mapping = folio_mapping(folio);
> > > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> >
> > "Folios in the swap cache return the swap mapping" -- you might disallow
> > pinning anonymous pages that are in the swap cache.
> >
> > I recall that there are corner cases where we can end up with an anon page
> > that's mapped writable but still in the swap cache ... so you'd fallback to
> > the GUP slow path (acceptable for these corner cases, I guess), however
> > especially the comment is a bit misleading then.
> >
> > So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> > ... which will make this piece of code most certainly easier to get when
> > staring at folio_mapping(). Or to spell it out in the comment (usually I
> > prefer code over comments).
>
> So how stable is folio->mapping at this point? Can two subsequent reads
> get different values? (eg. an actual mapping and NULL)
>
> If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
> the compiler from emitting the load multiple times.
Yes that actually feels like a bit of a flaw in folio_mapping(). I suppose
we run the risk of mapping being reset (e.g. to NULL) even if any mapping
we get being guaranteed to be safe due to the RCU thing.
Based on David's feedback I think I'll recode to something more direct
where we READ_ONCE() the mapping, then check mapping flags for anon, avoid
the swap case + check shmem.
On 02.05.23 19:22, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
>> [...]
>>
>>> +{
>>> + struct address_space *mapping;
>>> +
>>> + /*
>>> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
>>> + * to disappear from under us, as well as preventing RCU grace periods
>>> + * from making progress (i.e. implying rcu_read_lock()).
>>> + *
>>> + * This means we can rely on the folio remaining stable for all
>>> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
>>> + * and those that do not.
>>> + *
>>> + * We get the added benefit that given inodes, and thus address_space,
>>> + * objects are RCU freed, we can rely on the mapping remaining stable
>>> + * here with no risk of a truncation or similar race.
>>> + */
>>> + lockdep_assert_irqs_disabled();
>>> +
>>> + /*
>>> + * If no mapping can be found, this implies an anonymous or otherwise
>>> + * non-file backed folio so in this instance we permit the pin.
>>> + *
>>> + * shmem and hugetlb mappings do not require dirty-tracking so we
>>> + * explicitly whitelist these.
>>> + *
>>> + * Other non dirty-tracked folios will be picked up on the slow path.
>>> + */
>>> + mapping = folio_mapping(folio);
>>> + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
>>
>> "Folios in the swap cache return the swap mapping" -- you might disallow
>> pinning anonymous pages that are in the swap cache.
>>
>> I recall that there are corner cases where we can end up with an anon page
>> that's mapped writable but still in the swap cache ... so you'd fallback to
>> the GUP slow path (acceptable for these corner cases, I guess), however
>> especially the comment is a bit misleading then.
>>
>> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
>> ... which will make this piece of code most certainly easier to get when
>> staring at folio_mapping(). Or to spell it out in the comment (usually I
>> prefer code over comments).
>
> So how stable is folio->mapping at this point? Can two subsequent reads
> get different values? (eg. an actual mapping and NULL)
>
> If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
> the compiler from emitting the load multiple times.
I can only talk about anon pages in this specific call order here (check
first, then test if the PTE changed in the meantime): we don't care if
we get two different values. If we get a different value the second
time, surely we (temporarily) pinned an anon page that is no longer
mapped (freed in the meantime). But in that case (even if we read
garbage folio->mapping and made the wrong call here), we'll detect
afterwards that the PTE changed, and unpin what we (temporarily) pinned.
As folio_test_anon() only checks two bits in folio->mapping it's fine,
because we won't dereference garbage folio->mapping.
With folio_mapping() on !anon and READ_ONCE() ... good question. Kirill
said it would be fairly stable, but I suspect that it could change
(especially if we call it before validating if the PTE changed as I
described further below).
Now, if we read folio->mapping after checking if the page we pinned is
still mapped (PTE unchanged), at least the page we pinned cannot be
reused in the meantime. I suspect that we can still read "NULL" on the
second read. But whatever we dereference from the first read should
still be valid, even if the second read would have returned NULL ("rcu
freeing").
--
Thanks,
David / dhildenb
On 02.05.23 19:31, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
>> [...]
>>
>>> +{
>>> + struct address_space *mapping;
>>> +
>>> + /*
>>> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
>>> + * to disappear from under us, as well as preventing RCU grace periods
>>> + * from making progress (i.e. implying rcu_read_lock()).
>>> + *
>>> + * This means we can rely on the folio remaining stable for all
>>> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
>>> + * and those that do not.
>>> + *
>>> + * We get the added benefit that given inodes, and thus address_space,
>>> + * objects are RCU freed, we can rely on the mapping remaining stable
>>> + * here with no risk of a truncation or similar race.
>>> + */
>>> + lockdep_assert_irqs_disabled();
>>> +
>>> + /*
>>> + * If no mapping can be found, this implies an anonymous or otherwise
>>> + * non-file backed folio so in this instance we permit the pin.
>>> + *
>>> + * shmem and hugetlb mappings do not require dirty-tracking so we
>>> + * explicitly whitelist these.
>>> + *
>>> + * Other non dirty-tracked folios will be picked up on the slow path.
>>> + */
>>> + mapping = folio_mapping(folio);
>>> + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
>>
>> "Folios in the swap cache return the swap mapping" -- you might disallow
>> pinning anonymous pages that are in the swap cache.
>>
>> I recall that there are corner cases where we can end up with an anon page
>> that's mapped writable but still in the swap cache ... so you'd fallback to
>> the GUP slow path (acceptable for these corner cases, I guess), however
>> especially the comment is a bit misleading then.
>
> How could that happen?
>
>>
>> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
>> ... which will make this piece of code most certainly easier to get when
>> staring at folio_mapping(). Or to spell it out in the comment (usually I
>> prefer code over comments).
>
> I literally made this change based on your suggestion :) but perhaps I
> misinterpreted what you meant.
>
> I do spell it out in the comment that the page can be anonymous, But perhaps
> explicitly checking the mapping flags is the way to go.
>
>>
>>> +}
>>> +
>>> /**
>>> * try_grab_folio() - Attempt to get or pin a folio.
>>> * @page: pointer to page to be grabbed
>>> @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>>> */
>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>> {
>>> + bool is_longterm = flags & FOLL_LONGTERM;
>>> +
>>> if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
>>> return NULL;
>>> @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>> * right zone, so fail and let the caller fall back to the slow
>>> * path.
>>> */
>>> - if (unlikely((flags & FOLL_LONGTERM) &&
>>> - !is_longterm_pinnable_page(page)))
>>> + if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
>>> return NULL;
>>> /*
>>> @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>> if (!folio)
>>> return NULL;
>>> + /*
>>> + * Can this folio be safely pinned? We need to perform this
>>> + * check after the folio is stabilised.
>>> + */
>>> + if ((flags & FOLL_WRITE) && is_longterm &&
>>> + !folio_longterm_write_pin_allowed(folio)) {
>>> + folio_put_refs(folio, refs);
>>> + return NULL;
>>> + }
>>
>> So we perform this change before validating whether the PTE changed.
>>
>> Hmm, naturally, I would have done it afterwards.
>>
>> IIRC, without IPI syncs during TLB flush (i.e.,
>> CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that
>> (1) We lookup the pte
>> (2) The page was unmapped and free
>> (3) The page gets reallocated and used
>> (4) We pin the page
>> (5) We dereference page->mapping
>
> But we have an implied RCU lock from disabled IRQs right? Unless that CONFIG
> option does something odd (I've not really dug into its brehaviour). It feels
> like that would break GUP-fast as a whole.
>
>>
>> If we then de-reference page->mapping that gets used by whoever allocated it
>> for something completely different (not a pointer to something reasonable),
>> I wonder if we might be in trouble.
>>
>> Checking first, whether the PTE changed makes sure that what we pinned and
>> what we're looking at is what we expected.
>>
>> ... I can spot that the page_is_secretmem() check is also done before that.
>> But it at least makes sure that it's still an LRU page before staring at the
>> mapping (making it a little safer?).
>
> As do we :)
>
> We also via try_get_folio() check to ensure that we aren't subject to a split.
>
>>
>> BUT, I keep messing up this part of the story. Maybe it all works as
>> expected because we will be synchronizing RCU somehow before actually
>> freeing the page in the !IPI case. ... but I think that's only true for page
>> tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE.
>
> My understanding based on what Peter said is that the IRQs being disabled should
> prevent anything bad from happening here.
... only if we verified that the PTE didn't change IIUC. IRQs disabled
only protect you from the mapping getting freed and reused (because
mappings are freed via RCU IIUC).
But as far as I can tell, it doesn't protect you from the page itself
getting freed and reused, and whoever freed the page uses page->mapping
to store something completely different.
But, again, it's all complicated and confusing to me.
page_is_secretmem() also doesn't use a READ_ONCE() ...
--
Thanks,
David / dhildenb
On Tue, May 02, 2023 at 07:38:27PM +0200, David Hildenbrand wrote:
> On 02.05.23 19:31, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> > > [...]
> > >
> > > > +{
> > > > + struct address_space *mapping;
> > > > +
> > > > + /*
> > > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > > > + * to disappear from under us, as well as preventing RCU grace periods
> > > > + * from making progress (i.e. implying rcu_read_lock()).
> > > > + *
> > > > + * This means we can rely on the folio remaining stable for all
> > > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > + * and those that do not.
> > > > + *
> > > > + * We get the added benefit that given inodes, and thus address_space,
> > > > + * objects are RCU freed, we can rely on the mapping remaining stable
> > > > + * here with no risk of a truncation or similar race.
> > > > + */
> > > > + lockdep_assert_irqs_disabled();
> > > > +
> > > > + /*
> > > > + * If no mapping can be found, this implies an anonymous or otherwise
> > > > + * non-file backed folio so in this instance we permit the pin.
> > > > + *
> > > > + * shmem and hugetlb mappings do not require dirty-tracking so we
> > > > + * explicitly whitelist these.
> > > > + *
> > > > + * Other non dirty-tracked folios will be picked up on the slow path.
> > > > + */
> > > > + mapping = folio_mapping(folio);
> > > > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> > >
> > > "Folios in the swap cache return the swap mapping" -- you might disallow
> > > pinning anonymous pages that are in the swap cache.
> > >
> > > I recall that there are corner cases where we can end up with an anon page
> > > that's mapped writable but still in the swap cache ... so you'd fallback to
> > > the GUP slow path (acceptable for these corner cases, I guess), however
> > > especially the comment is a bit misleading then.
> >
> > How could that happen?
> >
> > >
> > > So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> > > ... which will make this piece of code most certainly easier to get when
> > > staring at folio_mapping(). Or to spell it out in the comment (usually I
> > > prefer code over comments).
> >
> > I literally made this change based on your suggestion :) but perhaps I
> > misinterpreted what you meant.
> >
> > I do spell it out in the comment that the page can be anonymous, But perhaps
> > explicitly checking the mapping flags is the way to go.
> >
> > >
> > > > +}
> > > > +
> > > > /**
> > > > * try_grab_folio() - Attempt to get or pin a folio.
> > > > * @page: pointer to page to be grabbed
> > > > @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > > */
> > > > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > > > {
> > > > + bool is_longterm = flags & FOLL_LONGTERM;
> > > > +
> > > > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> > > > return NULL;
> > > > @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > > > * right zone, so fail and let the caller fall back to the slow
> > > > * path.
> > > > */
> > > > - if (unlikely((flags & FOLL_LONGTERM) &&
> > > > - !is_longterm_pinnable_page(page)))
> > > > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> > > > return NULL;
> > > > /*
> > > > @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > > > if (!folio)
> > > > return NULL;
> > > > + /*
> > > > + * Can this folio be safely pinned? We need to perform this
> > > > + * check after the folio is stabilised.
> > > > + */
> > > > + if ((flags & FOLL_WRITE) && is_longterm &&
> > > > + !folio_longterm_write_pin_allowed(folio)) {
> > > > + folio_put_refs(folio, refs);
> > > > + return NULL;
> > > > + }
> > >
> > > So we perform this change before validating whether the PTE changed.
> > >
> > > Hmm, naturally, I would have done it afterwards.
> > >
> > > IIRC, without IPI syncs during TLB flush (i.e.,
> > > CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that
> > > (1) We lookup the pte
> > > (2) The page was unmapped and free
> > > (3) The page gets reallocated and used
> > > (4) We pin the page
> > > (5) We dereference page->mapping
> >
> > But we have an implied RCU lock from disabled IRQs right? Unless that CONFIG
> > option does something odd (I've not really dug into its brehaviour). It feels
> > like that would break GUP-fast as a whole.
> >
> > >
> > > If we then de-reference page->mapping that gets used by whoever allocated it
> > > for something completely different (not a pointer to something reasonable),
> > > I wonder if we might be in trouble.
> > >
> > > Checking first, whether the PTE changed makes sure that what we pinned and
> > > what we're looking at is what we expected.
> > >
> > > ... I can spot that the page_is_secretmem() check is also done before that.
> > > But it at least makes sure that it's still an LRU page before staring at the
> > > mapping (making it a little safer?).
> >
> > As do we :)
> >
> > We also via try_get_folio() check to ensure that we aren't subject to a split.
> >
> > >
> > > BUT, I keep messing up this part of the story. Maybe it all works as
> > > expected because we will be synchronizing RCU somehow before actually
> > > freeing the page in the !IPI case. ... but I think that's only true for page
> > > tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE.
> >
> > My understanding based on what Peter said is that the IRQs being disabled should
> > prevent anything bad from happening here.
>
>
> ... only if we verified that the PTE didn't change IIUC. IRQs disabled only
> protect you from the mapping getting freed and reused (because mappings are
> freed via RCU IIUC).
>
> But as far as I can tell, it doesn't protect you from the page itself
> getting freed and reused, and whoever freed the page uses page->mapping to
> store something completely different.
Ack, and we'd not have mapping->inode to save us in an anon case either.
I'd rather be as cautious as we can possibly be, so let's move this to after the
'PTE is the same' check then, will fix on respin.
>
> But, again, it's all complicated and confusing to me.
>
It's just a fiddly, complicated, delicate area I feel :) hence why I
endeavour to take on board the community's views on this series to ensure
we end up with the best possible implementation.
>
> page_is_secretmem() also doesn't use a READ_ONCE() ...
Perhaps one for a follow up patch...
>
> --
> Thanks,
>
> David / dhildenb
>
On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote:
> On 02.05.23 19:22, Peter Zijlstra wrote:
> > On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> > > [...]
> > >
> > > > +{
> > > > + struct address_space *mapping;
> > > > +
> > > > + /*
> > > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > > > + * to disappear from under us, as well as preventing RCU grace periods
> > > > + * from making progress (i.e. implying rcu_read_lock()).
> > > > + *
> > > > + * This means we can rely on the folio remaining stable for all
> > > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > + * and those that do not.
> > > > + *
> > > > + * We get the added benefit that given inodes, and thus address_space,
> > > > + * objects are RCU freed, we can rely on the mapping remaining stable
> > > > + * here with no risk of a truncation or similar race.
> > > > + */
> > > > + lockdep_assert_irqs_disabled();
> > > > +
> > > > + /*
> > > > + * If no mapping can be found, this implies an anonymous or otherwise
> > > > + * non-file backed folio so in this instance we permit the pin.
> > > > + *
> > > > + * shmem and hugetlb mappings do not require dirty-tracking so we
> > > > + * explicitly whitelist these.
> > > > + *
> > > > + * Other non dirty-tracked folios will be picked up on the slow path.
> > > > + */
> > > > + mapping = folio_mapping(folio);
> > > > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> > >
> > > "Folios in the swap cache return the swap mapping" -- you might disallow
> > > pinning anonymous pages that are in the swap cache.
> > >
> > > I recall that there are corner cases where we can end up with an anon page
> > > that's mapped writable but still in the swap cache ... so you'd fallback to
> > > the GUP slow path (acceptable for these corner cases, I guess), however
> > > especially the comment is a bit misleading then.
> > >
> > > So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> > > ... which will make this piece of code most certainly easier to get when
> > > staring at folio_mapping(). Or to spell it out in the comment (usually I
> > > prefer code over comments).
> >
> > So how stable is folio->mapping at this point? Can two subsequent reads
> > get different values? (eg. an actual mapping and NULL)
> >
> > If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
> > the compiler from emitting the load multiple times.
>
> I can only talk about anon pages in this specific call order here (check
> first, then test if the PTE changed in the meantime): we don't care if we
> get two different values. If we get a different value the second time,
> surely we (temporarily) pinned an anon page that is no longer mapped (freed
> in the meantime). But in that case (even if we read garbage folio->mapping
> and made the wrong call here), we'll detect afterwards that the PTE changed,
> and unpin what we (temporarily) pinned. As folio_test_anon() only checks two
> bits in folio->mapping it's fine, because we won't dereference garbage
> folio->mapping.
>
> With folio_mapping() on !anon and READ_ONCE() ... good question. Kirill said
> it would be fairly stable, but I suspect that it could change (especially if
> we call it before validating if the PTE changed as I described further
> below).
>
> Now, if we read folio->mapping after checking if the page we pinned is still
> mapped (PTE unchanged), at least the page we pinned cannot be reused in the
> meantime. I suspect that we can still read "NULL" on the second read. But
> whatever we dereference from the first read should still be valid, even if
> the second read would have returned NULL ("rcu freeing").
>
On a specific point - if mapping turns out to be NULL after we confirm
stable PTE, I'd be inclined to reject and let the slow path take care of
it, would you agree that that's the correct approach?
I guess you could take that to mean that the page is no longer mapped so is
safe, but it feels like refusing it would be the safe course.
> --
> Thanks,
>
> David / dhildenb
>
On 5/2/23 12:34 PM, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
>
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
>
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
>
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
>
> For example, consider the following scenario:-
>
> 1. A folio is written to via GUP which write-faults the memory, notifying
> the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
> the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
> direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
> (though it does not have to).
>
> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
> pin_user_pages_fast_only() does not exist, we can rely on a slightly
> imperfect whitelisting in the PUP-fast case and fall back to the slow case
> should this fail.
>
> v7:
> - Fixed very silly bug in writeable_file_mapping_allowed() inverting the
> logic.
> - Removed unnecessary RCU lock code and replaced with adaptation of Peter's
> idea.
> - Removed unnecessary open-coded folio_test_anon() in
> folio_longterm_write_pin_allowed() and restructured to generally permit
> NULL folio_mapping().
>
FWIW, I realize you are planning another respin, but I went and tried this version out on s390 -- Now when using a memory backend file and vfio-pci on s390 I see vfio_pin_pages_remote failing consistently. However, the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable will still return positive.
On Tue, May 02, 2023 at 02:45:01PM -0400, Matthew Rosato wrote:
> On 5/2/23 12:34 PM, Lorenzo Stoakes wrote:
> > Writing to file-backed mappings which require folio dirty tracking using
> > GUP is a fundamentally broken operation, as kernel write access to GUP
> > mappings do not adhere to the semantics expected by a file system.
> >
> > A GUP caller uses the direct mapping to access the folio, which does not
> > cause write notify to trigger, nor does it enforce that the caller marks
> > the folio dirty.
> >
> > The problem arises when, after an initial write to the folio, writeback
> > results in the folio being cleaned and then the caller, via the GUP
> > interface, writes to the folio again.
> >
> > As a result of the use of this secondary, direct, mapping to the folio no
> > write notify will occur, and if the caller does mark the folio dirty, this
> > will be done so unexpectedly.
> >
> > For example, consider the following scenario:-
> >
> > 1. A folio is written to via GUP which write-faults the memory, notifying
> > the file system and dirtying the folio.
> > 2. Later, writeback is triggered, resulting in the folio being cleaned and
> > the PTE being marked read-only.
> > 3. The GUP caller writes to the folio, as it is mapped read/write via the
> > direct mapping.
> > 4. The GUP caller, now done with the page, unpins it and sets it dirty
> > (though it does not have to).
> >
> > This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
> > pin_user_pages_fast_only() does not exist, we can rely on a slightly
> > imperfect whitelisting in the PUP-fast case and fall back to the slow case
> > should this fail.
> >
> > v7:
> > - Fixed very silly bug in writeable_file_mapping_allowed() inverting the
> > logic.
> > - Removed unnecessary RCU lock code and replaced with adaptation of Peter's
> > idea.
> > - Removed unnecessary open-coded folio_test_anon() in
> > folio_longterm_write_pin_allowed() and restructured to generally permit
> > NULL folio_mapping().
> >
>
> FWIW, I realize you are planning another respin, but I went and tried this version out on s390 -- Now when using a memory backend file and vfio-pci on s390 I see vfio_pin_pages_remote failing consistently. However, the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable will still return positive.
>
Hey thanks very much for checking that :)
This version will unconditionally apply the retriction to non-FOLL_LONGTERM
by mistake (ugh) but vfio_pin_pages_remote() does seem to be setting
FOLL_LONGTERM anyway so this seems a legitimate test.
Interesting the _fast() variant succeeds...
David, Jason et al. can speak more to the ins and outs of these
virtualisation cases which I am not so familiar with, but I wonder if we do
need a flag to provide an exception for VFIO.
On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote:
> Now, if we read folio->mapping after checking if the page we pinned is still
> mapped (PTE unchanged), at least the page we pinned cannot be reused in the
> meantime. I suspect that we can still read "NULL" on the second read. But
> whatever we dereference from the first read should still be valid, even if
> the second read would have returned NULL ("rcu freeing").
Right, but given it's the compiler adding loads we're not sure what if
anything it uses and it gets very hard to reason about the code.
This is where READ_ONCE() helps, we instruct the compiler to only do a
single load and we can still reason about the code.
On 02.05.23 20:17, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote:
>> On 02.05.23 19:22, Peter Zijlstra wrote:
>>> On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
>>>> [...]
>>>>
>>>>> +{
>>>>> + struct address_space *mapping;
>>>>> +
>>>>> + /*
>>>>> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
>>>>> + * to disappear from under us, as well as preventing RCU grace periods
>>>>> + * from making progress (i.e. implying rcu_read_lock()).
>>>>> + *
>>>>> + * This means we can rely on the folio remaining stable for all
>>>>> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
>>>>> + * and those that do not.
>>>>> + *
>>>>> + * We get the added benefit that given inodes, and thus address_space,
>>>>> + * objects are RCU freed, we can rely on the mapping remaining stable
>>>>> + * here with no risk of a truncation or similar race.
>>>>> + */
>>>>> + lockdep_assert_irqs_disabled();
>>>>> +
>>>>> + /*
>>>>> + * If no mapping can be found, this implies an anonymous or otherwise
>>>>> + * non-file backed folio so in this instance we permit the pin.
>>>>> + *
>>>>> + * shmem and hugetlb mappings do not require dirty-tracking so we
>>>>> + * explicitly whitelist these.
>>>>> + *
>>>>> + * Other non dirty-tracked folios will be picked up on the slow path.
>>>>> + */
>>>>> + mapping = folio_mapping(folio);
>>>>> + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
>>>>
>>>> "Folios in the swap cache return the swap mapping" -- you might disallow
>>>> pinning anonymous pages that are in the swap cache.
>>>>
>>>> I recall that there are corner cases where we can end up with an anon page
>>>> that's mapped writable but still in the swap cache ... so you'd fallback to
>>>> the GUP slow path (acceptable for these corner cases, I guess), however
>>>> especially the comment is a bit misleading then.
>>>>
>>>> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
>>>> ... which will make this piece of code most certainly easier to get when
>>>> staring at folio_mapping(). Or to spell it out in the comment (usually I
>>>> prefer code over comments).
>>>
>>> So how stable is folio->mapping at this point? Can two subsequent reads
>>> get different values? (eg. an actual mapping and NULL)
>>>
>>> If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
>>> the compiler from emitting the load multiple times.
>>
>> I can only talk about anon pages in this specific call order here (check
>> first, then test if the PTE changed in the meantime): we don't care if we
>> get two different values. If we get a different value the second time,
>> surely we (temporarily) pinned an anon page that is no longer mapped (freed
>> in the meantime). But in that case (even if we read garbage folio->mapping
>> and made the wrong call here), we'll detect afterwards that the PTE changed,
>> and unpin what we (temporarily) pinned. As folio_test_anon() only checks two
>> bits in folio->mapping it's fine, because we won't dereference garbage
>> folio->mapping.
>>
>> With folio_mapping() on !anon and READ_ONCE() ... good question. Kirill said
>> it would be fairly stable, but I suspect that it could change (especially if
>> we call it before validating if the PTE changed as I described further
>> below).
>>
>> Now, if we read folio->mapping after checking if the page we pinned is still
>> mapped (PTE unchanged), at least the page we pinned cannot be reused in the
>> meantime. I suspect that we can still read "NULL" on the second read. But
>> whatever we dereference from the first read should still be valid, even if
>> the second read would have returned NULL ("rcu freeing").
>>
>
> On a specific point - if mapping turns out to be NULL after we confirm
> stable PTE, I'd be inclined to reject and let the slow path take care of
> it, would you agree that that's the correct approach?
If it's not an anon page and the mapping is NULL, I'd say simply
fallback to the slow path.
--
Thanks,
David / dhildenb
On 02.05.23 20:59, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote:
>> Now, if we read folio->mapping after checking if the page we pinned is still
>> mapped (PTE unchanged), at least the page we pinned cannot be reused in the
>> meantime. I suspect that we can still read "NULL" on the second read. But
>> whatever we dereference from the first read should still be valid, even if
>> the second read would have returned NULL ("rcu freeing").
>
> Right, but given it's the compiler adding loads we're not sure what if
> anything it uses and it gets very hard to reason about the code.
>
> This is where READ_ONCE() helps, we instruct the compiler to only do a
> single load and we can still reason about the code.
I completely agree, and I think we should fix that in
page_is_secretmem() as well.
--
Thanks,
David / dhildenb
On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote:
> On a specific point - if mapping turns out to be NULL after we confirm
> stable PTE, I'd be inclined to reject and let the slow path take care of
> it, would you agree that that's the correct approach?
I think in general if GUP fast detects any kind of race it should bail
to the slow path.
The races it tries to resolve itself should have really safe and
obvious solutions.
I think this comment is misleading:
> + /*
> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> + * to disappear from under us, as well as preventing RCU grace periods
> + * from making progress (i.e. implying rcu_read_lock()).
True, but that is not important here since we are not reading page
tables
> + * This means we can rely on the folio remaining stable for all
> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> + * and those that do not.
Not really clear. We have a valid folio refcount here, that is all.
> + * We get the added benefit that given inodes, and thus address_space,
> + * objects are RCU freed, we can rely on the mapping remaining stable
> + * here with no risk of a truncation or similar race.
Which is the real point:
1) GUP-fast disables IRQs which means this is the same context as rcu_read_lock()
2) We have a valid ref on the folio due to normal GUP fast operation
Thus derefing struct folio is OK
3) folio->mapping can be deref'd safely under RCU since mapping is RCU free'd
It may be zero if we are racing a page free path
Can it be zero for other reasons?
If it can't be zero for any other reason then go to GUP slow and let
it sort it out
Otherwise you have to treat NULL as a success.
Really what you are trying to do here is remove the folio lock which
would normally protect folio->mapping. Ie this test really boils down
to just 'folio_get_mapping_a_ops_rcu() == shmem_aops'
The hugetlb test is done on a page flag which should be stable under
the pageref.
So.. Your function really ought to be doing this logic:
// Should be impossible for a slab page to be in a VMA
if (unlikely(folio_test_slab(folio)))
return do gup slow;
// Can a present PTE even be a swap cache?
if (unlikely(folio_test_swapcache(folio)))
return do gup slow;
if (folio_test_hugetlb(folio))
return safe for fast
// Safe without the folio lock ?
struct address_space *mapping = READ_ONCE(folio->mapping)
if ((mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_ANON)
return safe for fast
if ((mapping & PAGE_MAPPING_FLAGS) == 0 && mapping)
return mapping->a_ops == shmem_aops;
// Depends on what mapping = NULL means
return do gup slow
Jason
> +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> +{
> + struct address_space *mapping;
> +
> + /*
> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> + * to disappear from under us, as well as preventing RCU grace periods
> + * from making progress (i.e. implying rcu_read_lock()).
> + *
> + * This means we can rely on the folio remaining stable for all
> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> + * and those that do not.
> + *
> + * We get the added benefit that given inodes, and thus address_space,
> + * objects are RCU freed, we can rely on the mapping remaining stable
> + * here with no risk of a truncation or similar race.
> + */
> + lockdep_assert_irqs_disabled();
> +
> + /*
> + * If no mapping can be found, this implies an anonymous or otherwise
> + * non-file backed folio so in this instance we permit the pin.
> + *
> + * shmem and hugetlb mappings do not require dirty-tracking so we
> + * explicitly whitelist these.
> + *
> + * Other non dirty-tracked folios will be picked up on the slow path.
> + */
> + mapping = folio_mapping(folio);
> + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> +}
BTW, try_grab_folio() is also called from follow_hugetlb_page(), which
is ordinary GUP and has interrupts enabled if I am not wrong.
--
Thanks,
David / dhildenb
On Tue, May 02, 2023 at 04:07:50PM -0300, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote:
>
> > On a specific point - if mapping turns out to be NULL after we confirm
> > stable PTE, I'd be inclined to reject and let the slow path take care of
> > it, would you agree that that's the correct approach?
>
> I think in general if GUP fast detects any kind of race it should bail
> to the slow path.
>
> The races it tries to resolve itself should have really safe and
> obvious solutions.
>
> I think this comment is misleading:
>
> > + /*
> > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > + * to disappear from under us, as well as preventing RCU grace periods
> > + * from making progress (i.e. implying rcu_read_lock()).
>
> True, but that is not important here since we are not reading page
> tables
>
> > + * This means we can rely on the folio remaining stable for all
> > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > + * and those that do not.
>
> Not really clear. We have a valid folio refcount here, that is all.
Some of this is a product of mixed signals from different commenters and
my being perhaps a little _too_ willing to just go with the flow.
With interrupts disabled and IPI blocked, plus the assurances that
interrupts being disabled implied the RCU version of page table
manipulation is also blocked, my understanding was that remapping in this
process to another page could not occur.
Of course the folio is 'stable' in the sense we have a refcount on it, but
it is unlocked so things can change.
I'm guessing the RCU guarantees in the TLB logic are not as solid as IPI,
because in the IPI case it seems to me you couldn't even clear the PTE
entry before getting to the page table case.
Otherwise, I'm a bit uncertain actually as to how we can get to the point
where the folio->mapping is being manipulated. Is this why?
>
> > + * We get the added benefit that given inodes, and thus address_space,
> > + * objects are RCU freed, we can rely on the mapping remaining stable
> > + * here with no risk of a truncation or similar race.
>
> Which is the real point:
>
> 1) GUP-fast disables IRQs which means this is the same context as rcu_read_lock()
> 2) We have a valid ref on the folio due to normal GUP fast operation
> Thus derefing struct folio is OK
> 3) folio->mapping can be deref'd safely under RCU since mapping is RCU free'd
> It may be zero if we are racing a page free path
> Can it be zero for other reasons?
Zero? You mean NULL?
In any case, I will try to clarify these, I do agree the _key_ point is
that we can rely on safely derefing the mapping, at least READ_ONCE()'d, as
use-after-free or dereffing garbage is the fear here.
>
> If it can't be zero for any other reason then go to GUP slow and let
> it sort it out
>
> Otherwise you have to treat NULL as a success.
>
Well that was literally the question :) and I've got somewhat contradictory
feedback. My instinct aligns with yours in that, just fallback to slow
path, so that's what I'll do. But just wanted to confirm.
> Really what you are trying to do here is remove the folio lock which
> would normally protect folio->mapping. Ie this test really boils down
> to just 'folio_get_mapping_a_ops_rcu() == shmem_aops'
>
> The hugetlb test is done on a page flag which should be stable under
> the pageref.
>
> So.. Your function really ought to be doing this logic:
>
> // Should be impossible for a slab page to be in a VMA
> if (unlikely(folio_test_slab(folio)))
> return do gup slow;
>
> // Can a present PTE even be a swap cache?
> if (unlikely(folio_test_swapcache(folio)))
> return do gup slow;
>
> if (folio_test_hugetlb(folio))
> return safe for fast
>
> // Safe without the folio lock ?
> struct address_space *mapping = READ_ONCE(folio->mapping)
> if ((mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_ANON)
> return safe for fast
> if ((mapping & PAGE_MAPPING_FLAGS) == 0 && mapping)
> return mapping->a_ops == shmem_aops;
>
> // Depends on what mapping = NULL means
> return do gup slow
>
Yeah this is how I was planning to implement it, or something along these
lines. The only question was whether my own view aligned with others to
avoid more spam :)
The READ_ONCE() approach is precisely how I wanted to do it in thet first
instance, but feared feedback about duplication and wondered if it made
much difference, but now it's clear this is ther ight way.
> Jason
On 02.05.23 21:25, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 04:07:50PM -0300, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote:
>>
>>> On a specific point - if mapping turns out to be NULL after we confirm
>>> stable PTE, I'd be inclined to reject and let the slow path take care of
>>> it, would you agree that that's the correct approach?
>>
>> I think in general if GUP fast detects any kind of race it should bail
>> to the slow path.
>>
>> The races it tries to resolve itself should have really safe and
>> obvious solutions.
>>
>> I think this comment is misleading:
>>
>>> + /*
>>> + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
>>> + * to disappear from under us, as well as preventing RCU grace periods
>>> + * from making progress (i.e. implying rcu_read_lock()).
>>
>> True, but that is not important here since we are not reading page
>> tables
>>
>>> + * This means we can rely on the folio remaining stable for all
>>> + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
>>> + * and those that do not.
>>
>> Not really clear. We have a valid folio refcount here, that is all.
>
> Some of this is a product of mixed signals from different commenters and
> my being perhaps a little _too_ willing to just go with the flow.
>
> With interrupts disabled and IPI blocked, plus the assurances that
> interrupts being disabled implied the RCU version of page table
> manipulation is also blocked, my understanding was that remapping in this
> process to another page could not occur.
>
> Of course the folio is 'stable' in the sense we have a refcount on it, but
> it is unlocked so things can change.
>
> I'm guessing the RCU guarantees in the TLB logic are not as solid as IPI,
> because in the IPI case it seems to me you couldn't even clear the PTE
> entry before getting to the page table case.
>
> Otherwise, I'm a bit uncertain actually as to how we can get to the point
> where the folio->mapping is being manipulated. Is this why?
I'll just stress again that I think there are cases where we unmap and
free a page without synchronizing against GUP-fast using an IPI or RCU.
That's one of the reasons why we recheck if the PTE changed to back off,
so I've been told.
I'm happy if someone proves me wrong and a page we just (temporarily)
pinned cannot have been freed+reused in the meantime.
--
Thanks,
David / dhildenb
On Tue, May 02, 2023 at 09:33:45PM +0200, David Hildenbrand wrote:
> On 02.05.23 21:25, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 04:07:50PM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote:
> > >
> > > > On a specific point - if mapping turns out to be NULL after we confirm
> > > > stable PTE, I'd be inclined to reject and let the slow path take care of
> > > > it, would you agree that that's the correct approach?
> > >
> > > I think in general if GUP fast detects any kind of race it should bail
> > > to the slow path.
> > >
> > > The races it tries to resolve itself should have really safe and
> > > obvious solutions.
> > >
> > > I think this comment is misleading:
> > >
> > > > + /*
> > > > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > > > + * to disappear from under us, as well as preventing RCU grace periods
> > > > + * from making progress (i.e. implying rcu_read_lock()).
> > >
> > > True, but that is not important here since we are not reading page
> > > tables
> > >
> > > > + * This means we can rely on the folio remaining stable for all
> > > > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > + * and those that do not.
> > >
> > > Not really clear. We have a valid folio refcount here, that is all.
> >
> > Some of this is a product of mixed signals from different commenters and
> > my being perhaps a little _too_ willing to just go with the flow.
> >
> > With interrupts disabled and IPI blocked, plus the assurances that
> > interrupts being disabled implied the RCU version of page table
> > manipulation is also blocked, my understanding was that remapping in this
> > process to another page could not occur.
> >
> > Of course the folio is 'stable' in the sense we have a refcount on it, but
> > it is unlocked so things can change.
> >
> > I'm guessing the RCU guarantees in the TLB logic are not as solid as IPI,
> > because in the IPI case it seems to me you couldn't even clear the PTE
> > entry before getting to the page table case.
> >
> > Otherwise, I'm a bit uncertain actually as to how we can get to the point
> > where the folio->mapping is being manipulated. Is this why?
>
> I'll just stress again that I think there are cases where we unmap and free
> a page without synchronizing against GUP-fast using an IPI or RCU.
OK that explains why we need to be careful. Don't worry I am going to move the
check after we confirm PTE entry hasn't changed.
>
> That's one of the reasons why we recheck if the PTE changed to back off, so
> I've been told.
>
> I'm happy if someone proves me wrong and a page we just (temporarily) pinned
> cannot have been freed+reused in the meantime.
Let's play it safe for now :)
>
> --
> Thanks,
>
> David / dhildenb
>
On Tue, May 02, 2023 at 09:17:53PM +0200, David Hildenbrand wrote:
> > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > +{
> > + struct address_space *mapping;
> > +
> > + /*
> > + * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > + * to disappear from under us, as well as preventing RCU grace periods
> > + * from making progress (i.e. implying rcu_read_lock()).
> > + *
> > + * This means we can rely on the folio remaining stable for all
> > + * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > + * and those that do not.
> > + *
> > + * We get the added benefit that given inodes, and thus address_space,
> > + * objects are RCU freed, we can rely on the mapping remaining stable
> > + * here with no risk of a truncation or similar race.
> > + */
> > + lockdep_assert_irqs_disabled();
> > +
> > + /*
> > + * If no mapping can be found, this implies an anonymous or otherwise
> > + * non-file backed folio so in this instance we permit the pin.
> > + *
> > + * shmem and hugetlb mappings do not require dirty-tracking so we
> > + * explicitly whitelist these.
> > + *
> > + * Other non dirty-tracked folios will be picked up on the slow path.
> > + */
> > + mapping = folio_mapping(folio);
> > + return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> > +}
>
> BTW, try_grab_folio() is also called from follow_hugetlb_page(), which is
> ordinary GUP and has interrupts enabled if I am not wrong.
It does hold the PTL though, so can't fiddle with the entry.
But that does suggest folio_test_hugetlb() should be put _before_ the irq
disabled assertion then :)
>
> --
> Thanks,
>
> David / dhildenb
>
On Tue, May 02, 2023 at 09:33:45PM +0200, David Hildenbrand wrote:
> I'll just stress again that I think there are cases where we unmap and free
> a page without synchronizing against GUP-fast using an IPI or RCU.
AFAIK this is true on ARM64 and other arches that do not use IPIs for
TLB shootdown.
Eg the broadcast TLBI described here:
https://developer.arm.com/documentation/101811/0102/Translation-Lookaside-Buffer-maintenance
TLB invalidation of remote CPUs Is done at the interconnect level and
does not trigger any interrupt.
So, arches that don't use IPI have to RCU free the page table entires
to work with GUP fast. They will set MMU_GATHER_RCU_TABLE_FREE. The
whole interrupt disable thing in GUP turns into nothing more than a
hacky RCU on those arches.
The ugly bit is the comment:
* We do not adopt an rcu_read_lock() here as we also want to block IPIs
* that come from THPs splitting.
Which, I think, today can be summarized in today's code base as
serializing with split_huge_page_to_list().
I don't know this code well, but the comment there says "Only caller
must hold pin on the @page" which is only possible if all the PTEs
have been replaced with migration entries or otherwise before we get
here. So the IPI the comment talks about, I suppose, is from the
installation of the migration entry?
However gup_huge_pmd() does the usual read, ref, check pattern, and
split_huge_page_to_list() uses page_ref_freeze() which is cmpxchg
So the three races seem to resolve themselves without IPI magic
- GUP sees the THP folio before the migration entry and +1's the ref
split_huge_page_to_list() bails beacuse the ref is elevated
- GUP fails to +1 the ref because it is zero and bails,
split_huge_page_to_list() made it zero, so it runs
- GUP sees the THP folio, split_huge_page_to_list() ran to
completion, and then GUP +1's a 4k page. The recheck of pmd_val
will see the migration entry, or the new PTE table pointer, but
never the original THP folio. So GUP will bail. The speculative
ref on the 4k page is harmless.
I can't figure out what this 2014 comment means in today's code.
Or where ARM64 hid the "IPI broadcast on THP split" :)
See commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
> That's one of the reasons why we recheck if the PTE changed to back off, so
> I've been told.
Yes, with no synchronizing point the refcount in GUP fast could be
taken on the page after it has been free'd and reallocated, but this
is only possible on RCU
> I'm happy if someone proves me wrong and a page we just (temporarily) pinned
> cannot have been freed+reused in the meantime.
Sadly I think no.. We'd need to RCU free the page itself as well to
make this true.
Jason
On Tue, May 02, 2023 at 08:25:57PM +0100, Lorenzo Stoakes wrote:
> Otherwise, I'm a bit uncertain actually as to how we can get to the point
> where the folio->mapping is being manipulated. Is this why?
On RCU architectures another thread zap's the PTEs and proceeds to
teardown and free the page. Part of that is clearing folio->mapping
under the folio lock.
The IPI approach would not get there, but we can't think in terms of
IPI since we have RCU architectures.
> In any case, I will try to clarify these, I do agree the _key_ point is
> that we can rely on safely derefing the mapping, at least READ_ONCE()'d, as
> use-after-free or dereffing garbage is the fear here.
I didn't chase it down, but as long as folio->mapping is always set to
something that is RCU free'd then this would be OK.
> Well that was literally the question :) and I've got somewhat contradictory
> feedback. My instinct aligns with yours in that, just fallback to slow
> path, so that's what I'll do. But just wanted to confirm.
I don't know it well enough to say, it looks like folio->mapping is
almost always set to something from what I can see.. At least NULL
pointer and the anon bits set for instance.
In any case fall back to fast is always safe, I'd just go to the
trouble to check with testing that in normal process cases we don't
hit it.
> The READ_ONCE() approach is precisely how I wanted to do it in thet first
> instance, but feared feedback about duplication and wondered if it made
> much difference, but now it's clear this is ther ight way.
The duplication is bad, and maybe we need more functions to counter
it, but GUP needs to know some of the details a little more, eg a NULL
return from folio_mapping() could inidicate the anon bits being set
which should not flow down to slow.
Jason