Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.
v8:
- Fixed typo writeable -> writable.
- Fixed bug in writable_file_mapping_allowed() - must check combination of
FOLL_PIN AND FOLL_LONGTERM not either/or.
- Updated vma_needs_dirty_tracking() to include write/shared to account for
MAP_PRIVATE mappings.
- Move to open-coding the checks in folio_pin_allowed() so we can
READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
account for fact we now check flags here.
- Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
anon. Defer to slow path.
- Perform GUP-fast check _after_ the lowest page table level is confirmed to
be stable.
- Updated comments and commit message for final patch as per Jason's
suggestions.
v7:
- Fixed very silly bug in writeable_file_mapping_allowed() inverting the
logic.
- Removed unnecessary RCU lock code and replaced with adaptation of Peter's
idea.
- Removed unnecessary open-coded folio_test_anon() in
folio_longterm_write_pin_allowed() and restructured to generally permit
NULL folio_mapping().
https://lore.kernel.org/all/[email protected]/
v6:
- Rebased on latest mm-unstable as of 28th April 2023.
- Add PUP-fast check with handling for rcu-locked TLB shootdown to synchronise
correctly.
- Split patch series into 3 to make it more digestible.
https://lore.kernel.org/all/[email protected]/
v5:
- Rebased on latest mm-unstable as of 25th April 2023.
- Some small refactorings suggested by John.
- Added an extended description of the problem in the comment around
writeable_file_mapping_allowed() for clarity.
- Updated commit message as suggested by Mika and John.
https://lore.kernel.org/all/6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com/
v4:
- Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to
reduce duplication and update to use this in the GUP check. Note that
both separately check vm_ops_needs_writenotify() as the latter needs to
test this before the vm_pgprot_modify() test, resulting in
vma_wants_writenotify() checking this twice, however it is such a small
check this should not be egregious.
https://lore.kernel.org/all/3b92d56f55671a0389252379237703df6e86ea48.1682464032.git.lstoakes@gmail.com/
v3:
- Rebased on latest mm-unstable as of 24th April 2023.
- Explicitly check whether file system requires folio dirtying. Note that
vma_wants_writenotify() could not be used directly as it is very much focused
on determining if the PTE r/w should be set (e.g. assuming private mapping
does not require it as already set, soft dirty considerations).
- Tested code against shmem and hugetlb mappings - confirmed that these are not
disallowed by the check.
- Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only
for FOLL_LONGTERM pins.
- As a result, limit check to internal GUP code.
https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/
v2:
- Add accidentally excluded ptrace_access_vm() use of
FOLL_ALLOW_BROKEN_FILE_MAPPING.
- Tweak commit message.
https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/
v1:
https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/
Lorenzo Stoakes (3):
mm/mmap: separate writenotify and dirty tracking logic
mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed
mappings
mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed
mappings
include/linux/mm.h | 1 +
mm/gup.c | 146 ++++++++++++++++++++++++++++++++++++++++++++-
mm/mmap.c | 53 ++++++++++++----
3 files changed, 186 insertions(+), 14 deletions(-)
--
2.40.1
vma_wants_writenotify() is specifically intended for setting PTE page table
flags, accounting for existing page table flag state and whether the
filesystem performs dirty tracking.
Separate out the notions of dirty tracking and PTE write notify checking in
order that we can invoke the dirty tracking check from elsewhere.
Note that this change introduces a very small duplicate check of the
separated out vm_ops_needs_writenotify() and vma_is_shared_writable()
functions. This is necessary to avoid making vma_needs_dirty_tracking()
needlessly complicated (e.g. passing flags or having it assume checks were
already performed). This is small enough that it doesn't seem too
egregious.
We check to ensure the mapping is shared writable, as any GUP caller will
be safe - MAP_PRIVATE mappings will be CoW'd and read-only file-backed
shared mappings are not permitted access, even with FOLL_FORCE.
Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Mika Penttilä <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
include/linux/mm.h | 1 +
mm/mmap.c | 53 ++++++++++++++++++++++++++++++++++------------
2 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..7b1d4e7393ef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
MM_CP_UFFD_WP_RESOLVE)
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
{
diff --git a/mm/mmap.c b/mm/mmap.c
index 5522130ae606..fa7442e44cc2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1475,6 +1475,42 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
}
#endif /* __ARCH_WANT_SYS_OLD_MMAP */
+/* Do VMA operations imply write notify is required? */
+static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
+{
+ return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
+}
+
+/* Is this VMA shared and writable? */
+static bool vma_is_shared_writable(struct vm_area_struct *vma)
+{
+ return (vma->vm_flags & (VM_WRITE | VM_SHARED)) ==
+ (VM_WRITE | VM_SHARED);
+}
+
+/*
+ * Does this VMA require the underlying folios to have their dirty state
+ * tracked?
+ */
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
+{
+ /* Only shared, writable VMAs require dirty tracking. */
+ if (!vma_is_shared_writable(vma))
+ return false;
+
+ /* Does the filesystem need to be notified? */
+ if (vm_ops_needs_writenotify(vma->vm_ops))
+ return true;
+
+ /* Specialty mapping? */
+ if (vma->vm_flags & VM_PFNMAP)
+ return false;
+
+ /* Can the mapping track the dirty pages? */
+ return vma->vm_file && vma->vm_file->f_mapping &&
+ mapping_can_writeback(vma->vm_file->f_mapping);
+}
+
/*
* Some shared mappings will want the pages marked read-only
* to track write events. If so, we'll downgrade vm_page_prot
@@ -1483,21 +1519,18 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
*/
int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
{
- vm_flags_t vm_flags = vma->vm_flags;
- const struct vm_operations_struct *vm_ops = vma->vm_ops;
-
/* If it was private or non-writable, the write bit is already clear */
- if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
+ if (!vma_is_shared_writable(vma))
return 0;
/* The backer wishes to know when pages are first written to? */
- if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
+ if (vm_ops_needs_writenotify(vma->vm_ops))
return 1;
/* The open routine did something to the protections that pgprot_modify
* won't preserve? */
if (pgprot_val(vm_page_prot) !=
- pgprot_val(vm_pgprot_modify(vm_page_prot, vm_flags)))
+ pgprot_val(vm_pgprot_modify(vm_page_prot, vma->vm_flags)))
return 0;
/*
@@ -1511,13 +1544,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
if (userfaultfd_wp(vma))
return 1;
- /* Specialty mapping? */
- if (vm_flags & VM_PFNMAP)
- return 0;
-
- /* Can the mapping track the dirty pages? */
- return vma->vm_file && vma->vm_file->f_mapping &&
- mapping_can_writeback(vma->vm_file->f_mapping);
+ return vma_needs_dirty_tracking(vma);
}
/*
--
2.40.1
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.
A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.
The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.
As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.
For example, consider the following scenario:-
1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).
This results in both data being written to a folio without writenotify, and
the folio being dirtied unexpectedly (if the caller decides to do so).
This issue was first reported by Jan Kara [1] in 2018, where the problem
resulted in file system crashes.
This is only relevant when the mappings are file-backed and the underlying
file system requires folio dirty tracking. File systems which do not, such
as shmem or hugetlb, are not at risk and therefore can be written to
without issue.
Unfortunately this limitation of GUP has been present for some time and
requires future rework of the GUP API in order to provide correct write
access to such mappings.
However, for the time being we introduce this check to prevent the most
egregious case of this occurring, use of the FOLL_LONGTERM pin.
These mappings are considerably more likely to be written to after
folios are cleaned and thus simply must not be permitted to do so.
This patch changes only the slow-path GUP functions, a following patch
adapts the GUP-fast path along similar lines.
[1]:https://lore.kernel.org/linux-mm/[email protected]/
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Mika Penttilä <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
mm/gup.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index ff689c88a357..0ea9ebec9547 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -959,16 +959,54 @@ static int faultin_page(struct vm_area_struct *vma,
return 0;
}
+/*
+ * Writing to file-backed mappings which require folio dirty tracking using GUP
+ * is a fundamentally broken operation, as kernel write access to GUP mappings
+ * do not adhere to the semantics expected by a file system.
+ *
+ * Consider the following scenario:-
+ *
+ * 1. A folio is written to via GUP which write-faults the memory, notifying
+ * the file system and dirtying the folio.
+ * 2. Later, writeback is triggered, resulting in the folio being cleaned and
+ * the PTE being marked read-only.
+ * 3. The GUP caller writes to the folio, as it is mapped read/write via the
+ * direct mapping.
+ * 4. The GUP caller, now done with the page, unpins it and sets it dirty
+ * (though it does not have to).
+ *
+ * This results in both data being written to a folio without writenotify, and
+ * the folio being dirtied unexpectedly (if the caller decides to do so).
+ */
+static bool writable_file_mapping_allowed(struct vm_area_struct *vma,
+ unsigned long gup_flags)
+{
+ /*
+ * If we aren't pinning then no problematic write can occur. A long term
+ * pin is the most egregious case so this is the case we disallow.
+ */
+ if ((gup_flags & (FOLL_PIN | FOLL_LONGTERM)) !=
+ (FOLL_PIN | FOLL_LONGTERM))
+ return true;
+
+ /*
+ * If the VMA does not require dirty tracking then no problematic write
+ * can occur either.
+ */
+ return !vma_needs_dirty_tracking(vma);
+}
+
static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
{
vm_flags_t vm_flags = vma->vm_flags;
int write = (gup_flags & FOLL_WRITE);
int foreign = (gup_flags & FOLL_REMOTE);
+ bool vma_anon = vma_is_anonymous(vma);
if (vm_flags & (VM_IO | VM_PFNMAP))
return -EFAULT;
- if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+ if ((gup_flags & FOLL_ANON) && !vma_anon)
return -EFAULT;
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
@@ -978,6 +1016,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
return -EFAULT;
if (write) {
+ if (!vma_anon &&
+ !writable_file_mapping_allowed(vma, gup_flags))
+ return -EFAULT;
+
if (!(vm_flags & VM_WRITE)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
--
2.40.1
Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.
This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
we have already done in the slow path.
We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous or belonging to a whitelisted filesystem - specifically
hugetlb and shmem mappings.
We take special care to ensure that both the folio and mapping are safe to
access when performing these checks and document folio_fast_pin_allowed()
accordingly.
It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.
Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Kirill A . Shutemov <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
mm/gup.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 102 insertions(+)
diff --git a/mm/gup.c b/mm/gup.c
index 0ea9ebec9547..1ab369b5d889 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -18,6 +18,7 @@
#include <linux/migrate.h>
#include <linux/mm_inline.h>
#include <linux/sched/mm.h>
+#include <linux/shmem_fs.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
@@ -95,6 +96,83 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
return folio;
}
+/*
+ * Used in the GUP-fast path to determine whether a pin is permitted for a
+ * specific folio.
+ *
+ * This call assumes the caller has pinned the folio, that the lowest page table
+ * level still points to this folio, and that interrupts have been disabled.
+ *
+ * Writing to pinned file-backed dirty tracked folios is inherently problematic
+ * (see comment describing the writable_file_mapping_allowed() function). We
+ * therefore try to avoid the most egregious case of a long-term mapping doing
+ * so.
+ *
+ * This function cannot be as thorough as that one as the VMA is not available
+ * in the fast path, so instead we whitelist known good cases and if in doubt,
+ * fall back to the slow path.
+ */
+static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
+{
+ struct address_space *mapping;
+ unsigned long mapping_flags;
+
+ /*
+ * If we aren't pinning then no problematic write can occur. A long term
+ * pin is the most egregious case so this is the one we disallow.
+ */
+ if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
+ (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
+ return true;
+
+ /* The folio is pinned, so we can safely access folio fields. */
+
+ /* Neither of these should be possible, but check to be sure. */
+ if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
+ return false;
+
+ /* hugetlb mappings do not require dirty-tracking. */
+ if (folio_test_hugetlb(folio))
+ return true;
+
+ /*
+ * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
+ * cannot proceed, which means no actions performed under RCU can
+ * proceed either.
+ *
+ * inodes and thus their mappings are freed under RCU, which means the
+ * mapping cannot be freed beneath us and thus we can safely dereference
+ * it.
+ */
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * However, there may be operations which _alter_ the mapping, so ensure
+ * we read it once and only once.
+ */
+ mapping = READ_ONCE(folio->mapping);
+
+ /*
+ * The mapping may have been truncated, in any case we cannot determine
+ * if this mapping is safe - fall back to slow path to determine how to
+ * proceed.
+ */
+ if (!mapping)
+ return false;
+
+ /* Anonymous folios are fine, other non-file backed cases are not. */
+ mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
+ if (mapping_flags)
+ return mapping_flags == PAGE_MAPPING_ANON;
+
+ /*
+ * At this point, we know the mapping is non-null and points to an
+ * address_space object. The only remaining whitelisted file system is
+ * shmem.
+ */
+ return shmem_mapping(mapping);
+}
+
/**
* try_grab_folio() - Attempt to get or pin a folio.
* @page: pointer to page to be grabbed
@@ -2464,6 +2542,11 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
goto pte_unmap;
}
+ if (!folio_fast_pin_allowed(folio, flags)) {
+ gup_put_folio(folio, 1, flags);
+ goto pte_unmap;
+ }
+
if (!pte_write(pte) && gup_must_unshare(NULL, flags, page)) {
gup_put_folio(folio, 1, flags);
goto pte_unmap;
@@ -2656,6 +2739,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
return 0;
}
+ if (!folio_fast_pin_allowed(folio, flags)) {
+ gup_put_folio(folio, refs, flags);
+ return 0;
+ }
+
if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
gup_put_folio(folio, refs, flags);
return 0;
@@ -2722,6 +2810,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;
}
+ if (!folio_fast_pin_allowed(folio, flags)) {
+ gup_put_folio(folio, refs, flags);
+ return 0;
+ }
if (!pmd_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) {
gup_put_folio(folio, refs, flags);
return 0;
@@ -2762,6 +2854,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
return 0;
}
+ if (!folio_fast_pin_allowed(folio, flags)) {
+ gup_put_folio(folio, refs, flags);
+ return 0;
+ }
+
if (!pud_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) {
gup_put_folio(folio, refs, flags);
return 0;
@@ -2797,6 +2894,11 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
return 0;
}
+ if (!folio_fast_pin_allowed(folio, flags)) {
+ gup_put_folio(folio, refs, flags);
+ return 0;
+ }
+
*nr += refs;
folio_set_referenced(folio);
return 1;
--
2.40.1
On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
>
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
>
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
>
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
>
> For example, consider the following scenario:-
>
> 1. A folio is written to via GUP which write-faults the memory, notifying
> the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
> the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
> direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
> (though it does not have to).
>
> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
> pin_user_pages_fast_only() does not exist, we can rely on a slightly
> imperfect whitelisting in the PUP-fast case and fall back to the slow case
> should this fail.
>
> v8:
> - Fixed typo writeable -> writable.
> - Fixed bug in writable_file_mapping_allowed() - must check combination of
> FOLL_PIN AND FOLL_LONGTERM not either/or.
> - Updated vma_needs_dirty_tracking() to include write/shared to account for
> MAP_PRIVATE mappings.
> - Move to open-coding the checks in folio_pin_allowed() so we can
> READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
> account for fact we now check flags here.
> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
> anon. Defer to slow path.
> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
> be stable.
> - Updated comments and commit message for final patch as per Jason's
> suggestions.
Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)
On Tue, 2 May 2023 23:51:35 +0100 Lorenzo Stoakes <[email protected]> wrote:
> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> as we cannot rule out folios being cleaned and then a GUP user writing to
> them again and possibly marking them dirty unexpectedly.
>
> This is especially egregious for long-term mappings (as indicated by the
> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> we have already done in the slow path.
>
> We have access to less information in the fast path as we cannot examine
> the VMA containing the mapping, however we can determine whether the folio
> is anonymous or belonging to a whitelisted filesystem - specifically
> hugetlb and shmem mappings.
>
> We take special care to ensure that both the folio and mapping are safe to
> access when performing these checks and document folio_fast_pin_allowed()
> accordingly.
>
> It's important to note that there are no APIs allowing users to specify
> FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> always rely on the fact that if we fail to pin on the fast path, the code
> will fall back to the slow path which can perform the more thorough check.
arm allnoconfig said
mm/gup.c:115:13: warning: 'folio_fast_pin_allowed' defined but not used [-Wunused-function]
115 | static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
| ^~~~~~~~~~~~~~~~~~~~~~
so I moved the definition inside CONFIG_ARCH_HAS_PTE_SPECIAL.
mm/gup.c | 154 ++++++++++++++++++++++++++---------------------------
1 file changed, 77 insertions(+), 77 deletions(-)
--- a/mm/gup.c~mm-gup-disallow-foll_longterm-gup-fast-writing-to-file-backed-mappings-fix
+++ a/mm/gup.c
@@ -96,83 +96,6 @@ retry:
return folio;
}
-/*
- * Used in the GUP-fast path to determine whether a pin is permitted for a
- * specific folio.
- *
- * This call assumes the caller has pinned the folio, that the lowest page table
- * level still points to this folio, and that interrupts have been disabled.
- *
- * Writing to pinned file-backed dirty tracked folios is inherently problematic
- * (see comment describing the writable_file_mapping_allowed() function). We
- * therefore try to avoid the most egregious case of a long-term mapping doing
- * so.
- *
- * This function cannot be as thorough as that one as the VMA is not available
- * in the fast path, so instead we whitelist known good cases and if in doubt,
- * fall back to the slow path.
- */
-static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
-{
- struct address_space *mapping;
- unsigned long mapping_flags;
-
- /*
- * If we aren't pinning then no problematic write can occur. A long term
- * pin is the most egregious case so this is the one we disallow.
- */
- if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
- (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
- return true;
-
- /* The folio is pinned, so we can safely access folio fields. */
-
- /* Neither of these should be possible, but check to be sure. */
- if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
- return false;
-
- /* hugetlb mappings do not require dirty-tracking. */
- if (folio_test_hugetlb(folio))
- return true;
-
- /*
- * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
- * cannot proceed, which means no actions performed under RCU can
- * proceed either.
- *
- * inodes and thus their mappings are freed under RCU, which means the
- * mapping cannot be freed beneath us and thus we can safely dereference
- * it.
- */
- lockdep_assert_irqs_disabled();
-
- /*
- * However, there may be operations which _alter_ the mapping, so ensure
- * we read it once and only once.
- */
- mapping = READ_ONCE(folio->mapping);
-
- /*
- * The mapping may have been truncated, in any case we cannot determine
- * if this mapping is safe - fall back to slow path to determine how to
- * proceed.
- */
- if (!mapping)
- return false;
-
- /* Anonymous folios are fine, other non-file backed cases are not. */
- mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
- if (mapping_flags)
- return mapping_flags == PAGE_MAPPING_ANON;
-
- /*
- * At this point, we know the mapping is non-null and points to an
- * address_space object. The only remaining whitelisted file system is
- * shmem.
- */
- return shmem_mapping(mapping);
-}
-
/**
* try_grab_folio() - Attempt to get or pin a folio.
* @page: pointer to page to be grabbed
@@ -2474,6 +2397,83 @@ static void __maybe_unused undo_dev_page
#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
/*
+ * Used in the GUP-fast path to determine whether a pin is permitted for a
+ * specific folio.
+ *
+ * This call assumes the caller has pinned the folio, that the lowest page table
+ * level still points to this folio, and that interrupts have been disabled.
+ *
+ * Writing to pinned file-backed dirty tracked folios is inherently problematic
+ * (see comment describing the writable_file_mapping_allowed() function). We
+ * therefore try to avoid the most egregious case of a long-term mapping doing
+ * so.
+ *
+ * This function cannot be as thorough as that one as the VMA is not available
+ * in the fast path, so instead we whitelist known good cases and if in doubt,
+ * fall back to the slow path.
+ */
+static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
+{
+ struct address_space *mapping;
+ unsigned long mapping_flags;
+
+ /*
+ * If we aren't pinning then no problematic write can occur. A long term
+ * pin is the most egregious case so this is the one we disallow.
+ */
+ if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
+ (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
+ return true;
+
+ /* The folio is pinned, so we can safely access folio fields. */
+
+ /* Neither of these should be possible, but check to be sure. */
+ if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
+ return false;
+
+ /* hugetlb mappings do not require dirty-tracking. */
+ if (folio_test_hugetlb(folio))
+ return true;
+
+ /*
+ * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
+ * cannot proceed, which means no actions performed under RCU can
+ * proceed either.
+ *
+ * inodes and thus their mappings are freed under RCU, which means the
+ * mapping cannot be freed beneath us and thus we can safely dereference
+ * it.
+ */
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * However, there may be operations which _alter_ the mapping, so ensure
+ * we read it once and only once.
+ */
+ mapping = READ_ONCE(folio->mapping);
+
+ /*
+ * The mapping may have been truncated, in any case we cannot determine
+ * if this mapping is safe - fall back to slow path to determine how to
+ * proceed.
+ */
+ if (!mapping)
+ return false;
+
+ /* Anonymous folios are fine, other non-file backed cases are not. */
+ mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
+ if (mapping_flags)
+ return mapping_flags == PAGE_MAPPING_ANON;
+
+ /*
+ * At this point, we know the mapping is non-null and points to an
+ * address_space object. The only remaining whitelisted file system is
+ * shmem.
+ */
+ return shmem_mapping(mapping);
+}
+
+/*
* Fast-gup relies on pte change detection to avoid concurrent pgtable
* operations.
*
_
On 03.05.23 02:31, Matthew Rosato wrote:
> On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
>> Writing to file-backed mappings which require folio dirty tracking using
>> GUP is a fundamentally broken operation, as kernel write access to GUP
>> mappings do not adhere to the semantics expected by a file system.
>>
>> A GUP caller uses the direct mapping to access the folio, which does not
>> cause write notify to trigger, nor does it enforce that the caller marks
>> the folio dirty.
>>
>> The problem arises when, after an initial write to the folio, writeback
>> results in the folio being cleaned and then the caller, via the GUP
>> interface, writes to the folio again.
>>
>> As a result of the use of this secondary, direct, mapping to the folio no
>> write notify will occur, and if the caller does mark the folio dirty, this
>> will be done so unexpectedly.
>>
>> For example, consider the following scenario:-
>>
>> 1. A folio is written to via GUP which write-faults the memory, notifying
>> the file system and dirtying the folio.
>> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>> the PTE being marked read-only.
>> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>> direct mapping.
>> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>> (though it does not have to).
>>
>> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
>> pin_user_pages_fast_only() does not exist, we can rely on a slightly
>> imperfect whitelisting in the PUP-fast case and fall back to the slow case
>> should this fail.
>>
>> v8:
>> - Fixed typo writeable -> writable.
>> - Fixed bug in writable_file_mapping_allowed() - must check combination of
>> FOLL_PIN AND FOLL_LONGTERM not either/or.
>> - Updated vma_needs_dirty_tracking() to include write/shared to account for
>> MAP_PRIVATE mappings.
>> - Move to open-coding the checks in folio_pin_allowed() so we can
>> READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
>> account for fact we now check flags here.
>> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
>> anon. Defer to slow path.
>> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
>> be stable.
>> - Updated comments and commit message for final patch as per Jason's
>> suggestions.
>
> Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)
At least it's consistent now ;) And it might be working as expected ...
In v7:
* pin_user_pages_fast() succeeded
* vfio_pin_pages_remote() failed
But also in v7:
* GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
mappings
* Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
In v8:
* pin_user_pages_fast() succeeds
* vfio_pin_pages_remote() succeeds
But also in v8:
* GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
mappings
* Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
I have to speculate, but ... could it be that you are using a private
mapping?
In QEMU, unfortunately, the default for memory-backend-file is
"share=off" (private) ... for memory-backend-memfd it is "share=on"
(shared). The default is stupid ...
If you invoke QEMU manually, can you specify "share=on" for the
memory-backend-file? I thought libvirt would always default to
"share=on" for file mappings (everything else doesn't make much sense)
... but you might have to specify
<access mode="shared"/>
in addition to
<source type="file"/>
--
Thanks,
David / dhildenb
On Tue 02-05-23 23:51:35, Lorenzo Stoakes wrote:
> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> as we cannot rule out folios being cleaned and then a GUP user writing to
> them again and possibly marking them dirty unexpectedly.
>
> This is especially egregious for long-term mappings (as indicated by the
> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> we have already done in the slow path.
>
> We have access to less information in the fast path as we cannot examine
> the VMA containing the mapping, however we can determine whether the folio
> is anonymous or belonging to a whitelisted filesystem - specifically
> hugetlb and shmem mappings.
>
> We take special care to ensure that both the folio and mapping are safe to
> access when performing these checks and document folio_fast_pin_allowed()
> accordingly.
>
> It's important to note that there are no APIs allowing users to specify
> FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> always rely on the fact that if we fail to pin on the fast path, the code
> will fall back to the slow path which can perform the more thorough check.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Suggested-by: Kirill A . Shutemov <[email protected]>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
The patch looks good to me now. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> mm/gup.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 102 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 0ea9ebec9547..1ab369b5d889 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -18,6 +18,7 @@
> #include <linux/migrate.h>
> #include <linux/mm_inline.h>
> #include <linux/sched/mm.h>
> +#include <linux/shmem_fs.h>
>
> #include <asm/mmu_context.h>
> #include <asm/tlbflush.h>
> @@ -95,6 +96,83 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> return folio;
> }
>
> +/*
> + * Used in the GUP-fast path to determine whether a pin is permitted for a
> + * specific folio.
> + *
> + * This call assumes the caller has pinned the folio, that the lowest page table
> + * level still points to this folio, and that interrupts have been disabled.
> + *
> + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> + * (see comment describing the writable_file_mapping_allowed() function). We
> + * therefore try to avoid the most egregious case of a long-term mapping doing
> + * so.
> + *
> + * This function cannot be as thorough as that one as the VMA is not available
> + * in the fast path, so instead we whitelist known good cases and if in doubt,
> + * fall back to the slow path.
> + */
> +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
> +{
> + struct address_space *mapping;
> + unsigned long mapping_flags;
> +
> + /*
> + * If we aren't pinning then no problematic write can occur. A long term
> + * pin is the most egregious case so this is the one we disallow.
> + */
> + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
> + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
> + return true;
> +
> + /* The folio is pinned, so we can safely access folio fields. */
> +
> + /* Neither of these should be possible, but check to be sure. */
> + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
> + return false;
> +
> + /* hugetlb mappings do not require dirty-tracking. */
> + if (folio_test_hugetlb(folio))
> + return true;
> +
> + /*
> + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
> + * cannot proceed, which means no actions performed under RCU can
> + * proceed either.
> + *
> + * inodes and thus their mappings are freed under RCU, which means the
> + * mapping cannot be freed beneath us and thus we can safely dereference
> + * it.
> + */
> + lockdep_assert_irqs_disabled();
> +
> + /*
> + * However, there may be operations which _alter_ the mapping, so ensure
> + * we read it once and only once.
> + */
> + mapping = READ_ONCE(folio->mapping);
> +
> + /*
> + * The mapping may have been truncated, in any case we cannot determine
> + * if this mapping is safe - fall back to slow path to determine how to
> + * proceed.
> + */
> + if (!mapping)
> + return false;
> +
> + /* Anonymous folios are fine, other non-file backed cases are not. */
> + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
> + if (mapping_flags)
> + return mapping_flags == PAGE_MAPPING_ANON;
> +
> + /*
> + * At this point, we know the mapping is non-null and points to an
> + * address_space object. The only remaining whitelisted file system is
> + * shmem.
> + */
> + return shmem_mapping(mapping);
> +}
> +
> /**
> * try_grab_folio() - Attempt to get or pin a folio.
> * @page: pointer to page to be grabbed
> @@ -2464,6 +2542,11 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> goto pte_unmap;
> }
>
> + if (!folio_fast_pin_allowed(folio, flags)) {
> + gup_put_folio(folio, 1, flags);
> + goto pte_unmap;
> + }
> +
> if (!pte_write(pte) && gup_must_unshare(NULL, flags, page)) {
> gup_put_folio(folio, 1, flags);
> goto pte_unmap;
> @@ -2656,6 +2739,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
> return 0;
> }
>
> + if (!folio_fast_pin_allowed(folio, flags)) {
> + gup_put_folio(folio, refs, flags);
> + return 0;
> + }
> +
> if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
> gup_put_folio(folio, refs, flags);
> return 0;
> @@ -2722,6 +2810,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> return 0;
> }
>
> + if (!folio_fast_pin_allowed(folio, flags)) {
> + gup_put_folio(folio, refs, flags);
> + return 0;
> + }
> if (!pmd_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) {
> gup_put_folio(folio, refs, flags);
> return 0;
> @@ -2762,6 +2854,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> return 0;
> }
>
> + if (!folio_fast_pin_allowed(folio, flags)) {
> + gup_put_folio(folio, refs, flags);
> + return 0;
> + }
> +
> if (!pud_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) {
> gup_put_folio(folio, refs, flags);
> return 0;
> @@ -2797,6 +2894,11 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
> return 0;
> }
>
> + if (!folio_fast_pin_allowed(folio, flags)) {
> + gup_put_folio(folio, refs, flags);
> + return 0;
> + }
> +
> *nr += refs;
> folio_set_referenced(folio);
> return 1;
> --
> 2.40.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 5/3/23 3:08 AM, David Hildenbrand wrote:
> On 03.05.23 02:31, Matthew Rosato wrote:
>> On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
>>> Writing to file-backed mappings which require folio dirty tracking using
>>> GUP is a fundamentally broken operation, as kernel write access to GUP
>>> mappings do not adhere to the semantics expected by a file system.
>>>
>>> A GUP caller uses the direct mapping to access the folio, which does not
>>> cause write notify to trigger, nor does it enforce that the caller marks
>>> the folio dirty.
>>>
>>> The problem arises when, after an initial write to the folio, writeback
>>> results in the folio being cleaned and then the caller, via the GUP
>>> interface, writes to the folio again.
>>>
>>> As a result of the use of this secondary, direct, mapping to the folio no
>>> write notify will occur, and if the caller does mark the folio dirty, this
>>> will be done so unexpectedly.
>>>
>>> For example, consider the following scenario:-
>>>
>>> 1. A folio is written to via GUP which write-faults the memory, notifying
>>> the file system and dirtying the folio.
>>> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>>> the PTE being marked read-only.
>>> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>>> direct mapping.
>>> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>>> (though it does not have to).
>>>
>>> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
>>> pin_user_pages_fast_only() does not exist, we can rely on a slightly
>>> imperfect whitelisting in the PUP-fast case and fall back to the slow case
>>> should this fail.
>>>
>>> v8:
>>> - Fixed typo writeable -> writable.
>>> - Fixed bug in writable_file_mapping_allowed() - must check combination of
>>> FOLL_PIN AND FOLL_LONGTERM not either/or.
>>> - Updated vma_needs_dirty_tracking() to include write/shared to account for
>>> MAP_PRIVATE mappings.
>>> - Move to open-coding the checks in folio_pin_allowed() so we can
>>> READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
>>> account for fact we now check flags here.
>>> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
>>> anon. Defer to slow path.
>>> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
>>> be stable.
>>> - Updated comments and commit message for final patch as per Jason's
>>> suggestions.
>>
>> Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)
>
> At least it's consistent now ;) And it might be working as expected ...
>
> In v7:
> * pin_user_pages_fast() succeeded
> * vfio_pin_pages_remote() failed
>
> But also in v7:
> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
> mappings
> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>
> In v8:
> * pin_user_pages_fast() succeeds
> * vfio_pin_pages_remote() succeeds
>
> But also in v8:
> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
> mappings
> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>
>
> I have to speculate, but ... could it be that you are using a private mapping?
>
> In QEMU, unfortunately, the default for memory-backend-file is "share=off" (private) ... for memory-backend-memfd it is "share=on" (shared). The default is stupid ...
>
> If you invoke QEMU manually, can you specify "share=on" for the memory-backend-file? I thought libvirt would always default to "share=on" for file mappings (everything else doesn't make much sense) ... but you might have to specify
> <access mode="shared"/>
> in addition to
> <source type="file"/>
>
Ah, there we go. Yes, I was using the default of share=off. When I instead specify share=on, now the pins will fail in both cases.
On 03.05.23 13:25, Matthew Rosato wrote:
> On 5/3/23 3:08 AM, David Hildenbrand wrote:
>> On 03.05.23 02:31, Matthew Rosato wrote:
>>> On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
>>>> Writing to file-backed mappings which require folio dirty tracking using
>>>> GUP is a fundamentally broken operation, as kernel write access to GUP
>>>> mappings do not adhere to the semantics expected by a file system.
>>>>
>>>> A GUP caller uses the direct mapping to access the folio, which does not
>>>> cause write notify to trigger, nor does it enforce that the caller marks
>>>> the folio dirty.
>>>>
>>>> The problem arises when, after an initial write to the folio, writeback
>>>> results in the folio being cleaned and then the caller, via the GUP
>>>> interface, writes to the folio again.
>>>>
>>>> As a result of the use of this secondary, direct, mapping to the folio no
>>>> write notify will occur, and if the caller does mark the folio dirty, this
>>>> will be done so unexpectedly.
>>>>
>>>> For example, consider the following scenario:-
>>>>
>>>> 1. A folio is written to via GUP which write-faults the memory, notifying
>>>> the file system and dirtying the folio.
>>>> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>>>> the PTE being marked read-only.
>>>> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>>>> direct mapping.
>>>> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>>>> (though it does not have to).
>>>>
>>>> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
>>>> pin_user_pages_fast_only() does not exist, we can rely on a slightly
>>>> imperfect whitelisting in the PUP-fast case and fall back to the slow case
>>>> should this fail.
>>>>
>>>> v8:
>>>> - Fixed typo writeable -> writable.
>>>> - Fixed bug in writable_file_mapping_allowed() - must check combination of
>>>> FOLL_PIN AND FOLL_LONGTERM not either/or.
>>>> - Updated vma_needs_dirty_tracking() to include write/shared to account for
>>>> MAP_PRIVATE mappings.
>>>> - Move to open-coding the checks in folio_pin_allowed() so we can
>>>> READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
>>>> account for fact we now check flags here.
>>>> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
>>>> anon. Defer to slow path.
>>>> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
>>>> be stable.
>>>> - Updated comments and commit message for final patch as per Jason's
>>>> suggestions.
>>>
>>> Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)
>>
>> At least it's consistent now ;) And it might be working as expected ...
>>
>> In v7:
>> * pin_user_pages_fast() succeeded
>> * vfio_pin_pages_remote() failed
>>
>> But also in v7:
>> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>> mappings
>> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>>
>> In v8:
>> * pin_user_pages_fast() succeeds
>> * vfio_pin_pages_remote() succeeds
>>
>> But also in v8:
>> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>> mappings
>> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>>
>>
>> I have to speculate, but ... could it be that you are using a private mapping?
>>
>> In QEMU, unfortunately, the default for memory-backend-file is "share=off" (private) ... for memory-backend-memfd it is "share=on" (shared). The default is stupid ...
>>
>> If you invoke QEMU manually, can you specify "share=on" for the memory-backend-file? I thought libvirt would always default to "share=on" for file mappings (everything else doesn't make much sense) ... but you might have to specify
>> <access mode="shared"/>
>> in addition to
>> <source type="file"/>
>>
>
> Ah, there we go. Yes, I was using the default of share=off. When I instead specify share=on, now the pins will fail in both cases.
>
Out of curiosity, how does that manifest?
I assume the VM is successfully created and as Linux tries initializing
and using the device, we get a bunch of errors inside the VM, correct?
--
Thanks,
David / dhildenb
On 5/3/23 8:53 AM, David Hildenbrand wrote:
> On 03.05.23 13:25, Matthew Rosato wrote:
>> On 5/3/23 3:08 AM, David Hildenbrand wrote:
>>> On 03.05.23 02:31, Matthew Rosato wrote:
>>>> On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
>>>>> Writing to file-backed mappings which require folio dirty tracking using
>>>>> GUP is a fundamentally broken operation, as kernel write access to GUP
>>>>> mappings do not adhere to the semantics expected by a file system.
>>>>>
>>>>> A GUP caller uses the direct mapping to access the folio, which does not
>>>>> cause write notify to trigger, nor does it enforce that the caller marks
>>>>> the folio dirty.
>>>>>
>>>>> The problem arises when, after an initial write to the folio, writeback
>>>>> results in the folio being cleaned and then the caller, via the GUP
>>>>> interface, writes to the folio again.
>>>>>
>>>>> As a result of the use of this secondary, direct, mapping to the folio no
>>>>> write notify will occur, and if the caller does mark the folio dirty, this
>>>>> will be done so unexpectedly.
>>>>>
>>>>> For example, consider the following scenario:-
>>>>>
>>>>> 1. A folio is written to via GUP which write-faults the memory, notifying
>>>>> the file system and dirtying the folio.
>>>>> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>>>>> the PTE being marked read-only.
>>>>> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>>>>> direct mapping.
>>>>> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>>>>> (though it does not have to).
>>>>>
>>>>> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
>>>>> pin_user_pages_fast_only() does not exist, we can rely on a slightly
>>>>> imperfect whitelisting in the PUP-fast case and fall back to the slow case
>>>>> should this fail.
>>>>>
>>>>> v8:
>>>>> - Fixed typo writeable -> writable.
>>>>> - Fixed bug in writable_file_mapping_allowed() - must check combination of
>>>>> FOLL_PIN AND FOLL_LONGTERM not either/or.
>>>>> - Updated vma_needs_dirty_tracking() to include write/shared to account for
>>>>> MAP_PRIVATE mappings.
>>>>> - Move to open-coding the checks in folio_pin_allowed() so we can
>>>>> READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
>>>>> account for fact we now check flags here.
>>>>> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
>>>>> anon. Defer to slow path.
>>>>> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
>>>>> be stable.
>>>>> - Updated comments and commit message for final patch as per Jason's
>>>>> suggestions.
>>>>
>>>> Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)
>>>
>>> At least it's consistent now ;) And it might be working as expected ...
>>>
>>> In v7:
>>> * pin_user_pages_fast() succeeded
>>> * vfio_pin_pages_remote() failed
>>>
>>> But also in v7:
>>> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>>> mappings
>>> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>>>
>>> In v8:
>>> * pin_user_pages_fast() succeeds
>>> * vfio_pin_pages_remote() succeeds
>>>
>>> But also in v8:
>>> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>>> mappings
>>> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>>>
>>>
>>> I have to speculate, but ... could it be that you are using a private mapping?
>>>
>>> In QEMU, unfortunately, the default for memory-backend-file is "share=off" (private) ... for memory-backend-memfd it is "share=on" (shared). The default is stupid ...
>>>
>>> If you invoke QEMU manually, can you specify "share=on" for the memory-backend-file? I thought libvirt would always default to "share=on" for file mappings (everything else doesn't make much sense) ... but you might have to specify
>>> <access mode="shared"/>
>>> in addition to
>>> <source type="file"/>
>>>
>>
>> Ah, there we go. Yes, I was using the default of share=off. When I instead specify share=on, now the pins will fail in both cases.
>>
>
> Out of curiosity, how does that manifest?
>
> I assume the VM is successfully created and as Linux tries initializing and using the device, we get a bunch of errors inside the VM, correct?
>
Yes, that's correct.
Which error comes first (an attempt at mapping something via type1 iommu or an attempt to register AEN) depends on the device type and the order of operations of the associated driver. But in either case, you're going to see guest errors associated with that action. mlx5 and ism give up rather quickly and just fail their probe. nvme in the guest is persistent and its actions keep re-attempting to setup AEN by issuing the associated instruction; but the associated blockdev will never show up.
On 03.05.23 00:51, Lorenzo Stoakes wrote:
> vma_wants_writenotify() is specifically intended for setting PTE page table
> flags, accounting for existing page table flag state and whether the
> filesystem performs dirty tracking.
>
> Separate out the notions of dirty tracking and PTE write notify checking in
> order that we can invoke the dirty tracking check from elsewhere.
>
> Note that this change introduces a very small duplicate check of the
> separated out vm_ops_needs_writenotify() and vma_is_shared_writable()
> functions. This is necessary to avoid making vma_needs_dirty_tracking()
> needlessly complicated (e.g. passing flags or having it assume checks were
> already performed). This is small enough that it doesn't seem too
> egregious.
>
> We check to ensure the mapping is shared writable, as any GUP caller will
> be safe - MAP_PRIVATE mappings will be CoW'd and read-only file-backed
> shared mappings are not permitted access, even with FOLL_FORCE.
>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> Reviewed-by: John Hubbard <[email protected]>
> Reviewed-by: Mika Penttilä <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> ---
> include/linux/mm.h | 1 +
> mm/mmap.c | 53 ++++++++++++++++++++++++++++++++++------------
> 2 files changed, 41 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..7b1d4e7393ef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
> MM_CP_UFFD_WP_RESOLVE)
>
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5522130ae606..fa7442e44cc2 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1475,6 +1475,42 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
> }
> #endif /* __ARCH_WANT_SYS_OLD_MMAP */
>
> +/* Do VMA operations imply write notify is required? */
Nit: comment is superfluous, this is already self-documenting code.
> +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
> +{
> + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
> +}
> +
> +/* Is this VMA shared and writable? */
Nit: dito
> +static bool vma_is_shared_writable(struct vm_area_struct *vma)
> +{
> + return (vma->vm_flags & (VM_WRITE | VM_SHARED)) ==
> + (VM_WRITE | VM_SHARED);
> +}
> +
> +/*
> + * Does this VMA require the underlying folios to have their dirty state
> + * tracked?
> + */
Nit: dito
> +bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
> +{
> + /* Only shared, writable VMAs require dirty tracking. */
> + if (!vma_is_shared_writable(vma))
> + return false;
> +
> + /* Does the filesystem need to be notified? */
> + if (vm_ops_needs_writenotify(vma->vm_ops))
> + return true;
> +
> + /* Specialty mapping? */
> + if (vma->vm_flags & VM_PFNMAP)
> + return false;
> +
> + /* Can the mapping track the dirty pages? */
> + return vma->vm_file && vma->vm_file->f_mapping &&
> + mapping_can_writeback(vma->vm_file->f_mapping);
> +}
> +
> /*
> * Some shared mappings will want the pages marked read-only
> * to track write events. If so, we'll downgrade vm_page_prot
> @@ -1483,21 +1519,18 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
> */
> int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
> {
> - vm_flags_t vm_flags = vma->vm_flags;
> - const struct vm_operations_struct *vm_ops = vma->vm_ops;
> -
> /* If it was private or non-writable, the write bit is already clear */
> - if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
> + if (!vma_is_shared_writable(vma))
> return 0;
>
> /* The backer wishes to know when pages are first written to? */
> - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
> + if (vm_ops_needs_writenotify(vma->vm_ops))
> return 1;
>
> /* The open routine did something to the protections that pgprot_modify
> * won't preserve? */
> if (pgprot_val(vm_page_prot) !=
> - pgprot_val(vm_pgprot_modify(vm_page_prot, vm_flags)))
> + pgprot_val(vm_pgprot_modify(vm_page_prot, vma->vm_flags)))
> return 0;
>
> /*
> @@ -1511,13 +1544,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
> if (userfaultfd_wp(vma))
> return 1;
>
> - /* Specialty mapping? */
> - if (vm_flags & VM_PFNMAP)
> - return 0;
> -
> - /* Can the mapping track the dirty pages? */
> - return vma->vm_file && vma->vm_file->f_mapping &&
> - mapping_can_writeback(vma->vm_file->f_mapping);
> + return vma_needs_dirty_tracking(vma);
> }
>
> /*
We now have duplicate vma_is_shared_writable() and
vm_ops_needs_writenotify() checks ...
Maybe move the VM_PFNMAP and "/* Can the mapping track the dirty pages?
*/" checks into a separate helper and call that from both,
vma_wants_writenotify() and vma_needs_dirty_tracking() ?
In any case
Acked-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On 03.05.23 00:51, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
>
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
>
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
>
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
>
> For example, consider the following scenario:-
>
> 1. A folio is written to via GUP which write-faults the memory, notifying
> the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
> the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
> direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
> (though it does not have to).
>
> This results in both data being written to a folio without writenotify, and
> the folio being dirtied unexpectedly (if the caller decides to do so).
>
> This issue was first reported by Jan Kara [1] in 2018, where the problem
> resulted in file system crashes.
>
> This is only relevant when the mappings are file-backed and the underlying
> file system requires folio dirty tracking. File systems which do not, such
> as shmem or hugetlb, are not at risk and therefore can be written to
> without issue.
>
> Unfortunately this limitation of GUP has been present for some time and
> requires future rework of the GUP API in order to provide correct write
> access to such mappings.
>
> However, for the time being we introduce this check to prevent the most
> egregious case of this occurring, use of the FOLL_LONGTERM pin.
>
> These mappings are considerably more likely to be written to after
> folios are cleaned and thus simply must not be permitted to do so.
>
> This patch changes only the slow-path GUP functions, a following patch
> adapts the GUP-fast path along similar lines.
>
> [1]:https://lore.kernel.org/linux-mm/[email protected]/
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> Reviewed-by: John Hubbard <[email protected]>
> Reviewed-by: Mika Penttilä <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> ---
Acked-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
[...]
> +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
> +{
> + struct address_space *mapping;
> + unsigned long mapping_flags;
> +
> + /*
> + * If we aren't pinning then no problematic write can occur. A long term
> + * pin is the most egregious case so this is the one we disallow.
> + */
> + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
> + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
> + return true;
> +
> + /* The folio is pinned, so we can safely access folio fields. */
> +
> + /* Neither of these should be possible, but check to be sure. */
You can easily have anon pages that are at the swapcache at this point
(especially, because this function is called before our unsharing
checks), the comment is misleading.
And there is nothing wrong about pinning an anon page that's still in
the swapcache. The following folio_test_anon() check will allow them.
The check made sense in page_mapping(), but here it's not required.
I do agree regarding folio_test_slab(), though. Should we WARN in case
we would have one?
if (WARN_ON_ONCE(folio_test_slab(folio)))
return false;
> + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
> + return false;
> +
> + /* hugetlb mappings do not require dirty-tracking. */
> + if (folio_test_hugetlb(folio))
> + return true;
> +
> + /*
> + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
> + * cannot proceed, which means no actions performed under RCU can
> + * proceed either.
> + *
> + * inodes and thus their mappings are freed under RCU, which means the
> + * mapping cannot be freed beneath us and thus we can safely dereference
> + * it.
> + */
> + lockdep_assert_irqs_disabled();
> +
> + /*
> + * However, there may be operations which _alter_ the mapping, so ensure
> + * we read it once and only once.
> + */
> + mapping = READ_ONCE(folio->mapping);
> +
> + /*
> + * The mapping may have been truncated, in any case we cannot determine
> + * if this mapping is safe - fall back to slow path to determine how to
> + * proceed.
> + */
> + if (!mapping)
> + return false;
> +
> + /* Anonymous folios are fine, other non-file backed cases are not. */
> + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
> + if (mapping_flags)
> + return mapping_flags == PAGE_MAPPING_ANON;
KSM pages are also (shared) anonymous folios, and that check would fail
-- which is ok (the following unsharing checks rejects long-term pinning
them), but a bit inconstent with your comment and folio_test_anon().
It would be more consistent (with your comment and also the
folio_test_anon implementation) to have here:
return mapping_flags & PAGE_MAPPING_ANON;
> +
> + /*
> + * At this point, we know the mapping is non-null and points to an
> + * address_space object. The only remaining whitelisted file system is
> + * shmem.
> + */
> + return shmem_mapping(mapping);
> +}
> +
In general, LGTM
Acked-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On Tue, May 02, 2023 at 07:18:21PM -0700, Andrew Morton wrote:
> On Tue, 2 May 2023 23:51:35 +0100 Lorenzo Stoakes <[email protected]> wrote:
>
> > Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> > as we cannot rule out folios being cleaned and then a GUP user writing to
> > them again and possibly marking them dirty unexpectedly.
> >
> > This is especially egregious for long-term mappings (as indicated by the
> > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> > we have already done in the slow path.
> >
> > We have access to less information in the fast path as we cannot examine
> > the VMA containing the mapping, however we can determine whether the folio
> > is anonymous or belonging to a whitelisted filesystem - specifically
> > hugetlb and shmem mappings.
> >
> > We take special care to ensure that both the folio and mapping are safe to
> > access when performing these checks and document folio_fast_pin_allowed()
> > accordingly.
> >
> > It's important to note that there are no APIs allowing users to specify
> > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> > always rely on the fact that if we fail to pin on the fast path, the code
> > will fall back to the slow path which can perform the more thorough check.
>
> arm allnoconfig said
>
> mm/gup.c:115:13: warning: 'folio_fast_pin_allowed' defined but not used [-Wunused-function]
> 115 | static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
> | ^~~~~~~~~~~~~~~~~~~~~~
>
> so I moved the definition inside CONFIG_ARCH_HAS_PTE_SPECIAL.
>
>
>
> mm/gup.c | 154 ++++++++++++++++++++++++++---------------------------
> 1 file changed, 77 insertions(+), 77 deletions(-)
>
> --- a/mm/gup.c~mm-gup-disallow-foll_longterm-gup-fast-writing-to-file-backed-mappings-fix
> +++ a/mm/gup.c
> @@ -96,83 +96,6 @@ retry:
> return folio;
> }
>
> -/*
> - * Used in the GUP-fast path to determine whether a pin is permitted for a
> - * specific folio.
> - *
> - * This call assumes the caller has pinned the folio, that the lowest page table
> - * level still points to this folio, and that interrupts have been disabled.
> - *
> - * Writing to pinned file-backed dirty tracked folios is inherently problematic
> - * (see comment describing the writable_file_mapping_allowed() function). We
> - * therefore try to avoid the most egregious case of a long-term mapping doing
> - * so.
> - *
> - * This function cannot be as thorough as that one as the VMA is not available
> - * in the fast path, so instead we whitelist known good cases and if in doubt,
> - * fall back to the slow path.
> - */
> -static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
> -{
> - struct address_space *mapping;
> - unsigned long mapping_flags;
> -
> - /*
> - * If we aren't pinning then no problematic write can occur. A long term
> - * pin is the most egregious case so this is the one we disallow.
> - */
> - if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
> - (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
> - return true;
> -
> - /* The folio is pinned, so we can safely access folio fields. */
> -
> - /* Neither of these should be possible, but check to be sure. */
> - if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
> - return false;
> -
> - /* hugetlb mappings do not require dirty-tracking. */
> - if (folio_test_hugetlb(folio))
> - return true;
> -
> - /*
> - * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
> - * cannot proceed, which means no actions performed under RCU can
> - * proceed either.
> - *
> - * inodes and thus their mappings are freed under RCU, which means the
> - * mapping cannot be freed beneath us and thus we can safely dereference
> - * it.
> - */
> - lockdep_assert_irqs_disabled();
> -
> - /*
> - * However, there may be operations which _alter_ the mapping, so ensure
> - * we read it once and only once.
> - */
> - mapping = READ_ONCE(folio->mapping);
> -
> - /*
> - * The mapping may have been truncated, in any case we cannot determine
> - * if this mapping is safe - fall back to slow path to determine how to
> - * proceed.
> - */
> - if (!mapping)
> - return false;
> -
> - /* Anonymous folios are fine, other non-file backed cases are not. */
> - mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
> - if (mapping_flags)
> - return mapping_flags == PAGE_MAPPING_ANON;
> -
> - /*
> - * At this point, we know the mapping is non-null and points to an
> - * address_space object. The only remaining whitelisted file system is
> - * shmem.
> - */
> - return shmem_mapping(mapping);
> -}
> -
> /**
> * try_grab_folio() - Attempt to get or pin a folio.
> * @page: pointer to page to be grabbed
> @@ -2474,6 +2397,83 @@ static void __maybe_unused undo_dev_page
>
> #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> /*
> + * Used in the GUP-fast path to determine whether a pin is permitted for a
> + * specific folio.
> + *
> + * This call assumes the caller has pinned the folio, that the lowest page table
> + * level still points to this folio, and that interrupts have been disabled.
> + *
> + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> + * (see comment describing the writable_file_mapping_allowed() function). We
> + * therefore try to avoid the most egregious case of a long-term mapping doing
> + * so.
> + *
> + * This function cannot be as thorough as that one as the VMA is not available
> + * in the fast path, so instead we whitelist known good cases and if in doubt,
> + * fall back to the slow path.
> + */
> +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
> +{
> + struct address_space *mapping;
> + unsigned long mapping_flags;
> +
> + /*
> + * If we aren't pinning then no problematic write can occur. A long term
> + * pin is the most egregious case so this is the one we disallow.
> + */
> + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
> + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
> + return true;
> +
> + /* The folio is pinned, so we can safely access folio fields. */
> +
> + /* Neither of these should be possible, but check to be sure. */
> + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
> + return false;
> +
> + /* hugetlb mappings do not require dirty-tracking. */
> + if (folio_test_hugetlb(folio))
> + return true;
> +
> + /*
> + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
> + * cannot proceed, which means no actions performed under RCU can
> + * proceed either.
> + *
> + * inodes and thus their mappings are freed under RCU, which means the
> + * mapping cannot be freed beneath us and thus we can safely dereference
> + * it.
> + */
> + lockdep_assert_irqs_disabled();
> +
> + /*
> + * However, there may be operations which _alter_ the mapping, so ensure
> + * we read it once and only once.
> + */
> + mapping = READ_ONCE(folio->mapping);
> +
> + /*
> + * The mapping may have been truncated, in any case we cannot determine
> + * if this mapping is safe - fall back to slow path to determine how to
> + * proceed.
> + */
> + if (!mapping)
> + return false;
> +
> + /* Anonymous folios are fine, other non-file backed cases are not. */
> + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
> + if (mapping_flags)
> + return mapping_flags == PAGE_MAPPING_ANON;
> +
> + /*
> + * At this point, we know the mapping is non-null and points to an
> + * address_space object. The only remaining whitelisted file system is
> + * shmem.
> + */
> + return shmem_mapping(mapping);
> +}
> +
> +/*
> * Fast-gup relies on pte change detection to avoid concurrent pgtable
> * operations.
> *
> _
>
Ack thanks for this, I think it doesn't quite cover all cases (kernel bot
moaning on -next for mips), needs some more fiddly #ifdef stuff, I will
spin a v9 shortly anyway to fix this up and address a few other minor
things.
On Thu, May 04, 2023 at 05:04:34PM +0200, David Hildenbrand wrote:
> [...]
>
> > +static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
> > +{
> > + struct address_space *mapping;
> > + unsigned long mapping_flags;
> > +
> > + /*
> > + * If we aren't pinning then no problematic write can occur. A long term
> > + * pin is the most egregious case so this is the one we disallow.
> > + */
> > + if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
> > + (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
> > + return true;
> > +
> > + /* The folio is pinned, so we can safely access folio fields. */
> > +
> > + /* Neither of these should be possible, but check to be sure. */
>
> You can easily have anon pages that are at the swapcache at this point
> (especially, because this function is called before our unsharing checks),
> the comment is misleading.
Ack will update.
>
> And there is nothing wrong about pinning an anon page that's still in the
> swapcache. The following folio_test_anon() check will allow them.
>
> The check made sense in page_mapping(), but here it's not required.
Waaaaaaaaaait a second, you were saying before:-
"Folios in the swap cache return the swap mapping" -- you might disallow
pinning anonymous pages that are in the swap cache.
I recall that there are corner cases where we can end up with an anon
page that's mapped writable but still in the swap cache ... so you'd
fallback to the GUP slow path (acceptable for these corner cases, I
guess), however especially the comment is a bit misleading then.
So are we allowing or disallowing pinning anon swap cache pages? :P
I mean slow path would allow them if they are just marked anon so I'm inclined
to allow them.
>
> I do agree regarding folio_test_slab(), though. Should we WARN in case we
> would have one?
>
> if (WARN_ON_ONCE(folio_test_slab(folio)))
> return false;
>
God help us if we have a slab page at this point, so agreed worth doing, it
would surely have to arise from some dreadful bug/memory corruption.
> > + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
> > + return false;
> > +
> > + /* hugetlb mappings do not require dirty-tracking. */
> > + if (folio_test_hugetlb(folio))
> > + return true;
> > +
> > + /*
> > + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
> > + * cannot proceed, which means no actions performed under RCU can
> > + * proceed either.
> > + *
> > + * inodes and thus their mappings are freed under RCU, which means the
> > + * mapping cannot be freed beneath us and thus we can safely dereference
> > + * it.
> > + */
> > + lockdep_assert_irqs_disabled();
> > +
> > + /*
> > + * However, there may be operations which _alter_ the mapping, so ensure
> > + * we read it once and only once.
> > + */
> > + mapping = READ_ONCE(folio->mapping);
> > +
> > + /*
> > + * The mapping may have been truncated, in any case we cannot determine
> > + * if this mapping is safe - fall back to slow path to determine how to
> > + * proceed.
> > + */
> > + if (!mapping)
> > + return false;
> > +
> > + /* Anonymous folios are fine, other non-file backed cases are not. */
> > + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
> > + if (mapping_flags)
> > + return mapping_flags == PAGE_MAPPING_ANON;
>
> KSM pages are also (shared) anonymous folios, and that check would fail --
> which is ok (the following unsharing checks rejects long-term pinning them),
> but a bit inconstent with your comment and folio_test_anon().
>
> It would be more consistent (with your comment and also the folio_test_anon
> implementation) to have here:
>
> return mapping_flags & PAGE_MAPPING_ANON;
>
I explicitly excluded KSM out of fear that could be some breakage given they're
wrprotect'd + expected to CoW though? But I guess you mean they'd get picked up
by the unshare and so it doesn't matter + we wouldn't want to exclude an
PG_anon_exclusive case?
I'll make the change in any case given the unshare check!
I notice the gup_huge_pgd() doesn't do an unshare but I mean, a PGD-sized huge
page probably isn't going to be CoW'd :P
> > +
> > + /*
> > + * At this point, we know the mapping is non-null and points to an
> > + * address_space object. The only remaining whitelisted file system is
> > + * shmem.
> > + */
> > + return shmem_mapping(mapping);
> > +}
> > +
>
> In general, LGTM
>
> Acked-by: David Hildenbrand <[email protected]>
>
Thanks!
Will respin, addressing your comments and addressing the issue the kernel
bot picked up with placement in the appropriate #ifdef's and send out a v9
shortly.
> --
> Thanks,
>
> David / dhildenb
>
On Wed, May 03, 2023 at 04:31:36PM +0200, David Hildenbrand wrote:
> On 03.05.23 00:51, Lorenzo Stoakes wrote:
> > vma_wants_writenotify() is specifically intended for setting PTE page table
> > flags, accounting for existing page table flag state and whether the
> > filesystem performs dirty tracking.
> >
> > Separate out the notions of dirty tracking and PTE write notify checking in
> > order that we can invoke the dirty tracking check from elsewhere.
> >
> > Note that this change introduces a very small duplicate check of the
> > separated out vm_ops_needs_writenotify() and vma_is_shared_writable()
> > functions. This is necessary to avoid making vma_needs_dirty_tracking()
> > needlessly complicated (e.g. passing flags or having it assume checks were
> > already performed). This is small enough that it doesn't seem too
> > egregious.
> >
> > We check to ensure the mapping is shared writable, as any GUP caller will
> > be safe - MAP_PRIVATE mappings will be CoW'd and read-only file-backed
> > shared mappings are not permitted access, even with FOLL_FORCE.
> >
> > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > Reviewed-by: John Hubbard <[email protected]>
> > Reviewed-by: Mika Penttil? <[email protected]>
> > Reviewed-by: Jan Kara <[email protected]>
> > Reviewed-by: Jason Gunthorpe <[email protected]>
> > ---
> > include/linux/mm.h | 1 +
> > mm/mmap.c | 53 ++++++++++++++++++++++++++++++++++------------
> > 2 files changed, 41 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 27ce77080c79..7b1d4e7393ef 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
> > MM_CP_UFFD_WP_RESOLVE)
> > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
> > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> > {
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 5522130ae606..fa7442e44cc2 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1475,6 +1475,42 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
> > }
> > #endif /* __ARCH_WANT_SYS_OLD_MMAP */
> > +/* Do VMA operations imply write notify is required? */
>
> Nit: comment is superfluous, this is already self-documenting code.
>
> > +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
> > +{
> > + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
> > +}
> > +
> > +/* Is this VMA shared and writable? */
>
> Nit: dito
>
> > +static bool vma_is_shared_writable(struct vm_area_struct *vma)
> > +{
> > + return (vma->vm_flags & (VM_WRITE | VM_SHARED)) ==
> > + (VM_WRITE | VM_SHARED);
> > +}
> > +
> > +/*
> > + * Does this VMA require the underlying folios to have their dirty state
> > + * tracked?
> > + */
>
> Nit: dito
>
Ack, was just trying to follow the pattern of comments on these helpers but
you're right, these aren't adding anything will strip.
> > +bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
> > +{
> > + /* Only shared, writable VMAs require dirty tracking. */
> > + if (!vma_is_shared_writable(vma))
> > + return false;
> > +
> > + /* Does the filesystem need to be notified? */
> > + if (vm_ops_needs_writenotify(vma->vm_ops))
> > + return true;
> > +
> > + /* Specialty mapping? */
> > + if (vma->vm_flags & VM_PFNMAP)
> > + return false;
> > +
> > + /* Can the mapping track the dirty pages? */
> > + return vma->vm_file && vma->vm_file->f_mapping &&
> > + mapping_can_writeback(vma->vm_file->f_mapping);
> > +}
> > +
> > /*
> > * Some shared mappings will want the pages marked read-only
> > * to track write events. If so, we'll downgrade vm_page_prot
> > @@ -1483,21 +1519,18 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
> > */
> > int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
> > {
> > - vm_flags_t vm_flags = vma->vm_flags;
> > - const struct vm_operations_struct *vm_ops = vma->vm_ops;
> > -
> > /* If it was private or non-writable, the write bit is already clear */
> > - if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
> > + if (!vma_is_shared_writable(vma))
> > return 0;
> > /* The backer wishes to know when pages are first written to? */
> > - if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
> > + if (vm_ops_needs_writenotify(vma->vm_ops))
> > return 1;
> > /* The open routine did something to the protections that pgprot_modify
> > * won't preserve? */
> > if (pgprot_val(vm_page_prot) !=
> > - pgprot_val(vm_pgprot_modify(vm_page_prot, vm_flags)))
> > + pgprot_val(vm_pgprot_modify(vm_page_prot, vma->vm_flags)))
> > return 0;
> > /*
> > @@ -1511,13 +1544,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
> > if (userfaultfd_wp(vma))
> > return 1;
> > - /* Specialty mapping? */
> > - if (vm_flags & VM_PFNMAP)
> > - return 0;
> > -
> > - /* Can the mapping track the dirty pages? */
> > - return vma->vm_file && vma->vm_file->f_mapping &&
> > - mapping_can_writeback(vma->vm_file->f_mapping);
> > + return vma_needs_dirty_tracking(vma);
> > }
> > /*
>
> We now have duplicate vma_is_shared_writable() and
> vm_ops_needs_writenotify() checks ...
>
Yes, this is noted in the commit message.
>
> Maybe move the VM_PFNMAP and "/* Can the mapping track the dirty pages? */"
> checks into a separate helper and call that from both,
> vma_wants_writenotify() and vma_needs_dirty_tracking() ?
I'll try to juggle it a bit more, the whole reason I'm doing these very
annoying duplications is because of the ordering and precedence of the
checks in both and wanting to avoid some hideious passing of flags or
splitting into too many bits or returning a non-bool value etc.
Will try to improve it in respin.
>
>
> In any case
>
> Acked-by: David Hildenbrand <[email protected]>
>
Thanks!
> --
> Thanks,
>
> David / dhildenb
>
>> And there is nothing wrong about pinning an anon page that's still in the
>> swapcache. The following folio_test_anon() check will allow them.
>>
>> The check made sense in page_mapping(), but here it's not required.
>
> Waaaaaaaaaait a second, you were saying before:-
>
> "Folios in the swap cache return the swap mapping" -- you might disallow
> pinning anonymous pages that are in the swap cache.
>
> I recall that there are corner cases where we can end up with an anon
> page that's mapped writable but still in the swap cache ... so you'd
> fallback to the GUP slow path (acceptable for these corner cases, I
> guess), however especially the comment is a bit misleading then.
>
> So are we allowing or disallowing pinning anon swap cache pages? :P
If we have an exclusive anon page that's still in the swap cache, sure! :)
I think there are ways that can be done, and nothing would actually
break. (I even wrote selftests in the cow selftests for that to amke
sure it works as expected)
>
> I mean slow path would allow them if they are just marked anon so I'm inclined
> to allow them.
Exactly my reasoning.
The less checks the better (especially if ordinary GUP just allows for
pinning it) :)
>
>>
>> I do agree regarding folio_test_slab(), though. Should we WARN in case we
>> would have one?
>>
>> if (WARN_ON_ONCE(folio_test_slab(folio)))
>> return false;
>>
>
> God help us if we have a slab page at this point, so agreed worth doing, it
> would surely have to arise from some dreadful bug/memory corruption.
>
Or some nasty race condition that we managed to ignore with rechecking
if the PTEs/PMDs changed :)
>>> + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
>>> + return false;
>>> +
>>> + /* hugetlb mappings do not require dirty-tracking. */
>>> + if (folio_test_hugetlb(folio))
>>> + return true;
>>> +
>>> + /*
>>> + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
>>> + * cannot proceed, which means no actions performed under RCU can
>>> + * proceed either.
>>> + *
>>> + * inodes and thus their mappings are freed under RCU, which means the
>>> + * mapping cannot be freed beneath us and thus we can safely dereference
>>> + * it.
>>> + */
>>> + lockdep_assert_irqs_disabled();
>>> +
>>> + /*
>>> + * However, there may be operations which _alter_ the mapping, so ensure
>>> + * we read it once and only once.
>>> + */
>>> + mapping = READ_ONCE(folio->mapping);
>>> +
>>> + /*
>>> + * The mapping may have been truncated, in any case we cannot determine
>>> + * if this mapping is safe - fall back to slow path to determine how to
>>> + * proceed.
>>> + */
>>> + if (!mapping)
>>> + return false;
>>> +
>>> + /* Anonymous folios are fine, other non-file backed cases are not. */
>>> + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
>>> + if (mapping_flags)
>>> + return mapping_flags == PAGE_MAPPING_ANON;
>>
>> KSM pages are also (shared) anonymous folios, and that check would fail --
>> which is ok (the following unsharing checks rejects long-term pinning them),
>> but a bit inconstent with your comment and folio_test_anon().
>>
>> It would be more consistent (with your comment and also the folio_test_anon
>> implementation) to have here:
>>
>> return mapping_flags & PAGE_MAPPING_ANON;
>>
>
> I explicitly excluded KSM out of fear that could be some breakage given they're
> wrprotect'd + expected to CoW though? But I guess you mean they'd get picked up
> by the unshare and so it doesn't matter + we wouldn't want to exclude an
> PG_anon_exclusive case?
Yes, unsharing handles that in the ordinary GUP and GUP-fast case. And
unsharing is neither GUP-fast nor even longterm specific (for anon pages).
Reason I'm brining this up is that I think it's best if we let
folio_fast_pin_allowed() just check for what's absolutely GUP-fast specific.
>
> I'll make the change in any case given the unshare check!
>
> I notice the gup_huge_pgd() doesn't do an unshare but I mean, a PGD-sized huge
> page probably isn't going to be CoW'd :P
I spotted exactly the same thing and wondered about that (after all I
added all that unsharing logic ... so I should know). I'm sure there
must be a reason I didn't add it ;)
... probably we should just add it even though it might essentially be
dead code for now (at least the cow selftests would try with each and
every hugetlb size and eventually reveal the problem on whatever arch
ends up using that code ... ).
Do you want to send a patch to add unsharing to gup_huge_pgd() as well?
--
Thanks,
David / dhildenb
On Fri, May 05, 2023 at 04:17:38PM +0200, David Hildenbrand wrote:
> > > And there is nothing wrong about pinning an anon page that's still in the
> > > swapcache. The following folio_test_anon() check will allow them.
> > >
> > > The check made sense in page_mapping(), but here it's not required.
> >
> > Waaaaaaaaaait a second, you were saying before:-
> >
> > "Folios in the swap cache return the swap mapping" -- you might disallow
> > pinning anonymous pages that are in the swap cache.
> >
> > I recall that there are corner cases where we can end up with an anon
> > page that's mapped writable but still in the swap cache ... so you'd
> > fallback to the GUP slow path (acceptable for these corner cases, I
> > guess), however especially the comment is a bit misleading then.
> >
> > So are we allowing or disallowing pinning anon swap cache pages? :P
>
> If we have an exclusive anon page that's still in the swap cache, sure! :)
>
> I think there are ways that can be done, and nothing would actually break.
> (I even wrote selftests in the cow selftests for that to amke sure it works
> as expected)
>
> >
> > I mean slow path would allow them if they are just marked anon so I'm inclined
> > to allow them.
>
> Exactly my reasoning.
>
> The less checks the better (especially if ordinary GUP just allows for
> pinning it) :)
Yeah a lot of my decision making on this has been trying to be conservative
about what we filter for but you get this weird inversion whereby if you're
too conservative about what you allow, you are actually being more
outlandish about what you disallow and vice-versa.
>
> >
> > >
> > > I do agree regarding folio_test_slab(), though. Should we WARN in case we
> > > would have one?
> > >
> > > if (WARN_ON_ONCE(folio_test_slab(folio)))
> > > return false;
> > >
> >
> > God help us if we have a slab page at this point, so agreed worth doing, it
> > would surely have to arise from some dreadful bug/memory corruption.
> >
>
> Or some nasty race condition that we managed to ignore with rechecking if
> the PTEs/PMDs changed :)
Well that should be sorted now :)
>
> > > > + if (unlikely(folio_test_slab(folio) || folio_test_swapcache(folio)))
> > > > + return false;
> > > > +
> > > > + /* hugetlb mappings do not require dirty-tracking. */
> > > > + if (folio_test_hugetlb(folio))
> > > > + return true;
> > > > +
> > > > + /*
> > > > + * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
> > > > + * cannot proceed, which means no actions performed under RCU can
> > > > + * proceed either.
> > > > + *
> > > > + * inodes and thus their mappings are freed under RCU, which means the
> > > > + * mapping cannot be freed beneath us and thus we can safely dereference
> > > > + * it.
> > > > + */
> > > > + lockdep_assert_irqs_disabled();
> > > > +
> > > > + /*
> > > > + * However, there may be operations which _alter_ the mapping, so ensure
> > > > + * we read it once and only once.
> > > > + */
> > > > + mapping = READ_ONCE(folio->mapping);
> > > > +
> > > > + /*
> > > > + * The mapping may have been truncated, in any case we cannot determine
> > > > + * if this mapping is safe - fall back to slow path to determine how to
> > > > + * proceed.
> > > > + */
> > > > + if (!mapping)
> > > > + return false;
> > > > +
> > > > + /* Anonymous folios are fine, other non-file backed cases are not. */
> > > > + mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
> > > > + if (mapping_flags)
> > > > + return mapping_flags == PAGE_MAPPING_ANON;
> > >
> > > KSM pages are also (shared) anonymous folios, and that check would fail --
> > > which is ok (the following unsharing checks rejects long-term pinning them),
> > > but a bit inconstent with your comment and folio_test_anon().
> > >
> > > It would be more consistent (with your comment and also the folio_test_anon
> > > implementation) to have here:
> > >
> > > return mapping_flags & PAGE_MAPPING_ANON;
> > >
> >
> > I explicitly excluded KSM out of fear that could be some breakage given they're
> > wrprotect'd + expected to CoW though? But I guess you mean they'd get picked up
> > by the unshare and so it doesn't matter + we wouldn't want to exclude an
> > PG_anon_exclusive case?
>
> Yes, unsharing handles that in the ordinary GUP and GUP-fast case. And
> unsharing is neither GUP-fast nor even longterm specific (for anon pages).
>
> Reason I'm brining this up is that I think it's best if we let
> folio_fast_pin_allowed() just check for what's absolutely GUP-fast specific.
Ack, indeed it's a separate thing, see above for the contradictory nature
of wanting to be cautious but then accidentally making your change _more
radical_ than you intended...!
>
> >
> > I'll make the change in any case given the unshare check!
> >
> > I notice the gup_huge_pgd() doesn't do an unshare but I mean, a PGD-sized huge
> > page probably isn't going to be CoW'd :P
>
> I spotted exactly the same thing and wondered about that (after all I added
> all that unsharing logic ... so I should know). I'm sure there must be a
> reason I didn't add it ;)
>
> ... probably we should just add it even though it might essentially be dead
> code for now (at least the cow selftests would try with each and every
> hugetlb size and eventually reveal the problem on whatever arch ends up
> using that code ... ).
>
> Do you want to send a patch to add unsharing to gup_huge_pgd() as well?
>
Sure will do!
> --
> Thanks,
>
> David / dhildenb
>