2023-05-01 23:14:32

by Lorenzo Stoakes

[permalink] [raw]
Subject: [PATCH v6 0/3] mm/gup: disallow GUP writing to file-backed mappings by default

Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.

The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.

For example, consider the following scenario:-

1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).

This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.

v6:
- Rebased on latest mm-unstable as of 28th April 2023.
- Add PUP-fast check with handling for rcu-locked TLB shootdown to synchronise
correctly.
- Split patch series into 3 to make it more digestible.

v5:
- Rebased on latest mm-unstable as of 25th April 2023.
- Some small refactorings suggested by John.
- Added an extended description of the problem in the comment around
writeable_file_mapping_allowed() for clarity.
- Updated commit message as suggested by Mika and John.
https://lore.kernel.org/all/6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com/

v4:
- Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to
reduce duplication and update to use this in the GUP check. Note that
both separately check vm_ops_needs_writenotify() as the latter needs to
test this before the vm_pgprot_modify() test, resulting in
vma_wants_writenotify() checking this twice, however it is such a small
check this should not be egregious.
https://lore.kernel.org/all/3b92d56f55671a0389252379237703df6e86ea48.1682464032.git.lstoakes@gmail.com/

v3:
- Rebased on latest mm-unstable as of 24th April 2023.
- Explicitly check whether file system requires folio dirtying. Note that
vma_wants_writenotify() could not be used directly as it is very much focused
on determining if the PTE r/w should be set (e.g. assuming private mapping
does not require it as already set, soft dirty considerations).
- Tested code against shmem and hugetlb mappings - confirmed that these are not
disallowed by the check.
- Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only
for FOLL_LONGTERM pins.
- As a result, limit check to internal GUP code.
https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/

v2:
- Add accidentally excluded ptrace_access_vm() use of
FOLL_ALLOW_BROKEN_FILE_MAPPING.
- Tweak commit message.
https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/

v1:
https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/


Lorenzo Stoakes (3):
mm/mmap: separate writenotify and dirty tracking logic
mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed
mappings
mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed
mappings

include/linux/mm.h | 1 +
mm/gup.c | 128 +++++++++++++++++++++++++++++++++++++++++++--
mm/mmap.c | 36 +++++++++----
3 files changed, 153 insertions(+), 12 deletions(-)

--
2.40.1


2023-05-01 23:15:07

by Lorenzo Stoakes

[permalink] [raw]
Subject: [PATCH v6 1/3] mm/mmap: separate writenotify and dirty tracking logic

vma_wants_writenotify() is specifically intended for setting PTE page table
flags, accounting for existing PTE flag state and whether that might
already be read-only while mixing this check with a check whether the
filesystem performs dirty tracking.

Separate out the notions of dirty tracking and a PTE write notify checking
in order that we can invoke the dirty tracking check from elsewhere.

Note that this change introduces a very small duplicate check of the
separated out vm_ops_needs_writenotify(). This is necessary to avoid making
vma_needs_dirty_tracking() needlessly complicated (e.g. passing a
check_writenotify flag or having it assume this check was already
performed). This is such a small check that it doesn't seem too egregious
to do this.

Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Mika Penttilä <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
include/linux/mm.h | 1 +
mm/mmap.c | 36 +++++++++++++++++++++++++++---------
2 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..7b1d4e7393ef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
MM_CP_UFFD_WP_RESOLVE)

+bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
{
diff --git a/mm/mmap.c b/mm/mmap.c
index 5522130ae606..295c5f2e9bd9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1475,6 +1475,31 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
}
#endif /* __ARCH_WANT_SYS_OLD_MMAP */

+/* Do VMA operations imply write notify is required? */
+static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
+{
+ return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
+}
+
+/*
+ * Does this VMA require the underlying folios to have their dirty state
+ * tracked?
+ */
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
+{
+ /* Does the filesystem need to be notified? */
+ if (vm_ops_needs_writenotify(vma->vm_ops))
+ return true;
+
+ /* Specialty mapping? */
+ if (vma->vm_flags & VM_PFNMAP)
+ return false;
+
+ /* Can the mapping track the dirty pages? */
+ return vma->vm_file && vma->vm_file->f_mapping &&
+ mapping_can_writeback(vma->vm_file->f_mapping);
+}
+
/*
* Some shared mappings will want the pages marked read-only
* to track write events. If so, we'll downgrade vm_page_prot
@@ -1484,14 +1509,13 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
{
vm_flags_t vm_flags = vma->vm_flags;
- const struct vm_operations_struct *vm_ops = vma->vm_ops;

/* If it was private or non-writable, the write bit is already clear */
if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
return 0;

/* The backer wishes to know when pages are first written to? */
- if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
+ if (vm_ops_needs_writenotify(vma->vm_ops))
return 1;

/* The open routine did something to the protections that pgprot_modify
@@ -1511,13 +1535,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
if (userfaultfd_wp(vma))
return 1;

- /* Specialty mapping? */
- if (vm_flags & VM_PFNMAP)
- return 0;
-
- /* Can the mapping track the dirty pages? */
- return vma->vm_file && vma->vm_file->f_mapping &&
- mapping_can_writeback(vma->vm_file->f_mapping);
+ return vma_needs_dirty_tracking(vma);
}

/*
--
2.40.1

2023-05-01 23:15:15

by Lorenzo Stoakes

[permalink] [raw]
Subject: [PATCH v6 2/3] mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings

Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.

The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.

For example, consider the following scenario:-

1. A folio is written to via GUP which write-faults the memory, notifying
the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
(though it does not have to).

This results in both data being written to a folio without writenotify, and
the folio being dirtied unexpectedly (if the caller decides to do so).

This issue was first reported by Jan Kara [1] in 2018, where the problem
resulted in file system crashes.

This is only relevant when the mappings are file-backed and the underlying
file system requires folio dirty tracking. File systems which do not, such
as shmem or hugetlb, are not at risk and therefore can be written to
without issue.

Unfortunately this limitation of GUP has been present for some time and
requires future rework of the GUP API in order to provide correct write
access to such mappings.

However, for the time being we introduce this check to prevent the most
egregious case of this occurring, use of the FOLL_LONGTERM pin.

These mappings are considerably more likely to be written to after
folios are cleaned and thus simply must not be permitted to do so.

This patch changes only the slow-path GUP functions, a following patch
adapts the GUP-fast path along similar lines.

[1]:https://lore.kernel.org/linux-mm/[email protected]/

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Mika Penttilä <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index ff689c88a357..0f09dec0906c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
return 0;
}

+/*
+ * Writing to file-backed mappings which require folio dirty tracking using GUP
+ * is a fundamentally broken operation, as kernel write access to GUP mappings
+ * do not adhere to the semantics expected by a file system.
+ *
+ * Consider the following scenario:-
+ *
+ * 1. A folio is written to via GUP which write-faults the memory, notifying
+ * the file system and dirtying the folio.
+ * 2. Later, writeback is triggered, resulting in the folio being cleaned and
+ * the PTE being marked read-only.
+ * 3. The GUP caller writes to the folio, as it is mapped read/write via the
+ * direct mapping.
+ * 4. The GUP caller, now done with the page, unpins it and sets it dirty
+ * (though it does not have to).
+ *
+ * This results in both data being written to a folio without writenotify, and
+ * the folio being dirtied unexpectedly (if the caller decides to do so).
+ */
+static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
+ unsigned long gup_flags)
+{
+ /* If we aren't pinning then no problematic write can occur. */
+ if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
+ return true;
+
+ /* We limit this check to the most egregious case - a long term pin. */
+ if (!(gup_flags & FOLL_LONGTERM))
+ return true;
+
+ /* If the VMA requires dirty tracking then GUP will be problematic. */
+ return vma_needs_dirty_tracking(vma);
+}
+
static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
{
vm_flags_t vm_flags = vma->vm_flags;
int write = (gup_flags & FOLL_WRITE);
int foreign = (gup_flags & FOLL_REMOTE);
+ bool vma_anon = vma_is_anonymous(vma);

if (vm_flags & (VM_IO | VM_PFNMAP))
return -EFAULT;

- if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+ if ((gup_flags & FOLL_ANON) && !vma_anon)
return -EFAULT;

if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
@@ -978,6 +1013,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
return -EFAULT;

if (write) {
+ if (!vma_anon &&
+ !writeable_file_mapping_allowed(vma, gup_flags))
+ return -EFAULT;
+
if (!(vm_flags & VM_WRITE)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
--
2.40.1

2023-05-01 23:17:10

by Lorenzo Stoakes

[permalink] [raw]
Subject: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.

This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
we have already done in the slow path.

We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous and then whitelist known-good mappings - specifically hugetlb
and shmem mappings.

While we obtain a stable folio for this check, the mapping might not be, as
a truncate could nullify it at any time. Since doing so requires mappings
to be zapped, we can synchronise against a TLB shootdown operation.

For some architectures TLB shootdown is synchronised by IPI, against which
we are protected as the GUP-fast operation is performed with interrupts
disabled. However, other architectures which specify
CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.

In these instances, we acquire an RCU lock while performing our checks. If
we cannot get a stable mapping, we fall back to the slow path, as otherwise
we'd have to walk the page tables again and it's simpler and more effective
to just fall back.

It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.

Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Kirill A . Shutemov <[email protected]>
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 85 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 0f09dec0906c..431618048a03 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -18,6 +18,7 @@
#include <linux/migrate.h>
#include <linux/mm_inline.h>
#include <linux/sched/mm.h>
+#include <linux/shmem_fs.h>

#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
@@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
return folio;
}

+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
+static bool stabilise_mapping_rcu(struct folio *folio)
+{
+ struct address_space *mapping = READ_ONCE(folio->mapping);
+
+ rcu_read_lock();
+
+ return mapping == READ_ONCE(folio->mapping);
+}
+
+static void unlock_rcu(void)
+{
+ rcu_read_unlock();
+}
+#else
+static bool stabilise_mapping_rcu(struct folio *)
+{
+ return true;
+}
+
+static void unlock_rcu(void)
+{
+}
+#endif
+
+/*
+ * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
+ * FOLL_WRITE pin is permitted for a specific folio.
+ *
+ * This assumes the folio is stable and pinned.
+ *
+ * Writing to pinned file-backed dirty tracked folios is inherently problematic
+ * (see comment describing the writeable_file_mapping_allowed() function). We
+ * therefore try to avoid the most egregious case of a long-term mapping doing
+ * so.
+ *
+ * This function cannot be as thorough as that one as the VMA is not available
+ * in the fast path, so instead we whitelist known good cases.
+ *
+ * The folio is stable, but the mapping might not be. When truncating for
+ * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
+ * so we are safe from an IPI, but some architectures use an RCU lock for this
+ * operation, so we acquire an RCU lock to ensure the mapping is stable.
+ */
+static bool folio_longterm_write_pin_allowed(struct folio *folio)
+{
+ bool ret;
+
+ /* hugetlb mappings do not require dirty tracking. */
+ if (folio_test_hugetlb(folio))
+ return true;
+
+ if (stabilise_mapping_rcu(folio)) {
+ struct address_space *mapping = folio_mapping(folio);
+
+ /*
+ * Neither anonymous nor shmem-backed folios require
+ * dirty tracking.
+ */
+ ret = folio_test_anon(folio) ||
+ (mapping && shmem_mapping(mapping));
+ } else {
+ /* If the mapping is unstable, fallback to the slow path. */
+ ret = false;
+ }
+
+ unlock_rcu();
+
+ return ret;
+}
+
/**
* try_grab_folio() - Attempt to get or pin a folio.
* @page: pointer to page to be grabbed
@@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
*/
struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
+ bool is_longterm = flags & FOLL_LONGTERM;
+
if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
return NULL;

@@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
* right zone, so fail and let the caller fall back to the slow
* path.
*/
- if (unlikely((flags & FOLL_LONGTERM) &&
- !is_longterm_pinnable_page(page)))
+ if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
return NULL;

/*
@@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
if (!folio)
return NULL;

+ /*
+ * Can this folio be safely pinned? We need to perform this
+ * check after the folio is stabilised.
+ */
+ if ((flags & FOLL_WRITE) && is_longterm &&
+ !folio_longterm_write_pin_allowed(folio)) {
+ folio_put_refs(folio, refs);
+ return NULL;
+ }
+
/*
* When pinning a large folio, use an exact count to track it.
*
--
2.40.1

2023-05-02 00:00:37

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 5/1/23 16:11, Lorenzo Stoakes wrote:
> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> as we cannot rule out folios being cleaned and then a GUP user writing to
> them again and possibly marking them dirty unexpectedly.
>
> This is especially egregious for long-term mappings (as indicated by the
> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> we have already done in the slow path.
>
> We have access to less information in the fast path as we cannot examine
> the VMA containing the mapping, however we can determine whether the folio
> is anonymous and then whitelist known-good mappings - specifically hugetlb
> and shmem mappings.
>
> While we obtain a stable folio for this check, the mapping might not be, as
> a truncate could nullify it at any time. Since doing so requires mappings
> to be zapped, we can synchronise against a TLB shootdown operation.
>
> For some architectures TLB shootdown is synchronised by IPI, against which
> we are protected as the GUP-fast operation is performed with interrupts
> disabled. However, other architectures which specify
> CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.
>
> In these instances, we acquire an RCU lock while performing our checks. If
> we cannot get a stable mapping, we fall back to the slow path, as otherwise
> we'd have to walk the page tables again and it's simpler and more effective
> to just fall back.
>
> It's important to note that there are no APIs allowing users to specify
> FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> always rely on the fact that if we fail to pin on the fast path, the code
> will fall back to the slow path which can perform the more thorough check.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Suggested-by: Kirill A . Shutemov <[email protected]>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> ---
> mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 85 insertions(+), 2 deletions(-)
>

Hi Lorenzo,

I am unable to find anything wrong with this patch, despite poring
over it and fretting over IPI vs. RCU cases. :)

Reviewed-by: John Hubbard <[email protected]>

thanks,
--
John Hubbard
NVIDIA


> diff --git a/mm/gup.c b/mm/gup.c
> index 0f09dec0906c..431618048a03 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -18,6 +18,7 @@
> #include <linux/migrate.h>
> #include <linux/mm_inline.h>
> #include <linux/sched/mm.h>
> +#include <linux/shmem_fs.h>
>
> #include <asm/mmu_context.h>
> #include <asm/tlbflush.h>
> @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> return folio;
> }
>
> +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +static bool stabilise_mapping_rcu(struct folio *folio)
> +{
> + struct address_space *mapping = READ_ONCE(folio->mapping);
> +
> + rcu_read_lock();
> +
> + return mapping == READ_ONCE(folio->mapping);
> +}
> +
> +static void unlock_rcu(void)
> +{
> + rcu_read_unlock();
> +}
> +#else
> +static bool stabilise_mapping_rcu(struct folio *)
> +{
> + return true;
> +}
> +
> +static void unlock_rcu(void)
> +{
> +}
> +#endif
> +
> +/*
> + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> + * FOLL_WRITE pin is permitted for a specific folio.
> + *
> + * This assumes the folio is stable and pinned.
> + *
> + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> + * (see comment describing the writeable_file_mapping_allowed() function). We
> + * therefore try to avoid the most egregious case of a long-term mapping doing
> + * so.
> + *
> + * This function cannot be as thorough as that one as the VMA is not available
> + * in the fast path, so instead we whitelist known good cases.
> + *
> + * The folio is stable, but the mapping might not be. When truncating for
> + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> + * so we are safe from an IPI, but some architectures use an RCU lock for this
> + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> + */
> +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> +{
> + bool ret;
> +
> + /* hugetlb mappings do not require dirty tracking. */
> + if (folio_test_hugetlb(folio))
> + return true;
> +
> + if (stabilise_mapping_rcu(folio)) {
> + struct address_space *mapping = folio_mapping(folio);
> +
> + /*
> + * Neither anonymous nor shmem-backed folios require
> + * dirty tracking.
> + */
> + ret = folio_test_anon(folio) ||
> + (mapping && shmem_mapping(mapping));
> + } else {
> + /* If the mapping is unstable, fallback to the slow path. */
> + ret = false;
> + }
> +
> + unlock_rcu();
> +
> + return ret;
> +}
> +
> /**
> * try_grab_folio() - Attempt to get or pin a folio.
> * @page: pointer to page to be grabbed
> @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> */
> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> {
> + bool is_longterm = flags & FOLL_LONGTERM;
> +
> if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> return NULL;
>
> @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> * right zone, so fail and let the caller fall back to the slow
> * path.
> */
> - if (unlikely((flags & FOLL_LONGTERM) &&
> - !is_longterm_pinnable_page(page)))
> + if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> return NULL;
>
> /*
> @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> if (!folio)
> return NULL;
>
> + /*
> + * Can this folio be safely pinned? We need to perform this
> + * check after the folio is stabilised.
> + */
> + if ((flags & FOLL_WRITE) && is_longterm &&
> + !folio_longterm_write_pin_allowed(folio)) {
> + folio_put_refs(folio, refs);
> + return NULL;
> + }
> +
> /*
> * When pinning a large folio, use an exact count to track it.
> *


2023-05-02 03:38:52

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

Hi Lorenzo,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/Lorenzo-Stoakes/mm-mmap-separate-writenotify-and-dirty-tracking-logic/20230502-071520
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/dee4f4ad6532b0f94d073da263526de334d5d7e0.1682981880.git.lstoakes%40gmail.com
patch subject: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings
config: arm-randconfig-r013-20230502 (https://download.01.org/0day-ci/archive/20230502/[email protected]/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project b1465cd49efcbc114a75220b153f5a055ce7911f)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm cross compiling tool for clang build
# apt-get install binutils-arm-linux-gnueabi
# https://github.com/intel-lab-lkp/linux/commit/602bb9fab888bd9176b8eeb80a0da68499c343ed
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Lorenzo-Stoakes/mm-mmap-separate-writenotify-and-dirty-tracking-logic/20230502-071520
git checkout 602bb9fab888bd9176b8eeb80a0da68499c343ed
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> mm/gup.c:114:49: warning: omitting the parameter name in a function definition is a C2x extension [-Wc2x-extensions]
static bool stabilise_mapping_rcu(struct folio *)
^
1 warning generated.


vim +114 mm/gup.c

108
109 static void unlock_rcu(void)
110 {
111 rcu_read_unlock();
112 }
113 #else
> 114 static bool stabilise_mapping_rcu(struct folio *)
115 {
116 return true;
117 }
118

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-05-02 07:56:29

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 11:33:29AM +0800, kernel test robot wrote:
> Hi Lorenzo,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Lorenzo-Stoakes/mm-mmap-separate-writenotify-and-dirty-tracking-logic/20230502-071520
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/dee4f4ad6532b0f94d073da263526de334d5d7e0.1682981880.git.lstoakes%40gmail.com
> patch subject: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings
> config: arm-randconfig-r013-20230502 (https://download.01.org/0day-ci/archive/20230502/[email protected]/config)
> compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project b1465cd49efcbc114a75220b153f5a055ce7911f)
> reproduce (this is a W=1 build):
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install arm cross compiling tool for clang build
> # apt-get install binutils-arm-linux-gnueabi
> # https://github.com/intel-lab-lkp/linux/commit/602bb9fab888bd9176b8eeb80a0da68499c343ed
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review Lorenzo-Stoakes/mm-mmap-separate-writenotify-and-dirty-tracking-logic/20230502-071520
> git checkout 602bb9fab888bd9176b8eeb80a0da68499c343ed
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm olddefconfig
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All warnings (new ones prefixed by >>):
>
> >> mm/gup.c:114:49: warning: omitting the parameter name in a function definition is a C2x extension [-Wc2x-extensions]
> static bool stabilise_mapping_rcu(struct folio *)
> ^
> 1 warning generated.
>
>
> vim +114 mm/gup.c
>
> 108
> 109 static void unlock_rcu(void)
> 110 {
> 111 rcu_read_unlock();
> 112 }
> 113 #else
> > 114 static bool stabilise_mapping_rcu(struct folio *)
> 115 {
> 116 return true;
> 117 }
> 118
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests

Am slightly surprised that's a C2x extension :)

-fix patch attached

----8<----
From 829cb87ddd84e4496310b12d26e65de11d5b9e53 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <[email protected]>
Date: Tue, 2 May 2023 08:40:48 +0100
Subject: [PATCH] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings

Fix compiler warning.

Signed-off-by: Lorenzo Stoakes <[email protected]>
---
mm/gup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 431618048a03..ffc3caf120fc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -111,7 +111,7 @@ static void unlock_rcu(void)
rcu_read_unlock();
}
#else
-static bool stabilise_mapping_rcu(struct folio *)
+static bool stabilise_mapping_rcu(struct folio *folio)
{
return true;
}
--
2.40.1

2023-05-02 11:17:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> return folio;
> }
>
> +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +static bool stabilise_mapping_rcu(struct folio *folio)
> +{
> + struct address_space *mapping = READ_ONCE(folio->mapping);
> +
> + rcu_read_lock();
> +
> + return mapping == READ_ONCE(folio->mapping);

This doesn't make sense; why bother reading the same thing twice?

Who cares if the thing changes from before; what you care about is that
the value you see has stable storage, this doesn't help with that.

> +}
> +
> +static void unlock_rcu(void)
> +{
> + rcu_read_unlock();
> +}
> +#else
> +static bool stabilise_mapping_rcu(struct folio *)
> +{
> + return true;
> +}
> +
> +static void unlock_rcu(void)
> +{
> +}
> +#endif

Anyway, this all can go away. RCU can't progress while you have
interrupts disabled anyway.

> +/*
> + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> + * FOLL_WRITE pin is permitted for a specific folio.
> + *
> + * This assumes the folio is stable and pinned.
> + *
> + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> + * (see comment describing the writeable_file_mapping_allowed() function). We
> + * therefore try to avoid the most egregious case of a long-term mapping doing
> + * so.
> + *
> + * This function cannot be as thorough as that one as the VMA is not available
> + * in the fast path, so instead we whitelist known good cases.
> + *
> + * The folio is stable, but the mapping might not be. When truncating for
> + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> + * so we are safe from an IPI, but some architectures use an RCU lock for this
> + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> + */
> +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> +{
> + bool ret;
> +
> + /* hugetlb mappings do not require dirty tracking. */
> + if (folio_test_hugetlb(folio))
> + return true;
> +

This:

> + if (stabilise_mapping_rcu(folio)) {
> + struct address_space *mapping = folio_mapping(folio);

And this is 3rd read of folio->mapping, just for giggles?

> +
> + /*
> + * Neither anonymous nor shmem-backed folios require
> + * dirty tracking.
> + */
> + ret = folio_test_anon(folio) ||
> + (mapping && shmem_mapping(mapping));
> + } else {
> + /* If the mapping is unstable, fallback to the slow path. */
> + ret = false;
> + }
> +
> + unlock_rcu();
> +
> + return ret;

then becomes:


if (folio_test_anon(folio))
return true;

/*
* Having IRQs disabled (as per GUP-fast) also inhibits RCU
* grace periods from making progress, IOW. they imply
* rcu_read_lock().
*/
lockdep_assert_irqs_disabled();

/*
* Inodes and thus address_space are RCU freed and thus safe to
* access at this point.
*/
mapping = folio_mapping(folio);
if (mapping && shmem_mapping(mapping))
return true;

return false;

> +}

2023-05-02 11:22:35

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue 02-05-23 00:11:49, Lorenzo Stoakes wrote:
> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> as we cannot rule out folios being cleaned and then a GUP user writing to
> them again and possibly marking them dirty unexpectedly.
>
> This is especially egregious for long-term mappings (as indicated by the
> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> we have already done in the slow path.
>
> We have access to less information in the fast path as we cannot examine
> the VMA containing the mapping, however we can determine whether the folio
> is anonymous and then whitelist known-good mappings - specifically hugetlb
> and shmem mappings.
>
> While we obtain a stable folio for this check, the mapping might not be, as
> a truncate could nullify it at any time. Since doing so requires mappings
> to be zapped, we can synchronise against a TLB shootdown operation.
>
> For some architectures TLB shootdown is synchronised by IPI, against which
> we are protected as the GUP-fast operation is performed with interrupts
> disabled. However, other architectures which specify
> CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.
>
> In these instances, we acquire an RCU lock while performing our checks. If
> we cannot get a stable mapping, we fall back to the slow path, as otherwise
> we'd have to walk the page tables again and it's simpler and more effective
> to just fall back.
>
> It's important to note that there are no APIs allowing users to specify
> FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> always rely on the fact that if we fail to pin on the fast path, the code
> will fall back to the slow path which can perform the more thorough check.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Suggested-by: Kirill A . Shutemov <[email protected]>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> ---
> mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 85 insertions(+), 2 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 0f09dec0906c..431618048a03 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -18,6 +18,7 @@
> #include <linux/migrate.h>
> #include <linux/mm_inline.h>
> #include <linux/sched/mm.h>
> +#include <linux/shmem_fs.h>
>
> #include <asm/mmu_context.h>
> #include <asm/tlbflush.h>
> @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> return folio;
> }
>
> +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +static bool stabilise_mapping_rcu(struct folio *folio)
> +{
> + struct address_space *mapping = READ_ONCE(folio->mapping);
> +
> + rcu_read_lock();
> +
> + return mapping == READ_ONCE(folio->mapping);
> +}
> +
> +static void unlock_rcu(void)
> +{
> + rcu_read_unlock();
> +}
> +#else
> +static bool stabilise_mapping_rcu(struct folio *)
> +{
> + return true;
> +}
> +
> +static void unlock_rcu(void)
> +{
> +}
> +#endif

So I wonder is this complexity worth it over just using rcu_read_lock()
unconditionally? It is just a compilation barrier AFAIK...

Also stabilise_mapping_rcu() seems to be a bit of a misnomer since the
mapping is not stable after the function is called. Also the return value
seems a bit pointless to me since we have to check folio_mapping() for
being != NULL anyway. All in all I'd say that:

struct address_space *mapping;

/* Make sure mapping cannot be freed under our hands */
rcu_read_lock();
mapping = folio_mapping(folio);
ret = folio_test_anon(folio) || (mapping && shmem_mapping(mapping));
rcu_read_unlock();

looks more comprehensible...

Honza

> +
> +/*
> + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> + * FOLL_WRITE pin is permitted for a specific folio.
> + *
> + * This assumes the folio is stable and pinned.
> + *
> + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> + * (see comment describing the writeable_file_mapping_allowed() function). We
> + * therefore try to avoid the most egregious case of a long-term mapping doing
> + * so.
> + *
> + * This function cannot be as thorough as that one as the VMA is not available
> + * in the fast path, so instead we whitelist known good cases.
> + *
> + * The folio is stable, but the mapping might not be. When truncating for
> + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> + * so we are safe from an IPI, but some architectures use an RCU lock for this
> + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> + */
> +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> +{
> + bool ret;
> +
> + /* hugetlb mappings do not require dirty tracking. */
> + if (folio_test_hugetlb(folio))
> + return true;
> +
> + if (stabilise_mapping_rcu(folio)) {
> + struct address_space *mapping = folio_mapping(folio);
> +
> + /*
> + * Neither anonymous nor shmem-backed folios require
> + * dirty tracking.
> + */
> + ret = folio_test_anon(folio) ||
> + (mapping && shmem_mapping(mapping));
> + } else {
> + /* If the mapping is unstable, fallback to the slow path. */
> + ret = false;
> + }
> +
> + unlock_rcu();
> +
> + return ret;
> +}
> +
> /**
> * try_grab_folio() - Attempt to get or pin a folio.
> * @page: pointer to page to be grabbed
> @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> */
> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> {
> + bool is_longterm = flags & FOLL_LONGTERM;
> +
> if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> return NULL;
>
> @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> * right zone, so fail and let the caller fall back to the slow
> * path.
> */
> - if (unlikely((flags & FOLL_LONGTERM) &&
> - !is_longterm_pinnable_page(page)))
> + if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> return NULL;
>
> /*
> @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> if (!folio)
> return NULL;
>
> + /*
> + * Can this folio be safely pinned? We need to perform this
> + * check after the folio is stabilised.
> + */
> + if ((flags & FOLL_WRITE) && is_longterm &&
> + !folio_longterm_write_pin_allowed(folio)) {
> + folio_put_refs(folio, refs);
> + return NULL;
> + }
> +
> /*
> * When pinning a large folio, use an exact count to track it.
> *
> --
> 2.40.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-05-02 11:26:16

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue 02-05-23 13:13:34, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > +
> > + /*
> > + * Neither anonymous nor shmem-backed folios require
> > + * dirty tracking.
> > + */
> > + ret = folio_test_anon(folio) ||
> > + (mapping && shmem_mapping(mapping));
> > + } else {
> > + /* If the mapping is unstable, fallback to the slow path. */
> > + ret = false;
> > + }
> > +
> > + unlock_rcu();
> > +
> > + return ret;
>
> then becomes:
>
>
> if (folio_test_anon(folio))
> return true;
>
> /*
> * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> * grace periods from making progress, IOW. they imply
> * rcu_read_lock().
> */
> lockdep_assert_irqs_disabled();
>
> /*
> * Inodes and thus address_space are RCU freed and thus safe to
> * access at this point.
> */
> mapping = folio_mapping(folio);
> if (mapping && shmem_mapping(mapping))
> return true;
>
> return false;

Yeah, that's even better.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-05-02 11:27:31

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > return folio;
> > }
> >
> > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > +static bool stabilise_mapping_rcu(struct folio *folio)
> > +{
> > + struct address_space *mapping = READ_ONCE(folio->mapping);
> > +
> > + rcu_read_lock();
> > +
> > + return mapping == READ_ONCE(folio->mapping);
>
> This doesn't make sense; why bother reading the same thing twice?

The intent is to see whether the folio->mapping has been truncated from
underneath us, as per the futex code that Kirill referred to which does
something similar [1].

>
> Who cares if the thing changes from before; what you care about is that
> the value you see has stable storage, this doesn't help with that.
>
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > + rcu_read_unlock();
> > +}
> > +#else
> > +static bool stabilise_mapping_rcu(struct folio *)
> > +{
> > + return true;
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > +}
> > +#endif
>
> Anyway, this all can go away. RCU can't progress while you have
> interrupts disabled anyway.

There seems to be other code in the kernel that assumes that this is not
the case, i.e. the futex code, though not sure if that's being run with
IRQs disabled... if not and it's absolutely certain that we need no special
handling for the RCU case, then happy days and more than glad to remove
this bit.

I'm far from an expert on RCU (I need to gain a better understanding of it)
so I'm deferring how best to proceed on _this part_ to the community.

>
> > +/*
> > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > + * FOLL_WRITE pin is permitted for a specific folio.
> > + *
> > + * This assumes the folio is stable and pinned.
> > + *
> > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > + * so.
> > + *
> > + * This function cannot be as thorough as that one as the VMA is not available
> > + * in the fast path, so instead we whitelist known good cases.
> > + *
> > + * The folio is stable, but the mapping might not be. When truncating for
> > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > + */
> > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > +{
> > + bool ret;
> > +
> > + /* hugetlb mappings do not require dirty tracking. */
> > + if (folio_test_hugetlb(folio))
> > + return true;
> > +
>
> This:
>
> > + if (stabilise_mapping_rcu(folio)) {
> > + struct address_space *mapping = folio_mapping(folio);
>
> And this is 3rd read of folio->mapping, just for giggles?

I like to giggle :)

Actually this is to handle the various cases in which the mapping might not
be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
have a helper exposed for a check. Given previous review about duplication
I felt best to reuse this even though it does access again... yes I felt
weird about doing that.

>
> > +
> > + /*
> > + * Neither anonymous nor shmem-backed folios require
> > + * dirty tracking.
> > + */
> > + ret = folio_test_anon(folio) ||
> > + (mapping && shmem_mapping(mapping));
> > + } else {
> > + /* If the mapping is unstable, fallback to the slow path. */
> > + ret = false;
> > + }
> > +
> > + unlock_rcu();
> > +
> > + return ret;
>
> then becomes:
>
>
> if (folio_test_anon(folio))
> return true;

This relies on the mapping so belongs below the lockdep assert imo.

>
> /*
> * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> * grace periods from making progress, IOW. they imply
> * rcu_read_lock().
> */
> lockdep_assert_irqs_disabled();
>
> /*
> * Inodes and thus address_space are RCU freed and thus safe to
> * access at this point.
> */
> mapping = folio_mapping(folio);
> if (mapping && shmem_mapping(mapping))
> return true;
>
> return false;
>
> > +}

I'm more than happy to do this (I'd rather drop the RCU bits if possible)
but need to be sure it's safe.

2023-05-02 11:30:49

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > return folio;
> > > }
> > >
> > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > +{
> > > + struct address_space *mapping = READ_ONCE(folio->mapping);
> > > +
> > > + rcu_read_lock();
> > > +
> > > + return mapping == READ_ONCE(folio->mapping);
> >
> > This doesn't make sense; why bother reading the same thing twice?
>
> The intent is to see whether the folio->mapping has been truncated from
> underneath us, as per the futex code that Kirill referred to which does
> something similar [1].
>
> >
> > Who cares if the thing changes from before; what you care about is that
> > the value you see has stable storage, this doesn't help with that.
> >
> > > +}
> > > +
> > > +static void unlock_rcu(void)
> > > +{
> > > + rcu_read_unlock();
> > > +}
> > > +#else
> > > +static bool stabilise_mapping_rcu(struct folio *)
> > > +{
> > > + return true;
> > > +}
> > > +
> > > +static void unlock_rcu(void)
> > > +{
> > > +}
> > > +#endif
> >
> > Anyway, this all can go away. RCU can't progress while you have
> > interrupts disabled anyway.
>
> There seems to be other code in the kernel that assumes that this is not
> the case, i.e. the futex code, though not sure if that's being run with
> IRQs disabled... if not and it's absolutely certain that we need no special
> handling for the RCU case, then happy days and more than glad to remove
> this bit.
>
> I'm far from an expert on RCU (I need to gain a better understanding of it)
> so I'm deferring how best to proceed on _this part_ to the community.
>
> >
> > > +/*
> > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > + *
> > > + * This assumes the folio is stable and pinned.
> > > + *
> > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > + * so.
> > > + *
> > > + * This function cannot be as thorough as that one as the VMA is not available
> > > + * in the fast path, so instead we whitelist known good cases.
> > > + *
> > > + * The folio is stable, but the mapping might not be. When truncating for
> > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > + */
> > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > +{
> > > + bool ret;
> > > +
> > > + /* hugetlb mappings do not require dirty tracking. */
> > > + if (folio_test_hugetlb(folio))
> > > + return true;
> > > +
> >
> > This:
> >
> > > + if (stabilise_mapping_rcu(folio)) {
> > > + struct address_space *mapping = folio_mapping(folio);
> >
> > And this is 3rd read of folio->mapping, just for giggles?
>
> I like to giggle :)
>
> Actually this is to handle the various cases in which the mapping might not
> be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> have a helper exposed for a check. Given previous review about duplication
> I felt best to reuse this even though it does access again... yes I felt
> weird about doing that.
>
> >
> > > +
> > > + /*
> > > + * Neither anonymous nor shmem-backed folios require
> > > + * dirty tracking.
> > > + */
> > > + ret = folio_test_anon(folio) ||
> > > + (mapping && shmem_mapping(mapping));
> > > + } else {
> > > + /* If the mapping is unstable, fallback to the slow path. */
> > > + ret = false;
> > > + }
> > > +
> > > + unlock_rcu();
> > > +
> > > + return ret;
> >
> > then becomes:
> >
> >
> > if (folio_test_anon(folio))
> > return true;
>
> This relies on the mapping so belongs below the lockdep assert imo.
>
> >
> > /*
> > * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > * grace periods from making progress, IOW. they imply
> > * rcu_read_lock().
> > */
> > lockdep_assert_irqs_disabled();
> >
> > /*
> > * Inodes and thus address_space are RCU freed and thus safe to
> > * access at this point.
> > */
> > mapping = folio_mapping(folio);
> > if (mapping && shmem_mapping(mapping))
> > return true;
> >
> > return false;
> >
> > > +}
>
> I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> but need to be sure it's safe.

Sorry forgot to include the [1]

[1]:https://lore.kernel.org/all/[email protected]/

2023-05-02 12:12:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > return folio;
> > > }
> > >
> > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > +{
> > > + struct address_space *mapping = READ_ONCE(folio->mapping);
> > > +
> > > + rcu_read_lock();
> > > +
> > > + return mapping == READ_ONCE(folio->mapping);
> >
> > This doesn't make sense; why bother reading the same thing twice?
>
> The intent is to see whether the folio->mapping has been truncated from
> underneath us, as per the futex code that Kirill referred to which does
> something similar [1].

Yeah, but per that 3rd load you got nothing here. Also that futex code
did the early load to deal with the !mapping case, but you're not doing
that.

> > Who cares if the thing changes from before; what you care about is that
> > the value you see has stable storage, this doesn't help with that.
> >
> > > +}
> > > +
> > > +static void unlock_rcu(void)
> > > +{
> > > + rcu_read_unlock();
> > > +}
> > > +#else
> > > +static bool stabilise_mapping_rcu(struct folio *)
> > > +{
> > > + return true;
> > > +}
> > > +
> > > +static void unlock_rcu(void)
> > > +{
> > > +}
> > > +#endif
> >
> > Anyway, this all can go away. RCU can't progress while you have
> > interrupts disabled anyway.
>
> There seems to be other code in the kernel that assumes that this is not
> the case,

Yeah, so Paul went back on forth on that a bit. It used to be true in
the good old days when everything was simple. Then Paul made things
complicated by separating out sched-RCU bh-RCU and 'regular' RCU
flavours.

At that point disabling IRQs would only (officially) inhibit sched and
bh RCU flavours, but not the regular RCU.

But then some years ago Linus convinced Paul that having all these
separate RCU flavours with separate QS rules was a big pain in the
backside and Paul munged them all together again.

So now, anything that inhibits any of the RCU flavours inhibits them
all. So disabling IRQs is sufficient.

> i.e. the futex code, though not sure if that's being run with
> IRQs disabled...

That futex code runs in preemptible context, per the lock_page() that
can sleep etc.. :-)

> > > +/*
> > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > + *
> > > + * This assumes the folio is stable and pinned.
> > > + *
> > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > + * so.
> > > + *
> > > + * This function cannot be as thorough as that one as the VMA is not available
> > > + * in the fast path, so instead we whitelist known good cases.
> > > + *
> > > + * The folio is stable, but the mapping might not be. When truncating for
> > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > + */
> > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > +{
> > > + bool ret;
> > > +
> > > + /* hugetlb mappings do not require dirty tracking. */
> > > + if (folio_test_hugetlb(folio))
> > > + return true;
> > > +
> >
> > This:
> >
> > > + if (stabilise_mapping_rcu(folio)) {
> > > + struct address_space *mapping = folio_mapping(folio);
> >
> > And this is 3rd read of folio->mapping, just for giggles?
>
> I like to giggle :)
>
> Actually this is to handle the various cases in which the mapping might not
> be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> have a helper exposed for a check. Given previous review about duplication
> I felt best to reuse this even though it does access again... yes I felt
> weird about doing that.

Right, I had a peek inside folio_mapping(), but the point is that this
3rd load might see yet *another* value of mapping from the prior two
loads, rendering them somewhat worthless.

> > > +
> > > + /*
> > > + * Neither anonymous nor shmem-backed folios require
> > > + * dirty tracking.
> > > + */
> > > + ret = folio_test_anon(folio) ||
> > > + (mapping && shmem_mapping(mapping));
> > > + } else {
> > > + /* If the mapping is unstable, fallback to the slow path. */
> > > + ret = false;
> > > + }
> > > +
> > > + unlock_rcu();
> > > +
> > > + return ret;
> >
> > then becomes:
> >
> >
> > if (folio_test_anon(folio))
> > return true;
>
> This relies on the mapping so belongs below the lockdep assert imo.

Oh, right you are.

> >
> > /*
> > * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > * grace periods from making progress, IOW. they imply
> > * rcu_read_lock().
> > */
> > lockdep_assert_irqs_disabled();
> >
> > /*
> > * Inodes and thus address_space are RCU freed and thus safe to
> > * access at this point.
> > */
> > mapping = folio_mapping(folio);
> > if (mapping && shmem_mapping(mapping))
> > return true;
> >
> > return false;
> >
> > > +}
>
> I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> but need to be sure it's safe.

GUP-fast as a whole relies on it :-)

2023-05-02 12:39:16

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > > return folio;
> > > > }
> > > >
> > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > > +{
> > > > + struct address_space *mapping = READ_ONCE(folio->mapping);
> > > > +
> > > > + rcu_read_lock();
> > > > +
> > > > + return mapping == READ_ONCE(folio->mapping);
> > >
> > > This doesn't make sense; why bother reading the same thing twice?
> >
> > The intent is to see whether the folio->mapping has been truncated from
> > underneath us, as per the futex code that Kirill referred to which does
> > something similar [1].
>
> Yeah, but per that 3rd load you got nothing here. Also that futex code
> did the early load to deal with the !mapping case, but you're not doing
> that.
>

OK I drafted a response three times then deleted which shows you how this
stuff messes with your mind :)

I realise now that literally it is checking whether the previous !mapping
case and lack of action taken on that was valid for futex, rendering this
pointless for the logic here.

We do check !mapping later but obviously with the 'stable' mapping whose
relation to pre-rcu lock is irrelevant.

Thanks for patiently explaining this :) RCU remains an area I need to take
a closer look at generally.

> > > Who cares if the thing changes from before; what you care about is that
> > > the value you see has stable storage, this doesn't help with that.
> > >
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > + rcu_read_unlock();
> > > > +}
> > > > +#else
> > > > +static bool stabilise_mapping_rcu(struct folio *)
> > > > +{
> > > > + return true;
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +}
> > > > +#endif
> > >
> > > Anyway, this all can go away. RCU can't progress while you have
> > > interrupts disabled anyway.
> >
> > There seems to be other code in the kernel that assumes that this is not
> > the case,
>
> Yeah, so Paul went back on forth on that a bit. It used to be true in
> the good old days when everything was simple. Then Paul made things
> complicated by separating out sched-RCU bh-RCU and 'regular' RCU
> flavours.
>
> At that point disabling IRQs would only (officially) inhibit sched and
> bh RCU flavours, but not the regular RCU.
>
> But then some years ago Linus convinced Paul that having all these
> separate RCU flavours with separate QS rules was a big pain in the
> backside and Paul munged them all together again.
>
> So now, anything that inhibits any of the RCU flavours inhibits them
> all. So disabling IRQs is sufficient.
>
> > i.e. the futex code, though not sure if that's being run with
> > IRQs disabled...
>
> That futex code runs in preemptible context, per the lock_page() that
> can sleep etc.. :-)

OK I am actually really happy to hear this because this means I can go
simplify this code significantly!

>
> > > > +/*
> > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > > + *
> > > > + * This assumes the folio is stable and pinned.
> > > > + *
> > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > > + * so.
> > > > + *
> > > > + * This function cannot be as thorough as that one as the VMA is not available
> > > > + * in the fast path, so instead we whitelist known good cases.
> > > > + *
> > > > + * The folio is stable, but the mapping might not be. When truncating for
> > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > > + */
> > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > > +{
> > > > + bool ret;
> > > > +
> > > > + /* hugetlb mappings do not require dirty tracking. */
> > > > + if (folio_test_hugetlb(folio))
> > > > + return true;
> > > > +
> > >
> > > This:
> > >
> > > > + if (stabilise_mapping_rcu(folio)) {
> > > > + struct address_space *mapping = folio_mapping(folio);
> > >
> > > And this is 3rd read of folio->mapping, just for giggles?
> >
> > I like to giggle :)
> >
> > Actually this is to handle the various cases in which the mapping might not
> > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> > have a helper exposed for a check. Given previous review about duplication
> > I felt best to reuse this even though it does access again... yes I felt
> > weird about doing that.
>
> Right, I had a peek inside folio_mapping(), but the point is that this
> 3rd load might see yet *another* value of mapping from the prior two
> loads, rendering them somewhat worthless.
>
> > > > +
> > > > + /*
> > > > + * Neither anonymous nor shmem-backed folios require
> > > > + * dirty tracking.
> > > > + */
> > > > + ret = folio_test_anon(folio) ||
> > > > + (mapping && shmem_mapping(mapping));
> > > > + } else {
> > > > + /* If the mapping is unstable, fallback to the slow path. */
> > > > + ret = false;
> > > > + }
> > > > +
> > > > + unlock_rcu();
> > > > +
> > > > + return ret;
> > >
> > > then becomes:
> > >
> > >
> > > if (folio_test_anon(folio))
> > > return true;
> >
> > This relies on the mapping so belongs below the lockdep assert imo.
>
> Oh, right you are.
>
> > >
> > > /*
> > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > * grace periods from making progress, IOW. they imply
> > > * rcu_read_lock().
> > > */
> > > lockdep_assert_irqs_disabled();
> > >
> > > /*
> > > * Inodes and thus address_space are RCU freed and thus safe to
> > > * access at this point.
> > > */
> > > mapping = folio_mapping(folio);
> > > if (mapping && shmem_mapping(mapping))
> > > return true;
> > >
> > > return false;
> > >
> > > > +}
> >
> > I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> > but need to be sure it's safe.
>
> GUP-fast as a whole relies on it :-)

Indeed, the only question was what happened with
CONFIG_MMU_GATHER_RCU_TABLE_FREE arches which appeared to require special
handling, but I'm very happy to hear they don't!

Will respin along the lines of your suggestion.

2023-05-02 12:47:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:

> > >
> > >
> > > if (folio_test_anon(folio))
> > > return true;
> >
> > This relies on the mapping so belongs below the lockdep assert imo.
>
> Oh, right you are.
>
> > >
> > > /*
> > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > * grace periods from making progress, IOW. they imply
> > > * rcu_read_lock().
> > > */
> > > lockdep_assert_irqs_disabled();
> > >
> > > /*
> > > * Inodes and thus address_space are RCU freed and thus safe to
> > > * access at this point.
> > > */
> > > mapping = folio_mapping(folio);
> > > if (mapping && shmem_mapping(mapping))
> > > return true;
> > >
> > > return false;
> > >
> > > > +}

So arguably you should do *one* READ_ONCE() load of mapping and
consistently use that, this means open-coding both folio_test_anon() and
folio_mapping().

2023-05-02 12:49:30

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 14:40, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
>
>>>>
>>>>
>>>> if (folio_test_anon(folio))
>>>> return true;
>>>
>>> This relies on the mapping so belongs below the lockdep assert imo.
>>
>> Oh, right you are.
>>
>>>>
>>>> /*
>>>> * Having IRQs disabled (as per GUP-fast) also inhibits RCU
>>>> * grace periods from making progress, IOW. they imply
>>>> * rcu_read_lock().
>>>> */
>>>> lockdep_assert_irqs_disabled();
>>>>
>>>> /*
>>>> * Inodes and thus address_space are RCU freed and thus safe to
>>>> * access at this point.
>>>> */
>>>> mapping = folio_mapping(folio);
>>>> if (mapping && shmem_mapping(mapping))
>>>> return true;
>>>>
>>>> return false;
>>>>
>>>>> +}
>
> So arguably you should do *one* READ_ONCE() load of mapping and
> consistently use that, this means open-coding both folio_test_anon() and
> folio_mapping().

Open-coding folio_test_anon() should not be required. We only care about
PAGE_MAPPING_FLAGS stored alongside folio->mapping, that will stick
around until the anon page was freed.

@Lorenzo, you might also want to special-case hugetlb directly using
folio_test_hugetlb().

--
Thanks,

David / dhildenb

2023-05-02 12:50:22

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> as we cannot rule out folios being cleaned and then a GUP user writing to
> them again and possibly marking them dirty unexpectedly.
>
> This is especially egregious for long-term mappings (as indicated by the
> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> we have already done in the slow path.

Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
It would no longer work with file backed memory, correct?

See
arch/s390/kvm/pci.c

kvm_s390_pci_aif_enable
which does have
FOLL_WRITE | FOLL_LONGTERM
to

>
> We have access to less information in the fast path as we cannot examine
> the VMA containing the mapping, however we can determine whether the folio
> is anonymous and then whitelist known-good mappings - specifically hugetlb
> and shmem mappings.
>
> While we obtain a stable folio for this check, the mapping might not be, as
> a truncate could nullify it at any time. Since doing so requires mappings
> to be zapped, we can synchronise against a TLB shootdown operation.
>
> For some architectures TLB shootdown is synchronised by IPI, against which
> we are protected as the GUP-fast operation is performed with interrupts
> disabled. However, other architectures which specify
> CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.
>
> In these instances, we acquire an RCU lock while performing our checks. If
> we cannot get a stable mapping, we fall back to the slow path, as otherwise
> we'd have to walk the page tables again and it's simpler and more effective
> to just fall back.
>
> It's important to note that there are no APIs allowing users to specify
> FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> always rely on the fact that if we fail to pin on the fast path, the code
> will fall back to the slow path which can perform the more thorough check.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Suggested-by: Kirill A . Shutemov <[email protected]>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> ---
> mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 85 insertions(+), 2 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 0f09dec0906c..431618048a03 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -18,6 +18,7 @@
> #include <linux/migrate.h>
> #include <linux/mm_inline.h>
> #include <linux/sched/mm.h>
> +#include <linux/shmem_fs.h>
>
> #include <asm/mmu_context.h>
> #include <asm/tlbflush.h>
> @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> return folio;
> }
>
> +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +static bool stabilise_mapping_rcu(struct folio *folio)
> +{
> + struct address_space *mapping = READ_ONCE(folio->mapping);
> +
> + rcu_read_lock();
> +
> + return mapping == READ_ONCE(folio->mapping);
> +}
> +
> +static void unlock_rcu(void)
> +{
> + rcu_read_unlock();
> +}
> +#else
> +static bool stabilise_mapping_rcu(struct folio *)
> +{
> + return true;
> +}
> +
> +static void unlock_rcu(void)
> +{
> +}
> +#endif
> +
> +/*
> + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> + * FOLL_WRITE pin is permitted for a specific folio.
> + *
> + * This assumes the folio is stable and pinned.
> + *
> + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> + * (see comment describing the writeable_file_mapping_allowed() function). We
> + * therefore try to avoid the most egregious case of a long-term mapping doing
> + * so.
> + *
> + * This function cannot be as thorough as that one as the VMA is not available
> + * in the fast path, so instead we whitelist known good cases.
> + *
> + * The folio is stable, but the mapping might not be. When truncating for
> + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> + * so we are safe from an IPI, but some architectures use an RCU lock for this
> + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> + */
> +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> +{
> + bool ret;
> +
> + /* hugetlb mappings do not require dirty tracking. */
> + if (folio_test_hugetlb(folio))
> + return true;
> +
> + if (stabilise_mapping_rcu(folio)) {
> + struct address_space *mapping = folio_mapping(folio);
> +
> + /*
> + * Neither anonymous nor shmem-backed folios require
> + * dirty tracking.
> + */
> + ret = folio_test_anon(folio) ||
> + (mapping && shmem_mapping(mapping));
> + } else {
> + /* If the mapping is unstable, fallback to the slow path. */
> + ret = false;
> + }
> +
> + unlock_rcu();
> +
> + return ret;
> +}
> +
> /**
> * try_grab_folio() - Attempt to get or pin a folio.
> * @page: pointer to page to be grabbed
> @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> */
> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> {
> + bool is_longterm = flags & FOLL_LONGTERM;
> +
> if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> return NULL;
>
> @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> * right zone, so fail and let the caller fall back to the slow
> * path.
> */
> - if (unlikely((flags & FOLL_LONGTERM) &&
> - !is_longterm_pinnable_page(page)))
> + if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> return NULL;
>
> /*
> @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> if (!folio)
> return NULL;
>
> + /*
> + * Can this folio be safely pinned? We need to perform this
> + * check after the folio is stabilised.
> + */
> + if ((flags & FOLL_WRITE) && is_longterm &&
> + !folio_longterm_write_pin_allowed(folio)) {
> + folio_put_refs(folio, refs);
> + return NULL;
> + }
> +
> /*
> * When pinning a large folio, use an exact count to track it.
> *

2023-05-02 13:03:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 14:52, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 02:47:22PM +0200, David Hildenbrand wrote:
>> On 02.05.23 14:40, Peter Zijlstra wrote:
>>> On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
>>>
>>>>>>
>>>>>>
>>>>>> if (folio_test_anon(folio))
>>>>>> return true;
>>>>>
>>>>> This relies on the mapping so belongs below the lockdep assert imo.
>>>>
>>>> Oh, right you are.
>>>>
>>>>>>
>>>>>> /*
>>>>>> * Having IRQs disabled (as per GUP-fast) also inhibits RCU
>>>>>> * grace periods from making progress, IOW. they imply
>>>>>> * rcu_read_lock().
>>>>>> */
>>>>>> lockdep_assert_irqs_disabled();
>>>>>>
>>>>>> /*
>>>>>> * Inodes and thus address_space are RCU freed and thus safe to
>>>>>> * access at this point.
>>>>>> */
>>>>>> mapping = folio_mapping(folio);
>>>>>> if (mapping && shmem_mapping(mapping))
>>>>>> return true;
>>>>>>
>>>>>> return false;
>>>>>>
>>>>>>> +}
>>>
>>> So arguably you should do *one* READ_ONCE() load of mapping and
>>> consistently use that, this means open-coding both folio_test_anon() and
>>> folio_mapping().
>>
>> Open-coding folio_test_anon() should not be required. We only care about
>> PAGE_MAPPING_FLAGS stored alongside folio->mapping, that will stick around
>> until the anon page was freed.
>>
>
> Ack, good point!
>
>> @Lorenzo, you might also want to special-case hugetlb directly using
>> folio_test_hugetlb().
>>
>
> I already am :) I guess you mean when I respin along these lines? Will port
> that across to.

Ha, I only stared at that reply, good! :)

--
Thanks,

David / dhildenb

2023-05-02 13:03:03

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
> > Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> > as we cannot rule out folios being cleaned and then a GUP user writing to
> > them again and possibly marking them dirty unexpectedly.
> >
> > This is especially egregious for long-term mappings (as indicated by the
> > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> > we have already done in the slow path.
>
> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
> It would no longer work with file backed memory, correct?
>
> See
> arch/s390/kvm/pci.c
>
> kvm_s390_pci_aif_enable
> which does have
> FOLL_WRITE | FOLL_LONGTERM
> to
>

Does this memory map a dirty-tracked file? It's kind of hard to dig into where
the address originates from without going through a ton of code. In worst case
if the fast code doesn't find a whitelist it'll fall back to slow path which
explicitly checks for dirty-tracked filesystem.

We can reintroduce a flag to permit exceptions if this is really broken, are you
able to test? I don't have an s390 sat around :)

> >
> > We have access to less information in the fast path as we cannot examine
> > the VMA containing the mapping, however we can determine whether the folio
> > is anonymous and then whitelist known-good mappings - specifically hugetlb
> > and shmem mappings.
> >
> > While we obtain a stable folio for this check, the mapping might not be, as
> > a truncate could nullify it at any time. Since doing so requires mappings
> > to be zapped, we can synchronise against a TLB shootdown operation.
> >
> > For some architectures TLB shootdown is synchronised by IPI, against which
> > we are protected as the GUP-fast operation is performed with interrupts
> > disabled. However, other architectures which specify
> > CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation.
> >
> > In these instances, we acquire an RCU lock while performing our checks. If
> > we cannot get a stable mapping, we fall back to the slow path, as otherwise
> > we'd have to walk the page tables again and it's simpler and more effective
> > to just fall back.
> >
> > It's important to note that there are no APIs allowing users to specify
> > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
> > always rely on the fact that if we fail to pin on the fast path, the code
> > will fall back to the slow path which can perform the more thorough check.
> >
> > Suggested-by: David Hildenbrand <[email protected]>
> > Suggested-by: Kirill A . Shutemov <[email protected]>
> > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > ---
> > mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 85 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 0f09dec0906c..431618048a03 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -18,6 +18,7 @@
> > #include <linux/migrate.h>
> > #include <linux/mm_inline.h>
> > #include <linux/sched/mm.h>
> > +#include <linux/shmem_fs.h>
> > #include <asm/mmu_context.h>
> > #include <asm/tlbflush.h>
> > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > return folio;
> > }
> > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > +static bool stabilise_mapping_rcu(struct folio *folio)
> > +{
> > + struct address_space *mapping = READ_ONCE(folio->mapping);
> > +
> > + rcu_read_lock();
> > +
> > + return mapping == READ_ONCE(folio->mapping);
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > + rcu_read_unlock();
> > +}
> > +#else
> > +static bool stabilise_mapping_rcu(struct folio *)
> > +{
> > + return true;
> > +}
> > +
> > +static void unlock_rcu(void)
> > +{
> > +}
> > +#endif
> > +
> > +/*
> > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > + * FOLL_WRITE pin is permitted for a specific folio.
> > + *
> > + * This assumes the folio is stable and pinned.
> > + *
> > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > + * so.
> > + *
> > + * This function cannot be as thorough as that one as the VMA is not available
> > + * in the fast path, so instead we whitelist known good cases.
> > + *
> > + * The folio is stable, but the mapping might not be. When truncating for
> > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > + */
> > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > +{
> > + bool ret;
> > +
> > + /* hugetlb mappings do not require dirty tracking. */
> > + if (folio_test_hugetlb(folio))
> > + return true;
> > +
> > + if (stabilise_mapping_rcu(folio)) {
> > + struct address_space *mapping = folio_mapping(folio);
> > +
> > + /*
> > + * Neither anonymous nor shmem-backed folios require
> > + * dirty tracking.
> > + */
> > + ret = folio_test_anon(folio) ||
> > + (mapping && shmem_mapping(mapping));
> > + } else {
> > + /* If the mapping is unstable, fallback to the slow path. */
> > + ret = false;
> > + }
> > +
> > + unlock_rcu();
> > +
> > + return ret;
> > +}
> > +
> > /**
> > * try_grab_folio() - Attempt to get or pin a folio.
> > * @page: pointer to page to be grabbed
> > @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > */
> > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > {
> > + bool is_longterm = flags & FOLL_LONGTERM;
> > +
> > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> > return NULL;
> > @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > * right zone, so fail and let the caller fall back to the slow
> > * path.
> > */
> > - if (unlikely((flags & FOLL_LONGTERM) &&
> > - !is_longterm_pinnable_page(page)))
> > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> > return NULL;
> > /*
> > @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > if (!folio)
> > return NULL;
> > + /*
> > + * Can this folio be safely pinned? We need to perform this
> > + * check after the folio is stabilised.
> > + */
> > + if ((flags & FOLL_WRITE) && is_longterm &&
> > + !folio_longterm_write_pin_allowed(folio)) {
> > + folio_put_refs(folio, refs);
> > + return NULL;
> > + }
> > +
> > /*
> > * When pinning a large folio, use an exact count to track it.
> > *

2023-05-02 13:04:57

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 02:47:22PM +0200, David Hildenbrand wrote:
> On 02.05.23 14:40, Peter Zijlstra wrote:
> > On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
> >
> > > > >
> > > > >
> > > > > if (folio_test_anon(folio))
> > > > > return true;
> > > >
> > > > This relies on the mapping so belongs below the lockdep assert imo.
> > >
> > > Oh, right you are.
> > >
> > > > >
> > > > > /*
> > > > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > > > * grace periods from making progress, IOW. they imply
> > > > > * rcu_read_lock().
> > > > > */
> > > > > lockdep_assert_irqs_disabled();
> > > > >
> > > > > /*
> > > > > * Inodes and thus address_space are RCU freed and thus safe to
> > > > > * access at this point.
> > > > > */
> > > > > mapping = folio_mapping(folio);
> > > > > if (mapping && shmem_mapping(mapping))
> > > > > return true;
> > > > >
> > > > > return false;
> > > > >
> > > > > > +}
> >
> > So arguably you should do *one* READ_ONCE() load of mapping and
> > consistently use that, this means open-coding both folio_test_anon() and
> > folio_mapping().
>
> Open-coding folio_test_anon() should not be required. We only care about
> PAGE_MAPPING_FLAGS stored alongside folio->mapping, that will stick around
> until the anon page was freed.
>

Ack, good point!

> @Lorenzo, you might also want to special-case hugetlb directly using
> folio_test_hugetlb().
>

I already am :) I guess you mean when I respin along these lines? Will port
that across to.

> --
> Thanks,
>
> David / dhildenb
>

2023-05-02 13:05:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 01:54:41PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
> > Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
> > > Writing to file-backed dirty-tracked mappings via GUP is inherently broken
> > > as we cannot rule out folios being cleaned and then a GUP user writing to
> > > them again and possibly marking them dirty unexpectedly.
> > >
> > > This is especially egregious for long-term mappings (as indicated by the
> > > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
> > > we have already done in the slow path.
> >
> > Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
> > It would no longer work with file backed memory, correct?
> >
> > See
> > arch/s390/kvm/pci.c
> >
> > kvm_s390_pci_aif_enable
> > which does have
> > FOLL_WRITE | FOLL_LONGTERM
> > to
> >
>
> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
> the address originates from without going through a ton of code. In worst case
> if the fast code doesn't find a whitelist it'll fall back to slow path which
> explicitly checks for dirty-tracked filesystem.

This looks like the same stuff David was talking about, a qemu guest
with VFIO backed by a filesystem file..

Broadly though, arch kvm code should not call pin_user_pages().

Either it is KVM focused and it should use the shadow table and it's
existing mmu_notifier synchronization scheme

Or it is VFIO focused so it should use mdev/etc and have an unmap call
back.

I'm not really sure what this is for though..

Jason

2023-05-02 13:09:49

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings



Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes:
> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
>>> as we cannot rule out folios being cleaned and then a GUP user writing to
>>> them again and possibly marking them dirty unexpectedly.
>>>
>>> This is especially egregious for long-term mappings (as indicated by the
>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
>>> we have already done in the slow path.
>>
>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
>> It would no longer work with file backed memory, correct?
>>
>> See
>> arch/s390/kvm/pci.c
>>
>> kvm_s390_pci_aif_enable
>> which does have
>> FOLL_WRITE | FOLL_LONGTERM
>> to
>>
>
> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
> the address originates from without going through a ton of code. In worst case
> if the fast code doesn't find a whitelist it'll fall back to slow path which
> explicitly checks for dirty-tracked filesystem.

It does pin from whatever QEMU uses as backing for the guest.
>
> We can reintroduce a flag to permit exceptions if this is really broken, are you
> able to test? I don't have an s390 sat around :)

Matt (Rosato on cc) probably can. In the end, it would mean having
<memoryBacking>
<source type="file"/>
</memoryBacking>

In libvirt I guess.

2023-05-02 13:20:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
\> > We can reintroduce a flag to permit exceptions if this is really broken, are you
> > able to test? I don't have an s390 sat around :)
>
> Matt (Rosato on cc) probably can. In the end, it would mean having
> <memoryBacking>
> <source type="file"/>
> </memoryBacking>

This s390 code is the least of the problems, after this series VFIO
won't startup at all with this configuration.

Jason

2023-05-02 13:38:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 02:08:10PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 12:25:54PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 01:13:34PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 02, 2023 at 12:11:49AM +0100, Lorenzo Stoakes wrote:
> > > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > > return folio;
> > > > }
> > > >
> > > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > +static bool stabilise_mapping_rcu(struct folio *folio)
> > > > +{
> > > > + struct address_space *mapping = READ_ONCE(folio->mapping);
> > > > +
> > > > + rcu_read_lock();
> > > > +
> > > > + return mapping == READ_ONCE(folio->mapping);
> > >
> > > This doesn't make sense; why bother reading the same thing twice?
> >
> > The intent is to see whether the folio->mapping has been truncated from
> > underneath us, as per the futex code that Kirill referred to which does
> > something similar [1].
>
> Yeah, but per that 3rd load you got nothing here. Also that futex code
> did the early load to deal with the !mapping case, but you're not doing
> that.
>
> > > Who cares if the thing changes from before; what you care about is that
> > > the value you see has stable storage, this doesn't help with that.
> > >
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > + rcu_read_unlock();
> > > > +}
> > > > +#else
> > > > +static bool stabilise_mapping_rcu(struct folio *)
> > > > +{
> > > > + return true;
> > > > +}
> > > > +
> > > > +static void unlock_rcu(void)
> > > > +{
> > > > +}
> > > > +#endif
> > >
> > > Anyway, this all can go away. RCU can't progress while you have
> > > interrupts disabled anyway.
> >
> > There seems to be other code in the kernel that assumes that this is not
> > the case,
>
> Yeah, so Paul went back on forth on that a bit. It used to be true in
> the good old days when everything was simple. Then Paul made things
> complicated by separating out sched-RCU bh-RCU and 'regular' RCU
> flavours.

Almost. ;-)

The way I made things complicated was instead by creating preemptible RCU
for the real-time effort. The original non-preemptible RCU was still
required for a number of use cases (for example, waiting for hardware
interrupt handlers), so it had to stay. Separately, network-based DoS
attacks necessitated adding RCU bh.

> At that point disabling IRQs would only (officially) inhibit sched and
> bh RCU flavours, but not the regular RCU.

Quite right.

> But then some years ago Linus convinced Paul that having all these
> separate RCU flavours with separate QS rules was a big pain in the
> backside and Paul munged them all together again.

What happened was that someone used one flavor of RCU reader and a
different flavor of RCU updater, creating an exploitable bug.

http://www2.rdrop.com/~paulmck/RCU/cve.2019.01.23e.pdf
https://www.youtube.com/watch?v=hZX1aokdNiY

And Linus asked that this bug be ruled out, so...

> So now, anything that inhibits any of the RCU flavours inhibits them
> all. So disabling IRQs is sufficient.

...for v4.20 and later, exactly.

Thanx, Paul

> > i.e. the futex code, though not sure if that's being run with
> > IRQs disabled...
>
> That futex code runs in preemptible context, per the lock_page() that
> can sleep etc.. :-)
>
> > > > +/*
> > > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
> > > > + * FOLL_WRITE pin is permitted for a specific folio.
> > > > + *
> > > > + * This assumes the folio is stable and pinned.
> > > > + *
> > > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic
> > > > + * (see comment describing the writeable_file_mapping_allowed() function). We
> > > > + * therefore try to avoid the most egregious case of a long-term mapping doing
> > > > + * so.
> > > > + *
> > > > + * This function cannot be as thorough as that one as the VMA is not available
> > > > + * in the fast path, so instead we whitelist known good cases.
> > > > + *
> > > > + * The folio is stable, but the mapping might not be. When truncating for
> > > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled
> > > > + * so we are safe from an IPI, but some architectures use an RCU lock for this
> > > > + * operation, so we acquire an RCU lock to ensure the mapping is stable.
> > > > + */
> > > > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > > > +{
> > > > + bool ret;
> > > > +
> > > > + /* hugetlb mappings do not require dirty tracking. */
> > > > + if (folio_test_hugetlb(folio))
> > > > + return true;
> > > > +
> > >
> > > This:
> > >
> > > > + if (stabilise_mapping_rcu(folio)) {
> > > > + struct address_space *mapping = folio_mapping(folio);
> > >
> > > And this is 3rd read of folio->mapping, just for giggles?
> >
> > I like to giggle :)
> >
> > Actually this is to handle the various cases in which the mapping might not
> > be what we want (i.e. have PAGE_MAPPING_FLAGS set) which doesn't appear to
> > have a helper exposed for a check. Given previous review about duplication
> > I felt best to reuse this even though it does access again... yes I felt
> > weird about doing that.
>
> Right, I had a peek inside folio_mapping(), but the point is that this
> 3rd load might see yet *another* value of mapping from the prior two
> loads, rendering them somewhat worthless.
>
> > > > +
> > > > + /*
> > > > + * Neither anonymous nor shmem-backed folios require
> > > > + * dirty tracking.
> > > > + */
> > > > + ret = folio_test_anon(folio) ||
> > > > + (mapping && shmem_mapping(mapping));
> > > > + } else {
> > > > + /* If the mapping is unstable, fallback to the slow path. */
> > > > + ret = false;
> > > > + }
> > > > +
> > > > + unlock_rcu();
> > > > +
> > > > + return ret;
> > >
> > > then becomes:
> > >
> > >
> > > if (folio_test_anon(folio))
> > > return true;
> >
> > This relies on the mapping so belongs below the lockdep assert imo.
>
> Oh, right you are.
>
> > >
> > > /*
> > > * Having IRQs disabled (as per GUP-fast) also inhibits RCU
> > > * grace periods from making progress, IOW. they imply
> > > * rcu_read_lock().
> > > */
> > > lockdep_assert_irqs_disabled();
> > >
> > > /*
> > > * Inodes and thus address_space are RCU freed and thus safe to
> > > * access at this point.
> > > */
> > > mapping = folio_mapping(folio);
> > > if (mapping && shmem_mapping(mapping))
> > > return true;
> > >
> > > return false;
> > >
> > > > +}
> >
> > I'm more than happy to do this (I'd rather drop the RCU bits if possible)
> > but need to be sure it's safe.
>
> GUP-fast as a whole relies on it :-)

2023-05-02 13:38:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 15:10, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>> able to test? I don't have an s390 sat around :)
>>
>> Matt (Rosato on cc) probably can. In the end, it would mean having
>> <memoryBacking>
>> <source type="file"/>
>> </memoryBacking>
>
> This s390 code is the least of the problems, after this series VFIO
> won't startup at all with this configuration.

Good question if the domain would fail to start. I recall that IOMMUs
for zPCI are special on s390x. [1]

Well, zPCI is special. I cannot immediately tell when we would trigger
long-term pinning.

[1] https://www.mail-archive.com/[email protected]/msg875728.html

--
Thanks,

David / dhildenb

2023-05-02 13:39:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
> On 02.05.23 15:10, Jason Gunthorpe wrote:
> > On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
> > \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
> > > > able to test? I don't have an s390 sat around :)
> > >
> > > Matt (Rosato on cc) probably can. In the end, it would mean having
> > > <memoryBacking>
> > > <source type="file"/>
> > > </memoryBacking>
> >
> > This s390 code is the least of the problems, after this series VFIO
> > won't startup at all with this configuration.
>
> Good question if the domain would fail to start. I recall that IOMMUs for
> zPCI are special on s390x. [1]

Not upstream they aren't.

> Well, zPCI is special. I cannot immediately tell when we would trigger
> long-term pinning.

zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
and the normal pin_user_pages() path.

> [1] https://www.mail-archive.com/[email protected]/msg875728.html

AFIACT this is talking about the nested IOMMU translation stuff.
RPCIT is the hypercall to invalidate the nested IOMMU table. I expect s390 is
going to have another go at implementing this using iommufd.

The stuff in that series like KVM_S390_ZPCIOP_REG_IOAT didn't make it
upstream.

Jason

2023-05-02 13:43:33

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 15:36, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
>> On 02.05.23 15:10, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>>> able to test? I don't have an s390 sat around :)
>>>>
>>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>> <memoryBacking>
>>>> <source type="file"/>
>>>> </memoryBacking>
>>>
>>> This s390 code is the least of the problems, after this series VFIO
>>> won't startup at all with this configuration.
>>
>> Good question if the domain would fail to start. I recall that IOMMUs for
>> zPCI are special on s390x. [1]
>
> Not upstream they aren't.
>
>> Well, zPCI is special. I cannot immediately tell when we would trigger
>> long-term pinning.
>
> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
> and the normal pin_user_pages() path.


@Christian, Matthew: would we pin all guest memory when starting the
domain (IIRC, like on x86-64) and fail early, or only when the guest
issues rpcit instructions to map individual pages?

--
Thanks,

David / dhildenb

2023-05-02 13:53:10

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 5/2/23 9:39 AM, David Hildenbrand wrote:
> On 02.05.23 15:36, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
>>> On 02.05.23 15:10, Jason Gunthorpe wrote:
>>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>>>> able to test? I don't have an s390 sat around :)
>>>>>
>>>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>>>     <memoryBacking>
>>>>>       <source type="file"/>
>>>>>     </memoryBacking>
>>>>
>>>> This s390 code is the least of the problems, after this series VFIO
>>>> won't startup at all with this configuration.
>>>
>>> Good question if the domain would fail to start. I recall that IOMMUs for
>>> zPCI are special on s390x. [1]
>>
>> Not upstream they aren't.
>>
>>> Well, zPCI is special. I cannot immediately tell when we would trigger
>>> long-term pinning.
>>
>> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
>> and the normal pin_user_pages() path.
>
>
> @Christian, Matthew: would we pin all guest memory when starting the domain (IIRC, like on x86-64) and fail early, or only when the guest issues rpcit instructions to map individual pages?
>

Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.

However, per Jason's prior suggestion, the initial implementation for s390 nesting via iommufd will pin all of guest memory when starting the domain. I have something already working via iommufd built on top of the nesting infrastructure patches and QEMU iommufd series that are floating around; needs some cleanup, hoping to send an RFC in the coming weeks. I can CC you if you'd like.

2023-05-02 13:54:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 15:43, Matthew Rosato wrote:
> On 5/2/23 9:39 AM, David Hildenbrand wrote:
>> On 02.05.23 15:36, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 03:28:40PM +0200, David Hildenbrand wrote:
>>>> On 02.05.23 15:10, Jason Gunthorpe wrote:
>>>>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>>>>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>>>>> able to test? I don't have an s390 sat around :)
>>>>>>
>>>>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>>>>     <memoryBacking>
>>>>>>       <source type="file"/>
>>>>>>     </memoryBacking>
>>>>>
>>>>> This s390 code is the least of the problems, after this series VFIO
>>>>> won't startup at all with this configuration.
>>>>
>>>> Good question if the domain would fail to start. I recall that IOMMUs for
>>>> zPCI are special on s390x. [1]
>>>
>>> Not upstream they aren't.
>>>
>>>> Well, zPCI is special. I cannot immediately tell when we would trigger
>>>> long-term pinning.
>>>
>>> zPCI uses the standard IOMMU stuff, so it uses a normal VFIO container
>>> and the normal pin_user_pages() path.
>>
>>
>> @Christian, Matthew: would we pin all guest memory when starting the domain (IIRC, like on x86-64) and fail early, or only when the guest issues rpcit instructions to map individual pages?
>>
>
> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>

Okay, so IIRC we'll fail starting the domain early, that's good. And if
we pin all guest memory (instead of small pieces dynamically), there is
little existing use for file-backed RAM in such zPCI configurations
(because memory cannot be reclaimed either way if it's all pinned), so
likely there are no real existing users.

> However, per Jason's prior suggestion, the initial implementation for s390 nesting via iommufd will pin all of guest memory when starting the domain. I have something already working via iommufd built on top of the nesting infrastructure patches and QEMU iommufd series that are floating around; needs some cleanup, hoping to send an RFC in the coming weeks. I can CC you if you'd like.

Yes, please.

--
Thanks,

David / dhildenb

2023-05-02 13:55:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>
> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> pin all guest memory (instead of small pieces dynamically), there is little
> existing use for file-backed RAM in such zPCI configurations (because memory
> cannot be reclaimed either way if it's all pinned), so likely there are no
> real existing users.

Right, this is VFIO, the physical HW can't tolerate not having pinned
memory, so something somewhere is always pinning it.

Which, again, makes it weird/wrong that this KVM code is pinning it
again :\

Jason

2023-05-02 14:08:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 15:50, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>
>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>> pin all guest memory (instead of small pieces dynamically), there is little
>> existing use for file-backed RAM in such zPCI configurations (because memory
>> cannot be reclaimed either way if it's all pinned), so likely there are no
>> real existing users.
>
> Right, this is VFIO, the physical HW can't tolerate not having pinned
> memory, so something somewhere is always pinning it.
>
> Which, again, makes it weird/wrong that this KVM code is pinning it
> again :\

IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's
for passthrough of (adapter) interrupts.

I have to speculate, but I guess for hardware to forward interrupts to
the VM, it has to pin the special guest memory page that will receive
the indications, to then configure (interrupt) hardware to target the
interrupt indications to that special guest page (using a host physical
address).

--
Thanks,

David / dhildenb

2023-05-02 14:11:35

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 5/2/23 9:50 AM, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>
>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>> pin all guest memory (instead of small pieces dynamically), there is little
>> existing use for file-backed RAM in such zPCI configurations (because memory
>> cannot be reclaimed either way if it's all pinned), so likely there are no
>> real existing users.
>
> Right, this is VFIO, the physical HW can't tolerate not having pinned
> memory, so something somewhere is always pinning it.

I might have mis-explained above.

With iommufd nesting, we will pin everything upfront as a starting point.

The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made.

>
> Which, again, makes it weird/wrong that this KVM code is pinning it
> again :\
>
> Jason

2023-05-02 14:11:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
> On 02.05.23 15:50, Jason Gunthorpe wrote:
> > On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> > > > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> > >
> > > Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> > > pin all guest memory (instead of small pieces dynamically), there is little
> > > existing use for file-backed RAM in such zPCI configurations (because memory
> > > cannot be reclaimed either way if it's all pinned), so likely there are no
> > > real existing users.
> >
> > Right, this is VFIO, the physical HW can't tolerate not having pinned
> > memory, so something somewhere is always pinning it.
> >
> > Which, again, makes it weird/wrong that this KVM code is pinning it
> > again :\
>
> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
> passthrough of (adapter) interrupts.
>
> I have to speculate, but I guess for hardware to forward interrupts to the
> VM, it has to pin the special guest memory page that will receive the
> indications, to then configure (interrupt) hardware to target the interrupt
> indications to that special guest page (using a host physical address).

Either the emulated access is "CPU" based happening through the KVM
page table so it should use mmu_notifier locking.

Or it is "DMA" and should go through an IOVA through iommufd pinning
and locking.

There is no other ground, nothing in KVM should be inventing its own
access methodology.

Jason

2023-05-02 14:20:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 16:04, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
>> On 02.05.23 15:50, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>>>
>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>>>> pin all guest memory (instead of small pieces dynamically), there is little
>>>> existing use for file-backed RAM in such zPCI configurations (because memory
>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
>>>> real existing users.
>>>
>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
>>> memory, so something somewhere is always pinning it.
>>>
>>> Which, again, makes it weird/wrong that this KVM code is pinning it
>>> again :\
>>
>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
>> passthrough of (adapter) interrupts.
>>
>> I have to speculate, but I guess for hardware to forward interrupts to the
>> VM, it has to pin the special guest memory page that will receive the
>> indications, to then configure (interrupt) hardware to target the interrupt
>> indications to that special guest page (using a host physical address).
>
> Either the emulated access is "CPU" based happening through the KVM
> page table so it should use mmu_notifier locking.
>
> Or it is "DMA" and should go through an IOVA through iommufd pinning
> and locking.
>
> There is no other ground, nothing in KVM should be inventing its own
> access methodology.

I might be wrong, but this seems to be a bit different.

It cannot tolerate page faults (needs a host physical address), so
memory notifiers don't really apply. (as a side note, KVM on s390x does
not use mmu notifiers as we know them)

It's kind-of like DMA, but it's not really DMA. It's the CPU delivering
interrupts for a specific device. So we're configuring the interrupt
controller I guess to target a guest memory page.

But I have way too little knowledge about zPCI and the code in question
here. And if it could be converted to iommufd (and if that's really the
right mechanism to use here).

Hopefully Matthew knows the details and if this really needs to be
special :)

--
Thanks,

David / dhildenb

2023-05-02 14:50:43

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 5/2/23 9:28 AM, David Hildenbrand wrote:
> On 02.05.23 15:10, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 03:04:27PM +0200, Christian Borntraeger wrote:
>> \> > We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>> able to test? I don't have an s390 sat around :)
>>>
>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>    <memoryBacking>
>>>      <source type="file"/>
>>>    </memoryBacking>
>>
>> This s390 code is the least of the problems, after this series VFIO
>> won't startup at all with this configuration.
>
> Good question if the domain would fail to start. I recall that IOMMUs for zPCI are special on s390x. [1]

Eh, well what you referenced there never merged. We will eventually do something like this, but via iommufd. But right now s390 is still using vfio type1 iommu

>
> Well, zPCI is special. I cannot immediately tell when we would trigger long-term pinning.

Enabling a zPCI device in a QEMU guest when using zPCI interpretation facilities

>
> [1] https://www.mail-archive.com/[email protected]/msg875728.html
>

2023-05-02 14:59:00

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 5/2/23 9:04 AM, Christian Borntraeger wrote:
>
>
> Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes:
>> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
>>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
>>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
>>>> as we cannot rule out folios being cleaned and then a GUP user writing to
>>>> them again and possibly marking them dirty unexpectedly.
>>>>
>>>> This is especially egregious for long-term mappings (as indicated by the
>>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
>>>> we have already done in the slow path.
>>>
>>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
>>> It would no longer work with file backed memory, correct?
>>>
>>> See
>>> arch/s390/kvm/pci.c
>>>
>>> kvm_s390_pci_aif_enable
>>> which does have
>>> FOLL_WRITE | FOLL_LONGTERM
>>> to
>>>
>>
>> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
>> the address originates from without going through a ton of code. In worst case
>> if the fast code doesn't find a whitelist it'll fall back to slow path which
>> explicitly checks for dirty-tracked filesystem.
>
> It does pin from whatever QEMU uses as backing for the guest.
>>
>> We can reintroduce a flag to permit exceptions if this is really broken, are you
>> able to test? I don't have an s390 sat around :)
>
> Matt (Rosato on cc) probably can. In the end, it would mean having
>   <memoryBacking>
>     <source type="file"/>
>   </memoryBacking>
>
> In libvirt I guess.

I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working.

2023-05-02 15:06:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 15:35, Matthew Rosato wrote:
> On 5/2/23 9:04 AM, Christian Borntraeger wrote:
>>
>>
>> Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes:
>>> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
>>>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
>>>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
>>>>> as we cannot rule out folios being cleaned and then a GUP user writing to
>>>>> them again and possibly marking them dirty unexpectedly.
>>>>>
>>>>> This is especially egregious for long-term mappings (as indicated by the
>>>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
>>>>> we have already done in the slow path.
>>>>
>>>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
>>>> It would no longer work with file backed memory, correct?
>>>>
>>>> See
>>>> arch/s390/kvm/pci.c
>>>>
>>>> kvm_s390_pci_aif_enable
>>>> which does have
>>>> FOLL_WRITE | FOLL_LONGTERM
>>>> to
>>>>
>>>
>>> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
>>> the address originates from without going through a ton of code. In worst case
>>> if the fast code doesn't find a whitelist it'll fall back to slow path which
>>> explicitly checks for dirty-tracked filesystem.
>>
>> It does pin from whatever QEMU uses as backing for the guest.
>>>
>>> We can reintroduce a flag to permit exceptions if this is really broken, are you
>>> able to test? I don't have an s390 sat around :)
>>
>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>   <memoryBacking>
>>     <source type="file"/>
>>   </memoryBacking>
>>
>> In libvirt I guess.
>
> I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working.
>

That's ... unexpected. :)

Either this series doesn't work as expected or you end up using a
filesystem that is still compatible. But I guess most applicable
filesystems (ext4, btrfs, xfs) all have a page_mkwrite callback and
should, therefore, disallow long-term pinning with this series.

--
Thanks,

David / dhildenb

2023-05-02 15:06:37

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 5/2/23 10:15 AM, David Hildenbrand wrote:
> On 02.05.23 16:04, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
>>> On 02.05.23 15:50, Jason Gunthorpe wrote:
>>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>>>>
>>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>>>>> pin all guest memory (instead of small pieces dynamically), there is little
>>>>> existing use for file-backed RAM in such zPCI configurations (because memory
>>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
>>>>> real existing users.
>>>>
>>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
>>>> memory, so something somewhere is always pinning it.
>>>>
>>>> Which, again, makes it weird/wrong that this KVM code is pinning it
>>>> again :\
>>>
>>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
>>> passthrough of (adapter) interrupts.
>>>
>>> I have to speculate, but I guess for hardware to forward interrupts to the
>>> VM, it has to pin the special guest memory page that will receive the
>>> indications, to then configure (interrupt) hardware to target the interrupt
>>> indications to that special guest page (using a host physical address).
>>
>> Either the emulated access is "CPU" based happening through the KVM
>> page table so it should use mmu_notifier locking.
>>
>> Or it is "DMA" and should go through an IOVA through iommufd pinning
>> and locking.
>>
>> There is no other ground, nothing in KVM should be inventing its own
>> access methodology.
>
> I might be wrong, but this seems to be a bit different.
>
> It cannot tolerate page faults (needs a host physical address), so memory notifiers don't really apply. (as a side note, KVM on s390x does not use mmu notifiers as we know them)

The host physical address is one shared between underlying firmware and the host kvm. Either might make changes to the referenced page and then issue an alert to the guest via a mechanism called GISA, giving impetus to the guest to look at that page and process the event. As you say, firmware can't tolerate the page being unavailable; it's expecting that once we feed it that location it's always available until we remove it (kvm_s390_pci_aif_disable).

>
> It's kind-of like DMA, but it's not really DMA.  It's the CPU delivering interrupts for a specific device. So we're configuring the interrupt controller I guess to target a guest memory page.
>
> But I have way too little knowledge about zPCI and the code in question here. And if it could be converted to iommufd (and if that's really the right mechanism to use here).
>
> Hopefully Matthew knows the details and if this really needs to be special :)

I think I need to have a look at mmu_notifiers to understand that better, but in the end firmware still needs a reliable page to deliver events to.


2023-05-02 15:08:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 2/3] mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings

On 02.05.23 01:11, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
>
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
>
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
>
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
>
> For example, consider the following scenario:-
>
> 1. A folio is written to via GUP which write-faults the memory, notifying
> the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
> the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
> direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
> (though it does not have to).
>
> This results in both data being written to a folio without writenotify, and
> the folio being dirtied unexpectedly (if the caller decides to do so).
>
> This issue was first reported by Jan Kara [1] in 2018, where the problem
> resulted in file system crashes.
>
> This is only relevant when the mappings are file-backed and the underlying
> file system requires folio dirty tracking. File systems which do not, such
> as shmem or hugetlb, are not at risk and therefore can be written to
> without issue.
>
> Unfortunately this limitation of GUP has been present for some time and
> requires future rework of the GUP API in order to provide correct write
> access to such mappings.
>
> However, for the time being we introduce this check to prevent the most
> egregious case of this occurring, use of the FOLL_LONGTERM pin.
>
> These mappings are considerably more likely to be written to after
> folios are cleaned and thus simply must not be permitted to do so.
>
> This patch changes only the slow-path GUP functions, a following patch
> adapts the GUP-fast path along similar lines.
>
> [1]:https://lore.kernel.org/linux-mm/[email protected]/
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> Reviewed-by: John Hubbard <[email protected]>
> Reviewed-by: Mika Penttilä <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> ---
> mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index ff689c88a357..0f09dec0906c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
> return 0;
> }
>
> +/*
> + * Writing to file-backed mappings which require folio dirty tracking using GUP
> + * is a fundamentally broken operation, as kernel write access to GUP mappings
> + * do not adhere to the semantics expected by a file system.
> + *
> + * Consider the following scenario:-
> + *
> + * 1. A folio is written to via GUP which write-faults the memory, notifying
> + * the file system and dirtying the folio.
> + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> + * the PTE being marked read-only.
> + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> + * direct mapping.
> + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> + * (though it does not have to).
> + *
> + * This results in both data being written to a folio without writenotify, and
> + * the folio being dirtied unexpectedly (if the caller decides to do so).
> + */
> +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> + unsigned long gup_flags)
> +{
> + /* If we aren't pinning then no problematic write can occur. */
> + if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> + return true;

I think we should really not look at FOLL_GET here. Just check for
FOLL_PIN (as said, even FOLL_LONGTERM would be sufficient, but I
understand the reasoning to keep it, although I would drop it :P ). It
also better matches your comment regarding pinning ...

See the comment in is_valid_gup_args() regarding "LONGTERM can only be
specified when pinning". (well, there we also check that FOLL_PIN has to
be set ... ;) )

> +
> + /* We limit this check to the most egregious case - a long term pin. */
> + if (!(gup_flags & FOLL_LONGTERM))
> + return true;
> +
> + /* If the VMA requires dirty tracking then GUP will be problematic. */
> + return vma_needs_dirty_tracking(vma);


... should that be "!vma_needs_dirty_tracking(vma)" ?

If the fs needs dirty tracking, it should be disallowed.

Maybe that explains why it's still working for Matthew in his s390x
test. ... or I am too tired and messed up :)

--
Thanks,

David / dhildenb

2023-05-02 15:11:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 15:56, Matthew Rosato wrote:
> On 5/2/23 9:50 AM, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>>
>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>>> pin all guest memory (instead of small pieces dynamically), there is little
>>> existing use for file-backed RAM in such zPCI configurations (because memory
>>> cannot be reclaimed either way if it's all pinned), so likely there are no
>>> real existing users.
>>
>> Right, this is VFIO, the physical HW can't tolerate not having pinned
>> memory, so something somewhere is always pinning it.
>
> I might have mis-explained above.
>
> With iommufd nesting, we will pin everything upfront as a starting point.
>
> The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made.

... so, after the domain started successfully on the libvirt/QEMU side ? :/

It would be great to confirm that. There might be a BUG in patch #2 (see
my reply to patch #2) that might not allow you to reproduce it right now.

--
Thanks,

David / dhildenb

2023-05-02 15:21:20

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v6 2/3] mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings

On Tue, May 02, 2023 at 05:04:02PM +0200, David Hildenbrand wrote:
> On 02.05.23 01:11, Lorenzo Stoakes wrote:
> > Writing to file-backed mappings which require folio dirty tracking using
> > GUP is a fundamentally broken operation, as kernel write access to GUP
> > mappings do not adhere to the semantics expected by a file system.
> >
> > A GUP caller uses the direct mapping to access the folio, which does not
> > cause write notify to trigger, nor does it enforce that the caller marks
> > the folio dirty.
> >
> > The problem arises when, after an initial write to the folio, writeback
> > results in the folio being cleaned and then the caller, via the GUP
> > interface, writes to the folio again.
> >
> > As a result of the use of this secondary, direct, mapping to the folio no
> > write notify will occur, and if the caller does mark the folio dirty, this
> > will be done so unexpectedly.
> >
> > For example, consider the following scenario:-
> >
> > 1. A folio is written to via GUP which write-faults the memory, notifying
> > the file system and dirtying the folio.
> > 2. Later, writeback is triggered, resulting in the folio being cleaned and
> > the PTE being marked read-only.
> > 3. The GUP caller writes to the folio, as it is mapped read/write via the
> > direct mapping.
> > 4. The GUP caller, now done with the page, unpins it and sets it dirty
> > (though it does not have to).
> >
> > This results in both data being written to a folio without writenotify, and
> > the folio being dirtied unexpectedly (if the caller decides to do so).
> >
> > This issue was first reported by Jan Kara [1] in 2018, where the problem
> > resulted in file system crashes.
> >
> > This is only relevant when the mappings are file-backed and the underlying
> > file system requires folio dirty tracking. File systems which do not, such
> > as shmem or hugetlb, are not at risk and therefore can be written to
> > without issue.
> >
> > Unfortunately this limitation of GUP has been present for some time and
> > requires future rework of the GUP API in order to provide correct write
> > access to such mappings.
> >
> > However, for the time being we introduce this check to prevent the most
> > egregious case of this occurring, use of the FOLL_LONGTERM pin.
> >
> > These mappings are considerably more likely to be written to after
> > folios are cleaned and thus simply must not be permitted to do so.
> >
> > This patch changes only the slow-path GUP functions, a following patch
> > adapts the GUP-fast path along similar lines.
> >
> > [1]:https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Suggested-by: Jason Gunthorpe <[email protected]>
> > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > Reviewed-by: John Hubbard <[email protected]>
> > Reviewed-by: Mika Penttil? <[email protected]>
> > Reviewed-by: Jan Kara <[email protected]>
> > Reviewed-by: Jason Gunthorpe <[email protected]>
> > ---
> > mm/gup.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 40 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index ff689c88a357..0f09dec0906c 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -959,16 +959,51 @@ static int faultin_page(struct vm_area_struct *vma,
> > return 0;
> > }
> > +/*
> > + * Writing to file-backed mappings which require folio dirty tracking using GUP
> > + * is a fundamentally broken operation, as kernel write access to GUP mappings
> > + * do not adhere to the semantics expected by a file system.
> > + *
> > + * Consider the following scenario:-
> > + *
> > + * 1. A folio is written to via GUP which write-faults the memory, notifying
> > + * the file system and dirtying the folio.
> > + * 2. Later, writeback is triggered, resulting in the folio being cleaned and
> > + * the PTE being marked read-only.
> > + * 3. The GUP caller writes to the folio, as it is mapped read/write via the
> > + * direct mapping.
> > + * 4. The GUP caller, now done with the page, unpins it and sets it dirty
> > + * (though it does not have to).
> > + *
> > + * This results in both data being written to a folio without writenotify, and
> > + * the folio being dirtied unexpectedly (if the caller decides to do so).
> > + */
> > +static bool writeable_file_mapping_allowed(struct vm_area_struct *vma,
> > + unsigned long gup_flags)
> > +{
> > + /* If we aren't pinning then no problematic write can occur. */
> > + if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> > + return true;
>
> I think we should really not look at FOLL_GET here. Just check for FOLL_PIN
> (as said, even FOLL_LONGTERM would be sufficient, but I understand the
> reasoning to keep it, although I would drop it :P ). It also better matches
> your comment regarding pinning ...
>
> See the comment in is_valid_gup_args() regarding "LONGTERM can only be
> specified when pinning". (well, there we also check that FOLL_PIN has to be
> set ... ;) )

I think I will finally give in, in penance for the very silly mistake I made
below...

>
> > +
> > + /* We limit this check to the most egregious case - a long term pin. */
> > + if (!(gup_flags & FOLL_LONGTERM))
> > + return true;
> > +
> > + /* If the VMA requires dirty tracking then GUP will be problematic. */
> > + return vma_needs_dirty_tracking(vma);
>
>
> ... should that be "!vma_needs_dirty_tracking(vma)" ?
>
> If the fs needs dirty tracking, it should be disallowed.
>
> Maybe that explains why it's still working for Matthew in his s390x test.
> ... or I am too tired and messed up :)
>

No, no it was I who was too tired it seems! You're correct, this is wrong,
will respin with fix :))

> --
> Thanks,
>
> David / dhildenb
>

2023-05-02 15:21:20

by Lorenzo Stoakes

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 05:09:06PM +0200, David Hildenbrand wrote:
> On 02.05.23 15:56, Matthew Rosato wrote:
> > On 5/2/23 9:50 AM, Jason Gunthorpe wrote:
> > > On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> > > > > Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> > > >
> > > > Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> > > > pin all guest memory (instead of small pieces dynamically), there is little
> > > > existing use for file-backed RAM in such zPCI configurations (because memory
> > > > cannot be reclaimed either way if it's all pinned), so likely there are no
> > > > real existing users.
> > >
> > > Right, this is VFIO, the physical HW can't tolerate not having pinned
> > > memory, so something somewhere is always pinning it.
> >
> > I might have mis-explained above.
> >
> > With iommufd nesting, we will pin everything upfront as a starting point.
> >
> > The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made.
>
> ... so, after the domain started successfully on the libvirt/QEMU side ? :/
>
> It would be great to confirm that. There might be a BUG in patch #2 (see my
> reply to patch #2) that might not allow you to reproduce it right now.
>

Yes apologies - thank you VERY much for doing this Matthew, but apologies, I
made rather a clanger in patch 2 which would mean fast patch degrading to slow
path would pass even for file-backed.

Will respin a v7 + cc you on that, if you could be so kind as to test that?

> --
> Thanks,
>
> David / dhildenb
>

2023-05-02 15:22:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 10:54:35AM -0400, Matthew Rosato wrote:
> On 5/2/23 10:15 AM, David Hildenbrand wrote:
> > On 02.05.23 16:04, Jason Gunthorpe wrote:
> >> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
> >>> On 02.05.23 15:50, Jason Gunthorpe wrote:
> >>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> >>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> >>>>>
> >>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> >>>>> pin all guest memory (instead of small pieces dynamically), there is little
> >>>>> existing use for file-backed RAM in such zPCI configurations (because memory
> >>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
> >>>>> real existing users.
> >>>>
> >>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
> >>>> memory, so something somewhere is always pinning it.
> >>>>
> >>>> Which, again, makes it weird/wrong that this KVM code is pinning it
> >>>> again :\
> >>>
> >>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
> >>> passthrough of (adapter) interrupts.
> >>>
> >>> I have to speculate, but I guess for hardware to forward interrupts to the
> >>> VM, it has to pin the special guest memory page that will receive the
> >>> indications, to then configure (interrupt) hardware to target the interrupt
> >>> indications to that special guest page (using a host physical address).
> >>
> >> Either the emulated access is "CPU" based happening through the KVM
> >> page table so it should use mmu_notifier locking.
> >>
> >> Or it is "DMA" and should go through an IOVA through iommufd pinning
> >> and locking.
> >>
> >> There is no other ground, nothing in KVM should be inventing its own
> >> access methodology.
> >
> > I might be wrong, but this seems to be a bit different.
> >
> > It cannot tolerate page faults (needs a host physical address), so
> > memory notifiers don't really apply. (as a side note, KVM on s390x
> > does not use mmu notifiers as we know them)
>
> The host physical address is one shared between underlying firmware
> and the host kvm. Either might make changes to the referenced page
> and then issue an alert to the guest via a mechanism called GISA,
> giving impetus to the guest to look at that page and process the
> event. As you say, firmware can't tolerate the page being
> unavailable; it's expecting that once we feed it that location it's
> always available until we remove it (kvm_s390_pci_aif_disable).

That is a CPU access delegated to the FW without any locking scheme to
make it safe with KVM :\

It would have been better if FW could inject it through the kvm page
tables so it has some coherency.

Otherwise you have to call this "DMA", I think.

How does s390 avoid mmu notifiers without having lots of problems?? It
is not really optional to hook the invalidations if you need to build
a shadow page table..

Jason

2023-05-02 15:22:46

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 5/2/23 10:57 AM, David Hildenbrand wrote:
> On 02.05.23 15:35, Matthew Rosato wrote:
>> On 5/2/23 9:04 AM, Christian Borntraeger wrote:
>>>
>>>
>>> Am 02.05.23 um 14:54 schrieb Lorenzo Stoakes:
>>>> On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote:
>>>>> Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes:
>>>>>> Writing to file-backed dirty-tracked mappings via GUP is inherently broken
>>>>>> as we cannot rule out folios being cleaned and then a GUP user writing to
>>>>>> them again and possibly marking them dirty unexpectedly.
>>>>>>
>>>>>> This is especially egregious for long-term mappings (as indicated by the
>>>>>> use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
>>>>>> we have already done in the slow path.
>>>>>
>>>>> Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery?
>>>>> It would no longer work with file backed memory, correct?
>>>>>
>>>>> See
>>>>> arch/s390/kvm/pci.c
>>>>>
>>>>> kvm_s390_pci_aif_enable
>>>>> which does have
>>>>> FOLL_WRITE | FOLL_LONGTERM
>>>>> to
>>>>>
>>>>
>>>> Does this memory map a dirty-tracked file? It's kind of hard to dig into where
>>>> the address originates from without going through a ton of code. In worst case
>>>> if the fast code doesn't find a whitelist it'll fall back to slow path which
>>>> explicitly checks for dirty-tracked filesystem.
>>>
>>> It does pin from whatever QEMU uses as backing for the guest.
>>>>
>>>> We can reintroduce a flag to permit exceptions if this is really broken, are you
>>>> able to test? I don't have an s390 sat around :)
>>>
>>> Matt (Rosato on cc) probably can. In the end, it would mean having
>>>    <memoryBacking>
>>>      <source type="file"/>
>>>    </memoryBacking>
>>>
>>> In libvirt I guess.
>>
>> I am running with this series applied using a QEMU guest with memory-backend-file (using the above libvirt snippet) for a few different PCI device types and AEN forwarding (e.g. what is setup in kvm_s390_pci_aif_enable) is still working.
>>
>
> That's ... unexpected. :)
>
> Either this series doesn't work as expected or you end up using a filesystem that is still compatible. But I guess most applicable filesystems (ext4, btrfs, xfs) all have a page_mkwrite callback and should, therefore, disallow long-term pinning with this series.
>

The memory backend file is on ext4 in my tests. A quick trace shows that pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable is still returning positive implying pages are being pinned.

2023-05-02 15:23:15

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 5/2/23 11:19 AM, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 05:09:06PM +0200, David Hildenbrand wrote:
>> On 02.05.23 15:56, Matthew Rosato wrote:
>>> On 5/2/23 9:50 AM, Jason Gunthorpe wrote:
>>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
>>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
>>>>>
>>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
>>>>> pin all guest memory (instead of small pieces dynamically), there is little
>>>>> existing use for file-backed RAM in such zPCI configurations (because memory
>>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
>>>>> real existing users.
>>>>
>>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
>>>> memory, so something somewhere is always pinning it.
>>>
>>> I might have mis-explained above.
>>>
>>> With iommufd nesting, we will pin everything upfront as a starting point.
>>>
>>> The current usage of vfio type1 iommu for s390 does not pin the entirety of guest memory upfront, it happens as guest RPCITs occur / type1 mappings are made.
>>
>> ... so, after the domain started successfully on the libvirt/QEMU side ? :/
>>
>> It would be great to confirm that. There might be a BUG in patch #2 (see my
>> reply to patch #2) that might not allow you to reproduce it right now.
>>
>
> Yes apologies - thank you VERY much for doing this Matthew, but apologies, I
> made rather a clanger in patch 2 which would mean fast patch degrading to slow
> path would pass even for file-backed.
>
> Will respin a v7 + cc you on that, if you could be so kind as to test that?
>

Sure, will do!

2023-05-02 15:46:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 12:20:46PM -0300, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 10:54:35AM -0400, Matthew Rosato wrote:
> > On 5/2/23 10:15 AM, David Hildenbrand wrote:
> > > On 02.05.23 16:04, Jason Gunthorpe wrote:
> > >> On Tue, May 02, 2023 at 03:57:30PM +0200, David Hildenbrand wrote:
> > >>> On 02.05.23 15:50, Jason Gunthorpe wrote:
> > >>>> On Tue, May 02, 2023 at 03:47:43PM +0200, David Hildenbrand wrote:
> > >>>>>> Eventually we want to implement a mechanism where we can dynamically pin in response to RPCIT.
> > >>>>>
> > >>>>> Okay, so IIRC we'll fail starting the domain early, that's good. And if we
> > >>>>> pin all guest memory (instead of small pieces dynamically), there is little
> > >>>>> existing use for file-backed RAM in such zPCI configurations (because memory
> > >>>>> cannot be reclaimed either way if it's all pinned), so likely there are no
> > >>>>> real existing users.
> > >>>>
> > >>>> Right, this is VFIO, the physical HW can't tolerate not having pinned
> > >>>> memory, so something somewhere is always pinning it.
> > >>>>
> > >>>> Which, again, makes it weird/wrong that this KVM code is pinning it
> > >>>> again :\
> > >>>
> > >>> IIUC, that pinning is not for ordinary IOMMU / KVM memory access. It's for
> > >>> passthrough of (adapter) interrupts.
> > >>>
> > >>> I have to speculate, but I guess for hardware to forward interrupts to the
> > >>> VM, it has to pin the special guest memory page that will receive the
> > >>> indications, to then configure (interrupt) hardware to target the interrupt
> > >>> indications to that special guest page (using a host physical address).
> > >>
> > >> Either the emulated access is "CPU" based happening through the KVM
> > >> page table so it should use mmu_notifier locking.
> > >>
> > >> Or it is "DMA" and should go through an IOVA through iommufd pinning
> > >> and locking.
> > >>
> > >> There is no other ground, nothing in KVM should be inventing its own
> > >> access methodology.
> > >
> > > I might be wrong, but this seems to be a bit different.
> > >
> > > It cannot tolerate page faults (needs a host physical address), so
> > > memory notifiers don't really apply. (as a side note, KVM on s390x
> > > does not use mmu notifiers as we know them)
> >
> > The host physical address is one shared between underlying firmware
> > and the host kvm. Either might make changes to the referenced page
> > and then issue an alert to the guest via a mechanism called GISA,
> > giving impetus to the guest to look at that page and process the
> > event. As you say, firmware can't tolerate the page being
> > unavailable; it's expecting that once we feed it that location it's
> > always available until we remove it (kvm_s390_pci_aif_disable).
>
> That is a CPU access delegated to the FW without any locking scheme to
> make it safe with KVM :\
>
> It would have been better if FW could inject it through the kvm page
> tables so it has some coherency.
>
> Otherwise you have to call this "DMA", I think.
>
> How does s390 avoid mmu notifiers without having lots of problems?? It
> is not really optional to hook the invalidations if you need to build
> a shadow page table..

Totally no idea on s390 details, but.. per my read above, if the firmware
needs to make sure the page is always available (so no way to fault it in
on demand), which means a longterm pinning seems appropriate here.

Then if pinned a must, there's no need for mmu notifiers (as the page will
simply not be invalidated anyway)?

Thanks,

--
Peter Xu

2023-05-02 15:48:18

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote:
> > How does s390 avoid mmu notifiers without having lots of problems?? It
> > is not really optional to hook the invalidations if you need to build
> > a shadow page table..
>
> Totally no idea on s390 details, but.. per my read above, if the firmware
> needs to make sure the page is always available (so no way to fault it in
> on demand), which means a longterm pinning seems appropriate here.
>
> Then if pinned a must, there's no need for mmu notifiers (as the page will
> simply not be invalidated anyway)?

And what if someone deliberately changes the mapping? memory hotplug
in the VM, or whatever?

We have ways to deal with this class of stuff in kvm and in
iommufd/vfio - this does not.

Jason

2023-05-02 15:51:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 17:36, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote:
>>> How does s390 avoid mmu notifiers without having lots of problems?? It
>>> is not really optional to hook the invalidations if you need to build
>>> a shadow page table..
>>
>> Totally no idea on s390 details, but.. per my read above, if the firmware
>> needs to make sure the page is always available (so no way to fault it in
>> on demand), which means a longterm pinning seems appropriate here.
>>
>> Then if pinned a must, there's no need for mmu notifiers (as the page will
>> simply not be invalidated anyway)?
>
> And what if someone deliberately changes the mapping? memory hotplug
> in the VM, or whatever?

Besides s390 not supporting memory hotplug in VMs (yet): if the guest
wants a different guest physical address, I guess that's the problem of
the guest, and it can update it:

KVM_S390_ZPCIOP_REG_AEN is triggered from QEMU via
s390_pci_kvm_aif_enable(), triggered by the guest via a special instruction.

If the hypervisor changes the mapping, it's just the same thing as
mixing e.g. MADV_DONTNEED with longterm pinning in vfio: don't do it.
And if you do it, you get to keep the mess you created for your VM.

Linux will make sure to not change the mapping: for example, page
migration of a pinned page will fail.

But maybe I am missing something important here.

--
Thanks,

David / dhildenb

2023-05-02 16:18:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 05:45:40PM +0200, David Hildenbrand wrote:
> On 02.05.23 17:36, Jason Gunthorpe wrote:
> > On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote:
> > > > How does s390 avoid mmu notifiers without having lots of problems?? It
> > > > is not really optional to hook the invalidations if you need to build
> > > > a shadow page table..
> > >
> > > Totally no idea on s390 details, but.. per my read above, if the firmware
> > > needs to make sure the page is always available (so no way to fault it in
> > > on demand), which means a longterm pinning seems appropriate here.
> > >
> > > Then if pinned a must, there's no need for mmu notifiers (as the page will
> > > simply not be invalidated anyway)?
> >
> > And what if someone deliberately changes the mapping? memory hotplug
> > in the VM, or whatever?
>
> Besides s390 not supporting memory hotplug in VMs (yet): if the guest wants
> a different guest physical address, I guess that's the problem of the guest,
> and it can update it:
>
> KVM_S390_ZPCIOP_REG_AEN is triggered from QEMU via
> s390_pci_kvm_aif_enable(), triggered by the guest via a special instruction.
>
> If the hypervisor changes the mapping, it's just the same thing as mixing
> e.g. MADV_DONTNEED with longterm pinning in vfio: don't do it. And if you do
> it, you get to keep the mess you created for your VM.
>
> Linux will make sure to not change the mapping: for example, page migration
> of a pinned page will fail.
>
> But maybe I am missing something important here.

It missses the general architectural point why we have all these
shootdown mechanims in other places - plares are not supposed to make
these kinds of assumptions. When the userspace unplugs the memory from
KVM or unmaps it from VFIO it is not still being accessed by the
kernel.

Functional bug or not, it is inconsistent with how this is designed to
work.

Jason

2023-05-02 16:23:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:

> > It missses the general architectural point why we have all these
> > shootdown mechanims in other places - plares are not supposed to make
> > these kinds of assumptions. When the userspace unplugs the memory from
> > KVM or unmaps it from VFIO it is not still being accessed by the
> > kernel.
>
> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
> virtio-balloon) in QEMU with vfio.

That is different, VFIO has it's own contract how it consumes the
memory from the MM and VFIO breaks all this stuff.

But when you tell VFIO to unmap the memory it doesn't keep accessing
it in the background like this does.

Jason

2023-05-02 16:36:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 18:06, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 05:45:40PM +0200, David Hildenbrand wrote:
>> On 02.05.23 17:36, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 11:32:57AM -0400, Peter Xu wrote:
>>>>> How does s390 avoid mmu notifiers without having lots of problems?? It
>>>>> is not really optional to hook the invalidations if you need to build
>>>>> a shadow page table..
>>>>
>>>> Totally no idea on s390 details, but.. per my read above, if the firmware
>>>> needs to make sure the page is always available (so no way to fault it in
>>>> on demand), which means a longterm pinning seems appropriate here.
>>>>
>>>> Then if pinned a must, there's no need for mmu notifiers (as the page will
>>>> simply not be invalidated anyway)?
>>>
>>> And what if someone deliberately changes the mapping? memory hotplug
>>> in the VM, or whatever?
>>
>> Besides s390 not supporting memory hotplug in VMs (yet): if the guest wants
>> a different guest physical address, I guess that's the problem of the guest,
>> and it can update it:
>>
>> KVM_S390_ZPCIOP_REG_AEN is triggered from QEMU via
>> s390_pci_kvm_aif_enable(), triggered by the guest via a special instruction.
>>
>> If the hypervisor changes the mapping, it's just the same thing as mixing
>> e.g. MADV_DONTNEED with longterm pinning in vfio: don't do it. And if you do
>> it, you get to keep the mess you created for your VM.
>>
>> Linux will make sure to not change the mapping: for example, page migration
>> of a pinned page will fail.
>>
>> But maybe I am missing something important here.
>
> It missses the general architectural point why we have all these
> shootdown mechanims in other places - plares are not supposed to make
> these kinds of assumptions. When the userspace unplugs the memory from
> KVM or unmaps it from VFIO it is not still being accessed by the
> kernel.

Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED
(e.g., virtio-balloon) in QEMU with vfio.

>
> Functional bug or not, it is inconsistent with how this is designed to
> work.

Sorry to say, I *really* don't see how that is supposed to work with a
page that *cannot* be faulted back in on demand.

--
Thanks,

David / dhildenb

2023-05-02 16:39:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 18:19, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:
>
>>> It missses the general architectural point why we have all these
>>> shootdown mechanims in other places - plares are not supposed to make
>>> these kinds of assumptions. When the userspace unplugs the memory from
>>> KVM or unmaps it from VFIO it is not still being accessed by the
>>> kernel.
>>
>> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
>> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
>> virtio-balloon) in QEMU with vfio.
>
> That is different, VFIO has it's own contract how it consumes the
> memory from the MM and VFIO breaks all this stuff.
>
> But when you tell VFIO to unmap the memory it doesn't keep accessing
> it in the background like this does.

To me, this is similar to when QEMU (user space) triggers
KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the
page (1) When triggered by the guest explicitly (2) when resetting the
VM (3) when resetting the virtual PCI device / configuration.

Interrupt gets unregistered from HW (which stops using the page), the
pages get unpinned. Pages get no longer used.

I guess I am still missing (a) how this is fundamentally different (b)
how it could be done differently.

I'd really be happy to learn how a better approach would look like that
does not use longterm pinnings.

I don't see an easy way to not use longterm pinnings. When using mmu
notifiers and getting notified about unmapping of a page (for whatever
reason ... migration, swapout, unmap), you'd have to disable aif. But
when to reenable it (maybe there would be a way)? Also, I'm not sure if
this could even be visible by the guest, if it's suddenly no longer enabled.

Something for the s390x people to explore ... if HW would be providing a
way to deal with that somehow.

--
Thanks,

David / dhildenb

2023-05-02 17:57:59

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote:
> On 02.05.23 18:19, Jason Gunthorpe wrote:
> > On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:
> >
> > > > It missses the general architectural point why we have all these
> > > > shootdown mechanims in other places - plares are not supposed to make
> > > > these kinds of assumptions. When the userspace unplugs the memory from
> > > > KVM or unmaps it from VFIO it is not still being accessed by the
> > > > kernel.
> > >
> > > Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
> > > munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
> > > virtio-balloon) in QEMU with vfio.
> >
> > That is different, VFIO has it's own contract how it consumes the
> > memory from the MM and VFIO breaks all this stuff.
> >
> > But when you tell VFIO to unmap the memory it doesn't keep accessing
> > it in the background like this does.
>
> To me, this is similar to when QEMU (user space) triggers
> KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the
> page (1) When triggered by the guest explicitly (2) when resetting the VM
> (3) when resetting the virtual PCI device / configuration.
>
> Interrupt gets unregistered from HW (which stops using the page), the pages
> get unpinned. Pages get no longer used.
>
> I guess I am still missing (a) how this is fundamentally different (b) how
> it could be done differently.

It uses an address that is already scoped within the KVM memory map
and uses KVM's gpa_to_gfn() to translate it to some pinnable page

It is not some independent thing like VFIO, it is explicitly scoped
within the existing KVM structure and it does not follow any mutations
that are done to the gpa map through the usual KVM APIs.

> I'd really be happy to learn how a better approach would look like that does
> not use longterm pinnings.

Sounds like the FW sadly needs pinnings. This is why I said it looks
like DMA. If possible it would be better to get the pinning through
VFIO, eg as a mdev

Otherwise, it would have been cleaner if this was divorced from KVM
and took in a direct user pointer, then maybe you could make the
argument is its own thing with its own lifetime rules. (then you are
kind of making your own mdev)

Or, perhaps, this is really part of some radical "irqfd" that we've
been on and off talking about specifically to get this area of
interrupt bypass uAPI'd properly..

Jason

2023-05-02 18:02:11

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 5/2/23 1:46 PM, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote:
>> On 02.05.23 18:19, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:
>>>
>>>>> It missses the general architectural point why we have all these
>>>>> shootdown mechanims in other places - plares are not supposed to make
>>>>> these kinds of assumptions. When the userspace unplugs the memory from
>>>>> KVM or unmaps it from VFIO it is not still being accessed by the
>>>>> kernel.
>>>>
>>>> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
>>>> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
>>>> virtio-balloon) in QEMU with vfio.
>>>
>>> That is different, VFIO has it's own contract how it consumes the
>>> memory from the MM and VFIO breaks all this stuff.
>>>
>>> But when you tell VFIO to unmap the memory it doesn't keep accessing
>>> it in the background like this does.
>>
>> To me, this is similar to when QEMU (user space) triggers
>> KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the
>> page (1) When triggered by the guest explicitly (2) when resetting the VM
>> (3) when resetting the virtual PCI device / configuration.
>>
>> Interrupt gets unregistered from HW (which stops using the page), the pages
>> get unpinned. Pages get no longer used.
>>
>> I guess I am still missing (a) how this is fundamentally different (b) how
>> it could be done differently.
>
> It uses an address that is already scoped within the KVM memory map
> and uses KVM's gpa_to_gfn() to translate it to some pinnable page
>
> It is not some independent thing like VFIO, it is explicitly scoped
> within the existing KVM structure and it does not follow any mutations
> that are done to the gpa map through the usual KVM APIs.
>
>> I'd really be happy to learn how a better approach would look like that does
>> not use longterm pinnings.
>
> Sounds like the FW sadly needs pinnings. This is why I said it looks
> like DMA. If possible it would be better to get the pinning through
> VFIO, eg as a mdev

Hrm, these are today handled as a vfio-pci device (e.g. no mdev) so that would be a pretty significant change.

>
> Otherwise, it would have been cleaner if this was divorced from KVM
> and took in a direct user pointer, then maybe you could make the
> argument is its own thing with its own lifetime rules. (then you are
> kind of making your own mdev)

Problem there is that firmware needs both the location (where to indicate the event) and the identity of the KVM instance (what guest to ship the GISA alert to) so I don't think we can completely divorce them.

>
> Or, perhaps, this is really part of some radical "irqfd" that we've
> been on and off talking about specifically to get this area of
> interrupt bypass uAPI'd properly..

I investigated that at one point and could not seem to get it to fit; I'll have another look there.



2023-05-02 18:11:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On Tue, May 02, 2023 at 01:59:09PM -0400, Matthew Rosato wrote:

> >> I'd really be happy to learn how a better approach would look like that does
> >> not use longterm pinnings.
> >
> > Sounds like the FW sadly needs pinnings. This is why I said it looks
> > like DMA. If possible it would be better to get the pinning through
> > VFIO, eg as a mdev
>
> Hrm, these are today handled as a vfio-pci device (e.g. no mdev) so that would be a pretty significant change.

I was thinking more of having your vIRQ controller as a vfio-mdev, so
qemu would have to open the irq controller as well as the kvm and the
PCI device

> > Otherwise, it would have been cleaner if this was divorced from KVM
> > and took in a direct user pointer, then maybe you could make the
> > argument is its own thing with its own lifetime rules. (then you are
> > kind of making your own mdev)
>
> Problem there is that firmware needs both the location (where to
> indicate the event) and the identity of the KVM instance (what guest
> to ship the GISA alert to) so I don't think we can completely
> divorce them.

I should have said "kvm page table", being in the KVM ioctls is sort
of OK - but I'm really skeptical that KVM is a good place to put HW
device wrappers.

> > Or, perhaps, this is really part of some radical "irqfd" that we've
> > been on and off talking about specifically to get this area of
> > interrupt bypass uAPI'd properly..
>
> I investigated that at one point and could not seem to get it to
> fit; I'll have another look there.

It is a long journey, but other arches have problems here too

Jason

2023-05-02 19:26:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

On 02.05.23 19:46, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote:
>> On 02.05.23 18:19, Jason Gunthorpe wrote:
>>> On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:
>>>
>>>>> It missses the general architectural point why we have all these
>>>>> shootdown mechanims in other places - plares are not supposed to make
>>>>> these kinds of assumptions. When the userspace unplugs the memory from
>>>>> KVM or unmaps it from VFIO it is not still being accessed by the
>>>>> kernel.
>>>>
>>>> Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
>>>> munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
>>>> virtio-balloon) in QEMU with vfio.
>>>
>>> That is different, VFIO has it's own contract how it consumes the
>>> memory from the MM and VFIO breaks all this stuff.
>>>
>>> But when you tell VFIO to unmap the memory it doesn't keep accessing
>>> it in the background like this does.
>>
>> To me, this is similar to when QEMU (user space) triggers
>> KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the
>> page (1) When triggered by the guest explicitly (2) when resetting the VM
>> (3) when resetting the virtual PCI device / configuration.
>>
>> Interrupt gets unregistered from HW (which stops using the page), the pages
>> get unpinned. Pages get no longer used.
>>
>> I guess I am still missing (a) how this is fundamentally different (b) how
>> it could be done differently.
>
> It uses an address that is already scoped within the KVM memory map
> and uses KVM's gpa_to_gfn() to translate it to some pinnable page
>
> It is not some independent thing like VFIO, it is explicitly scoped
> within the existing KVM structure and it does not follow any mutations
> that are done to the gpa map through the usual KVM APIs.

Right, it consumes guest physical addresses that are translated via the KVM memslots.
Agreed that it does not (and possibly cannot easily) update the hardware when the KVM
mapping (memslots) would ever change.

I guess it's also not documented that this is not supported.

>
>> I'd really be happy to learn how a better approach would look like that does
>> not use longterm pinnings.
>
> Sounds like the FW sadly needs pinnings. This is why I said it looks
> like DMA. If possible it would be better to get the pinning through
> VFIO, eg as a mdev
>
> Otherwise, it would have been cleaner if this was divorced from KVM
> and took in a direct user pointer, then maybe you could make the
> argument is its own thing with its own lifetime rules. (then you are
> kind of making your own mdev)

It would be cleaner if user space would translate the GPA to a HVA and provid
that, agreed ...

>
> Or, perhaps, this is really part of some radical "irqfd" that we've
> been on and off talking about specifically to get this area of
> interrupt bypass uAPI'd properly..

Most probably. It's one of these very special cases ... thankfully:

$ git grep -i longterm | grep kvm
arch/s390/kvm/pci.c: npages = pin_user_pages_fast(hva, 1, FOLL_WRITE | FOLL_LONGTERM, pages);
arch/s390/kvm/pci.c: npages = pin_user_pages_fast(hva, 1, FOLL_WRITE | FOLL_LONGTERM,


--
Thanks,

David / dhildenb