2024-03-14 16:13:28

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 0/2] mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly

Derrick reports that in some cases where pread() would fail with -EIO and
mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.

It all boils down to missing VM_FAULT_RETRY handling. Let's try to handle
that in a better way, similar to how ordinary GUP handles it.

Details in patch #1. In short, move special MADV_POPULATE_(READ|WRITE)
VMA handling into __get_user_pages(), and make faultin_page_range()
call __get_user_pages_locked(), which handles VM_FAULT_RETRY. Further,
avoid the now-useless madvise VMA walk, because __get_user_pages() will
perform the VMA lookup either way.

I briefly played with handling the FOLL_MADV_POPULATE checks in
__get_user_pages() a bit differently, integrating them with existing
handling, but it ended up looking worse. So I decided to keep it simple.

Likely, we need better selftests, but the reproducer from Darrick might
be a bit hard to convert into a simple selftest.

Note that using mlock() in Darricks reproducer results in a similar
endless retry. Likely, that is not what we want, and we should handle
VM_FAULT_RETRY in populate_vma_page_range() / __mm_populate() as well.
However, similarly using __get_user_pages_locked() might be more
complicated, because of the advanced VMA handling in
populate_vma_page_range().

Further, most populate_vma_page_range() callers simply ignore the return
values, so it's unclear in which cases we expect to just silently fail, or
where we'd want to retry+fail or endlessly retry instead.

Cc: Andrew Morton <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Hugh Dickins <[email protected]>

David Hildenbrand (2):
mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY
properly
mm/madvise: don't perform madvise VMA walk for
MADV_POPULATE_(READ|WRITE)

mm/gup.c | 54 ++++++++++++++++++++++++++++++---------------------
mm/internal.h | 10 ++++++----
mm/madvise.c | 43 +++++++++++++---------------------------
3 files changed, 52 insertions(+), 55 deletions(-)


base-commit: f48159f866f422371bb1aad10eb4d05b29ca4d8c
--
2.43.2



2024-03-14 16:13:41

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 2/2] mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE)

We changed faultin_page_range() to no longer consume a VMA, because
faultin_page_range() might internally release the mm lock to lookup
the VMA again -- required to cleanly handle VM_FAULT_RETRY. But
independent of that, __get_user_pages() will always lookup the VMA
itself.

Now that we let __get_user_pages() just handle VMA checks in a way that
is suitable for MADV_POPULATE_(READ|WRITE), the VMA walk in madvise()
is just overhead. So let's just call madvise_populate()
on the full range instead.

There is one change in behavior: madvise_walk_vmas() would skip any VMA
holes, and if everything succeeded, it would return -ENOMEM after
processing all VMAs.

However, for MADV_POPULATE_(READ|WRITE) it's unlikely for the caller to
notice any difference: -ENOMEM might either indicate that there were VMA
holes or that populating page tables failed because there was not enough
memory. So it's unlikely that user space will notice the difference, and
that special handling likely only makes sense for some other madvise()
actions.

Further, we'd already fail with -ENOMEM early in the past if looking up the
VMA after dropping the MM lock failed because of concurrent VMA
modifications. So let's just keep it simple and avoid the madvise VMA
walk, and consistently fail early if we find a VMA hole.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/madvise.c | 26 ++++++++++++--------------
1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 1a073fcc4c0c0..a2dd70c4a2e6b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -901,26 +901,19 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
return -EINVAL;
}

-static long madvise_populate(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end,
- int behavior)
+static long madvise_populate(struct mm_struct *mm, unsigned long start,
+ unsigned long end, int behavior)
{
const bool write = behavior == MADV_POPULATE_WRITE;
- struct mm_struct *mm = vma->vm_mm;
int locked = 1;
long pages;

- *prev = vma;
-
while (start < end) {
/* Populate (prefault) page tables readable/writable. */
pages = faultin_page_range(mm, start, end, write, &locked);
if (!locked) {
mmap_read_lock(mm);
locked = 1;
- *prev = NULL;
- vma = NULL;
}
if (pages < 0) {
switch (pages) {
@@ -1021,9 +1014,6 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
case MADV_DONTNEED:
case MADV_DONTNEED_LOCKED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
- case MADV_POPULATE_READ:
- case MADV_POPULATE_WRITE:
- return madvise_populate(vma, prev, start, end, behavior);
case MADV_NORMAL:
new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
break;
@@ -1425,8 +1415,16 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
end = start + len;

blk_start_plug(&plug);
- error = madvise_walk_vmas(mm, start, end, behavior,
- madvise_vma_behavior);
+ switch (behavior) {
+ case MADV_POPULATE_READ:
+ case MADV_POPULATE_WRITE:
+ error = madvise_populate(mm, start, end, behavior);
+ break;
+ default:
+ error = madvise_walk_vmas(mm, start, end, behavior,
+ madvise_vma_behavior);
+ break;
+ }
blk_finish_plug(&plug);
if (write)
mmap_write_unlock(mm);
--
2.43.2


2024-03-14 16:13:50

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 1/2] mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly

Darrick reports that in some cases where pread() would fail with -EIO and
mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.

While the madvise() call can be interrupted by a signal, this is not
the desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should
behave like page faults in that case: fail and not retry forever.

A reproducer can be found at [1].

The reason is that __get_user_pages(), as called by
faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper
way: it will simply return 0 when VM_FAULT_RETRY happened, making
madvise_populate()->faultin_vma_page_range() retry again and again,
never setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().

__get_user_pages_locked() does what we want, but duplicating that logic
in faultin_vma_page_range() feels wrong.

So let's use __get_user_pages_locked() instead, that will detect
VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault
handler return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point,
propagating -EFAULT from faultin_page() to __get_user_pages(), all the way
to madvise_populate().

But, there is an issue: __get_user_pages_locked() will end up re-taking the
MM lock and then __get_user_pages() will do another VMA lookup. In the
meantime, the VMA layout could have changed and we'd fail with different
error codes than we'd want to.

As __get_user_pages() will currently do a new VMA lookup either way,
let it do the VMA handling in a different way, controlled by a new
FOLL_MADV_POPULATE flag, effectively moving these checks from
madvise_populate() + faultin_page_range() in there.

With this change, Darricks reproducer properly fails with -EFAULT, as
documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.

[1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/

Reported-by: Darrick J. Wong <[email protected]>
Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/
Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/gup.c | 54 ++++++++++++++++++++++++++++++---------------------
mm/internal.h | 10 ++++++----
mm/madvise.c | 17 ++--------------
3 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d5..f6d55635742f5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1206,6 +1206,22 @@ static long __get_user_pages(struct mm_struct *mm,

/* first iteration or cross vma bound */
if (!vma || start >= vma->vm_end) {
+ /*
+ * MADV_POPULATE_(READ|WRITE) wants to handle VMA
+ * lookups+error reporting differently.
+ */
+ if (gup_flags & FOLL_MADV_POPULATE) {
+ vma = vma_lookup(mm, start);
+ if (!vma) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ if (check_vma_flags(vma, gup_flags)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ goto retry;
+ }
vma = gup_vma_lookup(mm, start);
if (!vma && in_gate_area(mm, start)) {
ret = get_gate_page(mm, start & PAGE_MASK,
@@ -1683,35 +1699,35 @@ long populate_vma_page_range(struct vm_area_struct *vma,
}

/*
- * faultin_vma_page_range() - populate (prefault) page tables inside the
- * given VMA range readable/writable
+ * faultin_page_range() - populate (prefault) page tables inside the
+ * given range readable/writable
*
* This takes care of mlocking the pages, too, if VM_LOCKED is set.
*
- * @vma: target vma
+ * @mm: the mm to populate page tables in
* @start: start address
* @end: end address
* @write: whether to prefault readable or writable
* @locked: whether the mmap_lock is still held
*
- * Returns either number of processed pages in the vma, or a negative error
- * code on error (see __get_user_pages()).
+ * Returns either number of processed pages in the MM, or a negative error
+ * code on error (see __get_user_pages()). Note that this function reports
+ * errors related to VMAs, such as incompatible mappings, as expected by
+ * MADV_POPULATE_(READ|WRITE).
*
- * vma->vm_mm->mmap_lock must be held. The range must be page-aligned and
- * covered by the VMA. If it's released, *@locked will be set to 0.
+ * The range must be page-aligned.
+ *
+ * mm->mmap_lock must be held. If it's released, *@locked will be set to 0.
*/
-long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, bool write, int *locked)
+long faultin_page_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end, bool write, int *locked)
{
- struct mm_struct *mm = vma->vm_mm;
unsigned long nr_pages = (end - start) / PAGE_SIZE;
int gup_flags;
long ret;

VM_BUG_ON(!PAGE_ALIGNED(start));
VM_BUG_ON(!PAGE_ALIGNED(end));
- VM_BUG_ON_VMA(start < vma->vm_start, vma);
- VM_BUG_ON_VMA(end > vma->vm_end, vma);
mmap_assert_locked(mm);

/*
@@ -1723,19 +1739,13 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
* a poisoned page.
* !FOLL_FORCE: Require proper access permissions.
*/
- gup_flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE;
+ gup_flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE |
+ FOLL_MADV_POPULATE;
if (write)
gup_flags |= FOLL_WRITE;

- /*
- * We want to report -EINVAL instead of -EFAULT for any permission
- * problems or incompatible mappings.
- */
- if (check_vma_flags(vma, gup_flags))
- return -EINVAL;
-
- ret = __get_user_pages(mm, start, nr_pages, gup_flags,
- NULL, locked);
+ ret = __get_user_pages_locked(mm, start, nr_pages, NULL, locked,
+ gup_flags);
lru_add_drain();
return ret;
}
diff --git a/mm/internal.h b/mm/internal.h
index d1c69119b24fb..a57dd5156cf84 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -686,9 +686,8 @@ struct anon_vma *folio_anon_vma(struct folio *folio);
void unmap_mapping_folio(struct folio *folio);
extern long populate_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *locked);
-extern long faultin_vma_page_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- bool write, int *locked);
+extern long faultin_page_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end, bool write, int *locked);
extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
unsigned long bytes);

@@ -1127,10 +1126,13 @@ enum {
FOLL_FAST_ONLY = 1 << 20,
/* allow unlocking the mmap lock */
FOLL_UNLOCKABLE = 1 << 21,
+ /* VMA lookup+checks compatible with MADV_POPULATE_(READ|WRITE) */
+ FOLL_MADV_POPULATE = 1 << 22,
};

#define INTERNAL_GUP_FLAGS (FOLL_TOUCH | FOLL_TRIED | FOLL_REMOTE | FOLL_PIN | \
- FOLL_FAST_ONLY | FOLL_UNLOCKABLE)
+ FOLL_FAST_ONLY | FOLL_UNLOCKABLE | \
+ FOLL_MADV_POPULATE)

/*
* Indicates for which pages that are write-protected in the page table,
diff --git a/mm/madvise.c b/mm/madvise.c
index 44a498c94158c..1a073fcc4c0c0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -908,27 +908,14 @@ static long madvise_populate(struct vm_area_struct *vma,
{
const bool write = behavior == MADV_POPULATE_WRITE;
struct mm_struct *mm = vma->vm_mm;
- unsigned long tmp_end;
int locked = 1;
long pages;

*prev = vma;

while (start < end) {
- /*
- * We might have temporarily dropped the lock. For example,
- * our VMA might have been split.
- */
- if (!vma || start >= vma->vm_end) {
- vma = vma_lookup(mm, start);
- if (!vma)
- return -ENOMEM;
- }
-
- tmp_end = min_t(unsigned long, end, vma->vm_end);
/* Populate (prefault) page tables readable/writable. */
- pages = faultin_vma_page_range(vma, start, tmp_end, write,
- &locked);
+ pages = faultin_page_range(mm, start, end, write, &locked);
if (!locked) {
mmap_read_lock(mm);
locked = 1;
@@ -949,7 +936,7 @@ static long madvise_populate(struct vm_area_struct *vma,
pr_warn_once("%s: unhandled return value: %ld\n",
__func__, pages);
fallthrough;
- case -ENOMEM:
+ case -ENOMEM: /* No VMA or out of memory. */
return -ENOMEM;
}
}
--
2.43.2


2024-03-15 02:25:51

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly

On Thu, Mar 14, 2024 at 05:12:58PM +0100, David Hildenbrand wrote:
> Derrick reports that in some cases where pread() would fail with -EIO and
> mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
> MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
>
> It all boils down to missing VM_FAULT_RETRY handling. Let's try to handle
> that in a better way, similar to how ordinary GUP handles it.
>
> Details in patch #1. In short, move special MADV_POPULATE_(READ|WRITE)
> VMA handling into __get_user_pages(), and make faultin_page_range()
> call __get_user_pages_locked(), which handles VM_FAULT_RETRY. Further,
> avoid the now-useless madvise VMA walk, because __get_user_pages() will
> perform the VMA lookup either way.
>
> I briefly played with handling the FOLL_MADV_POPULATE checks in
> __get_user_pages() a bit differently, integrating them with existing
> handling, but it ended up looking worse. So I decided to keep it simple.
>
> Likely, we need better selftests, but the reproducer from Darrick might
> be a bit hard to convert into a simple selftest.

No worries, I can convert my reproducer into an fstest. I actually had
no idea that there were so many madvise flags, it's tempting to wire up
fsx and fsstress so that the long soak group tests will exercise them.

> Note that using mlock() in Darricks reproducer results in a similar
> endless retry. Likely, that is not what we want, and we should handle
> VM_FAULT_RETRY in populate_vma_page_range() / __mm_populate() as well.
> However, similarly using __get_user_pages_locked() might be more
> complicated, because of the advanced VMA handling in
> populate_vma_page_range().
>
> Further, most populate_vma_page_range() callers simply ignore the return
> values, so it's unclear in which cases we expect to just silently fail, or
> where we'd want to retry+fail or endlessly retry instead.

With this patchset applied, my reproducer no longer gets stuck in an
infinite loop. I'll throw this at fstests overnight and see if anything
else falls out. Thank you!

--D

> Cc: Andrew Morton <[email protected]>
> Cc: Darrick J. Wong <[email protected]>
> Cc: John Hubbard <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Hugh Dickins <[email protected]>
>
> David Hildenbrand (2):
> mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY
> properly
> mm/madvise: don't perform madvise VMA walk for
> MADV_POPULATE_(READ|WRITE)
>
> mm/gup.c | 54 ++++++++++++++++++++++++++++++---------------------
> mm/internal.h | 10 ++++++----
> mm/madvise.c | 43 +++++++++++++---------------------------
> 3 files changed, 52 insertions(+), 55 deletions(-)
>
>
> base-commit: f48159f866f422371bb1aad10eb4d05b29ca4d8c
> --
> 2.43.2
>

2024-03-17 16:50:25

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly

On Thu, Mar 14, 2024 at 05:12:58PM +0100, David Hildenbrand wrote:
> Derrick reports that in some cases where pread() would fail with -EIO and
> mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
> MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
>
> It all boils down to missing VM_FAULT_RETRY handling. Let's try to handle
> that in a better way, similar to how ordinary GUP handles it.
>
> Details in patch #1. In short, move special MADV_POPULATE_(READ|WRITE)
> VMA handling into __get_user_pages(), and make faultin_page_range()
> call __get_user_pages_locked(), which handles VM_FAULT_RETRY. Further,
> avoid the now-useless madvise VMA walk, because __get_user_pages() will
> perform the VMA lookup either way.
>
> I briefly played with handling the FOLL_MADV_POPULATE checks in
> __get_user_pages() a bit differently, integrating them with existing
> handling, but it ended up looking worse. So I decided to keep it simple.
>
> Likely, we need better selftests, but the reproducer from Darrick might
> be a bit hard to convert into a simple selftest.
>
> Note that using mlock() in Darricks reproducer results in a similar
> endless retry. Likely, that is not what we want, and we should handle
> VM_FAULT_RETRY in populate_vma_page_range() / __mm_populate() as well.
> However, similarly using __get_user_pages_locked() might be more
> complicated, because of the advanced VMA handling in
> populate_vma_page_range().
>
> Further, most populate_vma_page_range() callers simply ignore the return
> values, so it's unclear in which cases we expect to just silently fail, or
> where we'd want to retry+fail or endlessly retry instead.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Darrick J. Wong <[email protected]>
> Cc: John Hubbard <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Hugh Dickins <[email protected]>

After a few days I haven't seen any problems, so
Tested-by: Darrick J. Wong <[email protected]>

--D

>
> David Hildenbrand (2):
> mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY
> properly
> mm/madvise: don't perform madvise VMA walk for
> MADV_POPULATE_(READ|WRITE)
>
> mm/gup.c | 54 ++++++++++++++++++++++++++++++---------------------
> mm/internal.h | 10 ++++++----
> mm/madvise.c | 43 +++++++++++++---------------------------
> 3 files changed, 52 insertions(+), 55 deletions(-)
>
>
> base-commit: f48159f866f422371bb1aad10eb4d05b29ca4d8c
> --
> 2.43.2
>

2024-03-17 16:52:11

by Darrick J. Wong

[permalink] [raw]
Subject: [RFC PATCH] xfs_io: add linux madvise advice codes

From: Darrick J. Wong <[email protected]>

Add all the Linux-specific madvise codes. We're going to need
MADV_POPULATE_READ for a regression test.

Signed-off-by: Darrick J. Wong <[email protected]>
---
configure.ac | 1
include/builddefs.in | 1
io/Makefile | 4 ++
io/madvise.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
m4/package_libcdev.m4 | 17 ++++++++
5 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/configure.ac b/configure.ac
index 3786e44db6fd..723bdca506d1 100644
--- a/configure.ac
+++ b/configure.ac
@@ -187,6 +187,7 @@ AC_CONFIG_SYSTEMD_SYSTEM_UNIT_DIR
AC_CONFIG_CROND_DIR
AC_CONFIG_UDEV_DIR
AC_HAVE_BLKID_TOPO
+AC_HAVE_KERNEL_MADVISE_FLAGS

if test "$enable_ubsan" = "yes" || test "$enable_ubsan" = "probe"; then
AC_PACKAGE_CHECK_UBSAN
diff --git a/include/builddefs.in b/include/builddefs.in
index 07428206da45..a04f3e70f19d 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -193,6 +193,7 @@ HAVE_O_TMPFILE = @have_o_tmpfile@
HAVE_MKOSTEMP_CLOEXEC = @have_mkostemp_cloexec@
USE_RADIX_TREE_FOR_INUMS = @use_radix_tree_for_inums@
HAVE_FSVERITY_DESCR = @have_fsverity_descr@
+HAVE_KERNEL_MADVISE = @have_kernel_madvise@

GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall -Werror -Wextra -Wno-unused-parameter
# -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/io/Makefile b/io/Makefile
index 6f903e3df9a7..ce39fda0e82a 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -84,6 +84,10 @@ ifeq ($(HAVE_GETFSMAP),yes)
CFILES += fsmap.c
endif

+ifeq ($(HAVE_KERNEL_MADVISE),yes)
+LCFLAGS += -DHAVE_KERNEL_MADVISE
+endif
+
default: depend $(LTCOMMAND)

include $(BUILDRULES)
diff --git a/io/madvise.c b/io/madvise.c
index 6e9c5b121d72..081666f403bb 100644
--- a/io/madvise.c
+++ b/io/madvise.c
@@ -9,6 +9,9 @@
#include <sys/mman.h>
#include "init.h"
#include "io.h"
+#ifdef HAVE_KERNEL_MADVISE
+# include <asm/mman.h>
+#endif

static cmdinfo_t madvise_cmd;

@@ -26,6 +29,47 @@ madvise_help(void)
" -r -- expect random page references (POSIX_MADV_RANDOM)\n"
" -s -- expect sequential page references (POSIX_MADV_SEQUENTIAL)\n"
" -w -- will need these pages (POSIX_MADV_WILLNEED) [*]\n"
+"\n"
+"The following Linux-specific advise values are available:\n"
+#ifdef MADV_COLLAPSE
+" -c -- try to collapse range into transparent hugepages (MADV_COLLAPSE)\n"
+#endif
+#ifdef MADV_COLD
+" -D -- deactivate the range (MADV_COLD)\n"
+#endif
+#ifdef MADV_FREE
+" -f -- free the range (MADV_FREE)\n"
+#endif
+#ifdef MADV_NOHUGEPAGE
+" -h -- disable transparent hugepages (MADV_NOHUGEPAGE)\n"
+#endif
+#ifdef MADV_HUGEPAGE
+" -H -- enable transparent hugepages (MADV_HUGEPAGE)\n"
+#endif
+#ifdef MADV_MERGEABLE
+" -m -- mark the range mergeable (MADV_MERGEABLE)\n"
+#endif
+#ifdef MADV_UNMERGEABLE
+" -M -- mark the range unmergeable (MADV_UNMERGEABLE)\n"
+#endif
+#ifdef MADV_SOFT_OFFLINE
+" -o -- mark the range offline (MADV_SOFT_OFFLINE)\n"
+#endif
+#ifdef MADV_REMOVE
+" -p -- punch a hole in the file (MADV_REMOVE)\n"
+#endif
+#ifdef MADV_HWPOISON
+" -P -- poison the page cache (MADV_HWPOISON)\n"
+#endif
+#ifdef MADV_POPULATE_READ
+" -R -- prefault in the range for read (MADV_POPULATE_READ)\n"
+#endif
+#ifdef MADV_POPULATE_WRITE
+" -W -- prefault in the range for write (MADV_POPULATE_WRITE)\n"
+#endif
+#ifdef MADV_PAGEOUT
+" -X -- reclaim the range (MADV_PAGEOUT)\n"
+#endif
" Notes:\n"
" NORMAL sets the default readahead setting on the file.\n"
" RANDOM sets the readahead setting on the file to zero.\n"
@@ -45,20 +89,85 @@ madvise_f(
int advise = MADV_NORMAL, c;
size_t blocksize, sectsize;

- while ((c = getopt(argc, argv, "drsw")) != EOF) {
+ while ((c = getopt(argc, argv, "cdDfhHmMopPrRswWX")) != EOF) {
switch (c) {
+#ifdef MADV_COLLAPSE
+ case 'c': /* collapse to thp */
+ advise = MADV_COLLAPSE;
+ break;
+#endif
case 'd': /* Don't need these pages */
advise = MADV_DONTNEED;
break;
+#ifdef MADV_COLD
+ case 'D': /* make more likely to be reclaimed */
+ advise = MADV_COLD;
+ break;
+#endif
+#ifdef MADV_FREE
+ case 'f': /* page range out of memory */
+ advise = MADV_FREE;
+ break;
+#endif
+#ifdef MADV_HUGEPAGE
+ case 'h': /* enable thp memory */
+ advise = MADV_HUGEPAGE;
+ break;
+#endif
+#ifdef MADV_NOHUGEPAGE
+ case 'H': /* disable thp memory */
+ advise = MADV_NOHUGEPAGE;
+ break;
+#endif
+#ifdef MADV_MERGEABLE
+ case 'm': /* enable merging */
+ advise = MADV_MERGEABLE;
+ break;
+#endif
+#ifdef MADV_UNMERGEABLE
+ case 'M': /* disable merging */
+ advise = MADV_UNMERGEABLE;
+ break;
+#endif
+#ifdef MADV_SOFT_OFFLINE
+ case 'o': /* offline */
+ advise = MADV_SOFT_OFFLINE;
+ break;
+#endif
+#ifdef MADV_REMOVE
+ case 'p': /* punch hole */
+ advise = MADV_REMOVE;
+ break;
+#endif
+#ifdef MADV_HWPOISON
+ case 'P': /* poison */
+ advise = MADV_HWPOISON;
+ break;
+#endif
case 'r': /* Expect random page references */
advise = MADV_RANDOM;
break;
+#ifdef MADV_POPULATE_READ
+ case 'R': /* fault in pages for read */
+ advise = MADV_POPULATE_READ;
+ break;
+#endif
case 's': /* Expect sequential page references */
advise = MADV_SEQUENTIAL;
break;
case 'w': /* Will need these pages */
advise = MADV_WILLNEED;
break;
+#ifdef MADV_POPULATE_WRITE
+ case 'W': /* fault in pages for write */
+ advise = MADV_POPULATE_WRITE;
+ break;
+#endif
+#ifdef MADV_PAGEOUT
+ case 'X': /* reclaim memory */
+ advise = MADV_PAGEOUT;
+ break;
+#endif
default:
exitcode = 1;
return command_usage(&madvise_cmd);
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 84f288dfcfdb..064d050b2b55 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -322,3 +322,20 @@ struct fsverity_descriptor m = { };
AC_SUBST(have_fsverity_descr)
])

+#
+# Check if asm/mman.h can be included
+#
+AC_DEFUN([AC_HAVE_KERNEL_MADVISE_FLAGS],
+ [ AC_MSG_CHECKING([for kernel madvise flags in asm/mman.h ])
+ AC_COMPILE_IFELSE(
+ [ AC_LANG_PROGRAM([[
+#include <asm/mman.h>
+ ]], [[
+int moo = MADV_COLLAPSE;
+ ]])
+ ], have_kernel_madvise=yes
+ AC_MSG_RESULT(yes),
+ AC_MSG_RESULT(no))
+ AC_SUBST(have_kernel_madvise)
+ ])
+

2024-03-17 16:53:43

by Darrick J. Wong

[permalink] [raw]
Subject: [RFC PATCH] fstests: test MADV_POPULATE_READ with IO errors

From: Darrick J. Wong <[email protected]>

This is a regression test for "mm/madvise: make
MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly".

Cc: David Hildenbrand <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
---
tests/generic/1835 | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
tests/generic/1835.out | 4 +++
2 files changed, 69 insertions(+)
create mode 100755 tests/generic/1835
create mode 100644 tests/generic/1835.out

diff --git a/tests/generic/1835 b/tests/generic/1835
new file mode 100755
index 0000000000..07479ab712
--- /dev/null
+++ b/tests/generic/1835
@@ -0,0 +1,65 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024 Oracle. All Rights Reserved.
+#
+# FS QA Test 1835
+#
+# This is a regression test for a kernel hang that I saw when creating a memory
+# mapping, injecting EIO errors on the block device, and invoking
+# MADV_POPULATE_READ on the mapping to fault in the pages.
+#
+. ./common/preamble
+_begin_fstest auto rw
+
+# Override the default cleanup function.
+_cleanup()
+{
+ cd /
+ rm -f $tmp.*
+ _dmerror_unmount
+ _dmerror_cleanup
+}
+
+# Import common functions.
+. ./common/dmerror
+
+_fixed_by_kernel_commit XXXXXXXXXXXX \
+ "mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly"
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_require_xfs_io_command madvise -R
+_require_scratch
+_require_dm_target error
+_require_command "$TIMEOUT_PROG" "timeout"
+
+_scratch_mkfs >> $seqres.full 2>&1
+_dmerror_init
+
+filesz=2m
+
+# Create a file that we'll read, then cycle mount to zap pagecache
+_dmerror_mount
+$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $filesz" "$SCRATCH_MNT/a" >> $seqres.full
+_dmerror_unmount
+_dmerror_mount
+
+# Try to read the file data in a regular fashion just to prove that it works.
+echo read with no errors
+timeout -s KILL 10s $XFS_IO_PROG -c "mmap -r 0 $filesz" -c "madvise -R 0 $filesz" "$SCRATCH_MNT/a"
+_dmerror_unmount
+_dmerror_mount
+
+# Load file metadata and induce EIO errors on read. Try to provoke the kernel;
+# kill the process after 10s so we can clean up.
+stat "$SCRATCH_MNT/a" >> $seqres.full
+echo read with IO errors
+_dmerror_load_error_table
+timeout -s KILL 10s $XFS_IO_PROG -c "mmap -r 0 $filesz" -c "madvise -R 0 $filesz" "$SCRATCH_MNT/a"
+_dmerror_load_working_table
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1835.out b/tests/generic/1835.out
new file mode 100644
index 0000000000..1b03586e8c
--- /dev/null
+++ b/tests/generic/1835.out
@@ -0,0 +1,4 @@
+QA output created by 1835
+read with no errors
+read with IO errors
+madvise: Bad address

2024-03-17 21:14:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH] xfs_io: add linux madvise advice codes

> +#
> +# Check if asm/mman.h can be included
> +#
> +AC_DEFUN([AC_HAVE_KERNEL_MADVISE_FLAGS],
> + [ AC_MSG_CHECKING([for kernel madvise flags in asm/mman.h ])
> + AC_COMPILE_IFELSE(
> + [ AC_LANG_PROGRAM([[
> +#include <asm/mman.h>
> + ]], [[
> +int moo = MADV_COLLAPSE;
> + ]])
> + ], have_kernel_madvise=yes
> + AC_MSG_RESULT(yes),
> + AC_MSG_RESULT(no))
> + AC_SUBST(have_kernel_madvise)
> + ])
> +

I don't think we really need this check, as madvise and asm/mman.h
have been around forever. We can probably also drop most of the
actual flag idefs, probably for everything older than MADV_WIPEONFORK.

The rest looks good to me.

2024-03-17 21:14:59

by Christoph Hellwig

[permalink] [raw]

2024-03-19 09:00:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH] fstests: test MADV_POPULATE_READ with IO errors

On 17.03.24 17:53, Darrick J. Wong wrote:
> From: Darrick J. Wong <[email protected]>
>
> This is a regression test for "mm/madvise: make
> MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly".
>
> Cc: David Hildenbrand <[email protected]>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---

Thanks for including this test, very helpful!

It's my first time reading fstests code, so I cannot give any feedback
that would be of a lot of value. Having that said, nothing jumped at me :)

--
Cheers,

David / dhildenb