LinuxLists.cc - [PATCH v4 00/10] userfaultfd: add minor fault handling for shmem

2021-04-20 22:10:09

by Axel Rasmussen

[permalink] [raw]

Subject: [PATCH v4 00/10] userfaultfd: add minor fault handling for shmem

Base
====

This series is based on (and therefore should apply cleanly to) the tag
"v5.12-rc7-mmots-2021-04-11-20-49", additionally with Peter's selftest cleanup
series applied first:

https://lore.kernel.org/patchwork/cover/1412450/

Changelog
=========

v3->v4:
- Fix handling of the shmem private mcopy case. Previously, I had (incorrectly)
assumed that !vma_is_anonymous() was equivalent to "the page will be in the
page cache". But, in this case we have an optimization where we allocate a new
*anonymous* page. So, use a new "bool page_in_cache" instead, which checks if
page->mapping is set. Correct several places with this new check. [Hugh]
- Fix calling mm_counter() before page_add_..._rmap(). [Hugh]
- When modifying shmem_mcopy_atomic_pte() to use the new install_pte() helper,
just use lru_cache_add_inactive_or_unevictable(), no need to branch and maybe
use lru_cache_add(). [Hugh]
- De-pluralize mcopy_atomic_install_pte(s). [Hugh]
- Make "writable" a bool, and initialize consistently. [Hugh]

v2->v3:
- Picked up {Reviewed,Acked}-by's.
- Reorder commits: introduce CONTINUE before MINOR registration. [Hugh, Peter]
- Don't try to {unlock,put}_page an xarray value in shmem_getpage_gfp. [Hugh]
- Move enum mcopy_atomic_mode forward declare out of CONFIG_HUGETLB_PAGE. [Hugh]
- Keep mistakenly removed UFFD_USER_MODE_ONLY in selftest. [Peter]
- Cleanup context management in self test (make clear implicit, remove unneeded
return values now that we have err()). [Peter]
- Correct dst_pte argument to dst_pmd in shmem_mcopy_atomic_pte macro. [Hugh]
- Mention the new shmem support feature in documentation. [Hugh]

v1->v2:
- Pick up Reviewed-by's.
- Don't swapin page when a minor fault occurs. Notice that it needs to be
swapped in, and just immediately fire the minor fault. Let a future CONTINUE
deal with swapping in the page. [Peter]
- Clarify comment about i_size checks in mm/userfaultfd.c. [Peter]
- Only forward declare once (out of #ifdef) in hugetlb.h. [Peter]

Changes since [2]:
- Squash the fixes ([2]) in with the original series ([1]). This makes reviewing
easier, as we no longer have to sift through deltas undoing what we had done
before. [Hugh, Peter]
- Modify shmem_mcopy_atomic_pte() to use the new mcopy_atomic_install_ptes()
helper, reducing code duplication. [Hugh]
- Properly trigger handle_userfault() in the shmem_swapin_page() case. [Hugh]
- Use shmem_getpage() instead of find_lock_page() to lookup the existing page in
for continue. This properly deals with swapped-out pages. [Hugh]
- Unconditionally pte_mkdirty() for anon memory (as before). [Peter]
- Don't include userfaultfd_k.h in either hugetlb.h or shmem_fs.h. [Hugh]
- Add comment for UFFD_FEATURE_MINOR_SHMEM (to match _HUGETLBFS). [Hugh]
- Fix some small cleanup issues (parens, reworded conditionals, reduced plumbing
of some parameters, simplify labels/gotos, ...). [Hugh, Peter]

Overview
========

See the series which added minor faults for hugetlbfs [3] for a detailed
overview of minor fault handling in general. This series adds the same support
for shmem-backed areas.

This series is structured as follows:

- Commits 1 and 2 are cleanups.
- Commits 3 and 4 implement the new feature (minor fault handling for shmem).
- Commits 5, 6, 7, 8 update the userfaultfd selftest to exercise the feature.
- Commit 9 is one final cleanup, modifying an existing code path to re-use a new
helper we've introduced. We rely on the selftest to show that this change
doesn't break anything.
- Commit 10 is a small documentation update to reflect the new changes.

Use Case
========

In some cases it is useful to have VM memory backed by tmpfs instead of
hugetlbfs. So, this feature will be used to support the same VM live migration
use case described in my original series.

Additionally, Android folks (Lokesh Gidra <[email protected]>) hope to
optimize the Android Runtime garbage collector using this feature:

"The plan is to use userfaultfd for concurrently compacting the heap. With
this feature, the heap can be shared-mapped at another location where the
GC-thread(s) could continue the compaction operation without the need to
invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java threads
get faults on the heap, UFFDIO_CONTINUE can be used to resume execution.
Furthermore, this feature enables updating references in the 'non-moving'
portion of the heap efficiently. Without this feature, uneccessary page
copying (ioctl(UFFDIO_COPY)) would be required."

[1] https://lore.kernel.org/patchwork/cover/1388144/
[2] https://lore.kernel.org/patchwork/patch/1408161/
[3] https://lore.kernel.org/linux-fsdevel/[email protected]/T/#t

Axel Rasmussen (10):
userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h
userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte
userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
userfaultfd/shmem: support minor fault registration for shmem
userfaultfd/selftests: use memfd_create for shmem test type
userfaultfd/selftests: create alias mappings in the shmem test
userfaultfd/selftests: reinitialize test context in each test
userfaultfd/selftests: exercise minor fault handling shmem support
userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_pte()
userfaultfd: update documentation to mention shmem minor faults

Documentation/admin-guide/mm/userfaultfd.rst | 3 +-
fs/userfaultfd.c | 6 +-
include/linux/hugetlb.h | 4 +-
include/linux/shmem_fs.h | 17 +-
include/linux/userfaultfd_k.h | 5 +
include/uapi/linux/userfaultfd.h | 7 +-
mm/hugetlb.c | 1 +
mm/memory.c | 8 +-
mm/shmem.c | 115 +++-----
mm/userfaultfd.c | 175 ++++++++----
tools/testing/selftests/vm/userfaultfd.c | 274 ++++++++++++-------
11 files changed, 364 insertions(+), 251 deletions(-)

--
2.31.1.368.gbe11c130af-goog

2021-04-20 22:10:13

by Axel Rasmussen

[permalink] [raw]

Subject: [PATCH v4 07/10] userfaultfd/selftests: reinitialize test context in each test

Currently, the context (fds, mmap-ed areas, etc.) are global. Each test
mutates this state in some way, in some cases really "clobbering it"
(e.g., the events test mremap-ing area_dst over the top of area_src, or
the minor faults tests overwriting the count_verify values in the test
areas). We run the tests in a particular order, each test is careful to
make the right assumptions about its starting state, etc.

But, this is fragile. It's better for a test's success or failure to not
depend on what some other prior test case did to the global state.

To that end, clear and reinitialize the test context at the start of
each test case, so whatever prior test cases did doesn't affect future
tests.

This is particularly relevant to this series because the events test's
mremap of area_dst screws up assumptions the minor fault test was
relying on. This wasn't a problem for hugetlb, as we don't mremap in
that case.

Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 215 ++++++++++++-----------
1 file changed, 116 insertions(+), 99 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 1f65c4ab7994..3fbc69f513dc 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -89,7 +89,8 @@ static int shm_fd;
static int huge_fd;
static char *huge_fd_off0;
static unsigned long long *count_verify;
-static int uffd, uffd_flags, finished, *pipefd;
+static int uffd = -1;
+static int uffd_flags, finished, *pipefd;
static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
static char *zeropage;
pthread_attr_t attr;
@@ -342,6 +343,111 @@ static struct uffd_test_ops hugetlb_uffd_test_ops = {

static struct uffd_test_ops *uffd_test_ops;

+static void userfaultfd_open(uint64_t *features)
+{
+ struct uffdio_api uffdio_api;
+
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
+ if (uffd < 0)
+ err("userfaultfd syscall not available in this kernel");
+ uffd_flags = fcntl(uffd, F_GETFD, NULL);
+
+ uffdio_api.api = UFFD_API;
+ uffdio_api.features = *features;
+ if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+ err("UFFDIO_API failed.\nPlease make sure to "
+ "run with either root or ptrace capability.");
+ if (uffdio_api.api != UFFD_API)
+ err("UFFDIO_API error: %" PRIu64, (uint64_t)uffdio_api.api);
+
+ *features = uffdio_api.features;
+}
+
+static inline void munmap_area(void **area)
+{
+ if (*area)
+ if (munmap(*area, nr_pages * page_size))
+ err("munmap");
+
+ *area = NULL;
+}
+
+static void uffd_test_ctx_clear(void)
+{
+ size_t i;
+
+ if (pipefd) {
+ for (i = 0; i < nr_cpus * 2; ++i) {
+ if (close(pipefd[i]))
+ err("close pipefd");
+ }
+ free(pipefd);
+ pipefd = NULL;
+ }
+
+ if (count_verify) {
+ free(count_verify);
+ count_verify = NULL;
+ }
+
+ if (uffd != -1) {
+ if (close(uffd))
+ err("close uffd");
+ uffd = -1;
+ }
+
+ huge_fd_off0 = NULL;
+ munmap_area((void **)&area_src);
+ munmap_area((void **)&area_src_alias);
+ munmap_area((void **)&area_dst);
+ munmap_area((void **)&area_dst_alias);
+}
+
+static void uffd_test_ctx_init_ext(uint64_t *features)
+{
+ unsigned long nr, cpu;
+
+ uffd_test_ctx_clear();
+
+ uffd_test_ops->allocate_area((void **)&area_src);
+ uffd_test_ops->allocate_area((void **)&area_dst);
+
+ uffd_test_ops->release_pages(area_src);
+ uffd_test_ops->release_pages(area_dst);
+
+ userfaultfd_open(features);
+
+ count_verify = malloc(nr_pages * sizeof(unsigned long long));
+ if (!count_verify)
+ err("count_verify");
+
+ for (nr = 0; nr < nr_pages; nr++) {
+ *area_mutex(area_src, nr) =
+ (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER;
+ count_verify[nr] = *area_count(area_src, nr) = 1;
+ /*
+ * In the transition between 255 to 256, powerpc will
+ * read out of order in my_bcmp and see both bytes as
+ * zero, so leave a placeholder below always non-zero
+ * after the count, to avoid my_bcmp to trigger false
+ * positives.
+ */
+ *(area_count(area_src, nr) + 1) = 1;
+ }
+
+ pipefd = malloc(sizeof(int) * nr_cpus * 2);
+ if (!pipefd)
+ err("pipefd");
+ for (cpu = 0; cpu < nr_cpus; cpu++)
+ if (pipe2(&pipefd[cpu * 2], O_CLOEXEC | O_NONBLOCK))
+ err("pipe");
+}
+
+static inline void uffd_test_ctx_init(uint64_t features)
+{
+ uffd_test_ctx_init_ext(&features);
+}
+
static int my_bcmp(char *str1, char *str2, size_t n)
{
unsigned long i;
@@ -726,40 +832,6 @@ static int stress(struct uffd_stats *uffd_stats)
return 0;
}

-static int userfaultfd_open_ext(uint64_t *features)
-{
- struct uffdio_api uffdio_api;
-
- uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
- if (uffd < 0) {
- fprintf(stderr,
- "userfaultfd syscall not available in this kernel\n");
- return 1;
- }
- uffd_flags = fcntl(uffd, F_GETFD, NULL);
-
- uffdio_api.api = UFFD_API;
- uffdio_api.features = *features;
- if (ioctl(uffd, UFFDIO_API, &uffdio_api)) {
- fprintf(stderr, "UFFDIO_API failed.\nPlease make sure to "
- "run with either root or ptrace capability.\n");
- return 1;
- }
- if (uffdio_api.api != UFFD_API) {
- fprintf(stderr, "UFFDIO_API error: %" PRIu64 "\n",
- (uint64_t)uffdio_api.api);
- return 1;
- }
-
- *features = uffdio_api.features;
- return 0;
-}
-
-static int userfaultfd_open(uint64_t features)
-{
- return userfaultfd_open_ext(&features);
-}
-
sigjmp_buf jbuf, *sigbuf;

static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
@@ -868,6 +940,8 @@ static int faulting_process(int signal_test)
MREMAP_MAYMOVE | MREMAP_FIXED, area_src);
if (area_dst == MAP_FAILED)
err("mremap");
+ /* Reset area_src since we just clobbered it */
+ area_src = NULL;

for (; nr < nr_pages; nr++) {
count = *area_count(area_dst, nr);
@@ -961,10 +1035,8 @@ static int userfaultfd_zeropage_test(void)
printf("testing UFFDIO_ZEROPAGE: ");
fflush(stdout);

- uffd_test_ops->release_pages(area_dst);
+ uffd_test_ctx_init(0);

- if (userfaultfd_open(0))
- return 1;
uffdio_register.range.start = (unsigned long) area_dst;
uffdio_register.range.len = nr_pages * page_size;
uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
@@ -981,7 +1053,6 @@ static int userfaultfd_zeropage_test(void)
if (my_bcmp(area_dst, zeropage, page_size))
err("zeropage is not zero");

- close(uffd);
printf("done.\n");
return 0;
}
@@ -999,12 +1070,10 @@ static int userfaultfd_events_test(void)
printf("testing events (fork, remap, remove): ");
fflush(stdout);

- uffd_test_ops->release_pages(area_dst);
-
features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP |
UFFD_FEATURE_EVENT_REMOVE;
- if (userfaultfd_open(features))
- return 1;
+ uffd_test_ctx_init(features);
+
fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);

uffdio_register.range.start = (unsigned long) area_dst;
@@ -1037,8 +1106,6 @@ static int userfaultfd_events_test(void)
if (pthread_join(uffd_mon, NULL))
return 1;

- close(uffd);
-
uffd_stats_report(&stats, 1);

return stats.missing_faults != nr_pages;
@@ -1058,11 +1125,9 @@ static int userfaultfd_sig_test(void)
printf("testing signal delivery: ");
fflush(stdout);

- uffd_test_ops->release_pages(area_dst);
-
features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS;
- if (userfaultfd_open(features))
- return 1;
+ uffd_test_ctx_init(features);
+
fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);

uffdio_register.range.start = (unsigned long) area_dst;
@@ -1103,7 +1168,6 @@ static int userfaultfd_sig_test(void)
printf("done.\n");
if (userfaults)
err("Signal test failed, userfaults: %ld", userfaults);
- close(uffd);

return userfaults != 0;
}
@@ -1126,10 +1190,7 @@ static int userfaultfd_minor_test(void)
printf("testing minor faults: ");
fflush(stdout);

- uffd_test_ops->release_pages(area_dst);
-
- if (userfaultfd_open_ext(&features))
- return 1;
+ uffd_test_ctx_init_ext(&features);
/* If kernel reports the feature isn't supported, skip the test. */
if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) {
printf("skipping test due to lack of feature support\n");
@@ -1183,8 +1244,6 @@ static int userfaultfd_minor_test(void)
if (pthread_join(uffd_mon, NULL))
return 1;

- close(uffd);
-
uffd_stats_report(&stats, 1);

return stats.missing_faults != 0 || stats.minor_faults != nr_pages;
@@ -1196,50 +1255,9 @@ static int userfaultfd_stress(void)
char *tmp_area;
unsigned long nr;
struct uffdio_register uffdio_register;
- unsigned long cpu;
struct uffd_stats uffd_stats[nr_cpus];

- uffd_test_ops->allocate_area((void **)&area_src);
- if (!area_src)
- return 1;
- uffd_test_ops->allocate_area((void **)&area_dst);
- if (!area_dst)
- return 1;
-
- if (userfaultfd_open(0))
- return 1;
-
- count_verify = malloc(nr_pages * sizeof(unsigned long long));
- if (!count_verify) {
- perror("count_verify");
- return 1;
- }
-
- for (nr = 0; nr < nr_pages; nr++) {
- *area_mutex(area_src, nr) = (pthread_mutex_t)
- PTHREAD_MUTEX_INITIALIZER;
- count_verify[nr] = *area_count(area_src, nr) = 1;
- /*
- * In the transition between 255 to 256, powerpc will
- * read out of order in my_bcmp and see both bytes as
- * zero, so leave a placeholder below always non-zero
- * after the count, to avoid my_bcmp to trigger false
- * positives.
- */
- *(area_count(area_src, nr) + 1) = 1;
- }
-
- pipefd = malloc(sizeof(int) * nr_cpus * 2);
- if (!pipefd) {
- perror("pipefd");
- return 1;
- }
- for (cpu = 0; cpu < nr_cpus; cpu++) {
- if (pipe2(&pipefd[cpu*2], O_CLOEXEC | O_NONBLOCK)) {
- perror("pipe");
- return 1;
- }
- }
+ uffd_test_ctx_init(0);

if (posix_memalign(&area, page_size, page_size))
err("out of memory");
@@ -1360,7 +1378,6 @@ static int userfaultfd_stress(void)
uffd_stats_report(uffd_stats, nr_cpus);
}

- close(uffd);
return userfaultfd_zeropage_test() || userfaultfd_sig_test()
|| userfaultfd_events_test() || userfaultfd_minor_test();
}
--
2.31.1.368.gbe11c130af-goog

2021-04-20 22:11:27

by Axel Rasmussen

[permalink] [raw]

Subject: [PATCH v4 10/10] userfaultfd: update documentation to mention shmem minor faults

Generally, the documentation we wrote for hugetlbfs-based minor faults
still all applies. The only missing piece is to mention the new feature
flag which indicates that the kernel supports this for shmem as well.

Signed-off-by: Axel Rasmussen <[email protected]>
---
Documentation/admin-guide/mm/userfaultfd.rst | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 3aa38e8b8361..6528036093e1 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -77,7 +77,8 @@ events, except page fault notifications, may be generated:

- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
- areas.
+ areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
+ support for shmem virtual memory areas.

The userland application should set the feature flags it intends to use
when invoking the ``UFFDIO_API`` ioctl, to request that those features be
--
2.31.1.368.gbe11c130af-goog

2021-04-20 22:12:04

by Axel Rasmussen

[permalink] [raw]

Subject: [PATCH v4 05/10] userfaultfd/selftests: use memfd_create for shmem test type

This is a preparatory commit. In the future, we want to be able to setup
alias mappings for area_src and area_dst in the shmem test, like we do
in the hugetlb_shared test. With a VMA obtained via
mmap(MAP_ANONYMOUS | MAP_SHARED), it isn't clear how to do this.

So, mmap() with an fd, so we can create alias mappings. Use memfd_create
instead of actually passing in a tmpfs path like hugetlb does, since
it's more convenient / simpler to run, and works just as well.

Future commits will:

1. Setup the alias mappings.
2. Extend our tests to actually take advantage of this, to test new
userfaultfd behavior being introduced in this series.

Also, a small fix in the area we're changing: when the hugetlb setup
fails in main(), pass in the right argv[] so we actually print out the
hugetlb file path.

Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 6339aeaeeff8..fc40831f818f 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -85,6 +85,7 @@ static bool test_uffdio_wp = false;
static bool test_uffdio_minor = false;

static bool map_shared;
+static int shm_fd;
static int huge_fd;
static char *huge_fd_off0;
static unsigned long long *count_verify;
@@ -277,8 +278,11 @@ static void shmem_release_pages(char *rel_area)

static void shmem_allocate_area(void **alloc_area)
{
+ unsigned long offset =
+ alloc_area == (void **)&area_src ? 0 : nr_pages * page_size;
+
*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
- MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+ MAP_SHARED, shm_fd, offset);
if (*alloc_area == MAP_FAILED)
err("mmap of memfd failed");
}
@@ -1448,6 +1452,16 @@ int main(int argc, char **argv)
err("Open of %s failed", argv[4]);
if (ftruncate(huge_fd, 0))
err("ftruncate %s to size 0 failed", argv[4]);
+ } else if (test_type == TEST_SHMEM) {
+ shm_fd = memfd_create(argv[0], 0);
+ if (shm_fd < 0)
+ err("memfd_create");
+ if (ftruncate(shm_fd, nr_pages * page_size * 2))
+ err("ftruncate");
+ if (fallocate(shm_fd,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0,
+ nr_pages * page_size * 2))
+ err("fallocate");
}
printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
nr_pages, nr_pages_per_cpu);
--
2.31.1.368.gbe11c130af-goog

2021-04-20 22:12:14

by Axel Rasmussen

[permalink] [raw]

Subject: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

With this change, userspace can resolve a minor fault within a
shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this
match those for hugetlbfs - we look up the existing page in the page
cache, and install a PTE for it.

This commit introduces a new helper: mcopy_atomic_install_pte.

Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
shmem.c? The existing userfault implementation only relies on shmem.c
for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just
fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for
shmem in one place, regardless of shared/private (to reduce code
duplication).

Why add a new mcopy_atomic_install_pte helper? A problem we have with
continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are
*close* to what we want, but not exactly. We do want to setup the PTEs
in a CONTINUE operation, but we don't want to e.g. allocate a new page,
charge it (e.g. to the shmem inode), manipulate various flags, etc. Also
we have the problem stated above: shmem_mcopy_atomic_pte() and
mcopy_atomic_pte() both handle one-half of the problem (shared /
private) continue cares about. So, introduce mcontinue_atomic_pte(), to
handle all of the shmem continue cases. Introduce the helper so it
doesn't duplicate code with mcopy_atomic_pte().

In a future commit, shmem_mcopy_atomic_pte() will also be modified to
use this new helper. However, since this is a bigger refactor, it seems
most clear to do it as a separate change.

Signed-off-by: Axel Rasmussen <[email protected]>
---
mm/userfaultfd.c | 172 ++++++++++++++++++++++++++++++++++-------------
1 file changed, 127 insertions(+), 45 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 23fa2583bbd1..51d8c0127161 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -48,6 +48,83 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
return dst_vma;
}

+/*
+ * Install PTEs, to map dst_addr (within dst_vma) to page.
+ *
+ * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
+ * whether or not dst_vma is VM_SHARED. It also handles the more general
+ * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
+ * backed, or not).
+ *
+ * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
+ * shmem_mcopy_atomic_pte instead.
+ */
+static int mcopy_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr, struct page *page,
+ bool newly_allocated, bool wp_copy)
+{
+ int ret;
+ pte_t _dst_pte, *dst_pte;
+ bool writable = dst_vma->vm_flags & VM_WRITE;
+ bool vm_shared = dst_vma->vm_flags & VM_SHARED;
+ bool page_in_cache = page->mapping;
+ spinlock_t *ptl;
+ struct inode *inode;
+ pgoff_t offset, max_off;
+
+ _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+ if (page_in_cache && !vm_shared)
+ writable = false;
+ if (writable || !page_in_cache)
+ _dst_pte = pte_mkdirty(_dst_pte);
+ if (writable) {
+ if (wp_copy)
+ _dst_pte = pte_mkuffd_wp(_dst_pte);
+ else
+ _dst_pte = pte_mkwrite(_dst_pte);
+ }
+
+ dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+
+ if (vma_is_shmem(dst_vma)) {
+ /* serialize against truncate with the page table lock */
+ inode = dst_vma->vm_file->f_inode;
+ offset = linear_page_index(dst_vma, dst_addr);
+ max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ ret = -EFAULT;
+ if (unlikely(offset >= max_off))
+ goto out_unlock;
+ }
+
+ ret = -EEXIST;
+ if (!pte_none(*dst_pte))
+ goto out_unlock;
+
+ if (page_in_cache)
+ page_add_file_rmap(page, false);
+ else
+ page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+
+ /*
+ * Must happen after rmap, as mm_counter() checks mapping (via
+ * PageAnon()), which is set by __page_set_anon_rmap().
+ */
+ inc_mm_counter(dst_mm, mm_counter(page));
+
+ if (newly_allocated)
+ lru_cache_add_inactive_or_unevictable(page, dst_vma);
+
+ set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(dst_vma, dst_addr, dst_pte);
+ ret = 0;
+out_unlock:
+ pte_unmap_unlock(dst_pte, ptl);
+ return ret;
+}
+
static int mcopy_atomic_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
@@ -56,13 +133,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
struct page **pagep,
bool wp_copy)
{
- pte_t _dst_pte, *dst_pte;
- spinlock_t *ptl;
void *page_kaddr;
int ret;
struct page *page;
- pgoff_t offset, max_off;
- struct inode *inode;

if (!*pagep) {
ret = -ENOMEM;
@@ -99,43 +172,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL))
goto out_release;

- _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
- if (dst_vma->vm_flags & VM_WRITE) {
- if (wp_copy)
- _dst_pte = pte_mkuffd_wp(_dst_pte);
- else
- _dst_pte = pte_mkwrite(_dst_pte);
- }
-
- dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
- if (dst_vma->vm_file) {
- /* the shmem MAP_PRIVATE case requires checking the i_size */
- inode = dst_vma->vm_file->f_inode;
- offset = linear_page_index(dst_vma, dst_addr);
- max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- ret = -EFAULT;
- if (unlikely(offset >= max_off))
- goto out_release_uncharge_unlock;
- }
- ret = -EEXIST;
- if (!pte_none(*dst_pte))
- goto out_release_uncharge_unlock;
-
- inc_mm_counter(dst_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
- lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
- set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
-
- /* No need to invalidate - it was non-present before */
- update_mmu_cache(dst_vma, dst_addr, dst_pte);
-
- pte_unmap_unlock(dst_pte, ptl);
- ret = 0;
+ ret = mcopy_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+ page, true, wp_copy);
+ if (ret)
+ goto out_release;
out:
return ret;
-out_release_uncharge_unlock:
- pte_unmap_unlock(dst_pte, ptl);
out_release:
put_page(page);
goto out;
@@ -176,6 +218,41 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
return ret;
}

+/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
+static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
+ pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr,
+ bool wp_copy)
+{
+ struct inode *inode = file_inode(dst_vma->vm_file);
+ pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+ struct page *page;
+ int ret;
+
+ ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
+ if (ret)
+ goto out;
+ if (!page) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ ret = mcopy_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+ page, false, wp_copy);
+ if (ret)
+ goto out_release;
+
+ unlock_page(page);
+ ret = 0;
+out:
+ return ret;
+out_release:
+ unlock_page(page);
+ put_page(page);
+ goto out;
+}
+
static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
{
pgd_t *pgd;
@@ -415,11 +492,16 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
struct page **page,
- bool zeropage,
+ enum mcopy_atomic_mode mode,
bool wp_copy)
{
ssize_t err;

+ if (mode == MCOPY_ATOMIC_CONTINUE) {
+ return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+ wp_copy);
+ }
+
/*
* The normal page fault path for a shmem will invoke the
* fault, fill the hole in the file and COW it right away. The
@@ -431,7 +513,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
* and not in the radix tree.
*/
if (!(dst_vma->vm_flags & VM_SHARED)) {
- if (!zeropage)
+ if (mode == MCOPY_ATOMIC_NORMAL)
err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, src_addr, page,
wp_copy);
@@ -441,7 +523,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
} else {
VM_WARN_ON_ONCE(wp_copy);
err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
- dst_addr, src_addr, zeropage,
+ dst_addr, src_addr,
+ mode != MCOPY_ATOMIC_NORMAL,
page);
}

@@ -463,7 +546,6 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
long copied;
struct page *page;
bool wp_copy;
- bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);

/*
* Sanitize the command parameters:
@@ -526,7 +608,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,

if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
- if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
+ if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE)
goto out_unlock;

/*
@@ -574,7 +656,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
BUG_ON(pmd_trans_huge(*dst_pmd));

err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
- src_addr, &page, zeropage, wp_copy);
+ src_addr, &page, mcopy_mode, wp_copy);
cond_resched();

if (unlikely(err == -ENOENT)) {
--
2.31.1.368.gbe11c130af-goog

2021-04-20 22:12:28

by Axel Rasmussen

[permalink] [raw]

Subject: [PATCH v4 08/10] userfaultfd/selftests: exercise minor fault handling shmem support

Enable test_uffdio_minor for test_type == TEST_SHMEM, and modify the
test slightly to pass in / check for the right feature flags.

Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 29 ++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 3fbc69f513dc..a7ecc9993439 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -474,6 +474,7 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
static void continue_range(int ufd, __u64 start, __u64 len)
{
struct uffdio_continue req;
+ int ret;

req.range.start = start;
req.range.len = len;
@@ -482,6 +483,17 @@ static void continue_range(int ufd, __u64 start, __u64 len)
if (ioctl(ufd, UFFDIO_CONTINUE, &req))
err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
(uint64_t)start);
+
+ /*
+ * Error handling within the kernel for continue is subtly different
+ * from copy or zeropage, so it may be a source of bugs. Trigger an
+ * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG.
+ */
+ req.mapped = 0;
+ ret = ioctl(ufd, UFFDIO_CONTINUE, &req);
+ if (ret >= 0 || req.mapped != -EEXIST)
+ err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64,
+ ret, (int64_t) req.mapped);
}

static void *locking_thread(void *arg)
@@ -1182,7 +1194,7 @@ static int userfaultfd_minor_test(void)
void *expected_page;
char c;
struct uffd_stats stats = { 0 };
- uint64_t features = UFFD_FEATURE_MINOR_HUGETLBFS;
+ uint64_t req_features, features_out;

if (!test_uffdio_minor)
return 0;
@@ -1190,9 +1202,17 @@ static int userfaultfd_minor_test(void)
printf("testing minor faults: ");
fflush(stdout);

- uffd_test_ctx_init_ext(&features);
- /* If kernel reports the feature isn't supported, skip the test. */
- if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) {
+ if (test_type == TEST_HUGETLB)
+ req_features = UFFD_FEATURE_MINOR_HUGETLBFS;
+ else if (test_type == TEST_SHMEM)
+ req_features = UFFD_FEATURE_MINOR_SHMEM;
+ else
+ return 1;
+
+ features_out = req_features;
+ uffd_test_ctx_init_ext(&features_out);
+ /* If kernel reports required features aren't supported, skip test. */
+ if ((features_out & req_features) != req_features) {
printf("skipping test due to lack of feature support\n");
fflush(stdout);
return 0;
@@ -1426,6 +1446,7 @@ static void set_test_type(const char *type)
map_shared = true;
test_type = TEST_SHMEM;
uffd_test_ops = &shmem_uffd_test_ops;
+ test_uffdio_minor = true;
} else {
err("Unknown test type: %s", type);
}
--
2.31.1.368.gbe11c130af-goog

2021-04-20 22:12:30

by Axel Rasmussen

[permalink] [raw]

Subject: [PATCH v4 06/10] userfaultfd/selftests: create alias mappings in the shmem test

Previously, we just allocated two shm areas: area_src and area_dst. With
this commit, change this so we also allocate area_src_alias, and
area_dst_alias.

area_*_alias and area_* (respectively) point to the same underlying
physical pages, but are different VMAs. In a future commit in this
series, we'll leverage this setup to exercise minor fault handling
support for shmem, just like we do in the hugetlb_shared test.

Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index fc40831f818f..1f65c4ab7994 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -278,13 +278,29 @@ static void shmem_release_pages(char *rel_area)

static void shmem_allocate_area(void **alloc_area)
{
- unsigned long offset =
- alloc_area == (void **)&area_src ? 0 : nr_pages * page_size;
+ void *area_alias = NULL;
+ bool is_src = alloc_area == (void **)&area_src;
+ unsigned long offset = is_src ? 0 : nr_pages * page_size;

*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
MAP_SHARED, shm_fd, offset);
if (*alloc_area == MAP_FAILED)
err("mmap of memfd failed");
+
+ area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
+ MAP_SHARED, shm_fd, offset);
+ if (area_alias == MAP_FAILED)
+ err("mmap of memfd alias failed");
+
+ if (is_src)
+ area_src_alias = area_alias;
+ else
+ area_dst_alias = area_alias;
+}
+
+static void shmem_alias_mapping(__u64 *start, size_t len, unsigned long offset)
+{
+ *start = (unsigned long)area_dst_alias + offset;
}

struct uffd_test_ops {
@@ -314,7 +330,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = {
.expected_ioctls = SHMEM_EXPECTED_IOCTLS,
.allocate_area = shmem_allocate_area,
.release_pages = shmem_release_pages,
- .alias_mapping = noop_alias_mapping,
+ .alias_mapping = shmem_alias_mapping,
};

static struct uffd_test_ops hugetlb_uffd_test_ops = {
--
2.31.1.368.gbe11c130af-goog

2021-04-20 22:12:58

by Axel Rasmussen

[permalink] [raw]

Subject: [PATCH v4 09/10] userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_pte()

In a previous commit, we added the mcopy_atomic_install_pte() helper.
This helper does the job of setting up PTEs for an existing page, to map
it into a given VMA. It deals with both the anon and shmem cases, as
well as the shared and private cases.

In other words, shmem_mcopy_atomic_pte() duplicates a case it already
handles. So, expose it, and let shmem_mcopy_atomic_pte() use it
directly, to reduce code duplication.

This requires that we refactor shmem_mcopy_atomic_pte() a bit:

Instead of doing accounting (shmem_recalc_inode() et al) part-way
through the PTE setup, do it beforehand. This frees up
mcopy_atomic_install_pte() from having to care about this accounting,
but it does mean we need to clean it up if we get a failure afterwards
(shmem_uncharge()).

We can *almost* use shmem_charge() to do this, reducing code
duplication. But, it does `inode->i_mapping->nrpages++`, which would
double-count since shmem_add_to_page_cache() also does this.

Signed-off-by: Axel Rasmussen <[email protected]>
---
include/linux/userfaultfd_k.h | 5 ++++
mm/shmem.c | 53 ++++++++---------------------------
mm/userfaultfd.c | 17 ++++-------
3 files changed, 22 insertions(+), 53 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 794d1538b8ba..39c094cc6641 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -53,6 +53,11 @@ enum mcopy_atomic_mode {
MCOPY_ATOMIC_CONTINUE,
};

+extern int mcopy_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr, struct page *page,
+ bool newly_allocated, bool wp_copy);
+
extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len,
bool *mmap_changing, __u64 mode);
diff --git a/mm/shmem.c b/mm/shmem.c
index 30c0bb501dc9..9bfa80fcd414 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2378,10 +2378,8 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
struct address_space *mapping = inode->i_mapping;
gfp_t gfp = mapping_gfp_mask(mapping);
pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
- spinlock_t *ptl;
void *page_kaddr;
struct page *page;
- pte_t _dst_pte, *dst_pte;
int ret;
pgoff_t max_off;

@@ -2391,8 +2389,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,

if (!*pagep) {
page = shmem_alloc_page(gfp, info, pgoff);
- if (!page)
- goto out_unacct_blocks;
+ if (!page) {
+ shmem_inode_unacct_blocks(inode, 1);
+ goto out;
+ }

if (!zeropage) { /* COPY */
page_kaddr = kmap_atomic(page);
@@ -2432,59 +2432,28 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (ret)
goto out_release;

- _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
- if (dst_vma->vm_flags & VM_WRITE)
- _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
- else {
- /*
- * We don't set the pte dirty if the vma has no
- * VM_WRITE permission, so mark the page dirty or it
- * could be freed from under us. We could do it
- * unconditionally before unlock_page(), but doing it
- * only if VM_WRITE is not set is faster.
- */
- set_page_dirty(page);
- }
-
- dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
-
- ret = -EFAULT;
- max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(pgoff >= max_off))
- goto out_release_unlock;
-
- ret = -EEXIST;
- if (!pte_none(*dst_pte))
- goto out_release_unlock;
-
- lru_cache_add(page);
-
spin_lock_irq(&info->lock);
info->alloced++;
inode->i_blocks += BLOCKS_PER_PAGE;
shmem_recalc_inode(inode);
spin_unlock_irq(&info->lock);

- inc_mm_counter(dst_mm, mm_counter_file(page));
- page_add_file_rmap(page, false);
- set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+ ret = mcopy_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+ page, true, false);
+ if (ret)
+ goto out_release_uncharge;

- /* No need to invalidate - it was non-present before */
- update_mmu_cache(dst_vma, dst_addr, dst_pte);
- pte_unmap_unlock(dst_pte, ptl);
+ SetPageDirty(page);
unlock_page(page);
ret = 0;
out:
return ret;
-out_release_unlock:
- pte_unmap_unlock(dst_pte, ptl);
- ClearPageDirty(page);
+out_release_uncharge:
delete_from_page_cache(page);
+ shmem_uncharge(inode, 1);
out_release:
unlock_page(page);
put_page(page);
-out_unacct_blocks:
- shmem_inode_unacct_blocks(inode, 1);
goto out;
}
#endif /* CONFIG_USERFAULTFD */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 51d8c0127161..3a9ddbb2dbbd 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -51,18 +51,13 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
/*
* Install PTEs, to map dst_addr (within dst_vma) to page.
*
- * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
- * whether or not dst_vma is VM_SHARED. It also handles the more general
- * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
- * backed, or not).
- *
- * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
- * shmem_mcopy_atomic_pte instead.
+ * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem
+ * and anon, and for both shared and private VMAs.
*/
-static int mcopy_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr, struct page *page,
- bool newly_allocated, bool wp_copy)
+int mcopy_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr, struct page *page,
+ bool newly_allocated, bool wp_copy)
{
int ret;
pte_t _dst_pte, *dst_pte;
--
2.31.1.368.gbe11c130af-goog

2021-04-21 01:04:42

[permalink] [raw]

Subject: Re: [PATCH v4 09/10] userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_pte()

On Tue, Apr 20, 2021 at 03:08:03PM -0700, Axel Rasmussen wrote:
> In a previous commit, we added the mcopy_atomic_install_pte() helper.
> This helper does the job of setting up PTEs for an existing page, to map
> it into a given VMA. It deals with both the anon and shmem cases, as
> well as the shared and private cases.
>
> In other words, shmem_mcopy_atomic_pte() duplicates a case it already
> handles. So, expose it, and let shmem_mcopy_atomic_pte() use it
> directly, to reduce code duplication.
>
> This requires that we refactor shmem_mcopy_atomic_pte() a bit:
>
> Instead of doing accounting (shmem_recalc_inode() et al) part-way
> through the PTE setup, do it beforehand. This frees up
> mcopy_atomic_install_pte() from having to care about this accounting,
> but it does mean we need to clean it up if we get a failure afterwards
> (shmem_uncharge()).
>
> We can *almost* use shmem_charge() to do this, reducing code
> duplication. But, it does `inode->i_mapping->nrpages++`, which would
> double-count since shmem_add_to_page_cache() also does this.

Missing to mention the lru_cache_add() replacement comment as Hugh commented on
this?

>
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
> include/linux/userfaultfd_k.h | 5 ++++
> mm/shmem.c | 53 ++++++++---------------------------
> mm/userfaultfd.c | 17 ++++-------
> 3 files changed, 22 insertions(+), 53 deletions(-)
>
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 794d1538b8ba..39c094cc6641 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -53,6 +53,11 @@ enum mcopy_atomic_mode {
> MCOPY_ATOMIC_CONTINUE,
> };
>
> +extern int mcopy_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> + struct vm_area_struct *dst_vma,
> + unsigned long dst_addr, struct page *page,
> + bool newly_allocated, bool wp_copy);
> +
> extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
> unsigned long src_start, unsigned long len,
> bool *mmap_changing, __u64 mode);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 30c0bb501dc9..9bfa80fcd414 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2378,10 +2378,8 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
> struct address_space *mapping = inode->i_mapping;
> gfp_t gfp = mapping_gfp_mask(mapping);
> pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> - spinlock_t *ptl;
> void *page_kaddr;
> struct page *page;
> - pte_t _dst_pte, *dst_pte;
> int ret;
> pgoff_t max_off;
>
> @@ -2391,8 +2389,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
>
> if (!*pagep) {
> page = shmem_alloc_page(gfp, info, pgoff);
> - if (!page)
> - goto out_unacct_blocks;
> + if (!page) {
> + shmem_inode_unacct_blocks(inode, 1);
> + goto out;
> + }
>
> if (!zeropage) { /* COPY */
> page_kaddr = kmap_atomic(page);
> @@ -2432,59 +2432,28 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
> if (ret)
> goto out_release;
>
> - _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> - if (dst_vma->vm_flags & VM_WRITE)
> - _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> - else {
> - /*
> - * We don't set the pte dirty if the vma has no
> - * VM_WRITE permission, so mark the page dirty or it
> - * could be freed from under us. We could do it
> - * unconditionally before unlock_page(), but doing it
> - * only if VM_WRITE is not set is faster.
> - */
> - set_page_dirty(page);
> - }
> -
> - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> -
> - ret = -EFAULT;
> - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> - if (unlikely(pgoff >= max_off))
> - goto out_release_unlock;
> -
> - ret = -EEXIST;
> - if (!pte_none(*dst_pte))
> - goto out_release_unlock;
> -
> - lru_cache_add(page);
> -
> spin_lock_irq(&info->lock);
> info->alloced++;
> inode->i_blocks += BLOCKS_PER_PAGE;
> shmem_recalc_inode(inode);
> spin_unlock_irq(&info->lock);
>
> - inc_mm_counter(dst_mm, mm_counter_file(page));
> - page_add_file_rmap(page, false);
> - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> + ret = mcopy_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> + page, true, false);
> + if (ret)
> + goto out_release_uncharge;
>
> - /* No need to invalidate - it was non-present before */
> - update_mmu_cache(dst_vma, dst_addr, dst_pte);
> - pte_unmap_unlock(dst_pte, ptl);
> + SetPageDirty(page);
> unlock_page(page);
> ret = 0;
> out:
> return ret;
> -out_release_unlock:
> - pte_unmap_unlock(dst_pte, ptl);
> - ClearPageDirty(page);
> +out_release_uncharge:
> delete_from_page_cache(page);
> + shmem_uncharge(inode, 1);
> out_release:
> unlock_page(page);
> put_page(page);
> -out_unacct_blocks:

Will all the "goto out_release" miss one call to shmem_inode_unacct_blocks()?

> - shmem_inode_unacct_blocks(inode, 1);
> goto out;
> }
> #endif /* CONFIG_USERFAULTFD */
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 51d8c0127161..3a9ddbb2dbbd 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -51,18 +51,13 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
> /*
> * Install PTEs, to map dst_addr (within dst_vma) to page.
> *
> - * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> - * whether or not dst_vma is VM_SHARED. It also handles the more general
> - * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> - * backed, or not).
> - *
> - * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> - * shmem_mcopy_atomic_pte instead.
> + * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem
> + * and anon, and for both shared and private VMAs.
> */
> -static int mcopy_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> - struct vm_area_struct *dst_vma,
> - unsigned long dst_addr, struct page *page,
> - bool newly_allocated, bool wp_copy)
> +int mcopy_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> + struct vm_area_struct *dst_vma,
> + unsigned long dst_addr, struct page *page,
> + bool newly_allocated, bool wp_copy)
> {
> int ret;
> pte_t _dst_pte, *dst_pte;
> --
> 2.31.1.368.gbe11c130af-goog
>

--
Peter Xu

2021-04-22 20:23:55

by Axel Rasmussen

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

On Tue, Apr 20, 2021 at 3:08 PM Axel Rasmussen <[email protected]> wrote:
>
> With this change, userspace can resolve a minor fault within a
> shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this
> match those for hugetlbfs - we look up the existing page in the page
> cache, and install a PTE for it.
>
> This commit introduces a new helper: mcopy_atomic_install_pte.
>
> Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
> shmem.c? The existing userfault implementation only relies on shmem.c
> for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just
> fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for
> shmem in one place, regardless of shared/private (to reduce code
> duplication).
>
> Why add a new mcopy_atomic_install_pte helper? A problem we have with
> continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are
> *close* to what we want, but not exactly. We do want to setup the PTEs
> in a CONTINUE operation, but we don't want to e.g. allocate a new page,
> charge it (e.g. to the shmem inode), manipulate various flags, etc. Also
> we have the problem stated above: shmem_mcopy_atomic_pte() and
> mcopy_atomic_pte() both handle one-half of the problem (shared /
> private) continue cares about. So, introduce mcontinue_atomic_pte(), to
> handle all of the shmem continue cases. Introduce the helper so it
> doesn't duplicate code with mcopy_atomic_pte().
>
> In a future commit, shmem_mcopy_atomic_pte() will also be modified to
> use this new helper. However, since this is a bigger refactor, it seems
> most clear to do it as a separate change.
>
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
> mm/userfaultfd.c | 172 ++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 127 insertions(+), 45 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 23fa2583bbd1..51d8c0127161 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -48,6 +48,83 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
> return dst_vma;
> }
>
> +/*
> + * Install PTEs, to map dst_addr (within dst_vma) to page.
> + *
> + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> + * whether or not dst_vma is VM_SHARED. It also handles the more general
> + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> + * backed, or not).
> + *
> + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> + * shmem_mcopy_atomic_pte instead.
> + */
> +static int mcopy_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> + struct vm_area_struct *dst_vma,
> + unsigned long dst_addr, struct page *page,
> + bool newly_allocated, bool wp_copy)
> +{
> + int ret;
> + pte_t _dst_pte, *dst_pte;
> + bool writable = dst_vma->vm_flags & VM_WRITE;
> + bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> + bool page_in_cache = page->mapping;
> + spinlock_t *ptl;
> + struct inode *inode;
> + pgoff_t offset, max_off;
> +
> + _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> + if (page_in_cache && !vm_shared)
> + writable = false;
> + if (writable || !page_in_cache)
> + _dst_pte = pte_mkdirty(_dst_pte);
> + if (writable) {
> + if (wp_copy)
> + _dst_pte = pte_mkuffd_wp(_dst_pte);
> + else
> + _dst_pte = pte_mkwrite(_dst_pte);
> + }
> +
> + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> +
> + if (vma_is_shmem(dst_vma)) {
> + /* serialize against truncate with the page table lock */
> + inode = dst_vma->vm_file->f_inode;
> + offset = linear_page_index(dst_vma, dst_addr);
> + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> + ret = -EFAULT;
> + if (unlikely(offset >= max_off))
> + goto out_unlock;
> + }
> +
> + ret = -EEXIST;
> + if (!pte_none(*dst_pte))
> + goto out_unlock;
> +
> + if (page_in_cache)
> + page_add_file_rmap(page, false);
> + else
> + page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
> +
> + /*
> + * Must happen after rmap, as mm_counter() checks mapping (via
> + * PageAnon()), which is set by __page_set_anon_rmap().
> + */
> + inc_mm_counter(dst_mm, mm_counter(page));

Actually, I've noticed that this is still slightly incorrect.

As Hugh pointed out, this works for the anon case, because
page_add_new_anon_rmap() sets page->mapping.

But for the page_in_cache case, it doesn't work: unlike its anon
counterpart, page_add_file_rmap() *does not* set page->mapping. So, I
think this line needs to become:

inc_mm_counter(dst_mm, page_in_cache ? mm_counter_file(page) :
mm_counter(page));

I'll include this fix in a v5 as well.

> +
> + if (newly_allocated)
> + lru_cache_add_inactive_or_unevictable(page, dst_vma);
> +
> + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> +
> + /* No need to invalidate - it was non-present before */
> + update_mmu_cache(dst_vma, dst_addr, dst_pte);
> + ret = 0;
> +out_unlock:
> + pte_unmap_unlock(dst_pte, ptl);
> + return ret;
> +}
> +
> static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> pmd_t *dst_pmd,
> struct vm_area_struct *dst_vma,
> @@ -56,13 +133,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> struct page **pagep,
> bool wp_copy)
> {
> - pte_t _dst_pte, *dst_pte;
> - spinlock_t *ptl;
> void *page_kaddr;
> int ret;
> struct page *page;
> - pgoff_t offset, max_off;
> - struct inode *inode;
>
> if (!*pagep) {
> ret = -ENOMEM;
> @@ -99,43 +172,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL))
> goto out_release;
>
> - _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
> - if (dst_vma->vm_flags & VM_WRITE) {
> - if (wp_copy)
> - _dst_pte = pte_mkuffd_wp(_dst_pte);
> - else
> - _dst_pte = pte_mkwrite(_dst_pte);
> - }
> -
> - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> - if (dst_vma->vm_file) {
> - /* the shmem MAP_PRIVATE case requires checking the i_size */
> - inode = dst_vma->vm_file->f_inode;
> - offset = linear_page_index(dst_vma, dst_addr);
> - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> - ret = -EFAULT;
> - if (unlikely(offset >= max_off))
> - goto out_release_uncharge_unlock;
> - }
> - ret = -EEXIST;
> - if (!pte_none(*dst_pte))
> - goto out_release_uncharge_unlock;
> -
> - inc_mm_counter(dst_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
> - lru_cache_add_inactive_or_unevictable(page, dst_vma);
> -
> - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> -
> - /* No need to invalidate - it was non-present before */
> - update_mmu_cache(dst_vma, dst_addr, dst_pte);
> -
> - pte_unmap_unlock(dst_pte, ptl);
> - ret = 0;
> + ret = mcopy_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> + page, true, wp_copy);
> + if (ret)
> + goto out_release;
> out:
> return ret;
> -out_release_uncharge_unlock:
> - pte_unmap_unlock(dst_pte, ptl);
> out_release:
> put_page(page);
> goto out;
> @@ -176,6 +218,41 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
> return ret;
> }
>
> +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
> +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
> + pmd_t *dst_pmd,
> + struct vm_area_struct *dst_vma,
> + unsigned long dst_addr,
> + bool wp_copy)
> +{
> + struct inode *inode = file_inode(dst_vma->vm_file);
> + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> + struct page *page;
> + int ret;
> +
> + ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
> + if (ret)
> + goto out;
> + if (!page) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = mcopy_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> + page, false, wp_copy);
> + if (ret)
> + goto out_release;
> +
> + unlock_page(page);
> + ret = 0;
> +out:
> + return ret;
> +out_release:
> + unlock_page(page);
> + put_page(page);
> + goto out;
> +}
> +
> static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> {
> pgd_t *pgd;
> @@ -415,11 +492,16 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> unsigned long dst_addr,
> unsigned long src_addr,
> struct page **page,
> - bool zeropage,
> + enum mcopy_atomic_mode mode,
> bool wp_copy)
> {
> ssize_t err;
>
> + if (mode == MCOPY_ATOMIC_CONTINUE) {
> + return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> + wp_copy);
> + }
> +
> /*
> * The normal page fault path for a shmem will invoke the
> * fault, fill the hole in the file and COW it right away. The
> @@ -431,7 +513,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> * and not in the radix tree.
> */
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> - if (!zeropage)
> + if (mode == MCOPY_ATOMIC_NORMAL)
> err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> dst_addr, src_addr, page,
> wp_copy);
> @@ -441,7 +523,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> } else {
> VM_WARN_ON_ONCE(wp_copy);
> err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> - dst_addr, src_addr, zeropage,
> + dst_addr, src_addr,
> + mode != MCOPY_ATOMIC_NORMAL,
> page);
> }
>
> @@ -463,7 +546,6 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> long copied;
> struct page *page;
> bool wp_copy;
> - bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);
>
> /*
> * Sanitize the command parameters:
> @@ -526,7 +608,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>
> if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> goto out_unlock;
> - if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
> + if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE)
> goto out_unlock;
>
> /*
> @@ -574,7 +656,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> BUG_ON(pmd_trans_huge(*dst_pmd));
>
> err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> - src_addr, &page, zeropage, wp_copy);
> + src_addr, &page, mcopy_mode, wp_copy);
> cond_resched();
>
> if (unlikely(err == -ENOENT)) {
> --
> 2.31.1.368.gbe11c130af-goog
>

2021-04-22 21:20:25

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

Axel,

On Thu, Apr 22, 2021 at 01:22:02PM -0700, Axel Rasmussen wrote:
> > + if (page_in_cache)
> > + page_add_file_rmap(page, false);
> > + else
> > + page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
> > +
> > + /*
> > + * Must happen after rmap, as mm_counter() checks mapping (via
> > + * PageAnon()), which is set by __page_set_anon_rmap().
> > + */
> > + inc_mm_counter(dst_mm, mm_counter(page));
>
> Actually, I've noticed that this is still slightly incorrect.
>
> As Hugh pointed out, this works for the anon case, because
> page_add_new_anon_rmap() sets page->mapping.
>
> But for the page_in_cache case, it doesn't work: unlike its anon
> counterpart, page_add_file_rmap() *does not* set page->mapping.

If it's already in the page cache, shouldn't it be set already in e.g. one
previous call to shmem_add_to_page_cache()? Thanks,

--
Peter Xu

2021-04-22 22:06:44

by Axel Rasmussen

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

On Thu, Apr 22, 2021 at 2:18 PM Peter Xu <[email protected]> wrote:
>
> Axel,
>
> On Thu, Apr 22, 2021 at 01:22:02PM -0700, Axel Rasmussen wrote:
> > > + if (page_in_cache)
> > > + page_add_file_rmap(page, false);
> > > + else
> > > + page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
> > > +
> > > + /*
> > > + * Must happen after rmap, as mm_counter() checks mapping (via
> > > + * PageAnon()), which is set by __page_set_anon_rmap().
> > > + */
> > > + inc_mm_counter(dst_mm, mm_counter(page));
> >
> > Actually, I've noticed that this is still slightly incorrect.
> >
> > As Hugh pointed out, this works for the anon case, because
> > page_add_new_anon_rmap() sets page->mapping.
> >
> > But for the page_in_cache case, it doesn't work: unlike its anon
> > counterpart, page_add_file_rmap() *does not* set page->mapping.
>
> If it's already in the page cache, shouldn't it be set already in e.g. one
> previous call to shmem_add_to_page_cache()? Thanks,

Ah, of course. Sorry for the noise. This should have been obvious to
me from how page_in_cache is defined.

I had run into the same "Bad rss-counter state" warning while applying
my patches to an earlier kernel version, and got concerned about this
line after looking at page_add_file_rmap().

But, you're right that this ought to work, and indeed I can't
reproduce the warning when the patches are based on the mm snapshot
mentioned in the cover letter. So, it seems the problem lies with this
other unrelated merge I'm doing, not the series itself. :)

>
> --
> Peter Xu
>

2021-04-27 02:23:25

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

On Tue, 20 Apr 2021, Axel Rasmussen wrote:

> With this change, userspace can resolve a minor fault within a
> shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this
> match those for hugetlbfs - we look up the existing page in the page
> cache, and install a PTE for it.
>
> This commit introduces a new helper: mcopy_atomic_install_pte.
>
> Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
> shmem.c? The existing userfault implementation only relies on shmem.c
> for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just
> fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for
> shmem in one place, regardless of shared/private (to reduce code
> duplication).
>
> Why add a new mcopy_atomic_install_pte helper? A problem we have with
> continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are
> *close* to what we want, but not exactly. We do want to setup the PTEs
> in a CONTINUE operation, but we don't want to e.g. allocate a new page,
> charge it (e.g. to the shmem inode), manipulate various flags, etc. Also
> we have the problem stated above: shmem_mcopy_atomic_pte() and
> mcopy_atomic_pte() both handle one-half of the problem (shared /
> private) continue cares about. So, introduce mcontinue_atomic_pte(), to
> handle all of the shmem continue cases. Introduce the helper so it
> doesn't duplicate code with mcopy_atomic_pte().
>
> In a future commit, shmem_mcopy_atomic_pte() will also be modified to
> use this new helper. However, since this is a bigger refactor, it seems
> most clear to do it as a separate change.
>
> Signed-off-by: Axel Rasmussen <[email protected]>

If this "03/10" had been numbered 04/10, I would have said
Acked-by: Hugh Dickins <[email protected]>

But I find this new ordering incomprehensible - I'm surprised that it
even builds this way around (if it does): this patch is so much about
what has been enabled in "04/10" (references to UFFDIO_CONTINUE shmem
VMAs etc).

Does Peter still think this way round is better? If he does, then we
shall have to compromise by asking you just to squash the two together.

> ---
> mm/userfaultfd.c | 172 ++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 127 insertions(+), 45 deletions(-)

2021-04-27 02:28:17

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH v4 10/10] userfaultfd: update documentation to mention shmem minor faults

On Tue, 20 Apr 2021, Axel Rasmussen wrote:

> Generally, the documentation we wrote for hugetlbfs-based minor faults
> still all applies. The only missing piece is to mention the new feature
> flag which indicates that the kernel supports this for shmem as well.
>
> Signed-off-by: Axel Rasmussen <[email protected]>

Acked-by: Hugh Dickins <[email protected]>

(but could be folded into the one which defines UFFD_FEATURE_MINOR_SHMEM)

> ---
> Documentation/admin-guide/mm/userfaultfd.rst | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)

2021-04-27 03:00:27

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH v4 09/10] userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_pte()

On Tue, 20 Apr 2021, Peter Xu wrote:
> On Tue, Apr 20, 2021 at 03:08:03PM -0700, Axel Rasmussen wrote:
> > In a previous commit, we added the mcopy_atomic_install_pte() helper.
> > This helper does the job of setting up PTEs for an existing page, to map
> > it into a given VMA. It deals with both the anon and shmem cases, as
> > well as the shared and private cases.
> >
> > In other words, shmem_mcopy_atomic_pte() duplicates a case it already
> > handles. So, expose it, and let shmem_mcopy_atomic_pte() use it
> > directly, to reduce code duplication.
> >
> > This requires that we refactor shmem_mcopy_atomic_pte() a bit:
> >
> > Instead of doing accounting (shmem_recalc_inode() et al) part-way
> > through the PTE setup, do it beforehand. This frees up
> > mcopy_atomic_install_pte() from having to care about this accounting,
> > but it does mean we need to clean it up if we get a failure afterwards
> > (shmem_uncharge()).
> >
> > We can *almost* use shmem_charge() to do this, reducing code
> > duplication. But, it does `inode->i_mapping->nrpages++`, which would
> > double-count since shmem_add_to_page_cache() also does this.
>
> Missing to mention the lru_cache_add() replacement comment as Hugh commented on
> this?

Yes, please add that mention.

And personally, I'd much prefer this and the Doc commit,
the non-selftest commits, to be moved up before the selftests.

The selftest situation is confusing at present, since Peter's series
got dropped as [to-be-updated] from mmotm, so we're waiting for those
updates to be added back to mmotm before this series can go in.

(But akpm will be busy with merge window stuff at present:
now is not the time to be adding or re-adding a series.)

>
> >
> > Signed-off-by: Axel Rasmussen <[email protected]>

Peter raised a good point below, so no Ack from me yet.

> > ---
> > include/linux/userfaultfd_k.h | 5 ++++
> > mm/shmem.c | 53 ++++++++---------------------------
> > mm/userfaultfd.c | 17 ++++-------
> > 3 files changed, 22 insertions(+), 53 deletions(-)
> >
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index 794d1538b8ba..39c094cc6641 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -53,6 +53,11 @@ enum mcopy_atomic_mode {
> > MCOPY_ATOMIC_CONTINUE,
> > };
> >
> > +extern int mcopy_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > + struct vm_area_struct *dst_vma,
> > + unsigned long dst_addr, struct page *page,
> > + bool newly_allocated, bool wp_copy);
> > +
> > extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
> > unsigned long src_start, unsigned long len,
> > bool *mmap_changing, __u64 mode);
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 30c0bb501dc9..9bfa80fcd414 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2378,10 +2378,8 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > struct address_space *mapping = inode->i_mapping;
> > gfp_t gfp = mapping_gfp_mask(mapping);
> > pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> > - spinlock_t *ptl;
> > void *page_kaddr;
> > struct page *page;
> > - pte_t _dst_pte, *dst_pte;
> > int ret;
> > pgoff_t max_off;
> >
> > @@ -2391,8 +2389,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >
> > if (!*pagep) {
> > page = shmem_alloc_page(gfp, info, pgoff);
> > - if (!page)
> > - goto out_unacct_blocks;
> > + if (!page) {
> > + shmem_inode_unacct_blocks(inode, 1);
> > + goto out;
> > + }
> >
> > if (!zeropage) { /* COPY */
> > page_kaddr = kmap_atomic(page);
> > @@ -2432,59 +2432,28 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > if (ret)
> > goto out_release;
> >
> > - _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > - if (dst_vma->vm_flags & VM_WRITE)
> > - _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> > - else {
> > - /*
> > - * We don't set the pte dirty if the vma has no
> > - * VM_WRITE permission, so mark the page dirty or it
> > - * could be freed from under us. We could do it
> > - * unconditionally before unlock_page(), but doing it
> > - * only if VM_WRITE is not set is faster.
> > - */
> > - set_page_dirty(page);
> > - }
> > -
> > - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > -
> > - ret = -EFAULT;
> > - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > - if (unlikely(pgoff >= max_off))
> > - goto out_release_unlock;
> > -
> > - ret = -EEXIST;
> > - if (!pte_none(*dst_pte))
> > - goto out_release_unlock;
> > -
> > - lru_cache_add(page);
> > -
> > spin_lock_irq(&info->lock);
> > info->alloced++;
> > inode->i_blocks += BLOCKS_PER_PAGE;
> > shmem_recalc_inode(inode);
> > spin_unlock_irq(&info->lock);

I think it's best to move that info->locked block down to above the
new SetPageDirty(), where we know we have succeeded, because...

> >
> > - inc_mm_counter(dst_mm, mm_counter_file(page));
> > - page_add_file_rmap(page, false);
> > - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> > + ret = mcopy_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> > + page, true, false);
> > + if (ret)
> > + goto out_release_uncharge;
> >
> > - /* No need to invalidate - it was non-present before */
> > - update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > - pte_unmap_unlock(dst_pte, ptl);
> > + SetPageDirty(page);
> > unlock_page(page);
> > ret = 0;
> > out:
> > return ret;
> > -out_release_unlock:
> > - pte_unmap_unlock(dst_pte, ptl);
> > - ClearPageDirty(page);
> > +out_release_uncharge:
> > delete_from_page_cache(page);
> > + shmem_uncharge(inode, 1);
> > out_release:
> > unlock_page(page);
> > put_page(page);
> > -out_unacct_blocks:
>
> Will all the "goto out_release" miss one call to shmem_inode_unacct_blocks()?

Good catch, yes.

Which is why I suggest above to move the info->locked block down:
then you can forget about using shmem_uncharge(), and just use
shmem_inode_unacct_blocks() directly in all cases that need it here.

(But I notice the commit message refers to shmem_uncharge(), so that
will need adjusting. shmem_uncharge() is more of a hack than a proper
interface, introduced to manage certain THP split/collapse cases.)

>
> > - shmem_inode_unacct_blocks(inode, 1);
> > goto out;
> > }
> > #endif /* CONFIG_USERFAULTFD */

2021-04-27 16:16:52

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

On Mon, Apr 26, 2021 at 07:19:58PM -0700, Hugh Dickins wrote:
> On Tue, 20 Apr 2021, Axel Rasmussen wrote:
>
> > With this change, userspace can resolve a minor fault within a
> > shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this
> > match those for hugetlbfs - we look up the existing page in the page
> > cache, and install a PTE for it.
> >
> > This commit introduces a new helper: mcopy_atomic_install_pte.
> >
> > Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
> > shmem.c? The existing userfault implementation only relies on shmem.c
> > for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just
> > fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for
> > shmem in one place, regardless of shared/private (to reduce code
> > duplication).
> >
> > Why add a new mcopy_atomic_install_pte helper? A problem we have with
> > continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are
> > *close* to what we want, but not exactly. We do want to setup the PTEs
> > in a CONTINUE operation, but we don't want to e.g. allocate a new page,
> > charge it (e.g. to the shmem inode), manipulate various flags, etc. Also
> > we have the problem stated above: shmem_mcopy_atomic_pte() and
> > mcopy_atomic_pte() both handle one-half of the problem (shared /
> > private) continue cares about. So, introduce mcontinue_atomic_pte(), to
> > handle all of the shmem continue cases. Introduce the helper so it
> > doesn't duplicate code with mcopy_atomic_pte().
> >
> > In a future commit, shmem_mcopy_atomic_pte() will also be modified to
> > use this new helper. However, since this is a bigger refactor, it seems
> > most clear to do it as a separate change.
> >
> > Signed-off-by: Axel Rasmussen <[email protected]>
>
> If this "03/10" had been numbered 04/10, I would have said
> Acked-by: Hugh Dickins <[email protected]>
>
> But I find this new ordering incomprehensible - I'm surprised that it
> even builds this way around (if it does): this patch is so much about
> what has been enabled in "04/10" (references to UFFDIO_CONTINUE shmem
> VMAs etc).
>
> Does Peter still think this way round is better? If he does, then we
> shall have to compromise by asking you just to squash the two together.

Hi, Hugh, Axel,

I have no strong opinion. To me, UFFDIO_CONTINUE can be introduced earlier like
this. As long as we don't enable the feature (which is done in the next patch),
no one will be able to call it, then it looks clean. Merging them also looks
good to me.

Thanks,

--
Peter Xu

2021-04-27 17:00:16

by Axel Rasmussen

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

I'd prefer to keep them separate, as they are not tiny patches (they
are roughly +200/-150 each). And, they really are quite independent -
at least in the sense that I can reorder them via rebase with no
conflicts, and the code builds at each commit in either orientation. I
think this implies they're easier to review separately, rather than
squashed.

I don't have a strong feeling about the order. I slightly prefer
swapping them compared to this v4 series: first introduce minor
faults, then introduce CONTINUE.

Since Peter also has no strong opinion, and Hugh it sounds like you
prefer it the other way around, I'll swap them as we had in some
previous version of this series: first introduce minor faults, then
introduce CONTINUE.

On Tue, Apr 27, 2021 at 8:54 AM Peter Xu <[email protected]> wrote:
>
> On Mon, Apr 26, 2021 at 07:19:58PM -0700, Hugh Dickins wrote:
> > On Tue, 20 Apr 2021, Axel Rasmussen wrote:
> >
> > > With this change, userspace can resolve a minor fault within a
> > > shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this
> > > match those for hugetlbfs - we look up the existing page in the page
> > > cache, and install a PTE for it.
> > >
> > > This commit introduces a new helper: mcopy_atomic_install_pte.
> > >
> > > Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
> > > shmem.c? The existing userfault implementation only relies on shmem.c
> > > for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just
> > > fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for
> > > shmem in one place, regardless of shared/private (to reduce code
> > > duplication).
> > >
> > > Why add a new mcopy_atomic_install_pte helper? A problem we have with
> > > continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are
> > > *close* to what we want, but not exactly. We do want to setup the PTEs
> > > in a CONTINUE operation, but we don't want to e.g. allocate a new page,
> > > charge it (e.g. to the shmem inode), manipulate various flags, etc. Also
> > > we have the problem stated above: shmem_mcopy_atomic_pte() and
> > > mcopy_atomic_pte() both handle one-half of the problem (shared /
> > > private) continue cares about. So, introduce mcontinue_atomic_pte(), to
> > > handle all of the shmem continue cases. Introduce the helper so it
> > > doesn't duplicate code with mcopy_atomic_pte().
> > >
> > > In a future commit, shmem_mcopy_atomic_pte() will also be modified to
> > > use this new helper. However, since this is a bigger refactor, it seems
> > > most clear to do it as a separate change.
> > >
> > > Signed-off-by: Axel Rasmussen <[email protected]>
> >
> > If this "03/10" had been numbered 04/10, I would have said
> > Acked-by: Hugh Dickins <[email protected]>
> >
> > But I find this new ordering incomprehensible - I'm surprised that it
> > even builds this way around (if it does): this patch is so much about
> > what has been enabled in "04/10" (references to UFFDIO_CONTINUE shmem
> > VMAs etc).
> >
> > Does Peter still think this way round is better? If he does, then we
> > shall have to compromise by asking you just to squash the two together.
>
> Hi, Hugh, Axel,
>
> I have no strong opinion. To me, UFFDIO_CONTINUE can be introduced earlier like
> this. As long as we don't enable the feature (which is done in the next patch),
> no one will be able to call it, then it looks clean. Merging them also looks
> good to me.
>
> Thanks,
>
> --
> Peter Xu
>

2021-04-27 18:07:13

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

On Tue, Apr 27, 2021 at 09:57:16AM -0700, Axel Rasmussen wrote:
> I'd prefer to keep them separate, as they are not tiny patches (they
> are roughly +200/-150 each). And, they really are quite independent -
> at least in the sense that I can reorder them via rebase with no
> conflicts, and the code builds at each commit in either orientation. I
> think this implies they're easier to review separately, rather than
> squashed.
>
> I don't have a strong feeling about the order. I slightly prefer
> swapping them compared to this v4 series: first introduce minor
> faults, then introduce CONTINUE.
>
> Since Peter also has no strong opinion, and Hugh it sounds like you
> prefer it the other way around, I'll swap them as we had in some
> previous version of this series: first introduce minor faults, then
> introduce CONTINUE.

Yes I have no strong opinion, but that's probably the least I prefer. :-)

Because you'll declare UFFD_FEATURE_MINOR_SHMEM and enable this feature without
the feature being completely implemented (without UFFDIO_CONTINUE, it's not
complete since no one will be able to resolve that minor fault).

Not a big deal anyway, but since we're at it... Basically I think three things
to do for minor shmem support:

(1) UFFDIO_CONTINUE (resolving path)
(2) Handle fault path for shmem minor fault (faulting path)
(3) Enablement of UFFD_FEATURE_MINOR_SHMEM (from which point, user can detect
and enable it)

I have no preference on how you'd like to merge these steps (right now you did
1 first, then 2+3 later; or as Hugh suggested do 1+2+3 together), but I'd still
hope item 3 should always be the last, if possible...

Thanks,

--
Peter Xu

2021-04-27 20:32:23

by Axel Rasmussen

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

On Tue, Apr 27, 2021 at 11:03 AM Peter Xu <[email protected]> wrote:
>
> On Tue, Apr 27, 2021 at 09:57:16AM -0700, Axel Rasmussen wrote:
> > I'd prefer to keep them separate, as they are not tiny patches (they
> > are roughly +200/-150 each). And, they really are quite independent -
> > at least in the sense that I can reorder them via rebase with no
> > conflicts, and the code builds at each commit in either orientation. I
> > think this implies they're easier to review separately, rather than
> > squashed.
> >
> > I don't have a strong feeling about the order. I slightly prefer
> > swapping them compared to this v4 series: first introduce minor
> > faults, then introduce CONTINUE.
> >
> > Since Peter also has no strong opinion, and Hugh it sounds like you
> > prefer it the other way around, I'll swap them as we had in some
> > previous version of this series: first introduce minor faults, then
> > introduce CONTINUE.
>
> Yes I have no strong opinion, but that's probably the least I prefer. :-)
>
> Because you'll declare UFFD_FEATURE_MINOR_SHMEM and enable this feature without
> the feature being completely implemented (without UFFDIO_CONTINUE, it's not
> complete since no one will be able to resolve that minor fault).
>
> Not a big deal anyway, but since we're at it... Basically I think three things
> to do for minor shmem support:
>
> (1) UFFDIO_CONTINUE (resolving path)
> (2) Handle fault path for shmem minor fault (faulting path)
> (3) Enablement of UFFD_FEATURE_MINOR_SHMEM (from which point, user can detect
> and enable it)
>
> I have no preference on how you'd like to merge these steps (right now you did
> 1 first, then 2+3 later; or as Hugh suggested do 1+2+3 together), but I'd still
> hope item 3 should always be the last, if possible...

In that case, I'll split the patch which adds the faulting path in
two: add the faulting path hook and registration mode, and then in a
separate commit advertise the feature flag as available.

Then I'll order them like so, which I think is the order Hugh finds
more natural:
1. MInor fault registration / faulting path
2. CONTINUE ioctl to resolve the faults
3. Advertise the feature as supported

Sound okay?

>
> Thanks,
>
> --
> Peter Xu
>

2021-04-27 20:44:06

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

On Tue, Apr 27, 2021 at 01:29:14PM -0700, Axel Rasmussen wrote:
> On Tue, Apr 27, 2021 at 11:03 AM Peter Xu <[email protected]> wrote:
> >
> > On Tue, Apr 27, 2021 at 09:57:16AM -0700, Axel Rasmussen wrote:
> > > I'd prefer to keep them separate, as they are not tiny patches (they
> > > are roughly +200/-150 each). And, they really are quite independent -
> > > at least in the sense that I can reorder them via rebase with no
> > > conflicts, and the code builds at each commit in either orientation. I
> > > think this implies they're easier to review separately, rather than
> > > squashed.
> > >
> > > I don't have a strong feeling about the order. I slightly prefer
> > > swapping them compared to this v4 series: first introduce minor
> > > faults, then introduce CONTINUE.
> > >
> > > Since Peter also has no strong opinion, and Hugh it sounds like you
> > > prefer it the other way around, I'll swap them as we had in some
> > > previous version of this series: first introduce minor faults, then
> > > introduce CONTINUE.
> >
> > Yes I have no strong opinion, but that's probably the least I prefer. :-)
> >
> > Because you'll declare UFFD_FEATURE_MINOR_SHMEM and enable this feature without
> > the feature being completely implemented (without UFFDIO_CONTINUE, it's not
> > complete since no one will be able to resolve that minor fault).
> >
> > Not a big deal anyway, but since we're at it... Basically I think three things
> > to do for minor shmem support:
> >
> > (1) UFFDIO_CONTINUE (resolving path)
> > (2) Handle fault path for shmem minor fault (faulting path)
> > (3) Enablement of UFFD_FEATURE_MINOR_SHMEM (from which point, user can detect
> > and enable it)
> >
> > I have no preference on how you'd like to merge these steps (right now you did
> > 1 first, then 2+3 later; or as Hugh suggested do 1+2+3 together), but I'd still
> > hope item 3 should always be the last, if possible...
>
> In that case, I'll split the patch which adds the faulting path in
> two: add the faulting path hook and registration mode, and then in a
> separate commit advertise the feature flag as available.
>
> Then I'll order them like so, which I think is the order Hugh finds
> more natural:
> 1. MInor fault registration / faulting path
> 2. CONTINUE ioctl to resolve the faults
> 3. Advertise the feature as supported
>
> Sound okay?

Good to me, thanks Axel.

--
Peter Xu

2021-04-27 21:01:08

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH v4 03/10] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

On Tue, Apr 27, 2021 at 1:42 PM Peter Xu <[email protected]> wrote:
> On Tue, Apr 27, 2021 at 01:29:14PM -0700, Axel Rasmussen wrote:
> > On Tue, Apr 27, 2021 at 11:03 AM Peter Xu <[email protected]> wrote:
> > >
> > > On Tue, Apr 27, 2021 at 09:57:16AM -0700, Axel Rasmussen wrote:
> > > > I'd prefer to keep them separate, as they are not tiny patches (they
> > > > are roughly +200/-150 each). And, they really are quite independent -
> > > > at least in the sense that I can reorder them via rebase with no
> > > > conflicts, and the code builds at each commit in either orientation. I
> > > > think this implies they're easier to review separately, rather than
> > > > squashed.
> > > >
> > > > I don't have a strong feeling about the order. I slightly prefer
> > > > swapping them compared to this v4 series: first introduce minor
> > > > faults, then introduce CONTINUE.
> > > >
> > > > Since Peter also has no strong opinion, and Hugh it sounds like you
> > > > prefer it the other way around, I'll swap them as we had in some
> > > > previous version of this series: first introduce minor faults, then
> > > > introduce CONTINUE.
> > >
> > > Yes I have no strong opinion, but that's probably the least I prefer. :-)
> > >
> > > Because you'll declare UFFD_FEATURE_MINOR_SHMEM and enable this feature without
> > > the feature being completely implemented (without UFFDIO_CONTINUE, it's not
> > > complete since no one will be able to resolve that minor fault).
> > >
> > > Not a big deal anyway, but since we're at it... Basically I think three things
> > > to do for minor shmem support:
> > >
> > > (1) UFFDIO_CONTINUE (resolving path)
> > > (2) Handle fault path for shmem minor fault (faulting path)
> > > (3) Enablement of UFFD_FEATURE_MINOR_SHMEM (from which point, user can detect
> > > and enable it)
> > >
> > > I have no preference on how you'd like to merge these steps (right now you did
> > > 1 first, then 2+3 later; or as Hugh suggested do 1+2+3 together), but I'd still
> > > hope item 3 should always be the last, if possible...
> >
> > In that case, I'll split the patch which adds the faulting path in
> > two: add the faulting path hook and registration mode, and then in a
> > separate commit advertise the feature flag as available.
> >
> > Then I'll order them like so, which I think is the order Hugh finds
> > more natural:
> > 1. MInor fault registration / faulting path
> > 2. CONTINUE ioctl to resolve the faults
> > 3. Advertise the feature as supported
> >
> > Sound okay?
>
> Good to me, thanks Axel.

Okay.