2022-04-22 23:20:02

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v2 0/6] userfaultfd: add /dev/userfaultfd for fine grained access control

This series is based on torvalds/master, but additionally the run_vmtests.sh
changes assume my refactor [1] has been applied first.

The series is split up like so:
- Patch 1 is a simple fixup which we should take in any case (even by itself).
- Patches 2-4 add the feature, basic support for it to the selftest, and docs.
- Patches 5-6 make the selftest configurable, so you can test one or the other
instead of always both. If we decide this is overcomplicated, we could just
drop these two patches and take the rest of the series.

[1]: https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/

Changelog:
v1->v2:
- Add documentation update.
- Test *both* userfaultfd(2) and /dev/userfaultfd via the selftest.

Axel Rasmussen (6):
selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh
userfaultfd: add /dev/userfaultfd for fine grained access control
userfaultfd: selftests: modify selftest to use /dev/userfaultfd
userfaultfd: update documentation to describe /dev/userfaultfd
userfaultfd: selftests: make /dev/userfaultfd testing configurable
selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh

Documentation/admin-guide/mm/userfaultfd.rst | 38 +++++++++-
Documentation/admin-guide/sysctl/vm.rst | 3 +
fs/userfaultfd.c | 79 ++++++++++++++++----
include/uapi/linux/userfaultfd.h | 4 +
tools/testing/selftests/vm/run_vmtests.sh | 11 ++-
tools/testing/selftests/vm/userfaultfd.c | 60 +++++++++++++--
6 files changed, 170 insertions(+), 25 deletions(-)

--
2.36.0.rc2.479.g8af0fa9b8e-goog


2022-04-22 23:20:08

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v2 4/6] userfaultfd: update documentation to describe /dev/userfaultfd

Explain the different ways to create a new userfaultfd, and how access
control works for each way.

Signed-off-by: Axel Rasmussen <[email protected]>
---
Documentation/admin-guide/mm/userfaultfd.rst | 38 ++++++++++++++++++--
Documentation/admin-guide/sysctl/vm.rst | 3 ++
2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 6528036093e1..4c079b5377d4 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
Design
======

-Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
+Userspace creates a new userfaultfd, initializes it, and registers one or more
+regions of virtual memory with it. Then, any page faults which occur within the
+region(s) result in a message being delivered to the userfaultfd, notifying
+userspace of the fault.

The ``userfaultfd`` (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities:
@@ -39,7 +42,7 @@ Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that.

-The ``userfaultfd`` once opened by invoking the syscall, can also be
+The ``userfaultfd``, once created, can also be
passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on
@@ -50,6 +53,37 @@ is a corner case that would currently return ``-EBUSY``).
API
===

+Creating a userfaultfd
+----------------------
+
+There are two mechanisms to create a userfaultfd. There are various ways to
+restrict this too, since userfaultfds which handle kernel page faults have
+historically been a useful tool for exploiting the kernel.
+
+The first is the userfaultfd(2) syscall. Access to this is controlled in several
+ways:
+
+- By default, the userfaultfd will be able to handle kernel page faults. This
+ can be disabled by passing in UFFD_USER_MODE_ONLY.
+
+- If vm.unprivileged_userfaultfd is 0, then the caller must *either* have
+ CAP_SYS_PTRACE, or pass in UFFD_USER_MODE_ONLY.
+
+- If vm.unprivileged_userfaultfd is 1, then no particular privilege is needed to
+ use this syscall, even if UFFD_USER_MODE_ONLY is *not* set.
+
+Alternatively, userfaultfds can be created by opening /dev/userfaultfd, and
+issuing a USERFAULTFD_IOC_NEW ioctl to this device. Access to this device is
+controlled via normal filesystem permissions (user/group/mode for example) - no
+additional permission (capability/sysctl) is needed to be able to handle kernel
+faults this way. This is useful because it allows e.g. a specific user or group
+to be able to create kernel-fault-handling userfaultfds, without allowing it
+more broadly, or granting more privileges in addition to that particular ability
+(CAP_SYS_PTRACE). In other words, it allows permissions to be minimized.
+
+Initializing up a userfaultfd
+------------------------
+
When first opened the ``userfaultfd`` must be enabled invoking the
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
a later API version) which will specify the ``read/POLLIN`` protocol
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f4804ce37c58..8682d5fbc8ea 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -880,6 +880,9 @@ calls without any restrictions.

The default value is 0.

+An alternative to this sysctl / the userfaultfd(2) syscall is to create
+userfaultfds via /dev/userfaultfd. See
+Documentation/admin-guide/mm/userfaultfd.rst.

user_reserve_kbytes
===================
--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-22 23:20:21

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v2 2/6] userfaultfd: add /dev/userfaultfd for fine grained access control

Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount
of time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl
must be configured so that any unprivileged user can do it.

In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:

- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.

- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".

This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional
permissions at the same time.

To achieve this, add a /dev/userfaultfd misc device. This device
provides an alternative to the userfaultfd(2) syscall for the creation
of new userfaultfds. The idea is, any userfaultfds created this way will
be able to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.

Signed-off-by: Axel Rasmussen <[email protected]>
---
fs/userfaultfd.c | 79 ++++++++++++++++++++++++++------
include/uapi/linux/userfaultfd.h | 4 ++
2 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aa0c47cb0d16..16d7573ab41a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -29,6 +29,7 @@
#include <linux/ioctl.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/miscdevice.h>

int sysctl_unprivileged_userfaultfd __read_mostly;

@@ -65,6 +66,8 @@ struct userfaultfd_ctx {
unsigned int flags;
/* features requested from the userspace */
unsigned int features;
+ /* whether or not to handle kernel faults */
+ bool handle_kernel_faults;
/* released */
bool released;
/* memory mappings are changing because of non-cooperative event */
@@ -410,13 +413,8 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)

if (ctx->features & UFFD_FEATURE_SIGBUS)
goto out;
- if ((vmf->flags & FAULT_FLAG_USER) == 0 &&
- ctx->flags & UFFD_USER_MODE_ONLY) {
- printk_once(KERN_WARNING "uffd: Set unprivileged_userfaultfd "
- "sysctl knob to 1 if kernel faults must be handled "
- "without obtaining CAP_SYS_PTRACE capability\n");
+ if (!(vmf->flags & FAULT_FLAG_USER) && !ctx->handle_kernel_faults)
goto out;
- }

/*
* If it's already released don't get it. This avoids to loop
@@ -2064,19 +2062,33 @@ static void init_once_userfaultfd_ctx(void *mem)
seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock);
}

-SYSCALL_DEFINE1(userfaultfd, int, flags)
+static inline bool userfaultfd_allowed(bool is_syscall, int flags)
+{
+ bool kernel_faults = !(flags & UFFD_USER_MODE_ONLY);
+ bool allow_unprivileged = sysctl_unprivileged_userfaultfd;
+
+ /* userfaultfd(2) access is controlled by sysctl + capability. */
+ if (is_syscall && kernel_faults) {
+ if (!allow_unprivileged && !capable(CAP_SYS_PTRACE))
+ return false;
+ }
+
+ /*
+ * For /dev/userfaultfd, access is to be controlled using e.g.
+ * permissions on the device node. We assume this is correctly
+ * configured by userspace, so we simply allow access here.
+ */
+
+ return true;
+}
+
+static int new_userfaultfd(bool is_syscall, int flags)
{
struct userfaultfd_ctx *ctx;
int fd;

- if (!sysctl_unprivileged_userfaultfd &&
- (flags & UFFD_USER_MODE_ONLY) == 0 &&
- !capable(CAP_SYS_PTRACE)) {
- printk_once(KERN_WARNING "uffd: Set unprivileged_userfaultfd "
- "sysctl knob to 1 if kernel faults must be handled "
- "without obtaining CAP_SYS_PTRACE capability\n");
+ if (!userfaultfd_allowed(is_syscall, flags))
return -EPERM;
- }

BUG_ON(!current->mm);

@@ -2095,6 +2107,11 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
refcount_set(&ctx->refcount, 1);
ctx->flags = flags;
ctx->features = 0;
+ /*
+ * If UFFD_USER_MODE_ONLY is not set, then userfaultfd_allowed() above
+ * decided that kernel faults were allowed and should be handled.
+ */
+ ctx->handle_kernel_faults = !(flags & UFFD_USER_MODE_ONLY);
ctx->released = false;
atomic_set(&ctx->mmap_changing, 0);
ctx->mm = current->mm;
@@ -2110,8 +2127,42 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
return fd;
}

+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+ return new_userfaultfd(true, flags);
+}
+
+static int userfaultfd_dev_open(struct inode *inode, struct file *file)
+{
+ return 0;
+}
+
+static long userfaultfd_dev_ioctl(struct file *file, unsigned int cmd, unsigned long flags)
+{
+ if (cmd != USERFAULTFD_IOC_NEW)
+ return -EINVAL;
+
+ return new_userfaultfd(false, flags);
+}
+
+static const struct file_operations userfaultfd_dev_fops = {
+ .open = userfaultfd_dev_open,
+ .unlocked_ioctl = userfaultfd_dev_ioctl,
+ .compat_ioctl = compat_ptr_ioctl,
+ .owner = THIS_MODULE,
+ .llseek = noop_llseek,
+};
+
+static struct miscdevice userfaultfd_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "userfaultfd",
+ .fops = &userfaultfd_dev_fops
+};
+
static int __init userfaultfd_init(void)
{
+ WARN_ON(misc_register(&userfaultfd_misc));
+
userfaultfd_ctx_cachep = kmem_cache_create("userfaultfd_ctx_cache",
sizeof(struct userfaultfd_ctx),
0,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index ef739054cb1c..032a35b3bbd2 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -12,6 +12,10 @@

#include <linux/types.h>

+/* ioctls for /dev/userfaultfd */
+#define USERFAULTFD_IOC 0xAA
+#define USERFAULTFD_IOC_NEW _IOWR(USERFAULTFD_IOC, 0x00, int)
+
/*
* If the UFFDIO_API is upgraded someday, the UFFDIO_UNREGISTER and
* UFFDIO_WAKE ioctls should be defined as _IOW and not as _IOR. In
--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-22 23:20:42

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v2 1/6] selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh

This not being included was just a simple oversight. There are certain
features (like minor fault support) which are only enabled on shared
mappings, so without including hugetlb_shared we actually lose a
significant amount of test coverage.

Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/vm/run_vmtests.sh | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index a2302b5faaf2..5065dbd89bdb 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -121,9 +121,11 @@ run_test ./gup_test -a
run_test ./gup_test -ct -F 0x1 0 19 0x1000

run_test ./userfaultfd anon 20 16
-# Test requires source and destination huge pages. Size of source
-# (half_ufd_size_MB) is passed as argument to test.
+# Hugetlb tests require source and destination huge pages. Pass in half the
+# size ($half_ufd_size_MB), which is used for *each*.
run_test ./userfaultfd hugetlb "$half_ufd_size_MB" 32
+run_test ./userfaultfd hugetlb_shared "$half_ufd_size_MB" 32 "$mnt"/uffd-test
+rm -f "$mnt"/uffd-test
run_test ./userfaultfd shmem 20 16

#cleanup
--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-22 23:20:50

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v2 5/6] userfaultfd: selftests: make /dev/userfaultfd testing configurable

Instead of always testing both userfaultfd(2) and /dev/userfaultfd,
let the user choose which to test.

As with other test features, change the behavior based on a new
command line flag. Introduce the idea of "test mods", which are
generic (not specific to a test type) modifications to the behavior of
the test. This is sort of borrowed from this RFC patch series [1], but
simplified a bit.

The benefit is, in "typical" configurations this test is somewhat slow
(say, 30sec or something). Testing both clearly doubles it, so it may
not always be desirable, as users are likely to use one or the other,
but never both, in the "real world".

[1]: https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/

Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 41 +++++++++++++++++-------
1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 12ae742a9981..274522704e40 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -142,8 +142,17 @@ static void usage(void)
{
fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces> "
"[hugetlbfs_file]\n\n");
+
fprintf(stderr, "Supported <test type>: anon, hugetlb, "
"hugetlb_shared, shmem\n\n");
+
+ fprintf(stderr, "'Test mods' can be joined to the test type string with a ':'. "
+ "Supported mods:\n");
+ fprintf(stderr, "\tdev - Use /dev/userfaultfd instead of userfaultfd(2)\n");
+ fprintf(stderr, "\nExample test mod usage:\n");
+ fprintf(stderr, "# Run anonymous memory test with /dev/userfaultfd:\n");
+ fprintf(stderr, "./userfaultfd anon:dev 100 99999\n\n");
+
fprintf(stderr, "Examples:\n\n");
fprintf(stderr, "%s", examples);
exit(1);
@@ -1610,8 +1619,6 @@ unsigned long default_huge_page_size(void)

static void set_test_type(const char *type)
{
- uint64_t features = UFFD_API_FEATURES;
-
if (!strcmp(type, "anon")) {
test_type = TEST_ANON;
uffd_test_ops = &anon_uffd_test_ops;
@@ -1631,10 +1638,28 @@ static void set_test_type(const char *type)
test_type = TEST_SHMEM;
uffd_test_ops = &shmem_uffd_test_ops;
test_uffdio_minor = true;
- } else {
- err("Unknown test type: %s", type);
+ }
+}
+
+static void parse_test_type_arg(const char *raw_type)
+{
+ char *buf = strdup(raw_type);
+ uint64_t features = UFFD_API_FEATURES;
+
+ while (buf) {
+ const char *token = strsep(&buf, ":");
+
+ if (!test_type)
+ set_test_type(token);
+ else if (!strcmp(token, "dev"))
+ test_dev_userfaultfd = true;
+ else
+ err("unrecognized test mod '%s'", token);
}

+ if (!test_type)
+ err("failed to parse test type argument: '%s'", raw_type);
+
if (test_type == TEST_HUGETLB)
page_size = default_huge_page_size();
else
@@ -1681,7 +1706,7 @@ int main(int argc, char **argv)
err("failed to arm SIGALRM");
alarm(ALARM_INTERVAL_SECS);

- set_test_type(argv[1]);
+ parse_test_type_arg(argv[1]);

nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size /
@@ -1719,12 +1744,6 @@ int main(int argc, char **argv)
}
printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
nr_pages, nr_pages_per_cpu);
-
- test_dev_userfaultfd = false;
- if (userfaultfd_stress())
- return 1;
-
- test_dev_userfaultfd = true;
return userfaultfd_stress();
}

--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-22 23:21:06

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v2 3/6] userfaultfd: selftests: modify selftest to use /dev/userfaultfd

We clearly want to ensure both userfaultfd(2) and /dev/userfaultfd keep
working into the future, so just run the test twice, using each
interface.

Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 31 ++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 92a4516f8f0d..12ae742a9981 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -77,6 +77,9 @@ static int bounces;
#define TEST_SHMEM 3
static int test_type;

+/* test using /dev/userfaultfd, instead of userfaultfd(2) */
+static bool test_dev_userfaultfd;
+
/* exercise the test_uffdio_*_eexist every ALARM_INTERVAL_SECS */
#define ALARM_INTERVAL_SECS 10
static volatile bool test_uffdio_copy_eexist = true;
@@ -383,13 +386,31 @@ static void assert_expected_ioctls_present(uint64_t mode, uint64_t ioctls)
}
}

+static void __userfaultfd_open_dev(void)
+{
+ int fd;
+
+ uffd = -1;
+ fd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
+ if (fd < 0)
+ return;
+
+ uffd = ioctl(fd, USERFAULTFD_IOC_NEW,
+ O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
+ close(fd);
+}
+
static void userfaultfd_open(uint64_t *features)
{
struct uffdio_api uffdio_api;

- uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
+ if (test_dev_userfaultfd)
+ __userfaultfd_open_dev();
+ else
+ uffd = syscall(__NR_userfaultfd,
+ O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
if (uffd < 0)
- err("userfaultfd syscall not available in this kernel");
+ err("creating userfaultfd failed");
uffd_flags = fcntl(uffd, F_GETFD, NULL);

uffdio_api.api = UFFD_API;
@@ -1698,6 +1719,12 @@ int main(int argc, char **argv)
}
printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
nr_pages, nr_pages_per_cpu);
+
+ test_dev_userfaultfd = false;
+ if (userfaultfd_stress())
+ return 1;
+
+ test_dev_userfaultfd = true;
return userfaultfd_stress();
}

--
2.36.0.rc2.479.g8af0fa9b8e-goog

2022-04-26 04:20:38

by Dmitry V. Levin

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] userfaultfd: add /dev/userfaultfd for fine grained access control

On Fri, Apr 22, 2022 at 02:29:41PM -0700, Axel Rasmussen wrote:
[...]
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -12,6 +12,10 @@
>
> #include <linux/types.h>
>
> +/* ioctls for /dev/userfaultfd */
> +#define USERFAULTFD_IOC 0xAA
> +#define USERFAULTFD_IOC_NEW _IOWR(USERFAULTFD_IOC, 0x00, int)

Why this new ioctl is defined using _IOWR()? Since it neither reads from
user memory nor writes into user memory, it should rather be defined using
_IO(), shouldn't it?


--
ldv

2022-04-27 06:45:19

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh

On Fri, Apr 22, 2022 at 02:29:40PM -0700, Axel Rasmussen wrote:
> This not being included was just a simple oversight. There are certain
> features (like minor fault support) which are only enabled on shared
> mappings, so without including hugetlb_shared we actually lose a
> significant amount of test coverage.
>
> Signed-off-by: Axel Rasmussen <[email protected]>

Reviewed-by: Peter Xu <[email protected]>

--
Peter Xu

2022-04-27 09:28:01

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] userfaultfd: add /dev/userfaultfd for fine grained access control

On Tue, Apr 26, 2022 at 1:33 PM Peter Xu <[email protected]> wrote:
>
> Axel,
>
> On Fri, Apr 22, 2022 at 02:29:41PM -0700, Axel Rasmussen wrote:
> > @@ -65,6 +66,8 @@ struct userfaultfd_ctx {
> > unsigned int flags;
> > /* features requested from the userspace */
> > unsigned int features;
> > + /* whether or not to handle kernel faults */
> > + bool handle_kernel_faults;
>
> Could you help explain why we need this bool? I failed to figure out
> myself on the difference against "!(ctx->flags & UFFD_USER_MODE_ONLY)".

Ah, yeah you're right, we can get rid of it and just rely on
UFFD_USER_MODE_ONLY.

Just to add context, in a previous version I never sent out, I had:

ctx->handle_kernel_faults = userfaultfd_allowed(...);

That's wrong for other reasons, but if we were going to do that we'd
have to store the result, since it's a function not just of the flags,
but also of the method used to create the userfaultfd. I changed this
without also dropping the boolean, which can now be cleaned up. I'll
include this change in a v3.

>
> Thanks,
>
> --
> Peter Xu
>

2022-04-27 10:18:41

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh

On 4/22/22 3:29 PM, Axel Rasmussen wrote:
> This not being included was just a simple oversight. There are certain
> features (like minor fault support) which are only enabled on shared
> mappings, so without including hugetlb_shared we actually lose a
> significant amount of test coverage.
>
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
> tools/testing/selftests/vm/run_vmtests.sh | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
> index a2302b5faaf2..5065dbd89bdb 100755
> --- a/tools/testing/selftests/vm/run_vmtests.sh
> +++ b/tools/testing/selftests/vm/run_vmtests.sh
> @@ -121,9 +121,11 @@ run_test ./gup_test -a
> run_test ./gup_test -ct -F 0x1 0 19 0x1000
>
> run_test ./userfaultfd anon 20 16
> -# Test requires source and destination huge pages. Size of source
> -# (half_ufd_size_MB) is passed as argument to test.
> +# Hugetlb tests require source and destination huge pages. Pass in half the
> +# size ($half_ufd_size_MB), which is used for *each*.
> run_test ./userfaultfd hugetlb "$half_ufd_size_MB" 32
> +run_test ./userfaultfd hugetlb_shared "$half_ufd_size_MB" 32 "$mnt"/uffd-test
> +rm -f "$mnt"/uffd-test
> run_test ./userfaultfd shmem 20 16
>
> #cleanup
>

Looks good to me.

Reviewed-by: Shuah Khan <[email protected]>

thanks,
-- Shuah

2022-04-27 10:25:37

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] userfaultfd: add /dev/userfaultfd for fine grained access control

On Tue, Apr 26, 2022 at 6:00 PM Axel Rasmussen <[email protected]> wrote:
>
> You're right, [1] says _IO is appropriate for ioctls which only take
> an integer argument. I'll send a v3 with this fix, although I might
> wait a bit for any other review comments before doing so. Thanks for
> taking a look!

If there are no other command codes, you could also set .compat_ioctl
to the same function pointer as .unlocked_ioctl, the compat_ptr_ioctl
conversion is only needed when there are commands that take a pointer.

Armd

2022-04-27 10:30:29

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] userfaultfd: selftests: make /dev/userfaultfd testing configurable

On 4/22/22 3:29 PM, Axel Rasmussen wrote:
> Instead of always testing both userfaultfd(2) and /dev/userfaultfd,
> let the user choose which to test.
>
> As with other test features, change the behavior based on a new
> command line flag. Introduce the idea of "test mods", which are
> generic (not specific to a test type) modifications to the behavior of
> the test. This is sort of borrowed from this RFC patch series [1], but
> simplified a bit.
>
> The benefit is, in "typical" configurations this test is somewhat slow
> (say, 30sec or something). Testing both clearly doubles it, so it may
> not always be desirable, as users are likely to use one or the other,
> but never both, in the "real world".
>
> [1]: https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
> tools/testing/selftests/vm/userfaultfd.c | 41 +++++++++++++++++-------
> 1 file changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index 12ae742a9981..274522704e40 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -142,8 +142,17 @@ static void usage(void)
> {
> fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces> "
> "[hugetlbfs_file]\n\n");
> +

Remove the extra blank line here.

> fprintf(stderr, "Supported <test type>: anon, hugetlb, "
> "hugetlb_shared, shmem\n\n");
> +

Remove the extra blank line here.

> + fprintf(stderr, "'Test mods' can be joined to the test type string with a ':'. "
> + "Supported mods:\n");
> + fprintf(stderr, "\tdev - Use /dev/userfaultfd instead of userfaultfd(2)\n");
> + fprintf(stderr, "\nExample test mod usage:\n");
> + fprintf(stderr, "# Run anonymous memory test with /dev/userfaultfd:\n");
> + fprintf(stderr, "./userfaultfd anon:dev 100 99999\n\n");
> +
> fprintf(stderr, "Examples:\n\n");
> fprintf(stderr, "%s", examples);

Update examples above with new test cases if any.

> exit(1);
> @@ -1610,8 +1619,6 @@ unsigned long default_huge_page_size(void)
>
> static void set_test_type(const char *type)
> {
> - uint64_t features = UFFD_API_FEATURES;
> -
> if (!strcmp(type, "anon")) {
> test_type = TEST_ANON;
> uffd_test_ops = &anon_uffd_test_ops;
> @@ -1631,10 +1638,28 @@ static void set_test_type(const char *type)
> test_type = TEST_SHMEM;
> uffd_test_ops = &shmem_uffd_test_ops;
> test_uffdio_minor = true;
> - } else {
> - err("Unknown test type: %s", type);
> + }

At this point, it might make it so much easier and maintainable if
we were to use getopt instead of parsing options.

> +}
> +
> +static void parse_test_type_arg(const char *raw_type)
> +{
> + char *buf = strdup(raw_type);
> + uint64_t features = UFFD_API_FEATURES;
> +
> + while (buf) {
> + const char *token = strsep(&buf, ":");
> +
> + if (!test_type)
> + set_test_type(token);
> + else if (!strcmp(token, "dev"))
> + test_dev_userfaultfd = true;
> + else
> + err("unrecognized test mod '%s'", token);
> }
>
> + if (!test_type)
> + err("failed to parse test type argument: '%s'", raw_type);
> +
> if (test_type == TEST_HUGETLB)
> page_size = default_huge_page_size();
> else
> @@ -1681,7 +1706,7 @@ int main(int argc, char **argv)
> err("failed to arm SIGALRM");
> alarm(ALARM_INTERVAL_SECS);
>
> - set_test_type(argv[1]);
> + parse_test_type_arg(argv[1]);
>
> nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
> nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size /
> @@ -1719,12 +1744,6 @@ int main(int argc, char **argv)
> }
> printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
> nr_pages, nr_pages_per_cpu);
> -
> - test_dev_userfaultfd = false;
> - if (userfaultfd_stress())
> - return 1;
> -
> - test_dev_userfaultfd = true;
> return userfaultfd_stress();
> }
>
>

Same comments as before on fail vs. skip conditions to watch out
for and report them correctly.

thanks,
-- Shuah

2022-04-27 10:41:32

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] userfaultfd: update documentation to describe /dev/userfaultfd

On 4/22/22 3:29 PM, Axel Rasmussen wrote:
> Explain the different ways to create a new userfaultfd, and how access
> control works for each way.
>
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
> Documentation/admin-guide/mm/userfaultfd.rst | 38 ++++++++++++++++++--
> Documentation/admin-guide/sysctl/vm.rst | 3 ++
> 2 files changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
> index 6528036093e1..4c079b5377d4 100644
> --- a/Documentation/admin-guide/mm/userfaultfd.rst
> +++ b/Documentation/admin-guide/mm/userfaultfd.rst
> @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
> Design
> ======
>
> -Userfaults are delivered and resolved through the ``userfaultfd`` syscall.

Please keep this sentence in there and rephrase it to indicate how it was
done in the past.

Also explain here why this new approach is better than the syscall approach
before getting into the below details.

> +Userspace creates a new userfaultfd, initializes it, and registers one or more
> +regions of virtual memory with it. Then, any page faults which occur within the
> +region(s) result in a message being delivered to the userfaultfd, notifying
> +userspace of the fault.
>
> The ``userfaultfd`` (aside from registering and unregistering virtual
> memory ranges) provides two primary functionalities:
> @@ -39,7 +42,7 @@ Vmas are not suitable for page- (or hugepage) granular fault tracking
> when dealing with virtual address spaces that could span
> Terabytes. Too many vmas would be needed for that.>
> -The ``userfaultfd`` once opened by invoking the syscall, can also be
> +The ``userfaultfd``, once created, can also be

This is sentence is too short and would look odd. Combine the sentences
so it renders well in the generated doc.

> passed using unix domain sockets to a manager process, so the same
> manager process could handle the userfaults of a multitude of
> different processes without them being aware about what is going on
> @@ -50,6 +53,37 @@ is a corner case that would currently return ``-EBUSY``).
> API
> ===
>
> +Creating a userfaultfd
> +----------------------
> +
> +There are two mechanisms to create a userfaultfd. There are various ways to
> +restrict this too, since userfaultfds which handle kernel page faults have
> +historically been a useful tool for exploiting the kernel.
> +
> +The first is the userfaultfd(2) syscall. Access to this is controlled in several
> +ways:
> +
> +- By default, the userfaultfd will be able to handle kernel page faults. This
> + can be disabled by passing in UFFD_USER_MODE_ONLY.
> +
> +- If vm.unprivileged_userfaultfd is 0, then the caller must *either* have
> + CAP_SYS_PTRACE, or pass in UFFD_USER_MODE_ONLY.
> +
> +- If vm.unprivileged_userfaultfd is 1, then no particular privilege is needed to
> + use this syscall, even if UFFD_USER_MODE_ONLY is *not* set.
> +
> +Alternatively, userfaultfds can be created by opening /dev/userfaultfd, and
> +issuing a USERFAULTFD_IOC_NEW ioctl to this device. Access to this device is

New ioctl? I thought we are moving away from using ioctls?

> +controlled via normal filesystem permissions (user/group/mode for example) - no
> +additional permission (capability/sysctl) is needed to be able to handle kernel
> +faults this way. This is useful because it allows e.g. a specific user or group
> +to be able to create kernel-fault-handling userfaultfds, without allowing it
> +more broadly, or granting more privileges in addition to that particular ability
> +(CAP_SYS_PTRACE). In other words, it allows permissions to be minimized.
> +
> +Initializing up a userfaultfd
> +------------------------
> +

This will generate doc warn very likley - extend the dashes to the
entire length of the subtitle.

> When first opened the ``userfaultfd`` must be enabled invoking the
> ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
> a later API version) which will specify the ``read/POLLIN`` protocol
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index f4804ce37c58..8682d5fbc8ea 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -880,6 +880,9 @@ calls without any restrictions.
>
> The default value is 0.
>
> +An alternative to this sysctl / the userfaultfd(2) syscall is to create
> +userfaultfds via /dev/userfaultfd. See
> +Documentation/admin-guide/mm/userfaultfd.rst.
>
> user_reserve_kbytes
> ===================
>

thanks,
-- Shuah

2022-04-27 10:47:19

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] userfaultfd: selftests: modify selftest to use /dev/userfaultfd

On 4/22/22 3:29 PM, Axel Rasmussen wrote:
> We clearly want to ensure both userfaultfd(2) and /dev/userfaultfd keep
> working into the future, so just run the test twice, using each
> interface.
>
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
> tools/testing/selftests/vm/userfaultfd.c | 31 ++++++++++++++++++++++--
> 1 file changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index 92a4516f8f0d..12ae742a9981 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -77,6 +77,9 @@ static int bounces;
> #define TEST_SHMEM 3
> static int test_type;
>
> +/* test using /dev/userfaultfd, instead of userfaultfd(2) */
> +static bool test_dev_userfaultfd;
> +
> /* exercise the test_uffdio_*_eexist every ALARM_INTERVAL_SECS */
> #define ALARM_INTERVAL_SECS 10
> static volatile bool test_uffdio_copy_eexist = true;
> @@ -383,13 +386,31 @@ static void assert_expected_ioctls_present(uint64_t mode, uint64_t ioctls)
> }
> }
>
> +static void __userfaultfd_open_dev(void)
> +{
> + int fd;
> +
> + uffd = -1;
> + fd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
> + if (fd < 0)
> + return;
> +
> + uffd = ioctl(fd, USERFAULTFD_IOC_NEW,
> + O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
> + close(fd);
> +}
> +
> static void userfaultfd_open(uint64_t *features)
> {
> struct uffdio_api uffdio_api;
>
> - uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
> + if (test_dev_userfaultfd)
> + __userfaultfd_open_dev();
> + else
> + uffd = syscall(__NR_userfaultfd,
> + O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
> if (uffd < 0)
> - err("userfaultfd syscall not available in this kernel");
> + err("creating userfaultfd failed");

This isn't an error as in test failure. This will be a skip because of
unmet dependencies. Also if this test requires root access, please check
for that and make that a skip as well.

> uffd_flags = fcntl(uffd, F_GETFD, NULL);
>
> uffdio_api.api = UFFD_API;
> @@ -1698,6 +1719,12 @@ int main(int argc, char **argv)
> }
> printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
> nr_pages, nr_pages_per_cpu);
> +
> + test_dev_userfaultfd = false;
> + if (userfaultfd_stress())
> + return 1;
> +
> + test_dev_userfaultfd = true;
> return userfaultfd_stress();
> }
>
>

thanks,
-- Shuah

2022-04-27 11:05:20

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] userfaultfd: add /dev/userfaultfd for fine grained access control

You're right, [1] says _IO is appropriate for ioctls which only take
an integer argument. I'll send a v3 with this fix, although I might
wait a bit for any other review comments before doing so. Thanks for
taking a look!

https://www.kernel.org/doc/html/latest/driver-api/ioctl.html

On Mon, Apr 25, 2022 at 1:32 PM Dmitry V. Levin <[email protected]> wrote:
>
> On Fri, Apr 22, 2022 at 02:29:41PM -0700, Axel Rasmussen wrote:
> [...]
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -12,6 +12,10 @@
> >
> > #include <linux/types.h>
> >
> > +/* ioctls for /dev/userfaultfd */
> > +#define USERFAULTFD_IOC 0xAA
> > +#define USERFAULTFD_IOC_NEW _IOWR(USERFAULTFD_IOC, 0x00, int)
>
> Why this new ioctl is defined using _IOWR()? Since it neither reads from
> user memory nor writes into user memory, it should rather be defined using
> _IO(), shouldn't it?
>
>
> --
> ldv

2022-04-27 11:26:29

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] userfaultfd: add /dev/userfaultfd for fine grained access control

Axel,

On Fri, Apr 22, 2022 at 02:29:41PM -0700, Axel Rasmussen wrote:
> @@ -65,6 +66,8 @@ struct userfaultfd_ctx {
> unsigned int flags;
> /* features requested from the userspace */
> unsigned int features;
> + /* whether or not to handle kernel faults */
> + bool handle_kernel_faults;

Could you help explain why we need this bool? I failed to figure out
myself on the difference against "!(ctx->flags & UFFD_USER_MODE_ONLY)".

Thanks,

--
Peter Xu

2022-05-20 08:42:22

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] userfaultfd: selftests: modify selftest to use /dev/userfaultfd

On Tue, Apr 26, 2022 at 9:16 AM Shuah Khan <[email protected]> wrote:
>
> On 4/22/22 3:29 PM, Axel Rasmussen wrote:
> > We clearly want to ensure both userfaultfd(2) and /dev/userfaultfd keep
> > working into the future, so just run the test twice, using each
> > interface.
> >
> > Signed-off-by: Axel Rasmussen <[email protected]>
> > ---
> > tools/testing/selftests/vm/userfaultfd.c | 31 ++++++++++++++++++++++--
> > 1 file changed, 29 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> > index 92a4516f8f0d..12ae742a9981 100644
> > --- a/tools/testing/selftests/vm/userfaultfd.c
> > +++ b/tools/testing/selftests/vm/userfaultfd.c
> > @@ -77,6 +77,9 @@ static int bounces;
> > #define TEST_SHMEM 3
> > static int test_type;
> >
> > +/* test using /dev/userfaultfd, instead of userfaultfd(2) */
> > +static bool test_dev_userfaultfd;
> > +
> > /* exercise the test_uffdio_*_eexist every ALARM_INTERVAL_SECS */
> > #define ALARM_INTERVAL_SECS 10
> > static volatile bool test_uffdio_copy_eexist = true;
> > @@ -383,13 +386,31 @@ static void assert_expected_ioctls_present(uint64_t mode, uint64_t ioctls)
> > }
> > }
> >
> > +static void __userfaultfd_open_dev(void)
> > +{
> > + int fd;
> > +
> > + uffd = -1;
> > + fd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
> > + if (fd < 0)
> > + return;
> > +
> > + uffd = ioctl(fd, USERFAULTFD_IOC_NEW,
> > + O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
> > + close(fd);
> > +}
> > +
> > static void userfaultfd_open(uint64_t *features)
> > {
> > struct uffdio_api uffdio_api;
> >
> > - uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
> > + if (test_dev_userfaultfd)
> > + __userfaultfd_open_dev();
> > + else
> > + uffd = syscall(__NR_userfaultfd,
> > + O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
> > if (uffd < 0)
> > - err("userfaultfd syscall not available in this kernel");
> > + err("creating userfaultfd failed");
>
> This isn't an error as in test failure. This will be a skip because of
> unmet dependencies. Also if this test requires root access, please check
> for that and make that a skip as well.

Testing with the userfaultfd syscall doesn't require any special
permissions (root or otherwise).

But testing with /dev/userfaultfd will require access to that device
node, which is root:root by default, but the system administrator may
have changed this. In general I think this will only fail due to a)
lack of kernel support or b) lack of permissions though, so always
exiting with KSFT_SKIP here seems reasonable. I'll make that change in
v3.

>
> > uffd_flags = fcntl(uffd, F_GETFD, NULL);
> >
> > uffdio_api.api = UFFD_API;
> > @@ -1698,6 +1719,12 @@ int main(int argc, char **argv)
> > }
> > printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
> > nr_pages, nr_pages_per_cpu);
> > +
> > + test_dev_userfaultfd = false;
> > + if (userfaultfd_stress())
> > + return 1;
> > +
> > + test_dev_userfaultfd = true;
> > return userfaultfd_stress();
> > }
> >
> >
>
> thanks,
> -- Shuah

2022-05-21 13:17:43

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] userfaultfd: selftests: make /dev/userfaultfd testing configurable

On Tue, Apr 26, 2022 at 9:56 AM Shuah Khan <[email protected]> wrote:
>
> On 4/22/22 3:29 PM, Axel Rasmussen wrote:
> > Instead of always testing both userfaultfd(2) and /dev/userfaultfd,
> > let the user choose which to test.
> >
> > As with other test features, change the behavior based on a new
> > command line flag. Introduce the idea of "test mods", which are
> > generic (not specific to a test type) modifications to the behavior of
> > the test. This is sort of borrowed from this RFC patch series [1], but
> > simplified a bit.
> >
> > The benefit is, in "typical" configurations this test is somewhat slow
> > (say, 30sec or something). Testing both clearly doubles it, so it may
> > not always be desirable, as users are likely to use one or the other,
> > but never both, in the "real world".
> >
> > [1]: https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
> >
> > Signed-off-by: Axel Rasmussen <[email protected]>
> > ---
> > tools/testing/selftests/vm/userfaultfd.c | 41 +++++++++++++++++-------
> > 1 file changed, 30 insertions(+), 11 deletions(-)
> >
> > diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> > index 12ae742a9981..274522704e40 100644
> > --- a/tools/testing/selftests/vm/userfaultfd.c
> > +++ b/tools/testing/selftests/vm/userfaultfd.c
> > @@ -142,8 +142,17 @@ static void usage(void)
> > {
> > fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces> "
> > "[hugetlbfs_file]\n\n");
> > +
>
> Remove the extra blank line here.
>
> > fprintf(stderr, "Supported <test type>: anon, hugetlb, "
> > "hugetlb_shared, shmem\n\n");
> > +
>
> Remove the extra blank line here.
>
> > + fprintf(stderr, "'Test mods' can be joined to the test type string with a ':'. "
> > + "Supported mods:\n");
> > + fprintf(stderr, "\tdev - Use /dev/userfaultfd instead of userfaultfd(2)\n");
> > + fprintf(stderr, "\nExample test mod usage:\n");
> > + fprintf(stderr, "# Run anonymous memory test with /dev/userfaultfd:\n");
> > + fprintf(stderr, "./userfaultfd anon:dev 100 99999\n\n");
> > +
> > fprintf(stderr, "Examples:\n\n");
> > fprintf(stderr, "%s", examples);
>
> Update examples above with new test cases if any.

Will fix the above comments in v3.

>
> > exit(1);
> > @@ -1610,8 +1619,6 @@ unsigned long default_huge_page_size(void)
> >
> > static void set_test_type(const char *type)
> > {
> > - uint64_t features = UFFD_API_FEATURES;
> > -
> > if (!strcmp(type, "anon")) {
> > test_type = TEST_ANON;
> > uffd_test_ops = &anon_uffd_test_ops;
> > @@ -1631,10 +1638,28 @@ static void set_test_type(const char *type)
> > test_type = TEST_SHMEM;
> > uffd_test_ops = &shmem_uffd_test_ops;
> > test_uffdio_minor = true;
> > - } else {
> > - err("Unknown test type: %s", type);
> > + }
>
> At this point, it might make it so much easier and maintainable if
> we were to use getopt instead of parsing options.

Agreed, I'd like that as well. But, since it's a bigger refactor that
affects all test types, I think it may be cleaner to leave it for a
follow-up series.

>
> > +}
> > +
> > +static void parse_test_type_arg(const char *raw_type)
> > +{
> > + char *buf = strdup(raw_type);
> > + uint64_t features = UFFD_API_FEATURES;
> > +
> > + while (buf) {
> > + const char *token = strsep(&buf, ":");
> > +
> > + if (!test_type)
> > + set_test_type(token);
> > + else if (!strcmp(token, "dev"))
> > + test_dev_userfaultfd = true;
> > + else
> > + err("unrecognized test mod '%s'", token);
> > }
> >
> > + if (!test_type)
> > + err("failed to parse test type argument: '%s'", raw_type);
> > +
> > if (test_type == TEST_HUGETLB)
> > page_size = default_huge_page_size();
> > else
> > @@ -1681,7 +1706,7 @@ int main(int argc, char **argv)
> > err("failed to arm SIGALRM");
> > alarm(ALARM_INTERVAL_SECS);
> >
> > - set_test_type(argv[1]);
> > + parse_test_type_arg(argv[1]);
> >
> > nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
> > nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size /
> > @@ -1719,12 +1744,6 @@ int main(int argc, char **argv)
> > }
> > printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
> > nr_pages, nr_pages_per_cpu);
> > -
> > - test_dev_userfaultfd = false;
> > - if (userfaultfd_stress())
> > - return 1;
> > -
> > - test_dev_userfaultfd = true;
> > return userfaultfd_stress();
> > }
> >
> >
>
> Same comments as before on fail vs. skip conditions to watch out
> for and report them correctly.

I think in v3 things will be correct. Basically, in the skip cases we
just exit(KSFT_SKIP) directly, instead of relying on the return value
here. I'll take a pass and double check though before sending v3.

>
> thanks,
> -- Shuah
>

2022-05-23 07:04:39

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] userfaultfd: update documentation to describe /dev/userfaultfd

On Tue, Apr 26, 2022 at 9:46 AM Shuah Khan <[email protected]> wrote:
>
> On 4/22/22 3:29 PM, Axel Rasmussen wrote:
> > Explain the different ways to create a new userfaultfd, and how access
> > control works for each way.
> >
> > Signed-off-by: Axel Rasmussen <[email protected]>
> > ---
> > Documentation/admin-guide/mm/userfaultfd.rst | 38 ++++++++++++++++++--
> > Documentation/admin-guide/sysctl/vm.rst | 3 ++
> > 2 files changed, 39 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
> > index 6528036093e1..4c079b5377d4 100644
> > --- a/Documentation/admin-guide/mm/userfaultfd.rst
> > +++ b/Documentation/admin-guide/mm/userfaultfd.rst
> > @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
> > Design
> > ======
> >
> > -Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
>
> Please keep this sentence in there and rephrase it to indicate how it was
> done in the past.
>
> Also explain here why this new approach is better than the syscall approach
> before getting into the below details.

Hmm, so the old sentence I think was incorrect already. Notifications
of *the faults* aren't delivered and resolved through the syscall.
Rather, the syscall just gives you a file descriptor, and then
notification / resolution of faults happens though the file
descriptor, not through the syscall. So I think it needs to be
reworded in any case.

I think the overall structure of the doc as-is makes the most sense as
well - first explain how this will be used at a very high level, and
then go into the details (first how to create a userfaultfd, then how
to use it).

So, in the end I reworded the "Creating a userfaultfd" section, to
cover the two things you mentioned:

- Which is the "older" way and which is the "newer" way
- What the benefit of the newer way is

Hopefully this addresses the comment? I can tweak it more if needed.
In any case, thanks for taking a look at this series!

>
> > +Userspace creates a new userfaultfd, initializes it, and registers one or more
> > +regions of virtual memory with it. Then, any page faults which occur within the
> > +region(s) result in a message being delivered to the userfaultfd, notifying
> > +userspace of the fault.
> >
> > The ``userfaultfd`` (aside from registering and unregistering virtual
> > memory ranges) provides two primary functionalities:
> > @@ -39,7 +42,7 @@ Vmas are not suitable for page- (or hugepage) granular fault tracking
> > when dealing with virtual address spaces that could span
> > Terabytes. Too many vmas would be needed for that.>
> > -The ``userfaultfd`` once opened by invoking the syscall, can also be
> > +The ``userfaultfd``, once created, can also be
>
> This is sentence is too short and would look odd. Combine the sentences
> so it renders well in the generated doc.

Not 100% sure I understood the concern, but I do think it makes sense
to move "Vmas are not suitable ..." up into the same paragraph with
the other sentence about scalability. I'll do this in v3 as it looks a
bit nicer. This leaves the "The userfaultfd, once created, ..." part
alone, though. I think s/once opened by invoking the syscall/once
created/ is correct, since there are now various ways to create it. I
also think that second comma technically should have been there even
in the previous version.

>
> > passed using unix domain sockets to a manager process, so the same
> > manager process could handle the userfaults of a multitude of
> > different processes without them being aware about what is going on
> > @@ -50,6 +53,37 @@ is a corner case that would currently return ``-EBUSY``).
> > API
> > ===
> >
> > +Creating a userfaultfd
> > +----------------------
> > +
> > +There are two mechanisms to create a userfaultfd. There are various ways to
> > +restrict this too, since userfaultfds which handle kernel page faults have
> > +historically been a useful tool for exploiting the kernel.
> > +
> > +The first is the userfaultfd(2) syscall. Access to this is controlled in several
> > +ways:
> > +
> > +- By default, the userfaultfd will be able to handle kernel page faults. This
> > + can be disabled by passing in UFFD_USER_MODE_ONLY.
> > +
> > +- If vm.unprivileged_userfaultfd is 0, then the caller must *either* have
> > + CAP_SYS_PTRACE, or pass in UFFD_USER_MODE_ONLY.
> > +
> > +- If vm.unprivileged_userfaultfd is 1, then no particular privilege is needed to
> > + use this syscall, even if UFFD_USER_MODE_ONLY is *not* set.
> > +
> > +Alternatively, userfaultfds can be created by opening /dev/userfaultfd, and
> > +issuing a USERFAULTFD_IOC_NEW ioctl to this device. Access to this device is
>
> New ioctl? I thought we are moving away from using ioctls?

Hmm, looking at alternatives [1] am not sure I see a viable one:

We could have defined a new "userfaultfdfs" filesystem, but it seems
to me to be overkill for this feature.

We could have used a syscall instead and supported fine-grained access
control with a new capability, but this approach was rejected [2]
generally because we prefer to avoid adding capabilities, and this new
capability's scope (just userfaultfd) was considered too narrow.

So, I'm not sure of another better way to do this. I suppose one could
argue that the dislike of ioctls outweighs the usefulness of this
feature, but to me at least the tradeoff seems worth it. :)

[1]: https://www.kernel.org/doc/html/latest/driver-api/ioctl.html#alternatives-to-ioctl
[2]: https://lkml.org/lkml/2022/2/24/1012

>
> > +controlled via normal filesystem permissions (user/group/mode for example) - no
> > +additional permission (capability/sysctl) is needed to be able to handle kernel
> > +faults this way. This is useful because it allows e.g. a specific user or group
> > +to be able to create kernel-fault-handling userfaultfds, without allowing it
> > +more broadly, or granting more privileges in addition to that particular ability
> > +(CAP_SYS_PTRACE). In other words, it allows permissions to be minimized.
> > +
> > +Initializing up a userfaultfd
> > +------------------------
> > +
>
> This will generate doc warn very likley - extend the dashes to the
> entire length of the subtitle.

I'll fix this in v3.

>
> > When first opened the ``userfaultfd`` must be enabled invoking the
> > ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
> > a later API version) which will specify the ``read/POLLIN`` protocol
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index f4804ce37c58..8682d5fbc8ea 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -880,6 +880,9 @@ calls without any restrictions.
> >
> > The default value is 0.
> >
> > +An alternative to this sysctl / the userfaultfd(2) syscall is to create
> > +userfaultfds via /dev/userfaultfd. See
> > +Documentation/admin-guide/mm/userfaultfd.rst.
> >
> > user_reserve_kbytes
> > ===================
> >
>
> thanks,
> -- Shuah