Hello Andrea, Mike, and all,
Mike: thanks for the page that you sent. I've reworked it
a bit, and also added a lot of further information,
and an example program. In the process, I split the page
into two pieces, with one piece describing the userfaultfd()
system call and the other describing the ioctl() operations.
I'd like to get review input, especially from you and
Andrea, but also anyone else, for the current version
of this page, which includes a few FIXMEs to be sorted.
I've shown the rendered version of the page below.
The groff source is attached, and can also be found
at the branch here:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
The new ioctl_userfaultfd(2) page follows this mail.
Cheers,
Michael
USERFAULTFD(2) Linux Programmer's Manual USERFAULTFD(2)
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Need to describe close(2) semantics for userfaulfd │
│file descriptor: what happens when the userfaultfd │
│FD is closed? │
│ │
└─────────────────────────────────────────────────────┘
NAME
userfaultfd - create a file descriptor for handling page faults
in user space
SYNOPSIS
#include <sys/types.h>
#include <linux/userfaultfd.h>
int userfaultfd(int flags);
Note: There is no glibc wrapper for this system call; see
NOTES.
DESCRIPTION
userfaultfd() creates a new userfaultfd object that can be used
for delegation of page-fault handling to a user-space applica‐
tion, and returns a file descriptor that refers to the new
object. The new userfaultfd object is configured using
ioctl(2).
Once the userfaultfd object is configured, the application can
use read(2) to receive userfaultfd notifications. The reads
from userfaultfd may be blocking or non-blocking, depending on
the value of flags used for the creation of the userfaultfd or
subsequent calls to fcntl(2).
The following values may be bitwise ORed in flags to change the
behavior of userfaultfd():
O_CLOEXEC
Enable the close-on-exec flag for the new userfaultfd
file descriptor. See the description of the O_CLOEXEC
flag in open(2).
O_NONBLOCK
Enables non-blocking operation for the userfaultfd
object. See the description of the O_NONBLOCK flag in
open(2).
Usage
The userfaultfd mechanism is designed to allow a thread in a
multithreaded program to perform user-space paging for the
other threads in the process. When a page fault occurs for one
of the regions registered to the userfaultfd object, the fault‐
ing thread is put to sleep and an event is generated that can
be read via the userfaultfd file descriptor. The fault-han‐
dling thread reads events from this file descriptor and ser‐
vices them using the operations described in ioctl_user‐
faultfd(2). When servicing the page fault events, the fault-
handling thread can trigger a wake-up for the sleeping thread.
Userfaultfd operation
After the userfaultfd object is created with userfaultfd(), the
application must enable it using the UFFDIO_API ioctl(2) opera‐
tion. This operation allows a handshake between the kernel and
user space to determine the API version and supported features.
This operation must be performed before any of the other
ioctl(2) operations described below (or those operations fail
with the EINVAL error).
After a successful UFFDIO_API operation, the application then
registers memory address ranges using the UFFDIO_REGISTER
ioctl(2) operation. After successful completion of a UFF‐
DIO_REGISTER operation, a page fault occurring in the requested
memory range, and satisfying the mode defined at the registra‐
tion time, will be forwarded by the kernel to the user-space
application. The application can then use the UFFDIO_COPY or
UFFDIO_ZERO ioctl(2) operations to resolve the page fault.
Details of the various ioctl(2) operations can be found in
ioctl_userfaultfd(2).
Currently, userfaultfd can be used only with anonymous private
memory mappings.
Reading from the userfaultfd structure
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│are the details below correct? │
└─────────────────────────────────────────────────────┘
Each read(2) from the userfaultfd file descriptor returns one
or more uffd_msg structures, each of which describes a page-
fault event:
struct uffd_msg {
__u8 event; /* Type of event */
...
union {
struct {
__u64 flags; /* Flags describing fault */
__u64 address; /* Faulting address */
} pagefault;
...
} arg;
/* Padding fields omitted */
} __packed;
If multiple events are available and the supplied buffer is
large enough, read(2) returns as many events as will fit in the
supplied buffer. If the buffer supplied to read(2) is smaller
than the size of the uffd_msg structure, the read(2) fails with
the error EINVAL.
The fields set in the uffd_msg structure are as follows:
event The type of event. Currently, only one value can appear
in this field: UFFD_EVENT_PAGEFAULT, which indicates a
page-fault event.
address
The address that triggered the page fault.
flags A bit mask of flags that describe the event. For
UFFD_EVENT_PAGEFAULT, the following flag may appear:
UFFD_PAGEFAULT_FLAG_WRITE
If the address is in a range that was registered
with the UFFDIO_REGISTER_MODE_MISSING flag (see
ioctl_userfaultfd(2)) and this flag is set, this
a write fault; otherwise it is a read fault.
A read(2) on a userfaultfd file descriptor can fail with the
following errors:
EINVAL The userfaultfd object has not yet been enabled using
the UFFDIO_API ioctl(2) operation
The userfaultfd file descriptor can be monitored with poll(2),
select(2), and epoll(7). When events are available, the file
descriptor indicates as readable.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│But, it seems, the object must be created with │
│O_NONBLOCK. What is the rationale for this require‐ │
│ment? Something needs to be said in this manual │
│page. │
└─────────────────────────────────────────────────────┘
RETURN VALUE
On success, userfaultfd() returns a new file descriptor that
refers to the userfaultfd object. On error, -1 is returned,
and errno is set appropriately.
ERRORS
EINVAL An unsupported value was specified in flags.
EMFILE The per-process limit on the number of open file
descriptors has been reached
ENFILE The system-wide limit on the total number of open files
has been reached.
ENOMEM Insufficient kernel memory was available.
VERSIONS
The userfaultfd() system call first appeared in Linux 4.3.
CONFORMING TO
userfaultfd() is Linux-specific and should not be used in pro‐
grams intended to be portable.
NOTES
Glibc does not provide a wrapper for this system call; call it
using syscall(2).
The userfaultfd mechanism can be used as an alternative to tra‐
ditional user-space paging techniques based on the use of the
SIGSEGV signal and mmap(2). It can also be used to implement
lazy restore for checkpoint/restore mechanisms, as well as
post-copy migration to allow (nearly) uninterrupted execution
when transferring virtual machines from one host to another.
EXAMPLE
The program below demonstrates the use of the userfaultfd mech‐
anism. The program creates two threads, one of which acts as
the page-fault handler for the process, for the pages in a
demand-page zero region created using mmap(2).
The program takes one command-line argument, which is the num‐
ber of pages that will be created in a mapping whose page
faults will be handled via userfaultfd. After creating a user‐
faultfd object, the program then creates an anonymous private
mapping of the specified size and registers the address range
of that mapping using the UFFDIO_REGISTER ioctl(2) operation.
The program then creates a second thread that will perform the
task of handling page faults.
The main thread then walks through the pages of the mapping
fetching bytes from successive pages. Because the pages have
not yet been accessed, the first access of a byte in each page
will trigger a page-fault event on the userfaultfd file
descriptor.
Each of the page-fault events is handled by the second thread,
which sits in a loop processing input from the userfaultfd file
descriptor. In each loop iteration, the second thread first
calls poll(2) to check the state of the file descriptor, and
then reads an event from the file descriptor. All such events
should be UFFD_EVENT_PAGEFAULT events, which the thread handles
by copying a page of data into the faulting region using the
UFFDIO_COPY ioctl(2) operation.
The following is an example of what we see when running the
program:
$ ./userfaultfd_demo 3
Address returned by mmap() = 0x7fd30106c000
fault_handler_thread():
poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
(uffdio_copy.copy returned 4096)
Read address 0x7fd30106c00f in main(): A
Read address 0x7fd30106c40f in main(): A
Read address 0x7fd30106c80f in main(): A
Read address 0x7fd30106cc0f in main(): A
fault_handler_thread():
poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
(uffdio_copy.copy returned 4096)
Read address 0x7fd30106d00f in main(): B
Read address 0x7fd30106d40f in main(): B
Read address 0x7fd30106d80f in main(): B
Read address 0x7fd30106dc0f in main(): B
fault_handler_thread():
poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
(uffdio_copy.copy returned 4096)
Read address 0x7fd30106e00f in main(): C
Read address 0x7fd30106e40f in main(): C
Read address 0x7fd30106e80f in main(): C
Read address 0x7fd30106ec0f in main(): C
Program source
/* userfaultfd_demo.c
Licensed under the GNU General Public License version 2 or later.
*/
#define _GNU_SOURCE
#include <sys/types.h>
#include <stdio.h>
#include <linux/userfaultfd.h>
#include <pthread.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <poll.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
static int page_size;
static void *
fault_handler_thread(void *arg)
{
static struct uffd_msg msg; /* Data read from userfaultfd */
static int fault_cnt = 0; /* Number of faults so far handled */
long uffd; /* userfaultfd file descriptor */
static char *page = NULL;
struct uffdio_copy uffdio_copy;
ssize_t nread;
uffd = (long) arg;
/* Create a page that will be copied into the faulting region */
if (page == NULL) {
page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (page == MAP_FAILED)
errExit("mmap");
}
/* Loop, handling incoming events on the userfaultfd
file descriptor */
for (;;) {
/* See what poll() tells us about the userfaultfd */
struct pollfd pollfd;
int nready;
pollfd.fd = uffd;
pollfd.events = POLLIN;
nready = poll(&pollfd, 1, -1);
if (nready == -1)
errExit("poll");
printf("\nfault_handler_thread():\n");
printf(" poll() returns: nready = %d; "
"POLLIN = %d; POLLERR = %d\n", nready,
(pollfd.revents & POLLIN) != 0,
(pollfd.revents & POLLERR) != 0);
/* Read an event from the userfaultfd */
nread = read(uffd, &msg, sizeof(msg));
if (nread == 0) {
printf("EOF on userfaultfd!\n");
exit(EXIT_FAILURE);
}
if (nread == -1)
errExit("read");
/* We expect only one kind of event; verify that assumption */
if (msg.event != UFFD_EVENT_PAGEFAULT) {
fprintf(stderr, "Unexpected event on userfaultfd\n");
exit(EXIT_FAILURE);
}
/* Display info about the page-fault event */
printf(" UFFD_EVENT_PAGEFAULT event: ");
printf("flags = %llx; ", msg.arg.pagefault.flags);
printf("address = %llx\n", msg.arg.pagefault.address);
/* Copy the page pointed to by 'page' into the faulting
region. Vary the contents that are copied in, so that it
is more obvious that each fault is handled separately. */
memset(page, 'A' + fault_cnt % 20, page_size);
fault_cnt++;
uffdio_copy.src = (unsigned long) page;
/* We need to handle page faults in units of pages(!).
So, round faulting address down to page boundary */
uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
~(page_size - 1);
uffdio_copy.len = page_size;
uffdio_copy.mode = 0;
uffdio_copy.copy = 0;
if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
errExit("ioctl-UFFDIO_COPY");
printf(" (uffdio_copy.copy returned %lld)\n",
uffdio_copy.copy);
}
}
int
main(int argc, char *argv[])
{
long uffd; /* userfaultfd file descriptor */
char *addr; /* Start of region handled by userfaultfd */
unsigned long len; /* Length of region handled by userfaultfd */
pthread_t thr; /* ID of thread that handles page faults */
struct uffdio_api uffdio_api;
struct uffdio_register uffdio_register;
int s;
if (argc != 2) {
fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
exit(EXIT_FAILURE);
}
page_size = sysconf(_SC_PAGE_SIZE);
len = strtoul(argv[1], NULL, 0) * page_size;
/* Create and enable userfaultfd object */
uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
if (uffd == -1)
errExit("userfaultfd");
uffdio_api.api = UFFD_API;
uffdio_api.features = 0;
if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
errExit("ioctl-UFFDIO_API");
/* Create a private anonymous mapping. The memory will be
demand-zero paged--that is, not yet allocated. When we
actually touch the memory, it will be allocated via
the userfaultfd. */
addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED)
errExit("mmap");
printf("Address returned by mmap() = %p\n", addr);
/* Register the memory range of the mapping we just created for
handling by the userfaultfd object. In mode, we request to track
missing pages (i.e., pages that have not yet been faulted in). */
uffdio_register.range.start = (unsigned long) addr;
uffdio_register.range.len = len;
uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
errExit("ioctl-UFFDIO_REGISTER");
/* Create a thread that will process the userfaultfd events */
s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
if (s != 0) {
errno = s;
errExit("pthread_create");
}
/* Main thread now touches memory in the mapping, touching
locations 1024 bytes apart. This will trigger userfaultfd
events for all pages in the region. */
int l;
l = 0xf; /* Ensure that faulting address is not on a page
boundary, in order to test that we correctly
handle that case in fault_handling_thread() */
while (l < len) {
char c = addr[l];
printf("Read address %p in main(): ", addr + l);
printf("%c\n", c);
l += 1024;
usleep(100000); /* Slow things down a little */
}
exit(EXIT_SUCCESS);
}
SEE ALSO
fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
Documentation/vm/userfaultfd.txt in the Linux kernel source
tree
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
Hello Andrea, Mike, and all,
Mike: here's the split out page that describes the
userfaultfd ioctl() operations.
I'd like to get review input, especially from you and
Andrea, but also anyone else, for the current version
of this page, which includes quite a few FIXMEs to be
sorted.
I've shown the rendered version of the page below.
The groff source is attached, and can also be found
at the branch here:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
The new ioctl_userfaultfd(2) page follows this mail.
Cheers,
Michael
NAME
userfaultfd - create a file descriptor for handling page faults in user
space
SYNOPSIS
#include <sys/ioctl.h>
int ioctl(int fd, int cmd, ...);
DESCRIPTION
Various ioctl(2) operations can be performed on a userfaultfd object
(created by a call to userfaultfd(2)) using calls of the form:
ioctl(fd, cmd, argp);
In the above, fd is a file descriptor referring to a userfaultfd
object, cmd is one of the commands listed below, and argp is a pointer
to a data structure that is specific to cmd.
The various ioctl(2) operations are described below. The UFFDIO_API,
UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
userfaultfd behavior. These operations allow the caller to choose what
features will be enabled and what kinds of events will be delivered to
the application. The remaining operations are range operations. These
operations enable the calling application to resolve page-fault events
in a consistent way.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Above: What does "consistent" mean? │
│ │
└─────────────────────────────────────────────────────┘
UFFDIO_API
(Since Linux 4.3.) Enable operation of the userfaultfd and perform API
handshake. The argp argument is a pointer to a uffdio_api structure,
defined as:
struct uffdio_api {
__u64 api; /* Requested API version (input) */
__u64 features; /* Must be zero */
__u64 ioctls; /* Available ioctl() operations (output) */
};
The api field denotes the API version requested by the application.
Before the call, the features field must be initialized to zero.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Above: Why must the 'features' field be initialized │
│to zero? │
└─────────────────────────────────────────────────────┘
The kernel verifies that it can support the requested API version, and
sets the features and ioctls fields to bit masks representing all the
available features and the generic ioctl(2) operations available. Cur‐
rently, zero (i.e., no feature bits) is placed in the features field.
The returned ioctls field can contain the following bits:
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│This user-space API seems not fully polished. Why │
│are there not constants defined for each of the bit- │
│mask values listed below? │
└─────────────────────────────────────────────────────┘
1 << _UFFDIO_API
The UFFDIO_API operation is supported.
1 << _UFFDIO_REGISTER
The UFFDIO_REGISTER operation is supported.
1 << _UFFDIO_UNREGISTER
The UFFDIO_UNREGISTER operation is supported.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Is the above description of the 'ioctls' field cor‐ │
│rect? Does more need to be said? │
│ │
└─────────────────────────────────────────────────────┘
This ioctl(2) operation returns 0 on success. On error, -1 is returned
and errno is set to indicate the cause of the error. Possible errors
include:
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Is the following error list correct? │
│ │
└─────────────────────────────────────────────────────┘
EINVAL The userfaultfd has already been enabled by a previous UFF‐
DIO_API operation.
EINVAL The API version requested in the api field is not supported by
this kernel, or the features field was not zero.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│In the above error case, the returned 'uffdio_api' │
│structure zeroed out. Why is this done? This should │
│be explained in the manual page. │
│ │
└─────────────────────────────────────────────────────┘
UFFDIO_REGISTER
(Since Linux 4.3.) Register a memory address range with the user‐
faultfd object. The argp argument is a pointer to a uffdio_register
structure, defined as:
struct uffdio_range {
__u64 start; /* Start of range */
__u64 len; /* Length of rnage (bytes) */
};
struct uffdio_register {
struct uffdio_range range;
__u64 mode; /* Desired mode of operation (input) */
__u64 ioctls; /* Available ioctl() operations (output) */
};
The range field defines a memory range starting at start and continuing
for len bytes that should be handled by the userfaultfd.
The mode field defines the mode of operation desired for this memory
region. The following values may be bitwise ORed to set the user‐
faultfd mode for the specified range:
UFFDIO_REGISTER_MODE_MISSING
Track page faults on missing pages.
UFFDIO_REGISTER_MODE_WP
Track page faults on write-protected pages.
Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
If the operation is successful, the kernel modifies the ioctls bit-mask
field to indicate which ioctl(2) operations are available for the spec‐
ified range. This returned bit mask is as for UFFDIO_API.
This ioctl(2) operation returns 0 on success. On error, -1 is returned
and errno is set to indicate the cause of the error. Possible errors
include:
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Is the following error list correct? │
│ │
└─────────────────────────────────────────────────────┘
EBUSY A mapping in the specified range is registered with another
userfaultfd object.
EINVAL An invalid or unsupported bit was specified in the mode field;
or the mode field was zero.
EINVAL There is no mapping in the specified address range.
EINVAL range.start or range.len is not a multiple of the system page
size; or, range.len is zero; or these fields are otherwise
invalid.
EINVAL There as an incompatible mapping in the specified address range.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Above: What does "incompatible" mean? │
│ │
└─────────────────────────────────────────────────────┘
UFFDIO_UNREGISTER
(Since Linux 4.3.) Unregister a memory address range from userfaultfd.
The address range to unregister is specified in the uffdio_range struc‐
ture pointed to by argp.
This ioctl(2) operation returns 0 on success. On error, -1 is returned
and errno is set to indicate the cause of the error. Possible errors
include:
EINVAL Either the start or the len field of the ufdio_range structure
was not a multiple of the system page size; or the len field was
zero; or these fields were otherwise invalid.
EINVAL There as an incompatible mapping in the specified address range.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Above: What does "incompatible" mean? │
└─────────────────────────────────────────────────────┘
EINVAL There was no mapping in the specified address range.
UFFDIO_COPY
(Since Linux 4.3.) Atomically copy a continuous memory chunk into the
userfault registered range and optionally wake up the blocked thread.
The source and destination addresses and the number of bytes to copy
are specified by the src, dst, and len fields of the uffdio_copy struc‐
ture pointed to by argp:
struct uffdio_copy {
__u64 dst; /* Source of copy */
__u64 src; /* Destinate of copy */
__u64 len; /* Number of bytes to copy */
__u64 mode; /* Flags controlling behavior of copy */
__s64 copy; /* Number of bytes copied, or negated error */
};
The following value may be bitwise ORed in mode to change the behavior
of the UFFDIO_COPY operation:
UFFDIO_COPY_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault resolution
The copy field is used by the kernel to return the number of bytes that
was actually copied, or an error (a negated errno-style value).
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Above: Why is the 'copy' field used to return error │
│values? This should be explained in the manual │
│page. │
└─────────────────────────────────────────────────────┘
If the value returned in copy doesn't match the value that was speci‐
fied in len, the operation fails with the error EAGAIN. The copy field
is output-only; it is not read by the UFFDIO_COPY operation.
This ioctl(2) operation returns 0 on success. In this case, the entire
area was copied. On error, -1 is returned and errno is set to indicate
the cause of the error. Possible errors include:
EAGAIN The number of bytes copied (i.e., the value returned in the copy
field) does not equal the value that was specified in the len
field.
EINVAL Either dst or len was not a multiple of the system page size, or
the range specified by src and len or dst and len was invalid.
EINVAL An invalid bit was specified in the mode field.
UFFDIO_ZEROPAGE
(Since Linux 4.3.) Zero out a memory range registered with user‐
faultfd. The requested range is specified by the range field of the
uffdio_zeropage structure pointed to by argp:
struct uffdio_zeropage {
struct uffdio_range range;
__u64 mode; /* Flags controlling behavior of copy */
__s64 zeropage; /* Number of bytes zeroed, or negated error */
};
The following value may be bitwise ORed in mode to change the behavior
of the UFFDIO_ZERO operation:
UFFDIO_ZEROPAGE_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault resolution.
The zeropage field is used by the kernel to return the number of bytes
that was actually zeroed, or an error in the same manner as UFF‐
DIO_COPY.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Why is the 'zeropage' field used to return error │
│values? This should be explained in the manual │
│page. │
└─────────────────────────────────────────────────────┘
If the value returned in the zeropage field doesn't match the value
that was specified in range.len, the operation fails with the error
EAGAIN. The zeropage field is output-only; it is not read by the UFF‐
DIO_ZERO operation.
This ioctl(2) operation returns 0 on success. In this case, the entire
area was zeroed. On error, -1 is returned and errno is set to indicate
the cause of the error. Possible errors include:
EAGAIN The number of bytes zeroed (i.e., the value returned in the
zeropage field) does not equal the value that was specified in
the range.len field.
EINVAL Either range.start or range.len was not a multiple of the system
page size; or range.len was zero; or the range specified was
invalid.
EINVAL An invalid bit was specified in the mode field.
UFFDIO_WAKE
(Since Linux 4.3.) Wake up the thread waiting for page-fault resolu‐
tion on a specified memory address range. The argp argument is a
pointer to a uffdio_range structure (shown above) that specifies the
address range.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Need more detail here. How is the UFFDIO_WAKE opera‐ │
│tion used? │
└─────────────────────────────────────────────────────┘
This ioctl(2) operation returns 0 on success. On error, -1 is returned
and errno is set to indicate the cause of the error. Possible errors
include:
EINVAL The start or the len field of the ufdio_range structure was not
a multiple of the system page size; or len was zero; or the
specified range was otherwise invalid.
RETURN VALUE
See descriptions of the individual operations, above.
ERRORS
See descriptions of the individual operations, above. In addition, the
following general errors can occur for all of the operations described
above:
EFAULT argp does not point to a valid memory address.
EINVAL (For all operations except UFFDIO_API.) The userfaultfd object
has not yet been enabled (via the UFFDIO_API operation).
CONFORMING TO
These ioctl(2) operations are Linux-specific.
EXAMPLE
See userfaultfd(2).
SEE ALSO
ioctl(2), mmap(2), userfaultfd(2)
Documentation/vm/userfaultfd.txt in the Linux kernel source tree
Hello Michael,
On Mon, Mar 20, 2017 at 09:08:05PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Andrea, Mike, and all,
>
> Mike: thanks for the page that you sent. I've reworked it
> a bit, and also added a lot of further information,
> and an example program. In the process, I split the page
> into two pieces, with one piece describing the userfaultfd()
> system call and the other describing the ioctl() operations.
>
> I'd like to get review input, especially from you and
> Andrea, but also anyone else, for the current version
> of this page, which includes a few FIXMEs to be sorted.
Thanks for the update. I'm adressing the FIXME points you've mentioned
below.
Otherwise, everything seems the right description of the current upstream.
4.11 will have quite a few updates to userfault and we'll need to udpate
this page and ioctl_userfaultfd(2) to address those updates. I am planning
to work on the man update in the next few weeks.
> I've shown the rendered version of the page below.
> The groff source is attached, and can also be found
> at the branch here:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>
> The new ioctl_userfaultfd(2) page follows this mail.
>
> Cheers,
>
> Michael
--
Sincerely yours,
Mike.
> USERFAULTFD(2) Linux Programmer's Manual USERFAULTFD(2)
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Need to describe close(2) semantics for userfaulfd │
> │file descriptor: what happens when the userfaultfd │
> │FD is closed? │
> │ │
> └─────────────────────────────────────────────────────┘
When userfaultfd is closed, it unregisters all memory ranges that were
previously registered with it and flushes the outstanding page fault
events.
> NAME
> userfaultfd - create a file descriptor for handling page faults
> in user space
>
> SYNOPSIS
> #include <sys/types.h>
> #include <linux/userfaultfd.h>
>
> int userfaultfd(int flags);
>
> Note: There is no glibc wrapper for this system call; see
> NOTES.
>
> DESCRIPTION
> userfaultfd() creates a new userfaultfd object that can be used
> for delegation of page-fault handling to a user-space applica‐
> tion, and returns a file descriptor that refers to the new
> object. The new userfaultfd object is configured using
> ioctl(2).
>
> Once the userfaultfd object is configured, the application can
> use read(2) to receive userfaultfd notifications. The reads
> from userfaultfd may be blocking or non-blocking, depending on
> the value of flags used for the creation of the userfaultfd or
> subsequent calls to fcntl(2).
>
> The following values may be bitwise ORed in flags to change the
> behavior of userfaultfd():
>
> O_CLOEXEC
> Enable the close-on-exec flag for the new userfaultfd
> file descriptor. See the description of the O_CLOEXEC
> flag in open(2).
>
> O_NONBLOCK
> Enables non-blocking operation for the userfaultfd
> object. See the description of the O_NONBLOCK flag in
> open(2).
>
> Usage
> The userfaultfd mechanism is designed to allow a thread in a
> multithreaded program to perform user-space paging for the
> other threads in the process. When a page fault occurs for one
> of the regions registered to the userfaultfd object, the fault‐
> ing thread is put to sleep and an event is generated that can
> be read via the userfaultfd file descriptor. The fault-han‐
> dling thread reads events from this file descriptor and ser‐
> vices them using the operations described in ioctl_user‐
> faultfd(2). When servicing the page fault events, the fault-
> handling thread can trigger a wake-up for the sleeping thread.
>
> Userfaultfd operation
> After the userfaultfd object is created with userfaultfd(), the
> application must enable it using the UFFDIO_API ioctl(2) opera‐
> tion. This operation allows a handshake between the kernel and
> user space to determine the API version and supported features.
> This operation must be performed before any of the other
> ioctl(2) operations described below (or those operations fail
> with the EINVAL error).
>
> After a successful UFFDIO_API operation, the application then
> registers memory address ranges using the UFFDIO_REGISTER
> ioctl(2) operation. After successful completion of a UFF‐
> DIO_REGISTER operation, a page fault occurring in the requested
> memory range, and satisfying the mode defined at the registra‐
> tion time, will be forwarded by the kernel to the user-space
> application. The application can then use the UFFDIO_COPY or
> UFFDIO_ZERO ioctl(2) operations to resolve the page fault.
>
> Details of the various ioctl(2) operations can be found in
> ioctl_userfaultfd(2).
>
> Currently, userfaultfd can be used only with anonymous private
> memory mappings.
>
> Reading from the userfaultfd structure
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │are the details below correct? │
> └─────────────────────────────────────────────────────┘
Yes, at least for the current upstream version. 4.11 will have quite a few
updates to userfaultfd.
> Each read(2) from the userfaultfd file descriptor returns one
> or more uffd_msg structures, each of which describes a page-
> fault event:
>
> struct uffd_msg {
> __u8 event; /* Type of event */
> ...
> union {
> struct {
> __u64 flags; /* Flags describing fault */
> __u64 address; /* Faulting address */
> } pagefault;
> ...
> } arg;
>
> /* Padding fields omitted */
> } __packed;
>
> If multiple events are available and the supplied buffer is
> large enough, read(2) returns as many events as will fit in the
> supplied buffer. If the buffer supplied to read(2) is smaller
> than the size of the uffd_msg structure, the read(2) fails with
> the error EINVAL.
>
> The fields set in the uffd_msg structure are as follows:
>
> event The type of event. Currently, only one value can appear
> in this field: UFFD_EVENT_PAGEFAULT, which indicates a
> page-fault event.
>
> address
> The address that triggered the page fault.
>
> flags A bit mask of flags that describe the event. For
> UFFD_EVENT_PAGEFAULT, the following flag may appear:
>
> UFFD_PAGEFAULT_FLAG_WRITE
> If the address is in a range that was registered
> with the UFFDIO_REGISTER_MODE_MISSING flag (see
> ioctl_userfaultfd(2)) and this flag is set, this
> a write fault; otherwise it is a read fault.
>
> A read(2) on a userfaultfd file descriptor can fail with the
> following errors:
>
> EINVAL The userfaultfd object has not yet been enabled using
> the UFFDIO_API ioctl(2) operation
>
> The userfaultfd file descriptor can be monitored with poll(2),
> select(2), and epoll(7). When events are available, the file
> descriptor indicates as readable.
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │But, it seems, the object must be created with │
> │O_NONBLOCK. What is the rationale for this require‐ │
> │ment? Something needs to be said in this manual │
> │page. │
> └─────────────────────────────────────────────────────┘
The object can be created without O_NONBLOCK, so probably the above
sentence can be rephrased as:
When the userfaultfd file descriptor is opened in non-blocking mode, it can
be monitored with ...
> RETURN VALUE
> On success, userfaultfd() returns a new file descriptor that
> refers to the userfaultfd object. On error, -1 is returned,
> and errno is set appropriately.
>
> ERRORS
> EINVAL An unsupported value was specified in flags.
>
> EMFILE The per-process limit on the number of open file
> descriptors has been reached
>
> ENFILE The system-wide limit on the total number of open files
> has been reached.
>
> ENOMEM Insufficient kernel memory was available.
>
> VERSIONS
> The userfaultfd() system call first appeared in Linux 4.3.
>
> CONFORMING TO
> userfaultfd() is Linux-specific and should not be used in pro‐
> grams intended to be portable.
>
> NOTES
> Glibc does not provide a wrapper for this system call; call it
> using syscall(2).
>
> The userfaultfd mechanism can be used as an alternative to tra‐
> ditional user-space paging techniques based on the use of the
> SIGSEGV signal and mmap(2). It can also be used to implement
> lazy restore for checkpoint/restore mechanisms, as well as
> post-copy migration to allow (nearly) uninterrupted execution
> when transferring virtual machines from one host to another.
>
> EXAMPLE
> The program below demonstrates the use of the userfaultfd mech‐
> anism. The program creates two threads, one of which acts as
> the page-fault handler for the process, for the pages in a
> demand-page zero region created using mmap(2).
>
> The program takes one command-line argument, which is the num‐
> ber of pages that will be created in a mapping whose page
> faults will be handled via userfaultfd. After creating a user‐
> faultfd object, the program then creates an anonymous private
> mapping of the specified size and registers the address range
> of that mapping using the UFFDIO_REGISTER ioctl(2) operation.
> The program then creates a second thread that will perform the
> task of handling page faults.
>
> The main thread then walks through the pages of the mapping
> fetching bytes from successive pages. Because the pages have
> not yet been accessed, the first access of a byte in each page
> will trigger a page-fault event on the userfaultfd file
> descriptor.
>
> Each of the page-fault events is handled by the second thread,
> which sits in a loop processing input from the userfaultfd file
> descriptor. In each loop iteration, the second thread first
> calls poll(2) to check the state of the file descriptor, and
> then reads an event from the file descriptor. All such events
> should be UFFD_EVENT_PAGEFAULT events, which the thread handles
> by copying a page of data into the faulting region using the
> UFFDIO_COPY ioctl(2) operation.
>
> The following is an example of what we see when running the
> program:
>
> $ ./userfaultfd_demo 3
> Address returned by mmap() = 0x7fd30106c000
>
> fault_handler_thread():
> poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
> UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
> (uffdio_copy.copy returned 4096)
> Read address 0x7fd30106c00f in main(): A
> Read address 0x7fd30106c40f in main(): A
> Read address 0x7fd30106c80f in main(): A
> Read address 0x7fd30106cc0f in main(): A
>
> fault_handler_thread():
> poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
> UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
> (uffdio_copy.copy returned 4096)
> Read address 0x7fd30106d00f in main(): B
> Read address 0x7fd30106d40f in main(): B
> Read address 0x7fd30106d80f in main(): B
> Read address 0x7fd30106dc0f in main(): B
>
> fault_handler_thread():
> poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
> UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
> (uffdio_copy.copy returned 4096)
> Read address 0x7fd30106e00f in main(): C
> Read address 0x7fd30106e40f in main(): C
> Read address 0x7fd30106e80f in main(): C
> Read address 0x7fd30106ec0f in main(): C
>
> Program source
>
> /* userfaultfd_demo.c
>
> Licensed under the GNU General Public License version 2 or later.
> */
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <stdio.h>
> #include <linux/userfaultfd.h>
> #include <pthread.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <signal.h>
> #include <poll.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/ioctl.h>
> #include <poll.h>
>
> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
> } while (0)
>
> static int page_size;
>
> static void *
> fault_handler_thread(void *arg)
> {
> static struct uffd_msg msg; /* Data read from userfaultfd */
> static int fault_cnt = 0; /* Number of faults so far handled */
> long uffd; /* userfaultfd file descriptor */
> static char *page = NULL;
> struct uffdio_copy uffdio_copy;
> ssize_t nread;
>
> uffd = (long) arg;
>
> /* Create a page that will be copied into the faulting region */
>
> if (page == NULL) {
> page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> if (page == MAP_FAILED)
> errExit("mmap");
> }
>
> /* Loop, handling incoming events on the userfaultfd
> file descriptor */
>
> for (;;) {
>
> /* See what poll() tells us about the userfaultfd */
>
> struct pollfd pollfd;
> int nready;
> pollfd.fd = uffd;
> pollfd.events = POLLIN;
> nready = poll(&pollfd, 1, -1);
> if (nready == -1)
> errExit("poll");
>
> printf("\nfault_handler_thread():\n");
> printf(" poll() returns: nready = %d; "
> "POLLIN = %d; POLLERR = %d\n", nready,
> (pollfd.revents & POLLIN) != 0,
> (pollfd.revents & POLLERR) != 0);
>
> /* Read an event from the userfaultfd */
>
> nread = read(uffd, &msg, sizeof(msg));
> if (nread == 0) {
> printf("EOF on userfaultfd!\n");
> exit(EXIT_FAILURE);
> }
>
> if (nread == -1)
> errExit("read");
>
> /* We expect only one kind of event; verify that assumption */
>
> if (msg.event != UFFD_EVENT_PAGEFAULT) {
> fprintf(stderr, "Unexpected event on userfaultfd\n");
> exit(EXIT_FAILURE);
> }
>
> /* Display info about the page-fault event */
>
> printf(" UFFD_EVENT_PAGEFAULT event: ");
> printf("flags = %llx; ", msg.arg.pagefault.flags);
> printf("address = %llx\n", msg.arg.pagefault.address);
>
> /* Copy the page pointed to by 'page' into the faulting
> region. Vary the contents that are copied in, so that it
> is more obvious that each fault is handled separately. */
>
> memset(page, 'A' + fault_cnt % 20, page_size);
> fault_cnt++;
>
> uffdio_copy.src = (unsigned long) page;
>
> /* We need to handle page faults in units of pages(!).
> So, round faulting address down to page boundary */
>
> uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
> ~(page_size - 1);
> uffdio_copy.len = page_size;
> uffdio_copy.mode = 0;
> uffdio_copy.copy = 0;
> if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
> errExit("ioctl-UFFDIO_COPY");
>
> printf(" (uffdio_copy.copy returned %lld)\n",
> uffdio_copy.copy);
> }
> }
>
> int
> main(int argc, char *argv[])
> {
> long uffd; /* userfaultfd file descriptor */
> char *addr; /* Start of region handled by userfaultfd */
> unsigned long len; /* Length of region handled by userfaultfd */
> pthread_t thr; /* ID of thread that handles page faults */
> struct uffdio_api uffdio_api;
> struct uffdio_register uffdio_register;
> int s;
>
> if (argc != 2) {
> fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
> exit(EXIT_FAILURE);
> }
>
> page_size = sysconf(_SC_PAGE_SIZE);
> len = strtoul(argv[1], NULL, 0) * page_size;
>
> /* Create and enable userfaultfd object */
>
> uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> if (uffd == -1)
> errExit("userfaultfd");
>
> uffdio_api.api = UFFD_API;
> uffdio_api.features = 0;
> if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
> errExit("ioctl-UFFDIO_API");
>
> /* Create a private anonymous mapping. The memory will be
> demand-zero paged--that is, not yet allocated. When we
> actually touch the memory, it will be allocated via
> the userfaultfd. */
>
> addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> if (addr == MAP_FAILED)
> errExit("mmap");
>
> printf("Address returned by mmap() = %p\n", addr);
>
> /* Register the memory range of the mapping we just created for
> handling by the userfaultfd object. In mode, we request to track
> missing pages (i.e., pages that have not yet been faulted in). */
>
> uffdio_register.range.start = (unsigned long) addr;
> uffdio_register.range.len = len;
> uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
> errExit("ioctl-UFFDIO_REGISTER");
>
> /* Create a thread that will process the userfaultfd events */
>
> s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
> if (s != 0) {
> errno = s;
> errExit("pthread_create");
> }
>
> /* Main thread now touches memory in the mapping, touching
> locations 1024 bytes apart. This will trigger userfaultfd
> events for all pages in the region. */
>
> int l;
> l = 0xf; /* Ensure that faulting address is not on a page
> boundary, in order to test that we correctly
> handle that case in fault_handling_thread() */
> while (l < len) {
> char c = addr[l];
> printf("Read address %p in main(): ", addr + l);
> printf("%c\n", c);
> l += 1024;
> usleep(100000); /* Slow things down a little */
> }
>
> exit(EXIT_SUCCESS);
> }
>
> SEE ALSO
> fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
>
> Documentation/vm/userfaultfd.txt in the Linux kernel source
> tree
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
Hello Michael,
On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Andrea, Mike, and all,
>
> Mike: here's the split out page that describes the
> userfaultfd ioctl() operations.
>
> I'd like to get review input, especially from you and
> Andrea, but also anyone else, for the current version
> of this page, which includes quite a few FIXMEs to be
> sorted.
>
> I've shown the rendered version of the page below.
> The groff source is attached, and can also be found
> at the branch here:
>
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>
> The new ioctl_userfaultfd(2) page follows this mail.
>
> Cheers,
>
> Michael
>
> NAME
> userfaultfd - create a file descriptor for handling page faults in user
> space
>
> SYNOPSIS
> #include <sys/ioctl.h>
>
> int ioctl(int fd, int cmd, ...);
>
> DESCRIPTION
> Various ioctl(2) operations can be performed on a userfaultfd object
> (created by a call to userfaultfd(2)) using calls of the form:
>
> ioctl(fd, cmd, argp);
>
> In the above, fd is a file descriptor referring to a userfaultfd
> object, cmd is one of the commands listed below, and argp is a pointer
> to a data structure that is specific to cmd.
>
> The various ioctl(2) operations are described below. The UFFDIO_API,
> UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
> userfaultfd behavior. These operations allow the caller to choose what
> features will be enabled and what kinds of events will be delivered to
> the application. The remaining operations are range operations. These
> operations enable the calling application to resolve page-fault events
> in a consistent way.
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Above: What does "consistent" mean? │
> │ │
> └─────────────────────────────────────────────────────┘
Andrea, can you please help with this one?
> UFFDIO_API
> (Since Linux 4.3.) Enable operation of the userfaultfd and perform API
> handshake. The argp argument is a pointer to a uffdio_api structure,
> defined as:
>
> struct uffdio_api {
> __u64 api; /* Requested API version (input) */
> __u64 features; /* Must be zero */
> __u64 ioctls; /* Available ioctl() operations (output) */
> };
>
> The api field denotes the API version requested by the application.
> Before the call, the features field must be initialized to zero.
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Above: Why must the 'features' field be initialized │
> │to zero? │
> └─────────────────────────────────────────────────────┘
Until 4.11 the only supported feature is delegation of missing page fault
and the UFFDIO_FEATURES bitmask is 0.
There's a check in uffdio_api call that the user is not trying to enable
any other functionality and it asserts that uffdio_api.featurs is zero [1].
Starting from 4.11 the features negotiation is different. Now uffdio_call
verifies that it can support features the application requested [2].
> The kernel verifies that it can support the requested API version, and
> sets the features and ioctls fields to bit masks representing all the
> available features and the generic ioctl(2) operations available. Cur‐
> rently, zero (i.e., no feature bits) is placed in the features field.
> The returned ioctls field can contain the following bits:
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │This user-space API seems not fully polished. Why │
> │are there not constants defined for each of the bit- │
> │mask values listed below? │
> └─────────────────────────────────────────────────────┘
>
> 1 << _UFFDIO_API
> The UFFDIO_API operation is supported.
>
> 1 << _UFFDIO_REGISTER
> The UFFDIO_REGISTER operation is supported.
>
> 1 << _UFFDIO_UNREGISTER
> The UFFDIO_UNREGISTER operation is supported.
Well, I tend to agree. I believe the original intention was to use the
OR'ed mask, like UFFD_API_IOCTLS.
Andrea, can you add somthing?
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Is the above description of the 'ioctls' field cor‐ │
> │rect? Does more need to be said? │
> │ │
> └─────────────────────────────────────────────────────┘
This is correct. I wouldn't add anything else.
> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> and errno is set to indicate the cause of the error. Possible errors
> include:
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
G> ├─────────────────────────────────────────────────────┤
> │Is the following error list correct? │
> │ │
> └─────────────────────────────────────────────────────┘
There's also -EFAULT in case copy_{from,to}_user fails.
>
> EINVAL The userfaultfd has already been enabled by a previous UFF‐
> DIO_API operation.
>
> EINVAL The API version requested in the api field is not supported by
> this kernel, or the features field was not zero.
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │In the above error case, the returned 'uffdio_api' │
> │structure zeroed out. Why is this done? This should │
> │be explained in the manual page. │
> │ │
> └─────────────────────────────────────────────────────┘
In my understanding the uffdio_api structure is zeroed to allow the caller
to distinguish the reasons for -EINVAL.
> UFFDIO_REGISTER
> (Since Linux 4.3.) Register a memory address range with the user‐
> faultfd object. The argp argument is a pointer to a uffdio_register
> structure, defined as:
>
> struct uffdio_range {
> __u64 start; /* Start of range */
> __u64 len; /* Length of rnage (bytes) */
> };
>
> struct uffdio_register {
> struct uffdio_range range;
> __u64 mode; /* Desired mode of operation (input) */
> __u64 ioctls; /* Available ioctl() operations (output) */
> };
>
>
> The range field defines a memory range starting at start and continuing
> for len bytes that should be handled by the userfaultfd.
>
> The mode field defines the mode of operation desired for this memory
> region. The following values may be bitwise ORed to set the user‐
> faultfd mode for the specified range:
>
> UFFDIO_REGISTER_MODE_MISSING
> Track page faults on missing pages.
>
> UFFDIO_REGISTER_MODE_WP
> Track page faults on write-protected pages.
>
> Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
>
> If the operation is successful, the kernel modifies the ioctls bit-mask
> field to indicate which ioctl(2) operations are available for the spec‐
> ified range. This returned bit mask is as for UFFDIO_API.
>
> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> and errno is set to indicate the cause of the error. Possible errors
> include:
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Is the following error list correct? │
> │ │
> └─────────────────────────────────────────────────────┘
Here again it maybe -EFAULT to indicate copy_{from,to}_user failure.
And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the
mm_struct has gone by the time userfault grabs it.
> EBUSY A mapping in the specified range is registered with another
> userfaultfd object.
>
> EINVAL An invalid or unsupported bit was specified in the mode field;
> or the mode field was zero.
>
> EINVAL There is no mapping in the specified address range.
>
> EINVAL range.start or range.len is not a multiple of the system page
> size; or, range.len is zero; or these fields are otherwise
> invalid.
>
> EINVAL There as an incompatible mapping in the specified address range.
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Above: What does "incompatible" mean? │
> │ │
> └─────────────────────────────────────────────────────┘
Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS |
MAP_PRIVATE mappings.
> UFFDIO_UNREGISTER
> (Since Linux 4.3.) Unregister a memory address range from userfaultfd.
> The address range to unregister is specified in the uffdio_range struc‐
> ture pointed to by argp.
>
> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> and errno is set to indicate the cause of the error. Possible errors
> include:
>
> EINVAL Either the start or the len field of the ufdio_range structure
> was not a multiple of the system page size; or the len field was
> zero; or these fields were otherwise invalid.
>
> EINVAL There as an incompatible mapping in the specified address range.
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Above: What does "incompatible" mean? │
> └─────────────────────────────────────────────────────┘
The same comments as for UFFDIO_REGISTER apply here as well.
> EINVAL There was no mapping in the specified address range.
>
> UFFDIO_COPY
> (Since Linux 4.3.) Atomically copy a continuous memory chunk into the
> userfault registered range and optionally wake up the blocked thread.
> The source and destination addresses and the number of bytes to copy
> are specified by the src, dst, and len fields of the uffdio_copy struc‐
> ture pointed to by argp:
>
> struct uffdio_copy {
> __u64 dst; /* Source of copy */
> __u64 src; /* Destinate of copy */
> __u64 len; /* Number of bytes to copy */
> __u64 mode; /* Flags controlling behavior of copy */
> __s64 copy; /* Number of bytes copied, or negated error */
> };
>
> The following value may be bitwise ORed in mode to change the behavior
> of the UFFDIO_COPY operation:
>
> UFFDIO_COPY_MODE_DONTWAKE
> Do not wake up the thread that waits for page-fault resolution
>
> The copy field is used by the kernel to return the number of bytes that
> was actually copied, or an error (a negated errno-style value).
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Above: Why is the 'copy' field used to return error │
> │values? This should be explained in the manual │
> │page. │
> └─────────────────────────────────────────────────────┘
Andrea, can you help with this one, please?
> If the value returned in copy doesn't match the value that was speci‐
> fied in len, the operation fails with the error EAGAIN. The copy field
> is output-only; it is not read by the UFFDIO_COPY operation.
>
> This ioctl(2) operation returns 0 on success. In this case, the entire
> area was copied. On error, -1 is returned and errno is set to indicate
> the cause of the error. Possible errors include:
>
> EAGAIN The number of bytes copied (i.e., the value returned in the copy
> field) does not equal the value that was specified in the len
> field.
>
> EINVAL Either dst or len was not a multiple of the system page size, or
> the range specified by src and len or dst and len was invalid.
>
> EINVAL An invalid bit was specified in the mode field.
>
> UFFDIO_ZEROPAGE
> (Since Linux 4.3.) Zero out a memory range registered with user‐
> faultfd. The requested range is specified by the range field of the
> uffdio_zeropage structure pointed to by argp:
>
> struct uffdio_zeropage {
> struct uffdio_range range;
> __u64 mode; /* Flags controlling behavior of copy */
> __s64 zeropage; /* Number of bytes zeroed, or negated error */
> };
>
> The following value may be bitwise ORed in mode to change the behavior
> of the UFFDIO_ZERO operation:
>
> UFFDIO_ZEROPAGE_MODE_DONTWAKE
> Do not wake up the thread that waits for page-fault resolution.
>
> The zeropage field is used by the kernel to return the number of bytes
> that was actually zeroed, or an error in the same manner as UFF‐
> DIO_COPY.
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Why is the 'zeropage' field used to return error │
> │values? This should be explained in the manual │
> │page. │
> └─────────────────────────────────────────────────────┘
> If the value returned in the zeropage field doesn't match the value
> that was specified in range.len, the operation fails with the error
> EAGAIN. The zeropage field is output-only; it is not read by the UFF‐
> DIO_ZERO operation.
>
> This ioctl(2) operation returns 0 on success. In this case, the entire
> area was zeroed. On error, -1 is returned and errno is set to indicate
> the cause of the error. Possible errors include:
>
> EAGAIN The number of bytes zeroed (i.e., the value returned in the
> zeropage field) does not equal the value that was specified in
> the range.len field.
>
> EINVAL Either range.start or range.len was not a multiple of the system
> page size; or range.len was zero; or the range specified was
> invalid.
>
> EINVAL An invalid bit was specified in the mode field.
>
> UFFDIO_WAKE
> (Since Linux 4.3.) Wake up the thread waiting for page-fault resolu‐
> tion on a specified memory address range. The argp argument is a
> pointer to a uffdio_range structure (shown above) that specifies the
> address range.
>
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Need more detail here. How is the UFFDIO_WAKE opera‐ │
> │tion used? │
> └─────────────────────────────────────────────────────┘
The UFFDIO_WAKE operation is used in conjunction with
UFFDIO_{COPY,ZEROPAGE} operations that have
UFFDIO_{COPY,ZEROPAGE}_MODE_DONTWAKE bit set in the mode field.
The userfault monitor can perform several UFFDIO_{COPY,ZEROPAGE} calls in a
batch and then explicitly wake up the faulting thread using UFFDIO_WAKE.
> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> and errno is set to indicate the cause of the error. Possible errors
> include:
>
> EINVAL The start or the len field of the ufdio_range structure was not
> a multiple of the system page size; or len was zero; or the
> specified range was otherwise invalid.
>
> RETURN VALUE
> See descriptions of the individual operations, above.
>
> ERRORS
> See descriptions of the individual operations, above. In addition, the
> following general errors can occur for all of the operations described
> above:
>
> EFAULT argp does not point to a valid memory address.
>
> EINVAL (For all operations except UFFDIO_API.) The userfaultfd object
> has not yet been enabled (via the UFFDIO_API operation).
>
> CONFORMING TO
> These ioctl(2) operations are Linux-specific.
>
> EXAMPLE
> See userfaultfd(2).
>
> SEE ALSO
> ioctl(2), mmap(2), userfaultfd(2)
>
> Documentation/vm/userfaultfd.txt in the Linux kernel source tree
>
[1] http://lxr.free-electrons.com/source/fs/userfaultfd.c#L1199
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/userfaultfd.c#n1680
Hello Mike,
On 03/21/2017 03:01 PM, Mike Rapoport wrote:
> Hello Michael,
>
> On Mon, Mar 20, 2017 at 09:08:05PM +0100, Michael Kerrisk (man-pages) wrote:
>> Hello Andrea, Mike, and all,
>>
>> Mike: thanks for the page that you sent. I've reworked it
>> a bit, and also added a lot of further information,
>> and an example program. In the process, I split the page
>> into two pieces, with one piece describing the userfaultfd()
>> system call and the other describing the ioctl() operations.
>>
>> I'd like to get review input, especially from you and
>> Andrea, but also anyone else, for the current version
>> of this page, which includes a few FIXMEs to be sorted.
>
> Thanks for the update. I'm adressing the FIXME points you've mentioned
> below.
Thanks!
> Otherwise, everything seems the right description of the current upstream.
> 4.11 will have quite a few updates to userfault and we'll need to udpate
> this page and ioctl_userfaultfd(2) to address those updates. I am planning
> to work on the man update in the next few weeks.
>
>> I've shown the rendered version of the page below.
>> The groff source is attached, and can also be found
>> at the branch here:
>
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>>
>> The new ioctl_userfaultfd(2) page follows this mail.
>>
>> Cheers,
>>
>> Michael
>
> --
> Sincerely yours,
> Mike.
>
>
>> USERFAULTFD(2) Linux Programmer's Manual USERFAULTFD(2)
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Need to describe close(2) semantics for userfaulfd │
>> │file descriptor: what happens when the userfaultfd │
>> │FD is closed? │
>> │ │
>> └─────────────────────────────────────────────────────┘
>
> When userfaultfd is closed, it unregisters all memory ranges that were
> previously registered with it and flushes the outstanding page fault
> events.
Presumably, this is more precisely stated as, "when the last
file descriptor referring to a userfaultfd object is closed..."?
I've made the text:
When the last file descriptor referring to a userfaultfd object
is closed, all memory ranges that were registered with the
object are unregistered and unread page-fault events are
flushed.
[...]
>> Reading from the userfaultfd structure
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │are the details below correct? │
>> └─────────────────────────────────────────────────────┘
>
> Yes, at least for the current upstream version. 4.11 will have quite a few
> updates to userfaultfd.
Okay.
>> Each read(2) from the userfaultfd file descriptor returns one
>> or more uffd_msg structures, each of which describes a page-
>> fault event:
>>
>> struct uffd_msg {
>> __u8 event; /* Type of event */
>> ...
>> union {
>> struct {
>> __u64 flags; /* Flags describing fault */
>> __u64 address; /* Faulting address */
>> } pagefault;
>> ...
>> } arg;
>>
>> /* Padding fields omitted */
>> } __packed;
>>
>> If multiple events are available and the supplied buffer is
>> large enough, read(2) returns as many events as will fit in the
>> supplied buffer. If the buffer supplied to read(2) is smaller
>> than the size of the uffd_msg structure, the read(2) fails with
>> the error EINVAL.
>>
>> The fields set in the uffd_msg structure are as follows:
>>
>> event The type of event. Currently, only one value can appear
>> in this field: UFFD_EVENT_PAGEFAULT, which indicates a
>> page-fault event.
>>
>> address
>> The address that triggered the page fault.
>>
>> flags A bit mask of flags that describe the event. For
>> UFFD_EVENT_PAGEFAULT, the following flag may appear:
>>
>> UFFD_PAGEFAULT_FLAG_WRITE
>> If the address is in a range that was registered
>> with the UFFDIO_REGISTER_MODE_MISSING flag (see
>> ioctl_userfaultfd(2)) and this flag is set, this
>> a write fault; otherwise it is a read fault.
>>
>> A read(2) on a userfaultfd file descriptor can fail with the
>> following errors:
>>
>> EINVAL The userfaultfd object has not yet been enabled using
>> the UFFDIO_API ioctl(2) operation
>>
>> The userfaultfd file descriptor can be monitored with poll(2),
>> select(2), and epoll(7). When events are available, the file
>> descriptor indicates as readable.
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │But, it seems, the object must be created with │
>> │O_NONBLOCK. What is the rationale for this require‐ │
>> │ment? Something needs to be said in this manual │
>> │page. │
>> └─────────────────────────────────────────────────────┘
>
> The object can be created without O_NONBLOCK, so probably the above
> sentence can be rephrased as:
>
> When the userfaultfd file descriptor is opened in non-blocking mode, it can
> be monitored with ...
Yes, but why is there this requirement for poll() etc. with the
O_NONBLOCK flag? I think something about that needs to be said in the
man page. Sorry, my FIXME was not clear enough. I've reworded the text
and the FIXME:
If the O_NONBLOCK flag is enabled in the associated open file
description, the userfaultfd file descriptor can be monitored
with poll(2), select(2), and epoll(7). When events are avail‐
able, the file descriptor indicates as readable. If the O_NON‐
BLOCK flag is not enabled, then poll(2) (always) indicates the
file as having a POLLERR condition, and select(2) indicates the
file descriptor as both readable and writable.
┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│What is the reason for this seemingly odd behavior │
│with respect to the O_NONBLOCK flag? (see user‐ │
│faultfd_poll() in fs/userfaultfd.c). Something │
│needs to be said about this. │
└─────────────────────────────────────────────────────┘
[...]
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
Hello Mike,
Hello Andrea (we need your help!),
On 03/22/2017 02:54 PM, Mike Rapoport wrote:
> Hello Michael,
>
> On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote:
>> Hello Andrea, Mike, and all,
>>
>> Mike: here's the split out page that describes the
>> userfaultfd ioctl() operations.
>>
>> I'd like to get review input, especially from you and
>> Andrea, but also anyone else, for the current version
>> of this page, which includes quite a few FIXMEs to be
>> sorted.
>>
>> I've shown the rendered version of the page below.
>> The groff source is attached, and can also be found
>> at the branch here:
>>
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>>
>> The new ioctl_userfaultfd(2) page follows this mail.
>>
>> Cheers,
>>
>> Michael
>>
>> NAME
>> userfaultfd - create a file descriptor for handling page faults in user
>> space
>>
>> SYNOPSIS
>> #include <sys/ioctl.h>
>>
>> int ioctl(int fd, int cmd, ...);
>>
>> DESCRIPTION
>> Various ioctl(2) operations can be performed on a userfaultfd object
>> (created by a call to userfaultfd(2)) using calls of the form:
>>
>> ioctl(fd, cmd, argp);
>>
>> In the above, fd is a file descriptor referring to a userfaultfd
>> object, cmd is one of the commands listed below, and argp is a pointer
>> to a data structure that is specific to cmd.
>>
>> The various ioctl(2) operations are described below. The UFFDIO_API,
>> UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
>> userfaultfd behavior. These operations allow the caller to choose what
>> features will be enabled and what kinds of events will be delivered to
>> the application. The remaining operations are range operations. These
>> operations enable the calling application to resolve page-fault events
>> in a consistent way.
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Above: What does "consistent" mean? │
>> │ │
>> └─────────────────────────────────────────────────────┘
>
> Andrea, can you please help with this one?
Let's see what Andrea has to say.
>> UFFDIO_API
>> (Since Linux 4.3.) Enable operation of the userfaultfd and perform API
>> handshake. The argp argument is a pointer to a uffdio_api structure,
>> defined as:
>>
>> struct uffdio_api {
>> __u64 api; /* Requested API version (input) */
>> __u64 features; /* Must be zero */
>> __u64 ioctls; /* Available ioctl() operations (output) */
>> };
>>
>> The api field denotes the API version requested by the application.
>> Before the call, the features field must be initialized to zero.
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Above: Why must the 'features' field be initialized │
>> │to zero? │
>> └─────────────────────────────────────────────────────┘
>
> Until 4.11 the only supported feature is delegation of missing page fault
> and the UFFDIO_FEATURES bitmask is 0.
So, the thing that was not clear, but now I think I understand:
'features' is an input field where one can ask about supported features
(but none are supported, before Linux 4.11). Is that correct? I've changed
the text here to read:
Before the call, the features field must be initialized
to zero. In the future, it is intended that this field can be
used to ask whether particular features are supported.
Seem okay?
> There's a check in uffdio_api call that the user is not trying to enable
> any other functionality and it asserts that uffdio_api.featurs is zero [1].
> Starting from 4.11 the features negotiation is different. Now uffdio_call
> verifies that it can support features the application requested [2].
Okay.
>> The kernel verifies that it can support the requested API version, and
>> sets the features and ioctls fields to bit masks representing all the
>> available features and the generic ioctl(2) operations available. Cur‐
>> rently, zero (i.e., no feature bits) is placed in the features field.
>> The returned ioctls field can contain the following bits:
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │This user-space API seems not fully polished. Why │
>> │are there not constants defined for each of the bit- │
>> │mask values listed below? │
>> └─────────────────────────────────────────────────────┘
>>
>> 1 << _UFFDIO_API
>> The UFFDIO_API operation is supported.
>>
>> 1 << _UFFDIO_REGISTER
>> The UFFDIO_REGISTER operation is supported.
>>
>> 1 << _UFFDIO_UNREGISTER
>> The UFFDIO_UNREGISTER operation is supported.
>
> Well, I tend to agree. I believe the original intention was to use the
> OR'ed mask, like UFFD_API_IOCTLS.
> Andrea, can you add somthing?
Yes, Andrea, please!
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Is the above description of the 'ioctls' field cor‐ │
>> │rect? Does more need to be said? │
>> │ │
>> └─────────────────────────────────────────────────────┘
>
> This is correct. I wouldn't add anything else.
Thanks.
>> This ioctl(2) operation returns 0 on success. On error, -1 is returned
>> and errno is set to indicate the cause of the error. Possible errors
>> include:
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
> G> ├─────────────────────────────────────────────────────┤
>> │Is the following error list correct? │
>> │ │
>> └─────────────────────────────────────────────────────┘
>
> There's also -EFAULT in case copy_{from,to}_user fails.
Okay -- I have added that error.
>>
>> EINVAL The userfaultfd has already been enabled by a previous UFF‐
>> DIO_API operation.
>>
>> EINVAL The API version requested in the api field is not supported by
>> this kernel, or the features field was not zero.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │In the above error case, the returned 'uffdio_api' │
>> │structure zeroed out. Why is this done? This should │
>> │be explained in the manual page. │
>> │ │
>> └─────────────────────────────────────────────────────┘
>
> In my understanding the uffdio_api structure is zeroed to allow the caller
> to distinguish the reasons for -EINVAL.
Andrea, can you please help here?
>> UFFDIO_REGISTER
>> (Since Linux 4.3.) Register a memory address range with the user‐
>> faultfd object. The argp argument is a pointer to a uffdio_register
>> structure, defined as:
>>
>> struct uffdio_range {
>> __u64 start; /* Start of range */
>> __u64 len; /* Length of rnage (bytes) */
>> };
>>
>> struct uffdio_register {
>> struct uffdio_range range;
>> __u64 mode; /* Desired mode of operation (input) */
>> __u64 ioctls; /* Available ioctl() operations (output) */
>> };
>>
>>
>> The range field defines a memory range starting at start and continuing
>> for len bytes that should be handled by the userfaultfd.
>>
>> The mode field defines the mode of operation desired for this memory
>> region. The following values may be bitwise ORed to set the user‐
>> faultfd mode for the specified range:
>>
>> UFFDIO_REGISTER_MODE_MISSING
>> Track page faults on missing pages.
>>
>> UFFDIO_REGISTER_MODE_WP
>> Track page faults on write-protected pages.
>>
>> Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
>>
>> If the operation is successful, the kernel modifies the ioctls bit-mask
>> field to indicate which ioctl(2) operations are available for the spec‐
>> ified range. This returned bit mask is as for UFFDIO_API.
>>
>> This ioctl(2) operation returns 0 on success. On error, -1 is returned
>> and errno is set to indicate the cause of the error. Possible errors
>> include:
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Is the following error list correct? │
>> │ │
>> └─────────────────────────────────────────────────────┘
>
> Here again it maybe -EFAULT to indicate copy_{from,to}_user failure.
> And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the
> mm_struct has gone by the time userfault grabs it.
Okay -- added EFAULT. I think I'll skip ENOMEM for the moment, but
will note the possibility in the page source.
>> EBUSY A mapping in the specified range is registered with another
>> userfaultfd object.
>>
>> EINVAL An invalid or unsupported bit was specified in the mode field;
>> or the mode field was zero.
>>
>> EINVAL There is no mapping in the specified address range.
>>
>> EINVAL range.start or range.len is not a multiple of the system page
>> size; or, range.len is zero; or these fields are otherwise
>> invalid.
>>
>> EINVAL There as an incompatible mapping in the specified address range.
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Above: What does "incompatible" mean? │
>> │ │
>> └─────────────────────────────────────────────────────┘
>
> Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS |
> MAP_PRIVATE mappings.
Hmmm -- this restriction is not actually mentioned in the description
of UFFDIO_REGISTER. So, at the start of the description of that operation,
I've made the text as follows:
[[
.SS UFFDIO_REGISTER
(Since Linux 4.3.)
Register a memory address range with the userfaultfd object.
The pages in the range must be "compatible".
In the current implementation,
.\" According to Mike Rapoport, this will change in Linux 4.11.
only private anonymous ranges are compatible for registering with
.BR UFFDIO_REGISTER .
]]
Okay?
>> UFFDIO_UNREGISTER
>> (Since Linux 4.3.) Unregister a memory address range from userfaultfd.
>> The address range to unregister is specified in the uffdio_range struc‐
>> ture pointed to by argp.
>>
>> This ioctl(2) operation returns 0 on success. On error, -1 is returned
>> and errno is set to indicate the cause of the error. Possible errors
>> include:
>>
>> EINVAL Either the start or the len field of the ufdio_range structure
>> was not a multiple of the system page size; or the len field was
>> zero; or these fields were otherwise invalid.
>>
>> EINVAL There as an incompatible mapping in the specified address range.
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Above: What does "incompatible" mean? │
>> └─────────────────────────────────────────────────────┘
>
> The same comments as for UFFDIO_REGISTER apply here as well.
Okay. I changed the introductory text on UFFDIO_UNREGISTER to say:
[[
.SS UFFDIO_UNREGISTER
(Since Linux 4.3.)
Unregister a memory address range from userfaultfd.
The pages in the range must be "compatible" (see the description of
.BR UFFDIO_REGISTER .)
]]
Okay?
>> EINVAL There was no mapping in the specified address range.
>>
>> UFFDIO_COPY
>> (Since Linux 4.3.) Atomically copy a continuous memory chunk into the
>> userfault registered range and optionally wake up the blocked thread.
>> The source and destination addresses and the number of bytes to copy
>> are specified by the src, dst, and len fields of the uffdio_copy struc‐
>> ture pointed to by argp:
>>
>> struct uffdio_copy {
>> __u64 dst; /* Source of copy */
>> __u64 src; /* Destinate of copy */
>> __u64 len; /* Number of bytes to copy */
>> __u64 mode; /* Flags controlling behavior of copy */
>> __s64 copy; /* Number of bytes copied, or negated error */
>> };
>>
>> The following value may be bitwise ORed in mode to change the behavior
>> of the UFFDIO_COPY operation:
>>
>> UFFDIO_COPY_MODE_DONTWAKE
>> Do not wake up the thread that waits for page-fault resolution
>>
>> The copy field is used by the kernel to return the number of bytes that
>> was actually copied, or an error (a negated errno-style value).
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Above: Why is the 'copy' field used to return error │
>> │values? This should be explained in the manual │
>> │page. │
>> └─────────────────────────────────────────────────────┘
>
> Andrea, can you help with this one, please?
Yes, Andrea, please.
>> If the value returned in copy doesn't match the value that was speci‐
>> fied in len, the operation fails with the error EAGAIN. The copy field
>> is output-only; it is not read by the UFFDIO_COPY operation.
>>
>> This ioctl(2) operation returns 0 on success. In this case, the entire
>> area was copied. On error, -1 is returned and errno is set to indicate
>> the cause of the error. Possible errors include:
>>
>> EAGAIN The number of bytes copied (i.e., the value returned in the copy
>> field) does not equal the value that was specified in the len
>> field.
>>
>> EINVAL Either dst or len was not a multiple of the system page size, or
>> the range specified by src and len or dst and len was invalid.
>>
>> EINVAL An invalid bit was specified in the mode field.
>>
>> UFFDIO_ZEROPAGE
>> (Since Linux 4.3.) Zero out a memory range registered with user‐
>> faultfd. The requested range is specified by the range field of the
>> uffdio_zeropage structure pointed to by argp:
>>
>> struct uffdio_zeropage {
>> struct uffdio_range range;
>> __u64 mode; /* Flags controlling behavior of copy */
>> __s64 zeropage; /* Number of bytes zeroed, or negated error */
>> };
>>
>> The following value may be bitwise ORed in mode to change the behavior
>> of the UFFDIO_ZERO operation:
>>
>> UFFDIO_ZEROPAGE_MODE_DONTWAKE
>> Do not wake up the thread that waits for page-fault resolution.
>>
>> The zeropage field is used by the kernel to return the number of bytes
>> that was actually zeroed, or an error in the same manner as UFF‐
>> DIO_COPY.
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Why is the 'zeropage' field used to return error │
>> │values? This should be explained in the manual │
>> │page. │
>> └─────────────────────────────────────────────────────┘
Help is still needed for this FIXME!
>> If the value returned in the zeropage field doesn't match the value
>> that was specified in range.len, the operation fails with the error
>> EAGAIN. The zeropage field is output-only; it is not read by the UFF‐
>> DIO_ZERO operation.
>>
>> This ioctl(2) operation returns 0 on success. In this case, the entire
>> area was zeroed. On error, -1 is returned and errno is set to indicate
>> the cause of the error. Possible errors include:
>>
>> EAGAIN The number of bytes zeroed (i.e., the value returned in the
>> zeropage field) does not equal the value that was specified in
>> the range.len field.
>>
>> EINVAL Either range.start or range.len was not a multiple of the system
>> page size; or range.len was zero; or the range specified was
>> invalid.
>>
>> EINVAL An invalid bit was specified in the mode field.
>>
>> UFFDIO_WAKE
>> (Since Linux 4.3.) Wake up the thread waiting for page-fault resolu‐
>> tion on a specified memory address range. The argp argument is a
>> pointer to a uffdio_range structure (shown above) that specifies the
>> address range.
>>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Need more detail here. How is the UFFDIO_WAKE opera‐ │
>> │tion used? │
>> └─────────────────────────────────────────────────────┘
>
> The UFFDIO_WAKE operation is used in conjunction with
> UFFDIO_{COPY,ZEROPAGE} operations that have
> UFFDIO_{COPY,ZEROPAGE}_MODE_DONTWAKE bit set in the mode field.
> The userfault monitor can perform several UFFDIO_{COPY,ZEROPAGE} calls in a
> batch and then explicitly wake up the faulting thread using UFFDIO_WAKE.
Perfect! I've tweaked that text a little and added to the page.
>> This ioctl(2) operation returns 0 on success. On error, -1 is returned
>> and errno is set to indicate the cause of the error. Possible errors
>> include:
>>
>> EINVAL The start or the len field of the ufdio_range structure was not
>> a multiple of the system page size; or len was zero; or the
>> specified range was otherwise invalid.
>>
>> RETURN VALUE
>> See descriptions of the individual operations, above.
>>
>> ERRORS
>> See descriptions of the individual operations, above. In addition, the
>> following general errors can occur for all of the operations described
>> above:
>>
>> EFAULT argp does not point to a valid memory address.
>>
>> EINVAL (For all operations except UFFDIO_API.) The userfaultfd object
>> has not yet been enabled (via the UFFDIO_API operation).
>>
>> CONFORMING TO
>> These ioctl(2) operations are Linux-specific.
>>
>> EXAMPLE
>> See userfaultfd(2).
>>
>> SEE ALSO
>> ioctl(2), mmap(2), userfaultfd(2)
>>
>> Documentation/vm/userfaultfd.txt in the Linux kernel source tree
>>
>
> [1] http://lxr.free-electrons.com/source/fs/userfaultfd.c#L1199
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/userfaultfd.c#n1680
The current version of the two pages has been pushed to
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
On Fri, Apr 21, 2017 at 08:30:55AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Mike,
>
> On 03/21/2017 03:01 PM, Mike Rapoport wrote:
> > Hello Michael,
> >
> > On Mon, Mar 20, 2017 at 09:08:05PM +0100, Michael Kerrisk (man-pages) wrote:
> >> Hello Andrea, Mike, and all,
> >>
> >> Mike: thanks for the page that you sent. I've reworked it
> >> a bit, and also added a lot of further information,
> >> and an example program. In the process, I split the page
> >> into two pieces, with one piece describing the userfaultfd()
> >> system call and the other describing the ioctl() operations.
> >>
> >> I'd like to get review input, especially from you and
> >> Andrea, but also anyone else, for the current version
> >> of this page, which includes a few FIXMEs to be sorted.
> >
> > Thanks for the update. I'm adressing the FIXME points you've mentioned
> > below.
>
> Thanks!
>
> > Otherwise, everything seems the right description of the current upstream.
> > 4.11 will have quite a few updates to userfault and we'll need to udpate
> > this page and ioctl_userfaultfd(2) to address those updates. I am planning
> > to work on the man update in the next few weeks.
> >
> >> I've shown the rendered version of the page below.
> >> The groff source is attached, and can also be found
> >> at the branch here:
> >
> >> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
> >>
> >> The new ioctl_userfaultfd(2) page follows this mail.
> >>
> >> Cheers,
> >>
> >> Michael
> >
> > --
> > Sincerely yours,
> > Mike.
> >
> >
> >> USERFAULTFD(2) Linux Programmer's Manual USERFAULTFD(2)
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Need to describe close(2) semantics for userfaulfd │
> >> │file descriptor: what happens when the userfaultfd │
> >> │FD is closed? │
> >> │ │
> >> └─────────────────────────────────────────────────────┘
> >
> > When userfaultfd is closed, it unregisters all memory ranges that were
> > previously registered with it and flushes the outstanding page fault
> > events.
>
> Presumably, this is more precisely stated as, "when the last
> file descriptor referring to a userfaultfd object is closed..."?
You are right.
> I've made the text:
>
> When the last file descriptor referring to a userfaultfd object
> is closed, all memory ranges that were registered with the
> object are unregistered and unread page-fault events are
> flushed.
>
> [...]
Perfect.
> >> Reading from the userfaultfd structure
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │are the details below correct? │
> >> └─────────────────────────────────────────────────────┘
> >
> > Yes, at least for the current upstream version. 4.11 will have quite a few
> > updates to userfaultfd.
>
> Okay.
>
> >> Each read(2) from the userfaultfd file descriptor returns one
> >> or more uffd_msg structures, each of which describes a page-
> >> fault event:
> >>
> >> struct uffd_msg {
> >> __u8 event; /* Type of event */
> >> ...
> >> union {
> >> struct {
> >> __u64 flags; /* Flags describing fault */
> >> __u64 address; /* Faulting address */
> >> } pagefault;
> >> ...
> >> } arg;
> >>
> >> /* Padding fields omitted */
> >> } __packed;
> >>
> >> If multiple events are available and the supplied buffer is
> >> large enough, read(2) returns as many events as will fit in the
> >> supplied buffer. If the buffer supplied to read(2) is smaller
> >> than the size of the uffd_msg structure, the read(2) fails with
> >> the error EINVAL.
> >>
> >> The fields set in the uffd_msg structure are as follows:
> >>
> >> event The type of event. Currently, only one value can appear
> >> in this field: UFFD_EVENT_PAGEFAULT, which indicates a
> >> page-fault event.
> >>
> >> address
> >> The address that triggered the page fault.
> >>
> >> flags A bit mask of flags that describe the event. For
> >> UFFD_EVENT_PAGEFAULT, the following flag may appear:
> >>
> >> UFFD_PAGEFAULT_FLAG_WRITE
> >> If the address is in a range that was registered
> >> with the UFFDIO_REGISTER_MODE_MISSING flag (see
> >> ioctl_userfaultfd(2)) and this flag is set, this
> >> a write fault; otherwise it is a read fault.
> >>
> >> A read(2) on a userfaultfd file descriptor can fail with the
> >> following errors:
> >>
> >> EINVAL The userfaultfd object has not yet been enabled using
> >> the UFFDIO_API ioctl(2) operation
> >>
> >> The userfaultfd file descriptor can be monitored with poll(2),
> >> select(2), and epoll(7). When events are available, the file
> >> descriptor indicates as readable.
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │But, it seems, the object must be created with │
> >> │O_NONBLOCK. What is the rationale for this require‐ │
> >> │ment? Something needs to be said in this manual │
> >> │page. │
> >> └─────────────────────────────────────────────────────┘
> >
> > The object can be created without O_NONBLOCK, so probably the above
> > sentence can be rephrased as:
> >
> > When the userfaultfd file descriptor is opened in non-blocking mode, it can
> > be monitored with ...
>
> Yes, but why is there this requirement for poll() etc. with the
> O_NONBLOCK flag? I think something about that needs to be said in the
> man page. Sorry, my FIXME was not clear enough. I've reworded the text
> and the FIXME:
>
> If the O_NONBLOCK flag is enabled in the associated open file
> description, the userfaultfd file descriptor can be monitored
> with poll(2), select(2), and epoll(7). When events are avail‐
> able, the file descriptor indicates as readable. If the O_NON‐
> BLOCK flag is not enabled, then poll(2) (always) indicates the
> file as having a POLLERR condition, and select(2) indicates the
> file descriptor as both readable and writable.
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │What is the reason for this seemingly odd behavior │
> │with respect to the O_NONBLOCK flag? (see user‐ │
> │faultfd_poll() in fs/userfaultfd.c). Something │
> │needs to be said about this. │
> └─────────────────────────────────────────────────────┘
Andrea, can you please help with this one as well?
> [...]
>
> Thanks,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
--
Sincerely yours,
Mike.
Hello Michael,
On Fri, Apr 21, 2017 at 11:11:18AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Mike,
> Hello Andrea (we need your help!),
>
> On 03/22/2017 02:54 PM, Mike Rapoport wrote:
> > Hello Michael,
> >
> > On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote:
> >> Hello Andrea, Mike, and all,
> >>
> >> Mike: here's the split out page that describes the
> >> userfaultfd ioctl() operations.
> >>
> >> I'd like to get review input, especially from you and
> >> Andrea, but also anyone else, for the current version
> >> of this page, which includes quite a few FIXMEs to be
> >> sorted.
> >>
> >> I've shown the rendered version of the page below.
> >> The groff source is attached, and can also be found
> >> at the branch here:
> >>
> >> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
> >>
> >> The new ioctl_userfaultfd(2) page follows this mail.
> >>
> >> Cheers,
> >>
> >> Michael
> >>
> >> NAME
> >> userfaultfd - create a file descriptor for handling page faults in user
> >> space
> >>
> >> SYNOPSIS
> >> #include <sys/ioctl.h>
> >>
> >> int ioctl(int fd, int cmd, ...);
> >>
> >> DESCRIPTION
> >> Various ioctl(2) operations can be performed on a userfaultfd object
> >> (created by a call to userfaultfd(2)) using calls of the form:
> >>
> >> ioctl(fd, cmd, argp);
> >>
> >> In the above, fd is a file descriptor referring to a userfaultfd
> >> object, cmd is one of the commands listed below, and argp is a pointer
> >> to a data structure that is specific to cmd.
> >>
> >> The various ioctl(2) operations are described below. The UFFDIO_API,
> >> UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
> >> userfaultfd behavior. These operations allow the caller to choose what
> >> features will be enabled and what kinds of events will be delivered to
> >> the application. The remaining operations are range operations. These
> >> operations enable the calling application to resolve page-fault events
> >> in a consistent way.
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Above: What does "consistent" mean? │
> >> │ │
> >> └─────────────────────────────────────────────────────┘
> >
> > Andrea, can you please help with this one?
>
> Let's see what Andrea has to say.
Actually, I though I've copied this text from Andrea's docs, but now I've
found out it was my wording and I really don't remember now what was my
intention for "consistent" :)
My guess is that I was thinking about atomicity of UFFDIO_COPY, or the fact
that from the faulting thread perspective the page fault handling is the
same whether it's done in kernel or via userfaultfd...
That said, maybe it'd be better just to drop "in a consistent way".
> >> UFFDIO_API
> >> (Since Linux 4.3.) Enable operation of the userfaultfd and perform API
> >> handshake. The argp argument is a pointer to a uffdio_api structure,
> >> defined as:
> >>
> >> struct uffdio_api {
> >> __u64 api; /* Requested API version (input) */
> >> __u64 features; /* Must be zero */
> >> __u64 ioctls; /* Available ioctl() operations (output) */
> >> };
> >>
> >> The api field denotes the API version requested by the application.
> >> Before the call, the features field must be initialized to zero.
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Above: Why must the 'features' field be initialized │
> >> │to zero? │
> >> └─────────────────────────────────────────────────────┘
> >
> > Until 4.11 the only supported feature is delegation of missing page fault
> > and the UFFDIO_FEATURES bitmask is 0.
>
> So, the thing that was not clear, but now I think I understand:
> 'features' is an input field where one can ask about supported features
> (but none are supported, before Linux 4.11). Is that correct?
Yes.
> I've changed the text here to read:
>
> Before the call, the features field must be initialized
> to zero. In the future, it is intended that this field can be
> used to ask whether particular features are supported.
>
> Seem okay?
Yes.
Just the future is only a week or two from today as we are at 4.11-rc7 :)
> > There's a check in uffdio_api call that the user is not trying to enable
> > any other functionality and it asserts that uffdio_api.featurs is zero [1].
> > Starting from 4.11 the features negotiation is different. Now uffdio_call
> > verifies that it can support features the application requested [2].
>
> Okay.
>
> >> The kernel verifies that it can support the requested API version, and
> >> sets the features and ioctls fields to bit masks representing all the
> >> available features and the generic ioctl(2) operations available. Cur‐
> >> rently, zero (i.e., no feature bits) is placed in the features field.
> >> The returned ioctls field can contain the following bits:
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │This user-space API seems not fully polished. Why │
> >> │are there not constants defined for each of the bit- │
> >> │mask values listed below? │
> >> └─────────────────────────────────────────────────────┘
> >>
> >> 1 << _UFFDIO_API
> >> The UFFDIO_API operation is supported.
> >>
> >> 1 << _UFFDIO_REGISTER
> >> The UFFDIO_REGISTER operation is supported.
> >>
> >> 1 << _UFFDIO_UNREGISTER
> >> The UFFDIO_UNREGISTER operation is supported.
> >
> > Well, I tend to agree. I believe the original intention was to use the
> > OR'ed mask, like UFFD_API_IOCTLS.
> > Andrea, can you add somthing?
>
> Yes, Andrea, please!
>
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Is the above description of the 'ioctls' field cor‐ │
> >> │rect? Does more need to be said? │
> >> │ │
> >> └─────────────────────────────────────────────────────┘
> >
> > This is correct. I wouldn't add anything else.
>
> Thanks.
>
> >> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> >> and errno is set to indicate the cause of the error. Possible errors
> >> include:
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> > G> ├─────────────────────────────────────────────────────┤
> >> │Is the following error list correct? │
> >> │ │
> >> └─────────────────────────────────────────────────────┘
> >
> > There's also -EFAULT in case copy_{from,to}_user fails.
>
> Okay -- I have added that error.
>
> >>
> >> EINVAL The userfaultfd has already been enabled by a previous UFF‐
> >> DIO_API operation.
> >>
> >> EINVAL The API version requested in the api field is not supported by
> >> this kernel, or the features field was not zero.
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │In the above error case, the returned 'uffdio_api' │
> >> │structure zeroed out. Why is this done? This should │
> >> │be explained in the manual page. │
> >> │ │
> >> └─────────────────────────────────────────────────────┘
> >
> > In my understanding the uffdio_api structure is zeroed to allow the caller
> > to distinguish the reasons for -EINVAL.
>
> Andrea, can you please help here?
>
>
> >> UFFDIO_REGISTER
> >> (Since Linux 4.3.) Register a memory address range with the user‐
> >> faultfd object. The argp argument is a pointer to a uffdio_register
> >> structure, defined as:
> >>
> >> struct uffdio_range {
> >> __u64 start; /* Start of range */
> >> __u64 len; /* Length of rnage (bytes) */
> >> };
> >>
> >> struct uffdio_register {
> >> struct uffdio_range range;
> >> __u64 mode; /* Desired mode of operation (input) */
> >> __u64 ioctls; /* Available ioctl() operations (output) */
> >> };
> >>
> >>
> >> The range field defines a memory range starting at start and continuing
> >> for len bytes that should be handled by the userfaultfd.
> >>
> >> The mode field defines the mode of operation desired for this memory
> >> region. The following values may be bitwise ORed to set the user‐
> >> faultfd mode for the specified range:
> >>
> >> UFFDIO_REGISTER_MODE_MISSING
> >> Track page faults on missing pages.
> >>
> >> UFFDIO_REGISTER_MODE_WP
> >> Track page faults on write-protected pages.
> >>
> >> Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
> >>
> >> If the operation is successful, the kernel modifies the ioctls bit-mask
> >> field to indicate which ioctl(2) operations are available for the spec‐
> >> ified range. This returned bit mask is as for UFFDIO_API.
> >>
> >> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> >> and errno is set to indicate the cause of the error. Possible errors
> >> include:
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Is the following error list correct? │
> >> │ │
> >> └─────────────────────────────────────────────────────┘
> >
> > Here again it maybe -EFAULT to indicate copy_{from,to}_user failure.
> > And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the
> > mm_struct has gone by the time userfault grabs it.
>
> Okay -- added EFAULT. I think I'll skip ENOMEM for the moment, but
> will note the possibility in the page source.
>
> >> EBUSY A mapping in the specified range is registered with another
> >> userfaultfd object.
> >>
> >> EINVAL An invalid or unsupported bit was specified in the mode field;
> >> or the mode field was zero.
> >>
> >> EINVAL There is no mapping in the specified address range.
> >>
> >> EINVAL range.start or range.len is not a multiple of the system page
> >> size; or, range.len is zero; or these fields are otherwise
> >> invalid.
> >>
> >> EINVAL There as an incompatible mapping in the specified address range.
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Above: What does "incompatible" mean? │
> >> │ │
> >> └─────────────────────────────────────────────────────┘
> >
> > Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS |
> > MAP_PRIVATE mappings.
>
> Hmmm -- this restriction is not actually mentioned in the description
> of UFFDIO_REGISTER. So, at the start of the description of that operation,
> I've made the text as follows:
>
> [[
> .SS UFFDIO_REGISTER
> (Since Linux 4.3.)
> Register a memory address range with the userfaultfd object.
> The pages in the range must be "compatible".
> In the current implementation,
> .\" According to Mike Rapoport, this will change in Linux 4.11.
> only private anonymous ranges are compatible for registering with
> .BR UFFDIO_REGISTER .
> ]]
>
> Okay?
Yes.
> >> UFFDIO_UNREGISTER
> >> (Since Linux 4.3.) Unregister a memory address range from userfaultfd.
> >> The address range to unregister is specified in the uffdio_range struc‐
> >> ture pointed to by argp.
> >>
> >> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> >> and errno is set to indicate the cause of the error. Possible errors
> >> include:
> >>
> >> EINVAL Either the start or the len field of the ufdio_range structure
> >> was not a multiple of the system page size; or the len field was
> >> zero; or these fields were otherwise invalid.
> >>
> >> EINVAL There as an incompatible mapping in the specified address range.
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Above: What does "incompatible" mean? │
> >> └─────────────────────────────────────────────────────┘
> >
> > The same comments as for UFFDIO_REGISTER apply here as well.
>
> Okay. I changed the introductory text on UFFDIO_UNREGISTER to say:
>
> [[
> .SS UFFDIO_UNREGISTER
> (Since Linux 4.3.)
> Unregister a memory address range from userfaultfd.
> The pages in the range must be "compatible" (see the description of
> .BR UFFDIO_REGISTER .)
> ]]
>
> Okay?
Yes.
> >> EINVAL There was no mapping in the specified address range.
> >>
> >> UFFDIO_COPY
> >> (Since Linux 4.3.) Atomically copy a continuous memory chunk into the
> >> userfault registered range and optionally wake up the blocked thread.
> >> The source and destination addresses and the number of bytes to copy
> >> are specified by the src, dst, and len fields of the uffdio_copy struc‐
> >> ture pointed to by argp:
> >>
> >> struct uffdio_copy {
> >> __u64 dst; /* Source of copy */
> >> __u64 src; /* Destinate of copy */
> >> __u64 len; /* Number of bytes to copy */
> >> __u64 mode; /* Flags controlling behavior of copy */
> >> __s64 copy; /* Number of bytes copied, or negated error */
> >> };
> >>
> >> The following value may be bitwise ORed in mode to change the behavior
> >> of the UFFDIO_COPY operation:
> >>
> >> UFFDIO_COPY_MODE_DONTWAKE
> >> Do not wake up the thread that waits for page-fault resolution
> >>
> >> The copy field is used by the kernel to return the number of bytes that
> >> was actually copied, or an error (a negated errno-style value).
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Above: Why is the 'copy' field used to return error │
> >> │values? This should be explained in the manual │
> >> │page. │
> >> └─────────────────────────────────────────────────────┘
> >
> > Andrea, can you help with this one, please?
>
> Yes, Andrea, please.
>
> >> If the value returned in copy doesn't match the value that was speci‐
> >> fied in len, the operation fails with the error EAGAIN. The copy field
> >> is output-only; it is not read by the UFFDIO_COPY operation.
> >>
> >> This ioctl(2) operation returns 0 on success. In this case, the entire
> >> area was copied. On error, -1 is returned and errno is set to indicate
> >> the cause of the error. Possible errors include:
> >>
> >> EAGAIN The number of bytes copied (i.e., the value returned in the copy
> >> field) does not equal the value that was specified in the len
> >> field.
> >>
> >> EINVAL Either dst or len was not a multiple of the system page size, or
> >> the range specified by src and len or dst and len was invalid.
> >>
> >> EINVAL An invalid bit was specified in the mode field.
> >>
> >> UFFDIO_ZEROPAGE
> >> (Since Linux 4.3.) Zero out a memory range registered with user‐
> >> faultfd. The requested range is specified by the range field of the
> >> uffdio_zeropage structure pointed to by argp:
> >>
> >> struct uffdio_zeropage {
> >> struct uffdio_range range;
> >> __u64 mode; /* Flags controlling behavior of copy */
> >> __s64 zeropage; /* Number of bytes zeroed, or negated error */
> >> };
> >>
> >> The following value may be bitwise ORed in mode to change the behavior
> >> of the UFFDIO_ZERO operation:
> >>
> >> UFFDIO_ZEROPAGE_MODE_DONTWAKE
> >> Do not wake up the thread that waits for page-fault resolution.
> >>
> >> The zeropage field is used by the kernel to return the number of bytes
> >> that was actually zeroed, or an error in the same manner as UFF‐
> >> DIO_COPY.
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Why is the 'zeropage' field used to return error │
> >> │values? This should be explained in the manual │
> >> │page. │
> >> └─────────────────────────────────────────────────────┘
>
> Help is still needed for this FIXME!
It would be pretty much the same as for the 'copy' field in uffdio_copy...
> >> If the value returned in the zeropage field doesn't match the value
> >> that was specified in range.len, the operation fails with the error
> >> EAGAIN. The zeropage field is output-only; it is not read by the UFF‐
> >> DIO_ZERO operation.
> >>
> >> This ioctl(2) operation returns 0 on success. In this case, the entire
> >> area was zeroed. On error, -1 is returned and errno is set to indicate
> >> the cause of the error. Possible errors include:
> >>
> >> EAGAIN The number of bytes zeroed (i.e., the value returned in the
> >> zeropage field) does not equal the value that was specified in
> >> the range.len field.
> >>
> >> EINVAL Either range.start or range.len was not a multiple of the system
> >> page size; or range.len was zero; or the range specified was
> >> invalid.
> >>
> >> EINVAL An invalid bit was specified in the mode field.
> >>
> >> UFFDIO_WAKE
> >> (Since Linux 4.3.) Wake up the thread waiting for page-fault resolu‐
> >> tion on a specified memory address range. The argp argument is a
> >> pointer to a uffdio_range structure (shown above) that specifies the
> >> address range.
> >>
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Need more detail here. How is the UFFDIO_WAKE opera‐ │
> >> │tion used? │
> >> └─────────────────────────────────────────────────────┘
> >
> > The UFFDIO_WAKE operation is used in conjunction with
> > UFFDIO_{COPY,ZEROPAGE} operations that have
> > UFFDIO_{COPY,ZEROPAGE}_MODE_DONTWAKE bit set in the mode field.
> > The userfault monitor can perform several UFFDIO_{COPY,ZEROPAGE} calls in a
> > batch and then explicitly wake up the faulting thread using UFFDIO_WAKE.
>
> Perfect! I've tweaked that text a little and added to the page.
>
> >> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> >> and errno is set to indicate the cause of the error. Possible errors
> >> include:
> >>
> >> EINVAL The start or the len field of the ufdio_range structure was not
> >> a multiple of the system page size; or len was zero; or the
> >> specified range was otherwise invalid.
> >>
> >> RETURN VALUE
> >> See descriptions of the individual operations, above.
> >>
> >> ERRORS
> >> See descriptions of the individual operations, above. In addition, the
> >> following general errors can occur for all of the operations described
> >> above:
> >>
> >> EFAULT argp does not point to a valid memory address.
> >>
> >> EINVAL (For all operations except UFFDIO_API.) The userfaultfd object
> >> has not yet been enabled (via the UFFDIO_API operation).
> >>
> >> CONFORMING TO
> >> These ioctl(2) operations are Linux-specific.
> >>
> >> EXAMPLE
> >> See userfaultfd(2).
> >>
> >> SEE ALSO
> >> ioctl(2), mmap(2), userfaultfd(2)
> >>
> >> Documentation/vm/userfaultfd.txt in the Linux kernel source tree
> >>
> >
> > [1] http://lxr.free-electrons.com/source/fs/userfaultfd.c#L1199
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/userfaultfd.c#n1680
>
> The current version of the two pages has been pushed to
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>
> Cheers,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
>
Hello Mike,
On 04/21/2017 01:06 PM, Mike Rapoport wrote:
> On Fri, Apr 21, 2017 at 08:30:55AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Mike,
>>
>> On 03/21/2017 03:01 PM, Mike Rapoport wrote:
>>> Hello Michael,
>>>
>>> On Mon, Mar 20, 2017 at 09:08:05PM +0100, Michael Kerrisk (man-pages) wrote:
>>>> Hello Andrea, Mike, and all,
>>>>
>>>> Mike: thanks for the page that you sent. I've reworked it
>>>> a bit, and also added a lot of further information,
>>>> and an example program. In the process, I split the page
>>>> into two pieces, with one piece describing the userfaultfd()
>>>> system call and the other describing the ioctl() operations.
>>>>
>>>> I'd like to get review input, especially from you and
>>>> Andrea, but also anyone else, for the current version
>>>> of this page, which includes a few FIXMEs to be sorted.
>>>
>>> Thanks for the update. I'm adressing the FIXME points you've mentioned
>>> below.
>>
>> Thanks!
>>
>>> Otherwise, everything seems the right description of the current upstream.
>>> 4.11 will have quite a few updates to userfault and we'll need to udpate
>>> this page and ioctl_userfaultfd(2) to address those updates. I am planning
>>> to work on the man update in the next few weeks.
>>>
>>>> I've shown the rendered version of the page below.
>>>> The groff source is attached, and can also be found
>>>> at the branch here:
>>>
>>>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>>>>
>>>> The new ioctl_userfaultfd(2) page follows this mail.
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>
>>> --
>>> Sincerely yours,
>>> Mike.
>>>
>>>
>>>> USERFAULTFD(2) Linux Programmer's Manual USERFAULTFD(2)
>>>>
>>>> ┌─────────────────────────────────────────────────────┐
>>>> │FIXME │
>>>> ├─────────────────────────────────────────────────────┤
>>>> │Need to describe close(2) semantics for userfaulfd │
>>>> │file descriptor: what happens when the userfaultfd │
>>>> │FD is closed? │
>>>> │ │
>>>> └─────────────────────────────────────────────────────┘
>>>
>>> When userfaultfd is closed, it unregisters all memory ranges that were
>>> previously registered with it and flushes the outstanding page fault
>>> events.
>>
>> Presumably, this is more precisely stated as, "when the last
>> file descriptor referring to a userfaultfd object is closed..."?
>
> You are right.
Thanks for the confirmation.
>> I've made the text:
>>
>> When the last file descriptor referring to a userfaultfd object
>> is closed, all memory ranges that were registered with the
>> object are unregistered and unread page-fault events are
>> flushed.
>>
>> [...]
>
> Perfect.
>
[...]
>>>> Each read(2) from the userfaultfd file descriptor returns one
>>>> or more uffd_msg structures, each of which describes a page-
>>>> fault event:
>>>>
>>>> struct uffd_msg {
>>>> __u8 event; /* Type of event */
>>>> ...
>>>> union {
>>>> struct {
>>>> __u64 flags; /* Flags describing fault */
>>>> __u64 address; /* Faulting address */
>>>> } pagefault;
>>>> ...
>>>> } arg;
>>>>
>>>> /* Padding fields omitted */
>>>> } __packed;
>>>>
>>>> If multiple events are available and the supplied buffer is
>>>> large enough, read(2) returns as many events as will fit in the
>>>> supplied buffer. If the buffer supplied to read(2) is smaller
>>>> than the size of the uffd_msg structure, the read(2) fails with
>>>> the error EINVAL.
>>>>
>>>> The fields set in the uffd_msg structure are as follows:
>>>>
>>>> event The type of event. Currently, only one value can appear
>>>> in this field: UFFD_EVENT_PAGEFAULT, which indicates a
>>>> page-fault event.
>>>>
>>>> address
>>>> The address that triggered the page fault.
>>>>
>>>> flags A bit mask of flags that describe the event. For
>>>> UFFD_EVENT_PAGEFAULT, the following flag may appear:
>>>>
>>>> UFFD_PAGEFAULT_FLAG_WRITE
>>>> If the address is in a range that was registered
>>>> with the UFFDIO_REGISTER_MODE_MISSING flag (see
>>>> ioctl_userfaultfd(2)) and this flag is set, this
>>>> a write fault; otherwise it is a read fault.
>>>>
>>>> A read(2) on a userfaultfd file descriptor can fail with the
>>>> following errors:
>>>>
>>>> EINVAL The userfaultfd object has not yet been enabled using
>>>> the UFFDIO_API ioctl(2) operation
>>>>
>>>> The userfaultfd file descriptor can be monitored with poll(2),
>>>> select(2), and epoll(7). When events are available, the file
>>>> descriptor indicates as readable.
>>>>
>>>>
>>>> ┌─────────────────────────────────────────────────────┐
>>>> │FIXME │
>>>> ├─────────────────────────────────────────────────────┤
>>>> │But, it seems, the object must be created with │
>>>> │O_NONBLOCK. What is the rationale for this require‐ │
>>>> │ment? Something needs to be said in this manual │
>>>> │page. │
>>>> └─────────────────────────────────────────────────────┘
>>>
>>> The object can be created without O_NONBLOCK, so probably the above
>>> sentence can be rephrased as:
>>>
>>> When the userfaultfd file descriptor is opened in non-blocking mode, it can
>>> be monitored with ...
>>
>> Yes, but why is there this requirement for poll() etc. with the
>> O_NONBLOCK flag? I think something about that needs to be said in the
>> man page. Sorry, my FIXME was not clear enough. I've reworded the text
>> and the FIXME:
>>
>> If the O_NONBLOCK flag is enabled in the associated open file
>> description, the userfaultfd file descriptor can be monitored
>> with poll(2), select(2), and epoll(7). When events are avail‐
>> able, the file descriptor indicates as readable. If the O_NON‐
>> BLOCK flag is not enabled, then poll(2) (always) indicates the
>> file as having a POLLERR condition, and select(2) indicates the
>> file descriptor as both readable and writable.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │What is the reason for this seemingly odd behavior │
>> │with respect to the O_NONBLOCK flag? (see user‐ │
>> │faultfd_poll() in fs/userfaultfd.c). Something │
>> │needs to be said about this. │
>> └─────────────────────────────────────────────────────┘
>
> Andrea, can you please help with this one as well?
Let's see what Andrea has to say.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
Hi Mike,
On 04/21/2017 01:07 PM, Mike Rapoport wrote:
> Hello Michael,
>
> On Fri, Apr 21, 2017 at 11:11:18AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Mike,
>> Hello Andrea (we need your help!),
>>
>> On 03/22/2017 02:54 PM, Mike Rapoport wrote:
>>> Hello Michael,
>>>
>>> On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote:
>>>> Hello Andrea, Mike, and all,
>>>>
>>>> Mike: here's the split out page that describes the
>>>> userfaultfd ioctl() operations.
>>>>
>>>> I'd like to get review input, especially from you and
>>>> Andrea, but also anyone else, for the current version
>>>> of this page, which includes quite a few FIXMEs to be
>>>> sorted.
>>>>
>>>> I've shown the rendered version of the page below.
>>>> The groff source is attached, and can also be found
>>>> at the branch here:
>>>>
>>>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>>>>
>>>> The new ioctl_userfaultfd(2) page follows this mail.
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>> NAME
>>>> userfaultfd - create a file descriptor for handling page faults in user
>>>> space
>>>>
>>>> SYNOPSIS
>>>> #include <sys/ioctl.h>
>>>>
>>>> int ioctl(int fd, int cmd, ...);
>>>>
>>>> DESCRIPTION
>>>> Various ioctl(2) operations can be performed on a userfaultfd object
>>>> (created by a call to userfaultfd(2)) using calls of the form:
>>>>
>>>> ioctl(fd, cmd, argp);
>>>>
>>>> In the above, fd is a file descriptor referring to a userfaultfd
>>>> object, cmd is one of the commands listed below, and argp is a pointer
>>>> to a data structure that is specific to cmd.
>>>>
>>>> The various ioctl(2) operations are described below. The UFFDIO_API,
>>>> UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
>>>> userfaultfd behavior. These operations allow the caller to choose what
>>>> features will be enabled and what kinds of events will be delivered to
>>>> the application. The remaining operations are range operations. These
>>>> operations enable the calling application to resolve page-fault events
>>>> in a consistent way.
>>>>
>>>>
>>>> ┌─────────────────────────────────────────────────────┐
>>>> │FIXME │
>>>> ├─────────────────────────────────────────────────────┤
>>>> │Above: What does "consistent" mean? │
>>>> │ │
>>>> └─────────────────────────────────────────────────────┘
>>>
>>> Andrea, can you please help with this one?
>>
>> Let's see what Andrea has to say.
>
> Actually, I though I've copied this text from Andrea's docs, but now I've
> found out it was my wording and I really don't remember now what was my
> intention for "consistent" :)
> My guess is that I was thinking about atomicity of UFFDIO_COPY, or the fact
> that from the faulting thread perspective the page fault handling is the
> same whether it's done in kernel or via userfaultfd...
> That said, maybe it'd be better just to drop "in a consistent way".
Okay. Dropped.
>>>> UFFDIO_API
>>>> (Since Linux 4.3.) Enable operation of the userfaultfd and perform API
>>>> handshake. The argp argument is a pointer to a uffdio_api structure,
>>>> defined as:
>>>>
>>>> struct uffdio_api {
>>>> __u64 api; /* Requested API version (input) */
>>>> __u64 features; /* Must be zero */
>>>> __u64 ioctls; /* Available ioctl() operations (output) */
>>>> };
>>>>
>>>> The api field denotes the API version requested by the application.
>>>> Before the call, the features field must be initialized to zero.
>>>>
>>>>
>>>> ┌─────────────────────────────────────────────────────┐
>>>> │FIXME │
>>>> ├─────────────────────────────────────────────────────┤
>>>> │Above: Why must the 'features' field be initialized │
>>>> │to zero? │
>>>> └─────────────────────────────────────────────────────┘
>>>
>>> Until 4.11 the only supported feature is delegation of missing page fault
>>> and the UFFDIO_FEATURES bitmask is 0.
>>
>> So, the thing that was not clear, but now I think I understand:
>> 'features' is an input field where one can ask about supported features
>> (but none are supported, before Linux 4.11). Is that correct?
>
> Yes.
Thanks.
>> I've changed the text here to read:
>>
>> Before the call, the features field must be initialized
>> to zero. In the future, it is intended that this field can be
>> used to ask whether particular features are supported.
>>
>> Seem okay?
>
> Yes.
> Just the future is only a week or two from today as we are at 4.11-rc7 :)
Yes, I understand :-). So of course there's a *lot* more
new stuff to document, right?
[...]
>>>> UFFDIO_REGISTER
[...]
>>>> EINVAL There as an incompatible mapping in the specified address range.
>>>>
>>>>
>>>> ┌─────────────────────────────────────────────────────┐
>>>> │FIXME │
>>>> ├─────────────────────────────────────────────────────┤
>>>> │Above: What does "incompatible" mean? │
>>>> │ │
>>>> └─────────────────────────────────────────────────────┘
>>>
>>> Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS |
>>> MAP_PRIVATE mappings.
>>
>> Hmmm -- this restriction is not actually mentioned in the description
>> of UFFDIO_REGISTER. So, at the start of the description of that operation,
>> I've made the text as follows:
>>
>> [[
>> .SS UFFDIO_REGISTER
>> (Since Linux 4.3.)
>> Register a memory address range with the userfaultfd object.
>> The pages in the range must be "compatible".
>> In the current implementation,
>> .\" According to Mike Rapoport, this will change in Linux 4.11.
>> only private anonymous ranges are compatible for registering with
>> .BR UFFDIO_REGISTER .
>> ]]
>>
>> Okay?
>
> Yes.
Thanks for checking it.
>>>> UFFDIO_UNREGISTER
[...]
>>>> EINVAL There as an incompatible mapping in the specified address range.
>>>>
>>>>
>>>> ┌─────────────────────────────────────────────────────┐
>>>> │FIXME │
>>>> ├─────────────────────────────────────────────────────┤
>>>> │Above: What does "incompatible" mean? │
>>>> └─────────────────────────────────────────────────────┘
>>>
>>> The same comments as for UFFDIO_REGISTER apply here as well.
>>
>> Okay. I changed the introductory text on UFFDIO_UNREGISTER to say:
>>
>> [[
>> .SS UFFDIO_UNREGISTER
>> (Since Linux 4.3.)
>> Unregister a memory address range from userfaultfd.
>> The pages in the range must be "compatible" (see the description of
>> .BR UFFDIO_REGISTER .)
>> ]]
>>
>> Okay?
>
> Yes.
Thanks.
[...]
The current version of the two pages has been pushed to
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
Hello Michael,
On Fri, Apr 21, 2017 at 01:41:18PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Mike,
>
[...]
> >
> > Yes.
> > Just the future is only a week or two from today as we are at 4.11-rc7 :)
>
> Yes, I understand :-). So of course there's a *lot* more
> new stuff to document, right?
I've started to add the description of the new functionality to both
userfaultfd.2 and ioctl_userfaultfd.2 and it's somewhat difficult for me to
decide how it would be better to split the information between these two
pages and what should be the pages internal structure.
I even thought about possibility of adding relatively comprehensive
description of userfaultfd as man7/userfaultfd.7 and then keeping the pages
in man2 relatively small, just with brief description of APIs and SEE ALSO
pointing to man7.
Any advise is highly appreciated.
> [...]
--
Sincerely yours,
Mike.
Hi Mike,
On 25 April 2017 at 10:00, Mike Rapoport <[email protected]> wrote:
> Hello Michael,
>
> On Fri, Apr 21, 2017 at 01:41:18PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Mike,
>>
>
> [...]
>
>> >
>> > Yes.
>> > Just the future is only a week or two from today as we are at 4.11-rc7 :)
>>
>> Yes, I understand :-). So of course there's a *lot* more
>> new stuff to document, right?
>
> I've started to add the description of the new functionality to both
> userfaultfd.2 and ioctl_userfaultfd.2
Thanks for doing this!
> and it's somewhat difficult for me to
> decide how it would be better to split the information between these two
> pages and what should be the pages internal structure.
>
> I even thought about possibility of adding relatively comprehensive
> description of userfaultfd as man7/userfaultfd.7 and then keeping the pages
> in man2 relatively small, just with brief description of APIs and SEE ALSO
> pointing to man7.
>
> Any advise is highly appreciated.
I'm not averse to the notion of a userfaultfd.7 page, but it's a
little hard to advise because I'm not sure of the size and scope of
your planned changes.
In the meantime, I've merged the userfaultfd pages into master,
dropped the "draft" branch, and pushed the updates in master to Git.
Can you write your changes as a series of patches, and perhaps first
give a brief oultine of the proposed changes before getting too far
into the work? Then we could tweak the direction if needed.
Cheers,
Michael