Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934237AbdCVNzS (ORCPT ); Wed, 22 Mar 2017 09:55:18 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:60770 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S934898AbdCVNyz (ORCPT ); Wed, 22 Mar 2017 09:54:55 -0400 Date: Wed, 22 Mar 2017 09:54:26 -0400 From: Mike Rapoport To: "Michael Kerrisk (man-pages)" , Andrea Arcangeli Cc: lkml , "linux-mm@kvack.org" , linux-man Subject: Re: Review request: draft ioctl_userfaultfd(2) manual page References: <487b2c79-f99b-6d0f-2412-aa75cde65569@gmail.com> <9af29fc6-dce2-f729-0f07-a0bfcc6c3587@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <9af29fc6-dce2-f729-0f07-a0bfcc6c3587@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17032213-0016-0000-0000-00000460E149 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17032213-0017-0000-0000-000026EB4FCB Message-Id: <20170322135423.GB27789@rapoport-lnx> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-03-22_11:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001 definitions=main-1703220121 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 23071 Lines: 423 Hello Michael, On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote: > Hello Andrea, Mike, and all, > > Mike: here's the split out page that describes the > userfaultfd ioctl() operations. > > I'd like to get review input, especially from you and > Andrea, but also anyone else, for the current version > of this page, which includes quite a few FIXMEs to be > sorted. > > I've shown the rendered version of the page below. > The groff source is attached, and can also be found > at the branch here: > > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd > > The new ioctl_userfaultfd(2) page follows this mail. > > Cheers, > > Michael > > NAME > userfaultfd - create a file descriptor for handling page faults in user > space > > SYNOPSIS > #include > > int ioctl(int fd, int cmd, ...); > > DESCRIPTION > Various ioctl(2) operations can be performed on a userfaultfd object > (created by a call to userfaultfd(2)) using calls of the form: > > ioctl(fd, cmd, argp); > > In the above, fd is a file descriptor referring to a userfaultfd > object, cmd is one of the commands listed below, and argp is a pointer > to a data structure that is specific to cmd. > > The various ioctl(2) operations are described below. The UFFDIO_API, > UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure > userfaultfd behavior. These operations allow the caller to choose what > features will be enabled and what kinds of events will be delivered to > the application. The remaining operations are range operations. These > operations enable the calling application to resolve page-fault events > in a consistent way. > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │Above: What does "consistent" mean? │ > │ │ > └─────────────────────────────────────────────────────┘ Andrea, can you please help with this one? > UFFDIO_API > (Since Linux 4.3.) Enable operation of the userfaultfd and perform API > handshake. The argp argument is a pointer to a uffdio_api structure, > defined as: > > struct uffdio_api { > __u64 api; /* Requested API version (input) */ > __u64 features; /* Must be zero */ > __u64 ioctls; /* Available ioctl() operations (output) */ > }; > > The api field denotes the API version requested by the application. > Before the call, the features field must be initialized to zero. > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │Above: Why must the 'features' field be initialized │ > │to zero? │ > └─────────────────────────────────────────────────────┘ Until 4.11 the only supported feature is delegation of missing page fault and the UFFDIO_FEATURES bitmask is 0. There's a check in uffdio_api call that the user is not trying to enable any other functionality and it asserts that uffdio_api.featurs is zero [1]. Starting from 4.11 the features negotiation is different. Now uffdio_call verifies that it can support features the application requested [2]. > The kernel verifies that it can support the requested API version, and > sets the features and ioctls fields to bit masks representing all the > available features and the generic ioctl(2) operations available. Cur‐ > rently, zero (i.e., no feature bits) is placed in the features field. > The returned ioctls field can contain the following bits: > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │This user-space API seems not fully polished. Why │ > │are there not constants defined for each of the bit- │ > │mask values listed below? │ > └─────────────────────────────────────────────────────┘ > > 1 << _UFFDIO_API > The UFFDIO_API operation is supported. > > 1 << _UFFDIO_REGISTER > The UFFDIO_REGISTER operation is supported. > > 1 << _UFFDIO_UNREGISTER > The UFFDIO_UNREGISTER operation is supported. Well, I tend to agree. I believe the original intention was to use the OR'ed mask, like UFFD_API_IOCTLS. Andrea, can you add somthing? > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │Is the above description of the 'ioctls' field cor‐ │ > │rect? Does more need to be said? │ > │ │ > └─────────────────────────────────────────────────────┘ This is correct. I wouldn't add anything else. > This ioctl(2) operation returns 0 on success. On error, -1 is returned > and errno is set to indicate the cause of the error. Possible errors > include: > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ G> ├─────────────────────────────────────────────────────┤ > │Is the following error list correct? │ > │ │ > └─────────────────────────────────────────────────────┘ There's also -EFAULT in case copy_{from,to}_user fails. > > EINVAL The userfaultfd has already been enabled by a previous UFF‐ > DIO_API operation. > > EINVAL The API version requested in the api field is not supported by > this kernel, or the features field was not zero. > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │In the above error case, the returned 'uffdio_api' │ > │structure zeroed out. Why is this done? This should │ > │be explained in the manual page. │ > │ │ > └─────────────────────────────────────────────────────┘ In my understanding the uffdio_api structure is zeroed to allow the caller to distinguish the reasons for -EINVAL. > UFFDIO_REGISTER > (Since Linux 4.3.) Register a memory address range with the user‐ > faultfd object. The argp argument is a pointer to a uffdio_register > structure, defined as: > > struct uffdio_range { > __u64 start; /* Start of range */ > __u64 len; /* Length of rnage (bytes) */ > }; > > struct uffdio_register { > struct uffdio_range range; > __u64 mode; /* Desired mode of operation (input) */ > __u64 ioctls; /* Available ioctl() operations (output) */ > }; > > > The range field defines a memory range starting at start and continuing > for len bytes that should be handled by the userfaultfd. > > The mode field defines the mode of operation desired for this memory > region. The following values may be bitwise ORed to set the user‐ > faultfd mode for the specified range: > > UFFDIO_REGISTER_MODE_MISSING > Track page faults on missing pages. > > UFFDIO_REGISTER_MODE_WP > Track page faults on write-protected pages. > > Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING. > > If the operation is successful, the kernel modifies the ioctls bit-mask > field to indicate which ioctl(2) operations are available for the spec‐ > ified range. This returned bit mask is as for UFFDIO_API. > > This ioctl(2) operation returns 0 on success. On error, -1 is returned > and errno is set to indicate the cause of the error. Possible errors > include: > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │Is the following error list correct? │ > │ │ > └─────────────────────────────────────────────────────┘ Here again it maybe -EFAULT to indicate copy_{from,to}_user failure. And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the mm_struct has gone by the time userfault grabs it. > EBUSY A mapping in the specified range is registered with another > userfaultfd object. > > EINVAL An invalid or unsupported bit was specified in the mode field; > or the mode field was zero. > > EINVAL There is no mapping in the specified address range. > > EINVAL range.start or range.len is not a multiple of the system page > size; or, range.len is zero; or these fields are otherwise > invalid. > > EINVAL There as an incompatible mapping in the specified address range. > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │Above: What does "incompatible" mean? │ > │ │ > └─────────────────────────────────────────────────────┘ Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS | MAP_PRIVATE mappings. > UFFDIO_UNREGISTER > (Since Linux 4.3.) Unregister a memory address range from userfaultfd. > The address range to unregister is specified in the uffdio_range struc‐ > ture pointed to by argp. > > This ioctl(2) operation returns 0 on success. On error, -1 is returned > and errno is set to indicate the cause of the error. Possible errors > include: > > EINVAL Either the start or the len field of the ufdio_range structure > was not a multiple of the system page size; or the len field was > zero; or these fields were otherwise invalid. > > EINVAL There as an incompatible mapping in the specified address range. > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │Above: What does "incompatible" mean? │ > └─────────────────────────────────────────────────────┘ The same comments as for UFFDIO_REGISTER apply here as well. > EINVAL There was no mapping in the specified address range. > > UFFDIO_COPY > (Since Linux 4.3.) Atomically copy a continuous memory chunk into the > userfault registered range and optionally wake up the blocked thread. > The source and destination addresses and the number of bytes to copy > are specified by the src, dst, and len fields of the uffdio_copy struc‐ > ture pointed to by argp: > > struct uffdio_copy { > __u64 dst; /* Source of copy */ > __u64 src; /* Destinate of copy */ > __u64 len; /* Number of bytes to copy */ > __u64 mode; /* Flags controlling behavior of copy */ > __s64 copy; /* Number of bytes copied, or negated error */ > }; > > The following value may be bitwise ORed in mode to change the behavior > of the UFFDIO_COPY operation: > > UFFDIO_COPY_MODE_DONTWAKE > Do not wake up the thread that waits for page-fault resolution > > The copy field is used by the kernel to return the number of bytes that > was actually copied, or an error (a negated errno-style value). > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │Above: Why is the 'copy' field used to return error │ > │values? This should be explained in the manual │ > │page. │ > └─────────────────────────────────────────────────────┘ Andrea, can you help with this one, please? > If the value returned in copy doesn't match the value that was speci‐ > fied in len, the operation fails with the error EAGAIN. The copy field > is output-only; it is not read by the UFFDIO_COPY operation. > > This ioctl(2) operation returns 0 on success. In this case, the entire > area was copied. On error, -1 is returned and errno is set to indicate > the cause of the error. Possible errors include: > > EAGAIN The number of bytes copied (i.e., the value returned in the copy > field) does not equal the value that was specified in the len > field. > > EINVAL Either dst or len was not a multiple of the system page size, or > the range specified by src and len or dst and len was invalid. > > EINVAL An invalid bit was specified in the mode field. > > UFFDIO_ZEROPAGE > (Since Linux 4.3.) Zero out a memory range registered with user‐ > faultfd. The requested range is specified by the range field of the > uffdio_zeropage structure pointed to by argp: > > struct uffdio_zeropage { > struct uffdio_range range; > __u64 mode; /* Flags controlling behavior of copy */ > __s64 zeropage; /* Number of bytes zeroed, or negated error */ > }; > > The following value may be bitwise ORed in mode to change the behavior > of the UFFDIO_ZERO operation: > > UFFDIO_ZEROPAGE_MODE_DONTWAKE > Do not wake up the thread that waits for page-fault resolution. > > The zeropage field is used by the kernel to return the number of bytes > that was actually zeroed, or an error in the same manner as UFF‐ > DIO_COPY. > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │Why is the 'zeropage' field used to return error │ > │values? This should be explained in the manual │ > │page. │ > └─────────────────────────────────────────────────────┘ > If the value returned in the zeropage field doesn't match the value > that was specified in range.len, the operation fails with the error > EAGAIN. The zeropage field is output-only; it is not read by the UFF‐ > DIO_ZERO operation. > > This ioctl(2) operation returns 0 on success. In this case, the entire > area was zeroed. On error, -1 is returned and errno is set to indicate > the cause of the error. Possible errors include: > > EAGAIN The number of bytes zeroed (i.e., the value returned in the > zeropage field) does not equal the value that was specified in > the range.len field. > > EINVAL Either range.start or range.len was not a multiple of the system > page size; or range.len was zero; or the range specified was > invalid. > > EINVAL An invalid bit was specified in the mode field. > > UFFDIO_WAKE > (Since Linux 4.3.) Wake up the thread waiting for page-fault resolu‐ > tion on a specified memory address range. The argp argument is a > pointer to a uffdio_range structure (shown above) that specifies the > address range. > > > ┌─────────────────────────────────────────────────────┐ > │FIXME │ > ├─────────────────────────────────────────────────────┤ > │Need more detail here. How is the UFFDIO_WAKE opera‐ │ > │tion used? │ > └─────────────────────────────────────────────────────┘ The UFFDIO_WAKE operation is used in conjunction with UFFDIO_{COPY,ZEROPAGE} operations that have UFFDIO_{COPY,ZEROPAGE}_MODE_DONTWAKE bit set in the mode field. The userfault monitor can perform several UFFDIO_{COPY,ZEROPAGE} calls in a batch and then explicitly wake up the faulting thread using UFFDIO_WAKE. > This ioctl(2) operation returns 0 on success. On error, -1 is returned > and errno is set to indicate the cause of the error. Possible errors > include: > > EINVAL The start or the len field of the ufdio_range structure was not > a multiple of the system page size; or len was zero; or the > specified range was otherwise invalid. > > RETURN VALUE > See descriptions of the individual operations, above. > > ERRORS > See descriptions of the individual operations, above. In addition, the > following general errors can occur for all of the operations described > above: > > EFAULT argp does not point to a valid memory address. > > EINVAL (For all operations except UFFDIO_API.) The userfaultfd object > has not yet been enabled (via the UFFDIO_API operation). > > CONFORMING TO > These ioctl(2) operations are Linux-specific. > > EXAMPLE > See userfaultfd(2). > > SEE ALSO > ioctl(2), mmap(2), userfaultfd(2) > > Documentation/vm/userfaultfd.txt in the Linux kernel source tree > [1] http://lxr.free-electrons.com/source/fs/userfaultfd.c#L1199 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/userfaultfd.c#n1680