2015-02-13 09:11:44

by Fam Zheng

[permalink] [raw]
Subject: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

Hi all,

This is the updated series for the new epoll system calls, with the cover
letter rewritten which includes some more explanation. Comments are very
welcome!

Original Motivation
===================

QEMU, and probably many more select/poll based applications, will consider
epoll as an alternative, when its event loop needs to handle a big number of
fds. However, there are currently two concerns with epoll which prevents the
switching from ppoll to epoll.

The major one is the timeout precision.

For example in QEMU, the main loop takes care of calling callbacks at a
specific timeout - the QEMU timer API. The timeout value in ppoll depends on
the next firing timer. epoll_pwait's millisecond timeout is so coarse that
rounding up the timeout will hurt performance badly.

The minor one is the number of system call to update fd set.

While epoll can handle a large number of fds quickly, it still requires one
epoll_ctl per fd update, compared to the one-shot call to select/poll with an
fd array. This may as well make epoll inferior to ppoll in the cases where a
small, but frequently changing set of fds are polled by the event loop.

This series introduces two new epoll APIs to address them respectively. The
idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
suggested clockid as a parameter in epoll_pwait1.

Discussion
==========

[Note: This is the question part regarding the interface contract of
epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet,
please skip this part and probably start with the man page style documentation.
You can resume to this section later.]

[Thanks to Omar Sandoval <[email protected]>, who pointed out this in
reviewing v1]

We try to report status for each command in epoll_ctl_batch, by writting to
user provided command array (pointed to cmds). The tricky thing in the
implementation is that, copying the results back to userspace comes last, after
the commands are executed. At this point, if the copy_to_user fails, the
effects are done and no return - or if we add some mechanism to revert it, the
code will be too complicated and slow.

In above corner case, the return value of epoll_ctl_batch is smaller than
ncmds, which assures our caller that last N commands failed, where N = ncmds -
ret. But they'll also find that cmd.result is not changed to error code.

I suppose this is not a big deal, because 1) it should happen very rarely. 2)
user does know the actual number of commands that succeed.

So, do we leave it as is? Or is there any way to improve?

One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
before we execute the commands. If it succeeds, it's even less likely the last
copy_to_user could fail, so that we can even probably assert it won't. The
testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
performance impact, especially when @ncmds is big.

Links
=====

[1]: http://lists.openwall.net/linux-kernel/2015/01/08/542

Changelog
=========

Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
----------------------------------------------------

- Rename epoll_ctl_cmd.error_hint to "result". [Michael]

- Add background introduction in cover letter. [Michael]

- Expand the last struct of epoll_pwait1, add clockid and timespec.

- Update man page in cover letter accordingly:

* "error_hint" -> "result".
* The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.

Please review!

Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
-----------------------------------------------------

- As discussed in previous thread [1], split the call to epoll_ctl_batch and
epoll_pwait. [Michael]

- Fix memory leaks. [Omar]

- Add a short comment about the ignored copy_to_user failure. [Omar]

- Cover letter rewritten.

Documentation of the new system calls
=====================================

[I tried to write in the familiar man page style, but I am not proficient on
this. Thanks for Michael's help and suggestions in helping me improve it
through previous discussions.]

1) epoll_ctl_batch
------------------

NAME
epoll_ctl_batch - modify an epoll descriptor in batch

SYNOPSIS

#include <sys/epoll.h>

int epoll_ctl_batch(int epfd, int flags,
int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

The system call performs a batch of epoll_ctl operations. It allows
efficient update of events on this epoll descriptor.

Flags is reserved and must be 0.

Each operation is specified as an element in the cmds array, defined as:

struct epoll_ctl_cmd {

/* Reserved flags for future extension, must be 0. */
int flags;

/* The same as epoll_ctl() op parameter. */
int op;

/* The same as epoll_ctl() fd parameter. */
int fd;

/* The same as the "events" field in struct epoll_event. */
uint32_t events;

/* The same as the "data" field in struct epoll_event. */
uint64_t data;

/* Output field, will be set to the return code after this
* command is executed by kernel */
int result;
};

Commands are executed in their order in cmds, and if one of them failed,
the rest after it are not tried.

Not that this call isn't atomic in terms of updating the epoll
descriptor, which means a second epoll_ctl or epoll_ctl_batch call
during the first epoll_ctl_batch may make the operation sequence
interleaved. However, each single epoll_ctl_cmd operation has the same
semantics as a epoll_ctl call.

RETURN VALUE

If one or more of the parameters are incorrect, -1 is returned and errno
is set appropriately. Otherwise, the number of succeeded commands is
returned.

Each 'result' field may be set to the error code or 0, depending on the
result of the command. If the kernel fails to set the field after the
commands are completed or failed, it may also be unchanged, even though
the effects of successful commands are done. In this case, it's still
ensured that 1) the i-th (i = ret) command is the failed command; 2) all
the preceeding commands are successfully executed; 3) all the subsequent
ones are not executed.

ERRORS

Errors for the overall system call (in errno) are:

EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
cmds is NULL.

ENOMEM There was insufficient memory to handle the requested op control
operation.

EFAULT The memory area pointed to by cmds is not accessible with write
permissions.


Errors for each command (in result field) are:

EBADF epfd or fd is not a valid file descriptor.

EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
already registered with this epoll instance.

EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd,
or the requested operation op is not supported by this interface.

ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
with this epoll instance.

ENOMEM There was insufficient memory to handle the requested op control
operation.

ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
encountered while trying to register (EPOLL_CTL_ADD) a new file
descriptor on an epoll instance. See epoll(7) for further
details.

EPERM The target file fd does not support epoll.

CONFORMING TO

epoll_ctl_batch() is Linux-specific.

SEE ALSO

epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)


2) epoll_pwait1
---------------

NAME
epoll_pwait1 - wait for an I/O event on an epoll file descriptor

SYNOPSIS

#include <sys/epoll.h>

int epoll_pwait1(int epfd, int flags,
struct epoll_event *events,
int maxevents,
struct epoll_wait_params *params);

DESCRIPTION

The epoll_pwait1 system call differs from epoll_pwait only in
parameter list. The epfd, events and maxevents parameters are the same
as in epoll_wait and epoll_pwait. The flags and params are new.

The flags is reserved and must be zero.

The params is a pointer to a structure parameters for the epoll_pwait1,
defined as:

struct epoll_wait_params {
int clockid;
struct timespec timeout;
sigset_t *sigmask;
size_t sigsetsize;
};

The field clockid must be either CLOCK_REALTIME or CLOCK_MONOTONIC, to
choose the clock type to use for timeout. Note that CLOCK_MONOTONIC is
the implicit and only possible clock type for epoll_pwait.

The timeout field specifies the minimum time that epoll_wait() will
block. (The effective length will be rounded up to the clock
granularity, and kernel scheduling delays mean that the blocking
interval may overrun by a small amount.) Specifying a nagative time
lenght (for example, timeout.tv_sec = 0 and timeout.tv_nsec = -1, or the
other way round) causes epoll_pwait1() to block indefinitely, while
specifying a timeout equal to zero (both fields in timeout are zero)
causes epoll_wait() to return immediately, even if no events are
available.

The sigmask and sigsetsize has the same semantics as epoll_pwait. The
sigmask field may be specified as NULL, in which case epoll_pwait1()
will behave like epoll_wait.

User visibility of sigsetsize

In epoll_pwait and other syscalls, sigsetsize is not visible to
application developer as glibc has a wrapper around epoll_pwait system
call. Now that we pack several parameters in epoll_wait_params, so in
order to hide sigsetsize from application code, we can still wrap it,
either by expanding parameters and build the structure in the wrapper
function, or only ask application to provide the first half:

struct epoll_wait_params_user {
int clockid;
struct timespec timeout;
sigset_t *sigmask;
};

Then in the wrapper function copy to a full structure and fill in
sigsetsize.

RETURN VALUE

The same as said in epoll_pwait(2).

ERRORS

The same as said in man epoll_pwait(2), plus:

EINVAL flags is not zero, or clockid is neither CLOCK_REALTIME nor
CLOCK_MONOTONIC.

EFAULT The memory area pointed to by params is not accessible.


CONFORMING TO

epoll_pwait1() is Linux-specific.

SEE ALSO

epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Fam Zheng (7):
epoll: Extract epoll_wait_do and epoll_pwait_do
epoll: Specify clockid explicitly
epoll: Extract ep_ctl_do
epoll: Add implementation for epoll_ctl_batch
x86: Hook up epoll_ctl_batch syscall
epoll: Add implementation for epoll_pwait1
x86: Hook up epoll_pwait1 syscall

arch/x86/syscalls/syscall_32.tbl | 2 +
arch/x86/syscalls/syscall_64.tbl | 2 +
fs/eventpoll.c | 258 +++++++++++++++++++++++++--------------
include/linux/syscalls.h | 9 ++
include/uapi/linux/eventpoll.h | 18 +++
5 files changed, 199 insertions(+), 90 deletions(-)

--
1.9.3


2015-02-13 09:05:38

by Fam Zheng

[permalink] [raw]
Subject: [PATCH RFC v3 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do

In preparation of new epoll syscalls, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.

Signed-off-by: Fam Zheng <[email protected]>
---
fs/eventpoll.c | 162 ++++++++++++++++++++++++++-------------------------------
1 file changed, 75 insertions(+), 87 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..4cf359d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
}

-static inline struct timespec ep_set_mstimeout(long ms)
-{
- struct timespec now, ts = {
- .tv_sec = ms / MSEC_PER_SEC,
- .tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
- };
-
- ktime_get_ts(&now);
- return timespec_add_safe(now, ts);
-}
-
/**
* ep_poll - Retrieves ready events, and delivers them to the caller supplied
* event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
* @events: Pointer to the userspace buffer where the ready events should be
* stored.
* @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- * milliseconds. If the @timeout is zero, the function will not block,
- * while if the @timeout is less than zero, the function will block
- * until at least one event has been retrieved (or an error
- * occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation. If 0, the
+ * function will not block. If negative, the function will block until
+ * at least one event has been retrieved (or an error occurred).
*
* Returns: Returns the number of ready events which have been fetched, or an
* error code, in case of error.
*/
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents, long timeout)
+ int maxevents, const ktime_t timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
wait_queue_t wait;
ktime_t expires, *to = NULL;

- if (timeout > 0) {
- struct timespec end_time = ep_set_mstimeout(timeout);
-
- slack = select_estimate_accuracy(&end_time);
- to = &expires;
- *to = timespec_to_ktime(end_time);
- } else if (timeout == 0) {
+ if (!ktime_to_ns(timeout)) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
* caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
goto check_events;
+ } else if (ktime_to_ns(timeout) > 0) {
+ struct timespec now, end_time;
+
+ ktime_get_ts(&now);
+ end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+ slack = select_estimate_accuracy(&end_time);
+ to = &expires;
+ *to = timespec_to_ktime(end_time);
}

fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
return error;
}

-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
- int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+ int maxevents, const ktime_t timeout)
{
int error;
struct fd f;
@@ -2002,91 +1988,93 @@ error_fput:

/*
* Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_wait(2).
+ */
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+ int, maxevents, int, timeout)
+{
+ ktime_t kt = ms_to_ktime(timeout);
+ return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+ int maxevents, ktime_t timeout,
+ sigset_t *sigmask, size_t sigsetsize)
+{
+ int error;
+ sigset_t sigsaved;
+
+ /*
+ * If the caller wants a certain signal mask to be set during the wait,
+ * we apply it here.
+ */
+ if (sigmask) {
+ sigsaved = current->blocked;
+ set_current_blocked(sigmask);
+ }
+
+ error = epoll_wait_do(epfd, events, maxevents, timeout);
+
+ /*
+ * If we changed the signal mask, we need to restore the original one.
+ * In case we've got a signal while waiting, we do not restore the
+ * signal mask yet, and we allow do_signal() to deliver the signal on
+ * the way back to userspace, before the signal mask is restored.
+ */
+ if (sigmask) {
+ if (error == -EINTR) {
+ memcpy(&current->saved_sigmask, &sigsaved,
+ sizeof(sigsaved));
+ set_restore_sigmask();
+ } else
+ set_current_blocked(&sigsaved);
+ }
+
+ return error;
+}
+
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
* part of the user space epoll_pwait(2).
*/
SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout, const sigset_t __user *, sigmask,
size_t, sigsetsize)
{
- int error;
- sigset_t ksigmask, sigsaved;
+ ktime_t kt = ms_to_ktime(timeout);
+ sigset_t ksigmask;

- /*
- * If the caller wants a certain signal mask to be set during the wait,
- * we apply it here.
- */
if (sigmask) {
if (sigsetsize != sizeof(sigset_t))
return -EINVAL;
if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
return -EFAULT;
- sigsaved = current->blocked;
- set_current_blocked(&ksigmask);
}
-
- error = sys_epoll_wait(epfd, events, maxevents, timeout);
-
- /*
- * If we changed the signal mask, we need to restore the original one.
- * In case we've got a signal while waiting, we do not restore the
- * signal mask yet, and we allow do_signal() to deliver the signal on
- * the way back to userspace, before the signal mask is restored.
- */
- if (sigmask) {
- if (error == -EINTR) {
- memcpy(&current->saved_sigmask, &sigsaved,
- sizeof(sigsaved));
- set_restore_sigmask();
- } else
- set_current_blocked(&sigsaved);
- }
-
- return error;
+ return epoll_pwait_do(epfd, events, maxevents, kt,
+ sigmask ? &ksigmask : NULL, sigsetsize);
}

#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
- struct epoll_event __user *, events,
- int, maxevents, int, timeout,
- const compat_sigset_t __user *, sigmask,
- compat_size_t, sigsetsize)
+ struct epoll_event __user *, events,
+ int, maxevents, int, timeout,
+ const compat_sigset_t __user *, sigmask,
+ compat_size_t, sigsetsize)
{
- long err;
compat_sigset_t csigmask;
- sigset_t ksigmask, sigsaved;
+ sigset_t ksigmask;
+ ktime_t kt = ms_to_ktime(timeout);

- /*
- * If the caller wants a certain signal mask to be set during the wait,
- * we apply it here.
- */
if (sigmask) {
if (sigsetsize != sizeof(compat_sigset_t))
return -EINVAL;
if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
return -EFAULT;
sigset_from_compat(&ksigmask, &csigmask);
- sigsaved = current->blocked;
- set_current_blocked(&ksigmask);
}

- err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
- /*
- * If we changed the signal mask, we need to restore the original one.
- * In case we've got a signal while waiting, we do not restore the
- * signal mask yet, and we allow do_signal() to deliver the signal on
- * the way back to userspace, before the signal mask is restored.
- */
- if (sigmask) {
- if (err == -EINTR) {
- memcpy(&current->saved_sigmask, &sigsaved,
- sizeof(sigsaved));
- set_restore_sigmask();
- } else
- set_current_blocked(&sigsaved);
- }
-
- return err;
+ return epoll_pwait_do(epfd, events, maxevents, kt,
+ sigmask ? &ksigmask : NULL, sigsetsize);
}
#endif

--
1.9.3

2015-02-13 09:05:52

by Fam Zheng

[permalink] [raw]
Subject: [PATCH RFC v3 2/7] epoll: Specify clockid explicitly

Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.

Signed-off-by: Fam Zheng <[email protected]>
---
fs/eventpoll.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4cf359d..5840b03 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
* error code, in case of error.
*/
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents, const ktime_t timeout)
+ int maxevents, int clockid, const ktime_t timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
@@ -1578,6 +1578,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
wait_queue_t wait;
ktime_t expires, *to = NULL;

+ if (clockid != CLOCK_MONOTONIC && clockid != CLOCK_REALTIME)
+ return -EINVAL;
if (!ktime_to_ns(timeout)) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
@@ -1624,7 +1626,8 @@ fetch_events:
}

spin_unlock_irqrestore(&ep->lock, flags);
- if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+ if (!schedule_hrtimeout_range_clock(to, slack,
+ HRTIMER_MODE_ABS, clockid))
timed_out = 1;

spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1948,8 @@ error_return:
}

static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
- int maxevents, const ktime_t timeout)
+ int maxevents, int clockid,
+ const ktime_t timeout)
{
int error;
struct fd f;
@@ -1979,7 +1983,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
ep = f.file->private_data;

/* Time to fish for events ... */
- error = ep_poll(ep, events, maxevents, timeout);
+ error = ep_poll(ep, events, maxevents, clockid, timeout);

error_fput:
fdput(f);
@@ -1994,12 +1998,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout)
{
ktime_t kt = ms_to_ktime(timeout);
- return epoll_wait_do(epfd, events, maxevents, kt);
+ return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
}

static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
- int maxevents, ktime_t timeout,
- sigset_t *sigmask, size_t sigsetsize)
+ int maxevents,
+ int clockid, ktime_t timeout,
+ sigset_t *sigmask)
{
int error;
sigset_t sigsaved;
@@ -2013,7 +2018,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
set_current_blocked(sigmask);
}

- error = epoll_wait_do(epfd, events, maxevents, timeout);
+ error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);

/*
* If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2055,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
return -EFAULT;
}
- return epoll_pwait_do(epfd, events, maxevents, kt,
- sigmask ? &ksigmask : NULL, sigsetsize);
+ return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+ sigmask ? &ksigmask : NULL);
}

#ifdef CONFIG_COMPAT
@@ -2073,8 +2078,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
sigset_from_compat(&ksigmask, &csigmask);
}

- return epoll_pwait_do(epfd, events, maxevents, kt,
- sigmask ? &ksigmask : NULL, sigsetsize);
+ return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+ sigmask ? &ksigmask : NULL);
}
#endif

--
1.9.3

2015-02-13 09:06:13

by Fam Zheng

[permalink] [raw]
Subject: [PATCH RFC v3 3/7] epoll: Extract ep_ctl_do

This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_ctl_wait.

Signed-off-by: Fam Zheng <[email protected]>
---
fs/eventpoll.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 5840b03..6a2b0a4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1810,22 +1810,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
* the eventpoll file that enables the insertion/removal/change of
* file descriptors inside the interest set.
*/
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
- struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
{
int error;
int full_check = 0;
struct fd f, tf;
struct eventpoll *ep;
struct epitem *epi;
- struct epoll_event epds;
struct eventpoll *tep = NULL;

- error = -EFAULT;
- if (ep_op_has_event(op) &&
- copy_from_user(&epds, event, sizeof(struct epoll_event)))
- goto error_return;
-
error = -EBADF;
f = fdget(epfd);
if (!f.file)
@@ -1947,6 +1940,23 @@ error_return:
return error;
}

+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+ struct epoll_event __user *, event)
+{
+ struct epoll_event epds;
+
+ if (ep_op_has_event(op) &&
+ copy_from_user(&epds, event, sizeof(struct epoll_event)))
+ return -EFAULT;
+
+ return ep_ctl_do(epfd, op, fd, epds);
+}
+
static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
int maxevents, int clockid,
const ktime_t timeout)
--
1.9.3

2015-02-13 09:06:29

by Fam Zheng

[permalink] [raw]
Subject: [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch

This new syscall is a batched version of epoll_ctl. It will execute each
command as specified in cmds in given order, and stop at first failure
or upon completion of all commands.

Signed-off-by: Fam Zheng <[email protected]>
---
fs/eventpoll.c | 48 ++++++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 4 ++++
include/uapi/linux/eventpoll.h | 11 ++++++++++
3 files changed, 63 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6a2b0a4..12e2e63 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2069,6 +2069,54 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
sigmask ? &ksigmask : NULL);
}

+SYSCALL_DEFINE4(epoll_ctl_batch, int, epfd, int, flags,
+ int, ncmds, struct epoll_ctl_cmd __user *, cmds)
+{
+ struct epoll_ctl_cmd *kcmds = NULL;
+ int i, r, ret = 0;
+ int cmd_size;
+
+ if (flags)
+ return -EINVAL;
+ if (ncmds <= 0 || !cmds)
+ return -EINVAL;
+ cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
+ kcmds = kmalloc(cmd_size, GFP_KERNEL);
+ if (!kcmds)
+ return -ENOMEM;
+ if (copy_from_user(kcmds, cmds, cmd_size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ for (i = 0; i < ncmds; i++) {
+ struct epoll_event ev = (struct epoll_event) {
+ .events = kcmds[i].events,
+ .data = kcmds[i].data,
+ };
+ if (kcmds[i].flags) {
+ kcmds[i].result = -EINVAL;
+ goto copy;
+ }
+ kcmds[i].result = ep_ctl_do(epfd, kcmds[i].op,
+ kcmds[i].fd, ev);
+ if (kcmds[i].result)
+ goto copy;
+ ret++;
+ }
+copy:
+ r = copy_to_user(cmds, kcmds,
+ sizeof(struct epoll_ctl_cmd) * ncmds);
+ /* Failing to copy the command results back will leave
+ * userspace no way to know the actual error code, but we still
+ * report the number of succeeded commands with ret, so it's
+ * not a big problem. Ignore it for now.
+ */
+ (void) r;
+out:
+ kfree(kcmds);
+ return ret;
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..7d784e3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,7 @@
#define _LINUX_SYSCALLS_H

struct epoll_event;
+struct epoll_ctl_cmd;
struct iattr;
struct inode;
struct iocb;
@@ -634,6 +635,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout,
const sigset_t __user *sigmask,
size_t sigsetsize);
+asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
+ int ncmds,
+ struct epoll_ctl_cmd __user *cmds);
asmlinkage long sys_gethostname(char __user *name, int len);
asmlinkage long sys_sethostname(char __user *name, int len);
asmlinkage long sys_setdomainname(char __user *name, int len);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..4e18b17 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
#include <linux/fcntl.h>
#include <linux/types.h>

+#include <linux/signal.h>
+
/* Flags for epoll_create1. */
#define EPOLL_CLOEXEC O_CLOEXEC

@@ -61,6 +63,15 @@ struct epoll_event {
__u64 data;
} EPOLL_PACKED;

+struct epoll_ctl_cmd {
+ int flags;
+ int op;
+ int fd;
+ __u32 events;
+ __u64 data;
+ int result;
+} EPOLL_PACKED;
+
#ifdef CONFIG_PM_SLEEP
static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
{
--
1.9.3

2015-02-13 09:06:40

by Fam Zheng

[permalink] [raw]
Subject: [PATCH RFC v3 5/7] x86: Hook up epoll_ctl_batch syscall

Signed-off-by: Fam Zheng <[email protected]>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..fe809f6 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 epoll_ctl_batch sys_epoll_ctl_batch
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..67b2ea4 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 64 epoll_ctl_batch sys_epoll_ctl_batch

#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.3

2015-02-13 09:07:09

by Fam Zheng

[permalink] [raw]
Subject: [PATCH RFC v3 6/7] epoll: Add implementation for epoll_pwait1

This is the new implementation for poll which has a flags parameter and
packs a number of parameters into a structure.

The main advantage of it over existing epoll_pwait is about timeout:
epoll_pwait expects a relative millisecond value, while epoll_pwait1
accepts 1) a timespec which is in nanosecond granularity; 2) a clockid
to allow using a clock other than CLOCK_MONOTONIC.

The 'flags' field in params is reserved for now and must be zero. The
next step would be allowing absolute timeout value.

Signed-off-by: Fam Zheng <[email protected]>
---
fs/eventpoll.c | 27 +++++++++++++++++++++++++++
include/linux/syscalls.h | 5 +++++
include/uapi/linux/eventpoll.h | 7 +++++++
3 files changed, 39 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 12e2e63..cc51e8c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2117,6 +2117,33 @@ out:
return ret;
}

+SYSCALL_DEFINE5(epoll_pwait1, int, epfd, int, flags,
+ struct epoll_event __user *, events,
+ int, maxevents,
+ struct epoll_wait_params __user *, params)
+{
+ struct epoll_wait_params p;
+ ktime_t kt = { 0 };
+ sigset_t sigmask;
+
+ if (flags)
+ return -EINVAL;
+ if (!params)
+ return -EINVAL;
+ if (copy_from_user(&p, params, sizeof(p)))
+ return -EFAULT;
+ if (p.sigmask) {
+ if (copy_from_user(&sigmask, p.sigmask, sizeof(sigmask)))
+ return -EFAULT;
+ if (p.sigsetsize != sizeof(p.sigmask))
+ return -EINVAL;
+ }
+ kt = timespec_to_ktime(p.timeout);
+
+ return epoll_pwait_do(epfd, events, maxevents, p.clockid,
+ kt, p.sigmask ? &sigmask : NULL);
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7d784e3..a4823d9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -13,6 +13,7 @@

struct epoll_event;
struct epoll_ctl_cmd;
+struct epoll_wait_params;
struct iattr;
struct inode;
struct iocb;
@@ -635,6 +636,10 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout,
const sigset_t __user *sigmask,
size_t sigsetsize);
+asmlinkage long sys_epoll_pwait1(int epfd, int flags,
+ struct epoll_event __user *events,
+ int maxevents,
+ struct epoll_wait_params __user *params);
asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
int ncmds,
struct epoll_ctl_cmd __user *cmds);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 4e18b17..d35c591 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -72,6 +72,13 @@ struct epoll_ctl_cmd {
int result;
} EPOLL_PACKED;

+struct epoll_wait_params {
+ int clockid;
+ struct timespec timeout;
+ sigset_t *sigmask;
+ size_t sigsetsize;
+} EPOLL_PACKED;
+
#ifdef CONFIG_PM_SLEEP
static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
{
--
1.9.3

2015-02-13 09:07:04

by Fam Zheng

[permalink] [raw]
Subject: [PATCH RFC v3 7/7] x86: Hook up epoll_pwait1 syscall

Signed-off-by: Fam Zheng <[email protected]>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index fe809f6..bf912d8 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -366,3 +366,4 @@
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
359 i386 epoll_ctl_batch sys_epoll_ctl_batch
+360 i386 epoll_pwait1 sys_epoll_pwait1
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 67b2ea4..9246ad5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -330,6 +330,7 @@
321 common bpf sys_bpf
322 64 execveat stub_execveat
323 64 epoll_ctl_batch sys_epoll_ctl_batch
+324 64 epoll_pwait1 sys_epoll_pwait1

#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.3

2015-02-13 09:54:08

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
> Hi all,
>
> This is the updated series for the new epoll system calls, with the cover
> letter rewritten which includes some more explanation. Comments are very
> welcome!
>
> Original Motivation
> ===================
>
> QEMU, and probably many more select/poll based applications, will consider
> epoll as an alternative, when its event loop needs to handle a big number of
> fds. However, there are currently two concerns with epoll which prevents the
> switching from ppoll to epoll.
>
> The major one is the timeout precision.
>
> For example in QEMU, the main loop takes care of calling callbacks at a
> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
> rounding up the timeout will hurt performance badly.
>
> The minor one is the number of system call to update fd set.
>
> While epoll can handle a large number of fds quickly, it still requires one
> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
> fd array. This may as well make epoll inferior to ppoll in the cases where a
> small, but frequently changing set of fds are polled by the event loop.
>
> This series introduces two new epoll APIs to address them respectively. The
> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
> suggested clockid as a parameter in epoll_pwait1.
>
> Discussion
> ==========
>
> [Note: This is the question part regarding the interface contract of
> epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet,
> please skip this part and probably start with the man page style documentation.
> You can resume to this section later.]
>
> [Thanks to Omar Sandoval <[email protected]>, who pointed out this in
> reviewing v1]
>
> We try to report status for each command in epoll_ctl_batch, by writting to
> user provided command array (pointed to cmds). The tricky thing in the
> implementation is that, copying the results back to userspace comes last, after
> the commands are executed. At this point, if the copy_to_user fails, the
> effects are done and no return - or if we add some mechanism to revert it, the
> code will be too complicated and slow.
>
> In above corner case, the return value of epoll_ctl_batch is smaller than
> ncmds, which assures our caller that last N commands failed, where N = ncmds -
> ret. But they'll also find that cmd.result is not changed to error code.
>
> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
> user does know the actual number of commands that succeed.
>
> So, do we leave it as is? Or is there any way to improve?
>
> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
> before we execute the commands. If it succeeds, it's even less likely the last
> copy_to_user could fail, so that we can even probably assert it won't. The
> testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
> right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
> performance impact, especially when @ncmds is big.
>
I don't think this is the right thing to do, since, for example, another
thread could unmap the memory region holding buffer between the "check"
copy_to_user and the actual one.

The two alternatives that I see are:

1. If copy_to_user fails, ignore it and return the number of commands
that succeeded (i.e., what the code in your patch does).
2. If copy_to_user fails, return -EFAULT, regardless of how many
commands succeeded.

The problem with 1 is that it could potentially mask bugs in a user
program. You could imagine a buggy program that passes a read-only
buffer to epoll_ctl_batch and never finds out about it because they
don't get the error. (Then, when there's a real error in one of the
epoll_ctl_cmds, but .result is 0, someone will be very confused.)

So I think 2 is the better option. Sure, the user will have no idea how
many commands were executed, but when would EFAULT on this system call
be part of normal operation, anyways? You'll find the memory bug, fix
it, and rest easy knowing that nothing is silently failing behind your
back.

> Links
> =====
>
> [1]: http://lists.openwall.net/linux-kernel/2015/01/08/542
>
> Changelog
> =========
>
> Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
> ----------------------------------------------------
>
> - Rename epoll_ctl_cmd.error_hint to "result". [Michael]
>
> - Add background introduction in cover letter. [Michael]
>
> - Expand the last struct of epoll_pwait1, add clockid and timespec.
>
> - Update man page in cover letter accordingly:
>
> * "error_hint" -> "result".
> * The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.
>
> Please review!
>
> Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
> -----------------------------------------------------
>
> - As discussed in previous thread [1], split the call to epoll_ctl_batch and
> epoll_pwait. [Michael]
>
> - Fix memory leaks. [Omar]
>
> - Add a short comment about the ignored copy_to_user failure. [Omar]
>
> - Cover letter rewritten.
>
[snip]

--
Omar

2015-02-14 00:06:47

by Dan Rosenberg

[permalink] [raw]
Subject: Re: [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch


> + if (ncmds <= 0 || !cmds)
> + return -EINVAL;
> + cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
> + kcmds = kmalloc(cmd_size, GFP_KERNEL);
You should probably fix the integer overflow in the calculation of the
cmd_size variable, unless you like root vulnerabilities.

2015-02-15 05:31:59

by Fam Zheng

[permalink] [raw]
Subject: Re: [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch

On Fri, 02/13 19:06, Dan Rosenberg wrote:
>
> > + if (ncmds <= 0 || !cmds)
> > + return -EINVAL;
> > + cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
> > + kcmds = kmalloc(cmd_size, GFP_KERNEL);
> You should probably fix the integer overflow in the calculation of the
> cmd_size variable, unless you like root vulnerabilities.
>

Thanks! In the case of multiply overflow, we allocate a buffer that is smaller
than we think, and consequent writings will corrupt kernel memory after it.
That is the root vulnerabilities here. Will fix!

Fam

2015-02-15 06:44:51

by Fam Zheng

[permalink] [raw]
Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

On Fri, 02/13 01:53, Omar Sandoval wrote:
<snip>
> > Discussion
> > ==========
> >
> > [Note: This is the question part regarding the interface contract of
> > epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet,
> > please skip this part and probably start with the man page style documentation.
> > You can resume to this section later.]
> >
> > [Thanks to Omar Sandoval <[email protected]>, who pointed out this in
> > reviewing v1]
> >
> > We try to report status for each command in epoll_ctl_batch, by writting to
> > user provided command array (pointed to cmds). The tricky thing in the
> > implementation is that, copying the results back to userspace comes last, after
> > the commands are executed. At this point, if the copy_to_user fails, the
> > effects are done and no return - or if we add some mechanism to revert it, the
> > code will be too complicated and slow.
> >
> > In above corner case, the return value of epoll_ctl_batch is smaller than
> > ncmds, which assures our caller that last N commands failed, where N = ncmds -
> > ret. But they'll also find that cmd.result is not changed to error code.
> >
> > I suppose this is not a big deal, because 1) it should happen very rarely. 2)
> > user does know the actual number of commands that succeed.
> >
> > So, do we leave it as is? Or is there any way to improve?
> >
> > One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
> > before we execute the commands. If it succeeds, it's even less likely the last
> > copy_to_user could fail, so that we can even probably assert it won't. The
> > testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
> > right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
> > performance impact, especially when @ncmds is big.
> >
> I don't think this is the right thing to do, since, for example, another
> thread could unmap the memory region holding buffer between the "check"
> copy_to_user and the actual one.
>
> The two alternatives that I see are:
>
> 1. If copy_to_user fails, ignore it and return the number of commands
> that succeeded (i.e., what the code in your patch does).
> 2. If copy_to_user fails, return -EFAULT, regardless of how many
> commands succeeded.
>
> The problem with 1 is that it could potentially mask bugs in a user
> program. You could imagine a buggy program that passes a read-only
> buffer to epoll_ctl_batch and never finds out about it because they
> don't get the error. (Then, when there's a real error in one of the
> epoll_ctl_cmds, but .result is 0, someone will be very confused.)
>
> So I think 2 is the better option. Sure, the user will have no idea how
> many commands were executed, but when would EFAULT on this system call
> be part of normal operation, anyways? You'll find the memory bug, fix
> it, and rest easy knowing that nothing is silently failing behind your
> back.
>

OK, I agree with you here. I'm going to change this to -EFAULT in the next
revision.

Thanks!
Fam

Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

On 02/13/2015 10:53 AM, Omar Sandoval wrote:
> On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
>> Hi all,
>>
>> This is the updated series for the new epoll system calls, with the cover
>> letter rewritten which includes some more explanation. Comments are very
>> welcome!
>>
>> Original Motivation
>> ===================
>>
>> QEMU, and probably many more select/poll based applications, will consider
>> epoll as an alternative, when its event loop needs to handle a big number of
>> fds. However, there are currently two concerns with epoll which prevents the
>> switching from ppoll to epoll.
>>
>> The major one is the timeout precision.
>>
>> For example in QEMU, the main loop takes care of calling callbacks at a
>> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
>> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
>> rounding up the timeout will hurt performance badly.
>>
>> The minor one is the number of system call to update fd set.
>>
>> While epoll can handle a large number of fds quickly, it still requires one
>> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
>> fd array. This may as well make epoll inferior to ppoll in the cases where a
>> small, but frequently changing set of fds are polled by the event loop.
>>
>> This series introduces two new epoll APIs to address them respectively. The
>> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
>> suggested clockid as a parameter in epoll_pwait1.
>>
>> Discussion
>> ==========
>>
>> [Note: This is the question part regarding the interface contract of
>> epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet,
>> please skip this part and probably start with the man page style documentation.
>> You can resume to this section later.]
>>
>> [Thanks to Omar Sandoval <[email protected]>, who pointed out this in
>> reviewing v1]
>>
>> We try to report status for each command in epoll_ctl_batch, by writting to
>> user provided command array (pointed to cmds). The tricky thing in the
>> implementation is that, copying the results back to userspace comes last, after
>> the commands are executed. At this point, if the copy_to_user fails, the
>> effects are done and no return - or if we add some mechanism to revert it, the
>> code will be too complicated and slow.
>>
>> In above corner case, the return value of epoll_ctl_batch is smaller than
>> ncmds, which assures our caller that last N commands failed, where N = ncmds -
>> ret. But they'll also find that cmd.result is not changed to error code.
>>
>> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
>> user does know the actual number of commands that succeed.
>>
>> So, do we leave it as is? Or is there any way to improve?
>>
>> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
>> before we execute the commands. If it succeeds, it's even less likely the last
>> copy_to_user could fail, so that we can even probably assert it won't. The
>> testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
>> right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
>> performance impact, especially when @ncmds is big.
>>
> I don't think this is the right thing to do, since, for example, another
> thread could unmap the memory region holding buffer between the "check"
> copy_to_user and the actual one.
>
> The two alternatives that I see are:
>
> 1. If copy_to_user fails, ignore it and return the number of commands
> that succeeded (i.e., what the code in your patch does).
> 2. If copy_to_user fails, return -EFAULT, regardless of how many
> commands succeeded.
>
> The problem with 1 is that it could potentially mask bugs in a user
> program. You could imagine a buggy program that passes a read-only
> buffer to epoll_ctl_batch and never finds out about it because they
> don't get the error. (Then, when there's a real error in one of the
> epoll_ctl_cmds, but .result is 0, someone will be very confused.)
>
> So I think 2 is the better option. Sure, the user will have no idea how
> many commands were executed, but when would EFAULT on this system call
> be part of normal operation, anyways? You'll find the memory bug, fix
> it, and rest easy knowing that nothing is silently failing behind your
> back.

What Omar says makes sense to me too. Best to have the user get a clear
error indication for this case.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2015-02-15 22:00:16

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

On Fri, 13 Feb 2015 17:03:56 +0800
Fam Zheng <[email protected]> wrote:

> SYNOPSIS
>
> #include <sys/epoll.h>
>
> int epoll_pwait1(int epfd, int flags,
> struct epoll_event *events,
> int maxevents,
> struct epoll_wait_params *params);

Quick, possibly dumb question: might it make sense to also pass in
sizeof(struct epoll_wait_params)? That way, when somebody wants to add
another parameter in the future, the kernel can tell which version is in
use and they won't have to do an epoll_pwait2()?

Thanks,

jon

2015-02-16 01:03:38

by Fam Zheng

[permalink] [raw]
Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

On Sun, 02/15 15:00, Jonathan Corbet wrote:
> On Fri, 13 Feb 2015 17:03:56 +0800
> Fam Zheng <[email protected]> wrote:
>
> > SYNOPSIS
> >
> > #include <sys/epoll.h>
> >
> > int epoll_pwait1(int epfd, int flags,
> > struct epoll_event *events,
> > int maxevents,
> > struct epoll_wait_params *params);
>
> Quick, possibly dumb question: might it make sense to also pass in
> sizeof(struct epoll_wait_params)? That way, when somebody wants to add
> another parameter in the future, the kernel can tell which version is in
> use and they won't have to do an epoll_pwait2()?
>

Flags can be used for that, if the change is not radically different.

Fam

2015-02-16 07:25:55

by Seymour, Shane M

[permalink] [raw]
Subject: RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

I found the manual pages really confusing so I had a go at rewriting
them - there were places in the manual page that didn't match the
functionality provided by your code as well as I could tell).

My apologies for a few formatting issues though. I still don't like
parts of epoll_pwait1 but it's less confusing than it was.

You are free to take some or all or none of the changes.

I did have a question I marked with **** below about what you
describe and what your code does.

1) epoll_ctl_batch
------------------

NAME
epoll_ctl_batch - batch control interface for an epoll descriptor

SYNOPSIS

#include <sys/epoll.h>

int epoll_ctl_batch(int epfd, int flags,
int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

This system call is an extension of epoll_ctl(). The primary
difference is that this system call allows you to batch multiple
operations with the one system call. This provides a more efficient
interface for updating events on this epoll file descriptor epfd.

The flags argument is reserved and must be 0.

The argument ncmds is the number of cmds entries being passed in.
This number must be greater than 0.

Each operation is specified as an element in the cmds array, defined as:

struct epoll_ctl_cmd {

/* Reserved flags for future extension, must be 0. */
int flags;

/* The same as epoll_ctl() op parameter. */
int op;

/* The same as epoll_ctl() fd parameter. */
int fd;

/* The same as the "events" field in struct epoll_event. */
uint32_t events;

/* The same as the "data" field in struct epoll_event. */
uint64_t data;

/* Output field, will be set to the return code after this
* command is executed by kernel */
int result;
};

This system call is not atomic when updating the epoll descriptor.
All entries in cmds are executed in the order provided. If any cmds
entry fails to be processed no further entries are processed and
the number of successfully processed entries is returned.

Each single operation defined by a struct epoll_ctl_cmd has the same
semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page
for more information about how to correctly setup the members of a
struct epoll_ctl_cmd.

Upon completion of the call the result member of each struct epoll_ctl_cmd
may be set to 0 (sucessfully completed) or an error code depending on the
result of the command. If the kernel fails to change the result (for
example the location of the cmds argument is fully or partly read only)
the result member of each struct epoll_ctl_cmd may be unchanged.

RETURN VALUE

epoll_ctl_batch() returns a number greater than 0 to indicate the number
of cmnd entries processed. If all entries have been processed this will
equal the ncmds parameter passed in.

If one or more parameters are incorrect the value returned is -1 with
errno set appropriately - no cmds entries have been processed when this
happens.

If processing any entry in the cmds argument results in an error the
number returned is the number of the failing entry - this number will be
less than ncmds. Since ncmds must be greater than 0 a return value of
0 indicates an error associated with the very first cmds entry.
A return value of 0 does not indicate a successful system call.

To correctly test the return value from epoll_ctl_batch() use code similar
to the following:

ret=epoll_ctl_batch(epfd, flags, ncmds, &cmds);
if (ret < ncmds) {
if (ret == -1) {
/* An argument was invalid */
} else {
/* ret contains the number of successful entries
* processed. If you (mis?)use it as a C index it
* will index directly to the failing entry to
* get the result use cmds[ret].result which may
* contain the errno value associated with the
* entry.
*/
}
} else {
/* Success */
}

ERRORS

EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
cmds is NULL.

ENOMEM There was insufficient memory to handle the requested op control
operation.

EFAULT The memory area pointed to by cmds is not accessible with read
permissions.

In the event that the return value is not the same as the ncmds parameter
the result member of the failing struct epoll_ctl_cmd will contain a
negative errno value related to the error. The errno values that can be set
are those documented on the epoll_ctl(2) manual page.

CONFORMING TO

epoll_ctl_batch() is Linux-specific.

SEE ALSO

epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)


2) epoll_pwait1
---------------

NAME
epoll_pwait1 - wait for an I/O event on an epoll file descriptor

SYNOPSIS

#include <sys/epoll.h>

int epoll_pwait1(int epfd, int flags,
struct epoll_event *events,
int maxevents,
struct epoll_wait_params *params);

DESCRIPTION

The epoll_pwait1() syscall differs from epoll_pwait() only in
parameter list. The epfd, events and maxevents parameters are the same
as in epoll_wait() and epoll_pwait(). The flags and params are new.

The flags is reserved and must be zero.

The params is a pointer to a struct epoll_wait_params which is
defined as:

struct epoll_wait_params {
int clockid;
struct timespec timeout;
sigset_t *sigmask;
size_t sigsetsize;
};

The clockid member must be either CLOCK_REALTIME or CLOCK_MONOTONIC.
This will choose the clock type to use for timeout. This differs to
epoll_pwait(2) which has an implicit clock type of CLOCK_MONOTONIC.

The timeout member specifies the minimum time that epoll_wait(2) will
block. The time spent waiting will be rounded up to the clock
granularity. Kernel scheduling delays mean that the blocking
interval may overrun by a small amount. Specifying a -1 for either
tv_sec or tv_nsec member of the struct timespec timeout will cause
causes epoll_pwait1(2) to block indefinitely. Specifying a timeout
equal to zero (both tv_sec or tv_nsec member of the struct timespec
timeout are zero) causes epoll_wait(2) to return immediately, even
if no events are available.

**** Are you really really sure about this for the -1 stuff? your code copies in the timespec and just passes it to timespec_to_ktime:

+ if (copy_from_user(&p, params, sizeof(p)))
+ return -EFAULT;
...
+ kt = timespec_to_ktime(p.timeout);

Compare that to something like the futex syscall which does this:

if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;

t = timespec_to_ktime(ts);

If the timespec is not valid it returns -EINVAL back to user space. With your settings of tv_sec and/or tv_usec to -1 are you relying on a side effect of the conversion that could break your code in the future if in the unlikely event someone changes timespec_to_ktime() and should it be:

+ if (copy_from_user(&p, params, sizeof(p)))
+ return -EFAULT;
+ if ((p.timeout.tv_sec == -1) || (p.timeout.tv_nsec == -1)) {
+ /* this is off the top of my head no idea if it will compile */
+ p.timeout.tv_sec = KTIME_SEC_MAX;
+ p.timeout.tv_nsec = 0;
+ }
+ if (!timespec_valid(&p.timeout))
+ return -EINVAL;
...
+ kt = timespec_to_ktime(p.timeout);

I could of course be worried about nothing here is what I've suggested the right thing to do? Anyone feel free to chime in.

Both sigmask and sigsetsize have the same semantics as epoll_pwait(2). The
sigmask field may be specified as NULL, in which case epoll_pwait1(2)
will behave like epoll_wait(2).

User visibility of sigsetsize

In epoll_pwait(2) and other syscalls, sigsetsize is not visible to
an application developer as glibc has a wrapper around epoll_pwait(2).
Now we pack several parameters in epoll_wait_params. In
order to hide sigsetsize from application code this system call also
needs to be wrapped either by expanding parameters and building the
structure in the wrapper function, or by only asking application to
provide this part of the structure:

struct epoll_wait_params_user {
int clockid;
struct timespec timeout;
sigset_t *sigmask;
};

In the wrapper function it would be copied to a full structure and
sigsetsize filled in.

RETURN VALUE

When successful, epoll_wait1() returns the number of file descriptors
ready for the requested I/O, or zero if no file descriptor became ready
during the requested timeout nanoseconds. When an error occurs,
epoll_wait1() returns -1 and errno is set appropriately.

ERRORS

This system call can set errno to the same values as epoll_pwait(2),
as well as the following additional reasons:

EINVAL flags is not zero, or clockid is not one of CLOCK_REALTIME or
CLOCK_MONOTONIC.

EFAULT The memory area pointed to by params is not accessible.


CONFORMING TO

epoll_pwait1() is Linux-specific.

SEE ALSO

epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

2015-02-16 08:13:17

by Fam Zheng

[permalink] [raw]
Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

Hi Seymour,

On Mon, 02/16 07:25, Seymour, Shane M wrote:
> I found the manual pages really confusing so I had a go at rewriting
> them - there were places in the manual page that didn't match the
> functionality provided by your code as well as I could tell).

Could you point which places don't match the code?

>
> My apologies for a few formatting issues though. I still don't like
> parts of epoll_pwait1 but it's less confusing than it was.

Any other than the timespec question don't you like?

>
> You are free to take some or all or none of the changes.
>
> I did have a question I marked with **** below about what you
> describe and what your code does.
>

<snip>

> The timeout member specifies the minimum time that epoll_wait(2) will
> block. The time spent waiting will be rounded up to the clock
> granularity. Kernel scheduling delays mean that the blocking
> interval may overrun by a small amount. Specifying a -1 for either
> tv_sec or tv_nsec member of the struct timespec timeout will cause
> causes epoll_pwait1(2) to block indefinitely. Specifying a timeout
> equal to zero (both tv_sec or tv_nsec member of the struct timespec
> timeout are zero) causes epoll_wait(2) to return immediately, even
> if no events are available.
>
> **** Are you really really sure about this for the -1 stuff? your code copies
> in the timespec and just passes it to timespec_to_ktime:
>
> + if (copy_from_user(&p, params, sizeof(p)))
> + return -EFAULT;
> ...
> + kt = timespec_to_ktime(p.timeout);
>
> Compare that to something like the futex syscall which does this:
>
> if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
> return -EFAULT;
> if (!timespec_valid(&ts))
> return -EINVAL;
>
> t = timespec_to_ktime(ts);
>
> If the timespec is not valid it returns -EINVAL back to user space. With your
> settings of tv_sec and/or tv_usec to -1 are you relying on a side effect of
> the conversion that could break your code in the future if in the unlikely
> event someone changes timespec_to_ktime() and should it be:
>
> + if (copy_from_user(&p, params, sizeof(p)))
> + return -EFAULT;
> + if ((p.timeout.tv_sec == -1) || (p.timeout.tv_nsec == -1)) {
> + /* this is off the top of my head no idea if it will compile */
> + p.timeout.tv_sec = KTIME_SEC_MAX;
> + p.timeout.tv_nsec = 0;
> + }
> + if (!timespec_valid(&p.timeout))
> + return -EINVAL;
> ...
> + kt = timespec_to_ktime(p.timeout);

OK. timespec_valid() is clear about this: negative tv_sec is invalid, so I
don't think accepting -1 from user is the right thing to do.

We cannot do pointer check as ppoll already because the structure is embedded
in epoll_wait_params.

Maybe it's best to use a flags bit (#define EPOLL_PWAIT1_BLOCK 1). What do you
think?

Fam

<snip>

2015-02-18 18:49:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1


* Fam Zheng <[email protected]> wrote:

> On Sun, 02/15 15:00, Jonathan Corbet wrote:
> > On Fri, 13 Feb 2015 17:03:56 +0800
> > Fam Zheng <[email protected]> wrote:
> >
> > > SYNOPSIS
> > >
> > > #include <sys/epoll.h>
> > >
> > > int epoll_pwait1(int epfd, int flags,
> > > struct epoll_event *events,
> > > int maxevents,
> > > struct epoll_wait_params *params);
> >
> > Quick, possibly dumb question: might it make sense to also pass in
> > sizeof(struct epoll_wait_params)? That way, when somebody wants to add
> > another parameter in the future, the kernel can tell which version is in
> > use and they won't have to do an epoll_pwait2()?
> >
>
> Flags can be used for that, if the change is not
> radically different.

Passing in size is generally better than flags, because
that way an extension of the ABI (new field[s])
automatically signals towards the kernel what to do with
old binaries - while extending the functionality of new
binaries, without sacrificing functionality.

With flags you are either limited to the same structure
size - or have to decode a 'size' value from the flags
value - which is fragile (and in which case a real 'size'
parameter is better).

in the perf ABI we use something like that: there's a
perf_attr.size parameter that iterates the ABI forward,
while still being binary compatible with older software.

If old binaries pass in a smaller structure to a newer
kernel then the kernel pads the new fields with zero by
default - that way the kernel internals are never burdened
with compatibility details and data format versions.

If new user-space passes in a large structure than the
kernel can handle then the kernel returns an error - this
way user-space can transparently support conditional
features and fallback logic.

It works really well, we've done literally a hundred perf
ABI extensions this way in the last 4+ years, in a pretty
natural fashion, without littering the kernel (or
user-space) with version legacies and without breaking
existing perf tooling.

Other syscall ABIs already get painful when trying to
handle 2-3 data structure versions, so people either give
up, or add flags kludges or go to new syscall entries:
which is painful in its own fashion and adds unnecessary
latency to feature introduction as well.

Thanks,

Ingo

2015-02-25 03:31:22

by Fam Zheng

[permalink] [raw]
Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

On Wed, 02/18 19:49, Ingo Molnar wrote:
>
> * Fam Zheng <[email protected]> wrote:
>
> > On Sun, 02/15 15:00, Jonathan Corbet wrote:
> > > On Fri, 13 Feb 2015 17:03:56 +0800
> > > Fam Zheng <[email protected]> wrote:
> > >
> > > > SYNOPSIS
> > > >
> > > > #include <sys/epoll.h>
> > > >
> > > > int epoll_pwait1(int epfd, int flags,
> > > > struct epoll_event *events,
> > > > int maxevents,
> > > > struct epoll_wait_params *params);
> > >
> > > Quick, possibly dumb question: might it make sense to also pass in
> > > sizeof(struct epoll_wait_params)? That way, when somebody wants to add
> > > another parameter in the future, the kernel can tell which version is in
> > > use and they won't have to do an epoll_pwait2()?
> > >
> >
> > Flags can be used for that, if the change is not
> > radically different.
>
> Passing in size is generally better than flags, because
> that way an extension of the ABI (new field[s])
> automatically signals towards the kernel what to do with
> old binaries - while extending the functionality of new
> binaries, without sacrificing functionality.
>
> With flags you are either limited to the same structure
> size - or have to decode a 'size' value from the flags
> value - which is fragile (and in which case a real 'size'
> parameter is better).
>
> in the perf ABI we use something like that: there's a
> perf_attr.size parameter that iterates the ABI forward,
> while still being binary compatible with older software.
>
> If old binaries pass in a smaller structure to a newer
> kernel then the kernel pads the new fields with zero by
> default - that way the kernel internals are never burdened
> with compatibility details and data format versions.
>
> If new user-space passes in a large structure than the
> kernel can handle then the kernel returns an error - this
> way user-space can transparently support conditional
> features and fallback logic.
>
> It works really well, we've done literally a hundred perf
> ABI extensions this way in the last 4+ years, in a pretty
> natural fashion, without littering the kernel (or
> user-space) with version legacies and without breaking
> existing perf tooling.
>
> Other syscall ABIs already get painful when trying to
> handle 2-3 data structure versions, so people either give
> up, or add flags kludges or go to new syscall entries:
> which is painful in its own fashion and adds unnecessary
> latency to feature introduction as well.
>

Excellent. This now makes a lot of sense to me, thanks to your explanations,
Ingo.

I'll add the "size" field in the next revision.

Thanks,
Fam