Hi,
tldr - we saw a 6-7% CPU reduction with this patch. See patch 6 for
full numbers.
This adds support for EPOLL_CTL_MIN_WAIT, which allows setting a minimum
time that epoll_wait() should wait for events on a given epoll context.
Some justification and numbers are in patch 6, patches 1-5 are really
just prep patches or cleanups.
Sending this out to get some input on the API, basically. This is
obviously a per-context type of operation in this patchset, which isn't
necessarily ideal for any use case. Questions to be debated:
1) Would we want this to be available through epoll_wait() directly?
That would allow this to be done on a per-epoll_wait() basis, rather
than be tied to the specific context.
2) If the answer to #1 is yes, would we still want EPOLL_CTL_MIN_WAIT?
I think there are pros and cons to both, and perhaps the answer to both is
"yes". There are some benefits to doing this at epoll setup time, for
example - it nicely isolates it to that part rather than needing to be
done dynamically everytime epoll_wait() is called. This also helps the
application code, as it can turn off any busy'ness tracking based on if
the setup accepted EPOLL_CTL_MIN_WAIT or not.
Anyway, tossing this out there as it yielded quite good results in some
initial testing, we're running more of it. Sending out a v3 now since
someone reported that nonblock issue which is annoying. Hoping to get some
more discussion this time around, or at least some...
Also available here:
https://git.kernel.dk/cgit/linux-block/log/?h=epoll-min_ts
Since v2:
- Fix an issue with nonblock event checking (timeout given, 0/0 set)
- Add another prep patch, getting rid of passing in a known 'false'
to ep_busy_loop()
--
Jens Axboe
Rather than have two separate branches here, collapse them into a single
one instead. No functional changes here, just a cleanup in preparation
for changes in this area.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/eventpoll.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 52954d4637b5..3061bdde6cba 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1869,14 +1869,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
* important.
*/
eavail = ep_events_available(ep);
- if (!eavail)
+ if (!eavail) {
__add_wait_queue_exclusive(&ep->wq, &wait);
-
- write_unlock_irq(&ep->lock);
-
- if (!eavail)
+ write_unlock_irq(&ep->lock);
timed_out = !schedule_hrtimeout_range(to, slack,
HRTIMER_MODE_ABS);
+ } else {
+ write_unlock_irq(&ep->lock);
+ }
+
__set_current_state(TASK_RUNNING);
/*
--
2.35.1
This just cleans up the checking a bit, in preparation for a change
that will need access to 'ep' earlier.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/eventpoll.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 0994f2eb6adc..962d897bbfc6 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2111,6 +2111,20 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
if (!f.file)
goto error_return;
+ /*
+ * We have to check that the file structure underneath the file
+ * descriptor the user passed to us _is_ an eventpoll file.
+ */
+ error = -EINVAL;
+ if (!is_file_epoll(f.file))
+ goto error_fput;
+
+ /*
+ * At this point it is safe to assume that the "private_data" contains
+ * our own data structure.
+ */
+ ep = f.file->private_data;
+
/* Get the "struct file *" for the target file */
tf = fdget(fd);
if (!tf.file)
@@ -2126,12 +2140,10 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
ep_take_care_of_epollwakeup(epds);
/*
- * We have to check that the file structure underneath the file descriptor
- * the user passed to us _is_ an eventpoll file. And also we do not permit
- * adding an epoll file descriptor inside itself.
+ * We do not permit adding an epoll file descriptor inside itself.
*/
error = -EINVAL;
- if (f.file == tf.file || !is_file_epoll(f.file))
+ if (f.file == tf.file)
goto error_tgt_fput;
/*
@@ -2147,12 +2159,6 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
goto error_tgt_fput;
}
- /*
- * At this point it is safe to assume that the "private_data" contains
- * our own data structure.
- */
- ep = f.file->private_data;
-
/*
* When we insert an epoll file descriptor inside another epoll file
* descriptor, there is the chance of creating closed loops, which are
--
2.35.1
Rather than just have a timeout value for waiting on events, add
EPOLL_CTL_MIN_WAIT to allow setting a minimum time that epoll_wait()
should always wait for events to arrive.
For medium workload efficiencies, some production workloads inject
artificial timers or sleeps before calling epoll_wait() to get
better batching and higher efficiencies. While this does help, it's
not as efficient as it could be. By adding support for epoll_wait()
for this directly, we can avoids extra context switches and scheduler
and timer overhead.
As an example, running an AB test on an identical workload at about
~370K reqs/second, without this change and with the sleep hack
mentioned above (using 200 usec as the timeout), we're doing 310K-340K
non-voluntary context switches per second. Idle CPU on the host is 27-34%.
With the the sleep hack removed and epoll set to the same 200 usec
value, we're handling the exact same load but at 292K-315k non-voluntary
context switches and idle CPU of 33-41%, a substantial win.
Basic test case:
struct d {
int p1, p2;
};
static void *fn(void *data)
{
struct d *d = data;
char b = 0x89;
/* Generate 2 events 20 msec apart */
usleep(10000);
write(d->p1, &b, sizeof(b));
usleep(10000);
write(d->p2, &b, sizeof(b));
return NULL;
}
int main(int argc, char *argv[])
{
struct epoll_event ev, events[2];
pthread_t thread;
int p1[2], p2[2];
struct d d;
int efd, ret;
efd = epoll_create1(0);
if (efd < 0) {
perror("epoll_create");
return 1;
}
if (pipe(p1) < 0) {
perror("pipe");
return 1;
}
if (pipe(p2) < 0) {
perror("pipe");
return 1;
}
ev.events = EPOLLIN;
ev.data.fd = p1[0];
if (epoll_ctl(efd, EPOLL_CTL_ADD, p1[0], &ev) < 0) {
perror("epoll add");
return 1;
}
ev.events = EPOLLIN;
ev.data.fd = p2[0];
if (epoll_ctl(efd, EPOLL_CTL_ADD, p2[0], &ev) < 0) {
perror("epoll add");
return 1;
}
/* always wait 200 msec for events */
ev.data.u64 = 200000;
if (epoll_ctl(efd, EPOLL_CTL_MIN_WAIT, -1, &ev) < 0) {
perror("epoll add set timeout");
return 1;
}
d.p1 = p1[1];
d.p2 = p2[1];
pthread_create(&thread, NULL, fn, &d);
/* expect to get 2 events here rather than just 1 */
ret = epoll_wait(efd, events, 2, -1);
printf("epoll_wait=%d\n", ret);
return 0;
}
Signed-off-by: Jens Axboe <[email protected]>
---
fs/eventpoll.c | 97 +++++++++++++++++++++++++++++-----
include/linux/eventpoll.h | 2 +-
include/uapi/linux/eventpoll.h | 1 +
3 files changed, 85 insertions(+), 15 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 962d897bbfc6..9e00f8780ec5 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -117,6 +117,9 @@ struct eppoll_entry {
/* The "base" pointer is set to the container "struct epitem" */
struct epitem *base;
+ /* min wait time if (min_wait_ts) & 1 != 0 */
+ ktime_t min_wait_ts;
+
/*
* Wait queue item that will be linked to the target file wait
* queue head.
@@ -217,6 +220,9 @@ struct eventpoll {
u64 gen;
struct hlist_head refs;
+ /* min wait for epoll_wait() */
+ unsigned int min_wait_ts;
+
#ifdef CONFIG_NET_RX_BUSY_POLL
/* used to track busy poll napi_id */
unsigned int napi_id;
@@ -1747,6 +1753,32 @@ static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms)
return to;
}
+struct epoll_wq {
+ wait_queue_entry_t wait;
+ struct hrtimer timer;
+ ktime_t timeout_ts;
+ ktime_t min_wait_ts;
+ struct eventpoll *ep;
+ bool timed_out;
+ int maxevents;
+ int wakeups;
+};
+
+static bool ep_should_min_wait(struct epoll_wq *ewq)
+{
+ if (ewq->min_wait_ts & 1) {
+ /* just an approximation */
+ if (++ewq->wakeups >= ewq->maxevents)
+ goto stop_wait;
+ if (ktime_before(ktime_get_ns(), ewq->min_wait_ts))
+ return true;
+ }
+
+stop_wait:
+ ewq->min_wait_ts &= ~(u64) 1;
+ return false;
+}
+
/*
* autoremove_wake_function, but remove even on failure to wake up, because we
* know that default_wake_function/ttwu will only fail if the thread is already
@@ -1756,27 +1788,37 @@ static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms)
static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
unsigned int mode, int sync, void *key)
{
- int ret = default_wake_function(wq_entry, mode, sync, key);
+ struct epoll_wq *ewq = container_of(wq_entry, struct epoll_wq, wait);
+ int ret;
+
+ /*
+ * If min wait time hasn't been satisfied yet, keep waiting
+ */
+ if (ep_should_min_wait(ewq))
+ return 0;
+ ret = default_wake_function(wq_entry, mode, sync, key);
list_del_init(&wq_entry->entry);
return ret;
}
-struct epoll_wq {
- wait_queue_entry_t wait;
- struct hrtimer timer;
- ktime_t timeout_ts;
- bool timed_out;
-};
-
static enum hrtimer_restart ep_timer(struct hrtimer *timer)
{
struct epoll_wq *ewq = container_of(timer, struct epoll_wq, timer);
struct task_struct *task = ewq->wait.private;
+ const bool is_min_wait = ewq->min_wait_ts & 1;
+
+ if (!is_min_wait || ep_events_available(ewq->ep)) {
+ if (!is_min_wait)
+ ewq->timed_out = true;
+ ewq->min_wait_ts &= ~(u64) 1;
+ wake_up_process(task);
+ return HRTIMER_NORESTART;
+ }
- ewq->timed_out = true;
- wake_up_process(task);
- return HRTIMER_NORESTART;
+ ewq->min_wait_ts &= ~(u64) 1;
+ hrtimer_set_expires_range_ns(&ewq->timer, ewq->timeout_ts, 0);
+ return HRTIMER_RESTART;
}
static void ep_schedule(struct eventpoll *ep, struct epoll_wq *ewq, ktime_t *to,
@@ -1831,12 +1873,16 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
lockdep_assert_irqs_enabled();
+ ewq.min_wait_ts = 0;
+ ewq.ep = ep;
+ ewq.maxevents = maxevents;
ewq.timed_out = false;
+ ewq.wakeups = 0;
if (timeout && (timeout->tv_sec | timeout->tv_nsec)) {
slack = select_estimate_accuracy(timeout);
+ ewq.timeout_ts = timespec64_to_ktime(*timeout);
to = &ewq.timeout_ts;
- *to = timespec64_to_ktime(*timeout);
} else if (timeout) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
@@ -1845,6 +1891,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
ewq.timed_out = true;
}
+ /*
+ * If min_wait is set for this epoll instance, note the min_wait
+ * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's
+ * the state bit for whether or not min_wait is enabled.
+ */
+ if (ep->min_wait_ts) {
+ ewq.min_wait_ts = ktime_add_us(ktime_get_ns(),
+ ep->min_wait_ts);
+ ewq.min_wait_ts |= (u64) 1;
+ to = &ewq.min_wait_ts;
+ }
+
/*
* This call is racy: We may or may not see events that are being added
* to the ready list under the lock (e.g., in IRQ callbacks). For cases
@@ -1913,7 +1971,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
* important.
*/
eavail = ep_events_available(ep);
- if (!eavail) {
+ if (!eavail || ewq.min_wait_ts & 1) {
__add_wait_queue_exclusive(&ep->wq, &ewq.wait);
write_unlock_irq(&ep->lock);
ep_schedule(ep, &ewq, to, slack);
@@ -2125,6 +2183,17 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
*/
ep = f.file->private_data;
+ /*
+ * Handle EPOLL_CTL_MIN_WAIT upfront as we don't need to care about
+ * the fd being passed in.
+ */
+ if (op == EPOLL_CTL_MIN_WAIT) {
+ /* return old value */
+ error = ep->min_wait_ts;
+ ep->min_wait_ts = epds->data;
+ goto error_fput;
+ }
+
/* Get the "struct file *" for the target file */
tf = fdget(fd);
if (!tf.file)
@@ -2257,7 +2326,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
{
struct epoll_event epds;
- if (ep_op_has_event(op) &&
+ if ((ep_op_has_event(op) || op == EPOLL_CTL_MIN_WAIT) &&
copy_from_user(&epds, event, sizeof(struct epoll_event)))
return -EFAULT;
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 3337745d81bd..cbef635cb7e4 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -59,7 +59,7 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
/* Tells if the epoll_ctl(2) operation needs an event copy from userspace */
static inline int ep_op_has_event(int op)
{
- return op != EPOLL_CTL_DEL;
+ return op != EPOLL_CTL_DEL && op != EPOLL_CTL_MIN_WAIT;
}
#else
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 8a3432d0f0dc..81ecb1ca36e0 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,7 @@
#define EPOLL_CTL_ADD 1
#define EPOLL_CTL_DEL 2
#define EPOLL_CTL_MOD 3
+#define EPOLL_CTL_MIN_WAIT 4
/* Epoll event masks */
#define EPOLLIN (__force __poll_t)0x00000001
--
2.35.1
In preparation for making changes to how wakeups and sleeps are done,
move the timeout scheduling into a helper and manage it rather than
rely on schedule_hrtimeout_range().
Signed-off-by: Jens Axboe <[email protected]>
---
fs/eventpoll.c | 68 ++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 55 insertions(+), 13 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 64d7331353dd..888f565d0c5f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1762,6 +1762,47 @@ static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
return ret;
}
+struct epoll_wq {
+ wait_queue_entry_t wait;
+ struct hrtimer timer;
+ bool timed_out;
+};
+
+static enum hrtimer_restart ep_timer(struct hrtimer *timer)
+{
+ struct epoll_wq *ewq = container_of(timer, struct epoll_wq, timer);
+ struct task_struct *task = ewq->wait.private;
+
+ ewq->timed_out = true;
+ wake_up_process(task);
+ return HRTIMER_NORESTART;
+}
+
+static void ep_schedule(struct eventpoll *ep, struct epoll_wq *ewq, ktime_t *to,
+ u64 slack)
+{
+ if (ewq->timed_out)
+ return;
+ if (to && *to == 0) {
+ ewq->timed_out = true;
+ return;
+ }
+ if (!to) {
+ schedule();
+ return;
+ }
+
+ hrtimer_init_on_stack(&ewq->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ ewq->timer.function = ep_timer;
+ hrtimer_set_expires_range_ns(&ewq->timer, *to, slack);
+ hrtimer_start_expires(&ewq->timer, HRTIMER_MODE_ABS);
+
+ schedule();
+
+ hrtimer_cancel(&ewq->timer);
+ destroy_hrtimer_on_stack(&ewq->timer);
+}
+
/**
* ep_poll - Retrieves ready events, and delivers them to the caller-supplied
* event buffer.
@@ -1782,13 +1823,15 @@ static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
int maxevents, struct timespec64 *timeout)
{
- int res, eavail, timed_out = 0;
+ int res, eavail;
u64 slack = 0;
- wait_queue_entry_t wait;
ktime_t expires, *to = NULL;
+ struct epoll_wq ewq;
lockdep_assert_irqs_enabled();
+ ewq.timed_out = false;
+
if (timeout && (timeout->tv_sec | timeout->tv_nsec)) {
slack = select_estimate_accuracy(timeout);
to = &expires;
@@ -1798,7 +1841,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
* Avoid the unnecessary trip to the wait queue loop, if the
* caller specified a non blocking operation.
*/
- timed_out = 1;
+ ewq.timed_out = true;
}
/*
@@ -1823,7 +1866,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
return res;
}
- if (timed_out)
+ if (ewq.timed_out)
return 0;
eavail = ep_busy_loop(ep);
@@ -1850,8 +1893,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
* performance issue if a process is killed, causing all of its
* threads to wake up without being removed normally.
*/
- init_wait(&wait);
- wait.func = ep_autoremove_wake_function;
+ init_wait(&ewq.wait);
+ ewq.wait.func = ep_autoremove_wake_function;
write_lock_irq(&ep->lock);
/*
@@ -1870,10 +1913,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
*/
eavail = ep_events_available(ep);
if (!eavail) {
- __add_wait_queue_exclusive(&ep->wq, &wait);
+ __add_wait_queue_exclusive(&ep->wq, &ewq.wait);
write_unlock_irq(&ep->lock);
- timed_out = !schedule_hrtimeout_range(to, slack,
- HRTIMER_MODE_ABS);
+ ep_schedule(ep, &ewq, to, slack);
} else {
write_unlock_irq(&ep->lock);
}
@@ -1887,7 +1929,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
*/
eavail = 1;
- if (!list_empty_careful(&wait.entry)) {
+ if (!list_empty_careful(&ewq.wait.entry)) {
write_lock_irq(&ep->lock);
/*
* If the thread timed out and is not on the wait queue,
@@ -1896,9 +1938,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
* Thus, when wait.entry is empty, it needs to harvest
* events.
*/
- if (timed_out)
- eavail = list_empty(&wait.entry);
- __remove_wait_queue(&ep->wq, &wait);
+ if (ewq.timed_out)
+ eavail = list_empty(&ewq.wait.entry);
+ __remove_wait_queue(&ep->wq, &ewq.wait);
write_unlock_irq(&ep->lock);
}
}
--
2.35.1
This makes the expiration available to the wakeup handler. No functional
changes expected in this patch, purely in preparation for being able to
use the timeout on the wakeup side.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/eventpoll.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 888f565d0c5f..0994f2eb6adc 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1765,6 +1765,7 @@ static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
struct epoll_wq {
wait_queue_entry_t wait;
struct hrtimer timer;
+ ktime_t timeout_ts;
bool timed_out;
};
@@ -1825,7 +1826,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
{
int res, eavail;
u64 slack = 0;
- ktime_t expires, *to = NULL;
+ ktime_t *to = NULL;
struct epoll_wq ewq;
lockdep_assert_irqs_enabled();
@@ -1834,7 +1835,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
if (timeout && (timeout->tv_sec | timeout->tv_nsec)) {
slack = select_estimate_accuracy(timeout);
- to = &expires;
+ to = &ewq.timeout_ts;
*to = timespec64_to_ktime(*timeout);
} else if (timeout) {
/*
--
2.35.1
It's known to be 'false' from the one call site we have, as we break
out of the loop if it's not.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/eventpoll.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 3061bdde6cba..64d7331353dd 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -396,12 +396,12 @@ static bool ep_busy_loop_end(void *p, unsigned long start_time)
*
* we must do our busy polling with irqs enabled
*/
-static bool ep_busy_loop(struct eventpoll *ep, int nonblock)
+static bool ep_busy_loop(struct eventpoll *ep)
{
unsigned int napi_id = READ_ONCE(ep->napi_id);
if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) {
- napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end, ep, false,
+ napi_busy_loop(napi_id, ep_busy_loop_end, ep, false,
BUSY_POLL_BUDGET);
if (ep_events_available(ep))
return true;
@@ -453,7 +453,7 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
#else
-static inline bool ep_busy_loop(struct eventpoll *ep, int nonblock)
+static inline bool ep_busy_loop(struct eventpoll *ep)
{
return false;
}
@@ -1826,7 +1826,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
if (timed_out)
return 0;
- eavail = ep_busy_loop(ep, timed_out);
+ eavail = ep_busy_loop(ep);
if (eavail)
continue;
--
2.35.1
On 11/2/22 11:46 AM, Willem de Bruijn wrote:
> On Sun, Oct 30, 2022 at 6:02 PM Jens Axboe <[email protected]> wrote:
>>
>> Hi,
>>
>> tldr - we saw a 6-7% CPU reduction with this patch. See patch 6 for
>> full numbers.
>>
>> This adds support for EPOLL_CTL_MIN_WAIT, which allows setting a minimum
>> time that epoll_wait() should wait for events on a given epoll context.
>> Some justification and numbers are in patch 6, patches 1-5 are really
>> just prep patches or cleanups.
>>
>> Sending this out to get some input on the API, basically. This is
>> obviously a per-context type of operation in this patchset, which isn't
>> necessarily ideal for any use case. Questions to be debated:
>>
>> 1) Would we want this to be available through epoll_wait() directly?
>> That would allow this to be done on a per-epoll_wait() basis, rather
>> than be tied to the specific context.
>>
>> 2) If the answer to #1 is yes, would we still want EPOLL_CTL_MIN_WAIT?
>>
>> I think there are pros and cons to both, and perhaps the answer to both is
>> "yes". There are some benefits to doing this at epoll setup time, for
>> example - it nicely isolates it to that part rather than needing to be
>> done dynamically everytime epoll_wait() is called. This also helps the
>> application code, as it can turn off any busy'ness tracking based on if
>> the setup accepted EPOLL_CTL_MIN_WAIT or not.
>>
>> Anyway, tossing this out there as it yielded quite good results in some
>> initial testing, we're running more of it. Sending out a v3 now since
>> someone reported that nonblock issue which is annoying. Hoping to get some
>> more discussion this time around, or at least some...
>
> My main question is whether the cycle gains justify the code
> complexity and runtime cost in all other epoll paths.
>
> Syscall overhead is quite dependent on architecture and things like KPTI.
Definitely interested in experiences from other folks, but what other
runtime costs do you see compared to the baseline?
> Indeed, I was also wondering whether an extra timeout arg to
> epoll_wait would give the same feature with less side effects. Then no
> need for that new ctrl API.
That was my main question in this posting - what's the best api? The
current one, epoll_wait() addition, or both? The nice thing about the
current one is that it's easy to integrate into existing use cases, as
the decision to do batching on the userspace side or by utilizing this
feature can be kept in the setup path. If you do epoll_wait() and get
-1/EINVAL or false success on older kernels, then that's either a loss
because of thinking it worked, or a fast path need to check for this
specifically every time you call epoll_wait() rather than just at
init/setup time.
But this is very much the question I already posed and wanted to
discuss...
--
Jens Axboe
On Sun, Oct 30, 2022 at 6:02 PM Jens Axboe <[email protected]> wrote:
>
> Hi,
>
> tldr - we saw a 6-7% CPU reduction with this patch. See patch 6 for
> full numbers.
>
> This adds support for EPOLL_CTL_MIN_WAIT, which allows setting a minimum
> time that epoll_wait() should wait for events on a given epoll context.
> Some justification and numbers are in patch 6, patches 1-5 are really
> just prep patches or cleanups.
>
> Sending this out to get some input on the API, basically. This is
> obviously a per-context type of operation in this patchset, which isn't
> necessarily ideal for any use case. Questions to be debated:
>
> 1) Would we want this to be available through epoll_wait() directly?
> That would allow this to be done on a per-epoll_wait() basis, rather
> than be tied to the specific context.
>
> 2) If the answer to #1 is yes, would we still want EPOLL_CTL_MIN_WAIT?
>
> I think there are pros and cons to both, and perhaps the answer to both is
> "yes". There are some benefits to doing this at epoll setup time, for
> example - it nicely isolates it to that part rather than needing to be
> done dynamically everytime epoll_wait() is called. This also helps the
> application code, as it can turn off any busy'ness tracking based on if
> the setup accepted EPOLL_CTL_MIN_WAIT or not.
>
> Anyway, tossing this out there as it yielded quite good results in some
> initial testing, we're running more of it. Sending out a v3 now since
> someone reported that nonblock issue which is annoying. Hoping to get some
> more discussion this time around, or at least some...
My main question is whether the cycle gains justify the code
complexity and runtime cost in all other epoll paths.
Syscall overhead is quite dependent on architecture and things like KPTI.
Indeed, I was also wondering whether an extra timeout arg to
epoll_wait would give the same feature with less side effects. Then no
need for that new ctrl API.
On Wed, Nov 2, 2022 at 1:54 PM Jens Axboe <[email protected]> wrote:
>
> On 11/2/22 11:46 AM, Willem de Bruijn wrote:
> > On Sun, Oct 30, 2022 at 6:02 PM Jens Axboe <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> tldr - we saw a 6-7% CPU reduction with this patch. See patch 6 for
> >> full numbers.
> >>
> >> This adds support for EPOLL_CTL_MIN_WAIT, which allows setting a minimum
> >> time that epoll_wait() should wait for events on a given epoll context.
> >> Some justification and numbers are in patch 6, patches 1-5 are really
> >> just prep patches or cleanups.
> >>
> >> Sending this out to get some input on the API, basically. This is
> >> obviously a per-context type of operation in this patchset, which isn't
> >> necessarily ideal for any use case. Questions to be debated:
> >>
> >> 1) Would we want this to be available through epoll_wait() directly?
> >> That would allow this to be done on a per-epoll_wait() basis, rather
> >> than be tied to the specific context.
> >>
> >> 2) If the answer to #1 is yes, would we still want EPOLL_CTL_MIN_WAIT?
> >>
> >> I think there are pros and cons to both, and perhaps the answer to both is
> >> "yes". There are some benefits to doing this at epoll setup time, for
> >> example - it nicely isolates it to that part rather than needing to be
> >> done dynamically everytime epoll_wait() is called. This also helps the
> >> application code, as it can turn off any busy'ness tracking based on if
> >> the setup accepted EPOLL_CTL_MIN_WAIT or not.
> >>
> >> Anyway, tossing this out there as it yielded quite good results in some
> >> initial testing, we're running more of it. Sending out a v3 now since
> >> someone reported that nonblock issue which is annoying. Hoping to get some
> >> more discussion this time around, or at least some...
> >
> > My main question is whether the cycle gains justify the code
> > complexity and runtime cost in all other epoll paths.
> >
> > Syscall overhead is quite dependent on architecture and things like KPTI.
>
> Definitely interested in experiences from other folks, but what other
> runtime costs do you see compared to the baseline?
Nothing specific. Possible cost from added branches and moving local
variables into structs with possibly cold cachelines.
> > Indeed, I was also wondering whether an extra timeout arg to
> > epoll_wait would give the same feature with less side effects. Then no
> > need for that new ctrl API.
>
> That was my main question in this posting - what's the best api? The
> current one, epoll_wait() addition, or both? The nice thing about the
> current one is that it's easy to integrate into existing use cases, as
> the decision to do batching on the userspace side or by utilizing this
> feature can be kept in the setup path. If you do epoll_wait() and get
> -1/EINVAL or false success on older kernels, then that's either a loss
> because of thinking it worked, or a fast path need to check for this
> specifically every time you call epoll_wait() rather than just at
> init/setup time.
>
> But this is very much the question I already posed and wanted to
> discuss...
I see the value in being able to detect whether the feature is present.
But a pure epoll_wait implementation seems a lot simpler to me, and
more elegant: timeout is an argument to epoll_wait already.
A new epoll_wait variant would have to be a new system call, so it
would be easy to infer support for the feature.
>
> --
> Jens Axboe
On 11/2/22 5:09 PM, Willem de Bruijn wrote:
> On Wed, Nov 2, 2022 at 1:54 PM Jens Axboe <[email protected]> wrote:
>>
>> On 11/2/22 11:46 AM, Willem de Bruijn wrote:
>>> On Sun, Oct 30, 2022 at 6:02 PM Jens Axboe <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> tldr - we saw a 6-7% CPU reduction with this patch. See patch 6 for
>>>> full numbers.
>>>>
>>>> This adds support for EPOLL_CTL_MIN_WAIT, which allows setting a minimum
>>>> time that epoll_wait() should wait for events on a given epoll context.
>>>> Some justification and numbers are in patch 6, patches 1-5 are really
>>>> just prep patches or cleanups.
>>>>
>>>> Sending this out to get some input on the API, basically. This is
>>>> obviously a per-context type of operation in this patchset, which isn't
>>>> necessarily ideal for any use case. Questions to be debated:
>>>>
>>>> 1) Would we want this to be available through epoll_wait() directly?
>>>> That would allow this to be done on a per-epoll_wait() basis, rather
>>>> than be tied to the specific context.
>>>>
>>>> 2) If the answer to #1 is yes, would we still want EPOLL_CTL_MIN_WAIT?
>>>>
>>>> I think there are pros and cons to both, and perhaps the answer to both is
>>>> "yes". There are some benefits to doing this at epoll setup time, for
>>>> example - it nicely isolates it to that part rather than needing to be
>>>> done dynamically everytime epoll_wait() is called. This also helps the
>>>> application code, as it can turn off any busy'ness tracking based on if
>>>> the setup accepted EPOLL_CTL_MIN_WAIT or not.
>>>>
>>>> Anyway, tossing this out there as it yielded quite good results in some
>>>> initial testing, we're running more of it. Sending out a v3 now since
>>>> someone reported that nonblock issue which is annoying. Hoping to get some
>>>> more discussion this time around, or at least some...
>>>
>>> My main question is whether the cycle gains justify the code
>>> complexity and runtime cost in all other epoll paths.
>>>
>>> Syscall overhead is quite dependent on architecture and things like KPTI.
>>
>> Definitely interested in experiences from other folks, but what other
>> runtime costs do you see compared to the baseline?
>
> Nothing specific. Possible cost from added branches and moving local
> variables into structs with possibly cold cachelines.
>
>>> Indeed, I was also wondering whether an extra timeout arg to
>>> epoll_wait would give the same feature with less side effects. Then no
>>> need for that new ctrl API.
>>
>> That was my main question in this posting - what's the best api? The
>> current one, epoll_wait() addition, or both? The nice thing about the
>> current one is that it's easy to integrate into existing use cases, as
>> the decision to do batching on the userspace side or by utilizing this
>> feature can be kept in the setup path. If you do epoll_wait() and get
>> -1/EINVAL or false success on older kernels, then that's either a loss
>> because of thinking it worked, or a fast path need to check for this
>> specifically every time you call epoll_wait() rather than just at
>> init/setup time.
>>
>> But this is very much the question I already posed and wanted to
>> discuss...
>
> I see the value in being able to detect whether the feature is present.
>
> But a pure epoll_wait implementation seems a lot simpler to me, and
> more elegant: timeout is an argument to epoll_wait already.
>
> A new epoll_wait variant would have to be a new system call, so it
> would be easy to infer support for the feature.
Right, but it'd still mean that you'd need to check this in the fast
path in the app vs being able to do it at init time. Might there be
merit to doing both? From the conversion that we tried, the CTL variant
definitely made things easier to port. The new syscall would make enable
per-call delays however. There might be some merit to that, though I do
think that max_events + min_time is how you'd control batching anything
and that's suitably set in the context itself for most use cases.
--
Jens Axboe
On 11/2/22 5:51 PM, Willem de Bruijn wrote:
> On Wed, Nov 2, 2022 at 7:42 PM Jens Axboe <[email protected]> wrote:
>>
>> On 11/2/22 5:09 PM, Willem de Bruijn wrote:
>>> On Wed, Nov 2, 2022 at 1:54 PM Jens Axboe <[email protected]> wrote:
>>>>
>>>> On 11/2/22 11:46 AM, Willem de Bruijn wrote:
>>>>> On Sun, Oct 30, 2022 at 6:02 PM Jens Axboe <[email protected]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> tldr - we saw a 6-7% CPU reduction with this patch. See patch 6 for
>>>>>> full numbers.
>>>>>>
>>>>>> This adds support for EPOLL_CTL_MIN_WAIT, which allows setting a minimum
>>>>>> time that epoll_wait() should wait for events on a given epoll context.
>>>>>> Some justification and numbers are in patch 6, patches 1-5 are really
>>>>>> just prep patches or cleanups.
>>>>>>
>>>>>> Sending this out to get some input on the API, basically. This is
>>>>>> obviously a per-context type of operation in this patchset, which isn't
>>>>>> necessarily ideal for any use case. Questions to be debated:
>>>>>>
>>>>>> 1) Would we want this to be available through epoll_wait() directly?
>>>>>> That would allow this to be done on a per-epoll_wait() basis, rather
>>>>>> than be tied to the specific context.
>>>>>>
>>>>>> 2) If the answer to #1 is yes, would we still want EPOLL_CTL_MIN_WAIT?
>>>>>>
>>>>>> I think there are pros and cons to both, and perhaps the answer to both is
>>>>>> "yes". There are some benefits to doing this at epoll setup time, for
>>>>>> example - it nicely isolates it to that part rather than needing to be
>>>>>> done dynamically everytime epoll_wait() is called. This also helps the
>>>>>> application code, as it can turn off any busy'ness tracking based on if
>>>>>> the setup accepted EPOLL_CTL_MIN_WAIT or not.
>>>>>>
>>>>>> Anyway, tossing this out there as it yielded quite good results in some
>>>>>> initial testing, we're running more of it. Sending out a v3 now since
>>>>>> someone reported that nonblock issue which is annoying. Hoping to get some
>>>>>> more discussion this time around, or at least some...
>>>>>
>>>>> My main question is whether the cycle gains justify the code
>>>>> complexity and runtime cost in all other epoll paths.
>>>>>
>>>>> Syscall overhead is quite dependent on architecture and things like KPTI.
>>>>
>>>> Definitely interested in experiences from other folks, but what other
>>>> runtime costs do you see compared to the baseline?
>>>
>>> Nothing specific. Possible cost from added branches and moving local
>>> variables into structs with possibly cold cachelines.
>>>
>>>>> Indeed, I was also wondering whether an extra timeout arg to
>>>>> epoll_wait would give the same feature with less side effects. Then no
>>>>> need for that new ctrl API.
>>>>
>>>> That was my main question in this posting - what's the best api? The
>>>> current one, epoll_wait() addition, or both? The nice thing about the
>>>> current one is that it's easy to integrate into existing use cases, as
>>>> the decision to do batching on the userspace side or by utilizing this
>>>> feature can be kept in the setup path. If you do epoll_wait() and get
>>>> -1/EINVAL or false success on older kernels, then that's either a loss
>>>> because of thinking it worked, or a fast path need to check for this
>>>> specifically every time you call epoll_wait() rather than just at
>>>> init/setup time.
>>>>
>>>> But this is very much the question I already posed and wanted to
>>>> discuss...
>>>
>>> I see the value in being able to detect whether the feature is present.
>>>
>>> But a pure epoll_wait implementation seems a lot simpler to me, and
>>> more elegant: timeout is an argument to epoll_wait already.
>>>
>>> A new epoll_wait variant would have to be a new system call, so it
>>> would be easy to infer support for the feature.
>>
>> Right, but it'd still mean that you'd need to check this in the fast
>> path in the app vs being able to do it at init time.
>
> A process could call the new syscall with timeout 0 at init time to
> learn whether the feature is supported.
That is pretty clunky, though... It'd work, but not a very elegant API.
>> Might there be
>> merit to doing both? From the conversion that we tried, the CTL variant
>> definitely made things easier to port. The new syscall would make enable
>> per-call delays however. There might be some merit to that, though I do
>> think that max_events + min_time is how you'd control batching anything
>> and that's suitably set in the context itself for most use cases.
>
> I'm surprised a CTL variant is easier to port. An epoll_pwait3 with an
> extra argument only needs to pass that argument to do_epoll_wait.
It's literally adding two lines of code, that's it. A new syscall is way
worse both in terms of the userspace and kernel side for archs, and for
changing call sites in the app.
> FWIW, when adding nsec resolution I initially opted for an init-based
> approach, passing a new flag to epoll_create1. Feedback then was that
> it was odd to have one syscall affect the behavior of another. The
> final version just added a new epoll_pwait2 with timespec.
I'm fine with just doing a pure syscall variant too, it was my original
plan. Only changed it to allow for easier experimentation and adoption,
and based on the fact that most use cases would likely use a fixed value
per context anyway.
I think it'd be a shame to drop the ctl, unless there's strong arguments
against it. I'm quite happy to add a syscall variant too, that's not a
big deal and would be a minor addition. Patch 6 should probably cut out
the ctl addition and leave that for a patch 7, and then a patch 8 for
adding a syscall.
--
Jens Axboe
On Wed, Nov 2, 2022 at 7:42 PM Jens Axboe <[email protected]> wrote:
>
> On 11/2/22 5:09 PM, Willem de Bruijn wrote:
> > On Wed, Nov 2, 2022 at 1:54 PM Jens Axboe <[email protected]> wrote:
> >>
> >> On 11/2/22 11:46 AM, Willem de Bruijn wrote:
> >>> On Sun, Oct 30, 2022 at 6:02 PM Jens Axboe <[email protected]> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> tldr - we saw a 6-7% CPU reduction with this patch. See patch 6 for
> >>>> full numbers.
> >>>>
> >>>> This adds support for EPOLL_CTL_MIN_WAIT, which allows setting a minimum
> >>>> time that epoll_wait() should wait for events on a given epoll context.
> >>>> Some justification and numbers are in patch 6, patches 1-5 are really
> >>>> just prep patches or cleanups.
> >>>>
> >>>> Sending this out to get some input on the API, basically. This is
> >>>> obviously a per-context type of operation in this patchset, which isn't
> >>>> necessarily ideal for any use case. Questions to be debated:
> >>>>
> >>>> 1) Would we want this to be available through epoll_wait() directly?
> >>>> That would allow this to be done on a per-epoll_wait() basis, rather
> >>>> than be tied to the specific context.
> >>>>
> >>>> 2) If the answer to #1 is yes, would we still want EPOLL_CTL_MIN_WAIT?
> >>>>
> >>>> I think there are pros and cons to both, and perhaps the answer to both is
> >>>> "yes". There are some benefits to doing this at epoll setup time, for
> >>>> example - it nicely isolates it to that part rather than needing to be
> >>>> done dynamically everytime epoll_wait() is called. This also helps the
> >>>> application code, as it can turn off any busy'ness tracking based on if
> >>>> the setup accepted EPOLL_CTL_MIN_WAIT or not.
> >>>>
> >>>> Anyway, tossing this out there as it yielded quite good results in some
> >>>> initial testing, we're running more of it. Sending out a v3 now since
> >>>> someone reported that nonblock issue which is annoying. Hoping to get some
> >>>> more discussion this time around, or at least some...
> >>>
> >>> My main question is whether the cycle gains justify the code
> >>> complexity and runtime cost in all other epoll paths.
> >>>
> >>> Syscall overhead is quite dependent on architecture and things like KPTI.
> >>
> >> Definitely interested in experiences from other folks, but what other
> >> runtime costs do you see compared to the baseline?
> >
> > Nothing specific. Possible cost from added branches and moving local
> > variables into structs with possibly cold cachelines.
> >
> >>> Indeed, I was also wondering whether an extra timeout arg to
> >>> epoll_wait would give the same feature with less side effects. Then no
> >>> need for that new ctrl API.
> >>
> >> That was my main question in this posting - what's the best api? The
> >> current one, epoll_wait() addition, or both? The nice thing about the
> >> current one is that it's easy to integrate into existing use cases, as
> >> the decision to do batching on the userspace side or by utilizing this
> >> feature can be kept in the setup path. If you do epoll_wait() and get
> >> -1/EINVAL or false success on older kernels, then that's either a loss
> >> because of thinking it worked, or a fast path need to check for this
> >> specifically every time you call epoll_wait() rather than just at
> >> init/setup time.
> >>
> >> But this is very much the question I already posed and wanted to
> >> discuss...
> >
> > I see the value in being able to detect whether the feature is present.
> >
> > But a pure epoll_wait implementation seems a lot simpler to me, and
> > more elegant: timeout is an argument to epoll_wait already.
> >
> > A new epoll_wait variant would have to be a new system call, so it
> > would be easy to infer support for the feature.
>
> Right, but it'd still mean that you'd need to check this in the fast
> path in the app vs being able to do it at init time.
A process could call the new syscall with timeout 0 at init time to
learn whether the feature is supported.
> Might there be
> merit to doing both? From the conversion that we tried, the CTL variant
> definitely made things easier to port. The new syscall would make enable
> per-call delays however. There might be some merit to that, though I do
> think that max_events + min_time is how you'd control batching anything
> and that's suitably set in the context itself for most use cases.
I'm surprised a CTL variant is easier to port. An epoll_pwait3 with an
extra argument only needs to pass that argument to do_epoll_wait.
FWIW, when adding nsec resolution I initially opted for an init-based
approach, passing a new flag to epoll_create1. Feedback then was that
it was odd to have one syscall affect the behavior of another. The
final version just added a new epoll_pwait2 with timespec.
>> FWIW, when adding nsec resolution I initially opted for an init-based
>> approach, passing a new flag to epoll_create1. Feedback then was that
>> it was odd to have one syscall affect the behavior of another. The
>> final version just added a new epoll_pwait2 with timespec.
>
> I'm fine with just doing a pure syscall variant too, it was my original
> plan. Only changed it to allow for easier experimentation and adoption,
> and based on the fact that most use cases would likely use a fixed value
> per context anyway.
>
> I think it'd be a shame to drop the ctl, unless there's strong arguments
> against it. I'm quite happy to add a syscall variant too, that's not a
> big deal and would be a minor addition. Patch 6 should probably cut out
> the ctl addition and leave that for a patch 7, and then a patch 8 for
> adding a syscall.
I split the ctl patch out from the core change, and then took a look at
doing a syscall variant too. But there are a few complications there...
It would seem to make the most sense to build this on top of the newest
epoll wait syscall, epoll_pwait2(). But we're already at the max number
of arguments there...
Arguably pwait2 should've been converted to use some kind of versioned
struct instead. I'm going to take a stab at pwait3 with that kind of
interface.
--
Jens Axboe
On Sat, Nov 5, 2022 at 1:39 PM Jens Axboe <[email protected]> wrote:
>
> >> FWIW, when adding nsec resolution I initially opted for an init-based
> >> approach, passing a new flag to epoll_create1. Feedback then was that
> >> it was odd to have one syscall affect the behavior of another. The
> >> final version just added a new epoll_pwait2 with timespec.
> >
> > I'm fine with just doing a pure syscall variant too, it was my original
> > plan. Only changed it to allow for easier experimentation and adoption,
> > and based on the fact that most use cases would likely use a fixed value
> > per context anyway.
> >
> > I think it'd be a shame to drop the ctl, unless there's strong arguments
> > against it. I'm quite happy to add a syscall variant too, that's not a
> > big deal and would be a minor addition. Patch 6 should probably cut out
> > the ctl addition and leave that for a patch 7, and then a patch 8 for
> > adding a syscall.
> I split the ctl patch out from the core change, and then took a look at
> doing a syscall variant too. But there are a few complications there...
> It would seem to make the most sense to build this on top of the newest
> epoll wait syscall, epoll_pwait2(). But we're already at the max number
> of arguments there...
>
> Arguably pwait2 should've been converted to use some kind of versioned
> struct instead. I'm going to take a stab at pwait3 with that kind of
> interface.
Don't convert to a syscall approach based solely on my feedback. It
would be good to hear from others.
At a high level, I'm somewhat uncomfortable merging two syscalls for
behavior that already works, just to save half the syscall overhead.
There is no shortage of calls that may make some sense for a workload
to merge. Is the quoted 6-7% cpu cycle reduction due to saving one
SYSENTER/SYSEXIT (as the high resolution timer wake-up will be the
same), or am I missing something more fundamental?
On 11/5/22 12:05 PM, Willem de Bruijn wrote:
> On Sat, Nov 5, 2022 at 1:39 PM Jens Axboe <[email protected]> wrote:
>>
>>>> FWIW, when adding nsec resolution I initially opted for an init-based
>>>> approach, passing a new flag to epoll_create1. Feedback then was that
>>>> it was odd to have one syscall affect the behavior of another. The
>>>> final version just added a new epoll_pwait2 with timespec.
>>>
>>> I'm fine with just doing a pure syscall variant too, it was my original
>>> plan. Only changed it to allow for easier experimentation and adoption,
>>> and based on the fact that most use cases would likely use a fixed value
>>> per context anyway.
>>>
>>> I think it'd be a shame to drop the ctl, unless there's strong arguments
>>> against it. I'm quite happy to add a syscall variant too, that's not a
>>> big deal and would be a minor addition. Patch 6 should probably cut out
>>> the ctl addition and leave that for a patch 7, and then a patch 8 for
>>> adding a syscall.
>> I split the ctl patch out from the core change, and then took a look at
>> doing a syscall variant too. But there are a few complications there...
>> It would seem to make the most sense to build this on top of the newest
>> epoll wait syscall, epoll_pwait2(). But we're already at the max number
>> of arguments there...
>>
>> Arguably pwait2 should've been converted to use some kind of versioned
>> struct instead. I'm going to take a stab at pwait3 with that kind of
>> interface.
>
> Don't convert to a syscall approach based solely on my feedback. It
> would be good to hear from others.
It's not just based on your feedback, if you read the original cover
letter, then that is the question that is posed in terms of API - ctl to
modify it, new syscall, or both? So figured I should at least try and
see what the syscall would look like.
> At a high level, I'm somewhat uncomfortable merging two syscalls for
> behavior that already works, just to save half the syscall overhead.
> There is no shortage of calls that may make some sense for a workload
> to merge. Is the quoted 6-7% cpu cycle reduction due to saving one
> SYSENTER/SYSEXIT (as the high resolution timer wake-up will be the
> same), or am I missing something more fundamental?
No, it's not really related to saving a single syscall, and you'd
potentially save more than just one as well. If we look at the two
extremes of applications, one will be low load and you're handling
probably just 1 event per loop. Not really interesting. At the other
end, you're fully loaded, and by the time you check for events, you have
'maxevents' (or close to) available. That obviously reduces system
calls, but more importantly, it also allows the application to get some
batching effects from processing these events.
In the medium range, there's enough processing to react pretty quickly
to events coming in, and you then end up doing just 1 event (or close to
that). To overcome that, we have some applications that detect this
medium range and do an artificial sleep before calling epoll_wait().
That was a nice effiency win for them. But we can do this a lot more
efficiently in the kernel. That was the idea behind this, and the
initial results from TAO (which does that sleep hack) proved it to be
more than worthwhile. Syscall reduction is one thing, improved batching
another, and just as importanly is sleep+wakeup reductions.
--
Jens Axboe
From: Jens Axboe
> Sent: 05 November 2022 17:39
>
> >> FWIW, when adding nsec resolution I initially opted for an init-based
> >> approach, passing a new flag to epoll_create1. Feedback then was that
> >> it was odd to have one syscall affect the behavior of another. The
> >> final version just added a new epoll_pwait2 with timespec.
> >
> > I'm fine with just doing a pure syscall variant too, it was my original
> > plan. Only changed it to allow for easier experimentation and adoption,
> > and based on the fact that most use cases would likely use a fixed value
> > per context anyway.
> >
> > I think it'd be a shame to drop the ctl, unless there's strong arguments
> > against it. I'm quite happy to add a syscall variant too, that's not a
> > big deal and would be a minor addition. Patch 6 should probably cut out
> > the ctl addition and leave that for a patch 7, and then a patch 8 for
> > adding a syscall.
>
> I split the ctl patch out from the core change, and then took a look at
> doing a syscall variant too. But there are a few complications there...
> It would seem to make the most sense to build this on top of the newest
> epoll wait syscall, epoll_pwait2(). But we're already at the max number
> of arguments there...
>
> Arguably pwait2 should've been converted to use some kind of versioned
> struct instead. I'm going to take a stab at pwait3 with that kind of
> interface.
Adding an extra copy_from_user() adds a measurable overhead
to a system call - so you really don't want to do it unless
absolutely necessary.
I was wondering if you actually need two timeout parameters?
Could you just use a single bit (I presume one is available)
to request that the timeout be restarted when he first message
arrives and the syscall then return when either the timer
expires or the full number of events has been returned.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Sat, Nov 5, 2022 at 2:46 PM Jens Axboe <[email protected]> wrote:
>
> On 11/5/22 12:05 PM, Willem de Bruijn wrote:
> > On Sat, Nov 5, 2022 at 1:39 PM Jens Axboe <[email protected]> wrote:
> >>
> >>>> FWIW, when adding nsec resolution I initially opted for an init-based
> >>>> approach, passing a new flag to epoll_create1. Feedback then was that
> >>>> it was odd to have one syscall affect the behavior of another. The
> >>>> final version just added a new epoll_pwait2 with timespec.
> >>>
> >>> I'm fine with just doing a pure syscall variant too, it was my original
> >>> plan. Only changed it to allow for easier experimentation and adoption,
> >>> and based on the fact that most use cases would likely use a fixed value
> >>> per context anyway.
> >>>
> >>> I think it'd be a shame to drop the ctl, unless there's strong arguments
> >>> against it. I'm quite happy to add a syscall variant too, that's not a
> >>> big deal and would be a minor addition. Patch 6 should probably cut out
> >>> the ctl addition and leave that for a patch 7, and then a patch 8 for
> >>> adding a syscall.
> >> I split the ctl patch out from the core change, and then took a look at
> >> doing a syscall variant too. But there are a few complications there...
> >> It would seem to make the most sense to build this on top of the newest
> >> epoll wait syscall, epoll_pwait2(). But we're already at the max number
> >> of arguments there...
> >>
> >> Arguably pwait2 should've been converted to use some kind of versioned
> >> struct instead. I'm going to take a stab at pwait3 with that kind of
> >> interface.
> >
> > Don't convert to a syscall approach based solely on my feedback. It
> > would be good to hear from others.
>
> It's not just based on your feedback, if you read the original cover
> letter, then that is the question that is posed in terms of API - ctl to
> modify it, new syscall, or both? So figured I should at least try and
> see what the syscall would look like.
>
> > At a high level, I'm somewhat uncomfortable merging two syscalls for
> > behavior that already works, just to save half the syscall overhead.
> > There is no shortage of calls that may make some sense for a workload
> > to merge. Is the quoted 6-7% cpu cycle reduction due to saving one
> > SYSENTER/SYSEXIT (as the high resolution timer wake-up will be the
> > same), or am I missing something more fundamental?
>
> No, it's not really related to saving a single syscall, and you'd
> potentially save more than just one as well. If we look at the two
> extremes of applications, one will be low load and you're handling
> probably just 1 event per loop. Not really interesting. At the other
> end, you're fully loaded, and by the time you check for events, you have
> 'maxevents' (or close to) available. That obviously reduces system
> calls, but more importantly, it also allows the application to get some
> batching effects from processing these events.
>
> In the medium range, there's enough processing to react pretty quickly
> to events coming in, and you then end up doing just 1 event (or close to
> that). To overcome that, we have some applications that detect this
> medium range and do an artificial sleep before calling epoll_wait().
> That was a nice effiency win for them. But we can do this a lot more
> efficiently in the kernel. That was the idea behind this, and the
> initial results from TAO (which does that sleep hack) proved it to be
> more than worthwhile. Syscall reduction is one thing, improved batching
> another, and just as importanly is sleep+wakeup reductions.
Thanks for the context.
So this is akin to interrupt moderation in network interfaces. Would
it make sense to wait for timeout or nr of events, whichever comes
first, similar to rx_usecs/rx_frames. Instead of an unconditional
sleep at the start.
On 11/7/22 6:25 AM, Willem de Bruijn wrote:
> On Sat, Nov 5, 2022 at 2:46 PM Jens Axboe <[email protected]> wrote:
>>
>> On 11/5/22 12:05 PM, Willem de Bruijn wrote:
>>> On Sat, Nov 5, 2022 at 1:39 PM Jens Axboe <[email protected]> wrote:
>>>>
>>>>>> FWIW, when adding nsec resolution I initially opted for an init-based
>>>>>> approach, passing a new flag to epoll_create1. Feedback then was that
>>>>>> it was odd to have one syscall affect the behavior of another. The
>>>>>> final version just added a new epoll_pwait2 with timespec.
>>>>>
>>>>> I'm fine with just doing a pure syscall variant too, it was my original
>>>>> plan. Only changed it to allow for easier experimentation and adoption,
>>>>> and based on the fact that most use cases would likely use a fixed value
>>>>> per context anyway.
>>>>>
>>>>> I think it'd be a shame to drop the ctl, unless there's strong arguments
>>>>> against it. I'm quite happy to add a syscall variant too, that's not a
>>>>> big deal and would be a minor addition. Patch 6 should probably cut out
>>>>> the ctl addition and leave that for a patch 7, and then a patch 8 for
>>>>> adding a syscall.
>>>> I split the ctl patch out from the core change, and then took a look at
>>>> doing a syscall variant too. But there are a few complications there...
>>>> It would seem to make the most sense to build this on top of the newest
>>>> epoll wait syscall, epoll_pwait2(). But we're already at the max number
>>>> of arguments there...
>>>>
>>>> Arguably pwait2 should've been converted to use some kind of versioned
>>>> struct instead. I'm going to take a stab at pwait3 with that kind of
>>>> interface.
>>>
>>> Don't convert to a syscall approach based solely on my feedback. It
>>> would be good to hear from others.
>>
>> It's not just based on your feedback, if you read the original cover
>> letter, then that is the question that is posed in terms of API - ctl to
>> modify it, new syscall, or both? So figured I should at least try and
>> see what the syscall would look like.
>>
>>> At a high level, I'm somewhat uncomfortable merging two syscalls for
>>> behavior that already works, just to save half the syscall overhead.
>>> There is no shortage of calls that may make some sense for a workload
>>> to merge. Is the quoted 6-7% cpu cycle reduction due to saving one
>>> SYSENTER/SYSEXIT (as the high resolution timer wake-up will be the
>>> same), or am I missing something more fundamental?
>>
>> No, it's not really related to saving a single syscall, and you'd
>> potentially save more than just one as well. If we look at the two
>> extremes of applications, one will be low load and you're handling
>> probably just 1 event per loop. Not really interesting. At the other
>> end, you're fully loaded, and by the time you check for events, you have
>> 'maxevents' (or close to) available. That obviously reduces system
>> calls, but more importantly, it also allows the application to get some
>> batching effects from processing these events.
>>
>> In the medium range, there's enough processing to react pretty quickly
>> to events coming in, and you then end up doing just 1 event (or close to
>> that). To overcome that, we have some applications that detect this
>> medium range and do an artificial sleep before calling epoll_wait().
>> That was a nice effiency win for them. But we can do this a lot more
>> efficiently in the kernel. That was the idea behind this, and the
>> initial results from TAO (which does that sleep hack) proved it to be
>> more than worthwhile. Syscall reduction is one thing, improved batching
>> another, and just as importanly is sleep+wakeup reductions.
>
> Thanks for the context.
>
> So this is akin to interrupt moderation in network interfaces. Would
> it make sense to wait for timeout or nr of events, whichever comes
> first, similar to rx_usecs/rx_frames. Instead of an unconditional
> sleep at the start.
There's no unconditional sleep at the start with my patches, not sure
where you are getting that from. You already have 'nr of events', that's
the maxevents being passed in. If nr_available >= maxevents, then no
sleep will take place. We did debate doing a minevents kind of thing as
well, but the time based metric is more usable.
--
Jens Axboe
Hi Jens,
NICs and storage controllers have interrupt mitigation/coalescing
mechanisms that are similar.
NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
(counter) value. When a completion occurs, the device waits until the
timeout or until the completion counter value is reached.
If I've read the code correctly, min_wait is computed at the beginning
of epoll_wait(2). NVMe's Aggregation Time is computed from the first
completion.
It makes me wonder which approach is more useful for applications. With
the Aggregation Time approach applications can control how much extra
latency is added. What do you think about that approach?
Stefan
On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> Hi Jens,
> NICs and storage controllers have interrupt mitigation/coalescing
> mechanisms that are similar.
Yep
> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> (counter) value. When a completion occurs, the device waits until the
> timeout or until the completion counter value is reached.
>
> If I've read the code correctly, min_wait is computed at the beginning
> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> completion.
>
> It makes me wonder which approach is more useful for applications. With
> the Aggregation Time approach applications can control how much extra
> latency is added. What do you think about that approach?
We only tested the current approach, which is time noted from entry, not
from when the first event arrives. I suspect the nvme approach is better
suited to the hw side, the epoll timeout helps ensure that we batch
within xx usec rather than xx usec + whatever the delay until the first
one arrives. Which is why it's handled that way currently. That gives
you a fixed batch latency.
--
Jens Axboe
On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
>>> Hi Jens,
>>> NICs and storage controllers have interrupt mitigation/coalescing
>>> mechanisms that are similar.
>>
>> Yep
>>
>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
>>> (counter) value. When a completion occurs, the device waits until the
>>> timeout or until the completion counter value is reached.
>>>
>>> If I've read the code correctly, min_wait is computed at the beginning
>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
>>> completion.
>>>
>>> It makes me wonder which approach is more useful for applications. With
>>> the Aggregation Time approach applications can control how much extra
>>> latency is added. What do you think about that approach?
>>
>> We only tested the current approach, which is time noted from entry, not
>> from when the first event arrives. I suspect the nvme approach is better
>> suited to the hw side, the epoll timeout helps ensure that we batch
>> within xx usec rather than xx usec + whatever the delay until the first
>> one arrives. Which is why it's handled that way currently. That gives
>> you a fixed batch latency.
>
> min_wait is fine when the goal is just maximizing throughput without any
> latency targets.
That's not true at all, I think you're in different time scales than
this would be used for.
> The min_wait approach makes it hard to set a useful upper bound on
> latency because unlucky requests that complete early experience much
> more latency than requests that complete later.
As mentioned in the cover letter or the main patch, this is most useful
for the medium load kind of scenarios. For high load, the min_wait time
ends up not mattering because you will hit maxevents first anyway. For
the testing that we did, the target was 2-300 usec, and 200 usec was
used for the actual test. Depending on what the kind of traffic the
server is serving, that's usually not much of a concern. From your
reply, I'm guessing you're thinking of much higher min_wait numbers. I
don't think those would make sense. If your rate of arrival is low
enough that min_wait needs to be high to make a difference, then the
load is low enough anyway that it doesn't matter. Hence I'd argue that
it is indeed NOT hard to set a useful upper bound on latency, because
that is very much what min_wait is.
I'm happy to argue merits of one approach over another, but keep in mind
that this particular approach was not pulled out of thin air AND it has
actually been tested and verified successfully on a production workload.
This isn't a hypothetical benchmark kind of setup.
--
Jens Axboe
On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> > Hi Jens,
> > NICs and storage controllers have interrupt mitigation/coalescing
> > mechanisms that are similar.
>
> Yep
>
> > NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> > (counter) value. When a completion occurs, the device waits until the
> > timeout or until the completion counter value is reached.
> >
> > If I've read the code correctly, min_wait is computed at the beginning
> > of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> > completion.
> >
> > It makes me wonder which approach is more useful for applications. With
> > the Aggregation Time approach applications can control how much extra
> > latency is added. What do you think about that approach?
>
> We only tested the current approach, which is time noted from entry, not
> from when the first event arrives. I suspect the nvme approach is better
> suited to the hw side, the epoll timeout helps ensure that we batch
> within xx usec rather than xx usec + whatever the delay until the first
> one arrives. Which is why it's handled that way currently. That gives
> you a fixed batch latency.
min_wait is fine when the goal is just maximizing throughput without any
latency targets.
The min_wait approach makes it hard to set a useful upper bound on
latency because unlucky requests that complete early experience much
more latency than requests that complete later.
Stefan
On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
> On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
>> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
>>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
>>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
>>>>> Hi Jens,
>>>>> NICs and storage controllers have interrupt mitigation/coalescing
>>>>> mechanisms that are similar.
>>>>
>>>> Yep
>>>>
>>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
>>>>> (counter) value. When a completion occurs, the device waits until the
>>>>> timeout or until the completion counter value is reached.
>>>>>
>>>>> If I've read the code correctly, min_wait is computed at the beginning
>>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
>>>>> completion.
>>>>>
>>>>> It makes me wonder which approach is more useful for applications. With
>>>>> the Aggregation Time approach applications can control how much extra
>>>>> latency is added. What do you think about that approach?
>>>>
>>>> We only tested the current approach, which is time noted from entry, not
>>>> from when the first event arrives. I suspect the nvme approach is better
>>>> suited to the hw side, the epoll timeout helps ensure that we batch
>>>> within xx usec rather than xx usec + whatever the delay until the first
>>>> one arrives. Which is why it's handled that way currently. That gives
>>>> you a fixed batch latency.
>>>
>>> min_wait is fine when the goal is just maximizing throughput without any
>>> latency targets.
>>
>> That's not true at all, I think you're in different time scales than
>> this would be used for.
>>
>>> The min_wait approach makes it hard to set a useful upper bound on
>>> latency because unlucky requests that complete early experience much
>>> more latency than requests that complete later.
>>
>> As mentioned in the cover letter or the main patch, this is most useful
>> for the medium load kind of scenarios. For high load, the min_wait time
>> ends up not mattering because you will hit maxevents first anyway. For
>> the testing that we did, the target was 2-300 usec, and 200 usec was
>> used for the actual test. Depending on what the kind of traffic the
>> server is serving, that's usually not much of a concern. From your
>> reply, I'm guessing you're thinking of much higher min_wait numbers. I
>> don't think those would make sense. If your rate of arrival is low
>> enough that min_wait needs to be high to make a difference, then the
>> load is low enough anyway that it doesn't matter. Hence I'd argue that
>> it is indeed NOT hard to set a useful upper bound on latency, because
>> that is very much what min_wait is.
>>
>> I'm happy to argue merits of one approach over another, but keep in mind
>> that this particular approach was not pulled out of thin air AND it has
>> actually been tested and verified successfully on a production workload.
>> This isn't a hypothetical benchmark kind of setup.
>
> Fair enough. I just wanted to make sure the syscall interface that gets
> merged is as useful as possible.
That is indeed the main discussion as far as I'm concerned - syscall,
ctl, or both? At this point I'm inclined to just push forward with the
ctl addition. A new syscall can always be added, and if we do, then it'd
be nice to make one that will work going forward so we don't have to
keep adding epoll_wait variants...
--
Jens Axboe
On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> > On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>> Hi Jens,
> >>> NICs and storage controllers have interrupt mitigation/coalescing
> >>> mechanisms that are similar.
> >>
> >> Yep
> >>
> >>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>> (counter) value. When a completion occurs, the device waits until the
> >>> timeout or until the completion counter value is reached.
> >>>
> >>> If I've read the code correctly, min_wait is computed at the beginning
> >>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>> completion.
> >>>
> >>> It makes me wonder which approach is more useful for applications. With
> >>> the Aggregation Time approach applications can control how much extra
> >>> latency is added. What do you think about that approach?
> >>
> >> We only tested the current approach, which is time noted from entry, not
> >> from when the first event arrives. I suspect the nvme approach is better
> >> suited to the hw side, the epoll timeout helps ensure that we batch
> >> within xx usec rather than xx usec + whatever the delay until the first
> >> one arrives. Which is why it's handled that way currently. That gives
> >> you a fixed batch latency.
> >
> > min_wait is fine when the goal is just maximizing throughput without any
> > latency targets.
>
> That's not true at all, I think you're in different time scales than
> this would be used for.
>
> > The min_wait approach makes it hard to set a useful upper bound on
> > latency because unlucky requests that complete early experience much
> > more latency than requests that complete later.
>
> As mentioned in the cover letter or the main patch, this is most useful
> for the medium load kind of scenarios. For high load, the min_wait time
> ends up not mattering because you will hit maxevents first anyway. For
> the testing that we did, the target was 2-300 usec, and 200 usec was
> used for the actual test. Depending on what the kind of traffic the
> server is serving, that's usually not much of a concern. From your
> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> don't think those would make sense. If your rate of arrival is low
> enough that min_wait needs to be high to make a difference, then the
> load is low enough anyway that it doesn't matter. Hence I'd argue that
> it is indeed NOT hard to set a useful upper bound on latency, because
> that is very much what min_wait is.
>
> I'm happy to argue merits of one approach over another, but keep in mind
> that this particular approach was not pulled out of thin air AND it has
> actually been tested and verified successfully on a production workload.
> This isn't a hypothetical benchmark kind of setup.
Fair enough. I just wanted to make sure the syscall interface that gets
merged is as useful as possible.
Thanks,
Stefan
On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote:
> On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
> > On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
> >> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> >>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>>>> Hi Jens,
> >>>>> NICs and storage controllers have interrupt mitigation/coalescing
> >>>>> mechanisms that are similar.
> >>>>
> >>>> Yep
> >>>>
> >>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>>>> (counter) value. When a completion occurs, the device waits until the
> >>>>> timeout or until the completion counter value is reached.
> >>>>>
> >>>>> If I've read the code correctly, min_wait is computed at the beginning
> >>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>>>> completion.
> >>>>>
> >>>>> It makes me wonder which approach is more useful for applications. With
> >>>>> the Aggregation Time approach applications can control how much extra
> >>>>> latency is added. What do you think about that approach?
> >>>>
> >>>> We only tested the current approach, which is time noted from entry, not
> >>>> from when the first event arrives. I suspect the nvme approach is better
> >>>> suited to the hw side, the epoll timeout helps ensure that we batch
> >>>> within xx usec rather than xx usec + whatever the delay until the first
> >>>> one arrives. Which is why it's handled that way currently. That gives
> >>>> you a fixed batch latency.
> >>>
> >>> min_wait is fine when the goal is just maximizing throughput without any
> >>> latency targets.
> >>
> >> That's not true at all, I think you're in different time scales than
> >> this would be used for.
> >>
> >>> The min_wait approach makes it hard to set a useful upper bound on
> >>> latency because unlucky requests that complete early experience much
> >>> more latency than requests that complete later.
> >>
> >> As mentioned in the cover letter or the main patch, this is most useful
> >> for the medium load kind of scenarios. For high load, the min_wait time
> >> ends up not mattering because you will hit maxevents first anyway. For
> >> the testing that we did, the target was 2-300 usec, and 200 usec was
> >> used for the actual test. Depending on what the kind of traffic the
> >> server is serving, that's usually not much of a concern. From your
> >> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> >> don't think those would make sense. If your rate of arrival is low
> >> enough that min_wait needs to be high to make a difference, then the
> >> load is low enough anyway that it doesn't matter. Hence I'd argue that
> >> it is indeed NOT hard to set a useful upper bound on latency, because
> >> that is very much what min_wait is.
> >>
> >> I'm happy to argue merits of one approach over another, but keep in mind
> >> that this particular approach was not pulled out of thin air AND it has
> >> actually been tested and verified successfully on a production workload.
> >> This isn't a hypothetical benchmark kind of setup.
> >
> > Fair enough. I just wanted to make sure the syscall interface that gets
> > merged is as useful as possible.
>
> That is indeed the main discussion as far as I'm concerned - syscall,
> ctl, or both? At this point I'm inclined to just push forward with the
> ctl addition. A new syscall can always be added, and if we do, then it'd
> be nice to make one that will work going forward so we don't have to
> keep adding epoll_wait variants...
epoll_wait3() would be consistent with how maxevents and timeout work.
It does not suffer from extra ctl syscall overhead when applications
need to change min_wait.
The way the current patches add min_wait into epoll_ctl() seems hacky to
me. struct epoll_event was meant for file descriptor event entries. It
won't necessarily be large enough for future extensions (luckily
min_wait only needs a uint64_t value). It's turning epoll_ctl() into an
ioctl()/setsockopt()-style interface, which is bad for anything that
needs to understand syscalls, like seccomp. A properly typed
epoll_wait3() seems cleaner to me.
Stefan
On 11/8/22 10:24 AM, Stefan Hajnoczi wrote:
> On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote:
>> On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
>>> On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
>>>> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
>>>>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
>>>>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
>>>>>>> Hi Jens,
>>>>>>> NICs and storage controllers have interrupt mitigation/coalescing
>>>>>>> mechanisms that are similar.
>>>>>>
>>>>>> Yep
>>>>>>
>>>>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
>>>>>>> (counter) value. When a completion occurs, the device waits until the
>>>>>>> timeout or until the completion counter value is reached.
>>>>>>>
>>>>>>> If I've read the code correctly, min_wait is computed at the beginning
>>>>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
>>>>>>> completion.
>>>>>>>
>>>>>>> It makes me wonder which approach is more useful for applications. With
>>>>>>> the Aggregation Time approach applications can control how much extra
>>>>>>> latency is added. What do you think about that approach?
>>>>>>
>>>>>> We only tested the current approach, which is time noted from entry, not
>>>>>> from when the first event arrives. I suspect the nvme approach is better
>>>>>> suited to the hw side, the epoll timeout helps ensure that we batch
>>>>>> within xx usec rather than xx usec + whatever the delay until the first
>>>>>> one arrives. Which is why it's handled that way currently. That gives
>>>>>> you a fixed batch latency.
>>>>>
>>>>> min_wait is fine when the goal is just maximizing throughput without any
>>>>> latency targets.
>>>>
>>>> That's not true at all, I think you're in different time scales than
>>>> this would be used for.
>>>>
>>>>> The min_wait approach makes it hard to set a useful upper bound on
>>>>> latency because unlucky requests that complete early experience much
>>>>> more latency than requests that complete later.
>>>>
>>>> As mentioned in the cover letter or the main patch, this is most useful
>>>> for the medium load kind of scenarios. For high load, the min_wait time
>>>> ends up not mattering because you will hit maxevents first anyway. For
>>>> the testing that we did, the target was 2-300 usec, and 200 usec was
>>>> used for the actual test. Depending on what the kind of traffic the
>>>> server is serving, that's usually not much of a concern. From your
>>>> reply, I'm guessing you're thinking of much higher min_wait numbers. I
>>>> don't think those would make sense. If your rate of arrival is low
>>>> enough that min_wait needs to be high to make a difference, then the
>>>> load is low enough anyway that it doesn't matter. Hence I'd argue that
>>>> it is indeed NOT hard to set a useful upper bound on latency, because
>>>> that is very much what min_wait is.
>>>>
>>>> I'm happy to argue merits of one approach over another, but keep in mind
>>>> that this particular approach was not pulled out of thin air AND it has
>>>> actually been tested and verified successfully on a production workload.
>>>> This isn't a hypothetical benchmark kind of setup.
>>>
>>> Fair enough. I just wanted to make sure the syscall interface that gets
>>> merged is as useful as possible.
>>
>> That is indeed the main discussion as far as I'm concerned - syscall,
>> ctl, or both? At this point I'm inclined to just push forward with the
>> ctl addition. A new syscall can always be added, and if we do, then it'd
>> be nice to make one that will work going forward so we don't have to
>> keep adding epoll_wait variants...
>
> epoll_wait3() would be consistent with how maxevents and timeout work.
> It does not suffer from extra ctl syscall overhead when applications
> need to change min_wait.
>
> The way the current patches add min_wait into epoll_ctl() seems hacky to
> me. struct epoll_event was meant for file descriptor event entries. It
> won't necessarily be large enough for future extensions (luckily
> min_wait only needs a uint64_t value). It's turning epoll_ctl() into an
> ioctl()/setsockopt()-style interface, which is bad for anything that
> needs to understand syscalls, like seccomp. A properly typed
> epoll_wait3() seems cleaner to me.
The ctl method is definitely a bit of an oddball. I've highlighted why
I went that way in earlier emails, but in summary:
- Makes it easy to adopt, just adding two lines at init time.
- Moves detection of availability to init time as well, rather than
the fast path.
I don't think anyone would want to often change the wait, it's
something you'd set at init time. If you often want to change values
for some reason, then obviously a syscall parameter would be a lot
better.
epoll_pwait3() would be vastly different than the other ones, simply
because epoll_pwait2() is already using the maximum number of args.
We'd need to add an epoll syscall struct at that point, probably
with flags telling us if signal_struct or timeout is actually valid.
This is not to say I don't think we should add a syscall interface,
just some of the arguments pro and con from having actually looked
at it.
--
Jens Axboe
On Tue, Nov 08, 2022 at 10:28:37AM -0700, Jens Axboe wrote:
> On 11/8/22 10:24 AM, Stefan Hajnoczi wrote:
> > On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote:
> >> On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
> >>> On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
> >>>> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> >>>>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >>>>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>>>>>> Hi Jens,
> >>>>>>> NICs and storage controllers have interrupt mitigation/coalescing
> >>>>>>> mechanisms that are similar.
> >>>>>>
> >>>>>> Yep
> >>>>>>
> >>>>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>>>>>> (counter) value. When a completion occurs, the device waits until the
> >>>>>>> timeout or until the completion counter value is reached.
> >>>>>>>
> >>>>>>> If I've read the code correctly, min_wait is computed at the beginning
> >>>>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>>>>>> completion.
> >>>>>>>
> >>>>>>> It makes me wonder which approach is more useful for applications. With
> >>>>>>> the Aggregation Time approach applications can control how much extra
> >>>>>>> latency is added. What do you think about that approach?
> >>>>>>
> >>>>>> We only tested the current approach, which is time noted from entry, not
> >>>>>> from when the first event arrives. I suspect the nvme approach is better
> >>>>>> suited to the hw side, the epoll timeout helps ensure that we batch
> >>>>>> within xx usec rather than xx usec + whatever the delay until the first
> >>>>>> one arrives. Which is why it's handled that way currently. That gives
> >>>>>> you a fixed batch latency.
> >>>>>
> >>>>> min_wait is fine when the goal is just maximizing throughput without any
> >>>>> latency targets.
> >>>>
> >>>> That's not true at all, I think you're in different time scales than
> >>>> this would be used for.
> >>>>
> >>>>> The min_wait approach makes it hard to set a useful upper bound on
> >>>>> latency because unlucky requests that complete early experience much
> >>>>> more latency than requests that complete later.
> >>>>
> >>>> As mentioned in the cover letter or the main patch, this is most useful
> >>>> for the medium load kind of scenarios. For high load, the min_wait time
> >>>> ends up not mattering because you will hit maxevents first anyway. For
> >>>> the testing that we did, the target was 2-300 usec, and 200 usec was
> >>>> used for the actual test. Depending on what the kind of traffic the
> >>>> server is serving, that's usually not much of a concern. From your
> >>>> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> >>>> don't think those would make sense. If your rate of arrival is low
> >>>> enough that min_wait needs to be high to make a difference, then the
> >>>> load is low enough anyway that it doesn't matter. Hence I'd argue that
> >>>> it is indeed NOT hard to set a useful upper bound on latency, because
> >>>> that is very much what min_wait is.
> >>>>
> >>>> I'm happy to argue merits of one approach over another, but keep in mind
> >>>> that this particular approach was not pulled out of thin air AND it has
> >>>> actually been tested and verified successfully on a production workload.
> >>>> This isn't a hypothetical benchmark kind of setup.
> >>>
> >>> Fair enough. I just wanted to make sure the syscall interface that gets
> >>> merged is as useful as possible.
> >>
> >> That is indeed the main discussion as far as I'm concerned - syscall,
> >> ctl, or both? At this point I'm inclined to just push forward with the
> >> ctl addition. A new syscall can always be added, and if we do, then it'd
> >> be nice to make one that will work going forward so we don't have to
> >> keep adding epoll_wait variants...
> >
> > epoll_wait3() would be consistent with how maxevents and timeout work.
> > It does not suffer from extra ctl syscall overhead when applications
> > need to change min_wait.
> >
> > The way the current patches add min_wait into epoll_ctl() seems hacky to
> > me. struct epoll_event was meant for file descriptor event entries. It
> > won't necessarily be large enough for future extensions (luckily
> > min_wait only needs a uint64_t value). It's turning epoll_ctl() into an
> > ioctl()/setsockopt()-style interface, which is bad for anything that
> > needs to understand syscalls, like seccomp. A properly typed
> > epoll_wait3() seems cleaner to me.
>
> The ctl method is definitely a bit of an oddball. I've highlighted why
> I went that way in earlier emails, but in summary:
>
> - Makes it easy to adopt, just adding two lines at init time.
>
> - Moves detection of availability to init time as well, rather than
> the fast path.
Add an epoll_create1() flag to test for availability?
> I don't think anyone would want to often change the wait, it's
> something you'd set at init time. If you often want to change values
> for some reason, then obviously a syscall parameter would be a lot
> better.
>
> epoll_pwait3() would be vastly different than the other ones, simply
> because epoll_pwait2() is already using the maximum number of args.
> We'd need to add an epoll syscall struct at that point, probably
> with flags telling us if signal_struct or timeout is actually valid.
Yes :/.
> This is not to say I don't think we should add a syscall interface,
> just some of the arguments pro and con from having actually looked
> at it.
>
> --
> Jens Axboe
>
>
On Sun, Oct 30, 2022 at 04:02:03PM -0600, Jens Axboe wrote:
> Rather than just have a timeout value for waiting on events, add
> EPOLL_CTL_MIN_WAIT to allow setting a minimum time that epoll_wait()
> should always wait for events to arrive.
>
> For medium workload efficiencies, some production workloads inject
> artificial timers or sleeps before calling epoll_wait() to get
> better batching and higher efficiencies. While this does help, it's
> not as efficient as it could be. By adding support for epoll_wait()
> for this directly, we can avoids extra context switches and scheduler
> and timer overhead.
>
> As an example, running an AB test on an identical workload at about
> ~370K reqs/second, without this change and with the sleep hack
> mentioned above (using 200 usec as the timeout), we're doing 310K-340K
> non-voluntary context switches per second. Idle CPU on the host is 27-34%.
> With the the sleep hack removed and epoll set to the same 200 usec
> value, we're handling the exact same load but at 292K-315k non-voluntary
> context switches and idle CPU of 33-41%, a substantial win.
>
> Basic test case:
>
> struct d {
> int p1, p2;
> };
>
> static void *fn(void *data)
> {
> struct d *d = data;
> char b = 0x89;
>
> /* Generate 2 events 20 msec apart */
> usleep(10000);
> write(d->p1, &b, sizeof(b));
> usleep(10000);
> write(d->p2, &b, sizeof(b));
>
> return NULL;
> }
>
> int main(int argc, char *argv[])
> {
> struct epoll_event ev, events[2];
> pthread_t thread;
> int p1[2], p2[2];
> struct d d;
> int efd, ret;
>
> efd = epoll_create1(0);
> if (efd < 0) {
> perror("epoll_create");
> return 1;
> }
>
> if (pipe(p1) < 0) {
> perror("pipe");
> return 1;
> }
> if (pipe(p2) < 0) {
> perror("pipe");
> return 1;
> }
>
> ev.events = EPOLLIN;
> ev.data.fd = p1[0];
> if (epoll_ctl(efd, EPOLL_CTL_ADD, p1[0], &ev) < 0) {
> perror("epoll add");
> return 1;
> }
> ev.events = EPOLLIN;
> ev.data.fd = p2[0];
> if (epoll_ctl(efd, EPOLL_CTL_ADD, p2[0], &ev) < 0) {
> perror("epoll add");
> return 1;
> }
>
> /* always wait 200 msec for events */
> ev.data.u64 = 200000;
> if (epoll_ctl(efd, EPOLL_CTL_MIN_WAIT, -1, &ev) < 0) {
> perror("epoll add set timeout");
> return 1;
> }
>
> d.p1 = p1[1];
> d.p2 = p2[1];
> pthread_create(&thread, NULL, fn, &d);
>
> /* expect to get 2 events here rather than just 1 */
> ret = epoll_wait(efd, events, 2, -1);
> printf("epoll_wait=%d\n", ret);
>
> return 0;
> }
It might be worth adding a note in the commit message stating that
EPOLL_CTL_MIN_WAIT is a no-op when timeout is 0. This is a desired
behavior but it's not easy to see in the flow.
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/eventpoll.c | 97 +++++++++++++++++++++++++++++-----
> include/linux/eventpoll.h | 2 +-
> include/uapi/linux/eventpoll.h | 1 +
> 3 files changed, 85 insertions(+), 15 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 962d897bbfc6..9e00f8780ec5 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -117,6 +117,9 @@ struct eppoll_entry {
> /* The "base" pointer is set to the container "struct epitem" */
> struct epitem *base;
>
> + /* min wait time if (min_wait_ts) & 1 != 0 */
> + ktime_t min_wait_ts;
> +
> /*
> * Wait queue item that will be linked to the target file wait
> * queue head.
> @@ -217,6 +220,9 @@ struct eventpoll {
> u64 gen;
> struct hlist_head refs;
>
> + /* min wait for epoll_wait() */
> + unsigned int min_wait_ts;
> +
> #ifdef CONFIG_NET_RX_BUSY_POLL
> /* used to track busy poll napi_id */
> unsigned int napi_id;
> @@ -1747,6 +1753,32 @@ static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms)
> return to;
> }
>
> +struct epoll_wq {
> + wait_queue_entry_t wait;
> + struct hrtimer timer;
> + ktime_t timeout_ts;
> + ktime_t min_wait_ts;
> + struct eventpoll *ep;
> + bool timed_out;
> + int maxevents;
> + int wakeups;
> +};
> +
> +static bool ep_should_min_wait(struct epoll_wq *ewq)
> +{
> + if (ewq->min_wait_ts & 1) {
> + /* just an approximation */
> + if (++ewq->wakeups >= ewq->maxevents)
> + goto stop_wait;
Is there a way to short cut the wait if the process is being terminated?
We issues in production systems in the past where too many threads were
in epoll_wait and the process got terminated. It'd be nice if these
threads could exit the syscall as fast as possible.
> + if (ktime_before(ktime_get_ns(), ewq->min_wait_ts))
> + return true;
> + }
> +
> +stop_wait:
> + ewq->min_wait_ts &= ~(u64) 1;
> + return false;
> +}
> +
> /*
> * autoremove_wake_function, but remove even on failure to wake up, because we
> * know that default_wake_function/ttwu will only fail if the thread is already
> @@ -1756,27 +1788,37 @@ static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms)
> static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
> unsigned int mode, int sync, void *key)
> {
> - int ret = default_wake_function(wq_entry, mode, sync, key);
> + struct epoll_wq *ewq = container_of(wq_entry, struct epoll_wq, wait);
> + int ret;
> +
> + /*
> + * If min wait time hasn't been satisfied yet, keep waiting
> + */
> + if (ep_should_min_wait(ewq))
> + return 0;
>
> + ret = default_wake_function(wq_entry, mode, sync, key);
> list_del_init(&wq_entry->entry);
> return ret;
> }
>
> -struct epoll_wq {
> - wait_queue_entry_t wait;
> - struct hrtimer timer;
> - ktime_t timeout_ts;
> - bool timed_out;
> -};
> -
> static enum hrtimer_restart ep_timer(struct hrtimer *timer)
> {
> struct epoll_wq *ewq = container_of(timer, struct epoll_wq, timer);
> struct task_struct *task = ewq->wait.private;
> + const bool is_min_wait = ewq->min_wait_ts & 1;
> +
> + if (!is_min_wait || ep_events_available(ewq->ep)) {
> + if (!is_min_wait)
> + ewq->timed_out = true;
> + ewq->min_wait_ts &= ~(u64) 1;
> + wake_up_process(task);
> + return HRTIMER_NORESTART;
> + }
>
> - ewq->timed_out = true;
> - wake_up_process(task);
> - return HRTIMER_NORESTART;
> + ewq->min_wait_ts &= ~(u64) 1;
> + hrtimer_set_expires_range_ns(&ewq->timer, ewq->timeout_ts, 0);
> + return HRTIMER_RESTART;
> }
>
> static void ep_schedule(struct eventpoll *ep, struct epoll_wq *ewq, ktime_t *to,
> @@ -1831,12 +1873,16 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>
> lockdep_assert_irqs_enabled();
>
> + ewq.min_wait_ts = 0;
> + ewq.ep = ep;
> + ewq.maxevents = maxevents;
> ewq.timed_out = false;
> + ewq.wakeups = 0;
>
> if (timeout && (timeout->tv_sec | timeout->tv_nsec)) {
> slack = select_estimate_accuracy(timeout);
> + ewq.timeout_ts = timespec64_to_ktime(*timeout);
> to = &ewq.timeout_ts;
> - *to = timespec64_to_ktime(*timeout);
> } else if (timeout) {
> /*
> * Avoid the unnecessary trip to the wait queue loop, if the
> @@ -1845,6 +1891,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> ewq.timed_out = true;
> }
>
> + /*
> + * If min_wait is set for this epoll instance, note the min_wait
> + * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's
> + * the state bit for whether or not min_wait is enabled.
> + */
> + if (ep->min_wait_ts) {
Can we limit this block to "ewq.timed_out && ep->min_wait_ts"?
AFAICT, the code we run here is completely wasted if timeout is 0.
> + ewq.min_wait_ts = ktime_add_us(ktime_get_ns(),
> + ep->min_wait_ts);
> + ewq.min_wait_ts |= (u64) 1;
> + to = &ewq.min_wait_ts;
> + }
> +
> /*
> * This call is racy: We may or may not see events that are being added
> * to the ready list under the lock (e.g., in IRQ callbacks). For cases
> @@ -1913,7 +1971,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> * important.
> */
> eavail = ep_events_available(ep);
> - if (!eavail) {
> + if (!eavail || ewq.min_wait_ts & 1) {
> __add_wait_queue_exclusive(&ep->wq, &ewq.wait);
> write_unlock_irq(&ep->lock);
> ep_schedule(ep, &ewq, to, slack);
> @@ -2125,6 +2183,17 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
> */
> ep = f.file->private_data;
>
> + /*
> + * Handle EPOLL_CTL_MIN_WAIT upfront as we don't need to care about
> + * the fd being passed in.
> + */
> + if (op == EPOLL_CTL_MIN_WAIT) {
> + /* return old value */
> + error = ep->min_wait_ts;
> + ep->min_wait_ts = epds->data;
> + goto error_fput;
> + }
> +
> /* Get the "struct file *" for the target file */
> tf = fdget(fd);
> if (!tf.file)
> @@ -2257,7 +2326,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> {
> struct epoll_event epds;
>
> - if (ep_op_has_event(op) &&
> + if ((ep_op_has_event(op) || op == EPOLL_CTL_MIN_WAIT) &&
> copy_from_user(&epds, event, sizeof(struct epoll_event)))
> return -EFAULT;
>
> diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
> index 3337745d81bd..cbef635cb7e4 100644
> --- a/include/linux/eventpoll.h
> +++ b/include/linux/eventpoll.h
> @@ -59,7 +59,7 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
> /* Tells if the epoll_ctl(2) operation needs an event copy from userspace */
> static inline int ep_op_has_event(int op)
> {
> - return op != EPOLL_CTL_DEL;
> + return op != EPOLL_CTL_DEL && op != EPOLL_CTL_MIN_WAIT;
> }
>
> #else
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index 8a3432d0f0dc..81ecb1ca36e0 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -26,6 +26,7 @@
> #define EPOLL_CTL_ADD 1
> #define EPOLL_CTL_DEL 2
> #define EPOLL_CTL_MOD 3
> +#define EPOLL_CTL_MIN_WAIT 4
Have you considered introducing another epoll_pwait sycall variant?
That has a major benefit that min wait can be different per poller,
on the different epollfd. The usage would also be more readable:
"epoll for X amount of time but don't return sooner than Y."
This would be similar to the approach that [email protected] used
when introducing epoll_pwait2.
>
> /* Epoll event masks */
> #define EPOLLIN (__force __poll_t)0x00000001
> --
> 2.35.1
>
On 11/8/22 3:14 PM, Soheil Hassas Yeganeh wrote:
> On Sun, Oct 30, 2022 at 04:02:03PM -0600, Jens Axboe wrote:
>> Rather than just have a timeout value for waiting on events, add
>> EPOLL_CTL_MIN_WAIT to allow setting a minimum time that epoll_wait()
>> should always wait for events to arrive.
>>
>> For medium workload efficiencies, some production workloads inject
>> artificial timers or sleeps before calling epoll_wait() to get
>> better batching and higher efficiencies. While this does help, it's
>> not as efficient as it could be. By adding support for epoll_wait()
>> for this directly, we can avoids extra context switches and scheduler
>> and timer overhead.
>>
>> As an example, running an AB test on an identical workload at about
>> ~370K reqs/second, without this change and with the sleep hack
>> mentioned above (using 200 usec as the timeout), we're doing 310K-340K
>> non-voluntary context switches per second. Idle CPU on the host is 27-34%.
>> With the the sleep hack removed and epoll set to the same 200 usec
>> value, we're handling the exact same load but at 292K-315k non-voluntary
>> context switches and idle CPU of 33-41%, a substantial win.
>>
>> Basic test case:
>>
>> struct d {
>> int p1, p2;
>> };
>>
>> static void *fn(void *data)
>> {
>> struct d *d = data;
>> char b = 0x89;
>>
>> /* Generate 2 events 20 msec apart */
>> usleep(10000);
>> write(d->p1, &b, sizeof(b));
>> usleep(10000);
>> write(d->p2, &b, sizeof(b));
>>
>> return NULL;
>> }
>>
>> int main(int argc, char *argv[])
>> {
>> struct epoll_event ev, events[2];
>> pthread_t thread;
>> int p1[2], p2[2];
>> struct d d;
>> int efd, ret;
>>
>> efd = epoll_create1(0);
>> if (efd < 0) {
>> perror("epoll_create");
>> return 1;
>> }
>>
>> if (pipe(p1) < 0) {
>> perror("pipe");
>> return 1;
>> }
>> if (pipe(p2) < 0) {
>> perror("pipe");
>> return 1;
>> }
>>
>> ev.events = EPOLLIN;
>> ev.data.fd = p1[0];
>> if (epoll_ctl(efd, EPOLL_CTL_ADD, p1[0], &ev) < 0) {
>> perror("epoll add");
>> return 1;
>> }
>> ev.events = EPOLLIN;
>> ev.data.fd = p2[0];
>> if (epoll_ctl(efd, EPOLL_CTL_ADD, p2[0], &ev) < 0) {
>> perror("epoll add");
>> return 1;
>> }
>>
>> /* always wait 200 msec for events */
>> ev.data.u64 = 200000;
>> if (epoll_ctl(efd, EPOLL_CTL_MIN_WAIT, -1, &ev) < 0) {
>> perror("epoll add set timeout");
>> return 1;
>> }
>>
>> d.p1 = p1[1];
>> d.p2 = p2[1];
>> pthread_create(&thread, NULL, fn, &d);
>>
>> /* expect to get 2 events here rather than just 1 */
>> ret = epoll_wait(efd, events, 2, -1);
>> printf("epoll_wait=%d\n", ret);
>>
>> return 0;
>> }
>
> It might be worth adding a note in the commit message stating that
> EPOLL_CTL_MIN_WAIT is a no-op when timeout is 0. This is a desired
> behavior but it's not easy to see in the flow.
True, will do.
>> +struct epoll_wq {
>> + wait_queue_entry_t wait;
>> + struct hrtimer timer;
>> + ktime_t timeout_ts;
>> + ktime_t min_wait_ts;
>> + struct eventpoll *ep;
>> + bool timed_out;
>> + int maxevents;
>> + int wakeups;
>> +};
>> +
>> +static bool ep_should_min_wait(struct epoll_wq *ewq)
>> +{
>> + if (ewq->min_wait_ts & 1) {
>> + /* just an approximation */
>> + if (++ewq->wakeups >= ewq->maxevents)
>> + goto stop_wait;
>
> Is there a way to short cut the wait if the process is being terminated?
>
> We issues in production systems in the past where too many threads were
> in epoll_wait and the process got terminated. It'd be nice if these
> threads could exit the syscall as fast as possible.
Good point, it'd be a bit racy though as this is called from the waitq
callback and hence not in the task itself. But probably Good Enough for
most use cases?
This should probably be a separate patch though, as it seems this
affects regular waits too without min_wait set?
>> @@ -1845,6 +1891,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>> ewq.timed_out = true;
>> }
>>
>> + /*
>> + * If min_wait is set for this epoll instance, note the min_wait
>> + * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's
>> + * the state bit for whether or not min_wait is enabled.
>> + */
>> + if (ep->min_wait_ts) {
>
> Can we limit this block to "ewq.timed_out && ep->min_wait_ts"?
> AFAICT, the code we run here is completely wasted if timeout is 0.
Yep certainly, I can gate it on both of those conditions.
>> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
>> index 8a3432d0f0dc..81ecb1ca36e0 100644
>> --- a/include/uapi/linux/eventpoll.h
>> +++ b/include/uapi/linux/eventpoll.h
>> @@ -26,6 +26,7 @@
>> #define EPOLL_CTL_ADD 1
>> #define EPOLL_CTL_DEL 2
>> #define EPOLL_CTL_MOD 3
>> +#define EPOLL_CTL_MIN_WAIT 4
>
> Have you considered introducing another epoll_pwait sycall variant?
>
> That has a major benefit that min wait can be different per poller,
> on the different epollfd. The usage would also be more readable:
>
> "epoll for X amount of time but don't return sooner than Y."
>
> This would be similar to the approach that [email protected] used
> when introducing epoll_pwait2.
I have, see other replies in this thread, notably the ones with Stefan
today. Happy to do that, and my current branch does split out the ctl
addition from the meat of the min_wait support for this reason. Can't
seem to find a great way to do it, as we'd need to move to a struct
argument for this as epoll_pwait2() is already at max arguments for a
syscall. Suggestions more than welcome.
--
Jens Axboe
> > This would be similar to the approach that [email protected] used
> > when introducing epoll_pwait2.
>
> I have, see other replies in this thread, notably the ones with Stefan
> today. Happy to do that, and my current branch does split out the ctl
> addition from the meat of the min_wait support for this reason. Can't
> seem to find a great way to do it, as we'd need to move to a struct
> argument for this as epoll_pwait2() is already at max arguments for a
> syscall. Suggestions more than welcome.
Expect an array of two timespecs as fourth argument?
On 11/8/22 3:25 PM, Willem de Bruijn wrote:
>>> This would be similar to the approach that [email protected] used
>>> when introducing epoll_pwait2.
>>
>> I have, see other replies in this thread, notably the ones with Stefan
>> today. Happy to do that, and my current branch does split out the ctl
>> addition from the meat of the min_wait support for this reason. Can't
>> seem to find a great way to do it, as we'd need to move to a struct
>> argument for this as epoll_pwait2() is already at max arguments for a
>> syscall. Suggestions more than welcome.
>
> Expect an array of two timespecs as fourth argument?
Unfortunately even epoll_pwait2() doesn't have any kind of flags
argument to be able to do tricks like that... But I guess we could do
that with epoll_pwait3(), but it'd be an extra indirection for the copy
at that point (copy array of pointers, copy pointer if not NULL), which
would be unfortunate. I'd hate to have to argue that API to anyone, let
alone Linus, when pushing the series.
--
Jens Axboe
On Tue, Nov 8, 2022 at 5:20 PM Jens Axboe <[email protected]> wrote:
> > Is there a way to short cut the wait if the process is being terminated?
> >
> > We issues in production systems in the past where too many threads were
> > in epoll_wait and the process got terminated. It'd be nice if these
> > threads could exit the syscall as fast as possible.
>
> Good point, it'd be a bit racy though as this is called from the waitq
> callback and hence not in the task itself. But probably Good Enough for
> most use cases?
Sounds good. We can definitely do that as a follow up later.
> This should probably be a separate patch though, as it seems this
> affects regular waits too without min_wait set?
>
> >> @@ -1845,6 +1891,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> >> ewq.timed_out = true;
> >> }
> >>
> >> + /*
> >> + * If min_wait is set for this epoll instance, note the min_wait
> >> + * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's
> >> + * the state bit for whether or not min_wait is enabled.
> >> + */
> >> + if (ep->min_wait_ts) {
> >
> > Can we limit this block to "ewq.timed_out && ep->min_wait_ts"?
> > AFAICT, the code we run here is completely wasted if timeout is 0.
>
> Yep certainly, I can gate it on both of those conditions.
Thanks. I think that would help. You might also want to restructure the if/else
condition above but it's your call.
On Tue, Nov 8, 2022 at 5:29 PM Jens Axboe <[email protected]> wrote:
>
> On 11/8/22 3:25 PM, Willem de Bruijn wrote:
> >>> This would be similar to the approach that [email protected] used
> >>> when introducing epoll_pwait2.
> >>
> >> I have, see other replies in this thread, notably the ones with Stefan
> >> today. Happy to do that, and my current branch does split out the ctl
> >> addition from the meat of the min_wait support for this reason. Can't
> >> seem to find a great way to do it, as we'd need to move to a struct
> >> argument for this as epoll_pwait2() is already at max arguments for a
> >> syscall. Suggestions more than welcome.
> >
> > Expect an array of two timespecs as fourth argument?
>
> Unfortunately even epoll_pwait2() doesn't have any kind of flags
> argument to be able to do tricks like that... But I guess we could do
> that with epoll_pwait3(), but it'd be an extra indirection for the copy
> at that point (copy array of pointers, copy pointer if not NULL), which
> would be unfortunate. I'd hate to have to argue that API to anyone, let
> alone Linus, when pushing the series.
I personally like what Willem suggested. It feels more natural to me
and as you suggested previously it can be a struct argument.
The overheads would be similar to any syscall that accepts itimerspec.
I understand your concern on "epoll_pwait3". I wish Linus would weigh
in here. :-)
On Tue, Nov 8, 2022 at 5:30 PM Jens Axboe <[email protected]> wrote:
>
> On 11/8/22 3:25 PM, Willem de Bruijn wrote:
> >>> This would be similar to the approach that [email protected] used
> >>> when introducing epoll_pwait2.
> >>
> >> I have, see other replies in this thread, notably the ones with Stefan
> >> today. Happy to do that, and my current branch does split out the ctl
> >> addition from the meat of the min_wait support for this reason. Can't
> >> seem to find a great way to do it, as we'd need to move to a struct
> >> argument for this as epoll_pwait2() is already at max arguments for a
> >> syscall. Suggestions more than welcome.
> >
> > Expect an array of two timespecs as fourth argument?
>
> Unfortunately even epoll_pwait2() doesn't have any kind of flags
> argument to be able to do tricks like that... But I guess we could do
> that with epoll_pwait3(), but it'd be an extra indirection for the copy
> at that point (copy array of pointers, copy pointer if not NULL), which
> would be unfortunate. I'd hate to have to argue that API to anyone, let
> alone Linus, when pushing the series.
I did mean for a new syscall epoll_pwait3. But not an array of
pointers, an array of structs. The second arg is then mandatory for
this epoll_pwait_minwait variant of the syscall.
It would indeed have been nicer to be able to do this in epoll_pwait2
based on a flag. It's just doubling the size in copy_from_user in
get_timespec64.
Btw, when I added epoll_pwait2, there was a reasonable request to
also update the manpages and add a basic test to
tools/testing/selftests/filesystems/epoll. That is some extra work with
a syscall based approach.
From: Stefan Hajnoczi
> Sent: 08 November 2022 17:24
...
> The way the current patches add min_wait into epoll_ctl() seems hacky to
> me. struct epoll_event was meant for file descriptor event entries. It
> won't necessarily be large enough for future extensions (luckily
> min_wait only needs a uint64_t value). It's turning epoll_ctl() into an
> ioctl()/setsockopt()-style interface, which is bad for anything that
> needs to understand syscalls, like seccomp. A properly typed
> epoll_wait3() seems cleaner to me.
Is there any reason you can't use an ioctl() on an epoll fd?
That would be cleaner that hacking at epoll_ctl().
It would also be easier to modify to allow (strange) things like:
- return if no events for 10ms.
- return 200us after the first event.
- return after 10 events.
- return at most 100 events.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, Nov 8, 2022 at 3:09 PM Jens Axboe <[email protected]> wrote:
>
> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> > On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>> Hi Jens,
> >>> NICs and storage controllers have interrupt mitigation/coalescing
> >>> mechanisms that are similar.
> >>
> >> Yep
> >>
> >>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>> (counter) value. When a completion occurs, the device waits until the
> >>> timeout or until the completion counter value is reached.
> >>>
> >>> If I've read the code correctly, min_wait is computed at the beginning
> >>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>> completion.
> >>>
> >>> It makes me wonder which approach is more useful for applications. With
> >>> the Aggregation Time approach applications can control how much extra
> >>> latency is added. What do you think about that approach?
> >>
> >> We only tested the current approach, which is time noted from entry, not
> >> from when the first event arrives. I suspect the nvme approach is better
> >> suited to the hw side, the epoll timeout helps ensure that we batch
> >> within xx usec rather than xx usec + whatever the delay until the first
> >> one arrives. Which is why it's handled that way currently. That gives
> >> you a fixed batch latency.
> >
> > min_wait is fine when the goal is just maximizing throughput without any
> > latency targets.
>
> That's not true at all, I think you're in different time scales than
> this would be used for.
>
> > The min_wait approach makes it hard to set a useful upper bound on
> > latency because unlucky requests that complete early experience much
> > more latency than requests that complete later.
>
> As mentioned in the cover letter or the main patch, this is most useful
> for the medium load kind of scenarios. For high load, the min_wait time
> ends up not mattering because you will hit maxevents first anyway. For
> the testing that we did, the target was 2-300 usec, and 200 usec was
> used for the actual test. Depending on what the kind of traffic the
> server is serving, that's usually not much of a concern. From your
> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> don't think those would make sense. If your rate of arrival is low
> enough that min_wait needs to be high to make a difference, then the
> load is low enough anyway that it doesn't matter. Hence I'd argue that
> it is indeed NOT hard to set a useful upper bound on latency, because
> that is very much what min_wait is.
>
> I'm happy to argue merits of one approach over another, but keep in mind
> that this particular approach was not pulled out of thin air AND it has
> actually been tested and verified successfully on a production workload.
> This isn't a hypothetical benchmark kind of setup.
Following up on the interrupt mitigation analogy. This also reminds
somewhat of SO_RCVLOWAT. That sets a lower bound on received data
before waking up a single thread.
Would it be more useful to define a minevents event count, rather than
a minwait timeout? That might give the same amount of preferred batch
size, without adding latency when unnecessary, or having to infer a
reasonable bound from expected event rate. Bounded still by the max
timeout.
>>> @@ -1845,6 +1891,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>>> ewq.timed_out = true;
>>> }
>>>
>>> + /*
>>> + * If min_wait is set for this epoll instance, note the min_wait
>>> + * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's
>>> + * the state bit for whether or not min_wait is enabled.
>>> + */
>>> + if (ep->min_wait_ts) {
>>
>> Can we limit this block to "ewq.timed_out && ep->min_wait_ts"?
>> AFAICT, the code we run here is completely wasted if timeout is 0.
>
> Yep certainly, I can gate it on both of those conditions.
Looking at this for a respin, I think it should be gated on
!ewq.timed_out? timed_out == true is the path that it's wasted on
anyway.
--
Jens Axboe
On 12/1/22 11:39 AM, Soheil Hassas Yeganeh wrote:
> On Thu, Dec 1, 2022 at 1:00 PM Jens Axboe <[email protected]> wrote:
>>
>>>>> @@ -1845,6 +1891,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>>>>> ewq.timed_out = true;
>>>>> }
>>>>>
>>>>> + /*
>>>>> + * If min_wait is set for this epoll instance, note the min_wait
>>>>> + * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's
>>>>> + * the state bit for whether or not min_wait is enabled.
>>>>> + */
>>>>> + if (ep->min_wait_ts) {
>>>>
>>>> Can we limit this block to "ewq.timed_out && ep->min_wait_ts"?
>>>> AFAICT, the code we run here is completely wasted if timeout is 0.
>>>
>>> Yep certainly, I can gate it on both of those conditions.
>> Looking at this for a respin, I think it should be gated on
>> !ewq.timed_out? timed_out == true is the path that it's wasted on
>> anyway.
>
> Ah, yes, that's a good point. The check should be !ewq.timed_out.
The just posted v4 has the check (and the right one :-))
--
Jens Axboe
On Thu, Dec 1, 2022 at 1:00 PM Jens Axboe <[email protected]> wrote:
>
> >>> @@ -1845,6 +1891,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> >>> ewq.timed_out = true;
> >>> }
> >>>
> >>> + /*
> >>> + * If min_wait is set for this epoll instance, note the min_wait
> >>> + * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's
> >>> + * the state bit for whether or not min_wait is enabled.
> >>> + */
> >>> + if (ep->min_wait_ts) {
> >>
> >> Can we limit this block to "ewq.timed_out && ep->min_wait_ts"?
> >> AFAICT, the code we run here is completely wasted if timeout is 0.
> >
> > Yep certainly, I can gate it on both of those conditions.
> Looking at this for a respin, I think it should be gated on
> !ewq.timed_out? timed_out == true is the path that it's wasted on
> anyway.
Ah, yes, that's a good point. The check should be !ewq.timed_out.
Thanks,
Soheil
> --
> Jens Axboe
>