2019-06-24 14:44:01

by Roman Penyaev

[permalink] [raw]
Subject: [PATCH v5 00/14] epoll: support pollable epoll from userspace

Hi all,

This is v5 which introduces pollable epoll from userspace.

v5:
- Do not support architectures with reduced set of atomic ops,
namely arc-plat-eznps, sparc32, parisc. Return -EOPNOTSUP on
attempt to create uepoll on these archs.

- Explicitly hint compiler, that return value of 'xchg()' call
is not used, this fixes compilation on ia64.
Reported-by: kbuild test robot <[email protected]>

v4:
Changes based on Peter Zijlstra remarks:

- remove wmb() which was used incorrectly on event path
- remove "atomic_" prefix from some local helpers in order not to
mix them with the atomic_t API.
- store and read all member shared with userspace using WRITE/READ_ONCE
in order to avoid store/load tearing.
- do xchg(&epi->event.events, event->events) in ep_modify() instead of
plain write to avoid value corruption on archs (parisc, sparc32 and
arc-eznps) where atomic ops support is limited.
- change lockless algorithm which adds events to the uring, mainly
get rid of busy loop on user side, which can last uncontrollably
long. The drawbacks are additional 2 cmpxchg ops on hot event path
and complexity of the add_event_to_uring() logic.
- no gcc atomic builtins are used

- add epoll_create2 syscall to all syscall*.tbl files (Arnd Bergmann)
- add uepoll kselftests: testing/selftests/uepoll/* (Andrew Morton)
- reduce size of epitem structure for the original epoll case (Eric Wong)

- queue work to the system's high priority workqueue

v3:
- Measurements made, represented below.

- Fix alignment for epoll_uitem structure on all 64-bit archs except
x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
is added. (Linus)

- Check pollflags explicitly on 0 inside work callback, and do nothing
if 0.

v2:
- No reallocations, the max number of items (thus size of the user ring)
is specified by the caller.

- Interface is simplified: -ENOSPC is returned on attempt to add a new
epoll item if number is reached the max, nothing more.

- Alloced pages are accounted using user->locked_vm and limited to
RLIMIT_MEMLOCK value.

- EPOLLONESHOT is handled.

This series introduces pollable epoll from userspace, i.e. user creates
epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets header
and ring pointers and then consumes ready events from a ring, avoiding
epoll_wait() call. When ring is empty, user has to call epoll_wait()
in order to wait for new events. epoll_wait() returns -ESTALE if user
ring has events in the ring (kind of indication, that user has to consume
events from the user ring first, I could not invent anything better than
returning -ESTALE).

For user header and user ring allocation I used vmalloc_user(). I found
that it is much easy to reuse remap_vmalloc_range_partial() instead of
dealing with page cache (like aio.c does). What is also nice is that
virtual address is properly aligned on SHMLBA, thus there should not be
any d-cache aliasing problems on archs with vivt or vipt caches.

** Measurements

In order to measure polling from userspace libevent was modified [1] and
bench_http benchmark (client and server) was used:

o EPOLLET, original epoll:

20000 requests in 0.551306 sec. (36277.49 throughput)
Each took about 5.54 msec latency
1600000bytes read. 0 errors.

o EPOLLET + polling from userspace:

20000 requests in 0.475585 sec. (42053.47 throughput)
Each took about 4.78 msec latency
1600000bytes read. 0 errors.

So harvesting events from userspace gives 15% gain. Though bench_http
is not ideal benchmark, but at least it is the part of libevent and was
easy to modify.

Worth to mention that uepoll is very sensible to CPU, e.g. the gain above
is observed on desktop "Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz", but on
"Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz" measurements are almost the
same for both runs.

** Limitations

1. No support on architectures with reduced set of atomic ops,
namely arc-plat-eznps, sparc32, parisc. For these archs
-EOPNOTSUP is returned.

2. Expect always EPOLLET flag for new epoll items (Edge Triggered behavior)
obviously we can't call vfs_epoll() from userpace to have level
triggered behaviour.

3. No support for EPOLLWAKEUP
events are consumed from userspace, thus no way to call __pm_relax()

4. No support for EPOLLEXCLUSIVE
If device does not pass pollflags to wake_up() there is no way to
call poll() from the context under spinlock, thus special work is
scheduled to offload polling. In this specific case we can't
support exclusive wakeups, because we do not know actual result
of scheduled work and have to wake up every waiter.

** Principle of operation

* Basic structures shared with userspace:

In order to consume events from userspace all inserted items should be
stored in items array, which has original epoll_event field and u32
field for keeping ready events, i.e. each item has the following struct:

struct epoll_uitem {
__poll_t ready_events;
__poll_t events;
__u64 data;
};
BUILD_BUG_ON(sizeof(struct epoll_uitem) != 16);

And the following is a header, which is seen by userspace:

struct epoll_uheader {
u32 magic; /* epoll user header magic */
u32 header_length; /* length of the header + items */
u32 index_length; /* length of the index ring, always pow2 */
u32 max_items_nr; /* max num of items */
u32 head; /* updated by userland */
u32 int tail; /* updated by kernel */

struct epoll_uitem items[] __aligned(128);
};

/* Header is 128 bytes, thus items are aligned on CPU cache */
BUILD_BUG_ON(sizeof(struct epoll_uheader) != 128);

In order to poll epfd from userspace application has to call:

epoll_create2(EPOLL_USERPOLL, max_items_nr);

Ready events are kept in a ring buffer, which is simply an index table,
where each element points to an item in a header:

unsinged int *user_index;

* How is new event accounted on kernel side? Hot it is consumed from
* userspace?

When new event comes for some epoll item kernel does the following:

struct epoll_uitem *uitem;

/* Each item has a bit (index in user items array), discussed later */
uitem = user_header->items[epi->bit];

if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
/* Increase all three subcounters at once */
cnt = atomic64_add_return_acquire(0x100010001, &ep->shadow_cnt);

idx = cnt_to_monotonic(cnt) - 1;
item_idx = &ep->user_index[idx & index_mask];

/* Add a bit to the uring */
WRITE_ONCE(*item_idx, uepi->bit);

do {
old = cnt;
if (cnt_to_refs(cnt) == 1) {
/* We are the last, we will advance the tail */
advance = cnt_to_advance(cnt);
WARN_ON(!advance);
/* Zero out all fields except monotonic counter */
cnt &= ~MONOTONIC_MASK;
} else {
/* Someone else will advance, only drop the ref */
advance = 0;
cnt -= 1;
}
} while ((cnt = atomic64_cmpxchg_release(&ep->shadow_cnt,
old, cnt)) != old);

if (advance) {
/*
* Advance the tail. Tail is shared with userspace, thus we
* can't use kernel atomic_t for just atomic add, so use
* cmpxchg(). Sigh.
*
* We can race here with another cpu which also advances the
* tail. This is absolutely ok, since the tail is advanced
* in one direction and eventually addition is commutative.
*/
unsigned int old, tail = READ_ONCE(ep->user_header->tail);
do {
old = tail;
} while ((tail = cmpxchg(&ep->user_header->tail,
old, old + advance)) != old);
}
}

The most complicated part is how new event is added to the ring locklessly.
To achieve that ->shadow_cnt is split on 3 subcounters and the whole
layout of the counter can be represented as follows:

struct counter_t {
unsigned long long monotonic :32;
unsigned long long advance :16;
unsigned long long refs :16;
};

'monotonic' - Monotonically increases on each event insertion,
never decreases. Used as an index for an event
in the uring.

'advance' - Represents number of events on which user ->tail
has to be advanced. Monotonically increases if
events are coming in parallel from different cpus
while reference number keeps > 1.

'refs' - Represents reference number, i.e. number of cpus
inserting events in parallel. Once there is a
last inserter (the reference is 1), it should
zero out 'advance' member and advance the tail
for the userspace.

What this is all about? The main problem is that since event can be
inserted from many cpus in parallel, we can't advance the tail if
previous insertion has not been fully completed. The idea to solve
this is simple: the last one advances the tail. Who is exactly the
last? Who detects the reference number is equal to 1.

The other thing is worth to mention is that the ring can't infinitely
grow and corrupt other elements, because kernel always checks that item
was marked as ready, so userspace has to clear ready_events field.

On userside events the following code should be used in order to consume
events:

tail = READ_ONCE(header->tail);
for (i = 0; header->head != tail; header->head++) {
item_idx_ptr = &index[idx & indeces_mask];

/* Load index */
idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE);

item = &header->items[idx];

/*
* Fetch data first, if event is cleared by the kernel we drop the data
* returning false.
*/
event->data = item->event.data;
event->events = __atomic_exchange_n(&item->ready_events, 0,
__ATOMIC_RELEASE);

}

* How new epoll item gets its index inside user items array?

Kernel has a bitmap for that and gets free bit on attempt to insert a new
epoll item. When bitmap is full -ENOSPC is returned.

* Is there any testing app available?

There is a small app [2] which starts many threads with many event fds and
produces many events, while single consumer fetches them from userspace
and goes to kernel from time to time in order to wait.

Also libevent modification [1] is available, see "measurements" section
above.

[1] https://github.com/libevent/libevent/pull/801
[2] https://github.com/rouming/test-tools/blob/master/userpolled-epoll.c

Roman Penyaev (14):
epoll: move private helpers from a header to the source
epoll: introduce user structures for polling from userspace
epoll: allocate user header and user events ring for polling from
userspace
epoll: some sanity flags checks for epoll syscalls for polling from
userspace
epoll: offload polling to a work in case of epfd polled from userspace
epoll: introduce helpers for adding/removing events to uring
epoll: call ep_add_event_to_uring() from ep_poll_callback()
epoll: support polling from userspace for ep_insert()
epoll: support polling from userspace for ep_remove()
epoll: support polling from userspace for ep_modify()
epoll: support polling from userspace for ep_poll()
epoll: support mapping for epfd when polled from userspace
epoll: implement epoll_create2() syscall
kselftest: add uepoll-test which tests polling from userspace

arch/alpha/kernel/syscalls/syscall.tbl | 2 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 2 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
arch/parisc/kernel/syscalls/syscall.tbl | 2 +
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
arch/s390/kernel/syscalls/syscall.tbl | 2 +
arch/sh/kernel/syscalls/syscall.tbl | 2 +
arch/sparc/kernel/syscalls/syscall.tbl | 2 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/eventpoll.c | 951 ++++++++++++++++--
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/eventpoll.h | 47 +-
kernel/sys_ni.c | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/uepoll/.gitignore | 1 +
tools/testing/selftests/uepoll/Makefile | 16 +
.../uepoll/atomic-builtins-support.c | 13 +
tools/testing/selftests/uepoll/uepoll-test.c | 603 +++++++++++
28 files changed, 1566 insertions(+), 103 deletions(-)
create mode 100644 tools/testing/selftests/uepoll/.gitignore
create mode 100644 tools/testing/selftests/uepoll/Makefile
create mode 100644 tools/testing/selftests/uepoll/atomic-builtins-support.c
create mode 100644 tools/testing/selftests/uepoll/uepoll-test.c

Signed-off-by: Roman Penyaev <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Azat Khuzhin <[email protected]>
Cc: Eric Wong <[email protected]>
Cc: [email protected]
Cc: [email protected]

--
2.21.0


2019-06-24 14:44:17

by Roman Penyaev

[permalink] [raw]
Subject: [PATCH v5 02/14] epoll: introduce user structures for polling from userspace

This one introduces structures of user items array:

struct epoll_uheader -
describes inserted epoll items.

struct epoll_uitem -
single epoll item visible to userspace.

Signed-off-by: Roman Penyaev <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/eventpoll.c | 11 +++++++++++
include/uapi/linux/eventpoll.h | 29 +++++++++++++++++++++++++++++
2 files changed, 40 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 622b6c9ef8c9..6d7a5fe4a831 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -4,6 +4,7 @@
* Copyright (C) 2001,...,2009 Davide Libenzi
*
* Davide Libenzi <[email protected]>
+ * Polling from userspace support by Roman Penyaev <[email protected]>
*/

#include <linux/init.h>
@@ -104,6 +105,16 @@

#define EP_ITEM_COST (sizeof(struct epitem) + sizeof(struct eppoll_entry))

+/*
+ * That is around 1.3mb of allocated memory for one epfd. What is more
+ * important is ->index_length, which should be ^2, so do not increase
+ * max items number to avoid size doubling of user index.
+ *
+ * Before increasing the value see add_event_to_uring() and especially
+ * cnt_to_advance() functions and change them accordingly.
+ */
+#define EP_USERPOLL_MAX_ITEMS_NR 65536
+
struct epoll_filefd {
struct file *file;
int fd;
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 39dfc29f0f52..3317901b19c4 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -79,4 +79,33 @@ struct epoll_event {
__u64 data;
} EPOLL_PACKED;

+#define EPOLL_USERPOLL_HEADER_MAGIC 0xeb01eb01
+#define EPOLL_USERPOLL_HEADER_SIZE 128
+
+/*
+ * Item, shared with userspace. Unfortunately we can't embed epoll_event
+ * structure, because it is badly aligned on all 64-bit archs, except
+ * x86-64 (see EPOLL_PACKED). sizeof(epoll_uitem) == 16
+ */
+struct epoll_uitem {
+ __poll_t ready_events;
+ __poll_t events;
+ __u64 data;
+};
+
+/*
+ * Header, shared with userspace. sizeof(epoll_uheader) == 128
+ */
+struct epoll_uheader {
+ __u32 magic; /* epoll user header magic */
+ __u32 header_length; /* length of the header + items */
+ __u32 index_length; /* length of the index ring, always pow2 */
+ __u32 max_items_nr; /* max number of items */
+ __u32 head; /* updated by userland */
+ __u32 tail; /* updated by kernel */
+
+ struct epoll_uitem items[]
+ __attribute__((__aligned__(EPOLL_USERPOLL_HEADER_SIZE)));
+};
+
#endif /* _UAPI_LINUX_EVENTPOLL_H */
--
2.21.0

2019-06-24 14:44:26

by Roman Penyaev

[permalink] [raw]
Subject: [PATCH v5 01/14] epoll: move private helpers from a header to the source

Those helpers will access private eventpoll structure in future patches,
so keep those helpers close to callers.

Nothing important here.

Signed-off-by: Roman Penyaev <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/eventpoll.c | 13 +++++++++++++
include/uapi/linux/eventpoll.h | 12 ------------
2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 8bc064630be0..622b6c9ef8c9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -460,6 +460,19 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi)

#endif /* CONFIG_NET_RX_BUSY_POLL */

+#ifdef CONFIG_PM_SLEEP
+static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+{
+ if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND))
+ epev->events &= ~EPOLLWAKEUP;
+}
+#else
+static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+{
+ epev->events &= ~EPOLLWAKEUP;
+}
+#endif /* CONFIG_PM_SLEEP */
+
/**
* ep_call_nested - Perform a bound (possibly) nested call, by checking
* that the recursion limit is not exceeded, and that
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 8a3432d0f0dc..39dfc29f0f52 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -79,16 +79,4 @@ struct epoll_event {
__u64 data;
} EPOLL_PACKED;

-#ifdef CONFIG_PM_SLEEP
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
-{
- if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND))
- epev->events &= ~EPOLLWAKEUP;
-}
-#else
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
-{
- epev->events &= ~EPOLLWAKEUP;
-}
-#endif
#endif /* _UAPI_LINUX_EVENTPOLL_H */
--
2.21.0

2019-06-24 14:44:58

by Roman Penyaev

[permalink] [raw]
Subject: [PATCH v5 05/14] epoll: offload polling to a work in case of epfd polled from userspace

Not every device reports pollflags on wake_up(), expecting that it will be
polled later. vfs_poll() can't be called from ep_poll_callback(), because
ep_poll_callback() is called under the spinlock. Obviously userspace can't
call vfs_poll(), thus epoll has to offload vfs_poll() to a work and then to
call ep_poll_callback() with pollflags in a hand.

In order not to bloat the size of `struct epitem` with `struct work_struct`
new `struct uepitem` includes original epi and work. Thus for ep pollable
from user new `struct uepitem` will be allocated. This will be done in
the following patches.

Signed-off-by: Roman Penyaev <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/eventpoll.c | 131 ++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 107 insertions(+), 24 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4087efb1fbf3..f2a2be93bc4b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -38,6 +38,7 @@
#include <linux/seq_file.h>
#include <linux/compat.h>
#include <linux/rculist.h>
+#include <linux/workqueue.h>
#include <net/busy_poll.h>

/*
@@ -184,6 +185,18 @@ struct epitem {
struct epoll_event event;
};

+/*
+ * The only purpose of this structure is not to inflate the size of the
+ * 'struct epitem' if polling from user is not used.
+ */
+struct uepitem {
+ /* Includes ep item */
+ struct epitem epi;
+
+ /* Work for offloading event callback */
+ struct work_struct work;
+};
+
/*
* This structure is stored inside the "private_data" member of the file
* structure and represents the main data structure for the eventpoll
@@ -313,6 +326,9 @@ static struct nested_calls poll_loop_ncalls;
/* Slab cache used to allocate "struct epitem" */
static struct kmem_cache *epi_cache __read_mostly;

+/* Slab cache used to allocate "struct uepitem" */
+static struct kmem_cache *uepi_cache __read_mostly;
+
/* Slab cache used to allocate "struct eppoll_entry" */
static struct kmem_cache *pwq_cache __read_mostly;

@@ -391,6 +407,12 @@ static inline struct epitem *ep_item_from_epqueue(poll_table *p)
return container_of(p, struct ep_pqueue, pt)->epi;
}

+/* Get the "struct uepitem" from an generic epitem. */
+static inline struct uepitem *uep_item_from_epi(struct epitem *p)
+{
+ return container_of(p, struct uepitem, epi);
+}
+
/* Tells if the epoll_ctl(2) operation needs an event copy from userspace */
static inline int ep_op_has_event(int op)
{
@@ -719,6 +741,14 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
ep_remove_wait_queue(pwq);
kmem_cache_free(pwq_cache, pwq);
}
+ if (ep_polled_by_user(ep)) {
+ /*
+ * Events polled by user require offloading to a work,
+ * thus we have to be sure everything which was queued
+ * has run to a completion.
+ */
+ flush_work(&uep_item_from_epi(epi)->work);
+ }
}

/* call only when ep->mtx is held */
@@ -1369,9 +1399,8 @@ static inline bool chain_epi_lockless(struct epitem *epi)
}

/*
- * This is the callback that is passed to the wait queue wakeup
- * mechanism. It is called by the stored file descriptors when they
- * have events to report.
+ * This is the callback that is called directly from wake queue wakeup or
+ * from a work.
*
* This callback takes a read lock in order not to content with concurrent
* events from another file descriptors, thus all modifications to ->rdllist
@@ -1386,14 +1415,11 @@ static inline bool chain_epi_lockless(struct epitem *epi)
* queues are used should be detected accordingly. This is detected using
* cmpxchg() operation.
*/
-static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
+static int ep_poll_callback(struct epitem *epi, __poll_t pollflags)
{
- int pwake = 0;
- struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
- __poll_t pollflags = key_to_poll(key);
+ int pwake = 0, ewake = 0;
unsigned long flags;
- int ewake = 0;

read_lock_irqsave(&ep->lock, flags);

@@ -1411,8 +1437,9 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
/*
* Check the events coming with the callback. At this stage, not
* every device reports the events in the "key" parameter of the
- * callback. We need to be able to handle both cases here, hence the
- * test for "key" != NULL before the event match test.
+ * callback (for ep_poll_callback() case special worker is used).
+ * We need to be able to handle both cases here, hence the test
+ * for "key" != NULL before the event match test.
*/
if (pollflags && !(pollflags & epi->event.events))
goto out_unlock;
@@ -1472,23 +1499,68 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
if (!(epi->event.events & EPOLLEXCLUSIVE))
ewake = 1;

- if (pollflags & POLLFREE) {
- /*
- * If we race with ep_remove_wait_queue() it can miss
- * ->whead = NULL and do another remove_wait_queue() after
- * us, so we can't use __remove_wait_queue().
- */
- list_del_init(&wait->entry);
+ return ewake;
+}
+
+static void ep_poll_callback_work(struct work_struct *work)
+{
+ struct uepitem *uepi = container_of(work, typeof(*uepi), work);
+ struct epitem *epi = &uepi->epi;
+ __poll_t pollflags;
+ poll_table pt;
+
+ WARN_ON(!ep_polled_by_user(epi->ep));
+
+ init_poll_funcptr(&pt, NULL);
+ pollflags = ep_item_poll(epi, &pt, 1);
+ if (pollflags)
+ (void)ep_poll_callback(epi, pollflags);
+}
+
+/*
+ * This is the callback that is passed to the wait queue wakeup
+ * mechanism. It is called by the stored file descriptors when they
+ * have events to report.
+ */
+static int ep_poll_wakeup(wait_queue_entry_t *wait, unsigned int mode,
+ int sync, void *key)
+{
+
+ struct epitem *epi = ep_item_from_wait(wait);
+ struct eventpoll *ep = epi->ep;
+ __poll_t pollflags = key_to_poll(key);
+ int rc;
+
+ if (!ep_polled_by_user(ep) || pollflags) {
+ rc = ep_poll_callback(epi, pollflags);
+
+ if (pollflags & POLLFREE) {
+ /*
+ * If we race with ep_remove_wait_queue() it can miss
+ * ->whead = NULL and do another remove_wait_queue()
+ * after us, so we can't use __remove_wait_queue().
+ */
+ list_del_init(&wait->entry);
+ /*
+ * ->whead != NULL protects us from the race with
+ * ep_free() or ep_remove(), ep_remove_wait_queue()
+ * takes whead->lock held by the caller. Once we nullify
+ * it, nothing protects ep/epi or even wait.
+ */
+ smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
+ }
+ } else {
+ queue_work(system_highpri_wq, &uep_item_from_epi(epi)->work);
+
/*
- * ->whead != NULL protects us from the race with ep_free()
- * or ep_remove(), ep_remove_wait_queue() takes whead->lock
- * held by the caller. Once we nullify it, nothing protects
- * ep/epi or even wait.
+ * Here on this path we are absolutely sure that for file
+ * descriptors which are pollable from userspace we do not
+ * support EPOLLEXCLUSIVE, so it is safe to return 1.
*/
- smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
+ rc = 1;
}

- return ewake;
+ return rc;
}

/*
@@ -1502,7 +1574,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
struct eppoll_entry *pwq;

if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
- init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
+ init_waitqueue_func_entry(&pwq->wait, ep_poll_wakeup);
pwq->whead = whead;
pwq->base = epi;
if (epi->event.events & EPOLLEXCLUSIVE)
@@ -2575,6 +2647,10 @@ static int __init eventpoll_init(void)
/*
* We can have many thousands of epitems, so prevent this from
* using an extra cache line on 64-bit (and smaller) CPUs
+ *
+ * 'struct uepitem' is used for polling from userspace, it is
+ * slightly bigger than 128b because of the work struct, thus
+ * it is excluded from the check below.
*/
BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128);

@@ -2582,6 +2658,13 @@ static int __init eventpoll_init(void)
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);

+ /*
+ * Allocates slab cache used to allocate "struct uepitem" items,
+ * used only for polling from user.
+ */
+ uepi_cache = kmem_cache_create("eventpoll_uepi", sizeof(struct uepitem),
+ 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
+
/* Allocates slab cache used to allocate "struct eppoll_entry" */
pwq_cache = kmem_cache_create("eventpoll_pwq",
sizeof(struct eppoll_entry), 0, SLAB_PANIC|SLAB_ACCOUNT, NULL);
--
2.21.0

2019-06-24 22:03:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5 00/14] epoll: support pollable epoll from userspace

On Mon, Jun 24, 2019 at 10:42 PM Roman Penyaev <[email protected]> wrote:
>
> So harvesting events from userspace gives 15% gain. Though bench_http
> is not ideal benchmark, but at least it is the part of libevent and was
> easy to modify.
>
> Worth to mention that uepoll is very sensible to CPU, e.g. the gain above
> is observed on desktop "Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz", but on
> "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz" measurements are almost the
> same for both runs.

Hmm. 15% may be big in a big picture thing, but when it comes to what
is pretty much a micro-benchmark, I'm not sure how meaningful it is.

And the CPU sensitivity thing worries me. Did you check _why_ it
doesn't seem to make any difference on the Xeon 4110? Is it just
because at that point the machine has enough cores that you might as
well just sit in epoll() in the kernel and uepoll doesn't give you
much? Or is there something else going on?

Because this is a big enough change and UAPI thing that it's a bit
concerning if there isn't a clear and unambiguous win.

Linus

2019-06-25 04:07:03

by Eric Wong

[permalink] [raw]
Subject: Re: [PATCH v5 00/14] epoll: support pollable epoll from userspace

Roman Penyaev <[email protected]> wrote:
> Hi all,

+cc Jason Baron

> ** Limitations

<snip>

> 4. No support for EPOLLEXCLUSIVE
> If device does not pass pollflags to wake_up() there is no way to
> call poll() from the context under spinlock, thus special work is
> scheduled to offload polling. In this specific case we can't
> support exclusive wakeups, because we do not know actual result
> of scheduled work and have to wake up every waiter.

Lacking EPOLLEXCLUSIVE support is probably a showstopper for
common applications using per-task epoll combined with
non-blocking accept4() (e.g. nginx).


Fwiw, I'm still a weirdo who prefers a dedicated thread doing
blocking accept4 for distribution between tasks (so epoll never
sees a listen socket). But, depending on what runtime/language
I'm using, I can't always dedicate a blocking thread, so I
recently started using EPOLLEXCLUSIVE from Perl5 where I
couldn't rely on threads being available.


If I could dedicate time to improving epoll; I'd probably
add writev() support for batching epoll_ctl modifications
to reduce syscall traffic, or pick-up the kevent()-like interface
started long ago:
https://lore.kernel.org/lkml/[email protected]/
(but I'm not sure I want to increase the size of the syscall table).

2019-06-25 12:56:42

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH v5 00/14] epoll: support pollable epoll from userspace

On 2019-06-25 02:24, Eric Wong wrote:
> Roman Penyaev <[email protected]> wrote:
>> Hi all,
>
> +cc Jason Baron
>
>> ** Limitations
>
> <snip>
>
>> 4. No support for EPOLLEXCLUSIVE
>> If device does not pass pollflags to wake_up() there is no way to
>> call poll() from the context under spinlock, thus special work is
>> scheduled to offload polling. In this specific case we can't
>> support exclusive wakeups, because we do not know actual result
>> of scheduled work and have to wake up every waiter.
>
> Lacking EPOLLEXCLUSIVE support is probably a showstopper for
> common applications using per-task epoll combined with
> non-blocking accept4() (e.g. nginx).

For the 'accept' case it seems SO_REUSEPORT can be used:

https://lwn.net/Articles/542629/

Although I've never tried it in O_NONBLOCK + epoll scenario.

But I've just again dived into this add-wait-exclusive logic and it
seems possible to support EPOLLEXCLUSIVE by iterating over all "epis"
for a particular fd, which has been woken up.

For now I want to leave it as is just not to overcomplicate the code.

> Fwiw, I'm still a weirdo who prefers a dedicated thread doing
> blocking accept4 for distribution between tasks (so epoll never
> sees a listen socket). But, depending on what runtime/language
> I'm using, I can't always dedicate a blocking thread, so I
> recently started using EPOLLEXCLUSIVE from Perl5 where I
> couldn't rely on threads being available.
>
>
> If I could dedicate time to improving epoll; I'd probably
> add writev() support for batching epoll_ctl modifications
> to reduce syscall traffic, or pick-up the kevent()-like interface
> started long ago:
> https://lore.kernel.org/lkml/[email protected]/
> (but I'm not sure I want to increase the size of the syscall table).

There is also fresh fs/io_uring.c thingy, which supports polling and
batching (among other IO things). But polling there acts only as a
single-shot, so it might make sense to support there event subscription
instead of resurrecting kevent and co.

--
Roman






2019-06-25 13:00:23

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH v5 00/14] epoll: support pollable epoll from userspace

On 2019-06-24 22:38, Linus Torvalds wrote:
> On Mon, Jun 24, 2019 at 10:42 PM Roman Penyaev <[email protected]>
> wrote:
>>
>> So harvesting events from userspace gives 15% gain. Though bench_http
>> is not ideal benchmark, but at least it is the part of libevent and
>> was
>> easy to modify.
>>
>> Worth to mention that uepoll is very sensible to CPU, e.g. the gain
>> above
>> is observed on desktop "Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz",
>> but on
>> "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz" measurements are almost
>> the
>> same for both runs.
>
> Hmm. 15% may be big in a big picture thing, but when it comes to what
> is pretty much a micro-benchmark, I'm not sure how meaningful it is.
>
> And the CPU sensitivity thing worries me. Did you check _why_ it
> doesn't seem to make any difference on the Xeon 4110? Is it just
> because at that point the machine has enough cores that you might as
> well just sit in epoll() in the kernel and uepoll doesn't give you
> much? Or is there something else going on?

This http tool is a singlethreaded test, i.e. client and server
work as a standalone processes and each has a single event thread
for everything.

According to what I saw there, is that events come slowly (or event
loop acts faster?), so when time has come to harvest events there
is nothing, we take a slow path and go to kernel in order to sleep.
That does not explain the main "why", unfortunately.

I would like to retest that adding more clients to the server, thus
server is more likely to observe events in a ring, avoiding sleep.

--
Roman