2019-05-16 09:02:43

by Roman Penyaev

[permalink] [raw]
Subject: [PATCH v3 00/13] epoll: support pollable epoll from userspace

Hi all,

This is v3 which introduces pollable epoll from userspace.

v3:
- Measurements made, represented below.

- Fix alignment for epoll_uitem structure on all 64-bit archs except
x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
is added. (Linus)

- Check pollflags explicitly on 0 inside work callback, and do nothing
if 0.

v2:
- No reallocations, the max number of items (thus size of the user ring)
is specified by the caller.

- Interface is simplified: -ENOSPC is returned on attempt to add a new
epoll item if number is reached the max, nothing more.

- Alloced pages are accounted using user->locked_vm and limited to
RLIMIT_MEMLOCK value.

- EPOLLONESHOT is handled.

This series introduces pollable epoll from userspace, i.e. user creates
epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets header
and ring pointers and then consumes ready events from a ring, avoiding
epoll_wait() call. When ring is empty, user has to call epoll_wait()
in order to wait for new events. epoll_wait() returns -ESTALE if user
ring has events in the ring (kind of indication, that user has to consume
events from the user ring first, I could not invent anything better than
returning -ESTALE).

For user header and user ring allocation I used vmalloc_user(). I found
that it is much easy to reuse remap_vmalloc_range_partial() instead of
dealing with page cache (like aio.c does). What is also nice is that
virtual address is properly aligned on SHMLBA, thus there should not be
any d-cache aliasing problems on archs with vivt or vipt caches.

** Measurements

In order to measure polling from userspace libevent was modified [1] and
bench_http benchmark (client and server) was used:

o EPOLLET, original epoll:

20000 requests in 0.551306 sec. (36277.49 throughput)
Each took about 5.54 msec latency
1600000bytes read. 0 errors.

o EPOLLET + polling from userspace:

20000 requests in 0.475585 sec. (42053.47 throughput)
Each took about 4.78 msec latency
1600000bytes read. 0 errors.

So harvesting events from userspace gives 15% gain. Though bench_http
is not ideal benchmark, but at least it is the part of libevent and was
easy to modify.

Worth to mention that uepoll is very sensible to CPU, e.g. the gain above
is observed on desktop "Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz", but on
"Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz" measurements are almost the
same for both runs.

** Limitations

1. Expect always EPOLLET flag for new epoll items (Edge Triggered behavior)
obviously we can't call vfs_epoll() from userpace to have level
triggered behaviour.

2. No support for EPOLLWAKEUP
events are consumed from userspace, thus no way to call __pm_relax()

3. No support for EPOLLEXCLUSIVE
If device does not pass pollflags to wake_up() there is no way to
call poll() from the context under spinlock, thus special work is
scheduled to offload polling. In this specific case we can't
support exclusive wakeups, because we do not know actual result
of scheduled work and have to wake up every waiter.

** Principle of operation

* Basic structures shared with userspace:

In order to consume events from userspace all inserted items should be
stored in items array, which has original epoll_event field and u32
field for keeping ready events, i.e. each item has the following struct:

struct epoll_uitem {
__poll_t ready_events;
struct epoll_event event;
};
BUILD_BUG_ON(sizeof(struct epoll_uitem) != 16);

And the following is a header, which is seen by userspace:

struct epoll_uheader {
u32 magic; /* epoll user header magic */
u32 header_length; /* length of the header + items */
u32 index_length; /* length of the index ring, always pow2 */
u32 max_items_nr; /* max num of items */
u32 head; /* updated by userland */
u32 int tail; /* updated by kernel */

struct epoll_uitem items[] __aligned(128);
};

/* Header is 128 bytes, thus items are aligned on CPU cache */
BUILD_BUG_ON(sizeof(struct epoll_uheader) != 128);

In order to poll epfd from userspace application has to call:

epoll_create2(EPOLL_USERPOLL, max_items_nr);

Ready events are kept in a ring buffer, which is simply an index table,
where each element points to an item in a header:

unsinged int *user_index;

* How is new event accounted on kernel side? Hot it is consumed from
* userspace?

When new event comes for some epoll item kernel does the following:

struct epoll_uitem *uitem;

/* Each item has a bit (index in user items array), discussed later */
uitem = user_header->items[epi->bit];

if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
i = atomic_add(&ep->user_header->tail, 1);

item_idx = &user_index[i & index_mask];

/* Signal with a bit, user spins on index expecting value > 0 */
*item_idx = idx + 1;

/*
* Want index update be flushed from CPU write buffer and
* immediately visible on userspace side to avoid long busy
* loops.
*/
smp_wmb();
}

Important thing here is that ring can't infinitely grow and corrupt other
elements, because kernel always checks that item was marked as ready, so
userspace has to clear ready_events field.

On userside events the following code should be used in order to consume
events:

tail = READ_ONCE(header->tail);
for (i = 0; header->head != tail; header->head++) {
item_idx_ptr = &index[idx & indeces_mask];

/*
* Spin here till we see valid index
*/
while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
;

item = &header->items[idx - 1];

/*
* Mark index as invalid, that is for userspace only, kernel does not care
* and will refill this pointer only when observes that event is cleared,
* which happens below.
*/
*item_idx_ptr = 0;

/*
* Fetch data first, if event is cleared by the kernel we drop the data
* returning false.
*/
event->data = item->event.data;
event->events = __atomic_exchange_n(&item->ready_events, 0,
__ATOMIC_RELEASE);

}

* How new epoll item gets its index inside user items array?

Kernel has a bitmap for that and gets free bit on attempt to insert a new
epoll item. When bitmap is full -ENOSPC is returned.

* Is there any testing app available?

There is a small app [2] which starts many threads with many event fds and
produces many events, while single consumer fetches them from userspace
and goes to kernel from time to time in order to wait.

[1] https://github.com/libevent/libevent/pull/801
[2] https://github.com/rouming/test-tools/blob/master/userpolled-epoll.c


Roman Penyaev (13):
epoll: move private helpers from a header to the source
epoll: introduce user structures for polling from userspace
epoll: allocate user header and user events ring for polling from
userspace
epoll: some sanity flags checks for epoll syscalls for polling from
userspace
epoll: offload polling to a work in case of epfd polled from userspace
epoll: introduce helpers for adding/removing events to uring
epoll: call ep_add_event_to_uring() from ep_poll_callback()
epoll: support polling from userspace for ep_insert()
epoll: support polling from userspace for ep_remove()
epoll: support polling from userspace for ep_modify()
epoll: support polling from userspace for ep_poll()
epoll: support mapping for epfd when polled from userspace
epoll: implement epoll_create2() syscall

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/eventpoll.c | 714 ++++++++++++++++++++++---
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/eventpoll.h | 46 +-
kernel/sys_ni.c | 1 +
7 files changed, 669 insertions(+), 99 deletions(-)

Signed-off-by: Roman Penyaev <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
--
2.21.0


2019-05-16 09:02:51

by Roman Penyaev

[permalink] [raw]
Subject: [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()

Each ep_poll_callback() is called when fd calls wakeup() on epfd.
So account new event in user ring.

The tricky part here is EPOLLONESHOT. Since we are lockless we
have to be deal with ep_poll_callbacks() called in paralle, thus
use cmpxchg to clear public event bits and filter out concurrent
call from another cpu.

Signed-off-by: Roman Penyaev <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2f551c005640..55612da9651e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1407,6 +1407,29 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
}
#endif /* CONFIG_CHECKPOINT_RESTORE */

+/**
+ * Atomically clear public event bits and return %true if the old value has
+ * public event bits set.
+ */
+static inline bool ep_clear_public_event_bits(struct epitem *epi)
+{
+ __poll_t old, flags;
+
+ /*
+ * Here we race with ourselves and with ep_modify(), which can
+ * change the event bits. In order not to override events updated
+ * by ep_modify() we have to do cmpxchg.
+ */
+
+ old = epi->event.events;
+ do {
+ flags = old;
+ } while ((old = cmpxchg(&epi->event.events, flags,
+ flags & EP_PRIVATE_BITS)) != flags);
+
+ return flags & ~EP_PRIVATE_BITS;
+}
+
/**
* Adds a new entry to the tail of the list in a lockless way, i.e.
* multiple CPUs are allowed to call this function concurrently.
@@ -1526,6 +1549,20 @@ static int ep_poll_callback(struct epitem *epi, __poll_t pollflags)
if (pollflags && !(pollflags & epi->event.events))
goto out_unlock;

+ if (ep_polled_by_user(ep)) {
+ /*
+ * For polled descriptor from user we have to disable events on
+ * callback path in case of one-shot.
+ */
+ if ((epi->event.events & EPOLLONESHOT) &&
+ !ep_clear_public_event_bits(epi))
+ /* Race is lost, another callback has cleared events */
+ goto out_unlock;
+
+ ep_add_event_to_uring(epi, pollflags);
+ goto wakeup;
+ }
+
/*
* If we are transferring events to userspace, we can hold no locks
* (because we're accessing user memory, and because of linux f_op->poll()
@@ -1545,6 +1582,7 @@ static int ep_poll_callback(struct epitem *epi, __poll_t pollflags)
ep_pm_stay_awake_rcu(epi);
}

+wakeup:
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
--
2.21.0

2019-05-16 09:03:07

by Roman Penyaev

[permalink] [raw]
Subject: [PATCH v3 02/13] epoll: introduce user structures for polling from userspace

This one introduces structures of user items array:

struct epoll_uheader -
describes inserted epoll items.

struct epoll_uitem -
single epoll item visible to userspace.

Signed-off-by: Roman Penyaev <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2cc183e86a29..f598442512f3 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -9,6 +9,8 @@
*
* Davide Libenzi <[email protected]>
*
+ * Polling from userspace support by Roman Penyaev <[email protected]>
+ * (C) Copyright 2019 SUSE, All Rights Reserved
*/

#include <linux/init.h>
@@ -109,6 +111,13 @@

#define EP_ITEM_COST (sizeof(struct epitem) + sizeof(struct eppoll_entry))

+/*
+ * That is around 1.3mb of allocated memory for one epfd. What is more
+ * important is ->index_length, which should be ^2, so do not increase
+ * max items number to avoid size doubling of user index.
+ */
+#define EP_USERPOLL_MAX_ITEMS_NR 65536
+
struct epoll_filefd {
struct file *file;
int fd;
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 39dfc29f0f52..161dfcbcf3b5 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -79,4 +79,32 @@ struct epoll_event {
__u64 data;
} EPOLL_PACKED;

+#define EPOLL_USERPOLL_HEADER_MAGIC 0xeb01eb01
+#define EPOLL_USERPOLL_HEADER_SIZE 128
+
+/*
+ * Item, shared with userspace. Unfortunately we can't embed epoll_event
+ * structure, because it is badly aligned on all 64-bit archs, except
+ * x86-64 (see EPOLL_PACKED). sizeof(epoll_uitem) == 16
+ */
+struct epoll_uitem {
+ __poll_t ready_events;
+ __poll_t events;
+ __u64 data;
+};
+
+/*
+ * Header, shared with userspace. sizeof(epoll_uheader) == 128
+ */
+struct epoll_uheader {
+ u32 magic; /* epoll user header magic */
+ u32 header_length; /* length of the header + items */
+ u32 index_length; /* length of the index ring, always pow2 */
+ u32 max_items_nr; /* max number of items */
+ u32 head; /* updated by userland */
+ u32 tail; /* updated by kernel */
+
+ struct epoll_uitem items[] __aligned(EPOLL_USERPOLL_HEADER_SIZE);
+};
+
#endif /* _UAPI_LINUX_EVENTPOLL_H */
--
2.21.0

2019-05-31 09:58:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

On Thu, May 16, 2019 at 10:57:57AM +0200, Roman Penyaev wrote:
> When new event comes for some epoll item kernel does the following:
>
> struct epoll_uitem *uitem;
>
> /* Each item has a bit (index in user items array), discussed later */
> uitem = user_header->items[epi->bit];
>
> if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
> i = atomic_add(&ep->user_header->tail, 1);
>
> item_idx = &user_index[i & index_mask];
>
> /* Signal with a bit, user spins on index expecting value > 0 */
> *item_idx = idx + 1;
>
> /*
> * Want index update be flushed from CPU write buffer and
> * immediately visible on userspace side to avoid long busy
> * loops.
> */

That is garbage; smp_wmb() does no such thing.

> smp_wmb();
> }

2019-05-31 09:58:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()

On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:
> Each ep_poll_callback() is called when fd calls wakeup() on epfd.
> So account new event in user ring.
>
> The tricky part here is EPOLLONESHOT. Since we are lockless we
> have to be deal with ep_poll_callbacks() called in paralle, thus
> use cmpxchg to clear public event bits and filter out concurrent
> call from another cpu.
>
> Signed-off-by: Roman Penyaev <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 2f551c005640..55612da9651e 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1407,6 +1407,29 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
> }
> #endif /* CONFIG_CHECKPOINT_RESTORE */
>
> +/**
> + * Atomically clear public event bits and return %true if the old value has
> + * public event bits set.
> + */
> +static inline bool ep_clear_public_event_bits(struct epitem *epi)
> +{
> + __poll_t old, flags;
> +
> + /*
> + * Here we race with ourselves and with ep_modify(), which can
> + * change the event bits. In order not to override events updated
> + * by ep_modify() we have to do cmpxchg.
> + */
> +
> + old = epi->event.events;
> + do {
> + flags = old;
> + } while ((old = cmpxchg(&epi->event.events, flags,
> + flags & EP_PRIVATE_BITS)) != flags);
> +
> + return flags & ~EP_PRIVATE_BITS;
> +}

AFAICT epi->event.events also has normal writes to it, eg. in
ep_modify(). A number of architectures cannot handle concurrent normal
writes and cmpxchg() to the same variable.

2019-05-31 11:24:24

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()

On 2019-05-31 11:56, Peter Zijlstra wrote:
> On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:
>> Each ep_poll_callback() is called when fd calls wakeup() on epfd.
>> So account new event in user ring.
>>
>> The tricky part here is EPOLLONESHOT. Since we are lockless we
>> have to be deal with ep_poll_callbacks() called in paralle, thus
>> use cmpxchg to clear public event bits and filter out concurrent
>> call from another cpu.
>>
>> Signed-off-by: Roman Penyaev <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Al Viro <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index 2f551c005640..55612da9651e 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -1407,6 +1407,29 @@ struct file *get_epoll_tfile_raw_ptr(struct
>> file *file, int tfd,
>> }
>> #endif /* CONFIG_CHECKPOINT_RESTORE */
>>
>> +/**
>> + * Atomically clear public event bits and return %true if the old
>> value has
>> + * public event bits set.
>> + */
>> +static inline bool ep_clear_public_event_bits(struct epitem *epi)
>> +{
>> + __poll_t old, flags;
>> +
>> + /*
>> + * Here we race with ourselves and with ep_modify(), which can
>> + * change the event bits. In order not to override events updated
>> + * by ep_modify() we have to do cmpxchg.
>> + */
>> +
>> + old = epi->event.events;
>> + do {
>> + flags = old;
>> + } while ((old = cmpxchg(&epi->event.events, flags,
>> + flags & EP_PRIVATE_BITS)) != flags);
>> +
>> + return flags & ~EP_PRIVATE_BITS;
>> +}
>
> AFAICT epi->event.events also has normal writes to it, eg. in
> ep_modify(). A number of architectures cannot handle concurrent normal
> writes and cmpxchg() to the same variable.

Yes, we race with the current function and with ep_modify(). Then,
ep_modify()
should do something as the following:

- epi->event.events = event->events
+ xchg(&epi->event.events, event->events);

Is that ok?

Just curious: what are these archs?

Thanks.

--
Roman



2019-05-31 13:08:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()

On Fri, May 31, 2019 at 01:22:54PM +0200, Roman Penyaev wrote:
> On 2019-05-31 11:56, Peter Zijlstra wrote:
> > On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:

> > > +static inline bool ep_clear_public_event_bits(struct epitem *epi)
> > > +{
> > > + __poll_t old, flags;
> > > +
> > > + /*
> > > + * Here we race with ourselves and with ep_modify(), which can
> > > + * change the event bits. In order not to override events updated
> > > + * by ep_modify() we have to do cmpxchg.
> > > + */
> > > +
> > > + old = epi->event.events;
> > > + do {
> > > + flags = old;
> > > + } while ((old = cmpxchg(&epi->event.events, flags,
> > > + flags & EP_PRIVATE_BITS)) != flags);
> > > +
> > > + return flags & ~EP_PRIVATE_BITS;
> > > +}
> >
> > AFAICT epi->event.events also has normal writes to it, eg. in
> > ep_modify(). A number of architectures cannot handle concurrent normal
> > writes and cmpxchg() to the same variable.
>
> Yes, we race with the current function and with ep_modify(). Then,
> ep_modify()
> should do something as the following:
>
> - epi->event.events = event->events
> + xchg(&epi->event.events, event->events);
>
> Is that ok?

That should be correct, but at that point I think we should also always
read the thing with READ_ONCE() to avoid load-tearing. And I suspect it
then becomes sensible to change the type to atomic_t.

atomic_set() vs atomic_cmpxchg() only carries the extra overhead on
those 'dodgy' platforms.

> Just curious: what are these archs?

Oh, lovely stuff like parisc, sparc32 and arc-eznps. See
arch/parisc/lib/bitops.c:__cmpxchg_*() for example :/ Those systems only
have a single truly atomic op (something from the xchg / test-and-set
family) and the rest is fudged on top of that.

2019-05-31 14:50:29

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

On 5/16/19 2:57 AM, Roman Penyaev wrote:
> Hi all,
>
> This is v3 which introduces pollable epoll from userspace.
>
> v3:
> - Measurements made, represented below.
>
> - Fix alignment for epoll_uitem structure on all 64-bit archs except
> x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
> is added. (Linus)
>
> - Check pollflags explicitly on 0 inside work callback, and do nothing
> if 0.
>
> v2:
> - No reallocations, the max number of items (thus size of the user ring)
> is specified by the caller.
>
> - Interface is simplified: -ENOSPC is returned on attempt to add a new
> epoll item if number is reached the max, nothing more.
>
> - Alloced pages are accounted using user->locked_vm and limited to
> RLIMIT_MEMLOCK value.
>
> - EPOLLONESHOT is handled.
>
> This series introduces pollable epoll from userspace, i.e. user creates
> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets header
> and ring pointers and then consumes ready events from a ring, avoiding
> epoll_wait() call. When ring is empty, user has to call epoll_wait()
> in order to wait for new events. epoll_wait() returns -ESTALE if user
> ring has events in the ring (kind of indication, that user has to consume
> events from the user ring first, I could not invent anything better than
> returning -ESTALE).
>
> For user header and user ring allocation I used vmalloc_user(). I found
> that it is much easy to reuse remap_vmalloc_range_partial() instead of
> dealing with page cache (like aio.c does). What is also nice is that
> virtual address is properly aligned on SHMLBA, thus there should not be
> any d-cache aliasing problems on archs with vivt or vipt caches.

Why aren't we just adding support to io_uring for this instead? Then we
don't need yet another entirely new ring, that's is just a little
different from what we have.

I haven't looked into the details of your implementation, just curious
if there's anything that makes using io_uring a non-starter for this
purpose?

--
Jens Axboe

2019-05-31 15:07:53

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()

On 2019-05-31 15:05, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 01:22:54PM +0200, Roman Penyaev wrote:
>> On 2019-05-31 11:56, Peter Zijlstra wrote:
>> > On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:
>
>> > > +static inline bool ep_clear_public_event_bits(struct epitem *epi)
>> > > +{
>> > > + __poll_t old, flags;
>> > > +
>> > > + /*
>> > > + * Here we race with ourselves and with ep_modify(), which can
>> > > + * change the event bits. In order not to override events updated
>> > > + * by ep_modify() we have to do cmpxchg.
>> > > + */
>> > > +
>> > > + old = epi->event.events;
>> > > + do {
>> > > + flags = old;
>> > > + } while ((old = cmpxchg(&epi->event.events, flags,
>> > > + flags & EP_PRIVATE_BITS)) != flags);
>> > > +
>> > > + return flags & ~EP_PRIVATE_BITS;
>> > > +}
>> >
>> > AFAICT epi->event.events also has normal writes to it, eg. in
>> > ep_modify(). A number of architectures cannot handle concurrent normal
>> > writes and cmpxchg() to the same variable.
>>
>> Yes, we race with the current function and with ep_modify(). Then,
>> ep_modify()
>> should do something as the following:
>>
>> - epi->event.events = event->events
>> + xchg(&epi->event.events, event->events);
>>
>> Is that ok?
>
> That should be correct, but at that point I think we should also always
> read the thing with READ_ONCE() to avoid load-tearing. And I suspect it
> then becomes sensible to change the type to atomic_t.

But it seems if we afraid of load tearing that should be fixed
separately,
independently of this patchset, because epi->event.events is updated
in ep_modify() and races with ep_poll_callback(), which reads the value
in couple of places.

Probably nothing terrible will happen, because eventually event comes
or just ignored.


> atomic_set() vs atomic_cmpxchg() only carries the extra overhead on
> those 'dodgy' platforms.
>
>> Just curious: what are these archs?
>
> Oh, lovely stuff like parisc, sparc32 and arc-eznps. See
> arch/parisc/lib/bitops.c:__cmpxchg_*() for example :/ Those systems
> only
> have a single truly atomic op (something from the xchg / test-and-set
> family) and the rest is fudged on top of that.

Locks, nice.

--
Roman


2019-05-31 16:06:37

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

On 2019-05-31 16:48, Jens Axboe wrote:
> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>> Hi all,
>>
>> This is v3 which introduces pollable epoll from userspace.
>>
>> v3:
>> - Measurements made, represented below.
>>
>> - Fix alignment for epoll_uitem structure on all 64-bit archs except
>> x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
>> is added. (Linus)
>>
>> - Check pollflags explicitly on 0 inside work callback, and do
>> nothing
>> if 0.
>>
>> v2:
>> - No reallocations, the max number of items (thus size of the user
>> ring)
>> is specified by the caller.
>>
>> - Interface is simplified: -ENOSPC is returned on attempt to add a
>> new
>> epoll item if number is reached the max, nothing more.
>>
>> - Alloced pages are accounted using user->locked_vm and limited to
>> RLIMIT_MEMLOCK value.
>>
>> - EPOLLONESHOT is handled.
>>
>> This series introduces pollable epoll from userspace, i.e. user
>> creates
>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>> header
>> and ring pointers and then consumes ready events from a ring, avoiding
>> epoll_wait() call. When ring is empty, user has to call epoll_wait()
>> in order to wait for new events. epoll_wait() returns -ESTALE if user
>> ring has events in the ring (kind of indication, that user has to
>> consume
>> events from the user ring first, I could not invent anything better
>> than
>> returning -ESTALE).
>>
>> For user header and user ring allocation I used vmalloc_user(). I
>> found
>> that it is much easy to reuse remap_vmalloc_range_partial() instead of
>> dealing with page cache (like aio.c does). What is also nice is that
>> virtual address is properly aligned on SHMLBA, thus there should not
>> be
>> any d-cache aliasing problems on archs with vivt or vipt caches.
>
> Why aren't we just adding support to io_uring for this instead? Then we
> don't need yet another entirely new ring, that's is just a little
> different from what we have.
>
> I haven't looked into the details of your implementation, just curious
> if there's anything that makes using io_uring a non-starter for this
> purpose?

Afaict the main difference is that you do not need to recharge an fd
(submit new poll request in terms of io_uring): once fd has been added
to
epoll with epoll_ctl() - we get events. When you have thousands of fds
-
that should matter.

Also interesting question is how difficult to modify existing event
loops
in event libraries in order to support recharging (EPOLLONESHOT in terms
of epoll).

Maybe Azat who maintains libevent can shed light on this (currently I
see
that libevent does not support "EPOLLONESHOT" logic).


--
Roman


2019-05-31 16:35:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

On Thu, May 16, 2019 at 10:57:57AM +0200, Roman Penyaev wrote:
> When new event comes for some epoll item kernel does the following:
>
> struct epoll_uitem *uitem;
>
> /* Each item has a bit (index in user items array), discussed later */
> uitem = user_header->items[epi->bit];
>
> if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
> i = atomic_add(&ep->user_header->tail, 1);

So this is where you increment tail

>
> item_idx = &user_index[i & index_mask];
>
> /* Signal with a bit, user spins on index expecting value > 0 */
> *item_idx = idx + 1;

IUC, this is where you write the idx into shared memory, which is
_after_ tail has already been incremented.

> }
>
> Important thing here is that ring can't infinitely grow and corrupt other
> elements, because kernel always checks that item was marked as ready, so
> userspace has to clear ready_events field.
>
> On userside events the following code should be used in order to consume
> events:
>
> tail = READ_ONCE(header->tail);
> for (i = 0; header->head != tail; header->head++) {
> item_idx_ptr = &index[idx & indeces_mask];
>
> /*
> * Spin here till we see valid index
> */
> while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
> ;

Which you then try and fix up by busy waiting for @idx to become !0 ?!

Why not write the idx first, then increment the ->tail, such that when
we see ->tail, we already know idx must be correct?

>
> item = &header->items[idx - 1];
>
> /*
> * Mark index as invalid, that is for userspace only, kernel does not care
> * and will refill this pointer only when observes that event is cleared,
> * which happens below.
> */
> *item_idx_ptr = 0;

That avoids this store too.

>
> /*
> * Fetch data first, if event is cleared by the kernel we drop the data
> * returning false.
> */
> event->data = item->event.data;
> event->events = __atomic_exchange_n(&item->ready_events, 0,
> __ATOMIC_RELEASE);
>
> }

Aside from that, you have to READ/WRITE_ONCE() on ->head, to avoid
load/store tearing.


That would give something like:

kernel:

slot = atomic_fetch_inc(&ep->slot);
item_idx = &user_index[slot & idx_mask];
WRITE_ONCE(*item_idx, idx);
smp_store_release(&ep->user_header->tail, slot);

userspace:

tail = smp_load_acquire(&header->tail);
for (head = READ_ONCE(header->head); head != tail; head++) {
idx = READ_ONCE(index[head & idx_mask]);
itemp = &header->items[idx];

...
}
smp_store_release(&header->head, head);


2019-05-31 16:57:43

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

On 5/31/19 10:02 AM, Roman Penyaev wrote:
> On 2019-05-31 16:48, Jens Axboe wrote:
>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>> Hi all,
>>>
>>> This is v3 which introduces pollable epoll from userspace.
>>>
>>> v3:
>>> - Measurements made, represented below.
>>>
>>> - Fix alignment for epoll_uitem structure on all 64-bit archs except
>>> x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
>>> is added. (Linus)
>>>
>>> - Check pollflags explicitly on 0 inside work callback, and do
>>> nothing
>>> if 0.
>>>
>>> v2:
>>> - No reallocations, the max number of items (thus size of the user
>>> ring)
>>> is specified by the caller.
>>>
>>> - Interface is simplified: -ENOSPC is returned on attempt to add a
>>> new
>>> epoll item if number is reached the max, nothing more.
>>>
>>> - Alloced pages are accounted using user->locked_vm and limited to
>>> RLIMIT_MEMLOCK value.
>>>
>>> - EPOLLONESHOT is handled.
>>>
>>> This series introduces pollable epoll from userspace, i.e. user
>>> creates
>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>> header
>>> and ring pointers and then consumes ready events from a ring, avoiding
>>> epoll_wait() call. When ring is empty, user has to call epoll_wait()
>>> in order to wait for new events. epoll_wait() returns -ESTALE if user
>>> ring has events in the ring (kind of indication, that user has to
>>> consume
>>> events from the user ring first, I could not invent anything better
>>> than
>>> returning -ESTALE).
>>>
>>> For user header and user ring allocation I used vmalloc_user(). I
>>> found
>>> that it is much easy to reuse remap_vmalloc_range_partial() instead of
>>> dealing with page cache (like aio.c does). What is also nice is that
>>> virtual address is properly aligned on SHMLBA, thus there should not
>>> be
>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>
>> Why aren't we just adding support to io_uring for this instead? Then we
>> don't need yet another entirely new ring, that's is just a little
>> different from what we have.
>>
>> I haven't looked into the details of your implementation, just curious
>> if there's anything that makes using io_uring a non-starter for this
>> purpose?
>
> Afaict the main difference is that you do not need to recharge an fd
> (submit new poll request in terms of io_uring): once fd has been added
> to
> epoll with epoll_ctl() - we get events. When you have thousands of fds
> -
> that should matter.
>
> Also interesting question is how difficult to modify existing event
> loops
> in event libraries in order to support recharging (EPOLLONESHOT in terms
> of epoll).
>
> Maybe Azat who maintains libevent can shed light on this (currently I
> see
> that libevent does not support "EPOLLONESHOT" logic).

In terms of existing io_uring poll support, which is what I'm guessing
you're referring to, it is indeed just one-shot. But there's no reason
why we can't have it persist until explicitly canceled with POLL_REMOVE.

--
Jens Axboe

2019-05-31 18:52:53

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

On 2019-05-31 18:33, Peter Zijlstra wrote:
> On Thu, May 16, 2019 at 10:57:57AM +0200, Roman Penyaev wrote:
>> When new event comes for some epoll item kernel does the following:
>>
>> struct epoll_uitem *uitem;
>>
>> /* Each item has a bit (index in user items array), discussed later
>> */
>> uitem = user_header->items[epi->bit];
>>
>> if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
>> i = atomic_add(&ep->user_header->tail, 1);
>
> So this is where you increment tail
>
>>
>> item_idx = &user_index[i & index_mask];
>>
>> /* Signal with a bit, user spins on index expecting value > 0 */
>> *item_idx = idx + 1;
>
> IUC, this is where you write the idx into shared memory, which is
> _after_ tail has already been incremented.
>
>> }
>>
>> Important thing here is that ring can't infinitely grow and corrupt
>> other
>> elements, because kernel always checks that item was marked as ready,
>> so
>> userspace has to clear ready_events field.
>>
>> On userside events the following code should be used in order to
>> consume
>> events:
>>
>> tail = READ_ONCE(header->tail);
>> for (i = 0; header->head != tail; header->head++) {
>> item_idx_ptr = &index[idx & indeces_mask];
>>
>> /*
>> * Spin here till we see valid index
>> */
>> while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
>> ;
>
> Which you then try and fix up by busy waiting for @idx to become !0 ?!
>
> Why not write the idx first, then increment the ->tail, such that when
> we see ->tail, we already know idx must be correct?
>
>>
>> item = &header->items[idx - 1];
>>
>> /*
>> * Mark index as invalid, that is for userspace only, kernel does
>> not care
>> * and will refill this pointer only when observes that event is
>> cleared,
>> * which happens below.
>> */
>> *item_idx_ptr = 0;
>
> That avoids this store too.
>
>>
>> /*
>> * Fetch data first, if event is cleared by the kernel we drop
>> the data
>> * returning false.
>> */
>> event->data = item->event.data;
>> event->events = __atomic_exchange_n(&item->ready_events, 0,
>> __ATOMIC_RELEASE);
>>
>> }
>
> Aside from that, you have to READ/WRITE_ONCE() on ->head, to avoid
> load/store tearing.

Yes, clear. Thanks.

>
>
> That would give something like:
>
> kernel:
>
> slot = atomic_fetch_inc(&ep->slot);
> item_idx = &user_index[slot & idx_mask];
> WRITE_ONCE(*item_idx, idx);
> smp_store_release(&ep->user_header->tail, slot);

This can't be called from many cpus, tail can be overwritten with "old"
value. That what I try to solve.

--
Roman

2019-05-31 19:47:05

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

On 2019-05-31 18:54, Jens Axboe wrote:
> On 5/31/19 10:02 AM, Roman Penyaev wrote:
>> On 2019-05-31 16:48, Jens Axboe wrote:
>>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>>> Hi all,
>>>>
>>>> This is v3 which introduces pollable epoll from userspace.
>>>>
>>>> v3:
>>>> - Measurements made, represented below.
>>>>
>>>> - Fix alignment for epoll_uitem structure on all 64-bit archs
>>>> except
>>>> x86-64. epoll_uitem should be always 16 bit, proper
>>>> BUILD_BUG_ON
>>>> is added. (Linus)
>>>>
>>>> - Check pollflags explicitly on 0 inside work callback, and do
>>>> nothing
>>>> if 0.
>>>>
>>>> v2:
>>>> - No reallocations, the max number of items (thus size of the
>>>> user
>>>> ring)
>>>> is specified by the caller.
>>>>
>>>> - Interface is simplified: -ENOSPC is returned on attempt to add
>>>> a
>>>> new
>>>> epoll item if number is reached the max, nothing more.
>>>>
>>>> - Alloced pages are accounted using user->locked_vm and limited
>>>> to
>>>> RLIMIT_MEMLOCK value.
>>>>
>>>> - EPOLLONESHOT is handled.
>>>>
>>>> This series introduces pollable epoll from userspace, i.e. user
>>>> creates
>>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>>> header
>>>> and ring pointers and then consumes ready events from a ring,
>>>> avoiding
>>>> epoll_wait() call. When ring is empty, user has to call
>>>> epoll_wait()
>>>> in order to wait for new events. epoll_wait() returns -ESTALE if
>>>> user
>>>> ring has events in the ring (kind of indication, that user has to
>>>> consume
>>>> events from the user ring first, I could not invent anything better
>>>> than
>>>> returning -ESTALE).
>>>>
>>>> For user header and user ring allocation I used vmalloc_user(). I
>>>> found
>>>> that it is much easy to reuse remap_vmalloc_range_partial() instead
>>>> of
>>>> dealing with page cache (like aio.c does). What is also nice is
>>>> that
>>>> virtual address is properly aligned on SHMLBA, thus there should not
>>>> be
>>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>>
>>> Why aren't we just adding support to io_uring for this instead? Then
>>> we
>>> don't need yet another entirely new ring, that's is just a little
>>> different from what we have.
>>>
>>> I haven't looked into the details of your implementation, just
>>> curious
>>> if there's anything that makes using io_uring a non-starter for this
>>> purpose?
>>
>> Afaict the main difference is that you do not need to recharge an fd
>> (submit new poll request in terms of io_uring): once fd has been added
>> to
>> epoll with epoll_ctl() - we get events. When you have thousands of
>> fds
>> -
>> that should matter.
>>
>> Also interesting question is how difficult to modify existing event
>> loops
>> in event libraries in order to support recharging (EPOLLONESHOT in
>> terms
>> of epoll).
>>
>> Maybe Azat who maintains libevent can shed light on this (currently I
>> see
>> that libevent does not support "EPOLLONESHOT" logic).
>
> In terms of existing io_uring poll support, which is what I'm guessing
> you're referring to, it is indeed just one-shot.

Yes, yes.

> But there's no reason why we can't have it persist until explicitly
> canceled with POLL_REMOVE.

It seems not so easy. The main problem is that with only a ring it is
impossible to figure out on kernel side what event bits have been
already
seen by the userspace and what bits are new. So every new cqe has to
be added to a completion ring on each wake_up_interruptible() call.
(I mean when fd wants to report that something is ready).

IMO that can lead to many duplicate events (tens? hundreds? honestly no
idea), which userspace has to handle with subsequent read/write calls.
It can kill all performance benefits of a uring.

In uepoll this is solved with another piece of shared memory, where
userspace atomically clears bits and kernel side sets bits. If kernel
observes that bits were set (i.e. userspace has not seen this event)
- new index is added to a ring.

Can we extend the io_uring API to support this behavior? Also would
be great if we can make event path lockless. On a big number of fds
and frequent events - this matters, please take a look, recently I
did some measurements: https://lkml.org/lkml/2018/12/12/305

--
Roman

2019-05-31 21:11:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

On 5/31/19 1:45 PM, Roman Penyaev wrote:
> On 2019-05-31 18:54, Jens Axboe wrote:
>> On 5/31/19 10:02 AM, Roman Penyaev wrote:
>>> On 2019-05-31 16:48, Jens Axboe wrote:
>>>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>>>> Hi all,
>>>>>
>>>>> This is v3 which introduces pollable epoll from userspace.
>>>>>
>>>>> v3:
>>>>> - Measurements made, represented below.
>>>>>
>>>>> - Fix alignment for epoll_uitem structure on all 64-bit archs
>>>>> except
>>>>> x86-64. epoll_uitem should be always 16 bit, proper
>>>>> BUILD_BUG_ON
>>>>> is added. (Linus)
>>>>>
>>>>> - Check pollflags explicitly on 0 inside work callback, and do
>>>>> nothing
>>>>> if 0.
>>>>>
>>>>> v2:
>>>>> - No reallocations, the max number of items (thus size of the
>>>>> user
>>>>> ring)
>>>>> is specified by the caller.
>>>>>
>>>>> - Interface is simplified: -ENOSPC is returned on attempt to add
>>>>> a
>>>>> new
>>>>> epoll item if number is reached the max, nothing more.
>>>>>
>>>>> - Alloced pages are accounted using user->locked_vm and limited
>>>>> to
>>>>> RLIMIT_MEMLOCK value.
>>>>>
>>>>> - EPOLLONESHOT is handled.
>>>>>
>>>>> This series introduces pollable epoll from userspace, i.e. user
>>>>> creates
>>>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>>>> header
>>>>> and ring pointers and then consumes ready events from a ring,
>>>>> avoiding
>>>>> epoll_wait() call. When ring is empty, user has to call
>>>>> epoll_wait()
>>>>> in order to wait for new events. epoll_wait() returns -ESTALE if
>>>>> user
>>>>> ring has events in the ring (kind of indication, that user has to
>>>>> consume
>>>>> events from the user ring first, I could not invent anything better
>>>>> than
>>>>> returning -ESTALE).
>>>>>
>>>>> For user header and user ring allocation I used vmalloc_user(). I
>>>>> found
>>>>> that it is much easy to reuse remap_vmalloc_range_partial() instead
>>>>> of
>>>>> dealing with page cache (like aio.c does). What is also nice is
>>>>> that
>>>>> virtual address is properly aligned on SHMLBA, thus there should not
>>>>> be
>>>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>>>
>>>> Why aren't we just adding support to io_uring for this instead? Then
>>>> we
>>>> don't need yet another entirely new ring, that's is just a little
>>>> different from what we have.
>>>>
>>>> I haven't looked into the details of your implementation, just
>>>> curious
>>>> if there's anything that makes using io_uring a non-starter for this
>>>> purpose?
>>>
>>> Afaict the main difference is that you do not need to recharge an fd
>>> (submit new poll request in terms of io_uring): once fd has been added
>>> to
>>> epoll with epoll_ctl() - we get events. When you have thousands of
>>> fds
>>> -
>>> that should matter.
>>>
>>> Also interesting question is how difficult to modify existing event
>>> loops
>>> in event libraries in order to support recharging (EPOLLONESHOT in
>>> terms
>>> of epoll).
>>>
>>> Maybe Azat who maintains libevent can shed light on this (currently I
>>> see
>>> that libevent does not support "EPOLLONESHOT" logic).
>>
>> In terms of existing io_uring poll support, which is what I'm guessing
>> you're referring to, it is indeed just one-shot.
>
> Yes, yes.
>
>> But there's no reason why we can't have it persist until explicitly
>> canceled with POLL_REMOVE.
>
> It seems not so easy. The main problem is that with only a ring it is
> impossible to figure out on kernel side what event bits have been
> already
> seen by the userspace and what bits are new. So every new cqe has to
> be added to a completion ring on each wake_up_interruptible() call.
> (I mean when fd wants to report that something is ready).
>
> IMO that can lead to many duplicate events (tens? hundreds? honestly no
> idea), which userspace has to handle with subsequent read/write calls.
> It can kill all performance benefits of a uring.
>
> In uepoll this is solved with another piece of shared memory, where
> userspace atomically clears bits and kernel side sets bits. If kernel
> observes that bits were set (i.e. userspace has not seen this event)
> - new index is added to a ring.

Those are good points.

> Can we extend the io_uring API to support this behavior? Also would
> be great if we can make event path lockless. On a big number of fds
> and frequent events - this matters, please take a look, recently I
> did some measurements: https://lkml.org/lkml/2018/12/12/305

Yeah, I'd be happy to entertain that idea, and lockless completions as
well. We already do that for polled IO, but consider any "normal"
completion to be IRQ driven and hence need locking.

--
Jens Axboe

2019-06-05 06:18:50

by Roman Penyaev

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

On 2019-05-31 23:09, Jens Axboe wrote:
> On 5/31/19 1:45 PM, Roman Penyaev wrote:
>> On 2019-05-31 18:54, Jens Axboe wrote:
>>> On 5/31/19 10:02 AM, Roman Penyaev wrote:
>>>> On 2019-05-31 16:48, Jens Axboe wrote:
>>>>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> This is v3 which introduces pollable epoll from userspace.
>>>>>>
>>>>>> v3:
>>>>>> - Measurements made, represented below.
>>>>>>
>>>>>> - Fix alignment for epoll_uitem structure on all 64-bit archs
>>>>>> except
>>>>>> x86-64. epoll_uitem should be always 16 bit, proper
>>>>>> BUILD_BUG_ON
>>>>>> is added. (Linus)
>>>>>>
>>>>>> - Check pollflags explicitly on 0 inside work callback, and do
>>>>>> nothing
>>>>>> if 0.
>>>>>>
>>>>>> v2:
>>>>>> - No reallocations, the max number of items (thus size of the
>>>>>> user
>>>>>> ring)
>>>>>> is specified by the caller.
>>>>>>
>>>>>> - Interface is simplified: -ENOSPC is returned on attempt to
>>>>>> add
>>>>>> a
>>>>>> new
>>>>>> epoll item if number is reached the max, nothing more.
>>>>>>
>>>>>> - Alloced pages are accounted using user->locked_vm and
>>>>>> limited
>>>>>> to
>>>>>> RLIMIT_MEMLOCK value.
>>>>>>
>>>>>> - EPOLLONESHOT is handled.
>>>>>>
>>>>>> This series introduces pollable epoll from userspace, i.e. user
>>>>>> creates
>>>>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>>>>> header
>>>>>> and ring pointers and then consumes ready events from a ring,
>>>>>> avoiding
>>>>>> epoll_wait() call. When ring is empty, user has to call
>>>>>> epoll_wait()
>>>>>> in order to wait for new events. epoll_wait() returns -ESTALE if
>>>>>> user
>>>>>> ring has events in the ring (kind of indication, that user has to
>>>>>> consume
>>>>>> events from the user ring first, I could not invent anything
>>>>>> better
>>>>>> than
>>>>>> returning -ESTALE).
>>>>>>
>>>>>> For user header and user ring allocation I used vmalloc_user(). I
>>>>>> found
>>>>>> that it is much easy to reuse remap_vmalloc_range_partial()
>>>>>> instead
>>>>>> of
>>>>>> dealing with page cache (like aio.c does). What is also nice is
>>>>>> that
>>>>>> virtual address is properly aligned on SHMLBA, thus there should
>>>>>> not
>>>>>> be
>>>>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>>>>
>>>>> Why aren't we just adding support to io_uring for this instead?
>>>>> Then
>>>>> we
>>>>> don't need yet another entirely new ring, that's is just a little
>>>>> different from what we have.
>>>>>
>>>>> I haven't looked into the details of your implementation, just
>>>>> curious
>>>>> if there's anything that makes using io_uring a non-starter for
>>>>> this
>>>>> purpose?
>>>>
>>>> Afaict the main difference is that you do not need to recharge an fd
>>>> (submit new poll request in terms of io_uring): once fd has been
>>>> added
>>>> to
>>>> epoll with epoll_ctl() - we get events. When you have thousands of
>>>> fds
>>>> -
>>>> that should matter.
>>>>
>>>> Also interesting question is how difficult to modify existing event
>>>> loops
>>>> in event libraries in order to support recharging (EPOLLONESHOT in
>>>> terms
>>>> of epoll).
>>>>
>>>> Maybe Azat who maintains libevent can shed light on this (currently
>>>> I
>>>> see
>>>> that libevent does not support "EPOLLONESHOT" logic).
>>>
>>> In terms of existing io_uring poll support, which is what I'm
>>> guessing
>>> you're referring to, it is indeed just one-shot.
>>
>> Yes, yes.
>>
>>> But there's no reason why we can't have it persist until explicitly
>>> canceled with POLL_REMOVE.
>>
>> It seems not so easy. The main problem is that with only a ring it is
>> impossible to figure out on kernel side what event bits have been
>> already
>> seen by the userspace and what bits are new. So every new cqe has to
>> be added to a completion ring on each wake_up_interruptible() call.
>> (I mean when fd wants to report that something is ready).
>>
>> IMO that can lead to many duplicate events (tens? hundreds? honestly
>> no
>> idea), which userspace has to handle with subsequent read/write calls.
>> It can kill all performance benefits of a uring.
>>
>> In uepoll this is solved with another piece of shared memory, where
>> userspace atomically clears bits and kernel side sets bits. If kernel
>> observes that bits were set (i.e. userspace has not seen this event)
>> - new index is added to a ring.
>
> Those are good points.
>
>> Can we extend the io_uring API to support this behavior? Also would
>> be great if we can make event path lockless. On a big number of fds
>> and frequent events - this matters, please take a look, recently I
>> did some measurements: https://lkml.org/lkml/2018/12/12/305
>
> Yeah, I'd be happy to entertain that idea, and lockless completions as
> well. We already do that for polled IO, but consider any "normal"
> completion to be IRQ driven and hence need locking.

I would like to contribute as much as I can. "Subscription" on events
along with lockless ring seems reasonable to do for io_uring. I still
tend to think that uepoll and io_uring poll can coexist, at least
because it can be difficult to adopt current event libraries to async
nature of "add fd" / "remove add" requests of the io_uring, e.g. when
epoll_ctl() is called in order to remove fd, the caller expects no
events come after epoll_ctl() returns. Async behavior can break the
event loop. What can help is ability to wait on particular request,
which seems not possible without ugly tricks, right? (Under ugly tricks
I mean something as: wait for any event, traverse the completion ring
in order to meet particular completion, repeat if nothing is found).

Also epoll_ctl() can be called from another thread in order to
add/remove fd, and I suppose that is also successfully used by event
loop libraries or users of these libraries (not quite sure though, but
can imagine why it can be useful). To fix that will require introducing
locks on submission path of io_uring callers (I mean on user side,
inside these libraries), which can impact performance for generic
cases (only submission though).

What I want to say is that polling using io_uring can be used in some
new io/event stacks, but adoption of current event libraries can be
non trivial, where old plain epoll with a ring can be an easiest way.
But of course that's only my speculation.

--
Roman