2012-10-26 00:08:57

by Paton J. Lewis

[permalink] [raw]
Subject: [PATCH v3] epoll: Support for disabling items, and a self-test app.

From: "Paton J. Lewis" <[email protected]>

It is not currently possible to reliably delete epoll items when using the
same epoll set from multiple threads. After calling epoll_ctl with
EPOLL_CTL_DEL, another thread might still be executing code related to an
event for that epoll item (in response to epoll_wait). Therefore the deleting
thread does not know when it is safe to delete resources pertaining to the
associated epoll item because another thread might be using those resources.

The deleting thread could wait an arbitrary amount of time after calling
epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is
inefficient and could result in the destruction of resources before another
thread is done handling an event returned by epoll_wait.

This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which disables an
epoll item. If epoll_ctl returns -EBUSY in this case, then another thread may
handling a return from epoll_wait for this item. Otherwise if epoll_ctl
returns 0, then it is safe to delete the epoll item. This allows multiple
threads to use a mutex to determine when it is safe to delete an epoll item
and its associated resources, which allows epoll items to be deleted both
efficiently and without error in a multi-threaded environment. Note that
EPOLL_CTL_DISABLE is only useful in conjunction with EPOLLONESHOT, and using
EPOLL_CTL_DISABLE on an epoll item without EPOLLONESHOT returns -EINVAL.

This patch also adds a new test_epoll self-test program to both demonstrate
the need for this feature and test it.

Signed-off-by: Paton J. Lewis <[email protected]>
---
fs/eventpoll.c | 40 ++-
include/linux/eventpoll.h | 1 +
tools/testing/selftests/Makefile | 2 +-
tools/testing/selftests/epoll/Makefile | 11 +
tools/testing/selftests/epoll/test_epoll.c | 364 ++++++++++++++++++++++++++++
5 files changed, 414 insertions(+), 4 deletions(-)
create mode 100644 tools/testing/selftests/epoll/Makefile
create mode 100644 tools/testing/selftests/epoll/test_epoll.c

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 739b098..c718afd 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -339,7 +339,7 @@ static inline struct epitem *ep_item_from_epqueue(poll_table *p)
/* Tells if the epoll_ctl(2) operation needs an event copy from userspace */
static inline int ep_op_has_event(int op)
{
- return op != EPOLL_CTL_DEL;
+ return op == EPOLL_CTL_ADD || op == EPOLL_CTL_MOD;
}

/* Initialize the poll safe wake up structure */
@@ -664,6 +664,36 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
return 0;
}

+/*
+ * Disables a "struct epitem" in the eventpoll set. Returns -EBUSY if the item
+ * had no event flags set, indicating that another thread may be currently
+ * handling that item's events (in the case that EPOLLONESHOT was being
+ * used). Otherwise a zero result indicates that the item has been disabled
+ * from receiving events. A disabled item may be re-enabled via
+ * EPOLL_CTL_MOD. Must be called with "mtx" held.
+ */
+static int ep_disable(struct eventpoll *ep, struct epitem *epi)
+{
+ int result = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ep->lock, flags);
+ if (epi->event.events & EPOLLONESHOT) {
+ if (epi->event.events & ~EP_PRIVATE_BITS) {
+ if (ep_is_linked(&epi->rdllink))
+ list_del_init(&epi->rdllink);
+ /* Ensure ep_poll_callback will not add epi back onto
+ ready list: */
+ epi->event.events &= EP_PRIVATE_BITS;
+ } else
+ result = -EBUSY;
+ } else
+ result = -EINVAL;
+ spin_unlock_irqrestore(&ep->lock, flags);
+
+ return result;
+}
+
static void ep_free(struct eventpoll *ep)
{
struct rb_node *rbp;
@@ -996,8 +1026,6 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
rb_insert_color(&epi->rbn, &ep->rbr);
}

-
-
#define PATH_ARR_SIZE 5
/*
* These are the number paths of length 1 to 5, that we are allowing to emanate
@@ -1701,6 +1729,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
} else
error = -ENOENT;
break;
+ case EPOLL_CTL_DISABLE:
+ if (epi)
+ error = ep_disable(ep, epi);
+ else
+ error = -ENOENT;
+ break;
}
mutex_unlock(&ep->mtx);

diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 657ab55..e91f7e3 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -25,6 +25,7 @@
#define EPOLL_CTL_ADD 1
#define EPOLL_CTL_DEL 2
#define EPOLL_CTL_MOD 3
+#define EPOLL_CTL_DISABLE 4

/* Set the One Shot behaviour for the target file descriptor */
#define EPOLLONESHOT (1 << 30)
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 28bc57e..4cf477f 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -1,4 +1,4 @@
-TARGETS = breakpoints vm
+TARGETS = breakpoints epoll vm

all:
for TARGET in $(TARGETS); do \
diff --git a/tools/testing/selftests/epoll/Makefile b/tools/testing/selftests/epoll/Makefile
new file mode 100644
index 0000000..19806ed
--- /dev/null
+++ b/tools/testing/selftests/epoll/Makefile
@@ -0,0 +1,11 @@
+# Makefile for epoll selftests
+
+all: test_epoll
+%: %.c
+ gcc -pthread -g -o $@ $^
+
+run_tests: all
+ ./test_epoll
+
+clean:
+ $(RM) test_epoll
diff --git a/tools/testing/selftests/epoll/test_epoll.c b/tools/testing/selftests/epoll/test_epoll.c
new file mode 100644
index 0000000..54284eb
--- /dev/null
+++ b/tools/testing/selftests/epoll/test_epoll.c
@@ -0,0 +1,364 @@
+/*
+ * tools/testing/selftests/epoll/test_epoll.c
+ *
+ * Copyright 2012 Adobe Systems Incorporated
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Paton J. Lewis <[email protected]>
+ *
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/epoll.h>
+#include <sys/socket.h>
+
+/*
+ * A pointer to an epoll_item_private structure will be stored in the epoll
+ * item's event structure so that we can get access to the epoll_item_private
+ * data after calling epoll_wait:
+ */
+struct epoll_item_private {
+ int index; /* Position of this struct within the epoll_items array. */
+ int fd;
+ uint32_t events;
+ pthread_mutex_t mutex; /* Guards the following variables... */
+ int stop;
+ int status; /* Stores any error encountered while handling item. */
+ /* The following variable allows us to test whether we have encountered
+ a problem while attempting to cancel and delete the associated
+ event. When the test program exits, 'deleted' should be exactly
+ one. If it is greater than one, then the failed test reflects a real
+ world situation where we would have tried to access the epoll item's
+ private data after deleting it: */
+ int deleted;
+};
+
+struct epoll_item_private *epoll_items;
+
+/*
+ * Delete the specified item from the epoll set. In a real-world secneario this
+ * is where we would free the associated data structure, but in this testing
+ * environment we retain the structure so that we can test for double-deletion:
+ */
+void delete_item(int index)
+{
+ __sync_fetch_and_add(&epoll_items[index].deleted, 1);
+}
+
+/*
+ * A pointer to a read_thread_data structure will be passed as the argument to
+ * each read thread:
+ */
+struct read_thread_data {
+ int stop;
+ int status; /* Indicates any error encountered by the read thread. */
+ int epoll_set;
+};
+
+/*
+ * The function executed by the read threads:
+ */
+void *read_thread_function(void *function_data)
+{
+ struct read_thread_data *thread_data =
+ (struct read_thread_data *)function_data;
+ struct epoll_event event_data;
+ struct epoll_item_private *item_data;
+ char socket_data;
+
+ /* Handle events until we encounter an error or this thread's 'stop'
+ condition is set: */
+ while (1) {
+ int result = epoll_wait(thread_data->epoll_set,
+ &event_data,
+ 1, /* Number of desired events */
+ 1000); /* Timeout in ms */
+ if (result < 0) {
+ /* Breakpoints signal all threads. Ignore that while
+ debugging: */
+ if (errno == EINTR)
+ continue;
+ thread_data->status = errno;
+ return 0;
+ } else if (thread_data->stop)
+ return 0;
+ else if (result == 0) /* Timeout */
+ continue;
+
+ /* We need the mutex here because checking for the stop
+ condition and re-enabling the epoll item need to be done
+ together as one atomic operation when EPOLL_CTL_DISABLE is
+ available: */
+ item_data = (struct epoll_item_private *)event_data.data.ptr;
+ pthread_mutex_lock(&item_data->mutex);
+
+ /* Remove the item from the epoll set if we want to stop
+ handling that event: */
+ if (item_data->stop)
+ delete_item(item_data->index);
+ else {
+ /* Clear the data that was written to the other end of
+ our non-blocking socket: */
+ do {
+ if (read(item_data->fd, &socket_data, 1) < 1) {
+ if ((errno == EAGAIN) ||
+ (errno == EWOULDBLOCK))
+ break;
+ else
+ goto error_unlock;
+ }
+ } while (item_data->events & EPOLLET);
+
+ /* The item was one-shot, so re-enable it: */
+ event_data.events = item_data->events;
+ if (epoll_ctl(thread_data->epoll_set,
+ EPOLL_CTL_MOD,
+ item_data->fd,
+ &event_data) < 0)
+ goto error_unlock;
+ }
+
+ pthread_mutex_unlock(&item_data->mutex);
+ }
+
+error_unlock:
+ thread_data->status = item_data->status = errno;
+ pthread_mutex_unlock(&item_data->mutex);
+ return 0;
+}
+
+/*
+ * A pointer to a write_thread_data structure will be passed as the argument to
+ * the write thread:
+ */
+struct write_thread_data {
+ int stop;
+ int status; /* Indicates any error encountered by the write thread. */
+ int n_fds;
+ int *fds;
+};
+
+/*
+ * The function executed by the write thread. It writes a single byte to each
+ * socket in turn until the stop condition for this thread is set. If writing to
+ * a socket would block (i.e. errno was EAGAIN), we leave that socket alone for
+ * the moment and just move on to the next socket in the list. We don't care
+ * about the order in which we deliver events to the epoll set. In fact we don't
+ * care about the data we're writing to the pipes at all; we just want to
+ * trigger epoll events:
+ */
+void *write_thread_function(void *function_data)
+{
+ const char data = 'X';
+ int index;
+ struct write_thread_data *thread_data =
+ (struct write_thread_data *)function_data;
+ while (!thread_data->stop)
+ for (index = 0;
+ !thread_data->stop && (index < thread_data->n_fds);
+ ++index)
+ if ((write(thread_data->fds[index], &data, 1) < 1) &&
+ (errno != EAGAIN) &&
+ (errno != EWOULDBLOCK)) {
+ thread_data->status = errno;
+ return;
+ }
+}
+
+/*
+ * Arguments are currently ignored:
+ */
+int main(int argc, char **argv)
+{
+ const int n_read_threads = 100;
+ const int n_epoll_items = 500;
+ int index;
+ int epoll_set = epoll_create1(0);
+ struct write_thread_data write_thread_data = {
+ 0, 0, n_epoll_items, malloc(n_epoll_items * sizeof(int))
+ };
+ struct read_thread_data *read_thread_data =
+ malloc(n_read_threads * sizeof(struct read_thread_data));
+ pthread_t *read_threads = malloc(n_read_threads * sizeof(pthread_t));
+ pthread_t write_thread;
+ int socket_pair[2];
+ struct epoll_event event_data;
+
+ printf("-----------------\n");
+ printf("Runing test_epoll\n");
+ printf("-----------------\n");
+
+ epoll_items = malloc(n_epoll_items * sizeof(struct epoll_item_private));
+
+ if (epoll_set < 0 || epoll_items == 0 || write_thread_data.fds == 0 ||
+ read_thread_data == 0 || read_threads == 0)
+ goto error;
+
+ if (sysconf(_SC_NPROCESSORS_ONLN) < 2) {
+ printf("Error: please run this test on a multi-core system.\n");
+ goto error;
+ }
+
+ /* Create the socket pairs and epoll items: */
+ for (index = 0; index < n_epoll_items; ++index) {
+ if (socketpair(AF_UNIX,
+ SOCK_STREAM | SOCK_NONBLOCK,
+ 0,
+ socket_pair) < 0)
+ goto error;
+ write_thread_data.fds[index] = socket_pair[0];
+ epoll_items[index].index = index;
+ epoll_items[index].fd = socket_pair[1];
+ if (pthread_mutex_init(&epoll_items[index].mutex, NULL) != 0)
+ goto error;
+ /* We always use EPOLLONESHOT because this test is currently
+ structured to demonstrate the need for EPOLL_CTL_DISABLE,
+ which only produces useful information in the EPOLLONESHOT
+ case (without EPOLLONESHOT, calling epoll_ctl with
+ EPOLL_CTL_DISABLE will never return EBUSY). If support for
+ testing events without EPOLLONESHOT is desired, it should
+ probably be implemented in a separate unit test. */
+ epoll_items[index].events = EPOLLIN | EPOLLONESHOT;
+ if (index < n_epoll_items / 2)
+ epoll_items[index].events |= EPOLLET;
+ epoll_items[index].stop = 0;
+ epoll_items[index].status = 0;
+ epoll_items[index].deleted = 0;
+ event_data.events = epoll_items[index].events;
+ event_data.data.ptr = &epoll_items[index];
+ if (epoll_ctl(epoll_set,
+ EPOLL_CTL_ADD,
+ epoll_items[index].fd,
+ &event_data) < 0)
+ goto error;
+ }
+
+#ifdef EPOLL_CTL_DISABLE
+ /* Test to make sure that using EPOLL_CTL_DISABLE without EPOLLONESHOT
+ returns a clear error: */
+ if (socketpair(AF_UNIX,
+ SOCK_STREAM | SOCK_NONBLOCK,
+ 0,
+ socket_pair) < 0)
+ goto error;
+ event_data.events = EPOLLIN;
+ event_data.data.ptr = NULL;
+ if (epoll_ctl(epoll_set, EPOLL_CTL_ADD,
+ socket_pair[1], &event_data) < 0)
+ goto error;
+ if ((epoll_ctl(epoll_set, EPOLL_CTL_DISABLE,
+ socket_pair[1], NULL) == 0) || (errno != EINVAL))
+ goto error;
+ if (epoll_ctl(epoll_set, EPOLL_CTL_DEL, socket_pair[1], NULL) != 0)
+ goto error;
+#endif
+
+ /* Create and start the read threads: */
+ for (index = 0; index < n_read_threads; ++index) {
+ read_thread_data[index].stop = 0;
+ read_thread_data[index].status = 0;
+ read_thread_data[index].epoll_set = epoll_set;
+ if (pthread_create(&read_threads[index],
+ NULL,
+ read_thread_function,
+ &read_thread_data[index]) != 0)
+ goto error;
+ }
+
+ if (pthread_create(&write_thread,
+ NULL,
+ write_thread_function,
+ &write_thread_data) != 0)
+ goto error;
+
+ /* Cancel all event pollers: */
+#ifdef EPOLL_CTL_DISABLE
+ for (index = 0; index < n_epoll_items; ++index) {
+ pthread_mutex_lock(&epoll_items[index].mutex);
+ ++epoll_items[index].stop;
+ if (epoll_ctl(epoll_set,
+ EPOLL_CTL_DISABLE,
+ epoll_items[index].fd,
+ NULL) == 0)
+ delete_item(index);
+ else if (errno != EBUSY) {
+ pthread_mutex_unlock(&epoll_items[index].mutex);
+ goto error;
+ }
+ /* EBUSY means events were being handled; allow the other thread
+ to delete the item. */
+ pthread_mutex_unlock(&epoll_items[index].mutex);
+ }
+#else
+ for (index = 0; index < n_epoll_items; ++index) {
+ pthread_mutex_lock(&epoll_items[index].mutex);
+ ++epoll_items[index].stop;
+ pthread_mutex_unlock(&epoll_items[index].mutex);
+ /* Wait in case a thread running read_thread_function is
+ currently executing code between epoll_wait and
+ pthread_mutex_lock with this item. Note that a longer delay
+ would make double-deletion less likely (at the expense of
+ performance), but there is no guarantee that any delay would
+ ever be sufficient. Note also that we delete all event
+ pollers at once for testing purposes, but in a real-world
+ environment we are likely to want to be able to cancel event
+ pollers at arbitrary times. Therefore we can't improve this
+ situation by just splitting this loop into two loops
+ (i.e. signal 'stop' for all items, sleep, and then delete all
+ items). We also can't fix the problem via EPOLL_CTL_DEL
+ because that command can't prevent the case where some other
+ thread is executing read_thread_function within the region
+ mentioned above: */
+ usleep(1);
+ pthread_mutex_lock(&epoll_items[index].mutex);
+ if (!epoll_items[index].deleted)
+ delete_item(index);
+ pthread_mutex_unlock(&epoll_items[index].mutex);
+ }
+#endif
+
+ /* Shut down the read threads: */
+ for (index = 0; index < n_read_threads; ++index)
+ __sync_fetch_and_add(&read_thread_data[index].stop, 1);
+ for (index = 0; index < n_read_threads; ++index) {
+ if (pthread_join(read_threads[index], NULL) != 0)
+ goto error;
+ if (read_thread_data[index].status)
+ goto error;
+ }
+
+ /* Shut down the write thread: */
+ __sync_fetch_and_add(&write_thread_data.stop, 1);
+ if ((pthread_join(write_thread, NULL) != 0) || write_thread_data.status)
+ goto error;
+
+ /* Check for final error conditions: */
+ for (index = 0; index < n_epoll_items; ++index) {
+ if (epoll_items[index].status != 0)
+ goto error;
+ if (pthread_mutex_destroy(&epoll_items[index].mutex) < 0)
+ goto error;
+ }
+ for (index = 0; index < n_epoll_items; ++index)
+ if (epoll_items[index].deleted != 1) {
+ printf("Error: item data deleted %1d times.\n",
+ epoll_items[index].deleted);
+ goto error;
+ }
+
+ printf("[PASS]\n");
+ return 0;
+
+ error:
+ printf("[FAIL]\n");
+ return errno;
+}
--
1.7.9.5


2012-10-29 20:42:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] epoll: Support for disabling items, and a self-test app.

On Thu, 25 Oct 2012 17:08:32 -0700
"Paton J. Lewis" <[email protected]> wrote:

> It is not currently possible to reliably delete epoll items when using the
> same epoll set from multiple threads. After calling epoll_ctl with
> EPOLL_CTL_DEL, another thread might still be executing code related to an
> event for that epoll item (in response to epoll_wait). Therefore the deleting
> thread does not know when it is safe to delete resources pertaining to the
> associated epoll item because another thread might be using those resources.
>
> The deleting thread could wait an arbitrary amount of time after calling
> epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is
> inefficient and could result in the destruction of resources before another
> thread is done handling an event returned by epoll_wait.
>
> This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which disables an
> epoll item. If epoll_ctl returns -EBUSY in this case, then another thread may
> handling a return from epoll_wait for this item. Otherwise if epoll_ctl
> returns 0, then it is safe to delete the epoll item. This allows multiple
> threads to use a mutex to determine when it is safe to delete an epoll item
> and its associated resources, which allows epoll items to be deleted both
> efficiently and without error in a multi-threaded environment. Note that
> EPOLL_CTL_DISABLE is only useful in conjunction with EPOLLONESHOT, and using
> EPOLL_CTL_DISABLE on an epoll item without EPOLLONESHOT returns -EINVAL.
>
> This patch also adds a new test_epoll self-test program to both demonstrate
> the need for this feature and test it.

Thanks. I queued this for testing and for consideration for 3.8. I
shall send the revert patches in to Linus to remove this interface
change from 3.7.

2012-10-31 06:32:39

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v3] epoll: Support for disabling items, and a self-test app.

On 10/26/2012 08:08 AM, Paton J. Lewis wrote:
> From: "Paton J. Lewis" <[email protected]>
>
> It is not currently possible to reliably delete epoll items when using the
> same epoll set from multiple threads. After calling epoll_ctl with
> EPOLL_CTL_DEL, another thread might still be executing code related to an
> event for that epoll item (in response to epoll_wait). Therefore the deleting
> thread does not know when it is safe to delete resources pertaining to the
> associated epoll item because another thread might be using those resources.
>
> The deleting thread could wait an arbitrary amount of time after calling
> epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is
> inefficient and could result in the destruction of resources before another
> thread is done handling an event returned by epoll_wait.
>
> This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which disables an
> epoll item. If epoll_ctl returns -EBUSY in this case, then another thread may
> handling a return from epoll_wait for this item. Otherwise if epoll_ctl
> returns 0, then it is safe to delete the epoll item. This allows multiple
> threads to use a mutex to determine when it is safe to delete an epoll item
> and its associated resources, which allows epoll items to be deleted both
> efficiently and without error in a multi-threaded environment. Note that
> EPOLL_CTL_DISABLE is only useful in conjunction with EPOLLONESHOT, and using
> EPOLL_CTL_DISABLE on an epoll item without EPOLLONESHOT returns -EINVAL.
>
> This patch also adds a new test_epoll self-test program to both demonstrate
> the need for this feature and test it.

Hi, Paton

I'm just think about may be we could use this way.

Seems like currently we are depending on the epoll_ctl() to indicate the
start point of safe section and epoll_wait() for the end point, like:

while () {
epoll_wait() --------------

fd event arrived safe section

clear fd epi->event.events
--------------
if (fd need stop)
continue;
--------------
...fd data process...

epoll_ctl(MOD) danger section

set fd epi->event.events --------------

continue;
}

So we got a safe section and do delete work in this section won't cause
trouble since we have a stop check directly after it.

Actually what we want is to make sure no one will touch the fd any more
after we DISABLE it.

Then what about we add a ref count and a stop flag in epi, maintain it like:

epoll_wait()

check user events and
dec the ref count of fd ---------------------------

...

fd event arrived safe sec if ref count is 0

if epi stop flag set
do nothing
else
inc epi ref count ---------------------------
send event

And what DISABLE do is:

set epi stop flag

if epi ref count is not 0
wait until ref count be 0

So after DISABLE return, we can safely delete any thing related to that epi.

One thing is that the user should not change the events info returned by
epoll_wait().

It's just a propose, but if it works, there will be no limit on ONESHOT
any more ;-)

Regards,
Michael Wang

>
> Signed-off-by: Paton J. Lewis <[email protected]>
> ---
> fs/eventpoll.c | 40 ++-
> include/linux/eventpoll.h | 1 +
> tools/testing/selftests/Makefile | 2 +-
> tools/testing/selftests/epoll/Makefile | 11 +
> tools/testing/selftests/epoll/test_epoll.c | 364 ++++++++++++++++++++++++++++
> 5 files changed, 414 insertions(+), 4 deletions(-)
> create mode 100644 tools/testing/selftests/epoll/Makefile
> create mode 100644 tools/testing/selftests/epoll/test_epoll.c
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 739b098..c718afd 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -339,7 +339,7 @@ static inline struct epitem *ep_item_from_epqueue(poll_table *p)
> /* Tells if the epoll_ctl(2) operation needs an event copy from userspace */
> static inline int ep_op_has_event(int op)
> {
> - return op != EPOLL_CTL_DEL;
> + return op == EPOLL_CTL_ADD || op == EPOLL_CTL_MOD;
> }
>
> /* Initialize the poll safe wake up structure */
> @@ -664,6 +664,36 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> return 0;
> }
>
> +/*
> + * Disables a "struct epitem" in the eventpoll set. Returns -EBUSY if the item
> + * had no event flags set, indicating that another thread may be currently
> + * handling that item's events (in the case that EPOLLONESHOT was being
> + * used). Otherwise a zero result indicates that the item has been disabled
> + * from receiving events. A disabled item may be re-enabled via
> + * EPOLL_CTL_MOD. Must be called with "mtx" held.
> + */
> +static int ep_disable(struct eventpoll *ep, struct epitem *epi)
> +{
> + int result = 0;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&ep->lock, flags);
> + if (epi->event.events & EPOLLONESHOT) {
> + if (epi->event.events & ~EP_PRIVATE_BITS) {
> + if (ep_is_linked(&epi->rdllink))
> + list_del_init(&epi->rdllink);
> + /* Ensure ep_poll_callback will not add epi back onto
> + ready list: */
> + epi->event.events &= EP_PRIVATE_BITS;
> + } else
> + result = -EBUSY;
> + } else
> + result = -EINVAL;
> + spin_unlock_irqrestore(&ep->lock, flags);
> +
> + return result;
> +}
> +
> static void ep_free(struct eventpoll *ep)
> {
> struct rb_node *rbp;
> @@ -996,8 +1026,6 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
> rb_insert_color(&epi->rbn, &ep->rbr);
> }
>
> -
> -
> #define PATH_ARR_SIZE 5
> /*
> * These are the number paths of length 1 to 5, that we are allowing to emanate
> @@ -1701,6 +1729,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> } else
> error = -ENOENT;
> break;
> + case EPOLL_CTL_DISABLE:
> + if (epi)
> + error = ep_disable(ep, epi);
> + else
> + error = -ENOENT;
> + break;
> }
> mutex_unlock(&ep->mtx);
>
> diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
> index 657ab55..e91f7e3 100644
> --- a/include/linux/eventpoll.h
> +++ b/include/linux/eventpoll.h
> @@ -25,6 +25,7 @@
> #define EPOLL_CTL_ADD 1
> #define EPOLL_CTL_DEL 2
> #define EPOLL_CTL_MOD 3
> +#define EPOLL_CTL_DISABLE 4
>
> /* Set the One Shot behaviour for the target file descriptor */
> #define EPOLLONESHOT (1 << 30)
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index 28bc57e..4cf477f 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -1,4 +1,4 @@
> -TARGETS = breakpoints vm
> +TARGETS = breakpoints epoll vm
>
> all:
> for TARGET in $(TARGETS); do \
> diff --git a/tools/testing/selftests/epoll/Makefile b/tools/testing/selftests/epoll/Makefile
> new file mode 100644
> index 0000000..19806ed
> --- /dev/null
> +++ b/tools/testing/selftests/epoll/Makefile
> @@ -0,0 +1,11 @@
> +# Makefile for epoll selftests
> +
> +all: test_epoll
> +%: %.c
> + gcc -pthread -g -o $@ $^
> +
> +run_tests: all
> + ./test_epoll
> +
> +clean:
> + $(RM) test_epoll
> diff --git a/tools/testing/selftests/epoll/test_epoll.c b/tools/testing/selftests/epoll/test_epoll.c
> new file mode 100644
> index 0000000..54284eb
> --- /dev/null
> +++ b/tools/testing/selftests/epoll/test_epoll.c
> @@ -0,0 +1,364 @@
> +/*
> + * tools/testing/selftests/epoll/test_epoll.c
> + *
> + * Copyright 2012 Adobe Systems Incorporated
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * Paton J. Lewis <[email protected]>
> + *
> + */
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <sys/epoll.h>
> +#include <sys/socket.h>
> +
> +/*
> + * A pointer to an epoll_item_private structure will be stored in the epoll
> + * item's event structure so that we can get access to the epoll_item_private
> + * data after calling epoll_wait:
> + */
> +struct epoll_item_private {
> + int index; /* Position of this struct within the epoll_items array. */
> + int fd;
> + uint32_t events;
> + pthread_mutex_t mutex; /* Guards the following variables... */
> + int stop;
> + int status; /* Stores any error encountered while handling item. */
> + /* The following variable allows us to test whether we have encountered
> + a problem while attempting to cancel and delete the associated
> + event. When the test program exits, 'deleted' should be exactly
> + one. If it is greater than one, then the failed test reflects a real
> + world situation where we would have tried to access the epoll item's
> + private data after deleting it: */
> + int deleted;
> +};
> +
> +struct epoll_item_private *epoll_items;
> +
> +/*
> + * Delete the specified item from the epoll set. In a real-world secneario this
> + * is where we would free the associated data structure, but in this testing
> + * environment we retain the structure so that we can test for double-deletion:
> + */
> +void delete_item(int index)
> +{
> + __sync_fetch_and_add(&epoll_items[index].deleted, 1);
> +}
> +
> +/*
> + * A pointer to a read_thread_data structure will be passed as the argument to
> + * each read thread:
> + */
> +struct read_thread_data {
> + int stop;
> + int status; /* Indicates any error encountered by the read thread. */
> + int epoll_set;
> +};
> +
> +/*
> + * The function executed by the read threads:
> + */
> +void *read_thread_function(void *function_data)
> +{
> + struct read_thread_data *thread_data =
> + (struct read_thread_data *)function_data;
> + struct epoll_event event_data;
> + struct epoll_item_private *item_data;
> + char socket_data;
> +
> + /* Handle events until we encounter an error or this thread's 'stop'
> + condition is set: */
> + while (1) {
> + int result = epoll_wait(thread_data->epoll_set,
> + &event_data,
> + 1, /* Number of desired events */
> + 1000); /* Timeout in ms */
> + if (result < 0) {
> + /* Breakpoints signal all threads. Ignore that while
> + debugging: */
> + if (errno == EINTR)
> + continue;
> + thread_data->status = errno;
> + return 0;
> + } else if (thread_data->stop)
> + return 0;
> + else if (result == 0) /* Timeout */
> + continue;
> +
> + /* We need the mutex here because checking for the stop
> + condition and re-enabling the epoll item need to be done
> + together as one atomic operation when EPOLL_CTL_DISABLE is
> + available: */
> + item_data = (struct epoll_item_private *)event_data.data.ptr;
> + pthread_mutex_lock(&item_data->mutex);
> +
> + /* Remove the item from the epoll set if we want to stop
> + handling that event: */
> + if (item_data->stop)
> + delete_item(item_data->index);
> + else {
> + /* Clear the data that was written to the other end of
> + our non-blocking socket: */
> + do {
> + if (read(item_data->fd, &socket_data, 1) < 1) {
> + if ((errno == EAGAIN) ||
> + (errno == EWOULDBLOCK))
> + break;
> + else
> + goto error_unlock;
> + }
> + } while (item_data->events & EPOLLET);
> +
> + /* The item was one-shot, so re-enable it: */
> + event_data.events = item_data->events;
> + if (epoll_ctl(thread_data->epoll_set,
> + EPOLL_CTL_MOD,
> + item_data->fd,
> + &event_data) < 0)
> + goto error_unlock;
> + }
> +
> + pthread_mutex_unlock(&item_data->mutex);
> + }
> +
> +error_unlock:
> + thread_data->status = item_data->status = errno;
> + pthread_mutex_unlock(&item_data->mutex);
> + return 0;
> +}
> +
> +/*
> + * A pointer to a write_thread_data structure will be passed as the argument to
> + * the write thread:
> + */
> +struct write_thread_data {
> + int stop;
> + int status; /* Indicates any error encountered by the write thread. */
> + int n_fds;
> + int *fds;
> +};
> +
> +/*
> + * The function executed by the write thread. It writes a single byte to each
> + * socket in turn until the stop condition for this thread is set. If writing to
> + * a socket would block (i.e. errno was EAGAIN), we leave that socket alone for
> + * the moment and just move on to the next socket in the list. We don't care
> + * about the order in which we deliver events to the epoll set. In fact we don't
> + * care about the data we're writing to the pipes at all; we just want to
> + * trigger epoll events:
> + */
> +void *write_thread_function(void *function_data)
> +{
> + const char data = 'X';
> + int index;
> + struct write_thread_data *thread_data =
> + (struct write_thread_data *)function_data;
> + while (!thread_data->stop)
> + for (index = 0;
> + !thread_data->stop && (index < thread_data->n_fds);
> + ++index)
> + if ((write(thread_data->fds[index], &data, 1) < 1) &&
> + (errno != EAGAIN) &&
> + (errno != EWOULDBLOCK)) {
> + thread_data->status = errno;
> + return;
> + }
> +}
> +
> +/*
> + * Arguments are currently ignored:
> + */
> +int main(int argc, char **argv)
> +{
> + const int n_read_threads = 100;
> + const int n_epoll_items = 500;
> + int index;
> + int epoll_set = epoll_create1(0);
> + struct write_thread_data write_thread_data = {
> + 0, 0, n_epoll_items, malloc(n_epoll_items * sizeof(int))
> + };
> + struct read_thread_data *read_thread_data =
> + malloc(n_read_threads * sizeof(struct read_thread_data));
> + pthread_t *read_threads = malloc(n_read_threads * sizeof(pthread_t));
> + pthread_t write_thread;
> + int socket_pair[2];
> + struct epoll_event event_data;
> +
> + printf("-----------------\n");
> + printf("Runing test_epoll\n");
> + printf("-----------------\n");
> +
> + epoll_items = malloc(n_epoll_items * sizeof(struct epoll_item_private));
> +
> + if (epoll_set < 0 || epoll_items == 0 || write_thread_data.fds == 0 ||
> + read_thread_data == 0 || read_threads == 0)
> + goto error;
> +
> + if (sysconf(_SC_NPROCESSORS_ONLN) < 2) {
> + printf("Error: please run this test on a multi-core system.\n");
> + goto error;
> + }
> +
> + /* Create the socket pairs and epoll items: */
> + for (index = 0; index < n_epoll_items; ++index) {
> + if (socketpair(AF_UNIX,
> + SOCK_STREAM | SOCK_NONBLOCK,
> + 0,
> + socket_pair) < 0)
> + goto error;
> + write_thread_data.fds[index] = socket_pair[0];
> + epoll_items[index].index = index;
> + epoll_items[index].fd = socket_pair[1];
> + if (pthread_mutex_init(&epoll_items[index].mutex, NULL) != 0)
> + goto error;
> + /* We always use EPOLLONESHOT because this test is currently
> + structured to demonstrate the need for EPOLL_CTL_DISABLE,
> + which only produces useful information in the EPOLLONESHOT
> + case (without EPOLLONESHOT, calling epoll_ctl with
> + EPOLL_CTL_DISABLE will never return EBUSY). If support for
> + testing events without EPOLLONESHOT is desired, it should
> + probably be implemented in a separate unit test. */
> + epoll_items[index].events = EPOLLIN | EPOLLONESHOT;
> + if (index < n_epoll_items / 2)
> + epoll_items[index].events |= EPOLLET;
> + epoll_items[index].stop = 0;
> + epoll_items[index].status = 0;
> + epoll_items[index].deleted = 0;
> + event_data.events = epoll_items[index].events;
> + event_data.data.ptr = &epoll_items[index];
> + if (epoll_ctl(epoll_set,
> + EPOLL_CTL_ADD,
> + epoll_items[index].fd,
> + &event_data) < 0)
> + goto error;
> + }
> +
> +#ifdef EPOLL_CTL_DISABLE
> + /* Test to make sure that using EPOLL_CTL_DISABLE without EPOLLONESHOT
> + returns a clear error: */
> + if (socketpair(AF_UNIX,
> + SOCK_STREAM | SOCK_NONBLOCK,
> + 0,
> + socket_pair) < 0)
> + goto error;
> + event_data.events = EPOLLIN;
> + event_data.data.ptr = NULL;
> + if (epoll_ctl(epoll_set, EPOLL_CTL_ADD,
> + socket_pair[1], &event_data) < 0)
> + goto error;
> + if ((epoll_ctl(epoll_set, EPOLL_CTL_DISABLE,
> + socket_pair[1], NULL) == 0) || (errno != EINVAL))
> + goto error;
> + if (epoll_ctl(epoll_set, EPOLL_CTL_DEL, socket_pair[1], NULL) != 0)
> + goto error;
> +#endif
> +
> + /* Create and start the read threads: */
> + for (index = 0; index < n_read_threads; ++index) {
> + read_thread_data[index].stop = 0;
> + read_thread_data[index].status = 0;
> + read_thread_data[index].epoll_set = epoll_set;
> + if (pthread_create(&read_threads[index],
> + NULL,
> + read_thread_function,
> + &read_thread_data[index]) != 0)
> + goto error;
> + }
> +
> + if (pthread_create(&write_thread,
> + NULL,
> + write_thread_function,
> + &write_thread_data) != 0)
> + goto error;
> +
> + /* Cancel all event pollers: */
> +#ifdef EPOLL_CTL_DISABLE
> + for (index = 0; index < n_epoll_items; ++index) {
> + pthread_mutex_lock(&epoll_items[index].mutex);
> + ++epoll_items[index].stop;
> + if (epoll_ctl(epoll_set,
> + EPOLL_CTL_DISABLE,
> + epoll_items[index].fd,
> + NULL) == 0)
> + delete_item(index);
> + else if (errno != EBUSY) {
> + pthread_mutex_unlock(&epoll_items[index].mutex);
> + goto error;
> + }
> + /* EBUSY means events were being handled; allow the other thread
> + to delete the item. */
> + pthread_mutex_unlock(&epoll_items[index].mutex);
> + }
> +#else
> + for (index = 0; index < n_epoll_items; ++index) {
> + pthread_mutex_lock(&epoll_items[index].mutex);
> + ++epoll_items[index].stop;
> + pthread_mutex_unlock(&epoll_items[index].mutex);
> + /* Wait in case a thread running read_thread_function is
> + currently executing code between epoll_wait and
> + pthread_mutex_lock with this item. Note that a longer delay
> + would make double-deletion less likely (at the expense of
> + performance), but there is no guarantee that any delay would
> + ever be sufficient. Note also that we delete all event
> + pollers at once for testing purposes, but in a real-world
> + environment we are likely to want to be able to cancel event
> + pollers at arbitrary times. Therefore we can't improve this
> + situation by just splitting this loop into two loops
> + (i.e. signal 'stop' for all items, sleep, and then delete all
> + items). We also can't fix the problem via EPOLL_CTL_DEL
> + because that command can't prevent the case where some other
> + thread is executing read_thread_function within the region
> + mentioned above: */
> + usleep(1);
> + pthread_mutex_lock(&epoll_items[index].mutex);
> + if (!epoll_items[index].deleted)
> + delete_item(index);
> + pthread_mutex_unlock(&epoll_items[index].mutex);
> + }
> +#endif
> +
> + /* Shut down the read threads: */
> + for (index = 0; index < n_read_threads; ++index)
> + __sync_fetch_and_add(&read_thread_data[index].stop, 1);
> + for (index = 0; index < n_read_threads; ++index) {
> + if (pthread_join(read_threads[index], NULL) != 0)
> + goto error;
> + if (read_thread_data[index].status)
> + goto error;
> + }
> +
> + /* Shut down the write thread: */
> + __sync_fetch_and_add(&write_thread_data.stop, 1);
> + if ((pthread_join(write_thread, NULL) != 0) || write_thread_data.status)
> + goto error;
> +
> + /* Check for final error conditions: */
> + for (index = 0; index < n_epoll_items; ++index) {
> + if (epoll_items[index].status != 0)
> + goto error;
> + if (pthread_mutex_destroy(&epoll_items[index].mutex) < 0)
> + goto error;
> + }
> + for (index = 0; index < n_epoll_items; ++index)
> + if (epoll_items[index].deleted != 1) {
> + printf("Error: item data deleted %1d times.\n",
> + epoll_items[index].deleted);
> + goto error;
> + }
> +
> + printf("[PASS]\n");
> + return 0;
> +
> + error:
> + printf("[FAIL]\n");
> + return errno;
> +}
>

2012-10-31 18:57:54

by Paton J. Lewis

[permalink] [raw]
Subject: Re: [PATCH v3] epoll: Support for disabling items, and a self-test app.

On 10/30/12 11:32 PM, Michael Wang wrote:
> On 10/26/2012 08:08 AM, Paton J. Lewis wrote:
>> From: "Paton J. Lewis" <[email protected]>
>>
>> It is not currently possible to reliably delete epoll items when using the
>> same epoll set from multiple threads. After calling epoll_ctl with
>> EPOLL_CTL_DEL, another thread might still be executing code related to an
>> event for that epoll item (in response to epoll_wait). Therefore the deleting
>> thread does not know when it is safe to delete resources pertaining to the
>> associated epoll item because another thread might be using those resources.
>>
>> The deleting thread could wait an arbitrary amount of time after calling
>> epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is
>> inefficient and could result in the destruction of resources before another
>> thread is done handling an event returned by epoll_wait.
>>
>> This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which disables an
>> epoll item. If epoll_ctl returns -EBUSY in this case, then another thread may
>> handling a return from epoll_wait for this item. Otherwise if epoll_ctl
>> returns 0, then it is safe to delete the epoll item. This allows multiple
>> threads to use a mutex to determine when it is safe to delete an epoll item
>> and its associated resources, which allows epoll items to be deleted both
>> efficiently and without error in a multi-threaded environment. Note that
>> EPOLL_CTL_DISABLE is only useful in conjunction with EPOLLONESHOT, and using
>> EPOLL_CTL_DISABLE on an epoll item without EPOLLONESHOT returns -EINVAL.
>>
>> This patch also adds a new test_epoll self-test program to both demonstrate
>> the need for this feature and test it.
>
> Hi, Paton
>
> I'm just think about may be we could use this way.
>
> Seems like currently we are depending on the epoll_ctl() to indicate the
> start point of safe section and epoll_wait() for the end point, like:
>
> while () {
> epoll_wait() --------------
>
> fd event arrived safe section
>
> clear fd epi->event.events
> --------------
> if (fd need stop)
> continue;
> --------------
> ...fd data process...
>
> epoll_ctl(MOD) danger section
>
> set fd epi->event.events --------------
>
> continue;
> }
>
> So we got a safe section and do delete work in this section won't cause
> trouble since we have a stop check directly after it.
>
> Actually what we want is to make sure no one will touch the fd any more
> after we DISABLE it.
>
> Then what about we add a ref count and a stop flag in epi, maintain it like:
>
> epoll_wait()
>
> check user events and
> dec the ref count of fd ---------------------------
>
> ...
>
> fd event arrived safe sec if ref count is 0
>
> if epi stop flag set
> do nothing
> else
> inc epi ref count ---------------------------

The pseudecode you provide below (for "DISABLE") seems to indicate that
this "epi ref count" must be maintained by the kernel. Therefore any
userspace modification of a ref count associated with an epoll item will
require a new or changed kernel API.

> send event
>
> And what DISABLE do is:
>
> set epi stop flag
>
> if epi ref count is not 0
> wait until ref count be 0

Perhaps I don't fully understand what you're proposing, but I don't
think it's reasonable for a kernel API (epoll_ctl in this case) to block
while waiting for a userspace action (decrementing the ref count) that
might never occur.

Andrew Morton also proposed using ref counting in response to my initial
patch submission; my reply to his proposal might also be applicable to
your proposal. A link to that discussion thread:
http://thread.gmane.org/gmane.linux.kernel/1311457/focus=1315096

Sorry if I am misunderstanding your proposal, but I don't see how it
solves the original problem.

Pat

> So after DISABLE return, we can safely delete any thing related to that epi.
>
> One thing is that the user should not change the events info returned by
> epoll_wait().
>
> It's just a propose, but if it works, there will be no limit on ONESHOT
> any more ;-)
>
> Regards,
> Michael Wang
>
>>
>> Signed-off-by: Paton J. Lewis <[email protected]>
>> ---
>> fs/eventpoll.c | 40 ++-
>> include/linux/eventpoll.h | 1 +
>> tools/testing/selftests/Makefile | 2 +-
>> tools/testing/selftests/epoll/Makefile | 11 +
>> tools/testing/selftests/epoll/test_epoll.c | 364 ++++++++++++++++++++++++++++
>> 5 files changed, 414 insertions(+), 4 deletions(-)
>> create mode 100644 tools/testing/selftests/epoll/Makefile
>> create mode 100644 tools/testing/selftests/epoll/test_epoll.c
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index 739b098..c718afd 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -339,7 +339,7 @@ static inline struct epitem *ep_item_from_epqueue(poll_table *p)
>> /* Tells if the epoll_ctl(2) operation needs an event copy from userspace */
>> static inline int ep_op_has_event(int op)
>> {
>> - return op != EPOLL_CTL_DEL;
>> + return op == EPOLL_CTL_ADD || op == EPOLL_CTL_MOD;
>> }
>>
>> /* Initialize the poll safe wake up structure */
>> @@ -664,6 +664,36 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>> return 0;
>> }
>>
>> +/*
>> + * Disables a "struct epitem" in the eventpoll set. Returns -EBUSY if the item
>> + * had no event flags set, indicating that another thread may be currently
>> + * handling that item's events (in the case that EPOLLONESHOT was being
>> + * used). Otherwise a zero result indicates that the item has been disabled
>> + * from receiving events. A disabled item may be re-enabled via
>> + * EPOLL_CTL_MOD. Must be called with "mtx" held.
>> + */
>> +static int ep_disable(struct eventpoll *ep, struct epitem *epi)
>> +{
>> + int result = 0;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&ep->lock, flags);
>> + if (epi->event.events & EPOLLONESHOT) {
>> + if (epi->event.events & ~EP_PRIVATE_BITS) {
>> + if (ep_is_linked(&epi->rdllink))
>> + list_del_init(&epi->rdllink);
>> + /* Ensure ep_poll_callback will not add epi back onto
>> + ready list: */
>> + epi->event.events &= EP_PRIVATE_BITS;
>> + } else
>> + result = -EBUSY;
>> + } else
>> + result = -EINVAL;
>> + spin_unlock_irqrestore(&ep->lock, flags);
>> +
>> + return result;
>> +}
>> +
>> static void ep_free(struct eventpoll *ep)
>> {
>> struct rb_node *rbp;
>> @@ -996,8 +1026,6 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
>> rb_insert_color(&epi->rbn, &ep->rbr);
>> }
>>
>> -
>> -
>> #define PATH_ARR_SIZE 5
>> /*
>> * These are the number paths of length 1 to 5, that we are allowing to emanate
>> @@ -1701,6 +1729,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>> } else
>> error = -ENOENT;
>> break;
>> + case EPOLL_CTL_DISABLE:
>> + if (epi)
>> + error = ep_disable(ep, epi);
>> + else
>> + error = -ENOENT;
>> + break;
>> }
>> mutex_unlock(&ep->mtx);
>>
>> diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
>> index 657ab55..e91f7e3 100644
>> --- a/include/linux/eventpoll.h
>> +++ b/include/linux/eventpoll.h
>> @@ -25,6 +25,7 @@
>> #define EPOLL_CTL_ADD 1
>> #define EPOLL_CTL_DEL 2
>> #define EPOLL_CTL_MOD 3
>> +#define EPOLL_CTL_DISABLE 4
>>
>> /* Set the One Shot behaviour for the target file descriptor */
>> #define EPOLLONESHOT (1 << 30)
>> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
>> index 28bc57e..4cf477f 100644
>> --- a/tools/testing/selftests/Makefile
>> +++ b/tools/testing/selftests/Makefile
>> @@ -1,4 +1,4 @@
>> -TARGETS = breakpoints vm
>> +TARGETS = breakpoints epoll vm
>>
>> all:
>> for TARGET in $(TARGETS); do \
>> diff --git a/tools/testing/selftests/epoll/Makefile b/tools/testing/selftests/epoll/Makefile
>> new file mode 100644
>> index 0000000..19806ed
>> --- /dev/null
>> +++ b/tools/testing/selftests/epoll/Makefile
>> @@ -0,0 +1,11 @@
>> +# Makefile for epoll selftests
>> +
>> +all: test_epoll
>> +%: %.c
>> + gcc -pthread -g -o $@ $^
>> +
>> +run_tests: all
>> + ./test_epoll
>> +
>> +clean:
>> + $(RM) test_epoll
>> diff --git a/tools/testing/selftests/epoll/test_epoll.c b/tools/testing/selftests/epoll/test_epoll.c
>> new file mode 100644
>> index 0000000..54284eb
>> --- /dev/null
>> +++ b/tools/testing/selftests/epoll/test_epoll.c
>> @@ -0,0 +1,364 @@
>> +/*
>> + * tools/testing/selftests/epoll/test_epoll.c
>> + *
>> + * Copyright 2012 Adobe Systems Incorporated
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * Paton J. Lewis <[email protected]>
>> + *
>> + */
>> +
>> +#include <errno.h>
>> +#include <fcntl.h>
>> +#include <pthread.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <unistd.h>
>> +#include <sys/epoll.h>
>> +#include <sys/socket.h>
>> +
>> +/*
>> + * A pointer to an epoll_item_private structure will be stored in the epoll
>> + * item's event structure so that we can get access to the epoll_item_private
>> + * data after calling epoll_wait:
>> + */
>> +struct epoll_item_private {
>> + int index; /* Position of this struct within the epoll_items array. */
>> + int fd;
>> + uint32_t events;
>> + pthread_mutex_t mutex; /* Guards the following variables... */
>> + int stop;
>> + int status; /* Stores any error encountered while handling item. */
>> + /* The following variable allows us to test whether we have encountered
>> + a problem while attempting to cancel and delete the associated
>> + event. When the test program exits, 'deleted' should be exactly
>> + one. If it is greater than one, then the failed test reflects a real
>> + world situation where we would have tried to access the epoll item's
>> + private data after deleting it: */
>> + int deleted;
>> +};
>> +
>> +struct epoll_item_private *epoll_items;
>> +
>> +/*
>> + * Delete the specified item from the epoll set. In a real-world secneario this
>> + * is where we would free the associated data structure, but in this testing
>> + * environment we retain the structure so that we can test for double-deletion:
>> + */
>> +void delete_item(int index)
>> +{
>> + __sync_fetch_and_add(&epoll_items[index].deleted, 1);
>> +}
>> +
>> +/*
>> + * A pointer to a read_thread_data structure will be passed as the argument to
>> + * each read thread:
>> + */
>> +struct read_thread_data {
>> + int stop;
>> + int status; /* Indicates any error encountered by the read thread. */
>> + int epoll_set;
>> +};
>> +
>> +/*
>> + * The function executed by the read threads:
>> + */
>> +void *read_thread_function(void *function_data)
>> +{
>> + struct read_thread_data *thread_data =
>> + (struct read_thread_data *)function_data;
>> + struct epoll_event event_data;
>> + struct epoll_item_private *item_data;
>> + char socket_data;
>> +
>> + /* Handle events until we encounter an error or this thread's 'stop'
>> + condition is set: */
>> + while (1) {
>> + int result = epoll_wait(thread_data->epoll_set,
>> + &event_data,
>> + 1, /* Number of desired events */
>> + 1000); /* Timeout in ms */
>> + if (result < 0) {
>> + /* Breakpoints signal all threads. Ignore that while
>> + debugging: */
>> + if (errno == EINTR)
>> + continue;
>> + thread_data->status = errno;
>> + return 0;
>> + } else if (thread_data->stop)
>> + return 0;
>> + else if (result == 0) /* Timeout */
>> + continue;
>> +
>> + /* We need the mutex here because checking for the stop
>> + condition and re-enabling the epoll item need to be done
>> + together as one atomic operation when EPOLL_CTL_DISABLE is
>> + available: */
>> + item_data = (struct epoll_item_private *)event_data.data.ptr;
>> + pthread_mutex_lock(&item_data->mutex);
>> +
>> + /* Remove the item from the epoll set if we want to stop
>> + handling that event: */
>> + if (item_data->stop)
>> + delete_item(item_data->index);
>> + else {
>> + /* Clear the data that was written to the other end of
>> + our non-blocking socket: */
>> + do {
>> + if (read(item_data->fd, &socket_data, 1) < 1) {
>> + if ((errno == EAGAIN) ||
>> + (errno == EWOULDBLOCK))
>> + break;
>> + else
>> + goto error_unlock;
>> + }
>> + } while (item_data->events & EPOLLET);
>> +
>> + /* The item was one-shot, so re-enable it: */
>> + event_data.events = item_data->events;
>> + if (epoll_ctl(thread_data->epoll_set,
>> + EPOLL_CTL_MOD,
>> + item_data->fd,
>> + &event_data) < 0)
>> + goto error_unlock;
>> + }
>> +
>> + pthread_mutex_unlock(&item_data->mutex);
>> + }
>> +
>> +error_unlock:
>> + thread_data->status = item_data->status = errno;
>> + pthread_mutex_unlock(&item_data->mutex);
>> + return 0;
>> +}
>> +
>> +/*
>> + * A pointer to a write_thread_data structure will be passed as the argument to
>> + * the write thread:
>> + */
>> +struct write_thread_data {
>> + int stop;
>> + int status; /* Indicates any error encountered by the write thread. */
>> + int n_fds;
>> + int *fds;
>> +};
>> +
>> +/*
>> + * The function executed by the write thread. It writes a single byte to each
>> + * socket in turn until the stop condition for this thread is set. If writing to
>> + * a socket would block (i.e. errno was EAGAIN), we leave that socket alone for
>> + * the moment and just move on to the next socket in the list. We don't care
>> + * about the order in which we deliver events to the epoll set. In fact we don't
>> + * care about the data we're writing to the pipes at all; we just want to
>> + * trigger epoll events:
>> + */
>> +void *write_thread_function(void *function_data)
>> +{
>> + const char data = 'X';
>> + int index;
>> + struct write_thread_data *thread_data =
>> + (struct write_thread_data *)function_data;
>> + while (!thread_data->stop)
>> + for (index = 0;
>> + !thread_data->stop && (index < thread_data->n_fds);
>> + ++index)
>> + if ((write(thread_data->fds[index], &data, 1) < 1) &&
>> + (errno != EAGAIN) &&
>> + (errno != EWOULDBLOCK)) {
>> + thread_data->status = errno;
>> + return;
>> + }
>> +}
>> +
>> +/*
>> + * Arguments are currently ignored:
>> + */
>> +int main(int argc, char **argv)
>> +{
>> + const int n_read_threads = 100;
>> + const int n_epoll_items = 500;
>> + int index;
>> + int epoll_set = epoll_create1(0);
>> + struct write_thread_data write_thread_data = {
>> + 0, 0, n_epoll_items, malloc(n_epoll_items * sizeof(int))
>> + };
>> + struct read_thread_data *read_thread_data =
>> + malloc(n_read_threads * sizeof(struct read_thread_data));
>> + pthread_t *read_threads = malloc(n_read_threads * sizeof(pthread_t));
>> + pthread_t write_thread;
>> + int socket_pair[2];
>> + struct epoll_event event_data;
>> +
>> + printf("-----------------\n");
>> + printf("Runing test_epoll\n");
>> + printf("-----------------\n");
>> +
>> + epoll_items = malloc(n_epoll_items * sizeof(struct epoll_item_private));
>> +
>> + if (epoll_set < 0 || epoll_items == 0 || write_thread_data.fds == 0 ||
>> + read_thread_data == 0 || read_threads == 0)
>> + goto error;
>> +
>> + if (sysconf(_SC_NPROCESSORS_ONLN) < 2) {
>> + printf("Error: please run this test on a multi-core system.\n");
>> + goto error;
>> + }
>> +
>> + /* Create the socket pairs and epoll items: */
>> + for (index = 0; index < n_epoll_items; ++index) {
>> + if (socketpair(AF_UNIX,
>> + SOCK_STREAM | SOCK_NONBLOCK,
>> + 0,
>> + socket_pair) < 0)
>> + goto error;
>> + write_thread_data.fds[index] = socket_pair[0];
>> + epoll_items[index].index = index;
>> + epoll_items[index].fd = socket_pair[1];
>> + if (pthread_mutex_init(&epoll_items[index].mutex, NULL) != 0)
>> + goto error;
>> + /* We always use EPOLLONESHOT because this test is currently
>> + structured to demonstrate the need for EPOLL_CTL_DISABLE,
>> + which only produces useful information in the EPOLLONESHOT
>> + case (without EPOLLONESHOT, calling epoll_ctl with
>> + EPOLL_CTL_DISABLE will never return EBUSY). If support for
>> + testing events without EPOLLONESHOT is desired, it should
>> + probably be implemented in a separate unit test. */
>> + epoll_items[index].events = EPOLLIN | EPOLLONESHOT;
>> + if (index < n_epoll_items / 2)
>> + epoll_items[index].events |= EPOLLET;
>> + epoll_items[index].stop = 0;
>> + epoll_items[index].status = 0;
>> + epoll_items[index].deleted = 0;
>> + event_data.events = epoll_items[index].events;
>> + event_data.data.ptr = &epoll_items[index];
>> + if (epoll_ctl(epoll_set,
>> + EPOLL_CTL_ADD,
>> + epoll_items[index].fd,
>> + &event_data) < 0)
>> + goto error;
>> + }
>> +
>> +#ifdef EPOLL_CTL_DISABLE
>> + /* Test to make sure that using EPOLL_CTL_DISABLE without EPOLLONESHOT
>> + returns a clear error: */
>> + if (socketpair(AF_UNIX,
>> + SOCK_STREAM | SOCK_NONBLOCK,
>> + 0,
>> + socket_pair) < 0)
>> + goto error;
>> + event_data.events = EPOLLIN;
>> + event_data.data.ptr = NULL;
>> + if (epoll_ctl(epoll_set, EPOLL_CTL_ADD,
>> + socket_pair[1], &event_data) < 0)
>> + goto error;
>> + if ((epoll_ctl(epoll_set, EPOLL_CTL_DISABLE,
>> + socket_pair[1], NULL) == 0) || (errno != EINVAL))
>> + goto error;
>> + if (epoll_ctl(epoll_set, EPOLL_CTL_DEL, socket_pair[1], NULL) != 0)
>> + goto error;
>> +#endif
>> +
>> + /* Create and start the read threads: */
>> + for (index = 0; index < n_read_threads; ++index) {
>> + read_thread_data[index].stop = 0;
>> + read_thread_data[index].status = 0;
>> + read_thread_data[index].epoll_set = epoll_set;
>> + if (pthread_create(&read_threads[index],
>> + NULL,
>> + read_thread_function,
>> + &read_thread_data[index]) != 0)
>> + goto error;
>> + }
>> +
>> + if (pthread_create(&write_thread,
>> + NULL,
>> + write_thread_function,
>> + &write_thread_data) != 0)
>> + goto error;
>> +
>> + /* Cancel all event pollers: */
>> +#ifdef EPOLL_CTL_DISABLE
>> + for (index = 0; index < n_epoll_items; ++index) {
>> + pthread_mutex_lock(&epoll_items[index].mutex);
>> + ++epoll_items[index].stop;
>> + if (epoll_ctl(epoll_set,
>> + EPOLL_CTL_DISABLE,
>> + epoll_items[index].fd,
>> + NULL) == 0)
>> + delete_item(index);
>> + else if (errno != EBUSY) {
>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>> + goto error;
>> + }
>> + /* EBUSY means events were being handled; allow the other thread
>> + to delete the item. */
>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>> + }
>> +#else
>> + for (index = 0; index < n_epoll_items; ++index) {
>> + pthread_mutex_lock(&epoll_items[index].mutex);
>> + ++epoll_items[index].stop;
>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>> + /* Wait in case a thread running read_thread_function is
>> + currently executing code between epoll_wait and
>> + pthread_mutex_lock with this item. Note that a longer delay
>> + would make double-deletion less likely (at the expense of
>> + performance), but there is no guarantee that any delay would
>> + ever be sufficient. Note also that we delete all event
>> + pollers at once for testing purposes, but in a real-world
>> + environment we are likely to want to be able to cancel event
>> + pollers at arbitrary times. Therefore we can't improve this
>> + situation by just splitting this loop into two loops
>> + (i.e. signal 'stop' for all items, sleep, and then delete all
>> + items). We also can't fix the problem via EPOLL_CTL_DEL
>> + because that command can't prevent the case where some other
>> + thread is executing read_thread_function within the region
>> + mentioned above: */
>> + usleep(1);
>> + pthread_mutex_lock(&epoll_items[index].mutex);
>> + if (!epoll_items[index].deleted)
>> + delete_item(index);
>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>> + }
>> +#endif
>> +
>> + /* Shut down the read threads: */
>> + for (index = 0; index < n_read_threads; ++index)
>> + __sync_fetch_and_add(&read_thread_data[index].stop, 1);
>> + for (index = 0; index < n_read_threads; ++index) {
>> + if (pthread_join(read_threads[index], NULL) != 0)
>> + goto error;
>> + if (read_thread_data[index].status)
>> + goto error;
>> + }
>> +
>> + /* Shut down the write thread: */
>> + __sync_fetch_and_add(&write_thread_data.stop, 1);
>> + if ((pthread_join(write_thread, NULL) != 0) || write_thread_data.status)
>> + goto error;
>> +
>> + /* Check for final error conditions: */
>> + for (index = 0; index < n_epoll_items; ++index) {
>> + if (epoll_items[index].status != 0)
>> + goto error;
>> + if (pthread_mutex_destroy(&epoll_items[index].mutex) < 0)
>> + goto error;
>> + }
>> + for (index = 0; index < n_epoll_items; ++index)
>> + if (epoll_items[index].deleted != 1) {
>> + printf("Error: item data deleted %1d times.\n",
>> + epoll_items[index].deleted);
>> + goto error;
>> + }
>> +
>> + printf("[PASS]\n");
>> + return 0;
>> +
>> + error:
>> + printf("[FAIL]\n");
>> + return errno;
>> +}
>>
>
> .
>

2012-11-01 00:44:01

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v3] epoll: Support for disabling items, and a self-test app.

On 11/01/2012 02:57 AM, Paton J. Lewis wrote:
> On 10/30/12 11:32 PM, Michael Wang wrote:
>> On 10/26/2012 08:08 AM, Paton J. Lewis wrote:
>>> From: "Paton J. Lewis" <[email protected]>
>>>
>>> It is not currently possible to reliably delete epoll items when
>>> using the
>>> same epoll set from multiple threads. After calling epoll_ctl with
>>> EPOLL_CTL_DEL, another thread might still be executing code related
>>> to an
>>> event for that epoll item (in response to epoll_wait). Therefore the
>>> deleting
>>> thread does not know when it is safe to delete resources pertaining
>>> to the
>>> associated epoll item because another thread might be using those
>>> resources.
>>>
>>> The deleting thread could wait an arbitrary amount of time after calling
>>> epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is
>>> inefficient and could result in the destruction of resources before
>>> another
>>> thread is done handling an event returned by epoll_wait.
>>>
>>> This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which
>>> disables an
>>> epoll item. If epoll_ctl returns -EBUSY in this case, then another
>>> thread may
>>> handling a return from epoll_wait for this item. Otherwise if epoll_ctl
>>> returns 0, then it is safe to delete the epoll item. This allows
>>> multiple
>>> threads to use a mutex to determine when it is safe to delete an
>>> epoll item
>>> and its associated resources, which allows epoll items to be deleted
>>> both
>>> efficiently and without error in a multi-threaded environment. Note that
>>> EPOLL_CTL_DISABLE is only useful in conjunction with EPOLLONESHOT,
>>> and using
>>> EPOLL_CTL_DISABLE on an epoll item without EPOLLONESHOT returns -EINVAL.
>>>
>>> This patch also adds a new test_epoll self-test program to both
>>> demonstrate
>>> the need for this feature and test it.
>>
>> Hi, Paton
>>
>> I'm just think about may be we could use this way.
>>
>> Seems like currently we are depending on the epoll_ctl() to indicate the
>> start point of safe section and epoll_wait() for the end point, like:
>>
>> while () {
>> epoll_wait() --------------
>>
>> fd event arrived safe section
>>
>> clear fd epi->event.events
>> --------------
>> if (fd need stop)
>> continue;
>> --------------
>> ...fd data process...
>>
>> epoll_ctl(MOD) danger section
>>
>> set fd epi->event.events --------------
>>
>> continue;
>> }
>>
>> So we got a safe section and do delete work in this section won't cause
>> trouble since we have a stop check directly after it.
>>
>> Actually what we want is to make sure no one will touch the fd any more
>> after we DISABLE it.
>>
>> Then what about we add a ref count and a stop flag in epi, maintain it
>> like:
>>
>> epoll_wait()
>>
>> check user events and
>> dec the ref count of fd ---------------------------
>>
>> ...
>>
>> fd event arrived safe sec if ref count is 0
>>
>> if epi stop flag set
>> do nothing
>> else
>> inc epi ref count ---------------------------
>
> The pseudecode you provide below (for "DISABLE") seems to indicate that
> this "epi ref count" must be maintained by the kernel. Therefore any
> userspace modification of a ref count associated with an epoll item will
> require a new or changed kernel API.
>
>> send event
>>
>> And what DISABLE do is:
>>
>> set epi stop flag
>>
>> if epi ref count is not 0
>> wait until ref count be 0
>
> Perhaps I don't fully understand what you're proposing, but I don't
> think it's reasonable for a kernel API (epoll_ctl in this case) to block
> while waiting for a userspace action (decrementing the ref count) that
> might never occur.
>
> Andrew Morton also proposed using ref counting in response to my initial
> patch submission; my reply to his proposal might also be applicable to
> your proposal. A link to that discussion thread:
> http://thread.gmane.org/gmane.linux.kernel/1311457/focus=1315096
>
> Sorry if I am misunderstanding your proposal, but I don't see how it
> solves the original problem.

I just try to find out whether we could using DISABLE with out ONESHOT :)

My currently understanding is:

1. we actually want to determine the part between each epoll_wait() in a
while().

2. we can't count on epoll_wait() itself, since no info pass to kernel
to indicate whether it was invoked after another epoll_wait() in the
same while().

3. so we need epoll_ctl(MOD) to tell kernel: user finished process data
after epoll_wait(), and those data belong to which epi.

4. since 3 we need ONESHOT to be enabled.


Is this the reason we rely on ONESHOT to be enabled?


If it is, then if we could do some change to make epoll_wait() have the
ability:

1. indicate whether it is invoked after another epoll_wait() in same while()

2. indicate which epi has been fired by last epoll_wait()

3. no changes on the api for user

Then we could mark the safe section inside epoll_wait(), and we don't
need MOD or ONESHOT any more, is that correct?


Regards,
Michael Wang

>
> Pat
>
>> So after DISABLE return, we can safely delete any thing related to
>> that epi.
>>
>> One thing is that the user should not change the events info returned by
>> epoll_wait().
>>
>> It's just a propose, but if it works, there will be no limit on ONESHOT
>> any more ;-)
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> Signed-off-by: Paton J. Lewis <[email protected]>
>>> ---
>>> fs/eventpoll.c | 40 ++-
>>> include/linux/eventpoll.h | 1 +
>>> tools/testing/selftests/Makefile | 2 +-
>>> tools/testing/selftests/epoll/Makefile | 11 +
>>> tools/testing/selftests/epoll/test_epoll.c | 364
>>> ++++++++++++++++++++++++++++
>>> 5 files changed, 414 insertions(+), 4 deletions(-)
>>> create mode 100644 tools/testing/selftests/epoll/Makefile
>>> create mode 100644 tools/testing/selftests/epoll/test_epoll.c
>>>
>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>> index 739b098..c718afd 100644
>>> --- a/fs/eventpoll.c
>>> +++ b/fs/eventpoll.c
>>> @@ -339,7 +339,7 @@ static inline struct epitem
>>> *ep_item_from_epqueue(poll_table *p)
>>> /* Tells if the epoll_ctl(2) operation needs an event copy from
>>> userspace */
>>> static inline int ep_op_has_event(int op)
>>> {
>>> - return op != EPOLL_CTL_DEL;
>>> + return op == EPOLL_CTL_ADD || op == EPOLL_CTL_MOD;
>>> }
>>>
>>> /* Initialize the poll safe wake up structure */
>>> @@ -664,6 +664,36 @@ static int ep_remove(struct eventpoll *ep,
>>> struct epitem *epi)
>>> return 0;
>>> }
>>>
>>> +/*
>>> + * Disables a "struct epitem" in the eventpoll set. Returns -EBUSY
>>> if the item
>>> + * had no event flags set, indicating that another thread may be
>>> currently
>>> + * handling that item's events (in the case that EPOLLONESHOT was being
>>> + * used). Otherwise a zero result indicates that the item has been
>>> disabled
>>> + * from receiving events. A disabled item may be re-enabled via
>>> + * EPOLL_CTL_MOD. Must be called with "mtx" held.
>>> + */
>>> +static int ep_disable(struct eventpoll *ep, struct epitem *epi)
>>> +{
>>> + int result = 0;
>>> + unsigned long flags;
>>> +
>>> + spin_lock_irqsave(&ep->lock, flags);
>>> + if (epi->event.events & EPOLLONESHOT) {
>>> + if (epi->event.events & ~EP_PRIVATE_BITS) {
>>> + if (ep_is_linked(&epi->rdllink))
>>> + list_del_init(&epi->rdllink);
>>> + /* Ensure ep_poll_callback will not add epi
>>> back onto
>>> + ready list: */
>>> + epi->event.events &= EP_PRIVATE_BITS;
>>> + } else
>>> + result = -EBUSY;
>>> + } else
>>> + result = -EINVAL;
>>> + spin_unlock_irqrestore(&ep->lock, flags);
>>> +
>>> + return result;
>>> +}
>>> +
>>> static void ep_free(struct eventpoll *ep)
>>> {
>>> struct rb_node *rbp;
>>> @@ -996,8 +1026,6 @@ static void ep_rbtree_insert(struct eventpoll
>>> *ep, struct epitem *epi)
>>> rb_insert_color(&epi->rbn, &ep->rbr);
>>> }
>>>
>>> -
>>> -
>>> #define PATH_ARR_SIZE 5
>>> /*
>>> * These are the number paths of length 1 to 5, that we are
>>> allowing to emanate
>>> @@ -1701,6 +1729,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op,
>>> int, fd,
>>> } else
>>> error = -ENOENT;
>>> break;
>>> + case EPOLL_CTL_DISABLE:
>>> + if (epi)
>>> + error = ep_disable(ep, epi);
>>> + else
>>> + error = -ENOENT;
>>> + break;
>>> }
>>> mutex_unlock(&ep->mtx);
>>>
>>> diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
>>> index 657ab55..e91f7e3 100644
>>> --- a/include/linux/eventpoll.h
>>> +++ b/include/linux/eventpoll.h
>>> @@ -25,6 +25,7 @@
>>> #define EPOLL_CTL_ADD 1
>>> #define EPOLL_CTL_DEL 2
>>> #define EPOLL_CTL_MOD 3
>>> +#define EPOLL_CTL_DISABLE 4
>>>
>>> /* Set the One Shot behaviour for the target file descriptor */
>>> #define EPOLLONESHOT (1 << 30)
>>> diff --git a/tools/testing/selftests/Makefile
>>> b/tools/testing/selftests/Makefile
>>> index 28bc57e..4cf477f 100644
>>> --- a/tools/testing/selftests/Makefile
>>> +++ b/tools/testing/selftests/Makefile
>>> @@ -1,4 +1,4 @@
>>> -TARGETS = breakpoints vm
>>> +TARGETS = breakpoints epoll vm
>>>
>>> all:
>>> for TARGET in $(TARGETS); do \
>>> diff --git a/tools/testing/selftests/epoll/Makefile
>>> b/tools/testing/selftests/epoll/Makefile
>>> new file mode 100644
>>> index 0000000..19806ed
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/epoll/Makefile
>>> @@ -0,0 +1,11 @@
>>> +# Makefile for epoll selftests
>>> +
>>> +all: test_epoll
>>> +%: %.c
>>> + gcc -pthread -g -o $@ $^
>>> +
>>> +run_tests: all
>>> + ./test_epoll
>>> +
>>> +clean:
>>> + $(RM) test_epoll
>>> diff --git a/tools/testing/selftests/epoll/test_epoll.c
>>> b/tools/testing/selftests/epoll/test_epoll.c
>>> new file mode 100644
>>> index 0000000..54284eb
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/epoll/test_epoll.c
>>> @@ -0,0 +1,364 @@
>>> +/*
>>> + * tools/testing/selftests/epoll/test_epoll.c
>>> + *
>>> + * Copyright 2012 Adobe Systems Incorporated
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> modify
>>> + * it under the terms of the GNU General Public License as
>>> published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> + *
>>> + * Paton J. Lewis <[email protected]>
>>> + *
>>> + */
>>> +
>>> +#include <errno.h>
>>> +#include <fcntl.h>
>>> +#include <pthread.h>
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <unistd.h>
>>> +#include <sys/epoll.h>
>>> +#include <sys/socket.h>
>>> +
>>> +/*
>>> + * A pointer to an epoll_item_private structure will be stored in
>>> the epoll
>>> + * item's event structure so that we can get access to the
>>> epoll_item_private
>>> + * data after calling epoll_wait:
>>> + */
>>> +struct epoll_item_private {
>>> + int index; /* Position of this struct within the epoll_items
>>> array. */
>>> + int fd;
>>> + uint32_t events;
>>> + pthread_mutex_t mutex; /* Guards the following variables... */
>>> + int stop;
>>> + int status; /* Stores any error encountered while handling
>>> item. */
>>> + /* The following variable allows us to test whether we have
>>> encountered
>>> + a problem while attempting to cancel and delete the associated
>>> + event. When the test program exits, 'deleted' should be exactly
>>> + one. If it is greater than one, then the failed test
>>> reflects a real
>>> + world situation where we would have tried to access the
>>> epoll item's
>>> + private data after deleting it: */
>>> + int deleted;
>>> +};
>>> +
>>> +struct epoll_item_private *epoll_items;
>>> +
>>> +/*
>>> + * Delete the specified item from the epoll set. In a real-world
>>> secneario this
>>> + * is where we would free the associated data structure, but in this
>>> testing
>>> + * environment we retain the structure so that we can test for
>>> double-deletion:
>>> + */
>>> +void delete_item(int index)
>>> +{
>>> + __sync_fetch_and_add(&epoll_items[index].deleted, 1);
>>> +}
>>> +
>>> +/*
>>> + * A pointer to a read_thread_data structure will be passed as the
>>> argument to
>>> + * each read thread:
>>> + */
>>> +struct read_thread_data {
>>> + int stop;
>>> + int status; /* Indicates any error encountered by the read
>>> thread. */
>>> + int epoll_set;
>>> +};
>>> +
>>> +/*
>>> + * The function executed by the read threads:
>>> + */
>>> +void *read_thread_function(void *function_data)
>>> +{
>>> + struct read_thread_data *thread_data =
>>> + (struct read_thread_data *)function_data;
>>> + struct epoll_event event_data;
>>> + struct epoll_item_private *item_data;
>>> + char socket_data;
>>> +
>>> + /* Handle events until we encounter an error or this thread's
>>> 'stop'
>>> + condition is set: */
>>> + while (1) {
>>> + int result = epoll_wait(thread_data->epoll_set,
>>> + &event_data,
>>> + 1, /* Number of desired
>>> events */
>>> + 1000); /* Timeout in ms */
>>> + if (result < 0) {
>>> + /* Breakpoints signal all threads. Ignore that
>>> while
>>> + debugging: */
>>> + if (errno == EINTR)
>>> + continue;
>>> + thread_data->status = errno;
>>> + return 0;
>>> + } else if (thread_data->stop)
>>> + return 0;
>>> + else if (result == 0) /* Timeout */
>>> + continue;
>>> +
>>> + /* We need the mutex here because checking for the stop
>>> + condition and re-enabling the epoll item need to be
>>> done
>>> + together as one atomic operation when
>>> EPOLL_CTL_DISABLE is
>>> + available: */
>>> + item_data = (struct epoll_item_private
>>> *)event_data.data.ptr;
>>> + pthread_mutex_lock(&item_data->mutex);
>>> +
>>> + /* Remove the item from the epoll set if we want to stop
>>> + handling that event: */
>>> + if (item_data->stop)
>>> + delete_item(item_data->index);
>>> + else {
>>> + /* Clear the data that was written to the other
>>> end of
>>> + our non-blocking socket: */
>>> + do {
>>> + if (read(item_data->fd, &socket_data,
>>> 1) < 1) {
>>> + if ((errno == EAGAIN) ||
>>> + (errno == EWOULDBLOCK))
>>> + break;
>>> + else
>>> + goto error_unlock;
>>> + }
>>> + } while (item_data->events & EPOLLET);
>>> +
>>> + /* The item was one-shot, so re-enable it: */
>>> + event_data.events = item_data->events;
>>> + if (epoll_ctl(thread_data->epoll_set,
>>> + EPOLL_CTL_MOD,
>>> + item_data->fd,
>>> + &event_data) < 0)
>>> + goto error_unlock;
>>> + }
>>> +
>>> + pthread_mutex_unlock(&item_data->mutex);
>>> + }
>>> +
>>> +error_unlock:
>>> + thread_data->status = item_data->status = errno;
>>> + pthread_mutex_unlock(&item_data->mutex);
>>> + return 0;
>>> +}
>>> +
>>> +/*
>>> + * A pointer to a write_thread_data structure will be passed as the
>>> argument to
>>> + * the write thread:
>>> + */
>>> +struct write_thread_data {
>>> + int stop;
>>> + int status; /* Indicates any error encountered by the write
>>> thread. */
>>> + int n_fds;
>>> + int *fds;
>>> +};
>>> +
>>> +/*
>>> + * The function executed by the write thread. It writes a single
>>> byte to each
>>> + * socket in turn until the stop condition for this thread is set.
>>> If writing to
>>> + * a socket would block (i.e. errno was EAGAIN), we leave that
>>> socket alone for
>>> + * the moment and just move on to the next socket in the list. We
>>> don't care
>>> + * about the order in which we deliver events to the epoll set. In
>>> fact we don't
>>> + * care about the data we're writing to the pipes at all; we just
>>> want to
>>> + * trigger epoll events:
>>> + */
>>> +void *write_thread_function(void *function_data)
>>> +{
>>> + const char data = 'X';
>>> + int index;
>>> + struct write_thread_data *thread_data =
>>> + (struct write_thread_data *)function_data;
>>> + while (!thread_data->stop)
>>> + for (index = 0;
>>> + !thread_data->stop && (index < thread_data->n_fds);
>>> + ++index)
>>> + if ((write(thread_data->fds[index], &data, 1) <
>>> 1) &&
>>> + (errno != EAGAIN) &&
>>> + (errno != EWOULDBLOCK)) {
>>> + thread_data->status = errno;
>>> + return;
>>> + }
>>> +}
>>> +
>>> +/*
>>> + * Arguments are currently ignored:
>>> + */
>>> +int main(int argc, char **argv)
>>> +{
>>> + const int n_read_threads = 100;
>>> + const int n_epoll_items = 500;
>>> + int index;
>>> + int epoll_set = epoll_create1(0);
>>> + struct write_thread_data write_thread_data = {
>>> + 0, 0, n_epoll_items, malloc(n_epoll_items * sizeof(int))
>>> + };
>>> + struct read_thread_data *read_thread_data =
>>> + malloc(n_read_threads * sizeof(struct read_thread_data));
>>> + pthread_t *read_threads = malloc(n_read_threads *
>>> sizeof(pthread_t));
>>> + pthread_t write_thread;
>>> + int socket_pair[2];
>>> + struct epoll_event event_data;
>>> +
>>> + printf("-----------------\n");
>>> + printf("Runing test_epoll\n");
>>> + printf("-----------------\n");
>>> +
>>> + epoll_items = malloc(n_epoll_items * sizeof(struct
>>> epoll_item_private));
>>> +
>>> + if (epoll_set < 0 || epoll_items == 0 || write_thread_data.fds
>>> == 0 ||
>>> + read_thread_data == 0 || read_threads == 0)
>>> + goto error;
>>> +
>>> + if (sysconf(_SC_NPROCESSORS_ONLN) < 2) {
>>> + printf("Error: please run this test on a multi-core
>>> system.\n");
>>> + goto error;
>>> + }
>>> +
>>> + /* Create the socket pairs and epoll items: */
>>> + for (index = 0; index < n_epoll_items; ++index) {
>>> + if (socketpair(AF_UNIX,
>>> + SOCK_STREAM | SOCK_NONBLOCK,
>>> + 0,
>>> + socket_pair) < 0)
>>> + goto error;
>>> + write_thread_data.fds[index] = socket_pair[0];
>>> + epoll_items[index].index = index;
>>> + epoll_items[index].fd = socket_pair[1];
>>> + if (pthread_mutex_init(&epoll_items[index].mutex, NULL)
>>> != 0)
>>> + goto error;
>>> + /* We always use EPOLLONESHOT because this test is
>>> currently
>>> + structured to demonstrate the need for
>>> EPOLL_CTL_DISABLE,
>>> + which only produces useful information in the
>>> EPOLLONESHOT
>>> + case (without EPOLLONESHOT, calling epoll_ctl with
>>> + EPOLL_CTL_DISABLE will never return EBUSY). If
>>> support for
>>> + testing events without EPOLLONESHOT is desired, it
>>> should
>>> + probably be implemented in a separate unit test. */
>>> + epoll_items[index].events = EPOLLIN | EPOLLONESHOT;
>>> + if (index < n_epoll_items / 2)
>>> + epoll_items[index].events |= EPOLLET;
>>> + epoll_items[index].stop = 0;
>>> + epoll_items[index].status = 0;
>>> + epoll_items[index].deleted = 0;
>>> + event_data.events = epoll_items[index].events;
>>> + event_data.data.ptr = &epoll_items[index];
>>> + if (epoll_ctl(epoll_set,
>>> + EPOLL_CTL_ADD,
>>> + epoll_items[index].fd,
>>> + &event_data) < 0)
>>> + goto error;
>>> + }
>>> +
>>> +#ifdef EPOLL_CTL_DISABLE
>>> + /* Test to make sure that using EPOLL_CTL_DISABLE without
>>> EPOLLONESHOT
>>> + returns a clear error: */
>>> + if (socketpair(AF_UNIX,
>>> + SOCK_STREAM | SOCK_NONBLOCK,
>>> + 0,
>>> + socket_pair) < 0)
>>> + goto error;
>>> + event_data.events = EPOLLIN;
>>> + event_data.data.ptr = NULL;
>>> + if (epoll_ctl(epoll_set, EPOLL_CTL_ADD,
>>> + socket_pair[1], &event_data) < 0)
>>> + goto error;
>>> + if ((epoll_ctl(epoll_set, EPOLL_CTL_DISABLE,
>>> + socket_pair[1], NULL) == 0) || (errno != EINVAL))
>>> + goto error;
>>> + if (epoll_ctl(epoll_set, EPOLL_CTL_DEL, socket_pair[1], NULL)
>>> != 0)
>>> + goto error;
>>> +#endif
>>> +
>>> + /* Create and start the read threads: */
>>> + for (index = 0; index < n_read_threads; ++index) {
>>> + read_thread_data[index].stop = 0;
>>> + read_thread_data[index].status = 0;
>>> + read_thread_data[index].epoll_set = epoll_set;
>>> + if (pthread_create(&read_threads[index],
>>> + NULL,
>>> + read_thread_function,
>>> + &read_thread_data[index]) != 0)
>>> + goto error;
>>> + }
>>> +
>>> + if (pthread_create(&write_thread,
>>> + NULL,
>>> + write_thread_function,
>>> + &write_thread_data) != 0)
>>> + goto error;
>>> +
>>> + /* Cancel all event pollers: */
>>> +#ifdef EPOLL_CTL_DISABLE
>>> + for (index = 0; index < n_epoll_items; ++index) {
>>> + pthread_mutex_lock(&epoll_items[index].mutex);
>>> + ++epoll_items[index].stop;
>>> + if (epoll_ctl(epoll_set,
>>> + EPOLL_CTL_DISABLE,
>>> + epoll_items[index].fd,
>>> + NULL) == 0)
>>> + delete_item(index);
>>> + else if (errno != EBUSY) {
>>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>>> + goto error;
>>> + }
>>> + /* EBUSY means events were being handled; allow the
>>> other thread
>>> + to delete the item. */
>>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>>> + }
>>> +#else
>>> + for (index = 0; index < n_epoll_items; ++index) {
>>> + pthread_mutex_lock(&epoll_items[index].mutex);
>>> + ++epoll_items[index].stop;
>>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>>> + /* Wait in case a thread running read_thread_function is
>>> + currently executing code between epoll_wait and
>>> + pthread_mutex_lock with this item. Note that a
>>> longer delay
>>> + would make double-deletion less likely (at the
>>> expense of
>>> + performance), but there is no guarantee that any
>>> delay would
>>> + ever be sufficient. Note also that we delete all event
>>> + pollers at once for testing purposes, but in a
>>> real-world
>>> + environment we are likely to want to be able to
>>> cancel event
>>> + pollers at arbitrary times. Therefore we can't
>>> improve this
>>> + situation by just splitting this loop into two loops
>>> + (i.e. signal 'stop' for all items, sleep, and then
>>> delete all
>>> + items). We also can't fix the problem via EPOLL_CTL_DEL
>>> + because that command can't prevent the case where
>>> some other
>>> + thread is executing read_thread_function within the
>>> region
>>> + mentioned above: */
>>> + usleep(1);
>>> + pthread_mutex_lock(&epoll_items[index].mutex);
>>> + if (!epoll_items[index].deleted)
>>> + delete_item(index);
>>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>>> + }
>>> +#endif
>>> +
>>> + /* Shut down the read threads: */
>>> + for (index = 0; index < n_read_threads; ++index)
>>> + __sync_fetch_and_add(&read_thread_data[index].stop, 1);
>>> + for (index = 0; index < n_read_threads; ++index) {
>>> + if (pthread_join(read_threads[index], NULL) != 0)
>>> + goto error;
>>> + if (read_thread_data[index].status)
>>> + goto error;
>>> + }
>>> +
>>> + /* Shut down the write thread: */
>>> + __sync_fetch_and_add(&write_thread_data.stop, 1);
>>> + if ((pthread_join(write_thread, NULL) != 0) ||
>>> write_thread_data.status)
>>> + goto error;
>>> +
>>> + /* Check for final error conditions: */
>>> + for (index = 0; index < n_epoll_items; ++index) {
>>> + if (epoll_items[index].status != 0)
>>> + goto error;
>>> + if (pthread_mutex_destroy(&epoll_items[index].mutex) < 0)
>>> + goto error;
>>> + }
>>> + for (index = 0; index < n_epoll_items; ++index)
>>> + if (epoll_items[index].deleted != 1) {
>>> + printf("Error: item data deleted %1d times.\n",
>>> + epoll_items[index].deleted);
>>> + goto error;
>>> + }
>>> +
>>> + printf("[PASS]\n");
>>> + return 0;
>>> +
>>> + error:
>>> + printf("[FAIL]\n");
>>> + return errno;
>>> +}
>>>
>>
>> .
>>
>

2012-11-01 18:48:01

by Paton J. Lewis

[permalink] [raw]
Subject: Re: [PATCH v3] epoll: Support for disabling items, and a self-test app.

On 10/31/12 5:43 PM, Michael Wang wrote:
> On 11/01/2012 02:57 AM, Paton J. Lewis wrote:
>> On 10/30/12 11:32 PM, Michael Wang wrote:
>>> On 10/26/2012 08:08 AM, Paton J. Lewis wrote:
>>>> From: "Paton J. Lewis" <[email protected]>
>>>>
>>>> It is not currently possible to reliably delete epoll items when
>>>> using the
>>>> same epoll set from multiple threads. After calling epoll_ctl with
>>>> EPOLL_CTL_DEL, another thread might still be executing code related
>>>> to an
>>>> event for that epoll item (in response to epoll_wait). Therefore the
>>>> deleting
>>>> thread does not know when it is safe to delete resources pertaining
>>>> to the
>>>> associated epoll item because another thread might be using those
>>>> resources.
>>>>
>>>> The deleting thread could wait an arbitrary amount of time after calling
>>>> epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is
>>>> inefficient and could result in the destruction of resources before
>>>> another
>>>> thread is done handling an event returned by epoll_wait.
>>>>
>>>> This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which
>>>> disables an
>>>> epoll item. If epoll_ctl returns -EBUSY in this case, then another
>>>> thread may
>>>> handling a return from epoll_wait for this item. Otherwise if epoll_ctl
>>>> returns 0, then it is safe to delete the epoll item. This allows
>>>> multiple
>>>> threads to use a mutex to determine when it is safe to delete an
>>>> epoll item
>>>> and its associated resources, which allows epoll items to be deleted
>>>> both
>>>> efficiently and without error in a multi-threaded environment. Note that
>>>> EPOLL_CTL_DISABLE is only useful in conjunction with EPOLLONESHOT,
>>>> and using
>>>> EPOLL_CTL_DISABLE on an epoll item without EPOLLONESHOT returns -EINVAL.
>>>>
>>>> This patch also adds a new test_epoll self-test program to both
>>>> demonstrate
>>>> the need for this feature and test it.
>>>
>>> Hi, Paton
>>>
>>> I'm just think about may be we could use this way.
>>>
>>> Seems like currently we are depending on the epoll_ctl() to indicate the
>>> start point of safe section and epoll_wait() for the end point, like:
>>>
>>> while () {
>>> epoll_wait() --------------
>>>
>>> fd event arrived safe section
>>>
>>> clear fd epi->event.events
>>> --------------
>>> if (fd need stop)
>>> continue;
>>> --------------
>>> ...fd data process...
>>>
>>> epoll_ctl(MOD) danger section
>>>
>>> set fd epi->event.events --------------
>>>
>>> continue;
>>> }
>>>
>>> So we got a safe section and do delete work in this section won't cause
>>> trouble since we have a stop check directly after it.
>>>
>>> Actually what we want is to make sure no one will touch the fd any more
>>> after we DISABLE it.
>>>
>>> Then what about we add a ref count and a stop flag in epi, maintain it
>>> like:
>>>
>>> epoll_wait()
>>>
>>> check user events and
>>> dec the ref count of fd ---------------------------
>>>
>>> ...
>>>
>>> fd event arrived safe sec if ref count is 0
>>>
>>> if epi stop flag set
>>> do nothing
>>> else
>>> inc epi ref count ---------------------------
>>
>> The pseudecode you provide below (for "DISABLE") seems to indicate that
>> this "epi ref count" must be maintained by the kernel. Therefore any
>> userspace modification of a ref count associated with an epoll item will
>> require a new or changed kernel API.
>>
>>> send event
>>>
>>> And what DISABLE do is:
>>>
>>> set epi stop flag
>>>
>>> if epi ref count is not 0
>>> wait until ref count be 0
>>
>> Perhaps I don't fully understand what you're proposing, but I don't
>> think it's reasonable for a kernel API (epoll_ctl in this case) to block
>> while waiting for a userspace action (decrementing the ref count) that
>> might never occur.
>>
>> Andrew Morton also proposed using ref counting in response to my initial
>> patch submission; my reply to his proposal might also be applicable to
>> your proposal. A link to that discussion thread:
>> http://thread.gmane.org/gmane.linux.kernel/1311457/focus=1315096
>>
>> Sorry if I am misunderstanding your proposal, but I don't see how it
>> solves the original problem.
>
> I just try to find out whether we could using DISABLE with out ONESHOT :)
>
> My currently understanding is:
>
> 1. we actually want to determine the part between each epoll_wait() in a
> while().
>
> 2. we can't count on epoll_wait() itself, since no info pass to kernel
> to indicate whether it was invoked after another epoll_wait() in the
> same while().
>
> 3. so we need epoll_ctl(MOD) to tell kernel: user finished process data
> after epoll_wait(), and those data belong to which epi.
>
> 4. since 3 we need ONESHOT to be enabled.
>
>
> Is this the reason we rely on ONESHOT to be enabled?

No, we need to use EPOLLONESHOT to ensure that only one worker thread at
a time can ever be handling private data associated with a given fd.
This constraint allows a deletion thread to coordinate with that worker
thread via a mutex to cleanly and quickly delete the associated private
data (provided that the deletion thread knows whether or not such a
worker thread exists in that state, which the question that
EPOLL_CTL_DISABLE answers).

Pat

> If it is, then if we could do some change to make epoll_wait() have the
> ability:
>
> 1. indicate whether it is invoked after another epoll_wait() in same while()
>
> 2. indicate which epi has been fired by last epoll_wait()
>
> 3. no changes on the api for user
>
> Then we could mark the safe section inside epoll_wait(), and we don't
> need MOD or ONESHOT any more, is that correct?
>
>
> Regards,
> Michael Wang
>
>>
>> Pat
>>
>>> So after DISABLE return, we can safely delete any thing related to
>>> that epi.
>>>
>>> One thing is that the user should not change the events info returned by
>>> epoll_wait().
>>>
>>> It's just a propose, but if it works, there will be no limit on ONESHOT
>>> any more ;-)
>>>
>>> Regards,
>>> Michael Wang
>>>
>>>>
>>>> Signed-off-by: Paton J. Lewis <[email protected]>
>>>> ---
>>>> fs/eventpoll.c | 40 ++-
>>>> include/linux/eventpoll.h | 1 +
>>>> tools/testing/selftests/Makefile | 2 +-
>>>> tools/testing/selftests/epoll/Makefile | 11 +
>>>> tools/testing/selftests/epoll/test_epoll.c | 364
>>>> ++++++++++++++++++++++++++++
>>>> 5 files changed, 414 insertions(+), 4 deletions(-)
>>>> create mode 100644 tools/testing/selftests/epoll/Makefile
>>>> create mode 100644 tools/testing/selftests/epoll/test_epoll.c
>>>>
>>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>>> index 739b098..c718afd 100644
>>>> --- a/fs/eventpoll.c
>>>> +++ b/fs/eventpoll.c
>>>> @@ -339,7 +339,7 @@ static inline struct epitem
>>>> *ep_item_from_epqueue(poll_table *p)
>>>> /* Tells if the epoll_ctl(2) operation needs an event copy from
>>>> userspace */
>>>> static inline int ep_op_has_event(int op)
>>>> {
>>>> - return op != EPOLL_CTL_DEL;
>>>> + return op == EPOLL_CTL_ADD || op == EPOLL_CTL_MOD;
>>>> }
>>>>
>>>> /* Initialize the poll safe wake up structure */
>>>> @@ -664,6 +664,36 @@ static int ep_remove(struct eventpoll *ep,
>>>> struct epitem *epi)
>>>> return 0;
>>>> }
>>>>
>>>> +/*
>>>> + * Disables a "struct epitem" in the eventpoll set. Returns -EBUSY
>>>> if the item
>>>> + * had no event flags set, indicating that another thread may be
>>>> currently
>>>> + * handling that item's events (in the case that EPOLLONESHOT was being
>>>> + * used). Otherwise a zero result indicates that the item has been
>>>> disabled
>>>> + * from receiving events. A disabled item may be re-enabled via
>>>> + * EPOLL_CTL_MOD. Must be called with "mtx" held.
>>>> + */
>>>> +static int ep_disable(struct eventpoll *ep, struct epitem *epi)
>>>> +{
>>>> + int result = 0;
>>>> + unsigned long flags;
>>>> +
>>>> + spin_lock_irqsave(&ep->lock, flags);
>>>> + if (epi->event.events & EPOLLONESHOT) {
>>>> + if (epi->event.events & ~EP_PRIVATE_BITS) {
>>>> + if (ep_is_linked(&epi->rdllink))
>>>> + list_del_init(&epi->rdllink);
>>>> + /* Ensure ep_poll_callback will not add epi
>>>> back onto
>>>> + ready list: */
>>>> + epi->event.events &= EP_PRIVATE_BITS;
>>>> + } else
>>>> + result = -EBUSY;
>>>> + } else
>>>> + result = -EINVAL;
>>>> + spin_unlock_irqrestore(&ep->lock, flags);
>>>> +
>>>> + return result;
>>>> +}
>>>> +
>>>> static void ep_free(struct eventpoll *ep)
>>>> {
>>>> struct rb_node *rbp;
>>>> @@ -996,8 +1026,6 @@ static void ep_rbtree_insert(struct eventpoll
>>>> *ep, struct epitem *epi)
>>>> rb_insert_color(&epi->rbn, &ep->rbr);
>>>> }
>>>>
>>>> -
>>>> -
>>>> #define PATH_ARR_SIZE 5
>>>> /*
>>>> * These are the number paths of length 1 to 5, that we are
>>>> allowing to emanate
>>>> @@ -1701,6 +1729,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op,
>>>> int, fd,
>>>> } else
>>>> error = -ENOENT;
>>>> break;
>>>> + case EPOLL_CTL_DISABLE:
>>>> + if (epi)
>>>> + error = ep_disable(ep, epi);
>>>> + else
>>>> + error = -ENOENT;
>>>> + break;
>>>> }
>>>> mutex_unlock(&ep->mtx);
>>>>
>>>> diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
>>>> index 657ab55..e91f7e3 100644
>>>> --- a/include/linux/eventpoll.h
>>>> +++ b/include/linux/eventpoll.h
>>>> @@ -25,6 +25,7 @@
>>>> #define EPOLL_CTL_ADD 1
>>>> #define EPOLL_CTL_DEL 2
>>>> #define EPOLL_CTL_MOD 3
>>>> +#define EPOLL_CTL_DISABLE 4
>>>>
>>>> /* Set the One Shot behaviour for the target file descriptor */
>>>> #define EPOLLONESHOT (1 << 30)
>>>> diff --git a/tools/testing/selftests/Makefile
>>>> b/tools/testing/selftests/Makefile
>>>> index 28bc57e..4cf477f 100644
>>>> --- a/tools/testing/selftests/Makefile
>>>> +++ b/tools/testing/selftests/Makefile
>>>> @@ -1,4 +1,4 @@
>>>> -TARGETS = breakpoints vm
>>>> +TARGETS = breakpoints epoll vm
>>>>
>>>> all:
>>>> for TARGET in $(TARGETS); do \
>>>> diff --git a/tools/testing/selftests/epoll/Makefile
>>>> b/tools/testing/selftests/epoll/Makefile
>>>> new file mode 100644
>>>> index 0000000..19806ed
>>>> --- /dev/null
>>>> +++ b/tools/testing/selftests/epoll/Makefile
>>>> @@ -0,0 +1,11 @@
>>>> +# Makefile for epoll selftests
>>>> +
>>>> +all: test_epoll
>>>> +%: %.c
>>>> + gcc -pthread -g -o $@ $^
>>>> +
>>>> +run_tests: all
>>>> + ./test_epoll
>>>> +
>>>> +clean:
>>>> + $(RM) test_epoll
>>>> diff --git a/tools/testing/selftests/epoll/test_epoll.c
>>>> b/tools/testing/selftests/epoll/test_epoll.c
>>>> new file mode 100644
>>>> index 0000000..54284eb
>>>> --- /dev/null
>>>> +++ b/tools/testing/selftests/epoll/test_epoll.c
>>>> @@ -0,0 +1,364 @@
>>>> +/*
>>>> + * tools/testing/selftests/epoll/test_epoll.c
>>>> + *
>>>> + * Copyright 2012 Adobe Systems Incorporated
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or
>>>> modify
>>>> + * it under the terms of the GNU General Public License as
>>>> published by
>>>> + * the Free Software Foundation; either version 2 of the License, or
>>>> + * (at your option) any later version.
>>>> + *
>>>> + * Paton J. Lewis <[email protected]>
>>>> + *
>>>> + */
>>>> +
>>>> +#include <errno.h>
>>>> +#include <fcntl.h>
>>>> +#include <pthread.h>
>>>> +#include <stdio.h>
>>>> +#include <stdlib.h>
>>>> +#include <unistd.h>
>>>> +#include <sys/epoll.h>
>>>> +#include <sys/socket.h>
>>>> +
>>>> +/*
>>>> + * A pointer to an epoll_item_private structure will be stored in
>>>> the epoll
>>>> + * item's event structure so that we can get access to the
>>>> epoll_item_private
>>>> + * data after calling epoll_wait:
>>>> + */
>>>> +struct epoll_item_private {
>>>> + int index; /* Position of this struct within the epoll_items
>>>> array. */
>>>> + int fd;
>>>> + uint32_t events;
>>>> + pthread_mutex_t mutex; /* Guards the following variables... */
>>>> + int stop;
>>>> + int status; /* Stores any error encountered while handling
>>>> item. */
>>>> + /* The following variable allows us to test whether we have
>>>> encountered
>>>> + a problem while attempting to cancel and delete the associated
>>>> + event. When the test program exits, 'deleted' should be exactly
>>>> + one. If it is greater than one, then the failed test
>>>> reflects a real
>>>> + world situation where we would have tried to access the
>>>> epoll item's
>>>> + private data after deleting it: */
>>>> + int deleted;
>>>> +};
>>>> +
>>>> +struct epoll_item_private *epoll_items;
>>>> +
>>>> +/*
>>>> + * Delete the specified item from the epoll set. In a real-world
>>>> secneario this
>>>> + * is where we would free the associated data structure, but in this
>>>> testing
>>>> + * environment we retain the structure so that we can test for
>>>> double-deletion:
>>>> + */
>>>> +void delete_item(int index)
>>>> +{
>>>> + __sync_fetch_and_add(&epoll_items[index].deleted, 1);
>>>> +}
>>>> +
>>>> +/*
>>>> + * A pointer to a read_thread_data structure will be passed as the
>>>> argument to
>>>> + * each read thread:
>>>> + */
>>>> +struct read_thread_data {
>>>> + int stop;
>>>> + int status; /* Indicates any error encountered by the read
>>>> thread. */
>>>> + int epoll_set;
>>>> +};
>>>> +
>>>> +/*
>>>> + * The function executed by the read threads:
>>>> + */
>>>> +void *read_thread_function(void *function_data)
>>>> +{
>>>> + struct read_thread_data *thread_data =
>>>> + (struct read_thread_data *)function_data;
>>>> + struct epoll_event event_data;
>>>> + struct epoll_item_private *item_data;
>>>> + char socket_data;
>>>> +
>>>> + /* Handle events until we encounter an error or this thread's
>>>> 'stop'
>>>> + condition is set: */
>>>> + while (1) {
>>>> + int result = epoll_wait(thread_data->epoll_set,
>>>> + &event_data,
>>>> + 1, /* Number of desired
>>>> events */
>>>> + 1000); /* Timeout in ms */
>>>> + if (result < 0) {
>>>> + /* Breakpoints signal all threads. Ignore that
>>>> while
>>>> + debugging: */
>>>> + if (errno == EINTR)
>>>> + continue;
>>>> + thread_data->status = errno;
>>>> + return 0;
>>>> + } else if (thread_data->stop)
>>>> + return 0;
>>>> + else if (result == 0) /* Timeout */
>>>> + continue;
>>>> +
>>>> + /* We need the mutex here because checking for the stop
>>>> + condition and re-enabling the epoll item need to be
>>>> done
>>>> + together as one atomic operation when
>>>> EPOLL_CTL_DISABLE is
>>>> + available: */
>>>> + item_data = (struct epoll_item_private
>>>> *)event_data.data.ptr;
>>>> + pthread_mutex_lock(&item_data->mutex);
>>>> +
>>>> + /* Remove the item from the epoll set if we want to stop
>>>> + handling that event: */
>>>> + if (item_data->stop)
>>>> + delete_item(item_data->index);
>>>> + else {
>>>> + /* Clear the data that was written to the other
>>>> end of
>>>> + our non-blocking socket: */
>>>> + do {
>>>> + if (read(item_data->fd, &socket_data,
>>>> 1) < 1) {
>>>> + if ((errno == EAGAIN) ||
>>>> + (errno == EWOULDBLOCK))
>>>> + break;
>>>> + else
>>>> + goto error_unlock;
>>>> + }
>>>> + } while (item_data->events & EPOLLET);
>>>> +
>>>> + /* The item was one-shot, so re-enable it: */
>>>> + event_data.events = item_data->events;
>>>> + if (epoll_ctl(thread_data->epoll_set,
>>>> + EPOLL_CTL_MOD,
>>>> + item_data->fd,
>>>> + &event_data) < 0)
>>>> + goto error_unlock;
>>>> + }
>>>> +
>>>> + pthread_mutex_unlock(&item_data->mutex);
>>>> + }
>>>> +
>>>> +error_unlock:
>>>> + thread_data->status = item_data->status = errno;
>>>> + pthread_mutex_unlock(&item_data->mutex);
>>>> + return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * A pointer to a write_thread_data structure will be passed as the
>>>> argument to
>>>> + * the write thread:
>>>> + */
>>>> +struct write_thread_data {
>>>> + int stop;
>>>> + int status; /* Indicates any error encountered by the write
>>>> thread. */
>>>> + int n_fds;
>>>> + int *fds;
>>>> +};
>>>> +
>>>> +/*
>>>> + * The function executed by the write thread. It writes a single
>>>> byte to each
>>>> + * socket in turn until the stop condition for this thread is set.
>>>> If writing to
>>>> + * a socket would block (i.e. errno was EAGAIN), we leave that
>>>> socket alone for
>>>> + * the moment and just move on to the next socket in the list. We
>>>> don't care
>>>> + * about the order in which we deliver events to the epoll set. In
>>>> fact we don't
>>>> + * care about the data we're writing to the pipes at all; we just
>>>> want to
>>>> + * trigger epoll events:
>>>> + */
>>>> +void *write_thread_function(void *function_data)
>>>> +{
>>>> + const char data = 'X';
>>>> + int index;
>>>> + struct write_thread_data *thread_data =
>>>> + (struct write_thread_data *)function_data;
>>>> + while (!thread_data->stop)
>>>> + for (index = 0;
>>>> + !thread_data->stop && (index < thread_data->n_fds);
>>>> + ++index)
>>>> + if ((write(thread_data->fds[index], &data, 1) <
>>>> 1) &&
>>>> + (errno != EAGAIN) &&
>>>> + (errno != EWOULDBLOCK)) {
>>>> + thread_data->status = errno;
>>>> + return;
>>>> + }
>>>> +}
>>>> +
>>>> +/*
>>>> + * Arguments are currently ignored:
>>>> + */
>>>> +int main(int argc, char **argv)
>>>> +{
>>>> + const int n_read_threads = 100;
>>>> + const int n_epoll_items = 500;
>>>> + int index;
>>>> + int epoll_set = epoll_create1(0);
>>>> + struct write_thread_data write_thread_data = {
>>>> + 0, 0, n_epoll_items, malloc(n_epoll_items * sizeof(int))
>>>> + };
>>>> + struct read_thread_data *read_thread_data =
>>>> + malloc(n_read_threads * sizeof(struct read_thread_data));
>>>> + pthread_t *read_threads = malloc(n_read_threads *
>>>> sizeof(pthread_t));
>>>> + pthread_t write_thread;
>>>> + int socket_pair[2];
>>>> + struct epoll_event event_data;
>>>> +
>>>> + printf("-----------------\n");
>>>> + printf("Runing test_epoll\n");
>>>> + printf("-----------------\n");
>>>> +
>>>> + epoll_items = malloc(n_epoll_items * sizeof(struct
>>>> epoll_item_private));
>>>> +
>>>> + if (epoll_set < 0 || epoll_items == 0 || write_thread_data.fds
>>>> == 0 ||
>>>> + read_thread_data == 0 || read_threads == 0)
>>>> + goto error;
>>>> +
>>>> + if (sysconf(_SC_NPROCESSORS_ONLN) < 2) {
>>>> + printf("Error: please run this test on a multi-core
>>>> system.\n");
>>>> + goto error;
>>>> + }
>>>> +
>>>> + /* Create the socket pairs and epoll items: */
>>>> + for (index = 0; index < n_epoll_items; ++index) {
>>>> + if (socketpair(AF_UNIX,
>>>> + SOCK_STREAM | SOCK_NONBLOCK,
>>>> + 0,
>>>> + socket_pair) < 0)
>>>> + goto error;
>>>> + write_thread_data.fds[index] = socket_pair[0];
>>>> + epoll_items[index].index = index;
>>>> + epoll_items[index].fd = socket_pair[1];
>>>> + if (pthread_mutex_init(&epoll_items[index].mutex, NULL)
>>>> != 0)
>>>> + goto error;
>>>> + /* We always use EPOLLONESHOT because this test is
>>>> currently
>>>> + structured to demonstrate the need for
>>>> EPOLL_CTL_DISABLE,
>>>> + which only produces useful information in the
>>>> EPOLLONESHOT
>>>> + case (without EPOLLONESHOT, calling epoll_ctl with
>>>> + EPOLL_CTL_DISABLE will never return EBUSY). If
>>>> support for
>>>> + testing events without EPOLLONESHOT is desired, it
>>>> should
>>>> + probably be implemented in a separate unit test. */
>>>> + epoll_items[index].events = EPOLLIN | EPOLLONESHOT;
>>>> + if (index < n_epoll_items / 2)
>>>> + epoll_items[index].events |= EPOLLET;
>>>> + epoll_items[index].stop = 0;
>>>> + epoll_items[index].status = 0;
>>>> + epoll_items[index].deleted = 0;
>>>> + event_data.events = epoll_items[index].events;
>>>> + event_data.data.ptr = &epoll_items[index];
>>>> + if (epoll_ctl(epoll_set,
>>>> + EPOLL_CTL_ADD,
>>>> + epoll_items[index].fd,
>>>> + &event_data) < 0)
>>>> + goto error;
>>>> + }
>>>> +
>>>> +#ifdef EPOLL_CTL_DISABLE
>>>> + /* Test to make sure that using EPOLL_CTL_DISABLE without
>>>> EPOLLONESHOT
>>>> + returns a clear error: */
>>>> + if (socketpair(AF_UNIX,
>>>> + SOCK_STREAM | SOCK_NONBLOCK,
>>>> + 0,
>>>> + socket_pair) < 0)
>>>> + goto error;
>>>> + event_data.events = EPOLLIN;
>>>> + event_data.data.ptr = NULL;
>>>> + if (epoll_ctl(epoll_set, EPOLL_CTL_ADD,
>>>> + socket_pair[1], &event_data) < 0)
>>>> + goto error;
>>>> + if ((epoll_ctl(epoll_set, EPOLL_CTL_DISABLE,
>>>> + socket_pair[1], NULL) == 0) || (errno != EINVAL))
>>>> + goto error;
>>>> + if (epoll_ctl(epoll_set, EPOLL_CTL_DEL, socket_pair[1], NULL)
>>>> != 0)
>>>> + goto error;
>>>> +#endif
>>>> +
>>>> + /* Create and start the read threads: */
>>>> + for (index = 0; index < n_read_threads; ++index) {
>>>> + read_thread_data[index].stop = 0;
>>>> + read_thread_data[index].status = 0;
>>>> + read_thread_data[index].epoll_set = epoll_set;
>>>> + if (pthread_create(&read_threads[index],
>>>> + NULL,
>>>> + read_thread_function,
>>>> + &read_thread_data[index]) != 0)
>>>> + goto error;
>>>> + }
>>>> +
>>>> + if (pthread_create(&write_thread,
>>>> + NULL,
>>>> + write_thread_function,
>>>> + &write_thread_data) != 0)
>>>> + goto error;
>>>> +
>>>> + /* Cancel all event pollers: */
>>>> +#ifdef EPOLL_CTL_DISABLE
>>>> + for (index = 0; index < n_epoll_items; ++index) {
>>>> + pthread_mutex_lock(&epoll_items[index].mutex);
>>>> + ++epoll_items[index].stop;
>>>> + if (epoll_ctl(epoll_set,
>>>> + EPOLL_CTL_DISABLE,
>>>> + epoll_items[index].fd,
>>>> + NULL) == 0)
>>>> + delete_item(index);
>>>> + else if (errno != EBUSY) {
>>>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>>>> + goto error;
>>>> + }
>>>> + /* EBUSY means events were being handled; allow the
>>>> other thread
>>>> + to delete the item. */
>>>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>>>> + }
>>>> +#else
>>>> + for (index = 0; index < n_epoll_items; ++index) {
>>>> + pthread_mutex_lock(&epoll_items[index].mutex);
>>>> + ++epoll_items[index].stop;
>>>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>>>> + /* Wait in case a thread running read_thread_function is
>>>> + currently executing code between epoll_wait and
>>>> + pthread_mutex_lock with this item. Note that a
>>>> longer delay
>>>> + would make double-deletion less likely (at the
>>>> expense of
>>>> + performance), but there is no guarantee that any
>>>> delay would
>>>> + ever be sufficient. Note also that we delete all event
>>>> + pollers at once for testing purposes, but in a
>>>> real-world
>>>> + environment we are likely to want to be able to
>>>> cancel event
>>>> + pollers at arbitrary times. Therefore we can't
>>>> improve this
>>>> + situation by just splitting this loop into two loops
>>>> + (i.e. signal 'stop' for all items, sleep, and then
>>>> delete all
>>>> + items). We also can't fix the problem via EPOLL_CTL_DEL
>>>> + because that command can't prevent the case where
>>>> some other
>>>> + thread is executing read_thread_function within the
>>>> region
>>>> + mentioned above: */
>>>> + usleep(1);
>>>> + pthread_mutex_lock(&epoll_items[index].mutex);
>>>> + if (!epoll_items[index].deleted)
>>>> + delete_item(index);
>>>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>>>> + }
>>>> +#endif
>>>> +
>>>> + /* Shut down the read threads: */
>>>> + for (index = 0; index < n_read_threads; ++index)
>>>> + __sync_fetch_and_add(&read_thread_data[index].stop, 1);
>>>> + for (index = 0; index < n_read_threads; ++index) {
>>>> + if (pthread_join(read_threads[index], NULL) != 0)
>>>> + goto error;
>>>> + if (read_thread_data[index].status)
>>>> + goto error;
>>>> + }
>>>> +
>>>> + /* Shut down the write thread: */
>>>> + __sync_fetch_and_add(&write_thread_data.stop, 1);
>>>> + if ((pthread_join(write_thread, NULL) != 0) ||
>>>> write_thread_data.status)
>>>> + goto error;
>>>> +
>>>> + /* Check for final error conditions: */
>>>> + for (index = 0; index < n_epoll_items; ++index) {
>>>> + if (epoll_items[index].status != 0)
>>>> + goto error;
>>>> + if (pthread_mutex_destroy(&epoll_items[index].mutex) < 0)
>>>> + goto error;
>>>> + }
>>>> + for (index = 0; index < n_epoll_items; ++index)
>>>> + if (epoll_items[index].deleted != 1) {
>>>> + printf("Error: item data deleted %1d times.\n",
>>>> + epoll_items[index].deleted);
>>>> + goto error;
>>>> + }
>>>> +
>>>> + printf("[PASS]\n");
>>>> + return 0;
>>>> +
>>>> + error:
>>>> + printf("[FAIL]\n");
>>>> + return errno;
>>>> +}
>>>>
>>>
>>> .
>>>
>>
>
> .
>

2012-11-02 02:27:05

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v3] epoll: Support for disabling items, and a self-test app.

On 11/02/2012 02:47 AM, Paton J. Lewis wrote:
> On 10/31/12 5:43 PM, Michael Wang wrote:
>> On 11/01/2012 02:57 AM, Paton J. Lewis wrote:
>>> On 10/30/12 11:32 PM, Michael Wang wrote:
>>>> On 10/26/2012 08:08 AM, Paton J. Lewis wrote:
>>>>> From: "Paton J. Lewis" <[email protected]>
>>>>>
>>>>> It is not currently possible to reliably delete epoll items when
>>>>> using the
>>>>> same epoll set from multiple threads. After calling epoll_ctl with
>>>>> EPOLL_CTL_DEL, another thread might still be executing code related
>>>>> to an
>>>>> event for that epoll item (in response to epoll_wait). Therefore the
>>>>> deleting
>>>>> thread does not know when it is safe to delete resources pertaining
>>>>> to the
>>>>> associated epoll item because another thread might be using those
>>>>> resources.
>>>>>
>>>>> The deleting thread could wait an arbitrary amount of time after
>>>>> calling
>>>>> epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is
>>>>> inefficient and could result in the destruction of resources before
>>>>> another
>>>>> thread is done handling an event returned by epoll_wait.
>>>>>
>>>>> This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which
>>>>> disables an
>>>>> epoll item. If epoll_ctl returns -EBUSY in this case, then another
>>>>> thread may
>>>>> handling a return from epoll_wait for this item. Otherwise if
>>>>> epoll_ctl
>>>>> returns 0, then it is safe to delete the epoll item. This allows
>>>>> multiple
>>>>> threads to use a mutex to determine when it is safe to delete an
>>>>> epoll item
>>>>> and its associated resources, which allows epoll items to be deleted
>>>>> both
>>>>> efficiently and without error in a multi-threaded environment. Note
>>>>> that
>>>>> EPOLL_CTL_DISABLE is only useful in conjunction with EPOLLONESHOT,
>>>>> and using
>>>>> EPOLL_CTL_DISABLE on an epoll item without EPOLLONESHOT returns
>>>>> -EINVAL.
>>>>>
>>>>> This patch also adds a new test_epoll self-test program to both
>>>>> demonstrate
>>>>> the need for this feature and test it.
>>>>
>>>> Hi, Paton
>>>>
>>>> I'm just think about may be we could use this way.
>>>>
>>>> Seems like currently we are depending on the epoll_ctl() to indicate
>>>> the
>>>> start point of safe section and epoll_wait() for the end point, like:
>>>>
>>>> while () {
>>>> epoll_wait() --------------
>>>>
>>>> fd event arrived safe section
>>>>
>>>> clear fd epi->event.events
>>>> --------------
>>>> if (fd need stop)
>>>> continue;
>>>> --------------
>>>> ...fd data process...
>>>>
>>>> epoll_ctl(MOD) danger section
>>>>
>>>> set fd epi->event.events --------------
>>>>
>>>> continue;
>>>> }
>>>>
>>>> So we got a safe section and do delete work in this section won't cause
>>>> trouble since we have a stop check directly after it.
>>>>
>>>> Actually what we want is to make sure no one will touch the fd any more
>>>> after we DISABLE it.
>>>>
>>>> Then what about we add a ref count and a stop flag in epi, maintain it
>>>> like:
>>>>
>>>> epoll_wait()
>>>>
>>>> check user events and
>>>> dec the ref count of fd ---------------------------
>>>>
>>>> ...
>>>>
>>>> fd event arrived safe sec if ref count is 0
>>>>
>>>> if epi stop flag set
>>>> do nothing
>>>> else
>>>> inc epi ref count ---------------------------
>>>
>>> The pseudecode you provide below (for "DISABLE") seems to indicate that
>>> this "epi ref count" must be maintained by the kernel. Therefore any
>>> userspace modification of a ref count associated with an epoll item will
>>> require a new or changed kernel API.
>>>
>>>> send event
>>>>
>>>> And what DISABLE do is:
>>>>
>>>> set epi stop flag
>>>>
>>>> if epi ref count is not 0
>>>> wait until ref count be 0
>>>
>>> Perhaps I don't fully understand what you're proposing, but I don't
>>> think it's reasonable for a kernel API (epoll_ctl in this case) to block
>>> while waiting for a userspace action (decrementing the ref count) that
>>> might never occur.
>>>
>>> Andrew Morton also proposed using ref counting in response to my initial
>>> patch submission; my reply to his proposal might also be applicable to
>>> your proposal. A link to that discussion thread:
>>> http://thread.gmane.org/gmane.linux.kernel/1311457/focus=1315096
>>>
>>> Sorry if I am misunderstanding your proposal, but I don't see how it
>>> solves the original problem.
>>
>> I just try to find out whether we could using DISABLE with out ONESHOT :)
>>
>> My currently understanding is:
>>
>> 1. we actually want to determine the part between each epoll_wait() in a
>> while().
>>
>> 2. we can't count on epoll_wait() itself, since no info pass to kernel
>> to indicate whether it was invoked after another epoll_wait() in the
>> same while().
>>
>> 3. so we need epoll_ctl(MOD) to tell kernel: user finished process data
>> after epoll_wait(), and those data belong to which epi.
>>
>> 4. since 3 we need ONESHOT to be enabled.
>>
>>
>> Is this the reason we rely on ONESHOT to be enabled?
>
> No, we need to use EPOLLONESHOT to ensure that only one worker thread at
> a time can ever be handling private data associated with a given fd.
> This constraint allows a deletion thread to coordinate with that worker
> thread via a mutex to cleanly and quickly delete the associated private
> data (provided that the deletion thread knows whether or not such a
> worker thread exists in that state, which the question that
> EPOLL_CTL_DISABLE answers).

So using ONESHOT is a demand for current epoll coding design, and
DISABLE is try to solve the issue of that design?

I don't have many using experience on epoll as you guys :), so allow me
to ask one question:

Will it be very useful if we can using DISABLE with out the limitation
on ONESHOT?

Which means what ever the ep is, if DISABLE return 0, we are safe to
erase that fd.

Regards,
Michael Wang

2012-11-02 04:21:33

by Eric Wong

[permalink] [raw]
Subject: [RFC/PATCH] epoll: replace EPOLL_CTL_DISABLE with EPOLL_CTL_POKE

"Paton J. Lewis" <[email protected]> wrote:
> --- /dev/null
> +++ b/tools/testing/selftests/epoll/test_epoll.c

> + /* Handle events until we encounter an error or this thread's 'stop'
> + condition is set: */
> + while (1) {
> + int result = epoll_wait(thread_data->epoll_set,
> + &event_data,
> + 1, /* Number of desired events */
> + 1000); /* Timeout in ms */

<snip>

> + /* We need the mutex here because checking for the stop
> + condition and re-enabling the epoll item need to be done
> + together as one atomic operation when EPOLL_CTL_DISABLE is
> + available: */
> + item_data = (struct epoll_item_private *)event_data.data.ptr;
> + pthread_mutex_lock(&item_data->mutex);

The per-item mutex bothers me. Using EPOLLONESHOT normally frees me
from the need to worry about concurrent access to the item userspace.

Instead of having another thread call delete_item() on a successful
EPOLL_CTL_DISABLE, I'd normally use shutdown() (which causes
epoll_wait() to return the item and my normal error handling will kick
in once I try a read()/write()).

However, since shutdown() is limited to sockets and is irreversible,
I came up with EPOLL_CTL_POKE instead. Comments greatly appreciated
(I'm not a regular kernel hacker, either)


>From 12a2d7c4584605dd763c7510140666d2a6b51d89 Mon Sep 17 00:00:00 2001
From: Eric Wong <[email protected]>
Date: Fri, 2 Nov 2012 03:47:08 +0000
Subject: [PATCH] epoll: replace EPOLL_CTL_DISABLE with EPOLL_CTL_POKE

EPOLL_CTL_POKE may be used to force an item into the epoll
ready list. Instead of disabling an item asynchronously
via EPOLL_CTL_DISABLE, this forces the threads calling
epoll_wait() to handle the item in its normal loop.

EPOLL_CTL_POKE has the advantage of _not_ requiring per-item
locking when used in conjunction with EPOLLONESHOT.
Additionally, EPOLL_CTL_POKE does not require EPOLLONESHOT to
function (though it's usefulness in single-threaded or
oneshot-free scenarios is limited).

The modified epoll test demonstrates the lack of per-item
locking needed to accomplish safe deletion of items.

In a normal application, I use shutdown() for a similar effect
as calling EPOLL_CTL_POKE in a different thread. However,
shutdown() has some limitations:
a) it only works on sockets
b) irreversibly affects the socket

Signed-off-by: Eric Wong <[email protected]>
---
fs/eventpoll.c | 57 +++++++++++++++++++---------
include/uapi/linux/eventpoll.h | 2 +-
tools/testing/selftests/epoll/test_epoll.c | 36 +++++++++++++-----
3 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index da72250..1eea950 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -39,6 +39,9 @@
#include <asm/mman.h>
#include <linux/atomic.h>

+/* not currently exposed to user space, but maybe this should be */
+#define EPOLLPOKED (EPOLLWAKEUP >> 1)
+
/*
* LOCKING:
* There are three level of locking required by epoll :
@@ -677,30 +680,41 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
}

/*
- * Disables a "struct epitem" in the eventpoll set. Returns -EBUSY if the item
- * had no event flags set, indicating that another thread may be currently
- * handling that item's events (in the case that EPOLLONESHOT was being
- * used). Otherwise a zero result indicates that the item has been disabled
- * from receiving events. A disabled item may be re-enabled via
- * EPOLL_CTL_MOD. Must be called with "mtx" held.
+ * Inserts "struct epitem" of the eventpoll set into the ready list
+ * if it is not already on the ready list. Returns zero on success,
+ * returns -EBUSY if the item was already in the ready list. This
+ * poke action is always a one-time event and behaves like an
+ * edge-triggered wakeup.
*/
-static int ep_disable(struct eventpoll *ep, struct epitem *epi)
+static int ep_poke(struct eventpoll *ep, struct epitem *epi)
{
int result = 0;
unsigned long flags;
+ int pwake = 0;

spin_lock_irqsave(&ep->lock, flags);
- if (epi->event.events & ~EP_PRIVATE_BITS) {
- if (ep_is_linked(&epi->rdllink))
- list_del_init(&epi->rdllink);
- /* Ensure ep_poll_callback will not add epi back onto ready
- list: */
- epi->event.events &= EP_PRIVATE_BITS;
- }
- else
+
+ if (ep_is_linked(&epi->rdllink)) {
result = -EBUSY;
+ } else {
+ epi->event.events |= EPOLLPOKED;
+
+ list_add_tail(&epi->rdllink, &ep->rdllist);
+ __pm_stay_awake(epi->ws);
+
+ /* Notify waiting tasks that events are available */
+ if (waitqueue_active(&ep->wq))
+ wake_up_locked(&ep->wq);
+ if (waitqueue_active(&ep->poll_wait))
+ pwake++;
+ }
+
spin_unlock_irqrestore(&ep->lock, flags);

+ /* We have to call this outside the lock */
+ if (pwake)
+ ep_poll_safewake(&ep->poll_wait);
+
return result;
}

@@ -768,6 +782,8 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,

init_poll_funcptr(&pt, NULL);
list_for_each_entry_safe(epi, tmp, head, rdllink) {
+ if (epi->event.events & EPOLLPOKED)
+ return POLLIN | POLLRDNORM;
pt._key = epi->event.events;
if (epi->ffd.file->f_op->poll(epi->ffd.file, &pt) &
epi->event.events)
@@ -1398,7 +1414,7 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
* is holding "mtx", so no operations coming from userspace
* can change the item.
*/
- if (revents) {
+ if (revents || (epi->event.events & EPOLLPOKED)) {
if (__put_user(revents, &uevent->events) ||
__put_user(epi->event.data, &uevent->data)) {
list_add(&epi->rdllink, head);
@@ -1407,6 +1423,11 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
}
eventcnt++;
uevent++;
+
+ /* each poke is a one-time event */
+ if (epi->event.events & EPOLLPOKED)
+ epi->event.events &= ~EPOLLPOKED;
+
if (epi->event.events & EPOLLONESHOT)
epi->event.events &= EP_PRIVATE_BITS;
else if (!(epi->event.events & EPOLLET)) {
@@ -1813,9 +1834,9 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
} else
error = -ENOENT;
break;
- case EPOLL_CTL_DISABLE:
+ case EPOLL_CTL_POKE:
if (epi)
- error = ep_disable(ep, epi);
+ error = ep_poke(ep, epi);
else
error = -ENOENT;
break;
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 8c99ce7..46292bb 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -25,7 +25,7 @@
#define EPOLL_CTL_ADD 1
#define EPOLL_CTL_DEL 2
#define EPOLL_CTL_MOD 3
-#define EPOLL_CTL_DISABLE 4
+#define EPOLL_CTL_POKE 4

/*
* Request the handling of system wakeup events so as to prevent system suspends
diff --git a/tools/testing/selftests/epoll/test_epoll.c b/tools/testing/selftests/epoll/test_epoll.c
index f752539..537b53e 100644
--- a/tools/testing/selftests/epoll/test_epoll.c
+++ b/tools/testing/selftests/epoll/test_epoll.c
@@ -12,6 +12,7 @@
*
*/

+#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
@@ -99,11 +100,21 @@ void *read_thread_function(void *function_data)
together as one atomic operation when EPOLL_CTL_DISABLE is
available: */
item_data = (struct epoll_item_private *)event_data.data.ptr;
+#ifdef EPOLL_CTL_POKE
+ assert(event_data.events == 0 && "poke sets no events");
+#else
pthread_mutex_lock(&item_data->mutex);
+#endif

/* Remove the item from the epoll set if we want to stop
handling that event: */
- if (item_data->stop)
+ /*
+ * XXX
+ * Using __sync_add_and_fetch(&var, 0) should allow us
+ * to safely read without holding a mutex. Right?
+ * There's no __sync_fetch() in gcc...
+ */
+ if (__sync_add_and_fetch(&item_data->stop, 0))
delete_item(item_data->index);
else {
/* Clear the data that was written to the other end of
@@ -127,12 +138,16 @@ void *read_thread_function(void *function_data)
goto error_unlock;
}

+#ifndef EPOLL_CTL_POKE
pthread_mutex_unlock(&item_data->mutex);
+#endif
}

error_unlock:
thread_data->status = item_data->status = errno;
+#ifndef EPOLL_CTL_POKE
pthread_mutex_unlock(&item_data->mutex);
+#endif
return 0;
}

@@ -261,22 +276,23 @@ int main(int argc, char **argv)
goto error;

/* Cancel all event pollers: */
-#ifdef EPOLL_CTL_DISABLE
+#ifdef EPOLL_CTL_POKE
for (index = 0; index < n_epoll_items; ++index) {
- pthread_mutex_lock(&epoll_items[index].mutex);
- ++epoll_items[index].stop;
+ __sync_add_and_fetch(&epoll_items[index].stop, 1);
if (epoll_ctl(epoll_set,
- EPOLL_CTL_DISABLE,
+ EPOLL_CTL_POKE,
epoll_items[index].fd,
- NULL) == 0)
- delete_item(index);
- else if (errno != EBUSY) {
- pthread_mutex_unlock(&epoll_items[index].mutex);
+ NULL) == 0) {
+ /*
+ * We do NOT delete the item in this thread,
+ * the thread calling epoll_wait will do that
+ * for us.
+ */
+ } else if (errno != EBUSY) {
goto error;
}
/* EBUSY means events were being handled; allow the other thread
to delete the item. */
- pthread_mutex_unlock(&epoll_items[index].mutex);
}
#else
for (index = 0; index < n_epoll_items; ++index) {
--
Eric Wong

2012-11-03 01:17:28

by Christof Meerwald

[permalink] [raw]
Subject: Re: [RFC/PATCH] epoll: replace EPOLL_CTL_DISABLE with EPOLL_CTL_POKE

On Fri, 2 Nov 2012 04:13:12 +0000, Eric Wong wrote:
[...]
> EPOLL_CTL_POKE may be used to force an item into the epoll
> ready list. Instead of disabling an item asynchronously
> via EPOLL_CTL_DISABLE, this forces the threads calling
> epoll_wait() to handle the item in its normal loop.

That was my initial suggestion as well - see
https://lkml.org/lkml/2012/6/19/358


Christof

--

http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

2012-11-06 21:58:56

by Eric Wong

[permalink] [raw]
Subject: Re: [RFC/PATCH] epoll: replace EPOLL_CTL_DISABLE with EPOLL_CTL_POKE

Christof Meerwald <[email protected]> wrote:
> On Fri, 2 Nov 2012 04:13:12 +0000, Eric Wong wrote:
> [...]
> > EPOLL_CTL_POKE may be used to force an item into the epoll
> > ready list. Instead of disabling an item asynchronously
> > via EPOLL_CTL_DISABLE, this forces the threads calling
> > epoll_wait() to handle the item in its normal loop.
>
> That was my initial suggestion as well - see
> https://lkml.org/lkml/2012/6/19/358

Thanks, missed that the first time around. Perhaps TRIGGER is a
better name than POKE :x

I'm hoping Paton has a chance to look at my implementation and
comment on it.