2013-10-01 17:09:07

by Jason Baron

[permalink] [raw]
Subject: [PATCH 0/2 v2] epoll: reduce 'epmutex' lock contention

Hi,

Updated series, using a 'union' for the rcu callback.

Nathan Zimmer found that once we get over 10+ cpus, the scalability of
SPECjbb falls over due to the contention on the global 'epmutex', which
is taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.

Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path by
using rcu to guard against any concurrent traversals.

Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for simple
topologies. IE when adding a link from an epoll file descriptor to a wakeup
source, where the epoll file descriptor is not nested.

Performance of SPECjbb improves considerably. From Nathan Zimmer's testing.
Thread: http://marc.info/?l=linux-kernel&m=137908766013329&w=2

"
On the 16 socket run the performance went from 35k jOPS to 125k jOPS.
In addition the benchmark when from scaling well on 10 sockets to scaling well
on just over 40 sockets.

I should also note there system responsiveness of various commands was quite
improved when under the full load of the benchmark.
...


Currently the benchmark stops scaling at around 40-44 sockets but it seems like
I found a second unrelated bottleneck.
"

Thanks,

-Jason

Changes in v2:
-use a union for rcu callback and drop patch #3

v1:
http://marc.info/?l=linux-kernel&m=138057995316598&w=2

Jason Baron (2):
epoll: optimize EPOLL_CTL_DEL using rcu
epoll: Do not take global 'epmutex' for simple topologies

fs/eventpoll.c | 153 +++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 106 insertions(+), 47 deletions(-)

--
1.8.2


2013-10-01 17:08:20

by Jason Baron

[permalink] [raw]
Subject: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies

When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
directly to a wakeup source, we do not need to take the global 'epmutex',
unless the epoll file descriptor is nested. The purpose of taking
the 'epmutex' on add is to prevent complex topologies such as loops and
deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
operations. However, for the simple case of an epoll file descriptor
attached directly to a wakeup source (with no nesting), we do not need
to hold the 'epmutex'.

This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
performance:

"
On the 16 socket run the performance went from 35k jOPS to 125k jOPS.
In addition the benchmark when from scaling well on 10 sockets to scaling well
on just over 40 sockets.

...

Currently the benchmark stops scaling at around 40-44 sockets but it seems like
I found a second unrelated bottleneck.
"

Tested-by: Nathan Zimmer <[email protected]>
Signed-off-by: Jason Baron <[email protected]>
---
fs/eventpoll.c | 94 ++++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 68 insertions(+), 26 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index dd9fae1..0f25162 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -595,8 +595,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
static int ep_scan_ready_list(struct eventpoll *ep,
int (*sproc)(struct eventpoll *,
struct list_head *, void *),
- void *priv,
- int depth)
+ void *priv, int depth, int ep_locked)
{
int error, pwake = 0;
unsigned long flags;
@@ -607,7 +606,9 @@ static int ep_scan_ready_list(struct eventpoll *ep,
* We need to lock this because we could be hit by
* eventpoll_release_file() and epoll_ctl().
*/
- mutex_lock_nested(&ep->mtx, depth);
+
+ if (!ep_locked)
+ mutex_lock_nested(&ep->mtx, depth);

/*
* Steal the ready list, and re-init the original one to the
@@ -671,7 +672,8 @@ static int ep_scan_ready_list(struct eventpoll *ep,
}
spin_unlock_irqrestore(&ep->lock, flags);

- mutex_unlock(&ep->mtx);
+ if (!ep_locked)
+ mutex_unlock(&ep->mtx);

/* We have to call this outside the lock */
if (pwake)
@@ -829,15 +831,34 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
return 0;
}

+static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
+ poll_table *pt);
+
+struct readyevents_arg {
+ struct eventpoll *ep;
+ int locked;
+};
+
static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
{
- return ep_scan_ready_list(priv, ep_read_events_proc, NULL, call_nests + 1);
+ struct readyevents_arg *arg = (struct readyevents_arg *)priv;
+
+ return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL,
+ call_nests + 1, arg->locked);
}

static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
{
int pollflags;
struct eventpoll *ep = file->private_data;
+ struct readyevents_arg arg;
+
+ /*
+ * During ep_insert() we already hold the ep->mtx for the tfile.
+ * Prevent re-aquisition.
+ */
+ arg.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0);
+ arg.ep = ep;

/* Insert inside our poll wait queue */
poll_wait(file, &ep->poll_wait, wait);
@@ -849,7 +870,7 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
* could re-enter here.
*/
pollflags = ep_call_nested(&poll_readywalk_ncalls, EP_MAX_NESTS,
- ep_poll_readyevents_proc, ep, ep, current);
+ ep_poll_readyevents_proc, &arg, ep, current);

return pollflags != -1 ? pollflags : 0;
}
@@ -1250,7 +1271,7 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi)
* Must be called with "mtx" held.
*/
static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
- struct file *tfile, int fd)
+ struct file *tfile, int fd, int full_check)
{
int error, revents, pwake = 0;
unsigned long flags;
@@ -1318,7 +1339,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,

/* now check if we've created too many backpaths */
error = -EINVAL;
- if (reverse_path_check())
+ if (full_check && reverse_path_check())
goto error_remove_epi;

/* We have to drop the new item inside our item list to keep track of it */
@@ -1542,7 +1563,7 @@ static int ep_send_events(struct eventpoll *ep,
esed.maxevents = maxevents;
esed.events = events;

- return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0);
+ return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, 0);
}

static inline struct timespec ep_set_mstimeout(long ms)
@@ -1813,11 +1834,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
struct epoll_event __user *, event)
{
int error;
- int did_lock_epmutex = 0;
+ int full_check = 0;
struct fd f, tf;
struct eventpoll *ep;
struct epitem *epi;
struct epoll_event epds;
+ struct eventpoll *tep = NULL;

error = -EFAULT;
if (ep_op_has_event(op) &&
@@ -1866,23 +1888,40 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
* and hang them on the tfile_check_list, so we can check that we
* haven't created too many possible wakeup paths.
*
- * We need to hold the epmutex across ep_insert to prevent
- * multple adds from creating loops in parallel.
+ * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when
+ * the epoll file descriptor is attaching directly to a wakeup source,
+ * unless the epoll file descriptor is nested. The purpose of taking the
+ * 'epmutex' on add is to prevent complex toplogies such as loops and
+ * deep wakeup paths from forming in parallel through multiple
+ * EPOLL_CTL_ADD operations.
*/
+ mutex_lock_nested(&ep->mtx, 0);
if (op == EPOLL_CTL_ADD) {
- mutex_lock(&epmutex);
- did_lock_epmutex = 1;
- if (is_file_epoll(tf.file)) {
- error = -ELOOP;
- if (ep_loop_check(ep, tf.file) != 0) {
- clear_tfile_check_list();
- goto error_tgt_fput;
+ if (!list_empty(&f.file->f_ep_links) ||
+ is_file_epoll(tf.file)) {
+ full_check = 1;
+ mutex_unlock(&ep->mtx);
+ mutex_lock(&epmutex);
+ if (is_file_epoll(tf.file)) {
+ error = -ELOOP;
+ if (ep_loop_check(ep, tf.file) != 0) {
+ clear_tfile_check_list();
+ goto error_tgt_fput;
+ }
+ } else
+ list_add(&tf.file->f_tfile_llink,
+ &tfile_check_list);
+ mutex_lock_nested(&ep->mtx, 0);
+ if (is_file_epoll(tf.file)) {
+ tep = tf.file->private_data;
+ mutex_lock_nested(&tep->mtx, 1);
}
- } else
- list_add(&tf.file->f_tfile_llink, &tfile_check_list);
+ }
+ }
+ if (op == EPOLL_CTL_DEL && is_file_epoll(tf.file)) {
+ tep = tf.file->private_data;
+ mutex_lock_nested(&tep->mtx, 1);
}
-
- mutex_lock_nested(&ep->mtx, 0);

/*
* Try to lookup the file inside our RB tree, Since we grabbed "mtx"
@@ -1896,10 +1935,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
case EPOLL_CTL_ADD:
if (!epi) {
epds.events |= POLLERR | POLLHUP;
- error = ep_insert(ep, &epds, tf.file, fd);
+ error = ep_insert(ep, &epds, tf.file, fd, full_check);
} else
error = -EEXIST;
- clear_tfile_check_list();
+ if (full_check)
+ clear_tfile_check_list();
break;
case EPOLL_CTL_DEL:
if (epi)
@@ -1915,10 +1955,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
error = -ENOENT;
break;
}
+ if (tep != NULL)
+ mutex_unlock(&tep->mtx);
mutex_unlock(&ep->mtx);

error_tgt_fput:
- if (did_lock_epmutex)
+ if (full_check)
mutex_unlock(&epmutex);

fdput(tf);
--
1.8.2

2013-10-01 17:08:15

by Jason Baron

[permalink] [raw]
Subject: [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu

Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
converting the file->f_ep_links list into an rcu one. In this way, we can
traverse the epoll network on the add path in parallel with deletes. Since
deletes can't create loops or worse wakeup paths, this is safe.

This patch in combination with the patch "epoll: Do not take global 'epmutex'
for simple topologies", shows a dramatic performance improvement in
scalability for SPECjbb.

Tested-by: Nathan Zimmer <[email protected]>
Signed-off-by: Jason Baron <[email protected]>
---
fs/eventpoll.c | 65 ++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 41 insertions(+), 24 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 473e09d..dd9fae1 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -42,6 +42,7 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/compat.h>
+#include <linux/rculist.h>

/*
* LOCKING:
@@ -134,8 +135,12 @@ struct nested_calls {
* of these on a server and we do not want this to take another cache line.
*/
struct epitem {
- /* RB tree node used to link this structure to the eventpoll RB tree */
- struct rb_node rbn;
+ union {
+ /* RB tree node links this structure to the eventpoll RB tree */
+ struct rb_node rbn;
+ /* Used to free the struct epitem */
+ struct rcu_head rcu;
+ };

/* List header used to link this structure to the eventpoll ready list */
struct list_head rdllink;
@@ -166,6 +171,9 @@ struct epitem {

/* The structure that describe the interested events and the source fd */
struct epoll_event event;
+
+ /* The fllink is in use. Since rcu can't do 'list_del_init()' */
+ int on_list;
};

/*
@@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eventpoll *ep,
return error;
}

+static void epi_rcu_free(struct rcu_head *head)
+{
+ struct epitem *epi = container_of(head, struct epitem, rcu);
+ kmem_cache_free(epi_cache, epi);
+}
+
/*
* Removes a "struct epitem" from the eventpoll RB tree and deallocates
* all the associated resources. Must be called with "mtx" held.
@@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)

/* Remove the current item from the list of epoll hooks */
spin_lock(&file->f_lock);
- if (ep_is_linked(&epi->fllink))
- list_del_init(&epi->fllink);
+ if (epi->on_list) {
+ list_del_rcu(&epi->fllink);
+ epi->on_list = 0;
+ }
spin_unlock(&file->f_lock);

rb_erase(&epi->rbn, &ep->rbr);
@@ -705,9 +721,14 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
spin_unlock_irqrestore(&ep->lock, flags);

wakeup_source_unregister(ep_wakeup_source(epi));
-
- /* At this point it is safe to free the eventpoll item */
- kmem_cache_free(epi_cache, epi);
+ /*
+ * At this point it is safe to free the eventpoll item. Use the union
+ * field epi->rcu, since we are trying to minimize the size of
+ * 'struct epitem'. The 'rbn' field is no longer in use. Protected by
+ * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make
+ * use of the rbn field.
+ */
+ call_rcu(&epi->rcu, epi_rcu_free);

atomic_long_dec(&ep->user->epoll_watches);

@@ -873,7 +894,6 @@ static const struct file_operations eventpoll_fops = {
*/
void eventpoll_release_file(struct file *file)
{
- struct list_head *lsthead = &file->f_ep_links;
struct eventpoll *ep;
struct epitem *epi;

@@ -891,17 +911,12 @@ void eventpoll_release_file(struct file *file)
* Besides, ep_remove() acquires the lock, so we can't hold it here.
*/
mutex_lock(&epmutex);
-
- while (!list_empty(lsthead)) {
- epi = list_first_entry(lsthead, struct epitem, fllink);
-
+ list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
ep = epi->ep;
- list_del_init(&epi->fllink);
mutex_lock_nested(&ep->mtx, 0);
ep_remove(ep, epi);
mutex_unlock(&ep->mtx);
}
-
mutex_unlock(&epmutex);
}

@@ -1139,7 +1154,9 @@ static int reverse_path_check_proc(void *priv, void *cookie, int call_nests)
struct file *child_file;
struct epitem *epi;

- list_for_each_entry(epi, &file->f_ep_links, fllink) {
+ /* CTL_DEL can remove links here, but that can't increase our count */
+ rcu_read_lock();
+ list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
child_file = epi->ep->file;
if (is_file_epoll(child_file)) {
if (list_empty(&child_file->f_ep_links)) {
@@ -1161,6 +1178,7 @@ static int reverse_path_check_proc(void *priv, void *cookie, int call_nests)
"file is not an ep!\n");
}
}
+ rcu_read_unlock();
return error;
}

@@ -1255,6 +1273,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;
+ epi->on_list = 0;
if (epi->event.events & EPOLLWAKEUP) {
error = ep_create_wakeup_source(epi);
if (error)
@@ -1287,7 +1306,8 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,

/* Add the current item to the list of active epoll hook for this file */
spin_lock(&tfile->f_lock);
- list_add_tail(&epi->fllink, &tfile->f_ep_links);
+ list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
+ epi->on_list = 1;
spin_unlock(&tfile->f_lock);

/*
@@ -1328,8 +1348,8 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,

error_remove_epi:
spin_lock(&tfile->f_lock);
- if (ep_is_linked(&epi->fllink))
- list_del_init(&epi->fllink);
+ if (epi->on_list)
+ list_del_rcu(&epi->fllink);
spin_unlock(&tfile->f_lock);

rb_erase(&epi->rbn, &ep->rbr);
@@ -1846,15 +1866,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
* and hang them on the tfile_check_list, so we can check that we
* haven't created too many possible wakeup paths.
*
- * We need to hold the epmutex across both ep_insert and ep_remove
- * b/c we want to make sure we are looking at a coherent view of
- * epoll network.
+ * We need to hold the epmutex across ep_insert to prevent
+ * multple adds from creating loops in parallel.
*/
- if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) {
+ if (op == EPOLL_CTL_ADD) {
mutex_lock(&epmutex);
did_lock_epmutex = 1;
- }
- if (op == EPOLL_CTL_ADD) {
if (is_file_epoll(tf.file)) {
error = -ELOOP;
if (ep_loop_check(ep, tf.file) != 0) {
--
1.8.2

2013-10-03 21:50:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies

On Tue, 1 Oct 2013 17:08:14 +0000 (GMT) Jason Baron <[email protected]> wrote:

> When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
> directly to a wakeup source, we do not need to take the global 'epmutex',
> unless the epoll file descriptor is nested. The purpose of taking
> the 'epmutex' on add is to prevent complex topologies such as loops and
> deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
> operations. However, for the simple case of an epoll file descriptor
> attached directly to a wakeup source (with no nesting), we do not need
> to hold the 'epmutex'.
>
> This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
> scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
> performance:
>
> "
> On the 16 socket run the performance went from 35k jOPS to 125k jOPS.
> In addition the benchmark when from scaling well on 10 sockets to scaling well
> on just over 40 sockets.
>
> ...
>
> Currently the benchmark stops scaling at around 40-44 sockets but it seems like
> I found a second unrelated bottleneck.
> "

I couldn't resist fiddling. Please review:

From: Andrew Morton <[email protected]>
Subject: epoll-do-not-take-global-epmutex-for-simple-topologies-fix

- use `bool' for boolean variables
- remove unneeded/undesirable cast of void*
- add missed ep_scan_ready_list() kerneldoc

Cc: "Paul E. McKenney" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Davide Libenzi <[email protected]>
Cc: Eric Wong <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Nathan Zimmer <[email protected]>
Cc: Nelson Elhage <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

fs/eventpoll.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff -puN fs/eventpoll.c~epoll-do-not-take-global-epmutex-for-simple-topologies-fix fs/eventpoll.c
--- a/fs/eventpoll.c~epoll-do-not-take-global-epmutex-for-simple-topologies-fix
+++ a/fs/eventpoll.c
@@ -589,13 +589,14 @@ static inline void ep_pm_stay_awake_rcu(
* @sproc: Pointer to the scan callback.
* @priv: Private opaque data passed to the @sproc callback.
* @depth: The current depth of recursive f_op->poll calls.
+ * @ep_locked: caller already holds ep->mtx
*
* Returns: The same integer error code returned by the @sproc callback.
*/
static int ep_scan_ready_list(struct eventpoll *ep,
int (*sproc)(struct eventpoll *,
struct list_head *, void *),
- void *priv, int depth, int ep_locked)
+ void *priv, int depth, bool ep_locked)
{
int error, pwake = 0;
unsigned long flags;
@@ -836,12 +837,12 @@ static void ep_ptable_queue_proc(struct

struct readyevents_arg {
struct eventpoll *ep;
- int locked;
+ bool locked;
};

static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
{
- struct readyevents_arg *arg = (struct readyevents_arg *)priv;
+ struct readyevents_arg *arg = priv;

return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL,
call_nests + 1, arg->locked);
@@ -857,7 +858,7 @@ static unsigned int ep_eventpoll_poll(st
* During ep_insert() we already hold the ep->mtx for the tfile.
* Prevent re-aquisition.
*/
- arg.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0);
+ arg.locked = wait && (wait->_qproc == ep_ptable_queue_proc);
arg.ep = ep;

/* Insert inside our poll wait queue */
@@ -1563,7 +1564,7 @@ static int ep_send_events(struct eventpo
esed.maxevents = maxevents;
esed.events = events;

- return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, 0);
+ return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
}

static inline struct timespec ep_set_mstimeout(long ms)
_

2013-10-04 15:16:15

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies

On 10/03/2013 05:50 PM, Andrew Morton wrote:
> On Tue, 1 Oct 2013 17:08:14 +0000 (GMT) Jason Baron <[email protected]> wrote:
>
>> When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
>> directly to a wakeup source, we do not need to take the global 'epmutex',
>> unless the epoll file descriptor is nested. The purpose of taking
>> the 'epmutex' on add is to prevent complex topologies such as loops and
>> deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
>> operations. However, for the simple case of an epoll file descriptor
>> attached directly to a wakeup source (with no nesting), we do not need
>> to hold the 'epmutex'.
>>
>> This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
>> scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
>> performance:
>>
>> "
>> On the 16 socket run the performance went from 35k jOPS to 125k jOPS.
>> In addition the benchmark when from scaling well on 10 sockets to scaling well
>> on just over 40 sockets.
>>
>> ...
>>
>> Currently the benchmark stops scaling at around 40-44 sockets but it seems like
>> I found a second unrelated bottleneck.
>> "
> I couldn't resist fiddling. Please review
>
> From: Andrew Morton <[email protected]>
> Subject: epoll-do-not-take-global-epmutex-for-simple-topologies-fix
>
> - use `bool' for boolean variables
> - remove unneeded/undesirable cast of void*
> - add missed ep_scan_ready_list() kerneldoc

Hi Andrew,

Looks good to me. Feel free to add:

Reviewed-by: Jason Baron <[email protected]>

Thanks,

-Jason

2013-10-04 17:54:06

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies

On Fri, Oct 04, 2013 at 11:16:12AM -0400, Jason Baron wrote:
> On 10/03/2013 05:50 PM, Andrew Morton wrote:
> > On Tue, 1 Oct 2013 17:08:14 +0000 (GMT) Jason Baron <[email protected]> wrote:
> >
> >> When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
> >> directly to a wakeup source, we do not need to take the global 'epmutex',
> >> unless the epoll file descriptor is nested. The purpose of taking
> >> the 'epmutex' on add is to prevent complex topologies such as loops and
> >> deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
> >> operations. However, for the simple case of an epoll file descriptor
> >> attached directly to a wakeup source (with no nesting), we do not need
> >> to hold the 'epmutex'.
> >>
> >> This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
> >> scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
> >> performance:
> >>
> >> "
> >> On the 16 socket run the performance went from 35k jOPS to 125k jOPS.
> >> In addition the benchmark when from scaling well on 10 sockets to scaling well
> >> on just over 40 sockets.
> >>
> >> ...
> >>
> >> Currently the benchmark stops scaling at around 40-44 sockets but it seems like
> >> I found a second unrelated bottleneck.
> >> "
> > I couldn't resist fiddling. Please review
> >
> > From: Andrew Morton <[email protected]>
> > Subject: epoll-do-not-take-global-epmutex-for-simple-topologies-fix
> >
> > - use `bool' for boolean variables
> > - remove unneeded/undesirable cast of void*
> > - add missed ep_scan_ready_list() kerneldoc
>
> Hi Andrew,
>
> Looks good to me. Feel free to add:
>
> Reviewed-by: Jason Baron <[email protected]>
>
> Thanks,
>
> -Jason

Just finished rerunning the benchmarks with this latest patchset, including the
fiddling, and it still looks great.

Thanks,

Nate

2013-10-24 09:00:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu

On Tue, Oct 01, 2013 at 05:08:10PM +0000, Jason Baron wrote:
> Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
> converting the file->f_ep_links list into an rcu one. In this way, we can
> traverse the epoll network on the add path in parallel with deletes.
> Since deletes can't create loops or worse wakeup paths, this is safe.
>
> This patch in combination with the patch "epoll: Do not take global 'epmutex'
> for simple topologies", shows a dramatic performance improvement in
> scalability for SPECjbb.

A few questions and comments below.

Thanx, Paul

> Signed-off-by: Jason Baron <[email protected]>
> Tested-by: Nathan Zimmer <[email protected]>
> Cc: Eric Wong <[email protected]>
> Cc: Nelson Elhage <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Davide Libenzi <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> fs/eventpoll.c | 65 +++++++++++++++++++++++++++++------------------
> 1 file changed, 41 insertions(+), 24 deletions(-)
>
> diff -puN fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu fs/eventpoll.c
> --- a/fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu
> +++ a/fs/eventpoll.c
> @@ -42,6 +42,7 @@
> #include <linux/proc_fs.h>
> #include <linux/seq_file.h>
> #include <linux/compat.h>
> +#include <linux/rculist.h>
>
> /*
> * LOCKING:
> @@ -134,8 +135,12 @@ struct nested_calls {
> * of these on a server and we do not want this to take another cache line.
> */
> struct epitem {
> - /* RB tree node used to link this structure to the eventpoll RB tree */
> - struct rb_node rbn;
> + union {
> + /* RB tree node links this structure to the eventpoll RB tree */
> + struct rb_node rbn;

And RCU readers never need to use rbn, right? (That appears to be the case,
so good.)

> + /* Used to free the struct epitem */
> + struct rcu_head rcu;
> + };
>
> /* List header used to link this structure to the eventpoll ready list */
> struct list_head rdllink;
> @@ -166,6 +171,9 @@ struct epitem {
>
> /* The structure that describe the interested events and the source fd */
> struct epoll_event event;
> +
> + /* The fllink is in use. Since rcu can't do 'list_del_init()' */
> + int on_list;
> };
>
> /*
> @@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eve
> return error;
> }
>
> +static void epi_rcu_free(struct rcu_head *head)
> +{
> + struct epitem *epi = container_of(head, struct epitem, rcu);
> + kmem_cache_free(epi_cache, epi);

Sigh. I suppose that I need to create a kmem_cache_free_rcu() at some
point...

> +}
> +
> /*
> * Removes a "struct epitem" from the eventpoll RB tree and deallocates
> * all the associated resources. Must be called with "mtx" held.
> @@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *e
>
> /* Remove the current item from the list of epoll hooks */
> spin_lock(&file->f_lock);
> - if (ep_is_linked(&epi->fllink))
> - list_del_init(&epi->fllink);
> + if (epi->on_list) {
> + list_del_rcu(&epi->fllink);

OK, here is the list_del_rcu(). It does seem to precede the call_rcu()
below, as it must. Of course, if !epi->onlist, you could just free
it without going through call_rcu(), but perhaps that optimization is
not worthwhile.

> + epi->on_list = 0;
> + }
> spin_unlock(&file->f_lock);
>
> rb_erase(&epi->rbn, &ep->rbr);
> @@ -705,9 +721,14 @@ static int ep_remove(struct eventpoll *e
> spin_unlock_irqrestore(&ep->lock, flags);
>
> wakeup_source_unregister(ep_wakeup_source(epi));
> -
> - /* At this point it is safe to free the eventpoll item */
> - kmem_cache_free(epi_cache, epi);
> + /*
> + * At this point it is safe to free the eventpoll item. Use the union
> + * field epi->rcu, since we are trying to minimize the size of
> + * 'struct epitem'. The 'rbn' field is no longer in use. Protected by
> + * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make
> + * use of the rbn field.
> + */
> + call_rcu(&epi->rcu, epi_rcu_free);

And here is the call_rcu(), so good. At least assuming there are no other
RCU-reader-accessible lists that this thing is on. (If there are, then it
needs to be removed from these lists as well.)

>
> atomic_long_dec(&ep->user->epoll_watches);
>
> @@ -873,7 +894,6 @@ static const struct file_operations even
> */
> void eventpoll_release_file(struct file *file)
> {
> - struct list_head *lsthead = &file->f_ep_links;
> struct eventpoll *ep;
> struct epitem *epi;
>
> @@ -891,17 +911,12 @@ void eventpoll_release_file(struct file
> * Besides, ep_remove() acquires the lock, so we can't hold it here.
> */
> mutex_lock(&epmutex);
> -
> - while (!list_empty(lsthead)) {
> - epi = list_first_entry(lsthead, struct epitem, fllink);
> -
> + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
> ep = epi->ep;
> - list_del_init(&epi->fllink);
> mutex_lock_nested(&ep->mtx, 0);
> ep_remove(ep, epi);
> mutex_unlock(&ep->mtx);
> }
> -
> mutex_unlock(&epmutex);
> }
>
> @@ -1139,7 +1154,9 @@ static int reverse_path_check_proc(void
> struct file *child_file;
> struct epitem *epi;
>
> - list_for_each_entry(epi, &file->f_ep_links, fllink) {
> + /* CTL_DEL can remove links here, but that can't increase our count */
> + rcu_read_lock();
> + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
> child_file = epi->ep->file;
> if (is_file_epoll(child_file)) {
> if (list_empty(&child_file->f_ep_links)) {

Hmmm... It looks like ep_call_nested() acquires a global spinlock.
Perhaps that is fixed in a later patch in this series? Otherwise,
this will be a bottleneck on large systems.

> @@ -1161,6 +1178,7 @@ static int reverse_path_check_proc(void
> "file is not an ep!\n");
> }
> }
> + rcu_read_unlock();
> return error;
> }
>
> @@ -1255,6 +1273,7 @@ static int ep_insert(struct eventpoll *e
> epi->event = *event;
> epi->nwait = 0;
> epi->next = EP_UNACTIVE_PTR;
> + epi->on_list = 0;
> if (epi->event.events & EPOLLWAKEUP) {
> error = ep_create_wakeup_source(epi);
> if (error)
> @@ -1287,7 +1306,8 @@ static int ep_insert(struct eventpoll *e
>
> /* Add the current item to the list of active epoll hook for this file */
> spin_lock(&tfile->f_lock);

Looks like ->f_lock protects updates to the RCU-protected structure.

> - list_add_tail(&epi->fllink, &tfile->f_ep_links);
> + list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
> + epi->on_list = 1;
> spin_unlock(&tfile->f_lock);
>
> /*
> @@ -1328,8 +1348,8 @@ static int ep_insert(struct eventpoll *e
>
> error_remove_epi:
> spin_lock(&tfile->f_lock);

More evidence in favor of ->f_lock protecting updates to the RCU-protected
data structure.

> - if (ep_is_linked(&epi->fllink))
> - list_del_init(&epi->fllink);
> + if (epi->on_list)
> + list_del_rcu(&epi->fllink);

OK, this list_del_rcu() call is for handling failed inserts.

But if we had to use list_del_rcu(), don't we also need to wait a grace
period before freeing the item? Or does rb_erase() somehow do that?

Or has placing it ->on_list somehow avoided exposing it to readers?
(Of course, it this is the case, the list_del_rcu() could remain
a list_del_init()...)

> spin_unlock(&tfile->f_lock);
>
> rb_erase(&epi->rbn, &ep->rbr);
> @@ -1846,15 +1866,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, in
> * and hang them on the tfile_check_list, so we can check that we
> * haven't created too many possible wakeup paths.
> *
> - * We need to hold the epmutex across both ep_insert and ep_remove
> - * b/c we want to make sure we are looking at a coherent view of
> - * epoll network.
> + * We need to hold the epmutex across ep_insert to prevent
> + * multple adds from creating loops in parallel.
> */
> - if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) {
> + if (op == EPOLL_CTL_ADD) {
> mutex_lock(&epmutex);
> did_lock_epmutex = 1;
> - }
> - if (op == EPOLL_CTL_ADD) {
> if (is_file_epoll(tf.file)) {
> error = -ELOOP;
> if (ep_loop_check(ep, tf.file) != 0) {
> _
>

2013-10-24 10:09:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies

On Tue, Oct 01, 2013 at 05:08:14PM +0000, Jason Baron wrote:
> When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
> directly to a wakeup source, we do not need to take the global 'epmutex',
> unless the epoll file descriptor is nested. The purpose of taking
> the 'epmutex' on add is to prevent complex topologies such as loops and
> deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
> operations. However, for the simple case of an epoll file descriptor
> attached directly to a wakeup source (with no nesting), we do not need
> to hold the 'epmutex'.
>
> This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
> scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
> performance:
>
> "
> On the 16 socket run the performance went from 35k jOPS to 125k jOPS.
> In addition the benchmark when from scaling well on 10 sockets to scaling well
> on just over 40 sockets.
>
> ...
>
> Currently the benchmark stops scaling at around 40-44 sockets but it seems like
> I found a second unrelated bottleneck.
> "

Some questions and comments below.

> Tested-by: Nathan Zimmer <[email protected]>
> Signed-off-by: Jason Baron <[email protected]>
> ---
> fs/eventpoll.c | 94 ++++++++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 68 insertions(+), 26 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index dd9fae1..0f25162 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -595,8 +595,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
> static int ep_scan_ready_list(struct eventpoll *ep,
> int (*sproc)(struct eventpoll *,
> struct list_head *, void *),
> - void *priv,
> - int depth)
> + void *priv, int depth, int ep_locked)
> {
> int error, pwake = 0;
> unsigned long flags;
> @@ -607,7 +606,9 @@ static int ep_scan_ready_list(struct eventpoll *ep,
> * We need to lock this because we could be hit by
> * eventpoll_release_file() and epoll_ctl().
> */
> - mutex_lock_nested(&ep->mtx, depth);
> +
> + if (!ep_locked)
> + mutex_lock_nested(&ep->mtx, depth);
>
> /*
> * Steal the ready list, and re-init the original one to the
> @@ -671,7 +672,8 @@ static int ep_scan_ready_list(struct eventpoll *ep,
> }
> spin_unlock_irqrestore(&ep->lock, flags);
>
> - mutex_unlock(&ep->mtx);
> + if (!ep_locked)
> + mutex_unlock(&ep->mtx);
>
> /* We have to call this outside the lock */
> if (pwake)

OK, allowing calls to ep_scan_ready_list() to be made under the lock.

Usually you would use a __ep_scan_ready_list() instead of adding the
flag, but you do have the wakeup that needs to be outside of the lock.

But doesn't that mean that you need something like?

if (pwake && !ep_locked)

And then wouldn't you also need some way to inform the caller to
do the wakeup after releasing ep->mtx?

Or is it now safe to do the wakeup under ep->mtx? If so, why?

> @@ -829,15 +831,34 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
> return 0;
> }
>
> +static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
> + poll_table *pt);
> +
> +struct readyevents_arg {
> + struct eventpoll *ep;
> + int locked;
> +};
> +
> static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
> {

OK, I will bite... What is constraining the arguments of the
ep_poll_readyevents_proc()? I don't see any use of it except in this
file. Nor do I see any use of ep_call_nested() except in this file.
Might it be a bit cleaner to add a flag to the argument list?

I don't feel strongly about this, just asking the question.

> - return ep_scan_ready_list(priv, ep_read_events_proc, NULL, call_nests + 1);
> + struct readyevents_arg *arg = (struct readyevents_arg *)priv;
> +
> + return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL,
> + call_nests + 1, arg->locked);
> }
>
> static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
> {
> int pollflags;
> struct eventpoll *ep = file->private_data;
> + struct readyevents_arg arg;
> +
> + /*
> + * During ep_insert() we already hold the ep->mtx for the tfile.
> + * Prevent re-aquisition.
> + */
> + arg.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0);
> + arg.ep = ep;
>
> /* Insert inside our poll wait queue */
> poll_wait(file, &ep->poll_wait, wait);
> @@ -849,7 +870,7 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
> * could re-enter here.
> */
> pollflags = ep_call_nested(&poll_readywalk_ncalls, EP_MAX_NESTS,
> - ep_poll_readyevents_proc, ep, ep, current);
> + ep_poll_readyevents_proc, &arg, ep, current);
>
> return pollflags != -1 ? pollflags : 0;
> }
> @@ -1250,7 +1271,7 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi)
> * Must be called with "mtx" held.
> */
> static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
> - struct file *tfile, int fd)
> + struct file *tfile, int fd, int full_check)
> {
> int error, revents, pwake = 0;
> unsigned long flags;
> @@ -1318,7 +1339,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
>
> /* now check if we've created too many backpaths */
> error = -EINVAL;
> - if (reverse_path_check())
> + if (full_check && reverse_path_check())
> goto error_remove_epi;
>
> /* We have to drop the new item inside our item list to keep track of it */
> @@ -1542,7 +1563,7 @@ static int ep_send_events(struct eventpoll *ep,
> esed.maxevents = maxevents;
> esed.events = events;
>
> - return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0);
> + return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, 0);
> }
>
> static inline struct timespec ep_set_mstimeout(long ms)
> @@ -1813,11 +1834,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> struct epoll_event __user *, event)
> {
> int error;
> - int did_lock_epmutex = 0;
> + int full_check = 0;
> struct fd f, tf;
> struct eventpoll *ep;
> struct epitem *epi;
> struct epoll_event epds;
> + struct eventpoll *tep = NULL;
>
> error = -EFAULT;
> if (ep_op_has_event(op) &&
> @@ -1866,23 +1888,40 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> * and hang them on the tfile_check_list, so we can check that we
> * haven't created too many possible wakeup paths.
> *
> - * We need to hold the epmutex across ep_insert to prevent
> - * multple adds from creating loops in parallel.
> + * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when
> + * the epoll file descriptor is attaching directly to a wakeup source,
> + * unless the epoll file descriptor is nested. The purpose of taking the
> + * 'epmutex' on add is to prevent complex toplogies such as loops and
> + * deep wakeup paths from forming in parallel through multiple
> + * EPOLL_CTL_ADD operations.
> */
> + mutex_lock_nested(&ep->mtx, 0);
> if (op == EPOLL_CTL_ADD) {
> - mutex_lock(&epmutex);
> - did_lock_epmutex = 1;
> - if (is_file_epoll(tf.file)) {
> - error = -ELOOP;
> - if (ep_loop_check(ep, tf.file) != 0) {
> - clear_tfile_check_list();
> - goto error_tgt_fput;
> + if (!list_empty(&f.file->f_ep_links) ||
> + is_file_epoll(tf.file)) {
> + full_check = 1;
> + mutex_unlock(&ep->mtx);
> + mutex_lock(&epmutex);
> + if (is_file_epoll(tf.file)) {
> + error = -ELOOP;
> + if (ep_loop_check(ep, tf.file) != 0) {
> + clear_tfile_check_list();
> + goto error_tgt_fput;
> + }
> + } else
> + list_add(&tf.file->f_tfile_llink,
> + &tfile_check_list);
> + mutex_lock_nested(&ep->mtx, 0);
> + if (is_file_epoll(tf.file)) {
> + tep = tf.file->private_data;
> + mutex_lock_nested(&tep->mtx, 1);
> }
> - } else
> - list_add(&tf.file->f_tfile_llink, &tfile_check_list);
> + }
> + }
> + if (op == EPOLL_CTL_DEL && is_file_epoll(tf.file)) {
> + tep = tf.file->private_data;
> + mutex_lock_nested(&tep->mtx, 1);
> }
> -
> - mutex_lock_nested(&ep->mtx, 0);
>
> /*
> * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
> @@ -1896,10 +1935,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> case EPOLL_CTL_ADD:
> if (!epi) {
> epds.events |= POLLERR | POLLHUP;
> - error = ep_insert(ep, &epds, tf.file, fd);
> + error = ep_insert(ep, &epds, tf.file, fd, full_check);
> } else
> error = -EEXIST;
> - clear_tfile_check_list();
> + if (full_check)
> + clear_tfile_check_list();
> break;
> case EPOLL_CTL_DEL:
> if (epi)
> @@ -1915,10 +1955,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> error = -ENOENT;
> break;
> }
> + if (tep != NULL)
> + mutex_unlock(&tep->mtx);
> mutex_unlock(&ep->mtx);
>
> error_tgt_fput:
> - if (did_lock_epmutex)
> + if (full_check)
> mutex_unlock(&epmutex);
>
> fdput(tf);
> --
> 1.8.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-10-24 14:56:21

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu

Thanks for looking at this. Comments below.

-Jason


On 10/24/2013 04:52 AM, Paul E. McKenney wrote:
> On Tue, Oct 01, 2013 at 05:08:10PM +0000, Jason Baron wrote:
>> Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
>> converting the file->f_ep_links list into an rcu one. In this way, we can
>> traverse the epoll network on the add path in parallel with deletes.
>> Since deletes can't create loops or worse wakeup paths, this is safe.
>>
>> This patch in combination with the patch "epoll: Do not take global 'epmutex'
>> for simple topologies", shows a dramatic performance improvement in
>> scalability for SPECjbb.
>
> A few questions and comments below.
>
> Thanx, Paul
>
>> Signed-off-by: Jason Baron <[email protected]>
>> Tested-by: Nathan Zimmer <[email protected]>
>> Cc: Eric Wong <[email protected]>
>> Cc: Nelson Elhage <[email protected]>
>> Cc: Al Viro <[email protected]>
>> Cc: Davide Libenzi <[email protected]>
>> Cc: "Paul E. McKenney" <[email protected]>
>> Signed-off-by: Andrew Morton <[email protected]>
>> ---
>>
>> fs/eventpoll.c | 65 +++++++++++++++++++++++++++++------------------
>> 1 file changed, 41 insertions(+), 24 deletions(-)
>>
>> diff -puN fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu fs/eventpoll.c
>> --- a/fs/eventpoll.c~epoll-optimize-epoll_ctl_del-using-rcu
>> +++ a/fs/eventpoll.c
>> @@ -42,6 +42,7 @@
>> #include <linux/proc_fs.h>
>> #include <linux/seq_file.h>
>> #include <linux/compat.h>
>> +#include <linux/rculist.h>
>>
>> /*
>> * LOCKING:
>> @@ -134,8 +135,12 @@ struct nested_calls {
>> * of these on a server and we do not want this to take another cache line.
>> */
>> struct epitem {
>> - /* RB tree node used to link this structure to the eventpoll RB tree */
>> - struct rb_node rbn;
>> + union {
>> + /* RB tree node links this structure to the eventpoll RB tree */
>> + struct rb_node rbn;
>
> And RCU readers never need to use rbn, right? (That appears to be the case,
> so good.)

Right RCU readers, specifically sections under rcu_read_lock() are not touching
rbn.

The idea here is to remove the usage of the rbn, via the call to rb_erase() in
ep_remove() before we make use of the struct rcu_head rcu. The concern I have
is that there could be a race where somebody accesses rbn, while we are using
it as rcu. However, I don't think its possible. The idea is that all access to
rbn is guarded by ep->mtx. The only exception I found was in the top of
'ep_free()', but in that case we are in an 'fput' and I don't believe anybody
else could get a reference to the ep and hence the rbn. But please double
check.


>
>> + /* Used to free the struct epitem */
>> + struct rcu_head rcu;
>> + };
>>
>> /* List header used to link this structure to the eventpoll ready list */
>> struct list_head rdllink;
>> @@ -166,6 +171,9 @@ struct epitem {
>>
>> /* The structure that describe the interested events and the source fd */
>> struct epoll_event event;
>> +
>> + /* The fllink is in use. Since rcu can't do 'list_del_init()' */
>> + int on_list;
>> };
>>
>> /*
>> @@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eve
>> return error;
>> }
>>
>> +static void epi_rcu_free(struct rcu_head *head)
>> +{
>> + struct epitem *epi = container_of(head, struct epitem, rcu);
>> + kmem_cache_free(epi_cache, epi);
>
> Sigh. I suppose that I need to create a kmem_cache_free_rcu() at some
> point...
>
>> +}
>> +
>> /*
>> * Removes a "struct epitem" from the eventpoll RB tree and deallocates
>> * all the associated resources. Must be called with "mtx" held.
>> @@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *e
>>
>> /* Remove the current item from the list of epoll hooks */
>> spin_lock(&file->f_lock);
>> - if (ep_is_linked(&epi->fllink))
>> - list_del_init(&epi->fllink);
>> + if (epi->on_list) {
>> + list_del_rcu(&epi->fllink);
>
> OK, here is the list_del_rcu(). It does seem to precede the call_rcu()
> below, as it must. Of course, if !epi->onlist, you could just free
> it without going through call_rcu(), but perhaps that optimization is
> not worthwhile.
>

hmmm...yeah I think we're ok without it.

>> + epi->on_list = 0;
>> + }
>> spin_unlock(&file->f_lock);
>>
>> rb_erase(&epi->rbn, &ep->rbr);
>> @@ -705,9 +721,14 @@ static int ep_remove(struct eventpoll *e
>> spin_unlock_irqrestore(&ep->lock, flags);
>>
>> wakeup_source_unregister(ep_wakeup_source(epi));
>> -
>> - /* At this point it is safe to free the eventpoll item */
>> - kmem_cache_free(epi_cache, epi);
>> + /*
>> + * At this point it is safe to free the eventpoll item. Use the union
>> + * field epi->rcu, since we are trying to minimize the size of
>> + * 'struct epitem'. The 'rbn' field is no longer in use. Protected by
>> + * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make
>> + * use of the rbn field.
>> + */
>> + call_rcu(&epi->rcu, epi_rcu_free);
>
> And here is the call_rcu(), so good. At least assuming there are no other
> RCU-reader-accessible lists that this thing is on. (If there are, then it
> needs to be removed from these lists as well.)
>

It is removed from the epi->fdllink before here as well, but my patch doesn't
alter any ordering with respect to that removal.


>>
>> atomic_long_dec(&ep->user->epoll_watches);
>>
>> @@ -873,7 +894,6 @@ static const struct file_operations even
>> */
>> void eventpoll_release_file(struct file *file)
>> {
>> - struct list_head *lsthead = &file->f_ep_links;
>> struct eventpoll *ep;
>> struct epitem *epi;
>>
>> @@ -891,17 +911,12 @@ void eventpoll_release_file(struct file
>> * Besides, ep_remove() acquires the lock, so we can't hold it here.
>> */
>> mutex_lock(&epmutex);
>> -
>> - while (!list_empty(lsthead)) {
>> - epi = list_first_entry(lsthead, struct epitem, fllink);
>> -
>> + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
>> ep = epi->ep;
>> - list_del_init(&epi->fllink);
>> mutex_lock_nested(&ep->mtx, 0);
>> ep_remove(ep, epi);
>> mutex_unlock(&ep->mtx);
>> }
>> -
>> mutex_unlock(&epmutex);
>> }
>>
>> @@ -1139,7 +1154,9 @@ static int reverse_path_check_proc(void
>> struct file *child_file;
>> struct epitem *epi;
>>
>> - list_for_each_entry(epi, &file->f_ep_links, fllink) {
>> + /* CTL_DEL can remove links here, but that can't increase our count */
>> + rcu_read_lock();
>> + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
>> child_file = epi->ep->file;
>> if (is_file_epoll(child_file)) {
>> if (list_empty(&child_file->f_ep_links)) {
>
> Hmmm... It looks like ep_call_nested() acquires a global spinlock.
> Perhaps that is fixed in a later patch in this series? Otherwise,
> this will be a bottleneck on large systems.
>

Yes - the spinlock is pointed to by 'poll_loop_ncalls' which is
is used by the both the loop checking logic and the this
'reverse_patch_check_proc' which detects if we have added to many
wakeup paths to a wakeup source.

The point of this patch, is to only do these more expensive checks
when the epoll graph or nesting is 'complex'. So, in theory, the
the fast paths wouldn't even take this lock. Regardless, the current
logic still requires a global lock for these checks - the 'epmutex',
so that would be the higher level lock to fix, if there were still
scaling issues here. Note that with these patches the 'epmutex' is
now only taken on ep_insert, in certain cases, and not at all on
ep_remove (due to the new rcu usage).

>> @@ -1161,6 +1178,7 @@ static int reverse_path_check_proc(void
>> "file is not an ep!\n");
>> }
>> }
>> + rcu_read_unlock();
>> return error;
>> }
>>
>> @@ -1255,6 +1273,7 @@ static int ep_insert(struct eventpoll *e
>> epi->event = *event;
>> epi->nwait = 0;
>> epi->next = EP_UNACTIVE_PTR;
>> + epi->on_list = 0;
>> if (epi->event.events & EPOLLWAKEUP) {
>> error = ep_create_wakeup_source(epi);
>> if (error)
>> @@ -1287,7 +1306,8 @@ static int ep_insert(struct eventpoll *e
>>
>> /* Add the current item to the list of active epoll hook for this file */
>> spin_lock(&tfile->f_lock);
>
> Looks like ->f_lock protects updates to the RCU-protected structure.
>

yes.


>> - list_add_tail(&epi->fllink, &tfile->f_ep_links);
>> + list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
>> + epi->on_list = 1;
>> spin_unlock(&tfile->f_lock);
>>
>> /*
>> @@ -1328,8 +1348,8 @@ static int ep_insert(struct eventpoll *e
>>
>> error_remove_epi:
>> spin_lock(&tfile->f_lock);
>
> More evidence in favor of ->f_lock protecting updates to the RCU-protected
> data structure.
>
>> - if (ep_is_linked(&epi->fllink))
>> - list_del_init(&epi->fllink);
>> + if (epi->on_list)
>> + list_del_rcu(&epi->fllink);
>
> OK, this list_del_rcu() call is for handling failed inserts.
>
> But if we had to use list_del_rcu(), don't we also need to wait a grace
> period before freeing the item? Or does rb_erase() somehow do that?
>
> Or has placing it ->on_list somehow avoided exposing it to readers?
> (Of course, it this is the case, the list_del_rcu() could remain
> a list_del_init()...)
>

Right so the only way we can get a failed insert is if we get to the
'error_remove_epi:' label. Which is only possible if 'full_check' is
set. That implies that we hold the global 'epmutex' which means
that 'reverse_path_check_proc()' can not be running. Thus, we
could use either list_del_rcu or list_del_init here. I chose
list_del_rcu() to be more consistent. Perhaps, I should add a comment
above to explain this?


>> spin_unlock(&tfile->f_lock);
>>
>> rb_erase(&epi->rbn, &ep->rbr);
>> @@ -1846,15 +1866,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, in
>> * and hang them on the tfile_check_list, so we can check that we
>> * haven't created too many possible wakeup paths.
>> *
>> - * We need to hold the epmutex across both ep_insert and ep_remove
>> - * b/c we want to make sure we are looking at a coherent view of
>> - * epoll network.
>> + * We need to hold the epmutex across ep_insert to prevent
>> + * multple adds from creating loops in parallel.
>> */
>> - if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) {
>> + if (op == EPOLL_CTL_ADD) {
>> mutex_lock(&epmutex);
>> did_lock_epmutex = 1;
>> - }
>> - if (op == EPOLL_CTL_ADD) {
>> if (is_file_epoll(tf.file)) {
>> error = -ELOOP;
>> if (ep_loop_check(ep, tf.file) != 0) {
>> _
>>

2013-10-24 16:00:43

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies

On 10/24/2013 06:09 AM, Paul E. McKenney wrote:
> On Tue, Oct 01, 2013 at 05:08:14PM +0000, Jason Baron wrote:
>> When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
>> directly to a wakeup source, we do not need to take the global 'epmutex',
>> unless the epoll file descriptor is nested. The purpose of taking
>> the 'epmutex' on add is to prevent complex topologies such as loops and
>> deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
>> operations. However, for the simple case of an epoll file descriptor
>> attached directly to a wakeup source (with no nesting), we do not need
>> to hold the 'epmutex'.
>>
>> This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
>> scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
>> performance:
>>
>> "
>> On the 16 socket run the performance went from 35k jOPS to 125k jOPS.
>> In addition the benchmark when from scaling well on 10 sockets to scaling well
>> on just over 40 sockets.
>>
>> ...
>>
>> Currently the benchmark stops scaling at around 40-44 sockets but it seems like
>> I found a second unrelated bottleneck.
>> "
>
> Some questions and comments below.
>
>> Tested-by: Nathan Zimmer <[email protected]>
>> Signed-off-by: Jason Baron <[email protected]>
>> ---
>> fs/eventpoll.c | 94 ++++++++++++++++++++++++++++++++++++++++++----------------
>> 1 file changed, 68 insertions(+), 26 deletions(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index dd9fae1..0f25162 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -595,8 +595,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
>> static int ep_scan_ready_list(struct eventpoll *ep,
>> int (*sproc)(struct eventpoll *,
>> struct list_head *, void *),
>> - void *priv,
>> - int depth)
>> + void *priv, int depth, int ep_locked)
>> {
>> int error, pwake = 0;
>> unsigned long flags;
>> @@ -607,7 +606,9 @@ static int ep_scan_ready_list(struct eventpoll *ep,
>> * We need to lock this because we could be hit by
>> * eventpoll_release_file() and epoll_ctl().
>> */
>> - mutex_lock_nested(&ep->mtx, depth);
>> +
>> + if (!ep_locked)
>> + mutex_lock_nested(&ep->mtx, depth);
>>
>> /*
>> * Steal the ready list, and re-init the original one to the
>> @@ -671,7 +672,8 @@ static int ep_scan_ready_list(struct eventpoll *ep,
>> }
>> spin_unlock_irqrestore(&ep->lock, flags);
>>
>> - mutex_unlock(&ep->mtx);
>> + if (!ep_locked)
>> + mutex_unlock(&ep->mtx);
>>
>> /* We have to call this outside the lock */
>> if (pwake)
>
> OK, allowing calls to ep_scan_ready_list() to be made under the lock.
>
> Usually you would use a __ep_scan_ready_list() instead of adding the
> flag, but you do have the wakeup that needs to be outside of the lock.
>
> But doesn't that mean that you need something like?
>
> if (pwake && !ep_locked)
>
> And then wouldn't you also need some way to inform the caller to
> do the wakeup after releasing ep->mtx?
>
> Or is it now safe to do the wakeup under ep->mtx? If so, why?
>

All I was trying to do here was to preserve the locking ep->mtx as it was
prior to this patch series. Because the patches now take the 'destination'
ep->mtx on an epoll_ctl() operation. It might be already held leading to
double locking (i hit this case while testing this series). So the idea
here is simply to realize when we've already grabbed this ep->mtx.


>> @@ -829,15 +831,34 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
>> return 0;
>> }
>>
>> +static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
>> + poll_table *pt);
>> +
>> +struct readyevents_arg {
>> + struct eventpoll *ep;
>> + int locked;
>> +};
>> +
>> static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
>> {
>
> OK, I will bite... What is constraining the arguments of the
> ep_poll_readyevents_proc()? I don't see any use of it except in this
> file. Nor do I see any use of ep_call_nested() except in this file.
> Might it be a bit cleaner to add a flag to the argument list?
>
> I don't feel strongly about this, just asking the question.
>

I don't feel too strongly either, but other users of ep_call_nested
don't need this either...

Thanks,

-Jason