2013-03-07 11:26:41

by Eric Wong

[permalink] [raw]
Subject: epoll: possible bug from wakeup_source activation

Hi Arve, looking at commit 4d7e30d98939a0340022ccd49325a3d70f7e0238
(epoll: Add a flag, EPOLLWAKEUP, to prevent suspend ...)

I think the reason for using ep->ws instead of epi->ws in the unlikely
ovflist case applies to the likely rdllist case, too. Since epi->ws is
only protected by ep->mtx, it can also be deactivated while inside
ep_poll_callback.

So something like the following patch might be necessary
(shown here with extra context):

--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -968,39 +968,45 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
if (epi->next == EP_UNACTIVE_PTR) {
epi->next = ep->ovflist;
ep->ovflist = epi;
if (epi->ws) {
/*
* Activate ep->ws since epi->ws may get
* deactivated at any time.
*/
__pm_stay_awake(ep->ws);
}

}
goto out_unlock;
}

/* If this file is already in the ready list we exit soon */
if (!ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
- __pm_stay_awake(epi->ws);
+ if (epi->ws) {
+ /*
+ * Activate ep->ws since epi->ws may get
+ * deactivated at any time.
+ */
+ __pm_stay_awake(ep->ws);
+ }
}
-----------------------------------------------------------------------
However, I think my proposed patch will also cause breakage with the
way epi->ws is handled in ep_send_events_proc:

/*
* Activate ep->ws before deactivating epi->ws to prevent
* triggering auto-suspend here (in case we reactive epi->ws
* below).
*
* This could be rearranged to delay the deactivation of epi->ws
* instead, but then epi->ws would temporarily be out of sync
* with ep_is_linked().
*/
if (epi->ws && epi->ws->active)
__pm_stay_awake(ep->ws);
__pm_relax(epi->ws);
list_del_init(&epi->rdllink);

I'm not sure, but maybe only using ep->ws is the easiest way to go.


2013-03-08 01:30:30

by Eric Wong

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

Eric Wong <[email protected]> wrote:
> Hi Arve, looking at commit 4d7e30d98939a0340022ccd49325a3d70f7e0238
> (epoll: Add a flag, EPOLLWAKEUP, to prevent suspend ...)
>
> I think the reason for using ep->ws instead of epi->ws in the unlikely
> ovflist case applies to the likely rdllist case, too. Since epi->ws is
> only protected by ep->mtx, it can also be deactivated while inside
> ep_poll_callback.
>
> So something like the following patch might be necessary
> (shown here with extra context):
>
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -968,39 +968,45 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
> if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
> if (epi->next == EP_UNACTIVE_PTR) {
> epi->next = ep->ovflist;
> ep->ovflist = epi;
> if (epi->ws) {
> /*
> * Activate ep->ws since epi->ws may get
> * deactivated at any time.
> */
> __pm_stay_awake(ep->ws);
> }
>
> }

Thinking about this more, it looks like the original ep->ovflist case of
using ep->ws is unnecessary.

ep->ovflist != EP_UNACTIVE_PTR can only happen while ep->mtx is held (in
ep_scan_ready_list); which means ep_modify+friends cannot remove epi->ws.

ep_poll_callback holding ep->lock means ep_poll_callback prevents
ep_scan_ready_list from setting ep->ovflist = EP_UNACTIVE_PTR and
releasing ep->mtx.

> goto out_unlock;
> }
>
> /* If this file is already in the ready list we exit soon */
> if (!ep_is_linked(&epi->rdllink)) {
> list_add_tail(&epi->rdllink, &ep->rdllist);
> - __pm_stay_awake(epi->ws);
> + if (epi->ws) {
> + /*
> + * Activate ep->ws since epi->ws may get
> + * deactivated at any time.
> + */
> + __pm_stay_awake(ep->ws);
> + }
> }

I still think ep->ws needs to be used in the common ep->rdllist case.

2013-03-08 04:56:51

by Arve Hjønnevåg

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

On Thu, Mar 7, 2013 at 5:30 PM, Eric Wong <[email protected]> wrote:
> Eric Wong <[email protected]> wrote:
>> Hi Arve, looking at commit 4d7e30d98939a0340022ccd49325a3d70f7e0238
>> (epoll: Add a flag, EPOLLWAKEUP, to prevent suspend ...)
>>
>> I think the reason for using ep->ws instead of epi->ws in the unlikely
>> ovflist case applies to the likely rdllist case, too. Since epi->ws is
>> only protected by ep->mtx, it can also be deactivated while inside
>> ep_poll_callback.
>>
>> So something like the following patch might be necessary
>> (shown here with extra context):
>>
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -968,39 +968,45 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>> if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
>> if (epi->next == EP_UNACTIVE_PTR) {
>> epi->next = ep->ovflist;
>> ep->ovflist = epi;
>> if (epi->ws) {
>> /*
>> * Activate ep->ws since epi->ws may get
>> * deactivated at any time.
>> */
>> __pm_stay_awake(ep->ws);
>> }
>>
>> }
>
> Thinking about this more, it looks like the original ep->ovflist case of
> using ep->ws is unnecessary.
>
> ep->ovflist != EP_UNACTIVE_PTR can only happen while ep->mtx is held (in
> ep_scan_ready_list); which means ep_modify+friends cannot remove epi->ws.
>

The callback function in ep_scan_ready_list can call __pm_relax on it though.

> ep_poll_callback holding ep->lock means ep_poll_callback prevents
> ep_scan_ready_list from setting ep->ovflist = EP_UNACTIVE_PTR and
> releasing ep->mtx.

This code is reached when ep_scan_ready_list has set ep->ovflist to
NULL before releasing ep->lock. Since the callback function can call
__pm_relax on epi->ws without holding ep->lock we call __pm_stay_awake
in ep->ws here (the callback does not call __pm_relax on that).

>
>> goto out_unlock;
>> }
>>
>> /* If this file is already in the ready list we exit soon */
>> if (!ep_is_linked(&epi->rdllink)) {
>> list_add_tail(&epi->rdllink, &ep->rdllist);
>> - __pm_stay_awake(epi->ws);
>> + if (epi->ws) {
>> + /*
>> + * Activate ep->ws since epi->ws may get
>> + * deactivated at any time.
>> + */
>> + __pm_stay_awake(ep->ws);
>> + }
>> }
>
> I still think ep->ws needs to be used in the common ep->rdllist case.

ep_scan_ready_list calls __pm_relax on ep->ws when it is done, so this
will not work. ep->ws is not a "ep->rdllist not empty wakeup_source is
is a "ep_scan_ready_list is running" wakeup_source.

--
Arve Hj?nnev?g

2013-03-08 20:49:46

by Eric Wong

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

Arve Hjønnevåg <[email protected]> wrote:
> On Thu, Mar 7, 2013 at 5:30 PM, Eric Wong <[email protected]> wrote:
> > Eric Wong <[email protected]> wrote:
> >> Hi Arve, looking at commit 4d7e30d98939a0340022ccd49325a3d70f7e0238
> >> (epoll: Add a flag, EPOLLWAKEUP, to prevent suspend ...)
> >>
> >> I think the reason for using ep->ws instead of epi->ws in the unlikely
> >> ovflist case applies to the likely rdllist case, too. Since epi->ws is
> >> only protected by ep->mtx, it can also be deactivated while inside
> >> ep_poll_callback.
> >>
> >> So something like the following patch might be necessary
> >> (shown here with extra context):
> >>
> >> --- a/fs/eventpoll.c
> >> +++ b/fs/eventpoll.c
> >> @@ -968,39 +968,45 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
> >> if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
> >> if (epi->next == EP_UNACTIVE_PTR) {
> >> epi->next = ep->ovflist;
> >> ep->ovflist = epi;
> >> if (epi->ws) {
> >> /*
> >> * Activate ep->ws since epi->ws may get
> >> * deactivated at any time.
> >> */
> >> __pm_stay_awake(ep->ws);
> >> }
> >>
> >> }
> >
> > Thinking about this more, it looks like the original ep->ovflist case of
> > using ep->ws is unnecessary.
> >
> > ep->ovflist != EP_UNACTIVE_PTR can only happen while ep->mtx is held (in
> > ep_scan_ready_list); which means ep_modify+friends cannot remove epi->ws.
> >
>
> The callback function in ep_scan_ready_list can call __pm_relax on it though.
>
> > ep_poll_callback holding ep->lock means ep_poll_callback prevents
> > ep_scan_ready_list from setting ep->ovflist = EP_UNACTIVE_PTR and
> > releasing ep->mtx.
>
> This code is reached when ep_scan_ready_list has set ep->ovflist to
> NULL before releasing ep->lock. Since the callback function can call
> __pm_relax on epi->ws without holding ep->lock we call __pm_stay_awake
> in ep->ws here (the callback does not call __pm_relax on that).

Thanks for the explanation. I got "deactivate" and "destroy"
mixed up. However, I'm still concerned about the "destroy" case:

> >
> >> goto out_unlock;
> >> }
> >>
> >> /* If this file is already in the ready list we exit soon */
> >> if (!ep_is_linked(&epi->rdllink)) {
> >> list_add_tail(&epi->rdllink, &ep->rdllist);
> >> - __pm_stay_awake(epi->ws);
> >> + if (epi->ws) {
> >> + /*
> >> + * Activate ep->ws since epi->ws may get
> >> + * deactivated at any time.
> >> + */
> >> + __pm_stay_awake(ep->ws);
> >> + }
> >> }
> >
> > I still think ep->ws needs to be used in the common ep->rdllist case.
>
> ep_scan_ready_list calls __pm_relax on ep->ws when it is done, so this
> will not work. ep->ws is not a "ep->rdllist not empty wakeup_source is
> is a "ep_scan_ready_list is running" wakeup_source.

What happens if ep_modify calls ep_destroy_wakeup_source
while __pm_stay_awake is running on the same epi->ws?

2013-03-09 04:09:42

by Arve Hjønnevåg

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

On Fri, Mar 8, 2013 at 12:49 PM, Eric Wong <[email protected]> wrote:
> Arve Hj?nnev?g <[email protected]> wrote:
>> On Thu, Mar 7, 2013 at 5:30 PM, Eric Wong <[email protected]> wrote:
>> > Eric Wong <[email protected]> wrote:
>> >> Hi Arve, looking at commit 4d7e30d98939a0340022ccd49325a3d70f7e0238
>> >> (epoll: Add a flag, EPOLLWAKEUP, to prevent suspend ...)
>> >>
>> >> I think the reason for using ep->ws instead of epi->ws in the unlikely
>> >> ovflist case applies to the likely rdllist case, too. Since epi->ws is
>> >> only protected by ep->mtx, it can also be deactivated while inside
>> >> ep_poll_callback.
>> >>
>> >> So something like the following patch might be necessary
>> >> (shown here with extra context):
>> >>
>> >> --- a/fs/eventpoll.c
>> >> +++ b/fs/eventpoll.c
>> >> @@ -968,39 +968,45 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>> >> if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
>> >> if (epi->next == EP_UNACTIVE_PTR) {
>> >> epi->next = ep->ovflist;
>> >> ep->ovflist = epi;
>> >> if (epi->ws) {
>> >> /*
>> >> * Activate ep->ws since epi->ws may get
>> >> * deactivated at any time.
>> >> */
>> >> __pm_stay_awake(ep->ws);
>> >> }
>> >>
>> >> }
>> >
>> > Thinking about this more, it looks like the original ep->ovflist case of
>> > using ep->ws is unnecessary.
>> >
>> > ep->ovflist != EP_UNACTIVE_PTR can only happen while ep->mtx is held (in
>> > ep_scan_ready_list); which means ep_modify+friends cannot remove epi->ws.
>> >
>>
>> The callback function in ep_scan_ready_list can call __pm_relax on it though.
>>
>> > ep_poll_callback holding ep->lock means ep_poll_callback prevents
>> > ep_scan_ready_list from setting ep->ovflist = EP_UNACTIVE_PTR and
>> > releasing ep->mtx.
>>
>> This code is reached when ep_scan_ready_list has set ep->ovflist to
>> NULL before releasing ep->lock. Since the callback function can call
>> __pm_relax on epi->ws without holding ep->lock we call __pm_stay_awake
>> in ep->ws here (the callback does not call __pm_relax on that).
>
> Thanks for the explanation. I got "deactivate" and "destroy"
> mixed up. However, I'm still concerned about the "destroy" case:
>
>> >
>> >> goto out_unlock;
>> >> }
>> >>
>> >> /* If this file is already in the ready list we exit soon */
>> >> if (!ep_is_linked(&epi->rdllink)) {
>> >> list_add_tail(&epi->rdllink, &ep->rdllist);
>> >> - __pm_stay_awake(epi->ws);
>> >> + if (epi->ws) {
>> >> + /*
>> >> + * Activate ep->ws since epi->ws may get
>> >> + * deactivated at any time.
>> >> + */
>> >> + __pm_stay_awake(ep->ws);
>> >> + }
>> >> }
>> >
>> > I still think ep->ws needs to be used in the common ep->rdllist case.
>>
>> ep_scan_ready_list calls __pm_relax on ep->ws when it is done, so this
>> will not work. ep->ws is not a "ep->rdllist not empty wakeup_source is
>> is a "ep_scan_ready_list is running" wakeup_source.
>
> What happens if ep_modify calls ep_destroy_wakeup_source
> while __pm_stay_awake is running on the same epi->ws?

Yes, that looks like a problem. I think calling
ep_destroy_wakeup_source with ep->lock held should fix that. It is not
clear how useful changing EPOLLWAKEUP in ep_modify is, so
alternatively we could remove that feature and instead only allow it
to be set in ep_insert.

--
Arve Hj?nnev?g

2013-03-09 07:10:39

by Eric Wong

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

Arve Hjønnevåg <[email protected]> wrote:
> On Fri, Mar 8, 2013 at 12:49 PM, Eric Wong <[email protected]> wrote:
> > What happens if ep_modify calls ep_destroy_wakeup_source
> > while __pm_stay_awake is running on the same epi->ws?
>
> Yes, that looks like a problem. I think calling
> ep_destroy_wakeup_source with ep->lock held should fix that. It is not
> clear how useful changing EPOLLWAKEUP in ep_modify is, so
> alternatively we could remove that feature and instead only allow it
> to be set in ep_insert.

ep->lock would work, but ep->lock is already a source of heavy
contention in my multithreaded+epoll webservers.

Perhaps RCU can be used? I've no experience with RCU, but I've been
meaning to get acquainted with RCU.

Another possible solution is to only use ep->ws and add an atomic
counter to ep; so __pm_relax(ep->ws) is only called when the atomic
counter reaches zero.

2013-03-10 01:11:38

by Eric Wong

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

Eric Wong <[email protected]> wrote:
> Arve Hjønnevåg <[email protected]> wrote:
> > On Fri, Mar 8, 2013 at 12:49 PM, Eric Wong <[email protected]> wrote:
> > > What happens if ep_modify calls ep_destroy_wakeup_source
> > > while __pm_stay_awake is running on the same epi->ws?
> >
> > Yes, that looks like a problem. I think calling
> > ep_destroy_wakeup_source with ep->lock held should fix that. It is not
> > clear how useful changing EPOLLWAKEUP in ep_modify is, so
> > alternatively we could remove that feature and instead only allow it
> > to be set in ep_insert.
>
> ep->lock would work, but ep->lock is already a source of heavy
> contention in my multithreaded+epoll webservers.
>
> Perhaps RCU can be used? I've no experience with RCU, but I've been
> meaning to get acquainted with RCU.

The following is lightly tested, at least I haven't gotten lockdep
to complain.

--------------------------------- 8< ----------------------------
>From 2bcd549893aa204bd858cc1500a70f20b28e47c1 Mon Sep 17 00:00:00 2001
From: Eric Wong <[email protected]>
Date: Sun, 10 Mar 2013 01:06:50 +0000
Subject: [PATCH] epoll: RCU protect wakeup_source in epitem

This prevents wakeup_source destruction when a user hits the
item with EPOLL_CTL_MOD while ep_poll_callback is running.

Signed-off-by: Eric Wong <[email protected]>
---
fs/eventpoll.c | 31 ++++++++++++++++++++++++-------
1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 9fec183..e008d54 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -158,7 +158,7 @@ struct epitem {
struct list_head fllink;

/* wakeup_source used when EPOLLWAKEUP is set */
- struct wakeup_source *ws;
+ struct wakeup_source __rcu *ws;

/* The structure that describe the interested events and the source fd */
struct epoll_event event;
@@ -536,6 +536,17 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
}
}

+static inline void ep_pm_stay_awake(struct epitem *epi)
+{
+ struct wakeup_source *ws;
+
+ rcu_read_lock();
+ ws = rcu_dereference(epi->ws);
+ if (ws)
+ __pm_stay_awake(ws);
+ rcu_read_unlock();
+}
+
/**
* ep_scan_ready_list - Scans the ready list in a way that makes possible for
* the scan code, to call f_op->poll(). Also allows for
@@ -984,7 +995,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
/* If this file is already in the ready list we exit soon */
if (!ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
- __pm_stay_awake(epi->ws);
+ ep_pm_stay_awake(epi);
}

/*
@@ -1146,6 +1157,7 @@ static int reverse_path_check(void)
static int ep_create_wakeup_source(struct epitem *epi)
{
const char *name;
+ struct wakeup_source *ws;

if (!epi->ep->ws) {
epi->ep->ws = wakeup_source_register("eventpoll");
@@ -1154,17 +1166,22 @@ static int ep_create_wakeup_source(struct epitem *epi)
}

name = epi->ffd.file->f_path.dentry->d_name.name;
- epi->ws = wakeup_source_register(name);
- if (!epi->ws)
+ ws = wakeup_source_register(name);
+
+ if (!ws)
return -ENOMEM;
+ rcu_assign_pointer(epi->ws, ws);

return 0;
}

static void ep_destroy_wakeup_source(struct epitem *epi)
{
- wakeup_source_unregister(epi->ws);
- epi->ws = NULL;
+ struct wakeup_source *ws = epi->ws;
+
+ rcu_assign_pointer(epi->ws, NULL);
+ synchronize_rcu(); /* wait for ep_pm_stay_awake to finish */
+ wakeup_source_unregister(ws);
}

/*
@@ -1199,7 +1216,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
if (error)
goto error_create_wakeup_source;
} else {
- epi->ws = NULL;
+ RCU_INIT_POINTER(epi->ws, NULL);
}

/* Initialize the poll table using the queue callback */
--
Eric Wong

2013-03-10 04:59:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

On Sun, 2013-03-10 at 01:11 +0000, Eric Wong wrote:
>
> static void ep_destroy_wakeup_source(struct epitem *epi)
> {
> - wakeup_source_unregister(epi->ws);
> - epi->ws = NULL;
> + struct wakeup_source *ws = epi->ws;
> +
> + rcu_assign_pointer(epi->ws, NULL);

There is no need to use a memory barrier before setting epi->ws to NULL

You should use RCU_POINTER_INIT()

> + synchronize_rcu(); /* wait for ep_pm_stay_awake to finish */

Wont this add a significant slowdown ?

> + wakeup_source_unregister(ws);
> }

Please add the following in your .config

CONFIG_SPARSE_RCU_POINTER=y

and try :

make C=2 fs/eventpoll.o

And fix all warnings ;)


2013-03-10 11:50:31

by Eric Wong

[permalink] [raw]
Subject: [PATCH] epoll: use RCU protect wakeup_source in epitem

Eric Dumazet <[email protected]> wrote:
> On Sun, 2013-03-10 at 01:11 +0000, Eric Wong wrote:
> >
> > static void ep_destroy_wakeup_source(struct epitem *epi)
> > {
> > - wakeup_source_unregister(epi->ws);
> > - epi->ws = NULL;
> > + struct wakeup_source *ws = epi->ws;
> > +
> > + rcu_assign_pointer(epi->ws, NULL);
>
> There is no need to use a memory barrier before setting epi->ws to NULL
>
> You should use RCU_POINTER_INIT()

Thanks for looking at this patch. Using RCU_POINTER_INIT in my
updated patch below.

> > + synchronize_rcu(); /* wait for ep_pm_stay_awake to finish */
>
> Wont this add a significant slowdown ?

Maybe, but this is a very rare code path (using EPOLL_CTL_MOD to remove
a wakeup source), and synchronize_rcu gets called later, too:

wakeup_source_unregister
wakeup_source_remove
synchronize_rcu

> > + wakeup_source_unregister(ws);
> > }
>
> Please add the following in your .config
>
> CONFIG_SPARSE_RCU_POINTER=y
>
> and try :
>
> make C=2 fs/eventpoll.o
>
> And fix all warnings ;)

Done.

---------------------------------8<------------------------------
>From 24e5ee6222cc91208119973248fc4f8a6d1a31da Mon Sep 17 00:00:00 2001
From: Eric Wong <[email protected]>
Date: Sun, 10 Mar 2013 01:06:50 +0000
Subject: [PATCH] epoll: use RCU protect wakeup_source in epitem

This prevents wakeup_source destruction when a user hits the item with
EPOLL_CTL_MOD while ep_poll_callback is running.

Tested with CONFIG_SPARSE_RCU_POINTER=y and "make fs/eventpoll.o C=2"

Cc: Alexander Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arve Hjønnevåg <[email protected]>
Cc: Davide Libenzi <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Eric Wong <[email protected]>
---
fs/eventpoll.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 71 insertions(+), 21 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1326409..15d0f27 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -158,7 +158,7 @@ struct epitem {
struct list_head fllink;

/* wakeup_source used when EPOLLWAKEUP is set */
- struct wakeup_source *ws;
+ struct wakeup_source __rcu *ws;

/* The structure that describe the interested events and the source fd */
struct epoll_event event;
@@ -536,6 +536,38 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
}
}

+/* call only when ep->mtx is held */
+static inline struct wakeup_source *ep_wakeup_source(struct epitem *epi)
+{
+ return rcu_dereference_check(epi->ws, lockdep_is_held(&epi->ep->mtx));
+}
+
+/* call only when ep->mtx is held */
+static inline void ep_pm_stay_awake(struct epitem *epi)
+{
+ struct wakeup_source *ws = ep_wakeup_source(epi);
+
+ if (ws)
+ __pm_stay_awake(ws);
+}
+
+static inline bool ep_has_wakeup_source(struct epitem *epi)
+{
+ return rcu_access_pointer(epi->ws) ? true : false;
+}
+
+/* call when ep->mtx cannot be held (ep_poll_callback) */
+static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
+{
+ struct wakeup_source *ws;
+
+ rcu_read_lock();
+ ws = rcu_dereference(epi->ws);
+ if (ws)
+ __pm_stay_awake(ws);
+ rcu_read_unlock();
+}
+
/**
* ep_scan_ready_list - Scans the ready list in a way that makes possible for
* the scan code, to call f_op->poll(). Also allows for
@@ -599,7 +631,7 @@ static int ep_scan_ready_list(struct eventpoll *ep,
*/
if (!ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
- __pm_stay_awake(epi->ws);
+ ep_pm_stay_awake(epi);
}
}
/*
@@ -668,7 +700,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
list_del_init(&epi->rdllink);
spin_unlock_irqrestore(&ep->lock, flags);

- wakeup_source_unregister(epi->ws);
+ wakeup_source_unregister(ep_wakeup_source(epi));

/* At this point it is safe to free the eventpoll item */
kmem_cache_free(epi_cache, epi);
@@ -752,7 +784,7 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
* callback, but it's not actually ready, as far as
* caller requested events goes. We can remove it here.
*/
- __pm_relax(epi->ws);
+ __pm_relax(ep_wakeup_source(epi));
list_del_init(&epi->rdllink);
}
}
@@ -984,7 +1016,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
/* If this file is already in the ready list we exit soon */
if (!ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
- __pm_stay_awake(epi->ws);
+ ep_pm_stay_awake(epi);
}

/*
@@ -1146,6 +1178,7 @@ static int reverse_path_check(void)
static int ep_create_wakeup_source(struct epitem *epi)
{
const char *name;
+ struct wakeup_source *ws;

if (!epi->ep->ws) {
epi->ep->ws = wakeup_source_register("eventpoll");
@@ -1154,17 +1187,29 @@ static int ep_create_wakeup_source(struct epitem *epi)
}

name = epi->ffd.file->f_path.dentry->d_name.name;
- epi->ws = wakeup_source_register(name);
- if (!epi->ws)
+ ws = wakeup_source_register(name);
+
+ if (!ws)
return -ENOMEM;
+ rcu_assign_pointer(epi->ws, ws);

return 0;
}

-static void ep_destroy_wakeup_source(struct epitem *epi)
+/* rare code path, only used when EPOLL_CTL_MOD removes a wakeup source */
+static noinline void ep_destroy_wakeup_source(struct epitem *epi)
{
- wakeup_source_unregister(epi->ws);
- epi->ws = NULL;
+ struct wakeup_source *ws = ep_wakeup_source(epi);
+
+ rcu_assign_pointer(epi->ws, NULL);
+
+ /*
+ * wait for ep_pm_stay_awake_rcu to finish, synchronize_rcu is
+ * used internally by wakeup_source_remove, too (called by
+ * wakeup_source_unregister), so we cannot use call_rcu
+ */
+ synchronize_rcu();
+ wakeup_source_unregister(ws);
}

/*
@@ -1199,7 +1244,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
if (error)
goto error_create_wakeup_source;
} else {
- epi->ws = NULL;
+ RCU_INIT_POINTER(epi->ws, NULL);
}

/* Initialize the poll table using the queue callback */
@@ -1247,7 +1292,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
/* If the file is already "ready" we drop it inside the ready list */
if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
- __pm_stay_awake(epi->ws);
+ ep_pm_stay_awake(epi);

/* Notify waiting tasks that events are available */
if (waitqueue_active(&ep->wq))
@@ -1288,7 +1333,7 @@ error_unregister:
list_del_init(&epi->rdllink);
spin_unlock_irqrestore(&ep->lock, flags);

- wakeup_source_unregister(epi->ws);
+ wakeup_source_unregister(ep_wakeup_source(epi));

error_create_wakeup_source:
kmem_cache_free(epi_cache, epi);
@@ -1317,9 +1362,9 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
pt._key = event->events;
epi->event.data = event->data; /* protected by mtx */
if (epi->event.events & EPOLLWAKEUP) {
- if (!epi->ws)
+ if (!ep_has_wakeup_source(epi))
ep_create_wakeup_source(epi);
- } else if (epi->ws) {
+ } else if (ep_has_wakeup_source(epi)) {
ep_destroy_wakeup_source(epi);
}

@@ -1357,7 +1402,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
spin_lock_irq(&ep->lock);
if (!ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
- __pm_stay_awake(epi->ws);
+ ep_pm_stay_awake(epi);

/* Notify waiting tasks that events are available */
if (waitqueue_active(&ep->wq))
@@ -1383,6 +1428,7 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
unsigned int revents;
struct epitem *epi;
struct epoll_event __user *uevent;
+ struct wakeup_source *ws;
poll_table pt;

init_poll_funcptr(&pt, NULL);
@@ -1405,9 +1451,13 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
* instead, but then epi->ws would temporarily be out of sync
* with ep_is_linked().
*/
- if (epi->ws && epi->ws->active)
- __pm_stay_awake(ep->ws);
- __pm_relax(epi->ws);
+ ws = ep_wakeup_source(epi);
+ if (ws) {
+ if (ws->active)
+ __pm_stay_awake(ep->ws);
+ __pm_relax(ws);
+ }
+
list_del_init(&epi->rdllink);

pt._key = epi->event.events;
@@ -1424,7 +1474,7 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
if (__put_user(revents, &uevent->events) ||
__put_user(epi->event.data, &uevent->data)) {
list_add(&epi->rdllink, head);
- __pm_stay_awake(epi->ws);
+ ep_pm_stay_awake(epi);
return eventcnt ? eventcnt : -EFAULT;
}
eventcnt++;
@@ -1444,7 +1494,7 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
* poll callback will queue them in ep->ovflist.
*/
list_add_tail(&epi->rdllink, &ep->rdllist);
- __pm_stay_awake(epi->ws);
+ ep_pm_stay_awake(epi);
}
}
}
--
1.8.2.rc3.2.g89ce8d6

2013-03-11 23:37:45

by Arve Hjønnevåg

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

On Fri, Mar 8, 2013 at 11:10 PM, Eric Wong <[email protected]> wrote:
> Arve Hj?nnev?g <[email protected]> wrote:
>> On Fri, Mar 8, 2013 at 12:49 PM, Eric Wong <[email protected]> wrote:
>> > What happens if ep_modify calls ep_destroy_wakeup_source
>> > while __pm_stay_awake is running on the same epi->ws?
>>
>> Yes, that looks like a problem. I think calling
>> ep_destroy_wakeup_source with ep->lock held should fix that. It is not
>> clear how useful changing EPOLLWAKEUP in ep_modify is, so
>> alternatively we could remove that feature and instead only allow it
>> to be set in ep_insert.
>
> ep->lock would work, but ep->lock is already a source of heavy
> contention in my multithreaded+epoll webservers.
>

This should not have any significant impact on that since you would be
adding a lock to a code path that is, as far as I know, unused.

> Perhaps RCU can be used? I've no experience with RCU, but I've been
> meaning to get acquainted with RCU.
>

That adds code to the common path however. The wakeup_source is not
touch without holding one of the locks so holding both locks before
deleting it seems like a simpler solution.

> Another possible solution is to only use ep->ws and add an atomic
> counter to ep; so __pm_relax(ep->ws) is only called when the atomic
> counter reaches zero.

--
Arve Hj?nnev?g

2013-03-12 00:17:24

by Eric Wong

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

Arve Hjønnevåg <[email protected]> wrote:
> On Fri, Mar 8, 2013 at 11:10 PM, Eric Wong <[email protected]> wrote:
> > Arve Hjønnevåg <[email protected]> wrote:
> >> On Fri, Mar 8, 2013 at 12:49 PM, Eric Wong <[email protected]> wrote:
> >> > What happens if ep_modify calls ep_destroy_wakeup_source
> >> > while __pm_stay_awake is running on the same epi->ws?
> >>
> >> Yes, that looks like a problem. I think calling
> >> ep_destroy_wakeup_source with ep->lock held should fix that. It is not
> >> clear how useful changing EPOLLWAKEUP in ep_modify is, so
> >> alternatively we could remove that feature and instead only allow it
> >> to be set in ep_insert.
> >
> > ep->lock would work, but ep->lock is already a source of heavy
> > contention in my multithreaded+epoll webservers.
>
> This should not have any significant impact on that since you would be
> adding a lock to a code path that is, as far as I know, unused.
>
> > Perhaps RCU can be used? I've no experience with RCU, but I've been
> > meaning to get acquainted with RCU.
>
> That adds code to the common path however. The wakeup_source is not
> touch without holding one of the locks so holding both locks before
> deleting it seems like a simpler solution.

True. However, I've been looking into eliminating ep->lock in more
places (maybe entirely)[1].

I don't think the current overhead of RCU in epoll is significant,
either.


[1] I'll be testing Mathieu's wait-free concurrent queue soon:
http://mid.gmane.org/20130311213602.GB9829@Krystal

2013-03-12 00:29:03

by Arve Hjønnevåg

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

On Mon, Mar 11, 2013 at 5:17 PM, Eric Wong <[email protected]> wrote:
> Arve Hj?nnev?g <[email protected]> wrote:
>> On Fri, Mar 8, 2013 at 11:10 PM, Eric Wong <[email protected]> wrote:
>> > Arve Hj?nnev?g <[email protected]> wrote:
>> >> On Fri, Mar 8, 2013 at 12:49 PM, Eric Wong <[email protected]> wrote:
>> >> > What happens if ep_modify calls ep_destroy_wakeup_source
>> >> > while __pm_stay_awake is running on the same epi->ws?
>> >>
>> >> Yes, that looks like a problem. I think calling
>> >> ep_destroy_wakeup_source with ep->lock held should fix that. It is not
>> >> clear how useful changing EPOLLWAKEUP in ep_modify is, so
>> >> alternatively we could remove that feature and instead only allow it
>> >> to be set in ep_insert.
>> >
>> > ep->lock would work, but ep->lock is already a source of heavy
>> > contention in my multithreaded+epoll webservers.
>>
>> This should not have any significant impact on that since you would be
>> adding a lock to a code path that is, as far as I know, unused.
>>
>> > Perhaps RCU can be used? I've no experience with RCU, but I've been
>> > meaning to get acquainted with RCU.
>>
>> That adds code to the common path however. The wakeup_source is not
>> touch without holding one of the locks so holding both locks before
>> deleting it seems like a simpler solution.
>
> True. However, I've been looking into eliminating ep->lock in more
> places (maybe entirely)[1].
>
> I don't think the current overhead of RCU in epoll is significant,
> either.
>
>
> [1] I'll be testing Mathieu's wait-free concurrent queue soon:
> http://mid.gmane.org/20130311213602.GB9829@Krystal

OK, but is there any way you could use the same locking scheme for the
wakeup_source and the queue?

--
Arve Hj?nnev?g

2013-03-12 00:44:46

by Eric Wong

[permalink] [raw]
Subject: Re: epoll: possible bug from wakeup_source activation

Arve Hjønnevåg <[email protected]> wrote:
> On Mon, Mar 11, 2013 at 5:17 PM, Eric Wong <[email protected]> wrote:
> > Arve Hjønnevåg <[email protected]> wrote:
> >> On Fri, Mar 8, 2013 at 11:10 PM, Eric Wong <[email protected]> wrote:
> >> > Arve Hjønnevåg <[email protected]> wrote:
> >> >> On Fri, Mar 8, 2013 at 12:49 PM, Eric Wong <[email protected]> wrote:
> >> >> > What happens if ep_modify calls ep_destroy_wakeup_source
> >> >> > while __pm_stay_awake is running on the same epi->ws?
> >> >>
> >> >> Yes, that looks like a problem. I think calling
> >> >> ep_destroy_wakeup_source with ep->lock held should fix that. It is not
> >> >> clear how useful changing EPOLLWAKEUP in ep_modify is, so
> >> >> alternatively we could remove that feature and instead only allow it
> >> >> to be set in ep_insert.
> >> >
> >> > ep->lock would work, but ep->lock is already a source of heavy
> >> > contention in my multithreaded+epoll webservers.
> >>
> >> This should not have any significant impact on that since you would be
> >> adding a lock to a code path that is, as far as I know, unused.
> >>
> >> > Perhaps RCU can be used? I've no experience with RCU, but I've been
> >> > meaning to get acquainted with RCU.
> >>
> >> That adds code to the common path however. The wakeup_source is not
> >> touch without holding one of the locks so holding both locks before
> >> deleting it seems like a simpler solution.
> >
> > True. However, I've been looking into eliminating ep->lock in more
> > places (maybe entirely)[1].
> >
> > I don't think the current overhead of RCU in epoll is significant,
> > either.
> >
> >
> > [1] I'll be testing Mathieu's wait-free concurrent queue soon:
> > http://mid.gmane.org/20130311213602.GB9829@Krystal
>
> OK, but is there any way you could use the same locking scheme for the
> wakeup_source and the queue?

Probably, yes. I think I can just use ep->mtx and ignore the mutex
included with wfcq_head, need to protect the rbtree while dequeueing.

2013-03-14 03:09:03

by Eric Wong

[permalink] [raw]
Subject: [PATCH mm] epoll: lock ep->mtx in ep_free to silence lockdep

Technically we do not need to hold ep->mtx during ep_free since we are
certain there are no other users of ep at that point. However, lockdep
complains with a "suspicious rcu_dereference_check() usage!" message;
so lock the mutex before ep_remove to silence the warning.

Signed-off-by: Eric Wong <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arve Hjønnevåg <[email protected]>
Cc: Davide Libenzi <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: NeilBrown <[email protected]>,
Cc: Rafael J. Wysocki <[email protected]>
---
I considered using lockdep_off()/lockdep_on(), but I figure that may
hide other bugs...

fs/eventpoll.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 633e69f..d71b754 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -745,11 +745,15 @@ static void ep_free(struct eventpoll *ep)
* point we are sure no poll callbacks will be lingering around, and also by
* holding "epmutex" we can be sure that no file cleanup code will hit
* us during this operation. So we can avoid the lock on "ep->lock".
+ * We do not need to lock ep->mtx, either, we only do it to prevent
+ * a lockdep warning.
*/
+ mutex_lock(&ep->mtx);
while ((rbp = rb_first(&ep->rbr)) != NULL) {
epi = rb_entry(rbp, struct epitem, rbn);
ep_remove(ep, epi);
}
+ mutex_unlock(&ep->mtx);

mutex_unlock(&epmutex);
mutex_destroy(&ep->mtx);
--
Eric Wong