2012-02-22 17:42:02

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead

signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
this is not enough. eppoll_entry->whead still points to the memory
we are going to free, ep_unregister_pollwait()->remove_wait_queue()
is obviously unsafe.

Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
change ep_unregister_pollwait() to check pwq->whead != NULL before
remove_wait_queue(). We add the new ep_remove_wait_queue() helper
for this.

However this needs more locking. ep_remove_wait_queue() should take
ep->lock first to avoid the race and pin pwq->whead, then it needs
pwq->whead->lock for __remove_wait_queue().

This can obviously AB-BA deadlock with wake_up()->ep_poll_callback(),
so ep_remove_wait_queue() does the nasty lock + trylock-or-retry dance.

Of course, this also assumes that it is safe to take ep->lock in
ep_unregister_pollwait() paths, afaics this is true.

Reported-by: Maxime Bizon <[email protected]>
Cc: <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
---
fs/eventpoll.c | 43 +++++++++++++++++++++++++++++++++++++++----
1 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 442bedb..ac8bd15 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -320,6 +320,11 @@ static inline int ep_is_linked(struct list_head *p)
return !list_empty(p);
}

+static inline struct eppoll_entry *ep_pwq_from_wait(wait_queue_t *p)
+{
+ return container_of(p, struct eppoll_entry, wait);
+}
+
/* Get the "struct epitem" from a wait queue pointer */
static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
{
@@ -467,6 +472,33 @@ static void ep_poll_safewake(wait_queue_head_t *wq)
put_cpu();
}

+static void ep_remove_wait_queue(struct eventpoll *ep, struct eppoll_entry *pwq)
+{
+ for (;;) {
+ unsigned long flags; /* probably unneeded */
+
+ spin_lock_irqsave(&ep->lock, flags);
+ /* can be cleared by ep_poll_callback(POLLFREE) */
+ if (!pwq->whead)
+ goto unlock;
+
+ /* _trylock to avoid the deadlock, retry if it fails */
+ if (!spin_trylock(&pwq->whead->lock))
+ goto unlock;
+
+ __remove_wait_queue(pwq->whead, &pwq->wait);
+ spin_unlock(&pwq->whead->lock);
+ pwq->whead = NULL;
+ unlock:
+ spin_unlock_irqrestore(&ep->lock, flags);
+
+ if (!pwq->whead)
+ break;
+
+ cpu_relax();
+ }
+}
+
/*
* This function unregisters poll callbacks from the associated file
* descriptor. Must be called with "mtx" held (or "epmutex" if called from
@@ -481,7 +513,7 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
pwq = list_first_entry(lsthead, struct eppoll_entry, llink);

list_del(&pwq->llink);
- remove_wait_queue(pwq->whead, &pwq->wait);
+ ep_remove_wait_queue(ep, pwq);
kmem_cache_free(pwq_cache, pwq);
}
}
@@ -844,9 +876,12 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k

spin_lock_irqsave(&ep->lock, flags);

- /* the caller holds eppoll_entry->whead->lock */
- if ((unsigned long)key & POLLFREE)
- list_del_init(&wait->task_list);
+ if ((unsigned long)key & POLLFREE) {
+ struct eppoll_entry *pwq = ep_pwq_from_wait(wait);
+ /* the caller holds pwq->whead->lock */
+ __remove_wait_queue(pwq->whead, wait);
+ pwq->whead = NULL;
+ }

/*
* If the event mask does not contain any poll(2) event, we consider the
--
1.5.5.1


2012-02-23 15:51:34

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead

On 02/22, Oleg Nesterov wrote:
>
> However this needs more locking. ep_remove_wait_queue() should take
> ep->lock first to avoid the race and pin pwq->whead, then it needs
> pwq->whead->lock for __remove_wait_queue().
>
> This can obviously AB-BA deadlock with wake_up()->ep_poll_callback(),
> so ep_remove_wait_queue() does the nasty lock + trylock-or-retry dance.

Or we can rely on the fact that sighand_cachep is SLAB_DESTROY_BY_RCU,
and assume that ->whead is always rcu-protected if it can go away.

In this case we don't need 3/4 (although it makes sense to add the
fat comment), and 4/4 can be simplified, see below.

ep_pwq_from_wait() is not really needed, it has a single caller.
Just I tried to follow the coding style.

I'd prefer the explicit locking, but I don't really mind.

Oleg.

--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -320,6 +320,11 @@ static inline int ep_is_linked(struct list_head *p)
return !list_empty(p);
}

+static inline struct eppoll_entry *ep_pwq_from_wait(wait_queue_t *p)
+{
+ return container_of(p, struct eppoll_entry, wait);
+}
+
/* Get the "struct epitem" from a wait queue pointer */
static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
{
@@ -467,6 +472,17 @@ static void ep_poll_safewake(wait_queue_head_t *wq)
put_cpu();
}

+static void ep_remove_wait_queue(struct eppoll_entry *pwq)
+{
+ wait_queue_head_t *whead;
+
+ rcu_read_lock();
+ whead = rcu_dereference(pwq->whead);
+ if (whead)
+ remove_wait_queue(whead, &pwq->wait);
+ rcu_read_unlock();
+}
+
/*
* This function unregisters poll callbacks from the associated file
* descriptor. Must be called with "mtx" held (or "epmutex" if called from
@@ -481,7 +497,7 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
pwq = list_first_entry(lsthead, struct eppoll_entry, llink);

list_del(&pwq->llink);
- remove_wait_queue(pwq->whead, &pwq->wait);
+ ep_remove_wait_queue(pwq);
kmem_cache_free(pwq_cache, pwq);
}
}
@@ -845,8 +861,11 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
spin_lock_irqsave(&ep->lock, flags);

/* the caller holds eppoll_entry->whead->lock */
- if ((unsigned long)key & POLLFREE)
+ if ((unsigned long)key & POLLFREE) {
+ /* can't use __remove_wait_queue(), we need list_del_init() */
list_del_init(&wait->task_list);
+ ep_pwq_from_wait(wait)->whead = NULL;
+ }

/*
* If the event mask does not contain any poll(2) event, we consider the

2012-02-23 22:17:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/4] epoll: ep_unregister_pollwait() can use the freed pwq->whead

On Thu, Feb 23, 2012 at 7:44 AM, Oleg Nesterov <[email protected]> wrote:
>
> Or we can rely on the fact that sighand_cachep is SLAB_DESTROY_BY_RCU,
> and assume that ->whead is always rcu-protected if it can go away.
>
> In this case we don't need 3/4 (although it makes sense to add the
> fat comment), and 4/4 can be simplified, see below.

Ok.

Can you also get rid of 1/4, because quite frankly, adding that
BUG_ON() is just annoying. Either the thing gets fixed or not, but at
no point is it ok to say "ok, I'm going to fix it, but before I do
I'll just make it much worse".

Hmm?
Linus

2012-02-24 19:13:53

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v2 0/2] signalfd/epoll fixes

On 02/23, Linus Torvalds wrote:
>
> On Thu, Feb 23, 2012 at 7:44 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > Or we can rely on the fact that sighand_cachep is SLAB_DESTROY_BY_RCU,
> > and assume that ->whead is always rcu-protected if it can go away.
> >
> > In this case we don't need 3/4 (although it makes sense to add the
> > fat comment), and 4/4 can be simplified, see below.
>
> Ok.
>
> Can you also get rid of 1/4, because quite frankly, adding that
> BUG_ON() is just annoying. Either the thing gets fixed or not, but at
> no point is it ok to say "ok, I'm going to fix it, but before I do
> I'll just make it much worse".

OK. Please see v2.

Other changes:

- some comments

- now that we rely on rcu, ep_poll_callback() can do
remove_wait_queue() outsife of ep->lock



Davide, I see the new email, but it is too late for me to reply
today. Anyway, I think it makes sense to make the "simple" fix
before anything else. IOW, I'd suggest these changes for now in
any case (unless, of course, you see some problems).

Oleg.

2012-02-24 19:14:10

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()

This patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next change.

epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.

This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.

__cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
->signalfd_wqh is not empty, we add the new signalfd_cleanup()
helper.

ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.

The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.

In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

Note:

- we do not have wake_up_all_poll() but wake_up_poll()
is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

- signalfd_cleanup() uses POLLHUP along with POLLFREE,
we need a couple of simple changes in eventpoll.c to
make sure it can't be "lost".

Reported-by: Maxime Bizon <[email protected]>
Cc: <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
---
fs/eventpoll.c | 4 ++++
fs/signalfd.c | 11 +++++++++++
include/asm-generic/poll.h | 2 ++
include/linux/signalfd.h | 5 ++++-
kernel/fork.c | 5 ++++-
5 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index aabdfc3..34bbfc6 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -842,6 +842,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;

+ /* the caller holds eppoll_entry->whead->lock */
+ if ((unsigned long)key & POLLFREE)
+ list_del_init(&wait->task_list);
+
spin_lock_irqsave(&ep->lock, flags);

/*
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 492465b..79c1eea 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -30,6 +30,17 @@
#include <linux/signalfd.h>
#include <linux/syscalls.h>

+void signalfd_cleanup(struct sighand_struct *sighand)
+{
+ wait_queue_head_t *wqh = &sighand->signalfd_wqh;
+
+ if (likely(!waitqueue_active(wqh)))
+ return;
+
+ /* wait_queue_t->func(POLLFREE) should do remove_wait_queue() */
+ wake_up_poll(wqh, POLLHUP | POLLFREE);
+}
+
struct signalfd_ctx {
sigset_t sigmask;
};
diff --git a/include/asm-generic/poll.h b/include/asm-generic/poll.h
index 44bce83..9ce7f44 100644
--- a/include/asm-generic/poll.h
+++ b/include/asm-generic/poll.h
@@ -28,6 +28,8 @@
#define POLLRDHUP 0x2000
#endif

+#define POLLFREE 0x4000 /* currently only for epoll */
+
struct pollfd {
int fd;
short events;
diff --git a/include/linux/signalfd.h b/include/linux/signalfd.h
index 3ff4961..247399b 100644
--- a/include/linux/signalfd.h
+++ b/include/linux/signalfd.h
@@ -61,13 +61,16 @@ static inline void signalfd_notify(struct task_struct *tsk, int sig)
wake_up(&tsk->sighand->signalfd_wqh);
}

+extern void signalfd_cleanup(struct sighand_struct *sighand);
+
#else /* CONFIG_SIGNALFD */

static inline void signalfd_notify(struct task_struct *tsk, int sig) { }

+static inline void signalfd_cleanup(struct sighand_struct *sighand) { }
+
#endif /* CONFIG_SIGNALFD */

#endif /* __KERNEL__ */

#endif /* _LINUX_SIGNALFD_H */
-
diff --git a/kernel/fork.c b/kernel/fork.c
index b77fd55..e2cd3e2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -66,6 +66,7 @@
#include <linux/user-return-notifier.h>
#include <linux/oom.h>
#include <linux/khugepaged.h>
+#include <linux/signalfd.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -935,8 +936,10 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)

void __cleanup_sighand(struct sighand_struct *sighand)
{
- if (atomic_dec_and_test(&sighand->count))
+ if (atomic_dec_and_test(&sighand->count)) {
+ signalfd_cleanup(sighand);
kmem_cache_free(sighand_cachep, sighand);
+ }
}


--
1.5.5.1

2012-02-24 19:14:26

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v2 2/2] epoll: ep_unregister_pollwait() can use the freed pwq->whead

signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
this is not enough. eppoll_entry->whead still points to the memory
we are going to free, ep_unregister_pollwait()->remove_wait_queue()
is obviously unsafe.

Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
change ep_unregister_pollwait() to check pwq->whead != NULL under
rcu_read_lock() before remove_wait_queue(). We add the new helper,
ep_remove_wait_queue(), for this.

This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
ep_unregister_pollwait()->remove_wait_queue() can play with already
freed and potentially reused ->sighand, but this is fine. This memory
must have the valid ->signalfd_wqh until rcu_read_unlock().

Reported-by: Maxime Bizon <[email protected]>
Cc: <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
---
fs/eventpoll.c | 30 +++++++++++++++++++++++++++---
fs/signalfd.c | 6 +++++-
2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 34bbfc6..ea54cde 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -320,6 +320,11 @@ static inline int ep_is_linked(struct list_head *p)
return !list_empty(p);
}

+static inline struct eppoll_entry *ep_pwq_from_wait(wait_queue_t *p)
+{
+ return container_of(p, struct eppoll_entry, wait);
+}
+
/* Get the "struct epitem" from a wait queue pointer */
static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
{
@@ -467,6 +472,18 @@ static void ep_poll_safewake(wait_queue_head_t *wq)
put_cpu();
}

+static void ep_remove_wait_queue(struct eppoll_entry *pwq)
+{
+ wait_queue_head_t *whead;
+
+ rcu_read_lock();
+ /* If it is cleared by POLLFREE, it should be rcu-safe */
+ whead = rcu_dereference(pwq->whead);
+ if (whead)
+ remove_wait_queue(whead, &pwq->wait);
+ rcu_read_unlock();
+}
+
/*
* This function unregisters poll callbacks from the associated file
* descriptor. Must be called with "mtx" held (or "epmutex" if called from
@@ -481,7 +498,7 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
pwq = list_first_entry(lsthead, struct eppoll_entry, llink);

list_del(&pwq->llink);
- remove_wait_queue(pwq->whead, &pwq->wait);
+ ep_remove_wait_queue(pwq);
kmem_cache_free(pwq_cache, pwq);
}
}
@@ -842,9 +859,16 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;

- /* the caller holds eppoll_entry->whead->lock */
- if ((unsigned long)key & POLLFREE)
+ if ((unsigned long)key & POLLFREE) {
+ ep_pwq_from_wait(wait)->whead = NULL;
+ /*
+ * whead = NULL above can race with ep_remove_wait_queue()
+ * which can do another remove_wait_queue() after us, so we
+ * can't use __remove_wait_queue(). whead->lock is held by
+ * the caller.
+ */
list_del_init(&wait->task_list);
+ }

spin_lock_irqsave(&ep->lock, flags);

diff --git a/fs/signalfd.c b/fs/signalfd.c
index 79c1eea..7ae2a57 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -33,7 +33,11 @@
void signalfd_cleanup(struct sighand_struct *sighand)
{
wait_queue_head_t *wqh = &sighand->signalfd_wqh;
-
+ /*
+ * The lockless check can race with remove_wait_queue() in progress,
+ * but in this case its caller should run under rcu_read_lock() and
+ * sighand_cachep is SLAB_DESTROY_BY_RCU, we can safely return.
+ */
if (likely(!waitqueue_active(wqh)))
return;

--
1.5.5.1

2012-02-24 20:23:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] signalfd/epoll fixes

On Fri, Feb 24, 2012 at 11:06 AM, Oleg Nesterov <[email protected]> wrote:
>
> OK. Please see v2.

Ok, I applied these.

And then, dammit, I unapplied them again.

And then I applied them again.

I'm *really* conflicted, because I have this really strong feeling
that it's just papering over a symptom, and we damn well shouldn't be
doing that. I really think that what we really should do is allow
"poll()" to have a "poll_remove" callback (so each "add_poll_wait()"
will have a callback when it gets remove).

Then we could make the poll() functions actually do allocations and
crap - or at least add refcounts - and the "poll_remove()" ones would
undo them.

And then we could rip out all this, and make signalfd just do

static void poll_remove(struct file *file, struct wait_queue *wq)
{
struct sighand *sighand = container_of(wq, struct sighand, signalfd_wqh);
__cleanup_sighand(sighand);
}

and add that "poll_remove" to its file handler operations. And in
"poll()", it would just do

atomic_inc(&sighand->count);

as it does the poll_wait() thing. Sure, we need to have some way to
test "did we really add it", and only increment the count if so (so
poll_wait() would need to return a value), but this seems to be the
*real* fix. Because the real problem is that we cannot currently
refcount the poll users.

Ok, so it's just a strong feeling, and I *did* end up applying these
two patches after all, but I really wonder how hard it would be to
just add a single new callback function and be able to refcount that
sighand structure itself.

Linus

2012-02-24 23:15:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] signalfd/epoll fixes

On Fri, Feb 24, 2012 at 12:23 PM, Linus Torvalds
<[email protected]> wrote:
>
> I'm *really* conflicted, because I have this really strong feeling
> that it's just papering over a symptom, and we damn well shouldn't be
> doing that. I really think that what we really should do is allow
> "poll()" to have a "poll_remove" callback (so each "add_poll_wait()"
> will have a callback when it gets remove).
>
> Then we could make the poll() functions actually do allocations and
> crap - or at least add refcounts - and the "poll_remove()" ones would
> undo them.

Ok, I have *NOT* tested this, and by now it's certainly not 3.3
material any more, but here's a patch to kind of lay the groundwork
and show what I mean.

UNTESTED! UNTESTED! CAVEAT! Probably horribly broken.

Linus


Attachments:
patch.diff (8.89 kB)

2012-02-25 16:15:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] signalfd/epoll fixes

On 02/24, Linus Torvalds wrote:
>
> On Fri, Feb 24, 2012 at 12:23 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > I'm *really* conflicted, because I have this really strong feeling
> > that it's just papering over a symptom, and we damn well shouldn't be
> > doing that.

Heh. This is even documented in the changelog.

> I really think that what we really should do is allow
> > "poll()" to have a "poll_remove" callback (so each "add_poll_wait()"
> > will have a callback when it gets remove).

Yes. I also thought about this. In fact I think this probably makes
sense even ignoring epoll problems. Although I was thinking about
file_operations::poll_v2(file, poll_table, poll_table_entry, mode)
which could could replace the old ->poll() eventually. But perhaps
the explicit poll_rm() is better.

However, speaking about epoll/sigfd, imho this hides the problem too.

> +static void signalfd_poll_rm(struct file *file, wait_queue_head_t *p)
> +{
> + struct sighand_struct *sighand;
> +
> + sighand = container_of(p, struct sighand_struct, signalfd_wqh);
> + __cleanup_sighand(sighand);
> +}
> +
> static unsigned int signalfd_poll(struct file *file, poll_table *wait)
> {
> struct signalfd_ctx *ctx = file->private_data;
> unsigned int events = 0;
>
> - poll_wait(file, &current->sighand->signalfd_wqh, wait);
> + if (poll_wait(file, &current->sighand->signalfd_wqh, wait))
> + atomic_inc(&current->sighand->count);

Yes, this fixes use-after-free. But suppose the the application does:

sigfd = signalfd(...);
epoll_ctl(epollfd, EPOLL_CTL_ADD, sigfd);
execve();

de_thread() unshares ->sighand because of atomic_inc() above and
epollfd no longer works after exec, and the application can't know
this. (yes, currently we have the same problem with CLONE_SIGHAND'ed
apps, but still).

We can turn ->signalfd_wqh into the pointer to the refcountable
memory, or add another counter, but this is not nice.

And in any case. If the task exits, to me it looks simply pointless
to keep its ->sighand until __fput(). This only makes poll_rm() or
whatever safe, it can't report a signal. OK, OK, I am not arguing,
POLLFREE is equally ugly or even worse.

I _think_ that the right fix should move wait_queue_head_t from
sighand_struct to signalfd_ctx (file->private_data) somehow...

Oleg.

2012-02-25 19:00:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] signalfd/epoll fixes

On Sat, Feb 25, 2012 at 8:08 AM, Oleg Nesterov <[email protected]> wrote:
>
> Yes, this fixes use-after-free. But suppose the the application does:
>
> ? ? ? ?sigfd = signalfd(...);
> ? ? ? ?epoll_ctl(epollfd, EPOLL_CTL_ADD, sigfd);
> ? ? ? ?execve();

Well, that can't work anyway. We're adding to the wrong sighand.

Sure, we can force a wakeup for it at execve(), but that's not the old
behavior either, so why would we care about the above case? It's not a
sane thing to do, and it has never worked before either, so why
bother? My patch should make it "work" as well as it ever did.

Quite frankly, if you wanted to make signalfd work sanely with
eventpoll, you'd need to change the semantics entirely, and just say
"signalfd works in the context it was creared in, no other". Those
aren't the semantics we have, though, and there's no point in trying
to make up some *new* semantics for "if you execve() we'll still
follow it"

> And in any case. If the task exits, to me it looks simply pointless
> to keep its ->sighand until __fput(). This only makes poll_rm() or
> whatever safe, it can't report a signal. OK, OK, I am not arguing,
> POLLFREE is equally ugly or even worse.

That's my point. POLLFREE really does look worse, and doesn't really
give any saner behaviors.

> I _think_ that the right fix should move wait_queue_head_t from
> sighand_struct to signalfd_ctx (file->private_data) somehow...

I looked at it - and it requires the same "poll_rm()" callback, and
then instead you have to make allocations in poll() (for the list) and
de-allocate in "poll_rm()".

I do agree that in many ways it would be the "right" thing to do, but
it does require the same infrastructure, and the advantage is dubious
for this case. The main advantage is that we can extend the signalfd +
epoll interaction in new directions (like the "across execve" or other
cases), but I'd argue that we don't *want* to do that.

So your POLLFREE patches are merged, and I think they work. The reason
I like the poll_rm() thing is that I think it's conceptually the right
solution, and while our "poll_wait()" interface has really worked very
well, the fact that you cannot do anything stateful in it (because
there is never any way to undo anything except for the "remove from
wait queue" action) is very rigid and inflexible.

Of course, aside from signalfd, that inflexibility has never really
mattered. So..

Linus

2012-02-29 19:57:16

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()

On 02/24/2012 11:07 AM, Oleg Nesterov wrote:
> This patch is intentionally incomplete to simplify the review.
> It ignores ep_unregister_pollwait() which plays with the same wqh.
> See the next change.
>
> epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
> f_op->poll() needs. In particular it assumes that the wait queue
> can't go away until eventpoll_release(). This is not true in case
> of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
> which is not connected to the file.
>
> This patch adds the special event, POLLFREE, currently only for
> epoll. It expects that init_poll_funcptr()'ed hook should do the
> necessary cleanup. Perhaps it should be defined as EPOLLFREE in
> eventpoll.

> [lots of kernel-internal technical stuff]

I have a bunch of userspace code that uses signalfd via epoll. Does
this affect the ABI? Will epoll_wait ever set POLLFREE? Does
EPOLL_CTL_MOD accept POLLFREE?

IOW, from a userspace point of view, wtf does this do? I don't want to
end up with another POLLRDHUP-like* special case in my code.

--Andy

* IMO the right fix would have been to make EPOLLET fire POLLIN again
when the read point advances to EOF but before EOF is actually seen when
read() returns zero. Then POLLRDHUP would be unnecessary and user code
could do its thing in blissful ignorance. I hope POLLFREE isn't like that.

2012-02-29 20:13:44

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()

On 02/29, Andy Lutomirski wrote:
>
> On 02/24/2012 11:07 AM, Oleg Nesterov wrote:
> > This patch adds the special event, POLLFREE, currently only for
> > epoll. It expects that init_poll_funcptr()'ed hook should do the
> > necessary cleanup. Perhaps it should be defined as EPOLLFREE in
> > eventpoll.
>
> I have a bunch of userspace code that uses signalfd via epoll. Does
> this affect the ABI?

I hope not ;)

> Will epoll_wait ever set POLLFREE?

No.

> Does
> EPOLL_CTL_MOD accept POLLFREE?

Yes, it doesn't check the bits. EPOLL_CTL_MOD accepts any bit.

> IOW, from a userspace point of view, wtf does this do?

I hope this is transparent to the user-space. If the application
does EPOLL_CTL_MOD(POLLFREE) by mistake then ep_poll_callback(POLLFREE)
can trigger the unnecessary wakeup and move the file to ep->rdllist
but this is harmless. The subsequent ->poll() can't return POLLFREE.

Oleg.

2012-02-29 20:16:54

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()

On Wed, Feb 29, 2012 at 12:06 PM, Oleg Nesterov <[email protected]> wrote:
> On 02/29, Andy Lutomirski wrote:
>>
>> On 02/24/2012 11:07 AM, Oleg Nesterov wrote:
>> > This patch adds the special event, POLLFREE, currently only for
>> > epoll. It expects that init_poll_funcptr()'ed hook should do the
>> > necessary cleanup. Perhaps it should be defined as EPOLLFREE in
>> > eventpoll.
>>
>> I have a bunch of userspace code that uses signalfd via epoll. ?Does
>> this affect the ABI?
>
> I hope not ;)
>

Excellent!

To avoid further confusion, would it make sense to update the comment
in poll.h to indicate that POLLFREE is only for epoll and is internal
to the kernel?

--Andy