2013-10-14 15:46:40

by Dave Jones

[permalink] [raw]
Subject: epoll oops.

Machine is wedged and I can't get to it until tomorrow, but this is what was on serial console.
kernel running was from some time last Friday, I can get exact info tomorrow, though
I don't think there's anything epoll related recently that could explain this.

Dave

Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: bridge 8021q garp stp sctp dlci mpoa snd_seq_dummy tun fuse hidp bnep ipt_ULOG rfcomm scsi_transport_iscsi nfc caif_socket caif af_802154 phonet af_rxrpc bluetooth can_bcm rfkill can_raw can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds nfnetlink af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 xfs libcrc32c snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug shpchp snd_page_alloc e1000e snd_timer snd ptp pps_core soundcore
CPU: 3 PID: 449 Comm: trinity-main Not tainted 3.12.0-rc4+ #98
task: ffff88023e239560 ti: ffff880083082000 task.ti: ffff880083082000
RIP: 0010:[<ffffffff810c9f98>] [<ffffffff810c9f98>] __lock_acquire+0x58/0x1be0
RSP: 0018:ffff880083083c18 EFLAGS: 00010046
RAX: 0000000000000046 RBX: 0000000000000046 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8801654cec98
RBP: ffff880083083cc8 R08: 0000000000000002 R09: 0000000000000001
R10: ffff88023e239560 R11: 0000000000000000 R12: 0000000000000002
R13: ffff8801654cec98 R14: 0000000000000002 R15: 0000000000000000
FS: 00007fe8df704740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8801654cec98 CR3: 0000000001c0b000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
0000000000000006 00004f44c1dd11e9 ffff880083083c38 0000000000000017
0000000000000001 ffffffff81c410e0 0000000000000002 ffff88023e239c98
ffff880083083d20 ffffffff810ca22d ffff880083083c88 ffffffff810c9b85
Call Trace:
[<ffffffff810ca22d>] ? __lock_acquire+0x2ed/0x1be0
[<ffffffff810c9b85>] ? trace_hardirqs_on_caller+0x115/0x1e0
[<ffffffff810c9c5d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff810cc2d3>] lock_acquire+0x93/0x200
[<ffffffff810848d9>] ? remove_wait_queue+0x19/0x40
[<ffffffff81732c3b>] _raw_spin_lock_irqsave+0x4b/0x90
[<ffffffff810848d9>] ? remove_wait_queue+0x19/0x40
[<ffffffff810848d9>] remove_wait_queue+0x19/0x40
[<ffffffff812120eb>] ep_unregister_pollwait.isra.14+0x5b/0x1e0
[<ffffffff8121213a>] ? ep_unregister_pollwait.isra.14+0xaa/0x1e0
[<ffffffff81212786>] ep_remove+0x26/0x140
[<ffffffff81213391>] eventpoll_release_file+0x71/0xa0
[<ffffffff811c4faa>] __fput+0x2aa/0x2d0
[<ffffffff811c501e>] ____fput+0xe/0x10
[<ffffffff8107d67c>] task_work_run+0xac/0xe0
[<ffffffff81056bd7>] do_exit+0x2c7/0xcc0
[<ffffffff810c9b85>] ? trace_hardirqs_on_caller+0x115/0x1e0
[<ffffffff810c9c5d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff810589cc>] do_group_exit+0x4c/0xc0
[<ffffffff81058a54>] SyS_exit_group+0x14/0x20
[<ffffffff8173bf64>] tracesys+0xdd/0xe2
Code: 85 c0 8b 05 4b d6 bc 00 45 0f 45 e0 85 c0 0f 84 07 01 00 00 8b 05 31 af 00 01 49 89 fd 41 89 f7 41 89 d3 85 c0 0f 84 08 01 00 00 <49> 8b 45 00 ba 01 00 00 00 48 3d 60 6a 13 82 44 0f 44 e2 41 83
RIP [<ffffffff810c9f98>] __lock_acquire+0x58/0x1be0
RSP <ffff880083083c18>
CR2: ffff8801654cec98
---[ end trace 044e98c2d3aab216 ]---


2013-10-14 17:31:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: epoll oops.

On Mon, Oct 14, 2013 at 8:46 AM, Dave Jones <[email protected]> wrote:
> Machine is wedged and I can't get to it until tomorrow, but this is what was on serial console.
> kernel running was from some time last Friday, I can get exact info tomorrow, though
> I don't think there's anything epoll related recently that could explain this.

It looks like it is the access to "lock->key" that takes a page fault.
The pointer looks good (%r13=ffff8801654cec98), so I'm pretty sure
this is due to DEBUG_PAGEALLOC and a free'd page.

So it looks like ep_unregister_pollwait() calls remove_wait_queue() on
a wait-queue head that has already been free'd.

I have this dim memory of us having fought this before. But maybe I'm
just remembering some of the old signalfd-vs-epoll races.

Oleg, does this trigger any memory for you? Commit 971316f0503a
("epoll: ep_unregister_pollwait() can use the freed pwq->whead") just
makes me go "Hmm, this is *exactly* that that commit is talking
about.."

Linus

---
> Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU: 3 PID: 449 Comm: trinity-main Not tainted 3.12.0-rc4+ #98
> task: ffff88023e239560 ti: ffff880083082000 task.ti: ffff880083082000
> RIP: 0010:[<ffffffff810c9f98>] [<ffffffff810c9f98>] __lock_acquire+0x58/0x1be0
> Call Trace:
> [<ffffffff810cc2d3>] lock_acquire+0x93/0x200
> [<ffffffff81732c3b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810848d9>] remove_wait_queue+0x19/0x40
> [<ffffffff812120eb>] ep_unregister_pollwait.isra.14+0x5b/0x1e0
> [<ffffffff81212786>] ep_remove+0x26/0x140
> [<ffffffff81213391>] eventpoll_release_file+0x71/0xa0
> [<ffffffff811c4faa>] __fput+0x2aa/0x2d0
> [<ffffffff811c501e>] ____fput+0xe/0x10
> [<ffffffff8107d67c>] task_work_run+0xac/0xe0
> [<ffffffff81056bd7>] do_exit+0x2c7/0xcc0
> [<ffffffff810589cc>] do_group_exit+0x4c/0xc0
> [<ffffffff81058a54>] SyS_exit_group+0x14/0x20
> [<ffffffff8173bf64>] tracesys+0xdd/0xe2
> Code: 85 c0 8b 05 4b d6 bc 00 45 0f 45 e0 85 c0 0f 84 07 01 00 00 8b 05 31 af 00 01 49 89 fd 41 89 f7 41 89 d3 85 c0 0f 84 08 01 00 00 <49> 8b 45 00 ba 01 00 00 00 48 3d 60 6a 13 82 44 0f 44 e2 41 83
> RIP [<ffffffff810c9f98>] __lock_acquire+0x58/0x1be0
> RSP <ffff880083083c18>
> CR2: ffff8801654cec98
> ---[ end trace 044e98c2d3aab216 ]---
>

2013-10-14 19:23:54

by Oleg Nesterov

[permalink] [raw]
Subject: Re: epoll oops.

On 10/14, Linus Torvalds wrote:
>
> Oleg, does this trigger any memory for you? Commit 971316f0503a
> ("epoll: ep_unregister_pollwait() can use the freed pwq->whead") just
> makes me go "Hmm, this is *exactly* that that commit is talking
> about.."

Well. I have to admit I forgot everything I learned about this code
when I tried to fix that bug, and there were some subtle changes after
that... But of course I will try to investigate.

Oleg.

2013-10-14 20:57:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: epoll oops.

[ Adding Pekka to verify the SLAB_DESTROY_BY_RCU semantics, and Peter
Hurley due to the possible tty association ]

On Mon, Oct 14, 2013 at 10:31 AM, Linus Torvalds
<[email protected]> wrote:
>
> Oleg, does this trigger any memory for you? Commit 971316f0503a
> ("epoll: ep_unregister_pollwait() can use the freed pwq->whead") just
> makes me go "Hmm, this is *exactly* that that commit is talking
> about.."

Ok, Oleg, going back to that whole thread, I think that old bug went like this:

(a) normally all the wait-queues that epoll accesses are associated
with files, and as such they cannot go away for any normal file
activity. If the file exists, the waitqueue used for poll() on that
file must exist.

(b) signalfd is special, and it does a

poll_wait(file, &current->sighand->signalfd_wqh);

which means that the wait-queue isn't associated with the file
lifetime at all. It cleans it up with signalfd_cleanup() if the signal
handlers are removed. Normal (non-epoll) handling is safe, because
"current->sighand" obviously cannot go away as long as the current
thread (doing the polling) is in its poll/select handling.

(c) as a result, epoll and exit() can race, since the normal epoll
cleanup() is serialized by the file being closed, and we're missing
that for the case of sighand going away.

(d) we have this magic POLLFREE protocol to make signal handling
cleanup inform the epoll logic that "oops, this is going away", and we
depend on the underlying sighand data not going away thanks to the
eventual destruction of the slab being delayed by RCU.

(e) we are also very careful to only ever initialize the signalfd_wqh
entry in the SLAB *constructor*, because we cannot do it at every
allocation: it might still be in reused as long as it exists in the
slab cache: the SLAB_DESTROY_BY_RCU flag does *not* delay individual
slab entries, it only delays the final free of the underlying memory
allocation.

(f) to make things even more exciting, the SLAB_DESTROY_BY_RCU depend
on the slab implementation: slub and slob seem to delay each
individual allocation (and do ctor/dtor on every allocation), while
slab does that "delay only the underlying big page allocator" thing.

Agreed so far? Ugly, ugly, ugly, and I think there's exactly one
person who understands all of this. Namely you.

Anyway, so we protect the magic wait_queue_head_t from going away from
under us. We even make sure that we are very careful about only
initializing the sighand spinlock and this magic signalfd_wqh only
once per RCU lifetime. We're going to a lot of trouble to make this
all work. And it looks like it should work.

BUT.

It's enough that *one* other user of "poll_wait()" does something
similar to signalfd, and *doesn't* go through all this effort to make
it ok.

And I see a few worrisome cases. For example, look at "tty_poll()". It
ends up doing something very similar, except it uses the tty instead
of sighand. And exactly like the sighand struct, the tty allocation
lifespan can - thanks to hangup() - be shorter than the file
allocation lifespan.

Peter? Does a tty hangup end up actually possibly freeing the tty
struct? Looking at it, I'm starting to think that it only affects
f_op, and the "struct tty" stays around, in which case this is all
fine.

Hmm? There might be other cases..

Linus

2013-10-15 15:55:32

by Oleg Nesterov

[permalink] [raw]
Subject: Re: epoll oops.

> Hmm? There might be other cases..

Yes.

Dave, perhaps you have vmcore? I have no idea if this is possible or
not, but perhaps you can look at eventpoll_release_file's frame and
print file->f_op ?

On 10/14, Linus Torvalds wrote:
>
> [ Adding Pekka to verify the SLAB_DESTROY_BY_RCU semantics

Just in case, we depend on SLAB_DESTROY_BY_RCU anyway, and ->sighand
in particular. lock_task_sighand() equally depends on it.

> Ok, Oleg, going back to that whole thread, I think that old bug went like this:

Yes, yes, thanks, I do remember what this patch does and why. Just
I forgot everything about eventpoll.c, I tried to read it only once
to fix that bug.

> (b) signalfd is special, and it does a
>
> poll_wait(file, &current->sighand->signalfd_wqh);
>
> which means that the wait-queue isn't associated with the file
> lifetime at all. It cleans it up with signalfd_cleanup() if the signal
> handlers are removed. Normal (non-epoll) handling is safe, because
> "current->sighand" obviously cannot go away as long as the current
> thread (doing the polling) is in its poll/select handling.

Yes. and, just in case, the main problem is that sighand has no any
connection with the file. Unlike, say, tty which uses ->private_data.

> (c) as a result, epoll and exit() can race, since the normal epoll
> cleanup() is serialized by the file being closed, and we're missing
> that for the case of sighand going away.

Yes. Before that 971316f0503a hack epoll can't even know if the task
which did signalfd_poll() exits and frees the active signalfd_wqh.
If for example that task forked a child before exit.

And the whole RCU logic is only needed if exit/ep_remove_wait_queue
actually race with each other.

> Agreed so far? Ugly, ugly, ugly,

Yes, ugly, agreed. d80e731ecab4 even tries to docunent that this all
is the hack.

> And it looks like it should work.

Yes... I tried to read this all again, and so far I do not see
anything wrong... signalfd_cleanup()->waitqueue_active() looks fine
too, afaics. We do not need to clear ->whead unconditionally, the
only caller of ep_ptable_queue_proc() is signalfd_poll(), and we are
the last thread which can use this ->sighand.

> Peter? Does a tty hangup end up actually possibly freeing the tty
> struct? Looking at it, I'm starting to think that it only affects
> f_op, and the "struct tty" stays around, in which case this is all
> fine.

Of course I can't answer, but at first glance file_tty() should not go
away in this case... If nothing else, tty_release() expects tty != NULL,
and it seems that priv->tty is never changed.

Oleg.

2013-10-15 19:28:35

by Dave Jones

[permalink] [raw]
Subject: Re: epoll oops.

On Tue, Oct 15, 2013 at 05:48:38PM +0200, Oleg Nesterov wrote:
> > Hmm? There might be other cases..
>
> Yes.
>
> Dave, perhaps you have vmcore?

afraid not. I probably should set up kdump on my test machines. I'll look into it.

thanks,

Dave

2013-10-16 22:39:25

by Eric Wong

[permalink] [raw]
Subject: Re: epoll oops.

Oleg Nesterov <[email protected]> wrote:
> Yes. Before that 971316f0503a hack epoll can't even know if the task
> which did signalfd_poll() exits and frees the active signalfd_wqh.
> If for example that task forked a child before exit.
>
> And the whole RCU logic is only needed if exit/ep_remove_wait_queue
> actually race with each other.

Is there any chance this oops is caused by (or at least more easily
exposed by) commit 91cf5ab60ff82ecf4550a596867787c1e360dd3f ?
(epoll: add a reschedule point in ep_free())

I thought 91cf5ab would be benign, except...

> Yes, ugly, agreed. d80e731ecab4 even tries to docunent that this all
> is the hack.

.. the following sentence from d80e731ecab4 caught my eye:

It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

I haven't been able to trace if cond_resched() can take tasklist_lock.

2013-10-17 14:00:45

by Oleg Nesterov

[permalink] [raw]
Subject: Re: epoll oops.

On 10/16, Eric Wong wrote:
>
> Oleg Nesterov <[email protected]> wrote:
> > Yes. Before that 971316f0503a hack epoll can't even know if the task
> > which did signalfd_poll() exits and frees the active signalfd_wqh.
> > If for example that task forked a child before exit.
> >
> > And the whole RCU logic is only needed if exit/ep_remove_wait_queue
> > actually race with each other.
>
> Is there any chance this oops is caused by (or at least more easily
> exposed by) commit 91cf5ab60ff82ecf4550a596867787c1e360dd3f ?
> (epoll: add a reschedule point in ep_free())
>
> I thought 91cf5ab would be benign, except...
>
> > Yes, ugly, agreed. d80e731ecab4 even tries to docunent that this all
> > is the hack.
>
> .. the following sentence from d80e731ecab4 caught my eye:
>
> It also assumes that nobody can take tasklist_lock under epoll
> locks, this seems to be true.

This just reminds that with this patch __wake_up/ep_poll_callback can
be called under write_lock(tasklist).

> I haven't been able to trace if cond_resched() can take tasklist_lock.

No, it can't hold the non-sleepable rwlock_t. And the sentence above
doesn't mean the locks like epmutex, it is mostlt about ep->lock.

Oleg.

2013-10-23 09:08:34

by Pekka Enberg

[permalink] [raw]
Subject: Re: epoll oops.

Hi Linus,

On Mon, Oct 14, 2013 at 10:57 PM, Linus Torvalds
<[email protected]> wrote:
> [ Adding Pekka to verify the SLAB_DESTROY_BY_RCU semantics, and Peter
> Hurley due to the possible tty association ]
>
> On Mon, Oct 14, 2013 at 10:31 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Oleg, does this trigger any memory for you? Commit 971316f0503a
>> ("epoll: ep_unregister_pollwait() can use the freed pwq->whead") just
>> makes me go "Hmm, this is *exactly* that that commit is talking
>> about.."
>
> Ok, Oleg, going back to that whole thread, I think that old bug went like this:
>
> (a) normally all the wait-queues that epoll accesses are associated
> with files, and as such they cannot go away for any normal file
> activity. If the file exists, the waitqueue used for poll() on that
> file must exist.
>
> (b) signalfd is special, and it does a
>
> poll_wait(file, &current->sighand->signalfd_wqh);
>
> which means that the wait-queue isn't associated with the file
> lifetime at all. It cleans it up with signalfd_cleanup() if the signal
> handlers are removed. Normal (non-epoll) handling is safe, because
> "current->sighand" obviously cannot go away as long as the current
> thread (doing the polling) is in its poll/select handling.
>
> (c) as a result, epoll and exit() can race, since the normal epoll
> cleanup() is serialized by the file being closed, and we're missing
> that for the case of sighand going away.
>
> (d) we have this magic POLLFREE protocol to make signal handling
> cleanup inform the epoll logic that "oops, this is going away", and we
> depend on the underlying sighand data not going away thanks to the
> eventual destruction of the slab being delayed by RCU.
>
> (e) we are also very careful to only ever initialize the signalfd_wqh
> entry in the SLAB *constructor*, because we cannot do it at every
> allocation: it might still be in reused as long as it exists in the
> slab cache: the SLAB_DESTROY_BY_RCU flag does *not* delay individual
> slab entries, it only delays the final free of the underlying memory
> allocation.
>
> (f) to make things even more exciting, the SLAB_DESTROY_BY_RCU depend
> on the slab implementation: slub and slob seem to delay each
> individual allocation (and do ctor/dtor on every allocation), while
> slab does that "delay only the underlying big page allocator" thing.

So I'm not completely sure what you wanted me to verify Linus but yes
SLAB_DESTROY_BY_RCU only guarantees that the underlying page doesn't
go away for RCU but we're free to reuse the object. Anyone using the
object passed to kmem_cache_free() with SLAB_DESTROY_BY_RCU must check
that it's in fact the object we're interested in.

There's example code in a SLAB_DESTROY_BY_RCU comment in
<linux/slab.h> added by PeterZ.

Pekka

2013-10-23 13:43:48

by Peter Hurley

[permalink] [raw]
Subject: Re: epoll oops.

On 10/14/2013 04:57 PM, Linus Torvalds wrote:
> [ Adding Pekka to verify the SLAB_DESTROY_BY_RCU semantics, and Peter
> Hurley due to the possible tty association ]

> And I see a few worrisome cases. For example, look at "tty_poll()". It
> ends up doing something very similar, except it uses the tty instead
> of sighand. And exactly like the sighand struct, the tty allocation
> lifespan can - thanks to hangup() - be shorter than the file
> allocation lifespan.
>
> Peter? Does a tty hangup end up actually possibly freeing the tty
> struct? Looking at it, I'm starting to think that it only affects
> f_op, and the "struct tty" stays around, in which case this is all
> fine.

The tty_struct is only freed at the completion of the tty's
file_operations .release method (tty_release()). Further, it should
not be possible to advance past the tty_ldisc_release() call in
tty_release() while file operations such as tty_poll() -> poll_wait()
or a tty hangup are in-progress.

[Notwithstanding the above, if some kernel driver failed to acquire
a tty reference, either directly or via tty_port_tty_hangup(), before
hanging up, then the hangup could be racing with the .release(). But
I don't think that's what's happening here.]


On 10/15/2013 11:48 AM, Oleg Nesterov wrote:>> Hmm? There might be other cases..
>
> Yes.
>
> Dave, perhaps you have vmcore? I have no idea if this is possible or
> not, but perhaps you can look at eventpoll_release_file's frame and
> print file->f_op ?

I think Oleg's suggestion is the next diagnostic step.

Regards,
Peter Hurley