2013-03-07 02:16:51

by Dave Jones

[permalink] [raw]
Subject: BUG_ON(nd->inode != parent->d_inode);

Fuzz tester just hit this.

Dave


[11235.974332] ------------[ cut here ]------------
[11235.975039] kernel BUG at fs/namei.c:1441!
[11235.975661] invalid opcode: 0000 [#1] PREEMPT SMP
[11235.976433] Modules linked in: bnep vmw_vsock_vmci_transport vmw_vmci vsock fuse l2tp_ppp l2tp_core hidp 8021q garp mrp llc2 netrom phonet bridge dlci stp caif_socket caif af_key pppoe pppox ppp_generic slhc af_rxrpc ipt_ULOG can_bcm can_raw nfnetlink scsi_transport_iscsi rfcomm rose can x25 decnet ipx p8023 p8022 nfc ax25 appletalk psnap llc irda atm rds crc_ccitt lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec btusb snd_pcm microcode bluetooth pcspkr serio_raw edac_core snd_page_alloc rfkill snd_timer usb_debug snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm
[11235.988144] CPU 2
[11235.988462] Pid: 8737, comm: trinity-child0 Not tainted 3.9.0-rc1+ #69 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
[11235.990193] RIP: 0010:[<ffffffff816bc622>] [<ffffffff816bc622>] lookup_slow+0x25/0xab
[11235.991392] RSP: 0018:ffff8800ad2cbc98 EFLAGS: 00010293
[11235.992196] RAX: 0000000000000001 RBX: ffff8800ad2cbde8 RCX: 0000000000000000
[11235.993266] RDX: 00000000000018ac RSI: ffff8800ad2cbd18 RDI: ffff88010b080040
[11235.994280] RBP: ffff8800ad2cbcc8 R08: 0000000000000001 R09: 0000000000000000
[11235.995315] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000011
[11235.996334] R13: ffff8800cebf9bc0 R14: ffff8800ad2cbde8 R15: ffff8800cfb13178
[11235.997379] FS: 00007f98b7830740(0000) GS:ffff88012b200000(0000) knlGS:0000000000000000
[11235.998548] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11235.999375] CR2: 0000000000000008 CR3: 00000000baffd000 CR4: 00000000000007e0
[11236.000403] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11236.001421] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[11236.002457] Process trinity-child0 (pid: 8737, threadinfo ffff8800ad2ca000, task ffff880104ea4920)
[11236.003761] Stack:
[11236.004091] 0000000000000092 0000000000000000 0000000000000001 0000000000000011
[11236.005337] 0000000000000000 ffff8800ad2cbde8 ffff8800ad2cbd68 ffffffff811be75e
[11236.006570] ffff880104ea5040 0000000000000002 ffff8800ad2cbd08 ffffffff8100a196
[11236.015665] Call Trace:
[11236.023527] [<ffffffff811be75e>] path_lookupat+0x71e/0x740
[11236.032165] [<ffffffff8100a196>] ? native_sched_clock+0x26/0x90
[11236.040991] [<ffffffff810b2118>] ? trace_hardirqs_off_caller+0x28/0xc0
[11236.049559] [<ffffffff811be7b4>] filename_lookup+0x34/0xc0
[11236.058095] [<ffffffff811be8f2>] do_path_lookup+0x32/0x40
[11236.066609] [<ffffffff811beb7a>] kern_path+0x2a/0x50
[11236.074848] [<ffffffff811648f3>] ? might_fault+0x53/0xb0
[11236.082790] [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
[11236.091059] [<ffffffff8116493c>] ? might_fault+0x9c/0xb0
[11236.099271] [<ffffffff811648f3>] ? might_fault+0x53/0xb0
[11236.107711] [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
[11236.116287] [<ffffffff811d569d>] do_mount+0x8d/0xa00
[11236.124378] [<ffffffff811591ab>] ? memdup_user+0x4b/0x90
[11236.132666] [<ffffffff81159243>] ? strndup_user+0x53/0x70
[11236.140733] [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
[11236.149090] [<ffffffff811d609e>] sys_mount+0x8e/0xe0
[11236.157394] [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
[11236.166112] [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
[11236.174800] [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
[11236.183799] Code: 55 48 89 e5 0f 0b 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb 48 83 ec 10 4c 8b 6f 08 48 8b 7f 30 49 3b 7d 30 74 02 <0f> 0b 48 81 c7 e8 00 00 00 49 89 f4 31 f6 e8 7b 52 00 00 8b 53
[11236.204106] RIP [<ffffffff816bc622>] lookup_slow+0x25/0xab
[11236.212308] RSP <ffff8800ad2cbc98>
[11236.223021] ---[ end trace 9b541228773df23c ]---


2013-03-07 15:31:00

by Dave Jones

[permalink] [raw]
Subject: BUG_ON(nd->inode->i_op->follow_link);

On Wed, Mar 06, 2013 at 09:16:45PM -0500, Dave Jones wrote:

> kernel BUG at fs/namei.c:1441!
> invalid opcode: 0000 [#1] PREEMPT SMP
> Modules linked in: bnep vmw_vsock_vmci_transport vmw_vmci vsock fuse l2tp_ppp l2tp_core hidp 8021q garp mrp llc2 netrom phonet bridge dlci stp caif_socket caif af_key pppoe pppox ppp_generic slhc af_rxrpc ipt_ULOG can_bcm can_raw nfnetlink scsi_transport_iscsi rfcomm rose can x25 decnet ipx p8023 p8022 nfc ax25 appletalk psnap llc irda atm rds crc_ccitt lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec btusb snd_pcm microcode bluetooth pcspkr serio_raw edac_core snd_page_alloc rfkill snd_timer usb_debug snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm
> CPU 2
> Pid: 8737, comm: trinity-child0 Not tainted 3.9.0-rc1+ #69 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[<ffffffff816bc622>] [<ffffffff816bc622>] lookup_slow+0x25/0xab
> RSP: 0018:ffff8800ad2cbc98 EFLAGS: 00010293
> RAX: 0000000000000001 RBX: ffff8800ad2cbde8 RCX: 0000000000000000
> RDX: 00000000000018ac RSI: ffff8800ad2cbd18 RDI: ffff88010b080040
> RBP: ffff8800ad2cbcc8 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000011
> R13: ffff8800cebf9bc0 R14: ffff8800ad2cbde8 R15: ffff8800cfb13178
> FS: 00007f98b7830740(0000) GS:ffff88012b200000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 00000000baffd000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process trinity-child0 (pid: 8737, threadinfo ffff8800ad2ca000, task ffff880104ea4920)
> Stack:
> 0000000000000092 0000000000000000 0000000000000001 0000000000000011
> 0000000000000000 ffff8800ad2cbde8 ffff8800ad2cbd68 ffffffff811be75e
> ffff880104ea5040 0000000000000002 ffff8800ad2cbd08 ffffffff8100a196
> Call Trace:
> [<ffffffff811be75e>] path_lookupat+0x71e/0x740
> [<ffffffff8100a196>] ? native_sched_clock+0x26/0x90
> [<ffffffff810b2118>] ? trace_hardirqs_off_caller+0x28/0xc0
> [<ffffffff811be7b4>] filename_lookup+0x34/0xc0
> [<ffffffff811be8f2>] do_path_lookup+0x32/0x40
> [<ffffffff811beb7a>] kern_path+0x2a/0x50
> [<ffffffff811648f3>] ? might_fault+0x53/0xb0
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> [<ffffffff8116493c>] ? might_fault+0x9c/0xb0
> [<ffffffff811648f3>] ? might_fault+0x53/0xb0
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> [<ffffffff811d569d>] do_mount+0x8d/0xa00
> [<ffffffff811591ab>] ? memdup_user+0x4b/0x90
> [<ffffffff81159243>] ? strndup_user+0x53/0x70
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> [<ffffffff811d609e>] sys_mount+0x8e/0xe0
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> Code: 55 48 89 e5 0f 0b 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb 48 83 ec 10 4c 8b 6f 08 48 8b 7f 30 49 3b 7d 30 74 02 <0f> 0b 48 81 c7 e8 00 00 00 49 89 f4 31 f6 e8 7b 52 00 00 8b 53
> RIP [<ffffffff816bc622>] lookup_slow+0x25/0xab
> RSP <ffff8800ad2cbc98>


More VFS fun, this time on something in /proc.


kernel BUG at fs/namei.c:693!
invalid opcode: 0000 [#1] PREEMPT SMP
Modules linked in: appletalk irda can_bcm atm netrom phonet ipx p8023 psnap p8022 llc rds rose ax25 decnet caif_socket caif af_rxrpc crc_ccitt can pppoe pppox ppp_generic slhc nfc af_key x25 lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm btusb bluetooth snd_page_alloc snd_timer snd microcode edac_core usb_debug rfkill serio_raw pcspkr soundcore vhost_net r8169 tun macvtap mii macvlan kvm_amd kvm
CPU 0
Pid: 829, comm: trinity-main Not tainted 3.9.0-rc1+ #69 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff811c1414>] [<ffffffff811c1414>] nd_jump_link+0x54/0x60
RSP: 0018:ffff880112e31c78 EFLAGS: 00010286
RAX: ffffffff8181ed80 RBX: ffff880112e31dc8 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff81089183 RDI: 0000000000000001
RBP: ffff880112e31c88 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880112e31c98
R13: ffff880112e31dc8 R14: ffff880112e31dc8 R15: ffff880104efd030
FS: 00007f9c237f5740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003e29ebb750 CR3: 0000000112d59000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-main (pid: 829, threadinfo ffff880112e30000, task ffff880112c6a490)
Stack:
ffff88010a14de48 ffff880104efd030 ffff880112e31cc8 ffffffff8122475c
ffff880129284fe0 ffff880113eab780 000000005138b1bc 0000000000000000
ffff880112c6a490 ffff880112e31d28 ffff880112e31d68 ffffffff811be311
Call Trace:
[<ffffffff8122475c>] proc_pid_follow_link+0x6c/0x70
[<ffffffff811be311>] path_lookupat+0x2d1/0x740
[<ffffffff811be7b4>] filename_lookup+0x34/0xc0
[<ffffffff811c15ae>] user_path_at_empty+0x8e/0x110
[<ffffffff811b6085>] ? cp_new_stat+0x155/0x170
[<ffffffff811c1641>] user_path_at+0x11/0x20
[<ffffffff811b62f9>] vfs_fstatat+0x49/0xa0
[<ffffffff811b664a>] sys_newfstatat+0x1a/0x40
[<ffffffff810b8685>] ? trace_hardirqs_on_caller+0x115/0x1a0
[<ffffffff81341a6e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
Code: 44 24 08 48 89 43 08 48 8b 40 30 81 4b 38 00 10 00 00 48 89 43 30 48 8b 40 20 48 83 78 08 00 75 0a 48 8b 5d f0 4c 8b 65 f8 c9 c3 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 ba 10 00
RIP [<ffffffff811c1414>] nd_jump_link+0x54/0x60
RSP <ffff880112e31c78>

2013-03-07 17:31:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Thu, Mar 7, 2013 at 7:30 AM, Dave Jones <[email protected]> wrote:
> On Wed, Mar 06, 2013 at 09:16:45PM -0500, Dave Jones wrote:
>
> > kernel BUG at fs/namei.c:1441!

Ok, that's a seriously bad error case. although I still worry that
BUG_ON() is too bug of a hammer. If we hold any other locks, we're
basically screwed, and may end up not saving the error message to
/var/log/messages etc.

So I think we should change that BUG_ON() into a

if (WARN_ON_ONCE(nd->inode != parent->d_inode))
return -ESTALE;

or something. Not because the bug isn't serious, but because doing
BUG_ON() is likely worse from a debugging standpoint than not.

> > [<ffffffff811be75e>] path_lookupat+0x71e/0x740
> > [<ffffffff811be7b4>] filename_lookup+0x34/0xc0
> > [<ffffffff811be8f2>] do_path_lookup+0x32/0x40
> > [<ffffffff811beb7a>] kern_path+0x2a/0x50
> > [<ffffffff811d569d>] do_mount+0x8d/0xa00
> > [<ffffffff811d609e>] sys_mount+0x8e/0xe0
> > [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b

Hmm. Nothing looks all that odd in that trace. Do you have any idea
what the path was? This being trinity, I'm assuming you're doing some
kind of targeted testing. sysfs or proc, perhaps? Or some particular
concurrency test with random system calls/pathnames? Not that I see
how it could happen anyway, but maybe it could give some hint about
what triggered this.

> More VFS fun, this time on something in /proc.
>
> kernel BUG at fs/namei.c:693!
> Call Trace:
> [<ffffffff8122475c>] proc_pid_follow_link+0x6c/0x70
> [<ffffffff811be311>] path_lookupat+0x2d1/0x740
> [<ffffffff811be7b4>] filename_lookup+0x34/0xc0
> [<ffffffff811c15ae>] user_path_at_empty+0x8e/0x110
> [<ffffffff811c1641>] user_path_at+0x11/0x20
> [<ffffffff811b62f9>] vfs_fstatat+0x49/0xa0
> [<ffffffff811b664a>] sys_newfstatat+0x1a/0x40
> [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b

Hmm. Al, is that BUG_ON() even valid any more? Can a file descriptor
opened with F_PATH contain a symlink? So doing a proc lookup of such a
thing could point to something that has a follow_link, no?

Dave, are these BUG_ON's new with current git, or is it perhaps
because you've expanded trinity with new patterns to test random
arguments for?

Linus

2013-03-07 19:35:10

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Thu, Mar 07, 2013 at 09:30:56AM -0800, Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 7:30 AM, Dave Jones <[email protected]> wrote:
> > On Wed, Mar 06, 2013 at 09:16:45PM -0500, Dave Jones wrote:
> >
> > > kernel BUG at fs/namei.c:1441!
>
> Ok, that's a seriously bad error case. although I still worry that
> BUG_ON() is too bug of a hammer. If we hold any other locks, we're
> basically screwed, and may end up not saving the error message to
> /var/log/messages etc.
>
> So I think we should change that BUG_ON() into a
>
> if (WARN_ON_ONCE(nd->inode != parent->d_inode))
> return -ESTALE;

Curiously, the machine wasn't dead after hitting that.
Oh wait, it locks up that one CPU, leaving the others running right ?
That would explain it, it's got a few cores..

> > > [<ffffffff811be75e>] path_lookupat+0x71e/0x740
> > > [<ffffffff811be7b4>] filename_lookup+0x34/0xc0
> > > [<ffffffff811be8f2>] do_path_lookup+0x32/0x40
> > > [<ffffffff811beb7a>] kern_path+0x2a/0x50
> > > [<ffffffff811d569d>] do_mount+0x8d/0xa00
> > > [<ffffffff811d609e>] sys_mount+0x8e/0xe0
> > > [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
>
> Hmm. Nothing looks all that odd in that trace. Do you have any idea
> what the path was? This being trinity, I'm assuming you're doing some
> kind of targeted testing. sysfs or proc, perhaps? Or some particular
> concurrency test with random system calls/pathnames? Not that I see
> how it could happen anyway, but maybe it could give some hint about
> what triggered this.

Basically, see the summary of a bunch of bugs I reported to Greg last night
in sysfs: https://lkml.org/lkml/2013/3/7/21
It sounds like it's just trinity finding old bugs for the first time,
though I've not actually tested yet on an older kernel.

> Dave, are these BUG_ON's new with current git, or is it perhaps
> because you've expanded trinity with new patterns to test random
> arguments for?

I suspect it's the addition of this..
http://git.codemonkey.org.uk/?p=trinity.git;a=commitdiff;h=fd46c22e967a613de73d7e51a9715717d954ec45
Which adds a bunch of negative dentry lookups when it hits a mangled pathname.

It's really hard to figure out exactly what was going on in these crashes
though, as I think they're races, and I don't have a way to figure out
exactly what was happening on other threads at the time of the crash.
Telling trinity to fuzz just 'mount' probably won't reproduce the trace
above for eg, because it's the symptom of whatever else was going on.

Hmm, could make the oopses dump all cpu stacks instead somehow ?.
Perhaps that might be more enlightening for these kinds of bugs.

I'd be surprised if these bugs aren't easily reproducible for anyone
given how easy I seem to be stumbling into them.
You can grab the code at git://github.com/kernelslacker/trinity.git

Running it with no args will use /proc, /sys and /dev as potential fd's.
You can tell it to just use a specific path/file with '-V /proc'
I've been running the 'test-random.sh' harness which runs a few instances
to really drive the load up, and get things happening faster, but you
may get (un)lucky with just a single instance.

Also recommended = -q to quieten things, and -l off if logging is
slowing things down too much to cause fun things to trigger.

Dave

2013-03-07 20:33:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Thu, Mar 7, 2013 at 11:35 AM, Dave Jones <[email protected]> wrote:
> On Thu, Mar 07, 2013 at 09:30:56AM -0800, Linus Torvalds wrote:
> >
> > So I think we should change that BUG_ON() into a
> >
> > if (WARN_ON_ONCE(nd->inode != parent->d_inode))
> > return -ESTALE;
>
> Curiously, the machine wasn't dead after hitting that.
> Oh wait, it locks up that one CPU, leaving the others running right ?
> That would explain it, it's got a few cores..

No, it's simpler than that - we normally don't hold any extra locks
while walking pathnames.

The keyword being *normally*.

When we do things like renames etc, we often have other locks, and
then a BUG_ON() in that place is just deadly. But in your particular
case I think killing the process ends up just doing the right thing.

So a BUG_ON() can happen to do the right thing. In fact, back in the
BKL days it was pretty common, since the exit path could also undo the
BKL. It's just that these days it's not unlikely that we hold some
other lock or whatever, and then BUG_ON() can end up really screwing
things up.

> > Hmm. Nothing looks all that odd in that trace. Do you have any idea
> > what the path was? This being trinity, I'm assuming you're doing some
> > kind of targeted testing. sysfs or proc, perhaps? Or some particular
> > concurrency test with random system calls/pathnames? Not that I see
> > how it could happen anyway, but maybe it could give some hint about
> > what triggered this.
>
> Basically, see the summary of a bunch of bugs I reported to Greg last night
> in sysfs: https://lkml.org/lkml/2013/3/7/21
> It sounds like it's just trinity finding old bugs for the first time,
> though I've not actually tested yet on an older kernel.

If this is fairly repeatable, I really think it would be interesting
to see the names involved. Especially for sysfs, there are a *lot* of
random files that have odd semantics, and it depends on the file. Same
is (to a slightly lesser degree) true of /proc (which does have many
of the same issues, but tends to be more tested just for having been
around for longer - but then proc does have some issues all its own)

So for example, if you can re-create the one in nd_jump_link(), it
would be lovely if you replaced the BUG_ON() with just an if(), and
made it print out the old and the new path dentry names (ok, that
means saving the old path and doing the path_put on it afterwards).

Something like

+ const char *oldname = nd->path.dentry->d_name.name; /* Yeah,
this remembers the name pointer over the put_path(), not strictly
right */
+ const char *newname = path->dentry->d_name.name;
...
- BUG_ON(nd->inode->i_op->follow_link);
+ if (WARN_ON(nd->inode->i_op->follow_link)) {
+ printk("old=%s new=%d\n", oldname, newname);
+ }

or other..

> I suspect it's the addition of this..
> http://git.codemonkey.org.uk/?p=trinity.git;a=commitdiff;h=fd46c22e967a613de73d7e51a9715717d954ec45
> Which adds a bunch of negative dentry lookups when it hits a mangled pathname.

Yeah, it would have been nicer if we could bisect this to something,
or limit it to the changes in -rc1, but...

Linus

2013-03-07 21:38:50

by Dave Jones

[permalink] [raw]
Subject: ipc/testmsg GPF.

Trying to reproduce that nd_jump_link trace, but I keep hitting other bugs
instead. It's like whackamole. Except these are even more annoying
than moles.


general protection fault: 0000 [#1] PREEMPT SMP
Modules linked in: rose ax25 phonet lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth snd_pcm snd_page_alloc snd_timer snd vhost_net rfkill tun macvtap usb_debug macvlan microcode serio_raw pcspkr kvm_amd soundcore edac_core r8169 mii kvm
CPU 0
Pid: 845, comm: trinity-child14 Not tainted 3.9.0-rc1+ #70 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff812b7b00>] [<ffffffff812b7b00>] testmsg.isra.1+0x40/0x60
RSP: 0018:ffff880122b0fe78 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
RDX: 0000000000000002 RSI: 000000002c24a9b2 RDI: 697665642d737983
RBP: ffff880122b0fe78 R08: fffffff3f14b03ae R09: 0000000000000000
R10: ffff880127bd8000 R11: 0000000000000000 R12: 000000002c24a9b2
R13: ffff880123360798 R14: ffff8801233606e8 R15: 697665642d737973
FS: 00007f2672bd3740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2672b96068 CR3: 0000000127bc1000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child14 (pid: 845, threadinfo ffff880122b0e000, task ffff880127bd8000)
Stack:
ffff880122b0ff68 ffffffff812b8e7e ffff8801276d5b90 ffff880127bd8000
ffff880127bd8000 ffff880127bd8000 0000000000000000 ffffffff812b78c0
0000000000000000 ffffffff81c7a260 0000000000000000 0000000000001000
Call Trace:
[<ffffffff812b8e7e>] do_msgrcv+0x1de/0x670
[<ffffffff812b78c0>] ? load_msg+0x180/0x180
[<ffffffff810b8685>] ? trace_hardirqs_on_caller+0x115/0x1a0
[<ffffffff81341aae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff812b9325>] sys_msgrcv+0x15/0x20
[<ffffffff816cd982>] system_call_fastpath+0x16/0x1b
Code: 83 fa 04 74 16 31 c0 5d c3 66 90 ff ca b8 01 00 00 00 74 f3 31 c0 eb ef 0f 1f 00 48 39 37 b8 01 00 00 00 7e e2 31 c0 eb de 66 90 <48> 3b 37 75 d5 b8 01 00 00 00 5d c3 0f 1f 40 00 48 3b 37 74 c5

0000000000000000 <.text>:
0: 48 3b 37 cmp (%rdi),%rsi
3: 75 d5 jne 0xffffffffffffffda
5: b8 01 00 00 00 mov $0x1,%eax
a: 5d pop %rbp
b: c3 retq
c: 0f 1f 40 00 nopl 0x0(%rax)
10: 48 3b 37 cmp (%rdi),%rsi
13: 74 c5 je 0xffffffffffffffda

rdi is ascii. "ived-sy�" Curious.

EIP is here in testmsg.

case SEARCH_EQUAL:
if (msg->m_type == type)
240: 48 3b 37 cmp (%rdi),%rsi
243: 75 d5 jne 21a <testmsg.isra.1+0x1a>
{

2013-03-07 21:45:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

Adding Davem to the list, since this looks like networking.

That said, I'm also starting to wonder if you're just triggering odd
bugs because of some use-after-free corruption of random kernel data
structures? I seem to have a dim memory of you usually doing your
trinity runs with more debugging options, but based on just the oops
message you don't have DEBUG_PAGEALLOC enabled, for example. Maybe
that and SLUB debugging might turn up the culprit more quickly..

[ Just checked - yes, you often do have DEBUG_PAGEALLOC in your oops
reports, but not this time. list debugging too? ]

Linus

On Thu, Mar 7, 2013 at 1:38 PM, Dave Jones <[email protected]> wrote:
> Trying to reproduce that nd_jump_link trace, but I keep hitting other bugs
> instead. It's like whackamole. Except these are even more annoying
> than moles.
>
>
> general protection fault: 0000 [#1] PREEMPT SMP
> Modules linked in: rose ax25 phonet lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth snd_pcm snd_page_alloc snd_timer snd vhost_net rfkill tun macvtap usb_debug macvlan microcode serio_raw pcspkr kvm_amd soundcore edac_core r8169 mii kvm
> CPU 0
> Pid: 845, comm: trinity-child14 Not tainted 3.9.0-rc1+ #70 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[<ffffffff812b7b00>] [<ffffffff812b7b00>] testmsg.isra.1+0x40/0x60
> RSP: 0018:ffff880122b0fe78 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
> RDX: 0000000000000002 RSI: 000000002c24a9b2 RDI: 697665642d737983
> RBP: ffff880122b0fe78 R08: fffffff3f14b03ae R09: 0000000000000000
> R10: ffff880127bd8000 R11: 0000000000000000 R12: 000000002c24a9b2
> R13: ffff880123360798 R14: ffff8801233606e8 R15: 697665642d737973
> FS: 00007f2672bd3740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2672b96068 CR3: 0000000127bc1000 CR4: 00000000000007f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process trinity-child14 (pid: 845, threadinfo ffff880122b0e000, task ffff880127bd8000)
> Stack:
> ffff880122b0ff68 ffffffff812b8e7e ffff8801276d5b90 ffff880127bd8000
> ffff880127bd8000 ffff880127bd8000 0000000000000000 ffffffff812b78c0
> 0000000000000000 ffffffff81c7a260 0000000000000000 0000000000001000
> Call Trace:
> [<ffffffff812b8e7e>] do_msgrcv+0x1de/0x670
> [<ffffffff812b78c0>] ? load_msg+0x180/0x180
> [<ffffffff810b8685>] ? trace_hardirqs_on_caller+0x115/0x1a0
> [<ffffffff81341aae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff812b9325>] sys_msgrcv+0x15/0x20
> [<ffffffff816cd982>] system_call_fastpath+0x16/0x1b
> Code: 83 fa 04 74 16 31 c0 5d c3 66 90 ff ca b8 01 00 00 00 74 f3 31 c0 eb ef 0f 1f 00 48 39 37 b8 01 00 00 00 7e e2 31 c0 eb de 66 90 <48> 3b 37 75 d5 b8 01 00 00 00 5d c3 0f 1f 40 00 48 3b 37 74 c5
>
> 0000000000000000 <.text>:
> 0: 48 3b 37 cmp (%rdi),%rsi
> 3: 75 d5 jne 0xffffffffffffffda
> 5: b8 01 00 00 00 mov $0x1,%eax
> a: 5d pop %rbp
> b: c3 retq
> c: 0f 1f 40 00 nopl 0x0(%rax)
> 10: 48 3b 37 cmp (%rdi),%rsi
> 13: 74 c5 je 0xffffffffffffffda
>
> rdi is ascii. "ived-sy�" Curious.
>
> EIP is here in testmsg.
>
> case SEARCH_EQUAL:
> if (msg->m_type == type)
> 240: 48 3b 37 cmp (%rdi),%rsi
> 243: 75 d5 jne 21a <testmsg.isra.1+0x1a>
> {
>
>

2013-03-07 21:49:38

by David Miller

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

From: Linus Torvalds <[email protected]>
Date: Thu, 7 Mar 2013 13:45:27 -0800

> Adding Davem to the list, since this looks like networking.

I'd like to help but this looks more like sysv IPC than
networking. :-)

2013-03-07 21:51:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Thu, Mar 7, 2013 at 1:49 PM, David Miller <[email protected]> wrote:
> From: Linus Torvalds <[email protected]>
> Date: Thu, 7 Mar 2013 13:45:27 -0800
>
>> Adding Davem to the list, since this looks like networking.
>
> I'd like to help but this looks more like sysv IPC than
> networking. :-)

Yeah, I'm a f*cking moron.

msgrcv vs recvmsg. What can I say? We see so little msgrcv that I
don't have the mental associations for it at all.

Linus

2013-03-07 22:03:50

by Dave Jones

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Thu, Mar 07, 2013 at 01:45:27PM -0800, Linus Torvalds wrote:
> Adding Davem to the list, since this looks like networking.
>
> That said, I'm also starting to wonder if you're just triggering odd
> bugs because of some use-after-free corruption of random kernel data
> structures?

Plausible. The sheer volume of crashes I've started seeing in the last 24 hours
since adding those small features to trinity are making me really suspicious.
If I had this volume of crashes reported to me by a Fedora user, I'd be
suspecting their hardware ;)

Tangentally related: We have a *ton* of weird bug reports against Fedora
that have been open for a while that look like random memory corruption
(linked list corruption, 'bad page table' scribbles etc).
I'd *love* to think that the bugs I'm now seeing are related, on the off-chance
we can finally get some resolution on those.

> I seem to have a dim memory of you usually doing your
> trinity runs with more debugging options, but based on just the oops
> message you don't have DEBUG_PAGEALLOC enabled, for example. Maybe
> that and SLUB debugging might turn up the culprit more quickly..

My kernel-hacker hive-mind implant is working. Just started a rebuild with that enabled.
Had slub debug already.

> [ Just checked - yes, you often do have DEBUG_PAGEALLOC in your oops
> reports, but not this time. list debugging too? ]

Yep.

Dave

2013-03-07 22:19:21

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Thu, Mar 07, 2013 at 12:33:46PM -0800, Linus Torvalds wrote:

> If this is fairly repeatable, I really think it would be interesting
> to see the names involved. Especially for sysfs, there are a *lot* of
> random files that have odd semantics, and it depends on the file. Same
> is (to a slightly lesser degree) true of /proc (which does have many
> of the same issues, but tends to be more tested just for having been
> around for longer - but then proc does have some issues all its own)
>
> So for example, if you can re-create the one in nd_jump_link(), it
> would be lovely if you replaced the BUG_ON() with just an if(), and
> made it print out the old and the new path dentry names (ok, that
> means saving the old path and doing the path_put on it afterwards).
>
> Something like
>
> + const char *oldname = nd->path.dentry->d_name.name; /* Yeah,
> this remembers the name pointer over the put_path(), not strictly
> right */
> + const char *newname = path->dentry->d_name.name;
> ...
> - BUG_ON(nd->inode->i_op->follow_link);
> + if (WARN_ON(nd->inode->i_op->follow_link)) {
> + printk("old=%s new=%d\n", oldname, newname);
> + }

Ok, that didn't enlightenment me so much..

[ 304.559707] WARNING: at fs/namei.c:696 nd_jump_link+0x85/0xa0()
[ 304.570426] Hardware name: GA-MA78GM-S2H
[ 304.571037] Modules linked in: fuse l2tp_ppp l2tp_core rfcomm can_raw scsi_transport_iscsi ipt_ULOG af_key netrom pppoe pppox ppp_generic slhc rose caif_socket ax25 caif can_bcm ipx llc2 p8023 appletalk irda af_rxrpc phonet psnap p8022 rds can llc nfc crc_ccitt x25 nfnetlink decnet atm lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel snd_hda_codec bluetooth vhost_net snd_pcm tun macvtap microcode macvlan snd_page_alloc usb_debug r8169 edac_core rfkill serio_raw kvm_amd snd_timer pcspkr kvm mii snd soundcore
[ 304.642098] Pid: 14550, comm: trinity-child1 Not tainted 3.9.0-rc1+ #71
[ 304.690489] Call Trace:
[ 304.690888] [<ffffffff810440e5>] warn_slowpath_common+0x75/0xa0
[ 304.691849] [<ffffffff8104412a>] warn_slowpath_null+0x1a/0x20
[ 304.692736] [<ffffffff811c1795>] nd_jump_link+0x85/0xa0
[ 304.693541] [<ffffffff81224aec>] proc_pid_follow_link+0x6c/0x70
[ 304.694445] [<ffffffff811be661>] path_lookupat+0x2d1/0x740
[ 304.695270] [<ffffffff811beb04>] filename_lookup+0x34/0xc0
[ 304.696112] [<ffffffff811c193e>] user_path_at_empty+0x8e/0x110
[ 304.697007] [<ffffffff811c19d1>] user_path_at+0x11/0x20
[ 304.697800] [<ffffffff811d944f>] sys_setxattr+0x3f/0xe0
[ 304.698609] [<ffffffff816cdbc2>] system_call_fastpath+0x16/0x1b
[ 304.722423] ---[ end trace 469ba8f58f5994e6 ]---
[ 304.723132] old=fd new=1:9


thoughts ?

Dave

2013-03-07 22:36:52

by Dave Jones

[permalink] [raw]
Subject: pipe_release oops.

On Thu, Mar 07, 2013 at 05:03:33PM -0500, Dave Jones wrote:

> > I seem to have a dim memory of you usually doing your
> > trinity runs with more debugging options, but based on just the oops
> > message you don't have DEBUG_PAGEALLOC enabled, for example. Maybe
> > that and SLUB debugging might turn up the culprit more quickly..
>
> My kernel-hacker hive-mind implant is working. Just started a rebuild with that enabled.
> Had slub debug already.

The hits keep on coming..

[ 255.609172] BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
[ 255.610393] IP: [<ffffffff811bad62>] pipe_release+0x42/0xd0
[ 255.611235] PGD 0
[ 255.611552] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 255.612406] Modules linked in: nfnetlink can_bcm hidp af_rxrpc netrom pppoe pppox ppp_generic slhc appletalk phonet can_raw can decnet ipx rose irda ax25 x25 p8023 nfc psnap p8022 caif_socket caif crc_ccitt llc2 rds ipt_ULOG af_key atm llc fuse binfmt_misc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm btusb microcode bluetooth usb_debug snd_page_alloc snd_timer rfkill edac_core pcspkr snd serio_raw soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm
[ 255.621785] CPU 3
[ 255.622074] Pid: 4645, comm: trinity-child3 Not tainted 3.9.0-rc1+ #71 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
[ 255.623791] RIP: 0010:[<ffffffff811bad62>] [<ffffffff811bad62>] pipe_release+0x42/0xd0
[ 255.625008] RSP: 0018:ffff88010cad7dc8 EFLAGS: 00010296
[ 255.625797] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000006
[ 255.626840] RDX: 0000000000002630 RSI: 2222222222222222 RDI: 0000000000000001
[ 255.627881] RBP: ffff88010cad7df8 R08: 2222222222222222 R09: 2222222222222222
[ 255.628927] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88010b13cad8
[ 255.629969] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88010b13c9f0
[ 255.631020] FS: 00007fded3c44740(0000) GS:ffff88012ac00000(0000) knlGS:0000000000000000
[ 255.632204] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 255.633053] CR2: 0000000000000064 CR3: 0000000001c0b000 CR4: 00000000000007e0
[ 255.634101] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 255.635150] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 255.636189] Process trinity-child3 (pid: 4645, threadinfo ffff88010cad6000, task ffff88010ff10000)
[ 255.637495] Stack:
[ 255.637780] 0000000000000000 ffff880108b07700 0000000000000010 ffff8801143b8940
[ 255.638977] ffff88010b13c9f0 ffff880128a873e0 ffff88010cad7e08 ffffffff811bae0e
[ 255.640171] ffff88010cad7e58 ffffffff811b2825 ffff88010b13c9f0 ffff880108b07710
[ 255.641366] Call Trace:
[ 255.641743] [<ffffffff811bae0e>] pipe_rdwr_release+0x1e/0x20
[ 255.642591] [<ffffffff811b2825>] __fput+0xf5/0x2d0
[ 255.643313] [<ffffffff811b2a0e>] ____fput+0xe/0x10
[ 255.644040] [<ffffffff8106c5e5>] task_work_run+0xa5/0xd0
[ 255.644819] [<ffffffff8104ad30>] do_exit+0x2c0/0xce0
[ 255.645558] [<ffffffff816c5d8d>] ? retint_swapgs+0xe/0x13
[ 255.646347] [<ffffffff8104b7ec>] do_group_exit+0x4c/0xc0
[ 255.647128] [<ffffffff8104b877>] sys_exit_group+0x17/0x20
[ 255.647920] [<ffffffff816cdbc2>] system_call_fastpath+0x16/0x1b
[ 255.648778] Code: 00 4c 89 6d e8 4c 89 7d f8 41 89 f5 49 89 ff 31 f6 4c 89 e7 48 89 5d d8 4c 89 75 f0 41 89 d6 e8 f5 6d 50 00 49 8b 9f e0 03 00 00 <8b> 43 64 8b 4b 68 44 29 e8 44 29 f1 85 c0 89 43 64 89 4b 68 75
[ 255.652944] RIP [<ffffffff811bad62>] pipe_release+0x42/0xd0
[ 255.653805] RSP <ffff88010cad7dc8>
[ 255.654316] CR2: 0000000000000064
[ 255.655087] ---[ end trace 3b502614b99324e0 ]---

pipe->readers -= decr;
1642: 8b 43 64 mov 0x64(%rbx),%eax

I'm at the point where *I'm* starting to question what I'm seeing.
I don't have another test machine right now, so I can't see if I can
duplicate these results on another machine, so I'd be interested in
confirmation that I'm not going crazy debugging a broken machine.

Dave

2013-03-07 22:50:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Thu, Mar 7, 2013 at 2:18 PM, Dave Jones <[email protected]> wrote:
>
> Ok, that didn't enlightenment me so much..

Ooh, I'm good.

> [ 304.559707] WARNING: at fs/namei.c:696 nd_jump_link+0x85/0xa0()
> [ 304.692736] [<ffffffff811c1795>] nd_jump_link+0x85/0xa0
> [ 304.693541] [<ffffffff81224aec>] proc_pid_follow_link+0x6c/0x70
> [ 304.694445] [<ffffffff811be661>] path_lookupat+0x2d1/0x740
> [ 304.695270] [<ffffffff811beb04>] filename_lookup+0x34/0xc0
> [ 304.696112] [<ffffffff811c193e>] user_path_at_empty+0x8e/0x110
> [ 304.697007] [<ffffffff811c19d1>] user_path_at+0x11/0x20
> [ 304.697800] [<ffffffff811d944f>] sys_setxattr+0x3f/0xe0
> [ 304.698609] [<ffffffff816cdbc2>] system_call_fastpath+0x16/0x1b
> [ 304.722423] ---[ end trace 469ba8f58f5994e6 ]---
> [ 304.723132] old=fd new=1:9
>
> thoughts ?

Yes, I can reproduce this, and I know what's going on. I don't have a
patch, though, and I think it's in Al's court on what we should do.
Al?

Anyway, so the "old=fd" thing implies that it's one of the symlinks in
/proc/xyz/fd/, so that's something. It could have been the
cwd/exe/mmap ones. But that was kind of the expectation to begin with.

The "new=1:9" is the path that the file descriptor contained, and
since you're playing games with /proc, /sys and /dev, *and* since you
hit the "it shouldn't be a symlink" test, I'm *guessing* that it might
be something like

/sys/dev/char/1:9

which is a symlink to ../../devices/virtual/mem/urandom/.

And here is how you reproduce this:

Do:

int main(int argc, char **argv)
{
int fd = open(argv[1], O_PATH | O_NOFOLLOW);
struct stat st;
fstat(fd, &st);
sleep(100);
}

and then run

./a.out /sys/dev/char/1:9 &

which will look up that symlink using O_NOFOLLOW. Take note of the
<pid>, and then do

ls -l /proc/<pid>/fd/3/

and you get the oops, because when it follows that pseudo-symlink for
fd3, it will hit that symlink we just looked up. And trigger the
BUG_ON() you see.

Al, I think the BUG_ON() is simply bogus. Should we just remove it? Or
is there some conceptual reason why we can't handle a symlink there?

Linus

2013-03-07 23:03:20

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Thu, Mar 07, 2013 at 02:50:55PM -0800, Linus Torvalds wrote:

> Yes, I can reproduce this, and I know what's going on. I don't have a
> patch, though, and I think it's in Al's court on what we should do.
> Al?
>
> Anyway, so the "old=fd" thing implies that it's one of the symlinks in
> /proc/xyz/fd/, so that's something. It could have been the
> cwd/exe/mmap ones. But that was kind of the expectation to begin with.
>
> The "new=1:9" is the path that the file descriptor contained, and
> since you're playing games with /proc, /sys and /dev, *and* since you
> hit the "it shouldn't be a symlink" test, I'm *guessing* that it might
> be something like
>
> /sys/dev/char/1:9

Strange, I told it to *only* use /proc on that run, so it shouldn't have sucked
in anything from /dev unless something in /proc symlinks there and nftw()
ended up chasing it.

> which is a symlink to ../../devices/virtual/mem/urandom/.
>
> And here is how you reproduce this:

But this at least is one step towards restoring my belief that I'm not
going crazy here.

> Al, I think the BUG_ON() is simply bogus. Should we just remove it? Or
> is there some conceptual reason why we can't handle a symlink there?

For the time being, I'll remove that just so I can stop seeing it,
as it seems there's plenty of others to be chasing and you've got
a handle on this one.

Dave

2013-03-07 23:15:09

by Dave Jones

[permalink] [raw]
Subject: fasync_remove_entry oops

And.. More fun with pipes.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
IP: [<ffffffff811c3c10>] fasync_remove_entry+0x30/0xd0
PGD 1225c5067 PUD 11a133067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: rfcomm bnep nfnetlink can_bcm l2tp_ppp l2tp_core scsi_transport_iscsi fuse nfc x25 llc2 can_raw can ipt_ULOG ipx p8023 p8022 phonet rds netrom decnet appletalk psnap llc af_rxrpc caif_socket caif irda crc_ccitt af_key rose ax25 atm pppoe pppox ppp_generic slhc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables btusb snd_hda_codec_realtek bluetooth snd_hda_intel snd_hda_codec snd_pcm rfkill snd_page_alloc microcode usb_debug snd_timer vhost_net snd tun soundcore r8169 macvtap pcspkr macvlan serio_raw mii edac_core kvm_amd kvm
CPU 1
Pid: 17682, comm: trinity-child1 Not tainted 3.9.0-rc1+ #71 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff811c3c10>] [<ffffffff811c3c10>] fasync_remove_entry+0x30/0xd0
RSP: 0018:ffff8801004b7d58 EFLAGS: 00010282
RAX: ffff880087894920 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff81c64918 RDI: 0000000000000001
RBP: ffff8801004b7d78 R08: ffd9df0b804b81fd R09: 0000000000000000
R10: ffff880087894920 R11: 0000000000000000 R12: ffff880086fe2940
R13: ffff880086fe2970 R14: 0000000000000080 R15: 0000000000000000
FS: 00007feb12d57740(0000) GS:ffff88012a800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000080 CR3: 0000000126e46000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child1 (pid: 17682, threadinfo ffff8801004b6000, task ffff880087894920)
Stack:
ffff880086fe2940 ffff880086fe2940 0000000000000080 0000000000000000
ffff8801004b7db8 ffffffff811c3e43 0000000028a84b60 ffff880086fe2940
ffff88007e179608 ffff880086fe2940 ffff88007e179608 00000000ffffffff
Call Trace:
[<ffffffff811c3e43>] fasync_helper+0x73/0x90
[<ffffffff811b9957>] pipe_rdwr_fasync+0x67/0xd0
[<ffffffff811b29ab>] __fput+0x27b/0x2d0
[<ffffffff811b2a0e>] ____fput+0xe/0x10
[<ffffffff8106c5e5>] task_work_run+0xa5/0xd0
[<ffffffff8104ad30>] do_exit+0x2c0/0xce0
[<ffffffff816c5d8d>] ? retint_swapgs+0xe/0x13
[<ffffffff8104b7ec>] do_group_exit+0x4c/0xc0
[<ffffffff8104b877>] sys_exit_group+0x17/0x20
[<ffffffff816cdbc2>] system_call_fastpath+0x16/0x1b
Code: 55 48 89 e5 41 56 49 89 f6 41 55 4c 8d 6f 30 41 54 49 89 fc 4c 89 ef 53 31 db e8 6c 10 50 00 48 c7 c7 00 49 c6 81 e8 60 10 50 00 <49> 8b 06 48 85 c0 74 22 4c 39 60 58 48 89 c3 75 10 eb 41 0f 1f
RIP [<ffffffff811c3c10>] fasync_remove_entry+0x30/0xd0
RSP <ffff8801004b7d58>
CR2: 0000000000000080



for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
1650: 49 8b 06 mov (%r14),%rax

So we got to fasync_remove_entry with a NULL fa struct.

Can we just add more NULL checks here, or does that need to happen
at a higher level ?

Dave

2013-03-07 23:46:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: fasync_remove_entry oops

On Thu, Mar 7, 2013 at 3:14 PM, Dave Jones <[email protected]> wrote:
> And.. More fun with pipes.
> for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
> 1650: 49 8b 06 mov (%r14),%rax
>
> So we got to fasync_remove_entry with a NULL fa struct.
>
> Can we just add more NULL checks here, or does that need to happen
> at a higher level ?

I think you'll find that it's not fapp that was NULL. The caller was
pipe_rdwr_fasync -> fasync_helper, and pipe_rdwr_fasync always passes
in

&pipe->fasync_readers

(and writers) so it looks like it is pipe that was NULL. Really odd.
How did the open of the pipe succeed with a NULL i_pipe? We do have
i_pipe == NULL, but that should happen only with a not-yet-opened
pipe, or after the last close.

In neither case should you have that pipe_rdwr_fasync() call.

The fact that this happens for a delayed __fput() makes me think it
was never a successful open to begin with, but how did the FASYNC flag
get set in that case? Do we actually allow it in the open flags..
Hmm..

So if we need new NULL pointer checks, I think they'd need to be
something like the attached patch.

But this is definitely another of those "This is our most desperate
hour. Help me, Al-biwan Ke-Viro, you're my only hope" issues.

Al? Please don't make me wear that golden bikini.

Linus


Attachments:
patch.diff (1.88 kB)

2013-03-07 23:54:27

by Dave Jones

[permalink] [raw]
Subject: Re: fasync_remove_entry oops

On Thu, Mar 07, 2013 at 03:46:24PM -0800, Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 3:14 PM, Dave Jones <[email protected]> wrote:
> > And.. More fun with pipes.
> > for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
> > 1650: 49 8b 06 mov (%r14),%rax
> >
> > So we got to fasync_remove_entry with a NULL fa struct.
> >
> > Can we just add more NULL checks here, or does that need to happen
> > at a higher level ?
>
> I think you'll find that it's not fapp that was NULL.

yeah, brainfart

> The caller was
> pipe_rdwr_fasync -> fasync_helper, and pipe_rdwr_fasync always passes
> in
>
> &pipe->fasync_readers
>
> (and writers) so it looks like it is pipe that was NULL. Really odd.
> How did the open of the pipe succeed with a NULL i_pipe? We do have
> i_pipe == NULL, but that should happen only with a not-yet-opened
> pipe, or after the last close.
>
> In neither case should you have that pipe_rdwr_fasync() call.
>
> The fact that this happens for a delayed __fput() makes me think it
> was never a successful open to begin with, but how did the FASYNC flag
> get set in that case? Do we actually allow it in the open flags..
> Hmm..
>
> So if we need new NULL pointer checks, I think they'd need to be
> something like the attached patch.

I'll give it a shot. Can't be any worse than what we have already.

> But this is definitely another of those "This is our most desperate
> hour. Help me, Al-biwan Ke-Viro, you're my only hope" issues.
>
> Al? Please don't make me wear that golden bikini.

The box hosting Al's email is down, so you might be making a new fashion statement
for a little while.

Dave

2013-03-07 23:55:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Thu, Mar 7, 2013 at 2:50 PM, Linus Torvalds
<[email protected]> wrote:
>
> Al, I think the BUG_ON() is simply bogus. Should we just remove it? Or
> is there some conceptual reason why we can't handle a symlink there?

Hmm. Removing the BUG_ON() makes things work. Sure, you can get odd
files that are symlinks, but that's kind of what you ask for when you
do that whole O_NOFOLLOW thing, so the fact that you can then get them
through opening the /proc/<pid>/fd/<fd> file would seem to be roughly
the right thing regardless.

So I think I'll queue up that removal, although I'd still be happier
with input from Al.

Linus

2013-03-08 00:20:55

by Dave Jones

[permalink] [raw]
Subject: Re: fasync_remove_entry oops

On Thu, Mar 07, 2013 at 06:54:09PM -0500, Dave Jones wrote:
> On Thu, Mar 07, 2013 at 03:46:24PM -0800, Linus Torvalds wrote:
> > On Thu, Mar 7, 2013 at 3:14 PM, Dave Jones <[email protected]> wrote:
> > > And.. More fun with pipes.
> > > for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
> > > 1650: 49 8b 06 mov (%r14),%rax
> > >
> > > So we got to fasync_remove_entry with a NULL fa struct.
> > >
> > > Can we just add more NULL checks here, or does that need to happen
> > > at a higher level ?
> >
> > I think you'll find that it's not fapp that was NULL.
>
> yeah, brainfart
>
> > The caller was
> > pipe_rdwr_fasync -> fasync_helper, and pipe_rdwr_fasync always passes
> > in
> >
> > &pipe->fasync_readers
> >
> > (and writers) so it looks like it is pipe that was NULL. Really odd.
> > How did the open of the pipe succeed with a NULL i_pipe? We do have
> > i_pipe == NULL, but that should happen only with a not-yet-opened
> > pipe, or after the last close.
> >
> > In neither case should you have that pipe_rdwr_fasync() call.
> >
> > The fact that this happens for a delayed __fput() makes me think it
> > was never a successful open to begin with, but how did the FASYNC flag
> > get set in that case? Do we actually allow it in the open flags..
> > Hmm..
> >
> > So if we need new NULL pointer checks, I think they'd need to be
> > something like the attached patch.
>
> I'll give it a shot. Can't be any worse than what we have already.

I seem to be hitting the first pipe related oops a few times..
(Subject: pipe_release oops 3-4 mails upthread)

I don't know yet if that's just preventing me from getting to test the
pipe_rdwr_fasync path, but I haven't hit it since applying that patch.

Dave

2013-03-08 00:21:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: pipe_release oops.

On Thu, Mar 7, 2013 at 2:36 PM, Dave Jones <[email protected]> wrote:
>
> The hits keep on coming..
>
> [ 255.609172] BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
> [ 255.610393] IP: [<ffffffff811bad62>] pipe_release+0x42/0xd0

Ok, I think this is the same issue as your fasync thing.

So add a "if (pipe) { }" in pipe_release() too.

It's a bit odd, becuse we always allocate an inode->i_pipe whenever we
associate the file with any of the pipe operations, so I don't see how
that NULL i_pipe happens. But we do have that "free_pipe_info() if
there are no readers or writers" logic and I bet that's what triggers
this somehow. I just don't see what the exact sequence is.

Linus

2013-03-08 14:53:15

by Dave Jones

[permalink] [raw]
Subject: Re: pipe_release oops.

On Thu, Mar 07, 2013 at 04:21:13PM -0800, Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 2:36 PM, Dave Jones <[email protected]> wrote:
> >
> > The hits keep on coming..
> >
> > [ 255.609172] BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
> > [ 255.610393] IP: [<ffffffff811bad62>] pipe_release+0x42/0xd0
>
> Ok, I think this is the same issue as your fasync thing.
>
> So add a "if (pipe) { }" in pipe_release() too.

Yeah, that does the trick.
I changed your other diff a little to use a goto, which reduces a level of indentation..

Signed-off-by: Dave Jones <[email protected]>

diff --git a/fs/pipe.c b/fs/pipe.c
index 64a494c..49ba9cc 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -740,6 +740,9 @@ pipe_release(struct inode *inode, int decr, int decw)

mutex_lock(&inode->i_mutex);
pipe = inode->i_pipe;
+ if (!pipe)
+ goto out_unlock;
+
pipe->readers -= decr;
pipe->writers -= decw;

@@ -750,6 +753,8 @@ pipe_release(struct inode *inode, int decr, int decw)
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
+
+out_unlock:
mutex_unlock(&inode->i_mutex);

return 0;
@@ -759,10 +764,11 @@ static int
pipe_read_fasync(int fd, struct file *filp, int on)
{
struct inode *inode = file_inode(filp);
- int retval;
+ int retval = 0;

mutex_lock(&inode->i_mutex);
- retval = fasync_helper(fd, filp, on, &inode->i_pipe->fasync_readers);
+ if (inode->i_pipe)
+ retval = fasync_helper(fd, filp, on, &inode->i_pipe->fasync_readers);
mutex_unlock(&inode->i_mutex);

return retval;
@@ -773,10 +779,11 @@ static int
pipe_write_fasync(int fd, struct file *filp, int on)
{
struct inode *inode = file_inode(filp);
- int retval;
+ int retval = 0;

mutex_lock(&inode->i_mutex);
- retval = fasync_helper(fd, filp, on, &inode->i_pipe->fasync_writers);
+ if (inode->i_pipe)
+ retval = fasync_helper(fd, filp, on, &inode->i_pipe->fasync_writers);
mutex_unlock(&inode->i_mutex);

return retval;
@@ -787,16 +794,22 @@ static int
pipe_rdwr_fasync(int fd, struct file *filp, int on)
{
struct inode *inode = file_inode(filp);
- struct pipe_inode_info *pipe = inode->i_pipe;
- int retval;
+ struct pipe_inode_info *pipe;
+ int retval = 0;

mutex_lock(&inode->i_mutex);
+ pipe = inode->i_pipe;
+ if (!pipe)
+ goto out_unlock;
+
retval = fasync_helper(fd, filp, on, &pipe->fasync_readers);
if (retval >= 0) {
retval = fasync_helper(fd, filp, on, &pipe->fasync_writers);
if (retval < 0) /* this can happen only if on == T */
fasync_helper(-1, filp, 0, &pipe->fasync_readers);
}
+
+out_unlock:
mutex_unlock(&inode->i_mutex);
return retval;
}

2013-03-08 15:04:13

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Wed, Mar 06, 2013 at 09:16:45PM -0500, Dave Jones wrote:

> kernel BUG at fs/namei.c:1441!
> invalid opcode: 0000 [#1] PREEMPT SMP
> Modules linked in: bnep vmw_vsock_vmci_transport vmw_vmci vsock fuse l2tp_ppp l2tp_core hidp 8021q garp mrp llc2 netrom phonet bridge dlci stp caif_socket caif af_key pppoe pppox ppp_generic slhc af_rxrpc ipt_ULOG can_bcm can_raw nfnetlink scsi_transport_iscsi rfcomm rose can x25 decnet ipx p8023 p8022 nfc ax25 appletalk psnap llc irda atm rds crc_ccitt lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec btusb snd_pcm microcode bluetooth pcspkr serio_raw edac_core snd_page_alloc rfkill snd_timer usb_debug snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm
> CPU 2
> Pid: 8737, comm: trinity-child0 Not tainted 3.9.0-rc1+ #69 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[<ffffffff816bc622>] [<ffffffff816bc622>] lookup_slow+0x25/0xab
> RSP: 0018:ffff8800ad2cbc98 EFLAGS: 00010293
> RAX: 0000000000000001 RBX: ffff8800ad2cbde8 RCX: 0000000000000000
> RDX: 00000000000018ac RSI: ffff8800ad2cbd18 RDI: ffff88010b080040
> RBP: ffff8800ad2cbcc8 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000011
> R13: ffff8800cebf9bc0 R14: ffff8800ad2cbde8 R15: ffff8800cfb13178
> FS: 00007f98b7830740(0000) GS:ffff88012b200000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 00000000baffd000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process trinity-child0 (pid: 8737, threadinfo ffff8800ad2ca000, task ffff880104ea4920)
> Stack:
> 0000000000000092 0000000000000000 0000000000000001 0000000000000011
> 0000000000000000 ffff8800ad2cbde8 ffff8800ad2cbd68 ffffffff811be75e
> ffff880104ea5040 0000000000000002 ffff8800ad2cbd08 ffffffff8100a196
> Call Trace:
> [<ffffffff811be75e>] path_lookupat+0x71e/0x740
> [<ffffffff8100a196>] ? native_sched_clock+0x26/0x90
> [<ffffffff810b2118>] ? trace_hardirqs_off_caller+0x28/0xc0
> [<ffffffff811be7b4>] filename_lookup+0x34/0xc0
> [<ffffffff811be8f2>] do_path_lookup+0x32/0x40
> [<ffffffff811beb7a>] kern_path+0x2a/0x50
> [<ffffffff811648f3>] ? might_fault+0x53/0xb0
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> [<ffffffff8116493c>] ? might_fault+0x9c/0xb0
> [<ffffffff811648f3>] ? might_fault+0x53/0xb0
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> [<ffffffff811d569d>] do_mount+0x8d/0xa00
> [<ffffffff811591ab>] ? memdup_user+0x4b/0x90
> [<ffffffff81159243>] ? strndup_user+0x53/0x70
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> [<ffffffff811d609e>] sys_mount+0x8e/0xe0
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
> [<ffffffffa0001000>] ? ftrace_raw_event_kvm_age_page+0x70/0xd0 [kvm]
> Code: 55 48 89 e5 0f 0b 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb 48 83 ec 10 4c 8b 6f 08 48 8b 7f 30 49 3b 7d 30 74 02 <0f> 0b 48 81 c7 e8 00 00 00 49 89 f4 31 f6 e8 7b 52 00 00 8b 53
> RIP [<ffffffff816bc622>] lookup_slow+0x25/0xab
> RSP <ffff8800ad2cbc98>

Managed to trigger this one from a different path..

kernel BUG at fs/namei.c:1439!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: atm x25 af_key lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec snd_pcm microcode rfkill usb_debug snd_page_alloc serio_raw snd_timer pcspkr snd edac_core soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 1
Pid: 3025, comm: trinity-child1 Not tainted 3.9.0-rc1+ #77 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff816bfe24>] [<ffffffff816bfe24>] lookup_slow+0x25/0xab
RSP: 0018:ffff88010316dd18 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff88010316de48 RCX: 0000000000000000
RDX: 0000000000001078 RSI: ffff88010316dd98 RDI: ffff8801053026f8
RBP: ffff88010316dd48 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000010
R13: ffff880104e95720 R14: ffff88010316de48 R15: 0000000001412d30
FS: 00007f80b9bb9740(0000) GS:ffff88012a800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003e29e93880 CR3: 0000000108c39000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child1 (pid: 3025, threadinfo ffff88010316c000, task ffff8801048b2490)
Stack:
ffff8801048b2bb0 0000000000000000 0000000000000001 0000000000000010
0000000000000000 ffff88010316de48 ffff88010316dde8 ffffffff811c973e
ffff88010316dd88 00000000000000d0 00000000000000d0 ffff8801224a2300
Call Trace:
[<ffffffff811c973e>] path_lookupat+0x71e/0x740
[<ffffffff8119dbb1>] ? kmem_cache_alloc+0x111/0x330
[<ffffffff811c7dd0>] ? getname_flags.part.33+0x30/0x150
[<ffffffff810b32df>] ? lock_release_holdtime.part.24+0x8f/0x130
[<ffffffff811c9794>] filename_lookup+0x34/0xc0
[<ffffffff811cc58e>] user_path_at_empty+0x8e/0x110
[<ffffffff810b2868>] ? trace_hardirqs_off_caller+0x28/0xc0
[<ffffffff811cc621>] user_path_at+0x11/0x20
[<ffffffff811e4347>] sys_lgetxattr+0x37/0xa0
[<ffffffff8134b6ce>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff816d11c2>] system_call_fastpath+0x16/0x1b
Code: 55 48 89 e5 0f 0b 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb 48 83 ec 10 4c 8b 6f 08 48 8b 7f 30 49 3b 7d 30 74 02 <0f> 0b 48 81 c7 e8 00 00 00 49 89 f4 31 f6 e8 d9 52 00 00 8b 53
RIP [<ffffffff816bfe24>] lookup_slow+0x25/0xab
RSP <ffff88010316dd18>

What can I dump here that might give us more clues ?

The only clue I have on this one so far is that it's something in /proc.
(I have to run trinity with -V /proc right now, because of all those
weird sysfs corruption bugs until someone gets a handle on those).

Dave

2013-03-08 18:30:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: pipe_release oops.

On Fri, Mar 8, 2013 at 6:53 AM, Dave Jones <[email protected]> wrote:
>
> Yeah, that does the trick.
> I changed your other diff a little to use a goto, which reduces a level of indentation..

Hmm. So I've been trying to figure this out, and I really don't see
it. Every single pipe open routine *should* make sure that the inode
has an inode->i_pipe field. So if the open() has succeeded and you
have a valid file descriptor, the inode->i_pipe thing should be there.

I must be missing something, and I wonder if the thing I'm missing is
that with OPEN_PATH we may now have open calls that don't actually
have FMODE_READ or FMODE_WRITE set at all.

So suddenly we end up with these pipe openers that don't update the
counts, and I could imagine that really confusing us...

So I'm wondering if you could apply this patch, and see if that
triggers. In fact, please un-apply the other changes to fs/pipe.c too,
to see if it also obviates the need for checking i_pipe for NULL. You
should get the new warning (once), but you should not get any oopses..

Anyway, this would explain why the actual read/write paths don't need
to check for i_pipe - if FMODE_READ/WRITE aren't set, we'll never get
that far. But the release() and the fasync functions do get called
even for non-readable and non-writable files...

And Al, please get your email fixed. Is somebody usually on irc with
him or something?

Linus


Attachments:
patch.diff (479.00 B)

2013-03-08 18:51:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 7:04 AM, Dave Jones <[email protected]> wrote:
>
> Managed to trigger this one from a different path..
>
> kernel BUG at fs/namei.c:1439!
> Call Trace:
> [<ffffffff811c973e>] path_lookupat+0x71e/0x740
> [<ffffffff811c9794>] filename_lookup+0x34/0xc0
> [<ffffffff811cc58e>] user_path_at_empty+0x8e/0x110
> [<ffffffff811cc621>] user_path_at+0x11/0x20
> [<ffffffff811e4347>] sys_lgetxattr+0x37/0xa0
> [<ffffffff816d11c2>] system_call_fastpath+0x16/0x1b
>
> What can I dump here that might give us more clues ?

I think we should do the same thing and print out dentry names to give
us clues about where in /proc the problem happens.

And it doesn't really have to be proc - because of the symlinks in
/proc, we could easily get to /sys through processes like udev etc..

So how about just replaving that BUG_ON() with a

if (WARN_ON(nd->inode != parent->d_inode)) {
printk("%s -> %s (%s)\n", parent->d_name.name,
path->dentry->d_name.name, nd->last.name);
return -EINVAL;
}

or something like that.

Hopefully that gives some clue about whare it happens, and some idea
about what migth be going on..

Linus

2013-03-08 19:18:12

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 10:51:03AM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 7:04 AM, Dave Jones <[email protected]> wrote:
> >
> > Managed to trigger this one from a different path..
> >
> > kernel BUG at fs/namei.c:1439!
> > Call Trace:
> > [<ffffffff811c973e>] path_lookupat+0x71e/0x740
> > [<ffffffff811c9794>] filename_lookup+0x34/0xc0
> > [<ffffffff811cc58e>] user_path_at_empty+0x8e/0x110
> > [<ffffffff811cc621>] user_path_at+0x11/0x20
> > [<ffffffff811e4347>] sys_lgetxattr+0x37/0xa0
> > [<ffffffff816d11c2>] system_call_fastpath+0x16/0x1b
> >
> > What can I dump here that might give us more clues ?
>
> I think we should do the same thing and print out dentry names to give
> us clues about where in /proc the problem happens.
>
> And it doesn't really have to be proc - because of the symlinks in
> /proc, we could easily get to /sys through processes like udev etc..
>
> So how about just replaving that BUG_ON() with a
>
> if (WARN_ON(nd->inode != parent->d_inode)) {
> printk("%s -> %s (%s)\n", parent->d_name.name,
> path->dentry->d_name.name, nd->last.name);
> return -EINVAL;
> }
>
> or something like that.

Ugh, something in there is NULL..

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [< (null)>] (null)
PGD 10883f067 PUD 10883e067 PMD 0
Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: decnet ipx p8023 p8022 af_rxrpc appletalk psnap can rose af_key ax25 caif_socket caif crc_ccitt nfc atm llc2 phonet llc x25 lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables btusb bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm usb_debug edac_core microcode serio_raw pcspkr rfkill snd_page_alloc snd_timer snd r8169 soundcore mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 2
Pid: 942, comm: trinity-child2 Not tainted 3.9.0-rc1+ #79 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<0000000000000000>] [< (null)>] (null)
RSP: 0018:ffff88010885ddc0 EFLAGS: 00010246
RAX: ffffffff8181f540 RBX: ffff88010b3eb090 RCX: 0000000000000000
RDX: 0000000000000600 RSI: ffff88010b3eb090 RDI: ffff88010b359be8
RBP: ffff88010885dde8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88010b3e9720
R13: ffff88010885df38 R14: 0000000000000000 R15: 000000006a59949a
FS: 00007feff7939740(0000) GS:ffff88012aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000108840000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child2 (pid: 942, threadinfo ffff88010885c000, task ffff88011939a490)
Stack:
ffffffff811c62fd ffff88010b3e9720 ffff88010885df38 0000000000000000
0000000000000600 ffff88010885de18 ffffffff811c6518 ffff88010885de28
0100000000000000 0000000000000000 0000000000000000 ffff88010885de28
Call Trace:
[<ffffffff811c62fd>] ? lookup_real+0x1d/0x60
[<ffffffff811c6518>] __lookup_hash+0x38/0x50
[<ffffffff811c6549>] lookup_hash+0x19/0x20
[<ffffffff811c9983>] kern_path_create+0x93/0x170
[<ffffffff811c7e36>] ? getname_flags.part.33+0x86/0x150
[<ffffffff811c9aaa>] user_path_create+0x4a/0x70
[<ffffffff811cc87c>] sys_mknodat+0xac/0x1d0
[<ffffffff816d1242>] system_call_fastpath+0x16/0x1b
Code: Bad RIP value.
RIP [< (null)>] (null)
RSP <ffff88010885ddc0>
CR2: 0000000000000000


I'll turn off DEBUG_PAGEALLOC again, so I can get the RIP and figure out which deref it is.

Dave

2013-03-08 19:20:47

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 02:18:04PM -0500, Dave Jones wrote:
> On Fri, Mar 08, 2013 at 10:51:03AM -0800, Linus Torvalds wrote:
> > On Fri, Mar 8, 2013 at 7:04 AM, Dave Jones <[email protected]> wrote:
> > >
> > > Managed to trigger this one from a different path..
> > >
> > > kernel BUG at fs/namei.c:1439!
> > > Call Trace:
> > > [<ffffffff811c973e>] path_lookupat+0x71e/0x740
> > > [<ffffffff811c9794>] filename_lookup+0x34/0xc0
> > > [<ffffffff811cc58e>] user_path_at_empty+0x8e/0x110
> > > [<ffffffff811cc621>] user_path_at+0x11/0x20
> > > [<ffffffff811e4347>] sys_lgetxattr+0x37/0xa0
> > > [<ffffffff816d11c2>] system_call_fastpath+0x16/0x1b
> > >
> > > What can I dump here that might give us more clues ?
> >
> > I think we should do the same thing and print out dentry names to give
> > us clues about where in /proc the problem happens.
> >
> > And it doesn't really have to be proc - because of the symlinks in
> > /proc, we could easily get to /sys through processes like udev etc..
> >
> > So how about just replaving that BUG_ON() with a
> >
> > if (WARN_ON(nd->inode != parent->d_inode)) {
> > printk("%s -> %s (%s)\n", parent->d_name.name,
> > path->dentry->d_name.name, nd->last.name);
> > return -EINVAL;
> > }
> >
> > or something like that.
>
> Ugh, something in there is NULL..
>
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: [< (null)>] (null)
> PGD 10883f067 PUD 10883e067 PMD 0
> Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> Modules linked in: decnet ipx p8023 p8022 af_rxrpc appletalk psnap can rose af_key ax25 caif_socket caif crc_ccitt nfc atm llc2 phonet llc x25 lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables btusb bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm usb_debug edac_core microcode serio_raw pcspkr rfkill snd_page_alloc snd_timer snd r8169 soundcore mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
> CPU 2
> Pid: 942, comm: trinity-child2 Not tainted 3.9.0-rc1+ #79 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[<0000000000000000>] [< (null)>] (null)
> RSP: 0018:ffff88010885ddc0 EFLAGS: 00010246
> RAX: ffffffff8181f540 RBX: ffff88010b3eb090 RCX: 0000000000000000
> RDX: 0000000000000600 RSI: ffff88010b3eb090 RDI: ffff88010b359be8
> RBP: ffff88010885dde8 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000000 R12: ffff88010b3e9720
> R13: ffff88010885df38 R14: 0000000000000000 R15: 000000006a59949a
> FS: 00007feff7939740(0000) GS:ffff88012aa00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000108840000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process trinity-child2 (pid: 942, threadinfo ffff88010885c000, task ffff88011939a490)
> Stack:
> ffffffff811c62fd ffff88010b3e9720 ffff88010885df38 0000000000000000
> 0000000000000600 ffff88010885de18 ffffffff811c6518 ffff88010885de28
> 0100000000000000 0000000000000000 0000000000000000 ffff88010885de28
> Call Trace:
> [<ffffffff811c62fd>] ? lookup_real+0x1d/0x60
> [<ffffffff811c6518>] __lookup_hash+0x38/0x50
> [<ffffffff811c6549>] lookup_hash+0x19/0x20
> [<ffffffff811c9983>] kern_path_create+0x93/0x170
> [<ffffffff811c7e36>] ? getname_flags.part.33+0x86/0x150
> [<ffffffff811c9aaa>] user_path_create+0x4a/0x70
> [<ffffffff811cc87c>] sys_mknodat+0xac/0x1d0
> [<ffffffff816d1242>] system_call_fastpath+0x16/0x1b
> Code: Bad RIP value.
> RIP [< (null)>] (null)
> RSP <ffff88010885ddc0>
> CR2: 0000000000000000
>
> I'll turn off DEBUG_PAGEALLOC again, so I can get the RIP and figure out which deref it is.

Wait, it has to be that path->dentry->d_name.name.
The other derefs are the same as before.

Dave

2013-03-08 19:36:57

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 02:20:42PM -0500, Dave Jones wrote:

> > I'll turn off DEBUG_PAGEALLOC again, so I can get the RIP and figure out which deref it is.
>
> Wait, it has to be that path->dentry->d_name.name.
> The other derefs are the same as before.

I changed it to do this..

diff --git a/fs/namei.c b/fs/namei.c
index 961bc12..c1ca29e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -689,8 +689,6 @@ void nd_jump_link(struct nameidata *nd, struct path *path)
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED;
-
- BUG_ON(nd->inode->i_op->follow_link);
}

static inline void put_link(struct nameidata *nd, struct path *link, void *cookie)
@@ -1438,7 +1436,13 @@ static int lookup_slow(struct nameidata *nd, struct path *path)
int err;

parent = nd->path.dentry;
- BUG_ON(nd->inode != parent->d_inode);
+
+ if (WARN_ON(nd->inode != parent->d_inode)) {
+ printk("%s -> %p (%s)\n", parent->d_name.name, path->dentry, nd->last.name);
+ return -EINVAL;
+ }
+
+

mutex_lock(&parent->d_inode->i_mutex);
dentry = __lookup_hash(&nd->last, parent, nd->flags);


And now I'm getting a different BUG_ON

[ 170.860317] kernel BUG at fs/namei.c:2737!
[ 170.860855] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 170.861719] Modules linked in: irda can phonet ipx p8023 p8022 af_rxrpc atm x25 af_802154 llc2 pppoe pppox ppp_generic slhc netrom decnet af_key rose nfc ax25 appletalk psnap llc caif_socket caif crc_ccitt rds lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables btusb bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm snd_page_alloc microcode usb_debug snd_timer vhost_net snd rfkill serio_raw tun edac_core pcspkr macvtap soundcore macvlan kvm_amd r8169 mii kvm radeon backlight drm_kms_helper ttm
[ 170.869795] CPU 1
[ 170.870063] Pid: 998, comm: trinity-child1 Not tainted 3.9.0-rc1+ #80 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
[ 170.871541] RIP: 0010:[<ffffffff811cbe34>] [<ffffffff811cbe34>] do_last+0xda4/0xe30
[ 170.872561] RSP: 0018:ffff880108577cc8 EFLAGS: 00010206
[ 170.873250] RAX: 0000000000000001 RBX: ffff880108577e58 RCX: 0000000000000000
[ 170.874162] RDX: 000000000000006d RSI: ffff880108819660 RDI: 0000000000000001
[ 170.875073] RBP: ffff880108577d98 R08: 0000000000000001 R09: 0000000000000000
[ 170.875986] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000008000
[ 170.876898] R13: ffff880108577df8 R14: ffff880108577f28 R15: ffff88010875c840
[ 170.877810] FS: 00007f5651395740(0000) GS:ffff88012a800000(0000) knlGS:0000000000000000
[ 170.878838] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 170.879580] CR2: 0000000000fe1418 CR3: 0000000108655000 CR4: 00000000000007e0
[ 170.880491] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 170.881402] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 170.882314] Process trinity-child1 (pid: 998, threadinfo ffff880108576000, task ffff880108578000)
[ 170.883436] Stack:
[ 170.883696] ffff8801159b5e48 ffff880108578000 ffff880108577ce8 ffffffff811c8088
[ 170.884746] ffff880108577d98 ffffffff811c8305 ffff880122444620 ffff880122444600
[ 170.885806] ffff880108577d28 ffff880108578000 ffff880108578000 ffff880108577d58
[ 170.886840] Call Trace:
[ 170.887181] [<ffffffff811c8088>] ? inode_permission+0x18/0x50
[ 170.887939] [<ffffffff811c8305>] ? link_path_walk+0x245/0x880
[ 170.888697] [<ffffffff811cbf7a>] path_openat+0xba/0x500
[ 170.889392] [<ffffffff810b2868>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 170.890245] [<ffffffff810b2792>] ? get_lock_stats+0x22/0x70
[ 170.890980] [<ffffffff810b2bfe>] ? put_lock_stats.isra.23+0xe/0x40
[ 170.891790] [<ffffffff811cc681>] do_filp_open+0x41/0xa0
[ 170.892485] [<ffffffff811dbe99>] ? __alloc_fd+0x179/0x230
[ 170.893200] [<ffffffff811bb764>] do_sys_open+0xf4/0x1e0
[ 170.893893] [<ffffffff811bb894>] sys_openat+0x14/0x20
[ 170.894565] [<ffffffff816d1202>] system_call_fastpath+0x16/0x1b
[ 170.895342] Code: 66 f8 ff ff 41 ba e6 ff ff ff e9 b2 f8 ff ff 0f 8c eb f4 ff ff 48 8b 75 90 c6 45 ac 00 48 8b 76 30 48 39 73 30 0f 84 8a f5 ff ff <0f> 0b be e9 09 00 00 48 c7 c7 37 8c a0 81 4c 89 9d 48 ff ff ff

BUG_ON(nd->inode != dir->d_inode);


It also looks like that printk didn't fire. Hmm.. Different bug ?
Christ, I'm having a hard time keeping track of all these.

Dave

2013-03-08 19:47:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 11:36 AM, Dave Jones <[email protected]> wrote:
>
> I changed it to do this..
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 961bc12..c1ca29e 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -689,8 +689,6 @@ void nd_jump_link(struct nameidata *nd, struct path *path)
> nd->path = *path;
> nd->inode = nd->path.dentry->d_inode;
> nd->flags |= LOOKUP_JUMPED;
> -
> - BUG_ON(nd->inode->i_op->follow_link);
> }
>
> static inline void put_link(struct nameidata *nd, struct path *link, void *cookie)
> @@ -1438,7 +1436,13 @@ static int lookup_slow(struct nameidata *nd, struct path *path)
> int err;
>
> parent = nd->path.dentry;
> - BUG_ON(nd->inode != parent->d_inode);
> +
> + if (WARN_ON(nd->inode != parent->d_inode)) {
> + printk("%s -> %p (%s)\n", parent->d_name.name, path->dentry, nd->last.name);
> + return -EINVAL;
> + }

Ok, it might be nice to print out the path dentry name if it has one,
but it may well be that this only happens with negative dentries in
proc or sysfs, since you said that you just added testing of that..

> And now I'm getting a different BUG_ON

Heh. It's the same BUG_ON(), it's just replicated (and "parent" is
called "dir" here).

Maybe you can make the WARN_ON_ONCE() version be a macro, because that
test exists in multiple places: unlazy_walk, complete_walk,
lookup_slow and do_last (and walk_component in a different guise).

Linus

2013-03-08 19:51:01

by Jörn Engel

[permalink] [raw]
Subject: Re: pipe_release oops.

On Fri, 8 March 2013 10:30:01 -0800, Linus Torvalds wrote:
>
> Hmm. So I've been trying to figure this out, and I really don't see
> it. Every single pipe open routine *should* make sure that the inode
> has an inode->i_pipe field. So if the open() has succeeded and you
> have a valid file descriptor, the inode->i_pipe thing should be there.

Ok, here is a wild idea that is very likely wrong. But some
background first. I've had problems with process exit times and one
of the culprits turned out to be exit_files() where one device driver
went awol for several seconds. Fixing the device driver is hard, I
didn't see a good reason not to call exit_files() earlier and
exit_mm() was the other big offender, so the idea was to run both in
parallel and I applied the patch below.

As a result I've gotten a bunch of NULL pointer dereferences that only
happen in virtual machines, never on real hardware. For example
[<ffffffff81164bf8>] alloc_fd+0x38/0x130
[<ffffffff8114857e>] do_sys_open+0xee/0x1f0
[<ffffffff811486a1>] sys_open+0x21/0x30
[<ffffffff815bea29>] system_call_fastpath+0x16/0x1b

Now I can easily see how current->files being NULL will result in such
backtraces. I can also see how my patch moves the NULLing of
current->files a bit back in time. But I could never figure out how
my patch could have introduced a race that didn't exist before.

So the wild idea is that we have always had a very unlikely race with
current->files being NULL and trinity happens to hit it somehow.

Jörn

--
One of my most productive days was throwing away 1000 lines of code.
-- Ken Thompson.


diff --git a/kernel/exit.c b/kernel/exit.c
index f65345f9..5886799 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -4,6 +4,7 @@
* Copyright (C) 1991, 1992 Linus Torvalds
*/

+#include <linux/async.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/interrupt.h>
@@ -559,6 +560,11 @@ void exit_files(struct task_struct *tsk)
}
}

+static void exit_files_async(void *data, async_cookie_t cookie)
+{
+ exit_files(data);
+}
+
#ifdef CONFIG_MM_OWNER
/*
* A task is exiting. If it owned this mm, find a new owner for the mm.
@@ -905,6 +911,7 @@ static inline void check_stack_usage(void) {}
void do_exit(long code)
{
struct task_struct *tsk = current;
+ async_cookie_t files_cookie;
int group_dead;

profile_task_exit(tsk);
@@ -982,6 +989,7 @@ void do_exit(long code)
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);

+ files_cookie = async_schedule(exit_files_async, tsk);
exit_mm(tsk);

if (group_dead)
@@ -990,7 +998,7 @@ void do_exit(long code)

exit_sem(tsk);
exit_shm(tsk);
- exit_files(tsk);
+ async_synchronize_cookie(files_cookie);
exit_fs(tsk);
exit_task_work(tsk);
check_stack_usage();
--
1.7.10.4

2013-03-08 21:04:38

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 11:47:54AM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 11:36 AM, Dave Jones <[email protected]> wrote:
> >
> > I changed it to do this..
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 961bc12..c1ca29e 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -689,8 +689,6 @@ void nd_jump_link(struct nameidata *nd, struct path *path)
> > nd->path = *path;
> > nd->inode = nd->path.dentry->d_inode;
> > nd->flags |= LOOKUP_JUMPED;
> > -
> > - BUG_ON(nd->inode->i_op->follow_link);
> > }
> >
> > static inline void put_link(struct nameidata *nd, struct path *link, void *cookie)
> > @@ -1438,7 +1436,13 @@ static int lookup_slow(struct nameidata *nd, struct path *path)
> > int err;
> >
> > parent = nd->path.dentry;
> > - BUG_ON(nd->inode != parent->d_inode);
> > +
> > + if (WARN_ON(nd->inode != parent->d_inode)) {
> > + printk("%s -> %p (%s)\n", parent->d_name.name, path->dentry, nd->last.name);
> > + return -EINVAL;
> > + }
>
> Ok, it might be nice to print out the path dentry name if it has one,
> but it may well be that this only happens with negative dentries in
> proc or sysfs, since you said that you just added testing of that..
>
> > And now I'm getting a different BUG_ON
>
> Heh. It's the same BUG_ON(), it's just replicated (and "parent" is
> called "dir" here).
>
> Maybe you can make the WARN_ON_ONCE() version be a macro, because that
> test exists in multiple places: unlazy_walk, complete_walk,
> lookup_slow and do_last (and walk_component in a different guise).

queue up the sad trombone noises.

One of the things trinity passes syscalls is a page of deformed unicode.
Apparently this page is so fucked up, that it crashes *printk*.

Dave


[ 131.811418] WARNING: at fs/namei.c:2746 do_last+0xdb5/0xec0()
[ 131.812156] Hardware name: GA-MA78GM-S2H
[ 131.812659] Modules linked in: netrom(+) ax25 caif_socket caif irda crc_ccitt ipx af_802154 p8023 p8022 decnet appletalk psnap x25 llc af_rxrpc rds atm pppoe pppox ppp_generic slhc phonet nfc can_raw can lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec btusb bluetooth snd_pcm microcode snd_page_alloc rfkill snd_timer snd serio_raw edac_core vhost_net usb_debug pcspkr tun macvtap macvlan soundcore kvm_amd r8169 kvm mii radeon backlight drm_kms_helper ttm
[ 131.838316] Pid: 742, comm: trinity-child2 Not tainted 3.9.0-rc1+ #82
[ 131.848590] Call Trace:
[ 131.848969] [<ffffffff81045115>] warn_slowpath_common+0x75/0xa0
[ 131.849742] [<ffffffff8104515a>] warn_slowpath_null+0x1a/0x20
[ 131.850495] [<ffffffff811cbb35>] do_last+0xdb5/0xec0
[ 131.851150] [<ffffffff811c7d78>] ? inode_permission+0x18/0x50
[ 131.851900] [<ffffffff811c7ff5>] ? link_path_walk+0x245/0x880
[ 131.852651] [<ffffffff811cbcfa>] path_openat+0xba/0x500
[ 131.853340] [<ffffffff810b27f8>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 131.854186] [<ffffffff810b2722>] ? get_lock_stats+0x22/0x70
[ 131.854915] [<ffffffff810b2b8e>] ? put_lock_stats.isra.23+0xe/0x40
[ 131.855718] [<ffffffff811cc401>] do_filp_open+0x41/0xa0
[ 131.856407] [<ffffffff811dbc19>] ? __alloc_fd+0x179/0x230
[ 131.857116] [<ffffffff811bb414>] do_sys_open+0xf4/0x1e0
[ 131.857804] [<ffffffff811bb521>] sys_open+0x21/0x30
[ 131.858517] [<ffffffff816d1082>] system_call_fastpath+0x16/0x1b
[ 131.859303] [<ffffffffa0001001>] ? ttm_dma_tt_fini+0x71/0xa0 [ttm]
[ 131.937423] ---[ end trace cfbe25dc62f850d2 ]---
[ 131.938049] ->
[ 131.938274] general protection fault: 0000 [#1]
[ 131.938896] PREEMPT SMP
[ 131.939097] Modules linked in: netrom ax25 caif_socket caif irda crc_ccitt ipx af_802154 p8023 p8022 decnet appletalk psnap x25 llc af_rxrpc rds atm pppoe pppox ppp_generic slhc phonet nfc can_raw can lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec btusb bluetooth snd_pcm microcode snd_page_alloc rfkill snd_timer snd serio_raw edac_core vhost_net usb_debug pcspkr tun macvtap macvlan soundcore kvm_amd r8169 kvm mii radeon backlight drm_kms_helper ttm
[ 131.947457] CPU 2
[ 131.947733] Pid: 742, comm: trinity-child2 Tainted: G W 3.9.0-rc1+ #82 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
[ 131.949341] RIP: 0010:[<ffffffff81345fdd>] [<ffffffff81345fdd>] strnlen+0xd/0x40
[ 131.950362] RSP: 0018:ffff88011084bae8 EFLAGS: 00010086
[ 131.951062] RAX: ffffffff819e980c RBX: ffffffff82074da0 RCX: fffffffffffffffe
[ 131.951985] RDX: af0f48ef7bdef7bd RSI: ffffffffffffffff RDI: af0f48ef7bdef7bd
[ 131.952907] RBP: ffff88011084bae8 R08: 000000000000ffff R09: 000000000000ffff
[ 131.953829] R10: 0000000000000001 R11: 0000000000000000 R12: af0f48ef7bdef7bd
[ 131.954751] R13: ffffffff82075180 R14: 000000000000ffff R15: 0000000000000000
[ 131.955674] FS: 00007f376da9b740(0000) GS:ffff88012b200000(0000) knlGS:0000000000000000
[ 131.956714] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 131.957466] CR2: 00007f772904d000 CR3: 000000011082e000 CR4: 00000000000007e0
[ 131.958388] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 131.959310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 131.960234] Process trinity-child2 (pid: 742, threadinfo ffff88011084a000, task ffff880113ccc920)
[ 131.961370] Stack:
[ 131.961629] ffff88011084bb28 ffffffff813479ce ffffffff81c2c600 ffffffff82074da0
[ 131.962730] ffffffff82075180 ffff88011084bc70 ffffffff819e7b02 ffffffff819e7b02
[ 131.963848] ffff88011084bba8 ffffffff81348ba9 ffffffff81c2c5a0 ffffffff810b27f8
[ 131.966843] Call Trace:
[ 131.969052] [<ffffffff813479ce>] string.isra.3+0x3e/0xc0
[ 131.971642] [<ffffffff81348ba9>] vsnprintf+0x1f9/0x610
[ 131.974208] [<ffffffff810b27f8>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 131.976927] [<ffffffff81349081>] vscnprintf+0x11/0x30
[ 131.979441] [<ffffffff810478f1>] vprintk_emit+0x111/0x590
[ 131.981987] [<ffffffff811cbb35>] ? do_last+0xdb5/0xec0
[ 131.984503] [<ffffffff816bb79b>] printk+0x61/0x63
[ 131.986960] [<ffffffff811cbb6b>] do_last+0xdeb/0xec0
[ 131.989446] [<ffffffff811c7d78>] ? inode_permission+0x18/0x50
[ 131.992041] [<ffffffff811c7ff5>] ? link_path_walk+0x245/0x880
[ 131.994631] [<ffffffff811cbcfa>] path_openat+0xba/0x500
[ 131.997141] [<ffffffff810b27f8>] ? trace_hardirqs_off_caller+0x28/0xc0
[ 131.999804] [<ffffffff810b2722>] ? get_lock_stats+0x22/0x70
[ 132.002334] [<ffffffff810b2b8e>] ? put_lock_stats.isra.23+0xe/0x40
[ 132.004941] [<ffffffff811cc401>] do_filp_open+0x41/0xa0
[ 132.007428] [<ffffffff811dbc19>] ? __alloc_fd+0x179/0x230
[ 132.009899] [<ffffffff811bb414>] do_sys_open+0xf4/0x1e0
[ 132.012297] [<ffffffff811bb521>] sys_open+0x21/0x30
[ 132.014603] [<ffffffff816d1082>] system_call_fastpath+0x16/0x1b
[ 132.016993] [<ffffffffa0001001>] ? ttm_dma_tt_fini+0x71/0xa0 [ttm]
[ 132.019385] Code: c0 80 38 00 75 f8 48 29 f8 5d c3 31 c0 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 8d 4e ff 48 89 e5 74 28 <80> 3f 00 74 23 48 89 f8 31 d2 eb 0f 0f 1f 80 00 00 00 00 48 ff
[ 132.026917] RIP [<ffffffff81345fdd>] strnlen+0xd/0x40
[ 132.029301] RSP <ffff88011084bae8>


It then goes into a death spiral recursing over the same trace.

Dave

2013-03-08 22:41:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 1:04 PM, Dave Jones <[email protected]> wrote:
>
> queue up the sad trombone noises.
>
> One of the things trinity passes syscalls is a page of deformed unicode.
> Apparently this page is so fucked up, that it crashes *printk*.

It's probably my debug stuff that is bogus. One of the string pointers
passed to printk %s seems bad, and instead of being a proper kernel
pointer it's "0xaf0f48ef7bdef7bd". So trying to access it causes a GP
fault (it's not a validly formed pointer)

I'm not seeing what the problem is, but I'll mull on it..

Linus

2013-03-08 23:07:40

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 02:41:19PM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 1:04 PM, Dave Jones <[email protected]> wrote:
> >
> > queue up the sad trombone noises.
> >
> > One of the things trinity passes syscalls is a page of deformed unicode.
> > Apparently this page is so fucked up, that it crashes *printk*.
>
> It's probably my debug stuff that is bogus. One of the string pointers
> passed to printk %s seems bad, and instead of being a proper kernel
> pointer it's "0xaf0f48ef7bdef7bd". So trying to access it causes a GP
> fault (it's not a validly formed pointer)
>
> I'm not seeing what the problem is, but I'll mull on it..

Ok, got something more meaningful out of the lookup_slow trace.

[ 66.082984] parent->dname.name (06b6b6b6b6b6b6b)
[ 66.083637] parent =

At first I thought AH-HA! SLAB POISON!
But look closer.. it's shifted by 8 bits.

Also, this isn't a pointer, that's the output of..

printk("parent->dname.name %s (%x%x%x%x%x%x%x%x)\n", parent->d_name.name,
parent->d_name.name[0],
parent->d_name.name[1],
parent->d_name.name[2],
parent->d_name.name[3],
parent->d_name.name[4],
parent->d_name.name[5],
parent->d_name.name[6],
parent->d_name.name[7]);

Interestingly, that pattern always seems to be the same across different reboots.

parent seems to be a pointer to "\0".


Another oddball crash..



BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [< (null)>] (null)
PGD 10f366067 PUD 10f36c067 PMD 0
Oops: 0010 [#1] PREEMPT SMP
Modules linked in: can_bcm irda pppoe pppox ppp_generic can slhc af_802154 atm rds af_key phonet nfc ipx p8023 p8022 af_rxrpc caif_socket caif crc_ccitt decnet netrom appletalk x25 psnap llc rose ax25 lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm btusb bluetooth usb_debug snd_page_alloc microcode snd_timer rfkill snd serio_raw pcspkr edac_core soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 0
Pid: 822, comm: trinity-child0 Not tainted 3.9.0-rc1+ #85 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<0000000000000000>] [< (null)>] (null)
RSP: 0018:ffff88010f033db0 EFLAGS: 00010246
RAX: ffffffff8181f540 RBX: ffff8800cf49c940 RCX: 0000000000000000
RDX: 0000000000000600 RSI: ffff8800cf49c940 RDI: ffff8800cf492170
RBP: ffff88010f033dd8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8800cf49f530
R13: ffff88010f033f28 R14: 0000000000000000 R15: ffffffffa0000000
FS: 00007f29dffed740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010f365000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child0 (pid: 822, threadinfo ffff88010f032000, task ffff8801103ec920)
Stack:
ffffffff811c5fad ffff8800cf49f530 ffff88010f033f28 0000000000000000
0000000000000600 ffff88010f033e08 ffffffff811c61c8 ffff88010f033e18
0100000000000000 0000000000000000 0000000000000000 ffff88010f033e18
Call Trace:
[<ffffffff811c5fad>] ? lookup_real+0x1d/0x60
[<ffffffff811c61c8>] __lookup_hash+0x38/0x50
[<ffffffff811c61f9>] lookup_hash+0x19/0x20
[<ffffffff811c96a5>] kern_path_create+0x95/0x170
[<ffffffff811c7b56>] ? getname_flags.part.33+0x86/0x150
[<ffffffffa0000000>] ? 0xffffffff9fffffff
[<ffffffff811c97ca>] user_path_create+0x4a/0x70
[<ffffffff811cc66c>] sys_mknodat+0xac/0x1d0
[<ffffffff8134b4be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffffa0000000>] ? 0xffffffff9fffffff
[<ffffffff811cc7ad>] sys_mknod+0x1d/0x20
[<ffffffff816d1142>] system_call_fastpath+0x16/0x1b
[<ffffffffa0000000>] ? 0xffffffff9fffffff
Code: Bad RIP value.
RIP [< (null)>] (null)
RSP <ffff88010f033db0>
CR2: 0000000000000000
---[ end trace f74c72ebf05e8c15 ]---

I'll add some similar printk's to lookup_real and see what falls out.

Dave

2013-03-08 23:14:35

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 06:07:34PM -0500, Dave Jones wrote:
> parent seems to be a pointer to "\0".

Ignore that last part, it's wrong.

Dave

2013-03-08 23:20:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 3:07 PM, Dave Jones <[email protected]> wrote:
>
> Ok, got something more meaningful out of the lookup_slow trace.
>
> [ 66.082984] parent->dname.name (06b6b6b6b6b6b6b)
> [ 66.083637] parent =
>
> At first I thought AH-HA! SLAB POISON!
> But look closer.. it's shifted by 8 bits.

Or just the high byte has been cleared.

But yeah, if the parent has been free'd then that certainly explains
why the "impossible" test of

nd->inode != parent->d_inode

would trigger. And it would explain any odd crashes at lookup time
too. In particular, the NULL pointer one you reference seems to be
dir->i_op->lookup being NULL, so calling it (understandable) ends up
doing bad things.

I really don't understand how the parent could be free'd early.
Dentries are freed by RCU, and the dentry lookup code is some of the
most well-tested out there. I don't see how /proc could mess that up,
unless it just completely screws up some refcounting thing or other.

Linus

2013-03-08 23:28:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

Oh, btw, I'm assuming you're testing current git like you usually do.

If so, just to humor me, can you try plain 3.8 (or 3.8.2 or whatever
the current stable kernel is). Because while the oopses may be due to
your extensions to trinity, let's make sure. Maybe something bad
happened in the merge window, and we're just wasting time trying to
figure it out, when it could be just bisected...

The pipe and symlink BUG_ON() bugs you've found are likely real and
old, so keep the patches for those around, but let's just verify that
the BUG_ON(nd->inode != parent->d_inode);" is old too.

I suspect it's a real and old bug, and that you'll be able to recreate
it with the stable kernels too, but just to make sure..

Linus

2013-03-08 23:30:45

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 03:20:40PM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 3:07 PM, Dave Jones <[email protected]> wrote:
> >
> > Ok, got something more meaningful out of the lookup_slow trace.
> >
> > [ 66.082984] parent->dname.name (06b6b6b6b6b6b6b)
> > [ 66.083637] parent =
> >
> > At first I thought AH-HA! SLAB POISON!
> > But look closer.. it's shifted by 8 bits.
>
> Or just the high byte has been cleared.

Duh, yes. I wasn't zero padding, so that's 006b6b6b.

Here's another run, with something similar in path->dentry..

[ 100.729395] parent->dname.name (06 b6 b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b)
[ 100.729397] path->dentry: 6d6e740 6b6b6b6b
[ 100.729401] nd->last.name = (\xffffffd2.W._.N.".\xfffffffe.\xffffff80.^N.?.\xffffffe4.E.8.g.\xffffffd2.N.\xffffffb6.^G.\xfffffff1.\xffffffcc.U.\xffffffda.^_.h.^M.1.\xffffffc5.\xffffff82.%.B.\xffffffe0.\xffffffad.^U.8.^L.c.Z.^K.\xffffffe4.h.J.\xffffffc8.\xffffffad.\xffffff83./.\xffffff80.\xffffffd1.\xffffffe5.\xffffff87.\xffffffc3.\xffffffb2.\xffffffdc.\xffffff9d.\xffffffd1.E.\xffffffab.^B.
(garbage continues..)

2013-03-08 23:34:31

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 03:28:01PM -0800, Linus Torvalds wrote:
> Oh, btw, I'm assuming you're testing current git like you usually do.

yes.

> If so, just to humor me, can you try plain 3.8 (or 3.8.2 or whatever
> the current stable kernel is). Because while the oopses may be due to
> your extensions to trinity, let's make sure. Maybe something bad
> happened in the merge window, and we're just wasting time trying to
> figure it out, when it could be just bisected...

Sure, will do.

> The pipe and symlink BUG_ON() bugs you've found are likely real and
> old, so keep the patches for those around, but let's just verify that
> the BUG_ON(nd->inode != parent->d_inode);" is old too.
>
> I suspect it's a real and old bug, and that you'll be able to recreate
> it with the stable kernels too, but just to make sure..

Hopefully I won't uncover anything new.

Dave

2013-03-08 23:45:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 3:30 PM, Dave Jones <[email protected]> wrote:
> [ 100.729401] nd->last.name = (\xffffffd2.W._.N.".\xfffffffe.\xffffff80.^N.?.\xffffffe4.E.8.g.\xffffffd2.N.\xffffffb6.^G.\xfffffff1.\xffffffcc.U.\xffffffda.^_.h.^M.1.\xffffffc5.\xffffff82.%.B.\xffffffe0.\xffffffad.^U.8.^L.c.Z.^K.\xffffffe4.h.J.\xffffffc8.\xffffffad.\xffffff83./.\xffffff80.\xffffffd1.\xffffffe5.\xffffff87.\xffffffc3.\xffffffb2.\xffffffdc.\xffffff9d.\xffffffd1.E.\xffffffab.^B.

You're printing out a 'char' with %02x, aren't you? And then the sign
extensions gives you six extra 'f' character every time the char is
negative.

If using %02x, make sure you use 'unsigned char', or "& 255".

Depending on just what is going on, the "nd->last.name" thing really
is probably a valid string. Strictly speaking it's not necessarily
NUL-terminated, though, and should be printed out to a maximum of
"nd->last.len" characters (but the whole path will be NUL-terminated,
so it's all ok, you just might print out more than one component)

Linus

2013-03-08 23:47:20

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 03:28:01PM -0800, Linus Torvalds wrote:
> Oh, btw, I'm assuming you're testing current git like you usually do.
>
> If so, just to humor me, can you try plain 3.8 (or 3.8.2 or whatever
> the current stable kernel is). Because while the oopses may be due to
> your extensions to trinity, let's make sure. Maybe something bad
> happened in the merge window, and we're just wasting time trying to
> figure it out, when it could be just bisected...
>
> The pipe and symlink BUG_ON() bugs you've found are likely real and
> old, so keep the patches for those around, but let's just verify that
> the BUG_ON(nd->inode != parent->d_inode);" is old too.
>
> I suspect it's a real and old bug, and that you'll be able to recreate
> it with the stable kernels too, but just to make sure..
>
> Linus

That didn't take long..

[ 163.807774] kernel BUG at fs/namei.c:1446!
[ 163.807837] invalid opcode: 0000 [#1] SMP
[ 163.807910] Modules linked in: ipt_ULOG af_key irda crc_ccitt nfc pppoe pppox ppp_generic slhc atm lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep btusb snd_seq snd_seq_device snd_pcm bluetooth ppdev snd_page_alloc rfkill microcode usb_debug serio_raw k10temp edac_core edac_mce_amd snd_timer sp5100_tco snd soundcore i2c_piix4 r8169 lm63 mii parport_pc parport vhost_net tun macvtap macvlan kvm_amd kvm ata_generic pata_acpi firewire_ohci firewire_core pata_atiixp crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core wmi
[ 163.809090] CPU 1
[ 163.809126] Pid: 5831, comm: trinity-child1 Not tainted 3.8.1-201.fc18.x86_64.debug #1 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
[ 163.809303] RIP: 0010:[<ffffffff8171232d>] [<ffffffff8171232d>] lookup_slow+0x25/0xad
[ 163.809424] RSP: 0018:ffff8800cf895d08 EFLAGS: 00010212
[ 163.809499] RAX: ffff8801129db790 RBX: ffff8800cf895e48 RCX: 00000000000002e0
[ 163.809595] RDX: ffff8800cf895d98 RSI: ffff8800cf895e58 RDI: ffff8800cf895e48
[ 163.809691] RBP: ffff8800cf895d38 R08: 0000000000000001 R09: 0000000000000001
[ 163.809787] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010
[ 163.809883] R13: ffff8800cf4e46f0 R14: 00000000ffffff9c R15: ffff8800cf895e48
[ 163.809981] FS: 00007f2416788740(0000) GS:ffff88012a800000(0000) knlGS:0000000000000000
[ 163.810091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 163.810170] CR2: 0000000002222000 CR3: 0000000036eda000 CR4: 00000000000007e0
[ 163.810269] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 163.810367] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 163.810464] Process trinity-child1 (pid: 5831, threadinfo ffff8800cf894000, task ffff880111ff8000)
[ 163.810582] Stack:
[ 163.810614] ffffea000324ac00 0000000000000000 0000000000000001 0000000000000010
[ 163.810737] 0000000000000000 00000000ffffff9c ffff8800cf895de8 ffffffff811e8ced
[ 163.810859] 0000000000000000 ffffffff81a0bb2c ffff8800cf895d88 00000000000000d0
[ 163.810982] Call Trace:
[ 163.811026] [<ffffffff811e8ced>] path_lookupat+0x74d/0x770
[ 163.811108] [<ffffffff811bcfe8>] ? kmem_cache_alloc+0xe8/0x350
[ 163.811195] [<ffffffff810d69cf>] ? lock_release_holdtime.part.26+0xf/0x180
[ 163.811294] [<ffffffff811e8d44>] filename_lookup+0x34/0xc0
[ 163.811374] [<ffffffff811ebb9e>] user_path_at_empty+0x8e/0x110
[ 163.811507] [<ffffffff81021e49>] ? sched_clock+0x9/0x10
[ 163.811586] [<ffffffff810ae555>] ? sched_clock_cpu+0xc5/0x120
[ 163.811694] [<ffffffff810d606d>] ? trace_hardirqs_off+0xd/0x10
[ 163.811779] [<ffffffff810ae61f>] ? local_clock+0x6f/0x80
[ 163.811857] [<ffffffff811ebc31>] user_path_at+0x11/0x20
[ 163.811935] [<ffffffff81203e58>] sys_lgetxattr+0x38/0x90
[ 163.812014] [<ffffffff81724799>] system_call_fastpath+0x16/0x1b
[ 163.812096] Code: 55 48 89 e5 0f 0b 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb 48 83 ec 10 4c 8b 6f 08 48 8b 47 30 49 3b 45 30 74 02 <0f> 0b 48 8d b8 e8 00 00 00 49 89 f6 31 f6 49 89 d4 e8 3d 4d 00
[ 163.812748] RIP [<ffffffff8171232d>] lookup_slow+0x25/0xad
[ 163.812833] RSP <ffff8800cf895d08>
[ 163.813040] ---[ end trace f67521cc7acbb145 ]---

2013-03-08 23:51:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 3:47 PM, Dave Jones <[email protected]> wrote:
>
> That didn't take long..

Ok, thanks, so it's not something new to this merge window. Not that I
expected it to be, but better safe than sorry.

Linus

2013-03-08 23:55:20

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 03:45:37PM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 3:30 PM, Dave Jones <[email protected]> wrote:
> > [ 100.729401] nd->last.name = (\xffffffd2.W._.N.".\xfffffffe.\xffffff80.^N.?.\xffffffe4.E.8.g.\xffffffd2.N.\xffffffb6.^G.\xfffffff1.\xffffffcc.U.\xffffffda.^_.h.^M.1.\xffffffc5.\xffffff82.%.B.\xffffffe0.\xffffffad.^U.8.^L.c.Z.^K.\xffffffe4.h.J.\xffffffc8.\xffffffad.\xffffff83./.\xffffff80.\xffffffd1.\xffffffe5.\xffffff87.\xffffffc3.\xffffffb2.\xffffffdc.\xffffff9d.\xffffffd1.E.\xffffffab.^B.
>
> You're printing out a 'char' with %02x, aren't you? And then the sign
> extensions gives you six extra 'f' character every time the char is
> negative.
>
> If using %02x, make sure you use 'unsigned char', or "& 255".
>
> Depending on just what is going on, the "nd->last.name" thing really
> is probably a valid string. Strictly speaking it's not necessarily
> NUL-terminated, though, and should be printed out to a maximum of
> "nd->last.len" characters (but the whole path will be NUL-terminated,
> so it's all ok, you just might print out more than one component)
>
> Linus

That one was printed out with %s

Dave

2013-03-09 00:02:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 3:55 PM, Dave Jones <[email protected]> wrote:
>
> That one was printed out with %s

Ok, so those random pathnames you generate? They're funky.

Linus

2013-03-09 00:19:23

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 04:02:02PM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 3:55 PM, Dave Jones <[email protected]> wrote:
> >
> > That one was printed out with %s
>
> Ok, so those random pathnames you generate? They're funky.

I just did a test with just a page of 'A's, and got the same result,
so while that unicode stuff looks fancy, it isn't necessary to tickle
these bugs.

Trying to figure out exactly what change I made that started triggering this.

Dave

2013-03-09 00:27:14

by Peter Hurley

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

[ +Andrew Morton ]

On Thu, 2013-03-07 at 16:38 -0500, Dave Jones wrote:
> Trying to reproduce that nd_jump_link trace, but I keep hitting other bugs
> instead. It's like whackamole. Except these are even more annoying
> than moles.

Dave,
I thought I copied you on the 'ipc MSG_COPY fixes' patchset that fixes
this. Or is this gp fault happening with that patchset?

Linus,
The fixes should be in your inbox (from Andrew) titled:
[patch 01/11] ipc: fix potential oops when src msg > 4k w/ MSG_COPY
[patch 02/11] ipc: don't allocate a copy larger than max

> general protection fault: 0000 [#1] PREEMPT SMP
> Modules linked in: rose ax25 phonet lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth snd_pcm snd_page_alloc snd_timer snd vhost_net rfkill tun macvtap usb_debug macvlan microcode serio_raw pcspkr kvm_amd soundcore edac_core r8169 mii kvm
> CPU 0
> Pid: 845, comm: trinity-child14 Not tainted 3.9.0-rc1+ #70 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[<ffffffff812b7b00>] [<ffffffff812b7b00>] testmsg.isra.1+0x40/0x60
> RSP: 0018:ffff880122b0fe78 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
> RDX: 0000000000000002 RSI: 000000002c24a9b2 RDI: 697665642d737983
> RBP: ffff880122b0fe78 R08: fffffff3f14b03ae R09: 0000000000000000
> R10: ffff880127bd8000 R11: 0000000000000000 R12: 000000002c24a9b2
> R13: ffff880123360798 R14: ffff8801233606e8 R15: 697665642d737973
> FS: 00007f2672bd3740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2672b96068 CR3: 0000000127bc1000 CR4: 00000000000007f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process trinity-child14 (pid: 845, threadinfo ffff880122b0e000, task ffff880127bd8000)
> Stack:
> ffff880122b0ff68 ffffffff812b8e7e ffff8801276d5b90 ffff880127bd8000
> ffff880127bd8000 ffff880127bd8000 0000000000000000 ffffffff812b78c0
> 0000000000000000 ffffffff81c7a260 0000000000000000 0000000000001000
> Call Trace:
> [<ffffffff812b8e7e>] do_msgrcv+0x1de/0x670
> [<ffffffff812b78c0>] ? load_msg+0x180/0x180
> [<ffffffff810b8685>] ? trace_hardirqs_on_caller+0x115/0x1a0
> [<ffffffff81341aae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff812b9325>] sys_msgrcv+0x15/0x20
> [<ffffffff816cd982>] system_call_fastpath+0x16/0x1b
> Code: 83 fa 04 74 16 31 c0 5d c3 66 90 ff ca b8 01 00 00 00 74 f3 31 c0 eb ef 0f 1f 00 48 39 37 b8 01 00 00 00 7e e2 31 c0 eb de 66 90 <48> 3b 37 75 d5 b8 01 00 00 00 5d c3 0f 1f 40 00 48 3b 37 74 c5
>
> 0000000000000000 <.text>:
> 0: 48 3b 37 cmp (%rdi),%rsi
> 3: 75 d5 jne 0xffffffffffffffda
> 5: b8 01 00 00 00 mov $0x1,%eax
> a: 5d pop %rbp
> b: c3 retq
> c: 0f 1f 40 00 nopl 0x0(%rax)
> 10: 48 3b 37 cmp (%rdi),%rsi
> 13: 74 c5 je 0xffffffffffffffda
>
> rdi is ascii. "ived-sy�" Curious.
>
> EIP is here in testmsg.
>
> case SEARCH_EQUAL:
> if (msg->m_type == type)
> 240: 48 3b 37 cmp (%rdi),%rsi
> 243: 75 d5 jne 21a <testmsg.isra.1+0x1a>
> {
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-09 00:30:18

by Raymond Jennings

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 4:19 PM, Dave Jones <[email protected]> wrote:
> On Fri, Mar 08, 2013 at 04:02:02PM -0800, Linus Torvalds wrote:
> > On Fri, Mar 8, 2013 at 3:55 PM, Dave Jones <[email protected]> wrote:
> > >
> > > That one was printed out with %s
> >
> > Ok, so those random pathnames you generate? They're funky.
>
> I just did a test with just a page of 'A's, and got the same result,
> so while that unicode stuff looks fancy, it isn't necessary to tickle
> these bugs.
>
> Trying to figure out exactly what change I made that started triggering this.

Just curious but is this bug present in 3.7? 3.6? etc?

I'm actually a bit curious how "old" old is in this case.


> Dave
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-09 00:32:55

by Dave Jones

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Fri, Mar 08, 2013 at 07:27:01PM -0500, Peter Hurley wrote:
> [ +Andrew Morton ]
>
> On Thu, 2013-03-07 at 16:38 -0500, Dave Jones wrote:
> > Trying to reproduce that nd_jump_link trace, but I keep hitting other bugs
> > instead. It's like whackamole. Except these are even more annoying
> > than moles.
>
> Dave,
> I thought I copied you on the 'ipc MSG_COPY fixes' patchset that fixes
> this. Or is this gp fault happening with that patchset?

I hadn't gotten around to applying them to my testing tree.
Worth noting that of the dozens of oopses I've seen the last few days
I think I only saw this one once.

(Though I've narrowed my testing scope right now to try and tickle
those fs/namei.c bugs, so it could just be that I'm not exercising
the codepaths that caused the ipc bug)

Dave

2013-03-09 00:36:22

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 07:19:17PM -0500, Dave Jones wrote:
> On Fri, Mar 08, 2013 at 04:02:02PM -0800, Linus Torvalds wrote:
> > On Fri, Mar 8, 2013 at 3:55 PM, Dave Jones <[email protected]> wrote:
> > >
> > > That one was printed out with %s
> >
> > Ok, so those random pathnames you generate? They're funky.
>
> I just did a test with just a page of 'A's, and got the same result,
> so while that unicode stuff looks fancy, it isn't necessary to tickle
> these bugs.
>
> Trying to figure out exactly what change I made that started triggering this.

Ok, it's definitly the 'append something on the end of a valid pathname'
changeset. 'something' can be anything it seems.

Dave

2013-03-09 01:18:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 4:36 PM, Dave Jones <[email protected]> wrote:
>
> Ok, it's definitly the 'append something on the end of a valid pathname'
> changeset. 'something' can be anything it seems.

Ok. so maybe the way to "bisect" this is to play with that.

For example, does it happen even if the "something" does not have a
slash in it? IOW, you just append, say, a single 'a' character to a
name that doesn't already end in a slash?

And if it still does happen with that, perhaps you could have some
logic that only appends the 'a' to names that end with numbers. Does
it stop happening?

The machine keeps running, right? So you can try this out without
rebooting, just changing when you append the character?

Linus

2013-03-09 02:03:23

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 05:18:29PM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 4:36 PM, Dave Jones <[email protected]> wrote:
> >
> > Ok, it's definitly the 'append something on the end of a valid pathname'
> > changeset. 'something' can be anything it seems.
>
> Ok. so maybe the way to "bisect" this is to play with that.
>
> For example, does it happen even if the "something" does not have a
> slash in it? IOW, you just append, say, a single 'a' character to a
> name that doesn't already end in a slash?

existing pathname + 'a' = fine.

existing pathname + '/' + 'a' = boom.

> And if it still does happen with that, perhaps you could have some
> logic that only appends the 'a' to names that end with numbers. Does
> it stop happening?

Looks like if I do this..

if (isdigit(newpath[len]) != 0) {
newpath[len] = '/';
newpath[len+1] = 'A';
newpath[len+2] = 0;

no bug.

If I change that to == 0, I get the bug.

> The machine keeps running, right? So you can try this out without
> rebooting, just changing when you append the character?

Sometimes it gets wedged somewhere, so not always.
Quick to reboot though, so no biggie.

Now trying to lower the frequency at which it does the mangling
to pinpoint the affected file(s)

I need to run at least 8 child processes (-C8) to get this to
happen (machine has 4 cores). Unfortunatly that means the logging
gets a bit spewy if they're all mangling at the same time.

Dave

2013-03-09 02:08:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 6:03 PM, Dave Jones <[email protected]> wrote:
>
> existing pathname + 'a' = fine.
>
> existing pathname + '/' + 'a' = boom.

Good.

> Looks like if I do this..
>
> if (isdigit(newpath[len]) != 0) {
> newpath[len] = '/';
> newpath[len+1] = 'A';
> newpath[len+2] = 0;
>
> no bug.

Well, but that will never trigger. newpath[len] will always be NUL, so
you just disabled things entirely. Use "len-1".

So I don't think that did what you meant it to do.

Linus

2013-03-09 02:26:28

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 06:08:52PM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 6:03 PM, Dave Jones <[email protected]> wrote:
> >
> > existing pathname + 'a' = fine.
> >
> > existing pathname + '/' + 'a' = boom.
>
> Good.
>
> > Looks like if I do this..
> >
> > if (isdigit(newpath[len]) != 0) {
> > newpath[len] = '/';
> > newpath[len+1] = 'A';
> > newpath[len+2] = 0;
> >
> > no bug.
>
> Well, but that will never trigger. newpath[len] will always be NUL, so
> you just disabled things entirely. Use "len-1".
>
> So I don't think that did what you meant it to do.

Fixed that up, and even double checked my sanity by printing stuff out.

Confirmed that it's something that doesn't end in a number.

Dave

2013-03-09 02:56:37

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 09:26:23PM -0500, Dave Jones wrote:
> On Fri, Mar 08, 2013 at 06:08:52PM -0800, Linus Torvalds wrote:
> > On Fri, Mar 8, 2013 at 6:03 PM, Dave Jones <[email protected]> wrote:
> > >
> > > existing pathname + 'a' = fine.
> > >
> > > existing pathname + '/' + 'a' = boom.
> >
> > Good.
> >
> > > Looks like if I do this..
> > >
> > > if (isdigit(newpath[len]) != 0) {
> > > newpath[len] = '/';
> > > newpath[len+1] = 'A';
> > > newpath[len+2] = 0;
> > >
> > > no bug.
> >
> > Well, but that will never trigger. newpath[len] will always be NUL, so
> > you just disabled things entirely. Use "len-1".
> >
> > So I don't think that did what you meant it to do.
>
> Fixed that up, and even double checked my sanity by printing stuff out.
>
> Confirmed that it's something that doesn't end in a number.

I've got a hunch that it's /proc/$$/ns/uts.

After 3-4 runs, that's the only common file in the last few that got mangled.

I'll do some more tests, but this might be the one.

Dave

2013-03-09 02:57:57

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 09:56:31PM -0500, Dave Jones wrote:
> On Fri, Mar 08, 2013 at 09:26:23PM -0500, Dave Jones wrote:
> > On Fri, Mar 08, 2013 at 06:08:52PM -0800, Linus Torvalds wrote:
> > > On Fri, Mar 8, 2013 at 6:03 PM, Dave Jones <[email protected]> wrote:
> > > >
> > > > existing pathname + 'a' = fine.
> > > >
> > > > existing pathname + '/' + 'a' = boom.
> > >
> > > Good.
> > >
> > > > Looks like if I do this..
> > > >
> > > > if (isdigit(newpath[len]) != 0) {
> > > > newpath[len] = '/';
> > > > newpath[len+1] = 'A';
> > > > newpath[len+2] = 0;
> > > >
> > > > no bug.
> > >
> > > Well, but that will never trigger. newpath[len] will always be NUL, so
> > > you just disabled things entirely. Use "len-1".
> > >
> > > So I don't think that did what you meant it to do.
> >
> > Fixed that up, and even double checked my sanity by printing stuff out.
> >
> > Confirmed that it's something that doesn't end in a number.
>
> I've got a hunch that it's /proc/$$/ns/uts.
>
> After 3-4 runs, that's the only common file in the last few that got mangled.
>
> I'll do some more tests, but this might be the one.

confirmed. A simple

mkdir /proc/self/ns/uts/A

will trigger it.

Eric ?

Dave

2013-03-09 03:25:35

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 07:16:09PM -0800, Linus Torvalds wrote:
> Goodie. Your bug reports gave me heartburn. But it sounds like we have an
> angle on all of the ones you've seen now?
>
> Or have I forgotten about some case?

To be honest I've lost track of the whole collection.
Let me repull your latest tree, and see what falls out.
(I'll turn off CONFIG_USER_NS for now too until that gets fixed)

I have a feeling there were some sysfs ones that may still be unfixed.

Dave

2013-03-09 03:27:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

Dave Jones <[email protected]> writes:

> On Fri, Mar 08, 2013 at 09:56:31PM -0500, Dave Jones wrote:
> > On Fri, Mar 08, 2013 at 09:26:23PM -0500, Dave Jones wrote:
> > > On Fri, Mar 08, 2013 at 06:08:52PM -0800, Linus Torvalds wrote:
> > > > On Fri, Mar 8, 2013 at 6:03 PM, Dave Jones <[email protected]> wrote:
> > > > >
> > > > > existing pathname + 'a' = fine.
> > > > >
> > > > > existing pathname + '/' + 'a' = boom.
> > > >
> > > > Good.
> > > >
> > > > > Looks like if I do this..
> > > > >
> > > > > if (isdigit(newpath[len]) != 0) {
> > > > > newpath[len] = '/';
> > > > > newpath[len+1] = 'A';
> > > > > newpath[len+2] = 0;
> > > > >
> > > > > no bug.
> > > >
> > > > Well, but that will never trigger. newpath[len] will always be NUL, so
> > > > you just disabled things entirely. Use "len-1".
> > > >
> > > > So I don't think that did what you meant it to do.
> > >
> > > Fixed that up, and even double checked my sanity by printing stuff out.
> > >
> > > Confirmed that it's something that doesn't end in a number.
> >
> > I've got a hunch that it's /proc/$$/ns/uts.
> >
> > After 3-4 runs, that's the only common file in the last few that got mangled.
> >
> > I'll do some more tests, but this might be the one.
>
> confirmed. A simple
>
> mkdir /proc/self/ns/uts/A
>
> will trigger it.

This is a magic symlink similar to the other magic symlinks in proc so I
don't know if the test is wrong or my code is doing something to clever.

But I can reproduce it so I will look at it and see if I can make sense
of what is going on.

Eric

2013-03-09 03:38:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

Dave Jones <[email protected]> writes:

> On Fri, Mar 08, 2013 at 07:16:09PM -0800, Linus Torvalds wrote:
> > Goodie. Your bug reports gave me heartburn. But it sounds like we have an
> > angle on all of the ones you've seen now?
> >
> > Or have I forgotten about some case?
>
> To be honest I've lost track of the whole collection.
> Let me repull your latest tree, and see what falls out.
> (I'll turn off CONFIG_USER_NS for now too until that gets fixed)

It was CONFIG_UTS_NS that tripped you.

Since I can trigger this with /proc/self/ns/mnt/a you are going to be
able to compile this one out.

Eric

2013-03-09 03:51:01

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 07:36:34PM -0800, Linus Torvalds wrote:
> Note that my tree does not have the pipe changes. Um still not sure about
> the cause, see the patch with a warn-on-once in it. I don't like just
> adding the NULL pointer checks willy nilly without understanding why they
> made...

Yeah, understood. I still have that patched.

> > To be honest I've lost track of the whole collection.
> > Let me repull your latest tree, and see what falls out.
> > (I'll turn off CONFIG_USER_NS for now too until that gets fixed)
> >
> > I have a feeling there were some sysfs ones that may still be unfixed.

I was right..

[ 425.836722] general protection fault: 0000 [#1] PREEMPT SMP
[ 425.837529] Modules linked in: hidp l2tp_ppp l2tp_netlink l2tp_core cmtp kernelcapi bnep rfcomm scsi_transport_iscsi can_raw ipt_ULOG nfnetlink irda can_bcm ipx caif_socket appletalk x25 p8023 af_key caif psnap p8022 crc_ccitt phonet netrom rose rds llc2 nfc llc af_802154 can ax25 pppoe decnet af_rxrpc pppox ppp_generic slhc atm lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec btusb bluetooth snd_pcm snd_page_alloc snd_timer snd usb_debug microcode rfkill pcspkr serio_raw soundcore edac_core vhost_net tun macvtap macvlan kvm_amd r8169 kvm mii radeon backlight drm_kms_helper ttm
[ 425.846148] CPU 2
[ 425.846387] Pid: 15263, comm: trinity-child3 Not tainted 3.9.0-rc1+ #88 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
[ 425.847859] RIP: 0010:[<ffffffff8123f657>] [<ffffffff8123f657>] sysfs_find_dirent+0x47/0xf0
[ 425.848919] RSP: 0018:ffff88011aab5d38 EFLAGS: 00010202
[ 425.849581] RAX: 000000000ad55669 RBX: 6b6b6b6b6b6b6b6b RCX: 000000006b6b6b6b
[ 425.850468] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 425.851356] RBP: ffff88011aab5d58 R08: 0000000000000000 R09: 0000000000000000
[ 425.852237] R10: 0000000000000000 R11: 0000000000000000 R12: 000000007640c1d4
[ 425.853119] R13: ffff880105b41e48 R14: 0000000000000000 R15: ffff880127110000
[ 425.854001] FS: 00007f7cdb71d740(0000) GS:ffff88012b200000(0000) knlGS:0000000000000000
[ 425.855000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 425.855712] CR2: 0000000000000001 CR3: 0000000102df4000 CR4: 00000000000007e0
[ 425.856599] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 425.857486] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 425.858368] Process trinity-child3 (pid: 15263, threadinfo ffff88011aab4000, task ffff880127110000)
[ 425.859490] Stack:
[ 425.859747] ffff880113228d58 ffff880105b41e10 fffffffffffffffe ffff88010887eee8
[ 425.860718] ffff88011aab5d88 ffffffff8123f76d ffff880105b41e10 ffff8801088804a0
[ 425.861683] ffff88011aab5f28 0000000000000000 ffff88011aab5db8 ffffffff811c5b7d
[ 425.862648] Call Trace:
[ 425.862960] [<ffffffff8123f76d>] sysfs_lookup+0x6d/0xe0
[ 425.863623] [<ffffffff811c5b7d>] lookup_real+0x1d/0x60
[ 425.864274] [<ffffffff811c5d98>] __lookup_hash+0x38/0x50
[ 425.864951] [<ffffffff811c5dc9>] lookup_hash+0x19/0x20
[ 425.865607] [<ffffffff811c9275>] kern_path_create+0x95/0x170
[ 425.866322] [<ffffffff811c7726>] ? getname_flags.part.33+0x86/0x150
[ 425.868909] [<ffffffff811c939a>] user_path_create+0x4a/0x70
[ 425.871401] [<ffffffff811cc668>] sys_linkat+0x88/0x230
[ 425.873841] [<ffffffff8134b06e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 425.876433] [<ffffffff811cc82e>] sys_link+0x1e/0x20
[ 425.878839] [<ffffffff816d0d42>] system_call_fastpath+0x16/0x1b
[ 425.881364] Code: 00 48 8b 9f 88 00 00 00 f6 c4 0f 0f 95 c0 48 85 f6 0f 95 c2 38 d0 75 79 4c 89 ee 4c 89 f7 e8 91 ef ff ff 41 89 c4 48 85 db 74 1d <8b> 4b 28 41 39 cc 74 21 44 89 e0 29 c8 83 f8 00 7c 2c 74 45 48


I'll try some more divide and conquer debugging in a while.

Dave

2013-03-09 04:27:05

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 07:38:33PM -0800, Eric W. Biederman wrote:
> Dave Jones <[email protected]> writes:
>
> > On Fri, Mar 08, 2013 at 07:16:09PM -0800, Linus Torvalds wrote:
> > > Goodie. Your bug reports gave me heartburn. But it sounds like we have an
> > > angle on all of the ones you've seen now?
> > >
> > > Or have I forgotten about some case?
> >
> > To be honest I've lost track of the whole collection.
> > Let me repull your latest tree, and see what falls out.
> > (I'll turn off CONFIG_USER_NS for now too until that gets fixed)
>
> It was CONFIG_UTS_NS that tripped you.
>
> Since I can trigger this with /proc/self/ns/mnt/a you are going to be
> able to compile this one out.

fwiw, the other namespace procfs files look like they have the same bug

I just triggered it again on /proc/571/task/571/ns/net

Dave

2013-03-09 04:31:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 8, 2013 at 7:50 PM, Dave Jones <[email protected]> wrote:
> > >
> > > I have a feeling there were some sysfs ones that may still be unfixed.
>
> I was right..
>
> [ 425.836722] general protection fault: 0000 [#1] PREEMPT SMP

You forgot to enable DEBUG_PAGE_ALLOC again, but I don't think it much
matters. It's another slab free poison thing.

The faulting instruction is

mov 0x28(%rbx),%ecx

with %rbx having the value 6b6b6b6b6b6b6b6b.

> [ 425.847859] RIP: 0010:[<ffffffff8123f657>] [<ffffffff8123f657>] sysfs_find_dirent+0x47/0xf0

That seems to be

if (hash != sd->s_hash)

from sysfs_name_compare() that has been inlined into
sysfs_find_dirent(). And where "sd" is the corrupted value. If I read
things right.

So it looks like the sysfs rbtree is corrupted or something. Adding
Greg to the cc.

Linus

2013-03-09 04:39:23

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Fri, Mar 08, 2013 at 08:31:48PM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 7:50 PM, Dave Jones <[email protected]> wrote:
> > > >
> > > > I have a feeling there were some sysfs ones that may still be unfixed.
> >
> > I was right..
> >
> > [ 425.836722] general protection fault: 0000 [#1] PREEMPT SMP
>
> You forgot to enable DEBUG_PAGE_ALLOC again, but I don't think it much
> matters. It's another slab free poison thing.
>
> The faulting instruction is
>
> mov 0x28(%rbx),%ecx
>
> with %rbx having the value 6b6b6b6b6b6b6b6b.
>
> > [ 425.847859] RIP: 0010:[<ffffffff8123f657>] [<ffffffff8123f657>] sysfs_find_dirent+0x47/0xf0
>
> That seems to be
>
> if (hash != sd->s_hash)
>
> from sysfs_name_compare() that has been inlined into
> sysfs_find_dirent(). And where "sd" is the corrupted value. If I read
> things right.
>
> So it looks like the sysfs rbtree is corrupted or something. Adding
> Greg to the cc.

oh , that rings a bell. ISTR this had something to do with Sasha's idr changes.

Dave

2013-03-09 05:13:01

by Sasha Levin

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On 03/08/2013 11:39 PM, Dave Jones wrote:
> On Fri, Mar 08, 2013 at 08:31:48PM -0800, Linus Torvalds wrote:
> > On Fri, Mar 8, 2013 at 7:50 PM, Dave Jones <[email protected]> wrote:
> > > > >
> > > > > I have a feeling there were some sysfs ones that may still be unfixed.
> > >
> > > I was right..
> > >
> > > [ 425.836722] general protection fault: 0000 [#1] PREEMPT SMP
> >
> > You forgot to enable DEBUG_PAGE_ALLOC again, but I don't think it much
> > matters. It's another slab free poison thing.
> >
> > The faulting instruction is
> >
> > mov 0x28(%rbx),%ecx
> >
> > with %rbx having the value 6b6b6b6b6b6b6b6b.
> >
> > > [ 425.847859] RIP: 0010:[<ffffffff8123f657>] [<ffffffff8123f657>] sysfs_find_dirent+0x47/0xf0
> >
> > That seems to be
> >
> > if (hash != sd->s_hash)
> >
> > from sysfs_name_compare() that has been inlined into
> > sysfs_find_dirent(). And where "sd" is the corrupted value. If I read
> > things right.
> >
> > So it looks like the sysfs rbtree is corrupted or something. Adding
> > Greg to the cc.
>
> oh , that rings a bell. ISTR this had something to do with Sasha's idr changes.

I believe the issue you're referring to is my report of a panic in sysfs keys
where the device has gone away? If that is it, I don't think that this issue
is related to that one.

If not, I'm not sure which change you're referring to.


Thanks,
Sasha

2013-03-09 05:16:52

by Dave Jones

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

On Sat, Mar 09, 2013 at 12:13:16AM -0500, Sasha Levin wrote:
> On 03/08/2013 11:39 PM, Dave Jones wrote:
> > On Fri, Mar 08, 2013 at 08:31:48PM -0800, Linus Torvalds wrote:
> > > On Fri, Mar 8, 2013 at 7:50 PM, Dave Jones <[email protected]> wrote:
> > > > > >
> > > > > > I have a feeling there were some sysfs ones that may still be unfixed.
> > > >
> > > > I was right..
> > > >
> > > > [ 425.836722] general protection fault: 0000 [#1] PREEMPT SMP
> > >
> > > You forgot to enable DEBUG_PAGE_ALLOC again, but I don't think it much
> > > matters. It's another slab free poison thing.
> > >
> > > The faulting instruction is
> > >
> > > mov 0x28(%rbx),%ecx
> > >
> > > with %rbx having the value 6b6b6b6b6b6b6b6b.
> > >
> > > > [ 425.847859] RIP: 0010:[<ffffffff8123f657>] [<ffffffff8123f657>] sysfs_find_dirent+0x47/0xf0
> > >
> > > That seems to be
> > >
> > > if (hash != sd->s_hash)
> > >
> > > from sysfs_name_compare() that has been inlined into
> > > sysfs_find_dirent(). And where "sd" is the corrupted value. If I read
> > > things right.
> > >
> > > So it looks like the sysfs rbtree is corrupted or something. Adding
> > > Greg to the cc.
> >
> > oh , that rings a bell. ISTR this had something to do with Sasha's idr changes.
>
> I believe the issue you're referring to is my report of a panic in sysfs keys
> where the device has gone away? If that is it, I don't think that this issue
> is related to that one.
>
> If not, I'm not sure which change you're referring to.

Hmm I might have been thinking of the hlist changes rather than idr.
Though looking at those oopses, they were in find_pid_ns, so different.

So many bugs.

Dave

2013-03-09 08:28:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode != parent->d_inode);

Dave Jones <[email protected]> writes:

> On Fri, Mar 08, 2013 at 07:38:33PM -0800, Eric W. Biederman wrote:
> > Dave Jones <[email protected]> writes:
> >
> > > On Fri, Mar 08, 2013 at 07:16:09PM -0800, Linus Torvalds wrote:
> > > > Goodie. Your bug reports gave me heartburn. But it sounds like we have an
> > > > angle on all of the ones you've seen now?
> > > >
> > > > Or have I forgotten about some case?
> > >
> > > To be honest I've lost track of the whole collection.
> > > Let me repull your latest tree, and see what falls out.
> > > (I'll turn off CONFIG_USER_NS for now too until that gets fixed)
> >
> > It was CONFIG_UTS_NS that tripped you.
> >
> > Since I can trigger this with /proc/self/ns/mnt/a you are going to be
> > able to compile this one out.
>
> fwiw, the other namespace procfs files look like they have the same bug
>
> I just triggered it again on /proc/571/task/571/ns/net

Yep same code.

I will do a proper submission of this after I have slept. But
here is the fix.

From: "Eric W. Biederman" <[email protected]>
Date: Sat, 9 Mar 2013 00:14:45 -0800
Subject: [PATCH] proc: Use nd_jump_link in proc_ns_follow_link

Update proc_ns_follow_link to use nd_jump_link instead of just
manually updating nd.path.dentry.

This fixes the BUG_ON(nd->inode != parent->d_inode) reported by Dave
Jones and reproduced trivially with mkdir /proc/self/ns/uts/a.

Sigh it looks like the VFS change to require use of nd_jump_link
happend while proc_ns_follow_link was baking and since the common case
of proc_ns_follow_link continued to work without problems the need for
making this change was overlooked.

Cc: [email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/namespaces.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index b7a4719..66b51c0 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -118,7 +118,7 @@ static void *proc_ns_follow_link(struct dentry *dentry, struct nameidata *nd)
struct super_block *sb = inode->i_sb;
struct proc_inode *ei = PROC_I(inode);
struct task_struct *task;
- struct dentry *ns_dentry;
+ struct path ns_path;
void *error = ERR_PTR(-EACCES);

task = get_proc_task(inode);
@@ -128,14 +128,14 @@ static void *proc_ns_follow_link(struct dentry *dentry, struct nameidata *nd)
if (!ptrace_may_access(task, PTRACE_MODE_READ))
goto out_put_task;

- ns_dentry = proc_ns_get_dentry(sb, task, ei->ns_ops);
- if (IS_ERR(ns_dentry)) {
- error = ERR_CAST(ns_dentry);
+ ns_path.dentry = proc_ns_get_dentry(sb, task, ei->ns_ops);
+ if (IS_ERR(ns_path.dentry)) {
+ error = ERR_CAST(ns_path.dentry);
goto out_put_task;
}

- dput(nd->path.dentry);
- nd->path.dentry = ns_dentry;
+ ns_path.mnt = mntget(nd->path.mnt);
+ nd_jump_link(nd, &ns_path);
error = NULL;

out_put_task:
--
1.7.5.4

2013-03-10 22:10:53

by Al Viro

[permalink] [raw]
Subject: Re: pipe_release oops.

On Fri, Mar 08, 2013 at 10:30:01AM -0800, Linus Torvalds wrote:

> I must be missing something, and I wonder if the thing I'm missing is
> that with OPEN_PATH we may now have open calls that don't actually
> have FMODE_READ or FMODE_WRITE set at all.

With OPEN_PATH we don't call ->open() (or anything in the driver, for that
matter) at all. I really don't see how that could trigger...

> So suddenly we end up with these pipe openers that don't update the
> counts, and I could imagine that really confusing us...
>
> So I'm wondering if you could apply this patch, and see if that
> triggers. In fact, please un-apply the other changes to fs/pipe.c too,
> to see if it also obviates the need for checking i_pipe for NULL. You
> should get the new warning (once), but you should not get any oopses..
>
> Anyway, this would explain why the actual read/write paths don't need
> to check for i_pipe - if FMODE_READ/WRITE aren't set, we'll never get
> that far. But the release() and the fasync functions do get called
> even for non-readable and non-writable files...
>
> And Al, please get your email fixed. Is somebody usually on irc with
> him or something?

Back and looking through all that fun...

2013-03-10 23:04:46

by Al Viro

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Thu, Mar 07, 2013 at 10:30:52AM -0500, Dave Jones wrote:

> Code: 44 24 08 48 89 43 08 48 8b 40 30 81 4b 38 00 10 00 00 48 89 43 30 48 8b 40 20 48 83 78 08 00 75 0a 48 8b 5d f0 4c 8b 65 f8 c9 c3 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 ba 10 00
> RIP [<ffffffff811c1414>] nd_jump_link+0x54/0x60
> RSP <ffff880112e31c78>

Attempt to follow procfs symlink resolving in a symlink inode. Interesting...
Oh, hell...

OK, I can easily reproduce that one: open a symlink with O_PATH|O_NOFOLLOW,
then try to open /proc/self/fd/<fd>. Boom...

The interesting part is what to do with it; it's not enough to make that
BUG_ON() to STFU, unfortunately. I'm not sure what's the right behaviour
here - the thing is, following a symlink depends on which directory it's
in and with procfs ones we really don't know that; don't even know if the
bugger is still linked, for that matter.

Basically, the target of symlink is a function of symlink contents *and*
the directory it's in. Consider somebody doing O_PATH open of a symlink
(with O_NOFOLLOW). Then unlink it. What should following
/proc/self/fd/<that descriptor> mean?

FWIW, I'm seriously tempted to turn that BUG_ON() into return -ELOOP, rather
than attempt to invent a semantics for following those guys. I think we'd
discussed that corner case when doing O_PATH patchset, and IIRC decided to
go with "fail with -ELOOP"; looks like that had slipped through the cracks
and never got done. IOW, how about the following?

make nd_jump_link() fail sanely when asked to jump into a symlink

Signed-off-by: Al Viro <[email protected]>
---
diff --git a/fs/namei.c b/fs/namei.c
index 57ae9c8..9029d60 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -682,13 +682,18 @@ static inline void path_to_nameidata(const struct path *path,
* Helper to directly jump to a known parsed path from ->follow_link,
* caller must have taken a reference to path beforehand.
*/
-void nd_jump_link(struct nameidata *nd, struct path *path)
+void *nd_jump_link(struct nameidata *nd, struct path *path)
{
+ if (unlikely(path->dentry->d_inode->i_op->follow_link)) {
+ path_put(path);
+ return ERR_PTR(-ELOOP);
+ }
path_put(&nd->path);

nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED;
+ return NULL;
}

static inline void put_link(struct nameidata *nd, struct path *link, void *cookie)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 69078c7..b323ee6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1437,8 +1437,7 @@ static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
if (error)
goto out;

- nd_jump_link(nd, &path);
- return NULL;
+ return nd_jump_link(nd, &path);
out:
return ERR_PTR(error);
}
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 66b51c0..11c7b97 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -135,8 +135,7 @@ static void *proc_ns_follow_link(struct dentry *dentry, struct nameidata *nd)
}

ns_path.mnt = mntget(nd->path.mnt);
- nd_jump_link(nd, &ns_path);
- error = NULL;
+ error = nd_jump_link(nd, &ns_path);

out_put_task:
put_task_struct(task);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 5a5ff57..bf55ed9 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -82,7 +82,7 @@ extern int follow_up(struct path *);
extern struct dentry *lock_rename(struct dentry *, struct dentry *);
extern void unlock_rename(struct dentry *, struct dentry *);

-extern void nd_jump_link(struct nameidata *nd, struct path *path);
+extern void *nd_jump_link(struct nameidata *nd, struct path *path);

static inline void nd_set_link(struct nameidata *nd, char *path)
{

2013-03-10 23:33:22

by Al Viro

[permalink] [raw]
Subject: Re: pipe_release oops.

On Fri, Mar 08, 2013 at 01:26:49PM -0500, J??rn Engel wrote:
> On Fri, 8 March 2013 10:30:01 -0800, Linus Torvalds wrote:
> >
> > Hmm. So I've been trying to figure this out, and I really don't see
> > it. Every single pipe open routine *should* make sure that the inode
> > has an inode->i_pipe field. So if the open() has succeeded and you
> > have a valid file descriptor, the inode->i_pipe thing should be there.
>
> Ok, here is a wild idea that is very likely wrong. But some
> background first. I've had problems with process exit times and one
> of the culprits turned out to be exit_files() where one device driver
> went awol for several seconds. Fixing the device driver is hard, I
> didn't see a good reason not to call exit_files() earlier and
> exit_mm() was the other big offender, so the idea was to run both in
> parallel and I applied the patch below.
>
> As a result I've gotten a bunch of NULL pointer dereferences that only
> happen in virtual machines, never on real hardware. For example
> [<ffffffff81164bf8>] alloc_fd+0x38/0x130
> [<ffffffff8114857e>] do_sys_open+0xee/0x1f0
> [<ffffffff811486a1>] sys_open+0x21/0x30
> [<ffffffff815bea29>] system_call_fastpath+0x16/0x1b
>
> Now I can easily see how current->files being NULL will result in such
> backtraces. I can also see how my patch moves the NULLing of
> current->files a bit back in time. But I could never figure out how
> my patch could have introduced a race that didn't exist before.
>
> So the wild idea is that we have always had a very unlikely race with
> current->files being NULL and trinity happens to hit it somehow.
>
> J??rn

> + files_cookie = async_schedule(exit_files_async, tsk);
> exit_mm(tsk);
>
> if (group_dead)
> @@ -990,7 +998,7 @@ void do_exit(long code)
>
> exit_sem(tsk);
> exit_shm(tsk);
> - exit_files(tsk);
> + async_synchronize_cookie(files_cookie);

That doesn't do what you seem to think it's doing. It does *not* wait
for the completion of that sucker's execution - only the ones scheduled
before it. IOW, your exit_files_async() might very well be executed
*after* do_exit() completes and tsk gets reused.

2013-03-11 00:02:11

by Al Viro

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Thu, Mar 07, 2013 at 02:50:55PM -0800, Linus Torvalds wrote:

> and you get the oops, because when it follows that pseudo-symlink for
> fd3, it will hit that symlink we just looked up. And trigger the
> BUG_ON() you see.
>
> Al, I think the BUG_ON() is simply bogus. Should we just remove it? Or
> is there some conceptual reason why we can't handle a symlink there?

BUG_ON() is bogus these days, but failing ->follow_link() is not. The thing
is, what would you do with that sucker? Suppose you have ->follow_link()
leaving you in a symlink. What would you do with it? Either you were
not going to follow it (in which case you wouldn't have called ->follow_link()
in the first place) *or* you need to know which directory it's in. Which
we don't have in this case...

It really used to be an impossible case, until we'd introduced O_PATH. And
IIRC, back when O_PATH had been developed, the plan was to fail attempts
to follow these suckers with -ELOOP; looks like we forgot to do that. Mea
culpa...

And no, just removing BUG_ON() is not enough - callers of follow_link() will
do rather weird stuff (in nested symlink case it'll fail with ENOTDIR, in
path_lookupat() with LOOKUP_FOLLOW it'll give you a symlink, in case of
path_openat() with LOOKUP_FOLLOW fail with ELOOP).

See the patch I'd posted upthread. Let's make it fail consistently...

2013-03-11 00:35:34

by Al Viro

[permalink] [raw]
Subject: Re: pipe_release oops.

On Sun, Mar 10, 2013 at 10:10:47PM +0000, Al Viro wrote:
> On Fri, Mar 08, 2013 at 10:30:01AM -0800, Linus Torvalds wrote:
>
> > I must be missing something, and I wonder if the thing I'm missing is
> > that with OPEN_PATH we may now have open calls that don't actually
> > have FMODE_READ or FMODE_WRITE set at all.
>
> With OPEN_PATH we don't call ->open() (or anything in the driver, for that
> matter) at all. I really don't see how that could trigger...

Hmm... How the devil would things like pipe_read_open() get called, anyway?
pipe_rdwr_open() can be called, all right - that happens if you do pipe()
and then open() via /proc/self/fd/<n>. But how could pipe_read_open() and
pipe_write_open() be called? They are accessible only as ->open() of
read_pipefifo_fops/write_pipefifo_fops. And those are only used by
fifo_open(), which does reassign file->f_op to them, but does *not* call
their ->open()...

2013-03-11 15:10:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: pipe_release oops.

On Sun, Mar 10, 2013 at 5:35 PM, Al Viro <[email protected]> wrote:
>
> Hmm... How the devil would things like pipe_read_open() get called, anyway?
> pipe_rdwr_open() can be called, all right - that happens if you do pipe()
> and then open() via /proc/self/fd/<n>. But how could pipe_read_open() and
> pipe_write_open() be called? They are accessible only as ->open() of
> read_pipefifo_fops/write_pipefifo_fops. And those are only used by
> fifo_open(), which does reassign file->f_op to them, but does *not* call
> their ->open()...

.. same deal? Open the resulting fd from /proc/self/fd/<n> afterwards,
which will call file->f_op->open(), no?

Linus

2013-03-11 18:05:48

by Al Viro

[permalink] [raw]
Subject: Re: pipe_release oops.

On Mon, Mar 11, 2013 at 08:10:10AM -0700, Linus Torvalds wrote:
> On Sun, Mar 10, 2013 at 5:35 PM, Al Viro <[email protected]> wrote:
> >
> > Hmm... How the devil would things like pipe_read_open() get called, anyway?
> > pipe_rdwr_open() can be called, all right - that happens if you do pipe()
> > and then open() via /proc/self/fd/<n>. But how could pipe_read_open() and
> > pipe_write_open() be called? They are accessible only as ->open() of
> > read_pipefifo_fops/write_pipefifo_fops. And those are only used by
> > fifo_open(), which does reassign file->f_op to them, but does *not* call
> > their ->open()...
>
> .. same deal? Open the resulting fd from /proc/self/fd/<n> afterwards,
> which will call file->f_op->open(), no?

Not really - that would call inode->i_fop, not file->f_op. You get dentry
of opened file, but file->f_op is set from scratch - not copied from the
original struct file.

2013-03-11 18:26:29

by Dave Jones

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Fri, Mar 08, 2013 at 07:27:01PM -0500, Peter Hurley wrote:

> On Thu, 2013-03-07 at 16:38 -0500, Dave Jones wrote:
> Dave,
> I thought I copied you on the 'ipc MSG_COPY fixes' patchset that fixes
> this. Or is this gp fault happening with that patchset?
>
> Linus,
> The fixes should be in your inbox (from Andrew) titled:
> [patch 01/11] ipc: fix potential oops when src msg > 4k w/ MSG_COPY
> [patch 02/11] ipc: don't allocate a copy larger than max
>
> > general protection fault: 0000 [#1] PREEMPT SMP
> > Modules linked in: rose ax25 phonet lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth snd_pcm snd_page_alloc snd_timer snd vhost_net rfkill tun macvtap usb_debug macvlan microcode serio_raw pcspkr kvm_amd soundcore edac_core r8169 mii kvm
> > CPU 0
> > Pid: 845, comm: trinity-child14 Not tainted 3.9.0-rc1+ #70 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> > RIP: 0010:[<ffffffff812b7b00>] [<ffffffff812b7b00>] testmsg.isra.1+0x40/0x60
> > RSP: 0018:ffff880122b0fe78 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
> > RDX: 0000000000000002 RSI: 000000002c24a9b2 RDI: 697665642d737983
> > RBP: ffff880122b0fe78 R08: fffffff3f14b03ae R09: 0000000000000000
> > R10: ffff880127bd8000 R11: 0000000000000000 R12: 000000002c24a9b2
> > R13: ffff880123360798 R14: ffff8801233606e8 R15: 697665642d737973
> > FS: 00007f2672bd3740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f2672b96068 CR3: 0000000127bc1000 CR4: 00000000000007f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process trinity-child14 (pid: 845, threadinfo ffff880122b0e000, task ffff880127bd8000)
> > Stack:
> > ffff880122b0ff68 ffffffff812b8e7e ffff8801276d5b90 ffff880127bd8000
> > ffff880127bd8000 ffff880127bd8000 0000000000000000 ffffffff812b78c0
> > 0000000000000000 ffffffff81c7a260 0000000000000000 0000000000001000
> > Call Trace:
> > [<ffffffff812b8e7e>] do_msgrcv+0x1de/0x670
> > [<ffffffff812b78c0>] ? load_msg+0x180/0x180
> > [<ffffffff810b8685>] ? trace_hardirqs_on_caller+0x115/0x1a0
> > [<ffffffff81341aae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [<ffffffff812b9325>] sys_msgrcv+0x15/0x20
> > [<ffffffff816cd982>] system_call_fastpath+0x16/0x1b
> > Code: 83 fa 04 74 16 31 c0 5d c3 66 90 ff ca b8 01 00 00 00 74 f3 31 c0 eb ef 0f 1f 00 48 39 37 b8 01 00 00 00 7e e2 31 c0 eb de 66 90 <48> 3b 37 75 d5 b8 01 00 00 00 5d c3 0f 1f 40 00 48 3b 37 74 c5
> >
> > 0000000000000000 <.text>:
> > 0: 48 3b 37 cmp (%rdi),%rsi
> > 3: 75 d5 jne 0xffffffffffffffda
> > 5: b8 01 00 00 00 mov $0x1,%eax
> > a: 5d pop %rbp
> > b: c3 retq
> > c: 0f 1f 40 00 nopl 0x0(%rax)
> > 10: 48 3b 37 cmp (%rdi),%rsi
> > 13: 74 c5 je 0xffffffffffffffda
> >
> > rdi is ascii. "ived-sy�" Curious.
> >
> > EIP is here in testmsg.
> >
> > case SEARCH_EQUAL:
> > if (msg->m_type == type)
> > 240: 48 3b 37 cmp (%rdi),%rsi
> > 243: 75 d5 jne 21a <testmsg.isra.1+0x1a>
> > {

I just hit this again on rc2 which looks like it has the fixes that
Peter mentions above. This time rdi was 6b6b6b6b6b6b6b7b

Dave

[ 106.010425] general protection fault: 0000 [#1] PREEMPT SMP
[ 106.010521] Modules linked in: pppoe pppox ppp_generic slhc scsi_transport_iscsi appletalk rose decnet af_rxrpc nfc netrom can_raw af_802154 can llc2 phonet irda atm rds x25 ax25 caif_socket ipx p8023 caif psnap p8022 llc crc_ccitt af_key lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel snd_hda_codec btusb bluetooth snd_pcm microcode usb_debug rfkill serio_raw pcspkr edac_core snd_page_alloc snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
[ 106.011028] CPU 0
[ 106.011045] Pid: 794, comm: trinity-child0 Not tainted 3.9.0-rc2+ #90 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
[ 106.011115] RIP: 0010:[<ffffffff812c15b0>] [<ffffffff812c15b0>] testmsg.isra.5+0x30/0x60
[ 106.011172] RSP: 0018:ffff880114cf9e78 EFLAGS: 00010246
[ 106.011205] RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000004 RCX: 7ffffffffffff001
[ 106.011248] RDX: 0000000000000004 RSI: 6b6b6b6b6b6b6b6a RDI: 6b6b6b6b6b6b6b7b
[ 106.011290] RBP: ffff880114cf9e78 R08: 0000000000000004 R09: 0000000000000000
[ 106.011333] R10: ffff880114f62490 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6a
[ 106.011375] R13: ffff88010fd69300 R14: ffff88010fd69250 R15: 6b6b6b6b6b6b6b6b
[ 106.011417] FS: 00007f9dd9637740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
[ 106.011465] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 106.011500] CR2: 0000000002851320 CR3: 0000000114cea000 CR4: 00000000000007f0
[ 106.011542] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 106.011584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 106.011627] Process trinity-child0 (pid: 794, threadinfo ffff880114cf8000, task ffff880114f62490)
[ 106.011680] Stack:
[ 106.011695] ffff880114cf9f68 ffffffff812c28b6 ffff880114cf9eb8 ffff880114f62490
[ 106.011752] ffff880114f62490 ffff880114f62490 0000000000000000 ffffffff812c13c0
[ 106.011808] 0000000000000001 0000000000000000 ffffffff81c7aa40 0000000000000001
[ 106.011862] Call Trace:
[ 106.011881] [<ffffffff812c28b6>] do_msgrcv+0x1d6/0x660
[ 106.011916] [<ffffffff812c13c0>] ? load_msg+0x180/0x180
[ 106.011951] [<ffffffff810b8d55>] ? trace_hardirqs_on_caller+0x115/0x1a0
[ 106.011993] [<ffffffff8134a8de>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 106.012034] [<ffffffff812c2d55>] sys_msgrcv+0x15/0x20
[ 106.012070] [<ffffffff816d05c2>] system_call_fastpath+0x16/0x1b
[ 106.012107] Code: 55 83 fa 02 48 89 e5 74 32 7e 10 83 fa 03 74 3b 83 fa 04 74 16 31 c0 5d c3 66 90 ff ca b8 01 00 00 00 74 f3 31 c0 eb ef 0f 1f 00 <48> 39 37 b8 01 00 00 00 7e e2 31 c0 eb de 66 90 48 3b 37 75 d5
[ 106.012405] RIP [<ffffffff812c15b0>] testmsg.isra.5+0x30/0x60
[ 106.012444] RSP <ffff880114cf9e78>

2013-03-11 19:03:35

by Peter Hurley

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Mon, 2013-03-11 at 14:26 -0400, Dave Jones wrote:
> On Fri, Mar 08, 2013 at 07:27:01PM -0500, Peter Hurley wrote:
>
> > On Thu, 2013-03-07 at 16:38 -0500, Dave Jones wrote:
> > Dave,
> > I thought I copied you on the 'ipc MSG_COPY fixes' patchset that fixes
> > this. Or is this gp fault happening with that patchset?
> >
> > Linus,
> > The fixes should be in your inbox (from Andrew) titled:
> > [patch 01/11] ipc: fix potential oops when src msg > 4k w/ MSG_COPY
> > [patch 02/11] ipc: don't allocate a copy larger than max
> >
> > > general protection fault: 0000 [#1] PREEMPT SMP
> > > Modules linked in: rose ax25 phonet lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth snd_pcm snd_page_alloc snd_timer snd vhost_net rfkill tun macvtap usb_debug macvlan microcode serio_raw pcspkr kvm_amd soundcore edac_core r8169 mii kvm
> > > CPU 0
> > > Pid: 845, comm: trinity-child14 Not tainted 3.9.0-rc1+ #70 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> > > RIP: 0010:[<ffffffff812b7b00>] [<ffffffff812b7b00>] testmsg.isra.1+0x40/0x60
> > > RSP: 0018:ffff880122b0fe78 EFLAGS: 00010246
> > > RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
> > > RDX: 0000000000000002 RSI: 000000002c24a9b2 RDI: 697665642d737983
> > > RBP: ffff880122b0fe78 R08: fffffff3f14b03ae R09: 0000000000000000
> > > R10: ffff880127bd8000 R11: 0000000000000000 R12: 000000002c24a9b2
> > > R13: ffff880123360798 R14: ffff8801233606e8 R15: 697665642d737973
> > > FS: 00007f2672bd3740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f2672b96068 CR3: 0000000127bc1000 CR4: 00000000000007f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > Process trinity-child14 (pid: 845, threadinfo ffff880122b0e000, task ffff880127bd8000)
> > > Stack:
> > > ffff880122b0ff68 ffffffff812b8e7e ffff8801276d5b90 ffff880127bd8000
> > > ffff880127bd8000 ffff880127bd8000 0000000000000000 ffffffff812b78c0
> > > 0000000000000000 ffffffff81c7a260 0000000000000000 0000000000001000
> > > Call Trace:
> > > [<ffffffff812b8e7e>] do_msgrcv+0x1de/0x670
> > > [<ffffffff812b78c0>] ? load_msg+0x180/0x180
> > > [<ffffffff810b8685>] ? trace_hardirqs_on_caller+0x115/0x1a0
> > > [<ffffffff81341aae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > > [<ffffffff812b9325>] sys_msgrcv+0x15/0x20
> > > [<ffffffff816cd982>] system_call_fastpath+0x16/0x1b
> > > Code: 83 fa 04 74 16 31 c0 5d c3 66 90 ff ca b8 01 00 00 00 74 f3 31 c0 eb ef 0f 1f 00 48 39 37 b8 01 00 00 00 7e e2 31 c0 eb de 66 90 <48> 3b 37 75 d5 b8 01 00 00 00 5d c3 0f 1f 40 00 48 3b 37 74 c5
> > >
> > > 0000000000000000 <.text>:
> > > 0: 48 3b 37 cmp (%rdi),%rsi
> > > 3: 75 d5 jne 0xffffffffffffffda
> > > 5: b8 01 00 00 00 mov $0x1,%eax
> > > a: 5d pop %rbp
> > > b: c3 retq
> > > c: 0f 1f 40 00 nopl 0x0(%rax)
> > > 10: 48 3b 37 cmp (%rdi),%rsi
> > > 13: 74 c5 je 0xffffffffffffffda
> > >
> > > rdi is ascii. "ived-sy�" Curious.
> > >
> > > EIP is here in testmsg.
> > >
> > > case SEARCH_EQUAL:
> > > if (msg->m_type == type)
> > > 240: 48 3b 37 cmp (%rdi),%rsi
> > > 243: 75 d5 jne 21a <testmsg.isra.1+0x1a>
> > > {
>
> I just hit this again on rc2 which looks like it has the fixes that
> Peter mentions above. This time rdi was 6b6b6b6b6b6b6b7b
>
> Dave

Sorry, Dave. The fix must be in the other 8 patches that Andrew didn't
want to apply. I run trinity 10 or more times a day and can't get this
to trigger with the whole series.

Regards,
Peter Hurley

2013-03-12 01:27:08

by Dave Jones

[permalink] [raw]
Subject: Re: pipe_release oops.

On Fri, Mar 08, 2013 at 10:30:01AM -0800, Linus Torvalds wrote:
> On Fri, Mar 8, 2013 at 6:53 AM, Dave Jones <[email protected]> wrote:
> >
> > Yeah, that does the trick.
> > I changed your other diff a little to use a goto, which reduces a level of indentation..
>
> Hmm. So I've been trying to figure this out, and I really don't see
> it. Every single pipe open routine *should* make sure that the inode
> has an inode->i_pipe field. So if the open() has succeeded and you
> have a valid file descriptor, the inode->i_pipe thing should be there.
>
> I must be missing something, and I wonder if the thing I'm missing is
> that with OPEN_PATH we may now have open calls that don't actually
> have FMODE_READ or FMODE_WRITE set at all.
>
> So suddenly we end up with these pipe openers that don't update the
> counts, and I could imagine that really confusing us...
>
> So I'm wondering if you could apply this patch, and see if that
> triggers. In fact, please un-apply the other changes to fs/pipe.c too,
> to see if it also obviates the need for checking i_pipe for NULL. You
> should get the new warning (once), but you should not get any oopses..
>
> Anyway, this would explain why the actual read/write paths don't need
> to check for i_pipe - if FMODE_READ/WRITE aren't set, we'll never get
> that far. But the release() and the fasync functions do get called
> even for non-readable and non-writable files...


finally managed to trigger this now that all the other stuff got fixed.

WARNING: at fs/pipe.c:867 pipe_rdwr_open+0xb8/0xd0()
Hardware name: GA-MA78GM-S2H
Pid: 23339, comm: trinity-child0 Not tainted 3.9.0-rc2+ #91
Call Trace:
[<ffffffff81045115>] warn_slowpath_common+0x75/0xa0
[<ffffffff8104515a>] warn_slowpath_null+0x1a/0x20
[<ffffffff811c4208>] pipe_rdwr_open+0xb8/0xd0
[<ffffffff811b9c5b>] do_dentry_open+0x25b/0x300
[<ffffffff811c4150>] ? pipe_unlock+0x30/0x30
[<ffffffff811b9d49>] finish_open+0x49/0x60
[<ffffffff811cb119>] do_last+0x6d9/0xf10
[<ffffffff8107a722>] ? lg_local_unlock+0x42/0x70
[<ffffffff811de629>] ? mntput_no_expire+0x49/0x130
[<ffffffff811de736>] ? mntput+0x26/0x40
[<ffffffff811c5ca2>] ? path_put+0x22/0x30
[<ffffffff811cbc93>] path_openat+0x343/0x500
[<ffffffff811cc101>] do_filp_open+0x41/0xa0
[<ffffffff811bb0a4>] do_sys_open+0xf4/0x1e0
[<ffffffff811bb1d4>] sys_openat+0x14/0x20
[<ffffffff816d08c2>] system_call_fastpath+0x16/0x1b
---[ end trace 8290ea887bf97579 ]---

Dave

2013-03-12 13:06:46

by Al Viro

[permalink] [raw]
Subject: Re: pipe_release oops.

On Mon, Mar 11, 2013 at 06:05:43PM +0000, Al Viro wrote:
> On Mon, Mar 11, 2013 at 08:10:10AM -0700, Linus Torvalds wrote:
> > On Sun, Mar 10, 2013 at 5:35 PM, Al Viro <[email protected]> wrote:
> > >
> > > Hmm... How the devil would things like pipe_read_open() get called, anyway?
> > > pipe_rdwr_open() can be called, all right - that happens if you do pipe()
> > > and then open() via /proc/self/fd/<n>. But how could pipe_read_open() and
> > > pipe_write_open() be called? They are accessible only as ->open() of
> > > read_pipefifo_fops/write_pipefifo_fops. And those are only used by
> > > fifo_open(), which does reassign file->f_op to them, but does *not* call
> > > their ->open()...
> >
> > .. same deal? Open the resulting fd from /proc/self/fd/<n> afterwards,
> > which will call file->f_op->open(), no?
>
> Not really - that would call inode->i_fop, not file->f_op. You get dentry
> of opened file, but file->f_op is set from scratch - not copied from the
> original struct file.

While we are at it, I don't see any reason for having separate file_operations
for r/o, w/o and r/w cases; the only differences are in EBADF-returning
->read() and ->write() (and ->f_mode checks in vfs_read() et.al. take care of
that) and micro-optimizations in ->release() and ->fasync().

Frankly, I really wonder if we should simply use def_fifo_fops for ->i_fops
in get_pipe_inode() and let open() via /proc/<pid>/fd/<n> act as it would for
FIFOs, O_NONBLOCK and all. IOW, how about we simply merge all those
file_operations in one, folding fifo.c into pipe.c? And to hell with any
reassignments of ->f_op.

I'm probably missing something subtle here...

2013-03-12 15:31:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: pipe_release oops.

On Tue, Mar 12, 2013 at 6:06 AM, Al Viro <[email protected]> wrote:
>
> While we are at it, I don't see any reason for having separate file_operations
> for r/o, w/o and r/w cases; the only differences are in EBADF-returning
> ->read() and ->write() (and ->f_mode checks in vfs_read() et.al. take care of
> that) and micro-optimizations in ->release() and ->fasync().
>
> Frankly, I really wonder if we should simply use def_fifo_fops for ->i_fops
> in get_pipe_inode() and let open() via /proc/<pid>/fd/<n> act as it would for
> FIFOs, O_NONBLOCK and all. IOW, how about we simply merge all those
> file_operations in one, folding fifo.c into pipe.c? And to hell with any
> reassignments of ->f_op.
>
> I'm probably missing something subtle here...

Probably not missing anything subtle. I think all of this code is very
old, and related to previous /proc/<pid>/fd/<n> escapades. And the
semantics for those files were in flux some time long long ago (the
whole "dup vs new struct file" issue), it's all just duct-tape, I
think.

Linus

2013-03-12 18:31:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON(nd->inode->i_op->follow_link);

On Sun, Mar 10, 2013 at 4:04 PM, Al Viro <[email protected]> wrote:
>
> The interesting part is what to do with it; it's not enough to make that
> BUG_ON() to STFU, unfortunately.

Why? That's what I did last week, it seemed to be the RightThing(tm)
to do. It somebody has opened a symlink, opening that again through
/proc gives you the same symlink. That would be the natural semantics,
no? You do *not* want to follow it any further, afaik.

And yes, I tested the ENOTDIR semantics too while doing that, and it
seems to be the right thing again. It's not a directory. It's a
symlink.

But hey, if there is some deeper reason for why we'd need to do
anything more that I'm missing, and not just your dislike of the
semantics, thats' fine too.

That said, the thing I think is the *real* bug is that we expose this
O_NOFOLLOW|O_PATH file descriptor through /proc at all. How does that
happen to begin with? Normal O_PATH file descriptors don't do that,
afaik, and they get instantiated as some very special file descriptors
that don't show up as such for most operations. So I think there is
some O_NOFOLLOW magic going on there...

Linus

2013-03-12 19:43:58

by Al Viro

[permalink] [raw]
Subject: Re: pipe_release oops.

On Tue, Mar 12, 2013 at 08:31:50AM -0700, Linus Torvalds wrote:

> Probably not missing anything subtle. I think all of this code is very
> old, and related to previous /proc/<pid>/fd/<n> escapades. And the
> semantics for those files were in flux some time long long ago (the
> whole "dup vs new struct file" issue), it's all just duct-tape, I
> think.

Umm... How about the following, then? I think it makes the whole thing
simpler and saner... NOTE: this got only a light beating and we'd
just seen an example of long-standing breakage in that area; I'd really
like to see it tortured by Dave's scripts before it gets merged into
mainline.

unify pipe and fifo file_operations

merge all variants of file_operations for pipes and fifos together,
get rid of modifying ->f_op in fifo_open(), fold remains of fifo.c
into pipe.c, kill dead code.

Signed-off-by: Al Viro <[email protected]>
---
diff --git a/fs/Makefile b/fs/Makefile
index 9d53192..b691a96 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -7,7 +7,7 @@

obj-y := open.o read_write.o file_table.o super.o \
char_dev.o stat.o exec.o pipe.o namei.o fcntl.o \
- ioctl.o readdir.o select.o fifo.o dcache.o inode.o \
+ ioctl.o readdir.o select.o dcache.o inode.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
diff --git a/fs/fifo.c b/fs/fifo.c
deleted file mode 100644
index cf6f434..0000000
--- a/fs/fifo.c
+++ /dev/null
@@ -1,153 +0,0 @@
-/*
- * linux/fs/fifo.c
- *
- * written by Paul H. Hargrove
- *
- * Fixes:
- * 10-06-1999, AV: fixed OOM handling in fifo_open(), moved
- * initialization there, switched to external
- * allocation of pipe_inode_info.
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/sched.h>
-#include <linux/pipe_fs_i.h>
-
-static int wait_for_partner(struct inode* inode, unsigned int *cnt)
-{
- int cur = *cnt;
-
- while (cur == *cnt) {
- pipe_wait(inode->i_pipe);
- if (signal_pending(current))
- break;
- }
- return cur == *cnt ? -ERESTARTSYS : 0;
-}
-
-static void wake_up_partner(struct inode* inode)
-{
- wake_up_interruptible(&inode->i_pipe->wait);
-}
-
-static int fifo_open(struct inode *inode, struct file *filp)
-{
- struct pipe_inode_info *pipe;
- int ret;
-
- mutex_lock(&inode->i_mutex);
- pipe = inode->i_pipe;
- if (!pipe) {
- ret = -ENOMEM;
- pipe = alloc_pipe_info(inode);
- if (!pipe)
- goto err_nocleanup;
- inode->i_pipe = pipe;
- }
- filp->f_version = 0;
-
- /* We can only do regular read/write on fifos */
- filp->f_mode &= (FMODE_READ | FMODE_WRITE);
-
- switch (filp->f_mode) {
- case FMODE_READ:
- /*
- * O_RDONLY
- * POSIX.1 says that O_NONBLOCK means return with the FIFO
- * opened, even when there is no process writing the FIFO.
- */
- filp->f_op = &read_pipefifo_fops;
- pipe->r_counter++;
- if (pipe->readers++ == 0)
- wake_up_partner(inode);
-
- if (!pipe->writers) {
- if ((filp->f_flags & O_NONBLOCK)) {
- /* suppress POLLHUP until we have
- * seen a writer */
- filp->f_version = pipe->w_counter;
- } else {
- if (wait_for_partner(inode, &pipe->w_counter))
- goto err_rd;
- }
- }
- break;
-
- case FMODE_WRITE:
- /*
- * O_WRONLY
- * POSIX.1 says that O_NONBLOCK means return -1 with
- * errno=ENXIO when there is no process reading the FIFO.
- */
- ret = -ENXIO;
- if ((filp->f_flags & O_NONBLOCK) && !pipe->readers)
- goto err;
-
- filp->f_op = &write_pipefifo_fops;
- pipe->w_counter++;
- if (!pipe->writers++)
- wake_up_partner(inode);
-
- if (!pipe->readers) {
- if (wait_for_partner(inode, &pipe->r_counter))
- goto err_wr;
- }
- break;
-
- case FMODE_READ | FMODE_WRITE:
- /*
- * O_RDWR
- * POSIX.1 leaves this case "undefined" when O_NONBLOCK is set.
- * This implementation will NEVER block on a O_RDWR open, since
- * the process can at least talk to itself.
- */
- filp->f_op = &rdwr_pipefifo_fops;
-
- pipe->readers++;
- pipe->writers++;
- pipe->r_counter++;
- pipe->w_counter++;
- if (pipe->readers == 1 || pipe->writers == 1)
- wake_up_partner(inode);
- break;
-
- default:
- ret = -EINVAL;
- goto err;
- }
-
- /* Ok! */
- mutex_unlock(&inode->i_mutex);
- return 0;
-
-err_rd:
- if (!--pipe->readers)
- wake_up_interruptible(&pipe->wait);
- ret = -ERESTARTSYS;
- goto err;
-
-err_wr:
- if (!--pipe->writers)
- wake_up_interruptible(&pipe->wait);
- ret = -ERESTARTSYS;
- goto err;
-
-err:
- if (!pipe->readers && !pipe->writers)
- free_pipe_info(inode);
-
-err_nocleanup:
- mutex_unlock(&inode->i_mutex);
- return ret;
-}
-
-/*
- * Dummy default file-operations: the only thing this does
- * is contain the open that then fills in the correct operations
- * depending on the access mode of the file...
- */
-const struct file_operations def_fifo_fops = {
- .open = fifo_open, /* will set read_ or write_pipefifo_fops */
- .llseek = noop_llseek,
-};
diff --git a/fs/inode.c b/fs/inode.c
index f5f7c06..5b76d9b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1803,7 +1803,7 @@ void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
inode->i_fop = &def_blk_fops;
inode->i_rdev = rdev;
} else if (S_ISFIFO(mode))
- inode->i_fop = &def_fifo_fops;
+ inode->i_fop = &pipefifo_fops;
else if (S_ISSOCK(mode))
inode->i_fop = &bad_sock_fops;
else
diff --git a/fs/internal.h b/fs/internal.h
index 507141f..3af89c3 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -125,3 +125,8 @@ extern int invalidate_inodes(struct super_block *, bool);
* dcache.c
*/
extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
+
+/*
+ * pipe.c
+ */
+extern const struct file_operations pipefifo_fops;
diff --git a/fs/pipe.c b/fs/pipe.c
index 2234f3f..79f9f38 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -25,6 +25,8 @@
#include <asm/uaccess.h>
#include <asm/ioctls.h>

+#include "internal.h"
+
/*
* The max size that a non-root user is allowed to grow the pipe. Can
* be set by root in /proc/sys/fs/pipe-max-size
@@ -662,19 +664,6 @@ out:
return ret;
}

-static ssize_t
-bad_pipe_r(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
-{
- return -EBADF;
-}
-
-static ssize_t
-bad_pipe_w(struct file *filp, const char __user *buf, size_t count,
- loff_t *ppos)
-{
- return -EBADF;
-}
-
static long pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
struct inode *inode = file_inode(filp);
@@ -734,14 +723,16 @@ pipe_poll(struct file *filp, poll_table *wait)
}

static int
-pipe_release(struct inode *inode, int decr, int decw)
+pipe_release(struct inode *inode, struct file *file)
{
struct pipe_inode_info *pipe;

mutex_lock(&inode->i_mutex);
pipe = inode->i_pipe;
- pipe->readers -= decr;
- pipe->writers -= decw;
+ if (file->f_mode & FMODE_READ)
+ pipe->readers--;
+ if (file->f_mode & FMODE_WRITE)
+ pipe->writers--;

if (!pipe->readers && !pipe->writers) {
free_pipe_info(inode);
@@ -756,35 +747,7 @@ pipe_release(struct inode *inode, int decr, int decw)
}

static int
-pipe_read_fasync(int fd, struct file *filp, int on)
-{
- struct inode *inode = file_inode(filp);
- int retval;
-
- mutex_lock(&inode->i_mutex);
- retval = fasync_helper(fd, filp, on, &inode->i_pipe->fasync_readers);
- mutex_unlock(&inode->i_mutex);
-
- return retval;
-}
-
-
-static int
-pipe_write_fasync(int fd, struct file *filp, int on)
-{
- struct inode *inode = file_inode(filp);
- int retval;
-
- mutex_lock(&inode->i_mutex);
- retval = fasync_helper(fd, filp, on, &inode->i_pipe->fasync_writers);
- mutex_unlock(&inode->i_mutex);
-
- return retval;
-}
-
-
-static int
-pipe_rdwr_fasync(int fd, struct file *filp, int on)
+pipe_fasync(int fd, struct file *filp, int on)
{
struct inode *inode = file_inode(filp);
struct pipe_inode_info *pipe = inode->i_pipe;
@@ -801,129 +764,6 @@ pipe_rdwr_fasync(int fd, struct file *filp, int on)
return retval;
}

-
-static int
-pipe_read_release(struct inode *inode, struct file *filp)
-{
- return pipe_release(inode, 1, 0);
-}
-
-static int
-pipe_write_release(struct inode *inode, struct file *filp)
-{
- return pipe_release(inode, 0, 1);
-}
-
-static int
-pipe_rdwr_release(struct inode *inode, struct file *filp)
-{
- int decr, decw;
-
- decr = (filp->f_mode & FMODE_READ) != 0;
- decw = (filp->f_mode & FMODE_WRITE) != 0;
- return pipe_release(inode, decr, decw);
-}
-
-static int
-pipe_read_open(struct inode *inode, struct file *filp)
-{
- int ret = -ENOENT;
-
- mutex_lock(&inode->i_mutex);
-
- if (inode->i_pipe) {
- ret = 0;
- inode->i_pipe->readers++;
- }
-
- mutex_unlock(&inode->i_mutex);
-
- return ret;
-}
-
-static int
-pipe_write_open(struct inode *inode, struct file *filp)
-{
- int ret = -ENOENT;
-
- mutex_lock(&inode->i_mutex);
-
- if (inode->i_pipe) {
- ret = 0;
- inode->i_pipe->writers++;
- }
-
- mutex_unlock(&inode->i_mutex);
-
- return ret;
-}
-
-static int
-pipe_rdwr_open(struct inode *inode, struct file *filp)
-{
- int ret = -ENOENT;
-
- if (!(filp->f_mode & (FMODE_READ|FMODE_WRITE)))
- return -EINVAL;
-
- mutex_lock(&inode->i_mutex);
-
- if (inode->i_pipe) {
- ret = 0;
- if (filp->f_mode & FMODE_READ)
- inode->i_pipe->readers++;
- if (filp->f_mode & FMODE_WRITE)
- inode->i_pipe->writers++;
- }
-
- mutex_unlock(&inode->i_mutex);
-
- return ret;
-}
-
-/*
- * The file_operations structs are not static because they
- * are also used in linux/fs/fifo.c to do operations on FIFOs.
- *
- * Pipes reuse fifos' file_operations structs.
- */
-const struct file_operations read_pipefifo_fops = {
- .llseek = no_llseek,
- .read = do_sync_read,
- .aio_read = pipe_read,
- .write = bad_pipe_w,
- .poll = pipe_poll,
- .unlocked_ioctl = pipe_ioctl,
- .open = pipe_read_open,
- .release = pipe_read_release,
- .fasync = pipe_read_fasync,
-};
-
-const struct file_operations write_pipefifo_fops = {
- .llseek = no_llseek,
- .read = bad_pipe_r,
- .write = do_sync_write,
- .aio_write = pipe_write,
- .poll = pipe_poll,
- .unlocked_ioctl = pipe_ioctl,
- .open = pipe_write_open,
- .release = pipe_write_release,
- .fasync = pipe_write_fasync,
-};
-
-const struct file_operations rdwr_pipefifo_fops = {
- .llseek = no_llseek,
- .read = do_sync_read,
- .aio_read = pipe_read,
- .write = do_sync_write,
- .aio_write = pipe_write,
- .poll = pipe_poll,
- .unlocked_ioctl = pipe_ioctl,
- .open = pipe_rdwr_open,
- .release = pipe_rdwr_release,
- .fasync = pipe_rdwr_fasync,
-};
-
struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
{
struct pipe_inode_info *pipe;
@@ -996,7 +836,7 @@ static struct inode * get_pipe_inode(void)
inode->i_pipe = pipe;

pipe->readers = pipe->writers = 1;
- inode->i_fop = &rdwr_pipefifo_fops;
+ inode->i_fop = &pipefifo_fops;

/*
* Mark the inode dirty from the very beginning,
@@ -1039,13 +879,13 @@ int create_pipe_files(struct file **res, int flags)
d_instantiate(path.dentry, inode);

err = -ENFILE;
- f = alloc_file(&path, FMODE_WRITE, &write_pipefifo_fops);
+ f = alloc_file(&path, FMODE_WRITE, &pipefifo_fops);
if (IS_ERR(f))
goto err_dentry;

f->f_flags = O_WRONLY | (flags & (O_NONBLOCK | O_DIRECT));

- res[0] = alloc_file(&path, FMODE_READ, &read_pipefifo_fops);
+ res[0] = alloc_file(&path, FMODE_READ, &pipefifo_fops);
if (IS_ERR(res[0]))
goto err_file;

@@ -1144,6 +984,144 @@ SYSCALL_DEFINE1(pipe, int __user *, fildes)
return sys_pipe2(fildes, 0);
}

+static int wait_for_partner(struct inode* inode, unsigned int *cnt)
+{
+ int cur = *cnt;
+
+ while (cur == *cnt) {
+ pipe_wait(inode->i_pipe);
+ if (signal_pending(current))
+ break;
+ }
+ return cur == *cnt ? -ERESTARTSYS : 0;
+}
+
+static void wake_up_partner(struct inode* inode)
+{
+ wake_up_interruptible(&inode->i_pipe->wait);
+}
+
+static int fifo_open(struct inode *inode, struct file *filp)
+{
+ struct pipe_inode_info *pipe;
+ int ret;
+
+ mutex_lock(&inode->i_mutex);
+ pipe = inode->i_pipe;
+ if (!pipe) {
+ ret = -ENOMEM;
+ pipe = alloc_pipe_info(inode);
+ if (!pipe)
+ goto err_nocleanup;
+ inode->i_pipe = pipe;
+ }
+ filp->f_version = 0;
+
+ /* We can only do regular read/write on fifos */
+ filp->f_mode &= (FMODE_READ | FMODE_WRITE);
+
+ switch (filp->f_mode) {
+ case FMODE_READ:
+ /*
+ * O_RDONLY
+ * POSIX.1 says that O_NONBLOCK means return with the FIFO
+ * opened, even when there is no process writing the FIFO.
+ */
+ pipe->r_counter++;
+ if (pipe->readers++ == 0)
+ wake_up_partner(inode);
+
+ if (!pipe->writers) {
+ if ((filp->f_flags & O_NONBLOCK)) {
+ /* suppress POLLHUP until we have
+ * seen a writer */
+ filp->f_version = pipe->w_counter;
+ } else {
+ if (wait_for_partner(inode, &pipe->w_counter))
+ goto err_rd;
+ }
+ }
+ break;
+
+ case FMODE_WRITE:
+ /*
+ * O_WRONLY
+ * POSIX.1 says that O_NONBLOCK means return -1 with
+ * errno=ENXIO when there is no process reading the FIFO.
+ */
+ ret = -ENXIO;
+ if ((filp->f_flags & O_NONBLOCK) && !pipe->readers)
+ goto err;
+
+ pipe->w_counter++;
+ if (!pipe->writers++)
+ wake_up_partner(inode);
+
+ if (!pipe->readers) {
+ if (wait_for_partner(inode, &pipe->r_counter))
+ goto err_wr;
+ }
+ break;
+
+ case FMODE_READ | FMODE_WRITE:
+ /*
+ * O_RDWR
+ * POSIX.1 leaves this case "undefined" when O_NONBLOCK is set.
+ * This implementation will NEVER block on a O_RDWR open, since
+ * the process can at least talk to itself.
+ */
+
+ pipe->readers++;
+ pipe->writers++;
+ pipe->r_counter++;
+ pipe->w_counter++;
+ if (pipe->readers == 1 || pipe->writers == 1)
+ wake_up_partner(inode);
+ break;
+
+ default:
+ ret = -EINVAL;
+ goto err;
+ }
+
+ /* Ok! */
+ mutex_unlock(&inode->i_mutex);
+ return 0;
+
+err_rd:
+ if (!--pipe->readers)
+ wake_up_interruptible(&pipe->wait);
+ ret = -ERESTARTSYS;
+ goto err;
+
+err_wr:
+ if (!--pipe->writers)
+ wake_up_interruptible(&pipe->wait);
+ ret = -ERESTARTSYS;
+ goto err;
+
+err:
+ if (!pipe->readers && !pipe->writers)
+ free_pipe_info(inode);
+
+err_nocleanup:
+ mutex_unlock(&inode->i_mutex);
+ return ret;
+}
+
+const struct file_operations pipefifo_fops = {
+ .open = fifo_open,
+ .llseek = no_llseek,
+ .read = do_sync_read,
+ .aio_read = pipe_read,
+ .write = do_sync_write,
+ .aio_write = pipe_write,
+ .poll = pipe_poll,
+ .unlocked_ioctl = pipe_ioctl,
+ .release = pipe_release,
+ .fasync = pipe_fasync,
+};
+
/*
* Allocate a new array of pipe buffers and copy the info over. Returns the
* pipe size if successful, or return -ERROR on error.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..bf9f500 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2080,7 +2080,6 @@ extern int sync_filesystem(struct super_block *);
extern const struct file_operations def_blk_fops;
extern const struct file_operations def_chr_fops;
extern const struct file_operations bad_sock_fops;
-extern const struct file_operations def_fifo_fops;
#ifdef CONFIG_BLOCK
extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long);
extern int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
@@ -2152,10 +2151,6 @@ extern void init_special_inode(struct inode *, umode_t, dev_t);
extern void make_bad_inode(struct inode *);
extern int is_bad_inode(struct inode *);

-extern const struct file_operations read_pipefifo_fops;
-extern const struct file_operations write_pipefifo_fops;
-extern const struct file_operations rdwr_pipefifo_fops;
-
#ifdef CONFIG_BLOCK
/*
* return READ, READA, or WRITE

2013-03-12 19:56:53

by Dave Jones

[permalink] [raw]
Subject: Re: pipe_release oops.

On Tue, Mar 12, 2013 at 07:43:53PM +0000, Al Viro wrote:
> On Tue, Mar 12, 2013 at 08:31:50AM -0700, Linus Torvalds wrote:
>
> > Probably not missing anything subtle. I think all of this code is very
> > old, and related to previous /proc/<pid>/fd/<n> escapades. And the
> > semantics for those files were in flux some time long long ago (the
> > whole "dup vs new struct file" issue), it's all just duct-tape, I
> > think.
>
> Umm... How about the following, then? I think it makes the whole thing
> simpler and saner... NOTE: this got only a light beating and we'd
> just seen an example of long-standing breakage in that area; I'd really
> like to see it tortured by Dave's scripts before it gets merged into
> mainline.

I can do that. I'll holler if I see anything that looks out of place.

Dave

2013-03-12 20:09:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: pipe_release oops.

On Tue, Mar 12, 2013 at 12:43 PM, Al Viro <[email protected]> wrote:
>
> Umm... How about the following, then? I think it makes the whole thing
> simpler and saner... NOTE: this got only a light beating and we'd
> just seen an example of long-standing breakage in that area; I'd really
> like to see it tortured by Dave's scripts before it gets merged into
> mainline.

Looks ok to me.

But it's very hard to see the changes when they are joined by code
movement, so either I'd almost like to see it split into two ("pure
movement" followed by "clean up"), or I'd like to feel a lot safer by
having somebody beat on named pipes with some app that actually uses
them. They are rather seldom used, it's easy to break them and not
even notice. For example, we have that whole "r/w_counter" logic that
is subtle (and mis-documented, I notice). It's not a "count of
readers/writers", it's a "*sequence* count of readers/writers having
come in", and it's needed for the whole "oh, we're waiting for a
writer, and one came in, but disappeared before we noticed, but we can
see that it was there from the sequence number".

So the whole FIFO code is trivial from the standpoint of sharing all
the IO code with pipes, but it's nontrivial in having some very
specific semantics at open time, and it's seldom actually used, and
easy to get wrong.

So anything like this needs to be either "obviously no semantic
changes", or needs some real fifo testing.

Linus

2013-03-12 20:34:24

by Jörn Engel

[permalink] [raw]
Subject: Re: pipe_release oops.

On Sun, 10 March 2013 23:33:18 +0000, Al Viro wrote:
> On Fri, Mar 08, 2013 at 01:26:49PM -0500, J??rn Engel wrote:
>
> > + files_cookie = async_schedule(exit_files_async, tsk);
> > exit_mm(tsk);
> >
> > if (group_dead)
> > @@ -990,7 +998,7 @@ void do_exit(long code)
> >
> > exit_sem(tsk);
> > exit_shm(tsk);
> > - exit_files(tsk);
> > + async_synchronize_cookie(files_cookie);
>
> That doesn't do what you seem to think it's doing. It does *not* wait
> for the completion of that sucker's execution - only the ones scheduled
> before it. IOW, your exit_files_async() might very well be executed
> *after* do_exit() completes and tsk gets reused.

Indeed. Maybe async_barrier() would have been a less dangerous name
in the presence of people like me. Oh well!

I have kernel/async.c on my list anyway. We've had a number of cases
where a single bad scsi device turned the "do the scanning
asynchronously" idea into a "make completely unrelated stuff
synchonously wait for a broken disk" reality.

Thanks for spotting this!

Jörn

--
The cheapest, fastest and most reliable components of a computer
system are those that aren't there.
-- Gordon Bell, DEC labratories

2013-03-12 20:51:08

by Al Viro

[permalink] [raw]
Subject: Re: pipe_release oops.

On Tue, Mar 12, 2013 at 01:09:16PM -0700, Linus Torvalds wrote:
> On Tue, Mar 12, 2013 at 12:43 PM, Al Viro <[email protected]> wrote:
> >
> > Umm... How about the following, then? I think it makes the whole thing
> > simpler and saner... NOTE: this got only a light beating and we'd
> > just seen an example of long-standing breakage in that area; I'd really
> > like to see it tortured by Dave's scripts before it gets merged into
> > mainline.
>
> Looks ok to me.
>
> But it's very hard to see the changes when they are joined by code
> movement, so either I'd almost like to see it split into two ("pure
> movement" followed by "clean up"),

FWIW, that delta came from git diff HEAD^^ ;-) Internally, it's a couple
of commits; see vfs.git#pipe for details.

> or I'd like to feel a lot safer by
> having somebody beat on named pipes with some app that actually uses
> them.

Definitely. Well, one such app is sysvinit, so aforementioned light beating
did include that to some extent, but it certainly needs more testing.

> They are rather seldom used, it's easy to break them and not
> even notice. For example, we have that whole "r/w_counter" logic that
> is subtle (and mis-documented, I notice). It's not a "count of
> readers/writers", it's a "*sequence* count of readers/writers having
> come in", and it's needed for the whole "oh, we're waiting for a
> writer, and one came in, but disappeared before we noticed, but we can
> see that it was there from the sequence number".

... and moreover, it's interacting with poll()/select(). I _think_ it's
OK (note that we only step into that thing when we open an extra descriptor
via /proc/<pid>/fd/<pipe_fd>), but...

> So the whole FIFO code is trivial from the standpoint of sharing all
> the IO code with pipes, but it's nontrivial in having some very
> specific semantics at open time, and it's seldom actually used, and
> easy to get wrong.
>
> So anything like this needs to be either "obviously no semantic
> changes", or needs some real fifo testing.

Agreed.

2013-03-12 22:02:39

by Andrew Morton

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Mon, 11 Mar 2013 15:03:22 -0400 Peter Hurley <[email protected]> wrote:

> On Mon, 2013-03-11 at 14:26 -0400, Dave Jones wrote:
> > > > {
> >
> > I just hit this again on rc2 which looks like it has the fixes that
> > Peter mentions above. This time rdi was 6b6b6b6b6b6b6b7b
> >
> > Dave
>
> Sorry, Dave. The fix must be in the other 8 patches that Andrew didn't
> want to apply. I run trinity 10 or more times a day and can't get this
> to trigger with the whole series.

Are you able to reproduce the bug with current mainline?

2013-03-12 22:33:54

by Dave Jones

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Tue, Mar 12, 2013 at 03:02:36PM -0700, Andrew Morton wrote:
> On Mon, 11 Mar 2013 15:03:22 -0400 Peter Hurley <[email protected]> wrote:
>
> > On Mon, 2013-03-11 at 14:26 -0400, Dave Jones wrote:
> > > > > {
> > >
> > > I just hit this again on rc2 which looks like it has the fixes that
> > > Peter mentions above. This time rdi was 6b6b6b6b6b6b6b7b
> > >
> > > Dave
> >
> > Sorry, Dave. The fix must be in the other 8 patches that Andrew didn't
> > want to apply. I run trinity 10 or more times a day and can't get this
> > to trigger with the whole series.
>
> Are you able to reproduce the bug with current mainline?

Yep.
I'll give this patchset a try tomorrow.

Dave

2013-03-15 21:21:38

by Dave Jones

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Tue, Mar 12, 2013 at 03:02:36PM -0700, Andrew Morton wrote:
> On Mon, 11 Mar 2013 15:03:22 -0400 Peter Hurley <[email protected]> wrote:
>
> > On Mon, 2013-03-11 at 14:26 -0400, Dave Jones wrote:
> > > > > {
> > >
> > > I just hit this again on rc2 which looks like it has the fixes that
> > > Peter mentions above. This time rdi was 6b6b6b6b6b6b6b7b
> > >
> > > Dave
> >
> > Sorry, Dave. The fix must be in the other 8 patches that Andrew didn't
> > want to apply. I run trinity 10 or more times a day and can't get this
> > to trigger with the whole series.
>
> Are you able to reproduce the bug with current mainline?

So with Peters 40-odd patches, I've not been able to reproduce this
after beating on it for a few days. So the fix is in there somewhere.

Dave

2013-03-25 16:37:14

by Dave Jones

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Mon, Mar 11, 2013 at 03:03:22PM -0400, Peter Hurley wrote:
> On Mon, 2013-03-11 at 14:26 -0400, Dave Jones wrote:
> > On Fri, Mar 08, 2013 at 07:27:01PM -0500, Peter Hurley wrote:
> >
> > > On Thu, 2013-03-07 at 16:38 -0500, Dave Jones wrote:
> > >
> > > > general protection fault: 0000 [#1] PREEMPT SMP
> > > > Modules linked in: rose ax25 phonet lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth snd_pcm snd_page_alloc snd_timer snd vhost_net rfkill tun macvtap usb_debug macvlan microcode serio_raw pcspkr kvm_amd soundcore edac_core r8169 mii kvm
> > > > CPU 0
> > > > Pid: 845, comm: trinity-child14 Not tainted 3.9.0-rc1+ #70 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> > > > RIP: 0010:[<ffffffff812b7b00>] [<ffffffff812b7b00>] testmsg.isra.1+0x40/0x60
> > > > RSP: 0018:ffff880122b0fe78 EFLAGS: 00010246
> > > > RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
> > > > RDX: 0000000000000002 RSI: 000000002c24a9b2 RDI: 697665642d737983
> > > > RBP: ffff880122b0fe78 R08: fffffff3f14b03ae R09: 0000000000000000
> > > > R10: ffff880127bd8000 R11: 0000000000000000 R12: 000000002c24a9b2
> > > > R13: ffff880123360798 R14: ffff8801233606e8 R15: 697665642d737973
> > > > FS: 00007f2672bd3740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007f2672b96068 CR3: 0000000127bc1000 CR4: 00000000000007f0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > Process trinity-child14 (pid: 845, threadinfo ffff880122b0e000, task ffff880127bd8000)
> > > > Stack:
> > > > ffff880122b0ff68 ffffffff812b8e7e ffff8801276d5b90 ffff880127bd8000
> > > > ffff880127bd8000 ffff880127bd8000 0000000000000000 ffffffff812b78c0
> > > > 0000000000000000 ffffffff81c7a260 0000000000000000 0000000000001000
> > > > Call Trace:
> > > > [<ffffffff812b8e7e>] do_msgrcv+0x1de/0x670
> > > > [<ffffffff812b78c0>] ? load_msg+0x180/0x180
> > > > [<ffffffff810b8685>] ? trace_hardirqs_on_caller+0x115/0x1a0
> > > > [<ffffffff81341aae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > > > [<ffffffff812b9325>] sys_msgrcv+0x15/0x20
> > > > [<ffffffff816cd982>] system_call_fastpath+0x16/0x1b
> > > > Code: 83 fa 04 74 16 31 c0 5d c3 66 90 ff ca b8 01 00 00 00 74 f3 31 c0 eb ef 0f 1f 00 48 39 37 b8 01 00 00 00 7e e2 31 c0 eb de 66 90 <48> 3b 37 75 d5 b8 01 00 00 00 5d c3 0f 1f 40 00 48 3b 37 74 c5
> > > >
> > > > 0000000000000000 <.text>:
> > > > 0: 48 3b 37 cmp (%rdi),%rsi
> > > > 3: 75 d5 jne 0xffffffffffffffda
> > > > 5: b8 01 00 00 00 mov $0x1,%eax
> > > > a: 5d pop %rbp
> > > > b: c3 retq
> > > > c: 0f 1f 40 00 nopl 0x0(%rax)
> > > > 10: 48 3b 37 cmp (%rdi),%rsi
> > > > 13: 74 c5 je 0xffffffffffffffda
> > > >
> > > > rdi is ascii. "ived-sy�" Curious.
> > > >
> > > > EIP is here in testmsg.
> > > >
> > > > case SEARCH_EQUAL:
> > > > if (msg->m_type == type)
> > > > 240: 48 3b 37 cmp (%rdi),%rsi
> > > > 243: 75 d5 jne 21a <testmsg.isra.1+0x1a>
> > > > {
> >
> > I just hit this again on rc2 which looks like it has the fixes that
> > Peter mentions above. This time rdi was 6b6b6b6b6b6b6b7b
> >
> > Dave
>
> Sorry, Dave. The fix must be in the other 8 patches that Andrew didn't
> want to apply. I run trinity 10 or more times a day and can't get this
> to trigger with the whole series.

bad news: Turns out my recent testing where I thought your patches fixed
this was incorrect. I had excluded fuzz testing of msgrcv, so it was never
getting exercised.

When I put that back, rc4 + your patches still crashes.
However the crash looks slightly different..
(That may be because I upgraded the compiler from 4.7->4.8 last week)


general protection fault: 0000 [#1] PREEMPT SMP
Modules linked in: af_key phonet cmtp rose kernelcapi l2tp_ppp l2tp_netlink l2tp_core pppoe pppox hidp can_raw ppp_generic slhc nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can irda appletalk ipx rds p8023 p8022 atm decnet crc_ccitt x25 psnap af_802154 ax25 llc nfc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables btusb bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm snd_page_alloc rfkill vhost_net tun macvtap microcode macvlan edac_core snd_timer kvm_amd snd serio_raw kvm r8169 pcspkr soundcore mii radeon backlight drm_kms_helper ttm
CPU 1
Pid: 1067, comm: trinity-child1 Not tainted 3.9.0-rc4+ #120 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff812c2e9e>] [<ffffffff812c2e9e>] do_msgrcv+0x1ee/0x5f0
RSP: 0018:ffff8800be6dde98 EFLAGS: 00010297
RAX: 0000000001868405 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffff8800be68c920 RSI: 0000000000004000 RDI: ffff8800cfb206e8
RBP: ffff8800be6ddf68 R08: 0000000000000001 R09: 0000000000000001
R10: ffff8800be68c920 R11: 0000000000000001 R12: 0000000000000000
R13: ffff8800cfb20798 R14: ffff8800cfb206e8 R15: 6b6b6b6b6b6b6b6b
FS: 00007f1bf1243740(0000) GS:ffff88012b000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bf1206068 CR3: 00000000be5e0000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child1 (pid: 1067, threadinfo ffff8800be6dc000, task ffff8800be68c920)
Stack:
ffff8800bb674f50 ffff8800be68c920 ffff8800be68c920 ffff8800be68c920
ffffffff812c1b40 0000000000afb000 ffffffff81c7ad20 0000000001868405
0000000000000000 000040002e16f9d4 0000000000000001 ffff8800be68c920
Call Trace:
[<ffffffff812c1b40>] ? msg_security+0x10/0x10
[<ffffffff810b6bc5>] ? trace_hardirqs_on_caller+0x115/0x1a0
[<ffffffff8134aa6e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff812c32b5>] sys_msgrcv+0x15/0x20
[<ffffffff816cda02>] system_call_fastpath+0x16/0x1b
Code: cc 83 fb 04 0f 84 f3 00 00 00 8b 74 24 4c 85 f6 0f 84 18 02 00 00 48 8b 44 24 38 48 39 44 24 50 0f 84 12 02 00 00 4c 89 7c 24 60 <4d> 8b 3f 48 ff 44 24 50 4d 39 ef 75 9d 0f 1f 44 00 00 48 81 7c


2b:* 4d 8b 3f mov (%r15),%r15 <-- trapping instruction
2e: 48 ff 44 24 50 incq 0x50(%rsp)
33: 4d 39 ef cmp %r13,%r15
36: 75 9d jne 0xffffffffffffffd5
38: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
3d: 48 rex.W
3e: 81 .byte 0x81
3f: 7c .byte 0x7c

objdump -S output shows that this is here in do_msgrcv()

875 } else
876 break;
877 msg_counter++;
878 }
879 tmp = tmp->next;
880 }
881 if (!IS_ERR(msg)) {

the tmp->next deref goes chasing a freed pointer.

Dave

2013-03-25 18:29:04

by Peter Hurley

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Mon, 2013-03-25 at 12:37 -0400, Dave Jones wrote:
> On Mon, Mar 11, 2013 at 03:03:22PM -0400, Peter Hurley wrote:
> > On Mon, 2013-03-11 at 14:26 -0400, Dave Jones wrote:
> > > On Fri, Mar 08, 2013 at 07:27:01PM -0500, Peter Hurley wrote:
> > >
> > > > On Thu, 2013-03-07 at 16:38 -0500, Dave Jones wrote:
> > > >
> > > > > general protection fault: 0000 [#1] PREEMPT SMP
> > > > > Modules linked in: rose ax25 phonet lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth snd_pcm snd_page_alloc snd_timer snd vhost_net rfkill tun macvtap usb_debug macvlan microcode serio_raw pcspkr kvm_amd soundcore edac_core r8169 mii kvm
> > > > > CPU 0
> > > > > Pid: 845, comm: trinity-child14 Not tainted 3.9.0-rc1+ #70 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> > > > > RIP: 0010:[<ffffffff812b7b00>] [<ffffffff812b7b00>] testmsg.isra.1+0x40/0x60
> > > > > RSP: 0018:ffff880122b0fe78 EFLAGS: 00010246
> > > > > RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
> > > > > RDX: 0000000000000002 RSI: 000000002c24a9b2 RDI: 697665642d737983
> > > > > RBP: ffff880122b0fe78 R08: fffffff3f14b03ae R09: 0000000000000000
> > > > > R10: ffff880127bd8000 R11: 0000000000000000 R12: 000000002c24a9b2
> > > > > R13: ffff880123360798 R14: ffff8801233606e8 R15: 697665642d737973
> > > > > FS: 00007f2672bd3740(0000) GS:ffff88012ae00000(0000) knlGS:0000000000000000
> > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: 00007f2672b96068 CR3: 0000000127bc1000 CR4: 00000000000007f0
> > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > Process trinity-child14 (pid: 845, threadinfo ffff880122b0e000, task ffff880127bd8000)
> > > > > Stack:
> > > > > ffff880122b0ff68 ffffffff812b8e7e ffff8801276d5b90 ffff880127bd8000
> > > > > ffff880127bd8000 ffff880127bd8000 0000000000000000 ffffffff812b78c0
> > > > > 0000000000000000 ffffffff81c7a260 0000000000000000 0000000000001000
> > > > > Call Trace:
> > > > > [<ffffffff812b8e7e>] do_msgrcv+0x1de/0x670
> > > > > [<ffffffff812b78c0>] ? load_msg+0x180/0x180
> > > > > [<ffffffff810b8685>] ? trace_hardirqs_on_caller+0x115/0x1a0
> > > > > [<ffffffff81341aae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > > > > [<ffffffff812b9325>] sys_msgrcv+0x15/0x20
> > > > > [<ffffffff816cd982>] system_call_fastpath+0x16/0x1b
> > > > > Code: 83 fa 04 74 16 31 c0 5d c3 66 90 ff ca b8 01 00 00 00 74 f3 31 c0 eb ef 0f 1f 00 48 39 37 b8 01 00 00 00 7e e2 31 c0 eb de 66 90 <48> 3b 37 75 d5 b8 01 00 00 00 5d c3 0f 1f 40 00 48 3b 37 74 c5
> > > > >
> > > > > 0000000000000000 <.text>:
> > > > > 0: 48 3b 37 cmp (%rdi),%rsi
> > > > > 3: 75 d5 jne 0xffffffffffffffda
> > > > > 5: b8 01 00 00 00 mov $0x1,%eax
> > > > > a: 5d pop %rbp
> > > > > b: c3 retq
> > > > > c: 0f 1f 40 00 nopl 0x0(%rax)
> > > > > 10: 48 3b 37 cmp (%rdi),%rsi
> > > > > 13: 74 c5 je 0xffffffffffffffda
> > > > >
> > > > > rdi is ascii. "ived-sy�" Curious.
> > > > >
> > > > > EIP is here in testmsg.
> > > > >
> > > > > case SEARCH_EQUAL:
> > > > > if (msg->m_type == type)
> > > > > 240: 48 3b 37 cmp (%rdi),%rsi
> > > > > 243: 75 d5 jne 21a <testmsg.isra.1+0x1a>
> > > > > {
> > >
> > > I just hit this again on rc2 which looks like it has the fixes that
> > > Peter mentions above. This time rdi was 6b6b6b6b6b6b6b7b
> > >
> > > Dave
> >
> > Sorry, Dave. The fix must be in the other 8 patches that Andrew didn't
> > want to apply. I run trinity 10 or more times a day and can't get this
> > to trigger with the whole series.
>
> bad news: Turns out my recent testing where I thought your patches fixed
> this was incorrect. I had excluded fuzz testing of msgrcv, so it was never
> getting exercised.

I think you might have applied the wrong series. See below.

> When I put that back, rc4 + your patches still crashes.
> However the crash looks slightly different..
> (That may be because I upgraded the compiler from 4.7->4.8 last week)
>
>
> general protection fault: 0000 [#1] PREEMPT SMP
> Modules linked in: af_key phonet cmtp rose kernelcapi l2tp_ppp l2tp_netlink l2tp_core pppoe pppox hidp can_raw ppp_generic slhc nfnetlink scsi_transport_iscsi can_bcm ipt_ULOG can irda appletalk ipx rds p8023 p8022 atm decnet crc_ccitt x25 psnap af_802154 ax25 llc nfc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables btusb bluetooth snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm snd_page_alloc rfkill vhost_net tun macvtap microcode macvlan edac_core snd_timer kvm_amd snd serio_raw kvm r8169 pcspkr soundcore mii radeon backlight drm_kms_helper ttm
> CPU 1
> Pid: 1067, comm: trinity-child1 Not tainted 3.9.0-rc4+ #120 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[<ffffffff812c2e9e>] [<ffffffff812c2e9e>] do_msgrcv+0x1ee/0x5f0
> RSP: 0018:ffff8800be6dde98 EFLAGS: 00010297
> RAX: 0000000001868405 RBX: 0000000000000001 RCX: 0000000000000000
> RDX: ffff8800be68c920 RSI: 0000000000004000 RDI: ffff8800cfb206e8
> RBP: ffff8800be6ddf68 R08: 0000000000000001 R09: 0000000000000001
> R10: ffff8800be68c920 R11: 0000000000000001 R12: 0000000000000000
> R13: ffff8800cfb20798 R14: ffff8800cfb206e8 R15: 6b6b6b6b6b6b6b6b
> FS: 00007f1bf1243740(0000) GS:ffff88012b000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f1bf1206068 CR3: 00000000be5e0000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process trinity-child1 (pid: 1067, threadinfo ffff8800be6dc000, task ffff8800be68c920)
> Stack:
> ffff8800bb674f50 ffff8800be68c920 ffff8800be68c920 ffff8800be68c920
> ffffffff812c1b40 0000000000afb000 ffffffff81c7ad20 0000000001868405
> 0000000000000000 000040002e16f9d4 0000000000000001 ffff8800be68c920
> Call Trace:
> [<ffffffff812c1b40>] ? msg_security+0x10/0x10
> [<ffffffff810b6bc5>] ? trace_hardirqs_on_caller+0x115/0x1a0
> [<ffffffff8134aa6e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff812c32b5>] sys_msgrcv+0x15/0x20
> [<ffffffff816cda02>] system_call_fastpath+0x16/0x1b
> Code: cc 83 fb 04 0f 84 f3 00 00 00 8b 74 24 4c 85 f6 0f 84 18 02 00 00 48 8b 44 24 38 48 39 44 24 50 0f 84 12 02 00 00 4c 89 7c 24 60 <4d> 8b 3f 48 ff 44 24 50 4d 39 ef 75 9d 0f 1f 44 00 00 48 81 7c
>
>
> 2b:* 4d 8b 3f mov (%r15),%r15 <-- trapping instruction
> 2e: 48 ff 44 24 50 incq 0x50(%rsp)
> 33: 4d 39 ef cmp %r13,%r15
> 36: 75 9d jne 0xffffffffffffffd5
> 38: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 3d: 48 rex.W
> 3e: 81 .byte 0x81
> 3f: 7c .byte 0x7c
>
> objdump -S output shows that this is here in do_msgrcv()
>
> 875 } else
> 876 break;
> 877 msg_counter++;
> 878 }
> 879 tmp = tmp->next;

This source line does not exist with the 'ipc IPC_MSGCOPY fixes'
patchset applied.

> 880 }
> 881 if (!IS_ERR(msg)) {

2013-03-25 18:40:01

by Dave Jones

[permalink] [raw]
Subject: Re: ipc/testmsg GPF.

On Mon, Mar 25, 2013 at 02:28:53PM -0400, Peter Hurley wrote:

> > bad news: Turns out my recent testing where I thought your patches fixed
> > this was incorrect. I had excluded fuzz testing of msgrcv, so it was never
> > getting exercised.
>
> I think you might have applied the wrong series. See below.

The last lot I applied from you was 30-40 or so patches, then some of them
got applied upstream, so I lost track of just what's left outstanding.

> > 879 tmp = tmp->next;
>
> This source line does not exist with the 'ipc IPC_MSGCOPY fixes'
> patchset applied.

Hmm, I don't see that in my lkml folder for this month.
Can you bounce me a copy ?

Dave

2013-03-27 13:51:33

by Dave Jones

[permalink] [raw]
Subject: Yet another pipe related oops.

On Tue, Mar 12, 2013 at 01:09:16PM -0700, Linus Torvalds wrote:
> On Tue, Mar 12, 2013 at 12:43 PM, Al Viro <[email protected]> wrote:
> >
> > Umm... How about the following, then? I think it makes the whole thing
> > simpler and saner... NOTE: this got only a light beating and we'd
> > just seen an example of long-standing breakage in that area; I'd really
> > like to see it tortured by Dave's scripts before it gets merged into
> > mainline.
>
> Looks ok to me.
>
> But it's very hard to see the changes when they are joined by code
> movement, so either I'd almost like to see it split into two ("pure
> movement" followed by "clean up"), or I'd like to feel a lot safer by
> having somebody beat on named pipes with some app that actually uses
> them. They are rather seldom used, it's easy to break them and not
> even notice. For example, we have that whole "r/w_counter" logic that
> is subtle (and mis-documented, I notice). It's not a "count of
> readers/writers", it's a "*sequence* count of readers/writers having
> come in", and it's needed for the whole "oh, we're waiting for a
> writer, and one came in, but disappeared before we noticed, but we can
> see that it was there from the sequence number".
>
> So the whole FIFO code is trivial from the standpoint of sharing all
> the IO code with pipes, but it's nontrivial in having some very
> specific semantics at open time, and it's seldom actually used, and
> easy to get wrong.
>
> So anything like this needs to be either "obviously no semantic
> changes", or needs some real fifo testing.

So here's something interesting..

I had Al's changes in my local tree since the 12th, without anything
pipe related turning up at all. Yesterday I switched back to using
an unpatched[*] tree based on rc4.

Today I woke up to this..

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [< (null)>] (null)
PGD 11be33067 PUD 116800067 PMD 0
Oops: 0010 [#2] PREEMPT SMP
Modules linked in: dlci bridge 8021q garp stp mrp vmw_vsock_vmci_transport vmw_vmci vsock bnep rfcomm fuse hidp llc2 af_key caif_socket caif af_rxrpc netrom phonet rose cmtp kernelcapi l2tp_ppp l2tp_netlink l2tp_core can_raw nfnetlink can_bcm pppoe can scsi_transport_iscsi pppox ppp_generic slhc ipt_ULOG appletalk ipx irda p8023 p8022 atm decnet rds psnap llc crc_ccitt x25 af_802154 ax25 nfc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables btusb bluetooth raid0 snd_hda_codec_realtek snd_hda_intel snd_hda_codec microcode serio_raw pcspkr snd_pcm rfkill snd_page_alloc snd_timer snd edac_core r8169 soundcore mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 1
Pid: 3935, comm: trinity-child9 Tainted: G D 3.9.0-rc4+ #2 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<0000000000000000>] [< (null)>] (null)
RSP: 0018:ffff880100bcdb98 EFLAGS: 00010286
RAX: ffffffffa01600e0 RBX: ffff880100bcdbb0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff880100bcdba0 RDI: ffff880100bcdbb0
RBP: ffff880100bcdca8 R08: 0000000000000000 R09: ffffffff811efbc0
R10: 0000000000000001 R11: ffffea0000713600 R12: ffff88004b45fc80
R13: ffff880100bcdce8 R14: ffff880025e1f000 R15: ffffffff811efbc0
FS: 00007f61b835c740(0000) GS:ffff88012b000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000105e07000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child9 (pid: 3935, threadinfo ffff880100bcc000, task ffff880064bb8000)
Stack:
ffffffff811bc0f8 ffff880025e1f000 0000000000000068 0000000000000000
0000000000000000 0000000000000000 ffffffff00000001 ffff88004b45fc80
0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
[<ffffffff811bc0f8>] ? do_sync_write+0x98/0xd0
[<ffffffff81198c9b>] ? set_track+0x8b/0x190
[<ffffffff811bc995>] __kernel_write+0x115/0x120
[<ffffffff811efc12>] write_pipe_buf+0x52/0x80
[<ffffffff811ef153>] splice_from_pipe_feed+0x83/0x120
[<ffffffff811efbc0>] ? direct_splice_actor+0x30/0x30
[<ffffffff811ef496>] __splice_from_pipe+0x66/0x80
[<ffffffff811efbc0>] ? direct_splice_actor+0x30/0x30
[<ffffffff811f1126>] splice_from_pipe+0x86/0xa0
[<ffffffff811f1179>] default_file_splice_write+0x19/0x30
[<ffffffff811efb64>] do_splice_from+0x84/0xb0
[<ffffffff811efbb3>] direct_splice_actor+0x23/0x30
[<ffffffff811ef9c4>] splice_direct_to_actor+0xc4/0x1e0
[<ffffffff811efb90>] ? do_splice_from+0xb0/0xb0
[<ffffffff811f1209>] do_splice_direct+0x79/0x90
[<ffffffff811bd659>] do_sendfile+0x199/0x300
[<ffffffff8104d102>] ? do_setitimer+0x1c2/0x310
[<ffffffff811bd902>] sys_sendfile64+0x92/0xb0
[<ffffffff816cd242>] system_call_fastpath+0x16/0x1b
Code: Bad RIP value.
RIP [< (null)>] (null)
RSP <ffff880100bcdb98>
CR2: 0000000000000000

Could be that Al's patches refactored this bug away, or it could just be
that I've been lucky the last few weeks, and just haven't had the right
entropy to get the sequence of events right..

thoughts ?

Dave


[*] Asides from some pipe-unrelated patches fixing other oopses

2013-03-27 15:20:34

by Al Viro

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Wed, Mar 27, 2013 at 09:51:27AM -0400, Dave Jones wrote:

> Could be that Al's patches refactored this bug away, or it could just be
> that I've been lucky the last few weeks, and just haven't had the right
> entropy to get the sequence of events right..
>
> thoughts ?

Actually, that's my fault - check lost in patch reordering. My apologies ;-/
Eventually, we want that in fs/splice.c side of things (no point repeating it
for every buffer, after all), but for now this is the obvious minimal fix.

Signed-off-by: Al Viro <[email protected]>
---
diff --git a/fs/read_write.c b/fs/read_write.c
index f7b5a23..e6ddc8d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -424,6 +424,9 @@ ssize_t __kernel_write(struct file *file, const char *buf, size_t count, loff_t
const char __user *p;
ssize_t ret;

+ if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
+ return -EINVAL;
+
old_fs = get_fs();
set_fs(get_ds());
p = (__force const char __user *)buf;

2013-03-27 16:33:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Wed, Mar 27, 2013 at 8:20 AM, Al Viro <[email protected]> wrote:
>
> Actually, that's my fault - check lost in patch reordering. My apologies ;-/
> Eventually, we want that in fs/splice.c side of things (no point repeating it
> for every buffer, after all), but for now this is the obvious minimal fix.

Applied.

Do we actually have files with NULL f_ops pointers? Should we? What
could we possibly do with a file descriptor that doesn't have any
fops?

Also, perhaps we should do something more akin to what we do for
dentry functions where we validate them on registration, and we could
fix up or validate read/write pointers, with semantics something like

if (!fop->write)
fop->write = fop->aio_write ? do_sync_write : EINVAL_write;
if (!fop->read)
fop->read = fop->aio_read ? do_sync_read : EINVAL_read;

kind of things?

Not a big deal, perhaps.

Linus

2013-03-27 16:54:12

by Raymond Jennings

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Wed, Mar 27, 2013 at 9:33 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Mar 27, 2013 at 8:20 AM, Al Viro <[email protected]> wrote:
>>
>> Actually, that's my fault - check lost in patch reordering. My apologies ;-/
>> Eventually, we want that in fs/splice.c side of things (no point repeating it
>> for every buffer, after all), but for now this is the obvious minimal fix.
>
> Applied.
>
> Do we actually have files with NULL f_ops pointers? Should we? What
> could we possibly do with a file descriptor that doesn't have any
> fops?

For the sake of the curious including myself:

How would such a NULL f_ops file get created in the first place?

> Also, perhaps we should do something more akin to what we do for
> dentry functions where we validate them on registration, and we could
> fix up or validate read/write pointers, with semantics something like
>
> if (!fop->write)
> fop->write = fop->aio_write ? do_sync_write : EINVAL_write;
> if (!fop->read)
> fop->read = fop->aio_read ? do_sync_read : EINVAL_read;
>
> kind of things?
>
> Not a big deal, perhaps.
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-27 17:45:15

by Al Viro

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Wed, Mar 27, 2013 at 09:33:35AM -0700, Linus Torvalds wrote:

> Applied.
>
> Do we actually have files with NULL f_ops pointers? Should we? What
> could we possibly do with a file descriptor that doesn't have any
> fops?

We shouldn't, at least not for something that has been successfully
opened. I've a patch series cleaning that up a bit in the local
queue; will check for bitrot and throw into for-next.

Another thing that is a definite for-next fodder - we really have no
reason to put anything non-regular or opened not for write into ->s_files.
And since read-only opens outnumber write-only/read-write ones by far
(two orders of magnitude for something like kernel build), that gives
a nice reduction of files_lglock accesses. OTOH, the only remaining
user of those lists is forced remount to read-only, and I'm not at all
sure we wouldn't be better off by leaving those opened files alone and
just teaching file_start_write() to fail with EROFS on such fs. Then
we could get rid of files_lglock and ->s_files completely...

> Also, perhaps we should do something more akin to what we do for
> dentry functions where we validate them on registration, and we could
> fix up or validate read/write pointers, with semantics something like
>
> if (!fop->write)
> fop->write = fop->aio_write ? do_sync_write : EINVAL_write;
> if (!fop->read)
> fop->read = fop->aio_read ? do_sync_read : EINVAL_read;
>
> kind of things?

As it is, file_operations instances are const, and it's a good idea, IMO...

2013-04-01 20:34:50

by Al Viro

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Wed, Mar 27, 2013 at 05:45:06PM +0000, Al Viro wrote:
> We shouldn't, at least not for something that has been successfully
> opened. I've a patch series cleaning that up a bit in the local
> queue; will check for bitrot and throw into for-next.

Egads... OK, that has gone more than slightly out of control - right now
vfs.git#for-next is at 106 commits, -3.6KLoC balance and *still* hadn't
reached the ->f_op part of that work ;-/ OTOH, procfs-related code got
a lot cleaner and we actually have a chance to make the guts of proc_dir_entry
private to fs/proc now... I'll cull the outright bug fixes into for-linus
and push it your way...

The thing that really worries me is debugfs; that fscker sets whatever
file_operations it's got from the driver that registered a file there
and sticks that into ->i_fop. When we try to open() that, we get
try_module_get() ->i_fop->owner; so far, so good, but what if the driver
has just been removed *and* file_operations instance we are looking at
has already been freed?

IOW, how do we deal with a race between attempt to open a debugfs file and
its removal on driver unload? Greg?

2013-04-01 21:00:33

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Mon, Apr 01, 2013 at 09:34:46PM +0100, Al Viro wrote:
> On Wed, Mar 27, 2013 at 05:45:06PM +0000, Al Viro wrote:
> > We shouldn't, at least not for something that has been successfully
> > opened. I've a patch series cleaning that up a bit in the local
> > queue; will check for bitrot and throw into for-next.
>
> Egads... OK, that has gone more than slightly out of control - right now
> vfs.git#for-next is at 106 commits, -3.6KLoC balance and *still* hadn't
> reached the ->f_op part of that work ;-/ OTOH, procfs-related code got
> a lot cleaner and we actually have a chance to make the guts of proc_dir_entry
> private to fs/proc now... I'll cull the outright bug fixes into for-linus
> and push it your way...
>
> The thing that really worries me is debugfs; that fscker sets whatever
> file_operations it's got from the driver that registered a file there
> and sticks that into ->i_fop. When we try to open() that, we get
> try_module_get() ->i_fop->owner; so far, so good, but what if the driver
> has just been removed *and* file_operations instance we are looking at
> has already been freed?
>
> IOW, how do we deal with a race between attempt to open a debugfs file and
> its removal on driver unload? Greg?

Hm, I thought the i_fop->owner thing would be the needed protection, but
I guess you are right, it will not. I guess we need to do what
character devices do and have an "intermediate" fops in order to protect
this. Would that work?

As module removal never happens unless an admin does it by hand, it's
not a "real" issue that can be triggered by anyone easily, thankfully.

thanks,

greg k-h

2013-04-01 21:21:49

by Al Viro

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Mon, Apr 01, 2013 at 02:00:29PM -0700, Greg Kroah-Hartman wrote:

> > IOW, how do we deal with a race between attempt to open a debugfs file and
> > its removal on driver unload? Greg?
>
> Hm, I thought the i_fop->owner thing would be the needed protection, but

It will be, if you manage to fetch it...

> I guess you are right, it will not. I guess we need to do what
> character devices do and have an "intermediate" fops in order to protect
> this. Would that work?

You mean, with reassigning ->f_op in ->open()? That'll work, as long as
we have exclusion between removal and fetching the sucker in primary
->open()... Where would you prefer to stash fops?

2013-04-01 21:44:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Mon, Apr 01, 2013 at 10:21:42PM +0100, Al Viro wrote:
> On Mon, Apr 01, 2013 at 02:00:29PM -0700, Greg Kroah-Hartman wrote:
>
> > > IOW, how do we deal with a race between attempt to open a debugfs file and
> > > its removal on driver unload? Greg?
> >
> > Hm, I thought the i_fop->owner thing would be the needed protection, but
>
> It will be, if you manage to fetch it...

I agree.

> > I guess you are right, it will not. I guess we need to do what
> > character devices do and have an "intermediate" fops in order to protect
> > this. Would that work?
>
> You mean, with reassigning ->f_op in ->open()? That'll work, as long as
> we have exclusion between removal and fetching the sucker in primary
> ->open()... Where would you prefer to stash fops?

Ick, that's not going to work as the current api just uses a fops and
debugfs doesn't keep anything else hanging around that referes to
something "before" that, like 'struct cdev' does.

And, it's even worse, look at the use of DEFINE_SIMPLE_ATTRIBUTE(),
those take a pointer from a random module to read/write from, and use
the fops for the debugfs module. Hopefully no other user of that macro
has the same problem, and at first glance, I think that's true, but I
might be wrong...

Am I allowed to "punt" and say, "removing a module that uses debugfs is
not recommended?" :)

greg k-h

2013-04-01 23:27:22

by Al Viro

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Mon, Apr 01, 2013 at 02:44:36PM -0700, Greg Kroah-Hartman wrote:
> > > I guess you are right, it will not. I guess we need to do what
> > > character devices do and have an "intermediate" fops in order to protect
> > > this. Would that work?
> >
> > You mean, with reassigning ->f_op in ->open()? That'll work, as long as
> > we have exclusion between removal and fetching the sucker in primary
> > ->open()... Where would you prefer to stash fops?
>
> Ick, that's not going to work as the current api just uses a fops and
> debugfs doesn't keep anything else hanging around that referes to
> something "before" that, like 'struct cdev' does.

Er? How about just sticking it into dentry->d_fsdata and letting
debugfs_remove() zero that out? What am I missing here?

2013-04-02 00:22:11

by Al Viro

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Tue, Apr 02, 2013 at 12:27:18AM +0100, Al Viro wrote:
> On Mon, Apr 01, 2013 at 02:44:36PM -0700, Greg Kroah-Hartman wrote:
> > > > I guess you are right, it will not. I guess we need to do what
> > > > character devices do and have an "intermediate" fops in order to protect
> > > > this. Would that work?
> > >
> > > You mean, with reassigning ->f_op in ->open()? That'll work, as long as
> > > we have exclusion between removal and fetching the sucker in primary
> > > ->open()... Where would you prefer to stash fops?
> >
> > Ick, that's not going to work as the current api just uses a fops and
> > debugfs doesn't keep anything else hanging around that referes to
> > something "before" that, like 'struct cdev' does.
>
> Er? How about just sticking it into dentry->d_fsdata and letting
> debugfs_remove() zero that out? What am I missing here?

Hrm... For what it's worth, how do debugfs entries associated with
dynamic objects deal with debugfs_remove() vs. method calls? I don't
see _anything_ in {,__}debugfs_remove() that would looks like "wait
for ongoing write(2) attempts to complete". IOW, forget rmmod - WTF
protects us from access-after-free for any kind of data that isn't
permanently allocated?

2013-04-02 01:55:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Yet another pipe related oops.

On Tue, Apr 02, 2013 at 01:22:04AM +0100, Al Viro wrote:
> On Tue, Apr 02, 2013 at 12:27:18AM +0100, Al Viro wrote:
> > On Mon, Apr 01, 2013 at 02:44:36PM -0700, Greg Kroah-Hartman wrote:
> > > > > I guess you are right, it will not. I guess we need to do what
> > > > > character devices do and have an "intermediate" fops in order to protect
> > > > > this. Would that work?
> > > >
> > > > You mean, with reassigning ->f_op in ->open()? That'll work, as long as
> > > > we have exclusion between removal and fetching the sucker in primary
> > > > ->open()... Where would you prefer to stash fops?
> > >
> > > Ick, that's not going to work as the current api just uses a fops and
> > > debugfs doesn't keep anything else hanging around that referes to
> > > something "before" that, like 'struct cdev' does.
> >
> > Er? How about just sticking it into dentry->d_fsdata and letting
> > debugfs_remove() zero that out? What am I missing here?

Nothing, you are right, that would work just fine. Want me to fix it
up, or do you want to?

> Hrm... For what it's worth, how do debugfs entries associated with
> dynamic objects deal with debugfs_remove() vs. method calls? I don't
> see _anything_ in {,__}debugfs_remove() that would looks like "wait
> for ongoing write(2) attempts to complete". IOW, forget rmmod - WTF
> protects us from access-after-free for any kind of data that isn't
> permanently allocated?

Nothing protects you from that, that's what I was trying to get at with
the dynamic attributes comment.

greg k-h