2012-05-01 11:00:34

by Joel Becker

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Mon, Apr 30, 2012 at 02:27:47PM +0200, Jana Saout wrote:
> Hello,
>
> I've been trying out the latest kernel and ran into an occasional oops
> on a machine with OCFS2 and another machine with autofs. (on x86_64)
>
> I've attached one of those as full log excerpt at the end of the mail
> for completeness.
>
> What the crashes have in common is that they always occur in fs/namei.c
> hash_name (inlined into link_path_walk):
>
> [...]
>
> hash = (hash + a) * 9;
> len += sizeof(unsigned long);
> here ---> a = *(unsigned long *)(name+len);
> /* Do we have any NUL or '/' bytes in this word? */
> mask = has_zero(a) | has_zero(a ^ REPEAT_BYTE('/'));
> [...]
>
> The line got compiled into "mov 0(%rbp,%rcx,1),%rax" with rbp being
> "name" and "rcx" being len.
>
> Now, it seems ocfs2 and autofs both manage to call into link_path_walk
> with "name" not being word-aligned.
>
> In the first example oops rbp ends with 0x...ff9, which is not
> word-aligned, and in this particular case, the read goes one byte over
> the end of the page, hence the rare, but occasional oops. (similar issue
> for the autofs oops)

ocfs2 copyies a fast symlink into a len+1 buffer, allocated with
kzalloc. I'm not sure kzalloc is required to provide word-aligned
allocs, but I think it does. And while you could easily walk off the
end of len+1 if you are adding sizeof(ulong), that new pointer should be
aligned. Am I missing something?

> Force-disabling CONFIG_DCACHE_WORD_ACCESS make the oopses go away on
> those machines.
>
> Now, I guess, since the check is for dcache, and the name being passed
> in is from filesystem code and not dcache, that there is something weird
> going on here, or a case that has been missed, or something is happening
> that is not supposed to happen in OCFS2 or autofs.
>
> For the OCFS2 case I have a couple of oopses, always with almost
> identical backtraces with "ocfs2_fast_follow_link" in them. The autofs
> oops is the only one I ran into so far.

Do you have any ocfs2 OOPSen that are *not* in
fast_follow_link()? Where are they?

Joel

>
> Cheers,
> Jana
>
> OCFS2 oops:
>
> Apr 30 14:02:46 web5 kernel: PGD 180c067 PUD bf5f5067 PMD bf635067 PTE 0
> Apr 30 14:02:46 web5 kernel: Oops: 0000 [#8] PREEMPT SMP
> Apr 30 14:02:46 web5 kernel: CPU 0
> Apr 30 14:02:46 web5 kernel: Modules linked in: nfs lockd auth_rpcgss nfs_acl sunrpc autofs4 ocfs2 jbd2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
> Apr 30 14:02:46 web5 kernel:
> Apr 30 14:02:46 web5 kernel: Pid: 18880, comm: apache2 Tainted: G D 3.4.0-js1 #1
> Apr 30 14:02:46 web5 kernel: RIP: e030:[<ffffffff8113c29b>] [<ffffffff8113c29b>] link_path_walk+0xab/0x890
> Apr 30 14:02:46 web5 kernel: RSP: e02b:ffff88001e7a3bc8 EFLAGS: 00010257
> Apr 30 14:02:46 web5 kernel: RAX: 0000000000000000 RBX: ffff88001e7a3e08 RCX: 0000000000000000
> Apr 30 14:02:46 web5 kernel: RDX: 0000000000000000 RSI: 0000000000003230 RDI: 8080808080808080
> Apr 30 14:02:46 web5 kernel: RBP: ffff880147e6dff9 R08: fefefefefefefeff R09: 2f2f2f2f2f2f2f2f
> Apr 30 14:02:46 web5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800646c7878
> Apr 30 14:02:46 web5 kernel: R13: ffff880012103c00 R14: 0000000000000000 R15: ffff880012103c00
> Apr 30 14:02:46 web5 kernel: FS: 00007f9940f51750(0000) GS:ffff8800bff0c000(0000) knlGS:0000000000000000
> Apr 30 14:02:46 web5 kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> Apr 30 14:02:46 web5 kernel: CR2: ffff880147e6e000 CR3: 00000000051a8000 CR4: 0000000000000660
> Apr 30 14:02:46 web5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 30 14:02:46 web5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 30 14:02:46 web5 kernel: Process apache2 (pid: 18880, threadinfo ffff88001e7a2000, task ffff880012103c00)
> Apr 30 14:02:46 web5 kernel: Stack:
> Apr 30 14:02:46 web5 kernel: ffff880012103c00 ffffffff8112538c 0000000000000020 ffffffffa014f7d5
> Apr 30 14:02:46 web5 kernel: ffff88001e7a3c40 ffff880012103c00 ffff88001e7a3e08 ffff8800a115ed20
> Apr 30 14:02:46 web5 kernel: ffff8800646f33c0 000000094e96972a ffff880147e6dfef ffffffffa014f808
> Apr 30 14:02:46 web5 kernel: Call Trace:
> Apr 30 14:02:46 web5 kernel: [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
> Apr 30 14:02:46 web5 kernel: [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
> Apr 30 14:02:46 web5 kernel: [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
> Apr 30 14:02:46 web5 kernel: [<ffffffff8113c670>] ? link_path_walk+0x480/0x890
> Apr 30 14:02:46 web5 kernel: [<ffffffff8113cbe2>] ? path_lookupat+0x52/0x740
> Apr 30 14:02:46 web5 kernel: [<ffffffffa00fe05f>] ? ocfs2_wait_for_recovery+0x2f/0xc0 [ocfs2]
> Apr 30 14:02:46 web5 kernel: [<ffffffff810056c9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
> Apr 30 14:02:46 web5 kernel: [<ffffffff8113d2fc>] ? do_path_lookup+0x2c/0xc0
> Apr 30 14:02:46 web5 kernel: [<ffffffff8113a94d>] ? getname_flags+0xed/0x260
> Apr 30 14:02:46 web5 kernel: [<ffffffff8113ed0e>] ? user_path_at_empty+0x5e/0xb0
> Apr 30 14:02:46 web5 kernel: [<ffffffff8141d251>] ? _raw_spin_lock_irqsave+0x11/0x60
> Apr 30 14:02:46 web5 kernel: [<ffffffffa00e2c7d>] ? __ocfs2_cluster_unlock.isra.28+0x2d/0xe0 [ocfs2]
> Apr 30 14:02:46 web5 kernel: [<ffffffff81420a30>] ? do_page_fault+0x2d0/0x540
> Apr 30 14:02:46 web5 kernel: [<ffffffff811342f0>] ? cp_new_stat+0xe0/0x100
> Apr 30 14:02:46 web5 kernel: [<ffffffff81134482>] ? vfs_fstatat+0x32/0x60
> Apr 30 14:02:46 web5 kernel: [<ffffffff81134622>] ? sys_newlstat+0x12/0x30
> Apr 30 14:02:46 web5 kernel: [<ffffffff814242f9>] ? system_call_fastpath+0x16/0x1b
> Apr 30 14:02:46 web5 kernel: Code: 49 b9 2f 2f 2f 2f 2f 2f 2f 2f 49 b8 ff fe fe fe fe fe fe fe 48 bf 80 80 80 80 80 80 80 80 66 90 4c 01 d0 48 83 c1 08 4c 8d 14 c0 <48> 8b 44 0d 00 48 89 c6 4e 8d 24 00 4c 31 ce 4a 8d 14 06 48 f7
> Apr 30 14:02:46 web5 kernel: RSP <ffff88001e7a3bc8>
> Apr 30 14:02:46 web5 kernel: CR2: ffff880147e6e000
> Apr 30 14:02:46 web5 kernel: ---[ end trace d2be4a7423d225ba ]---
>
>
> autofs oops:
>
> Apr 30 01:46:52 www2 kernel: PGD 180c067 PUD 1810067 PMD 8d5067 PTE 0
> Apr 30 01:46:52 www2 kernel: Oops: 0000 [#1] PREEMPT SMP
> Apr 30 01:46:52 www2 kernel: CPU 4
> Apr 30 01:46:52 www2 kernel: Modules linked in: autofs4 nfsd exportfs nfs lockd auth_rpcgss nfs_acl sunrpc ext4 jbd2 crc16
> Apr 30 01:46:52 www2 kernel:
> Apr 30 01:46:52 www2 kernel: Pid: 30128, comm: automount Not tainted 3.4.0-js1 #1
> Apr 30 01:46:52 www2 kernel: RIP: e030:[<ffffffff8113c38b>] [<ffffffff8113c38b>] link_path_walk+0xab/0x890
> Apr 30 01:46:52 www2 kernel: RSP: e02b:ffff8800023abbb8 EFLAGS: 00010206
> Apr 30 01:46:52 www2 kernel: RAX: 234f31435a3c3650 RBX: ffff8800023abd38 RCX: 0000000000000018
> Apr 30 01:46:52 www2 kernel: RDX: 0107010303010000 RSI: 9a989e8c8c9e8f91 RDI: 8080808080808080
> Apr 30 01:46:52 www2 kernel: RBP: ffff88001e1effe7 R08: fefefefefefefeff R09: 2f2f2f2f2f2f2f2f
> Apr 30 01:46:52 www2 kernel: R10: 3dc8bb5e2c1de8d0 R11: ffff8800023abb74 R12: 0000000000000000
> Apr 30 01:46:52 www2 kernel: R13: ffff8800751ff200 R14: 0000000000000000 R15: ffff8800751ff200
> Apr 30 01:46:52 www2 kernel: FS: 00007f241eb55750(0063) GS:ffff88007ff42000(0000) knlGS:0000000000000000
> Apr 30 01:46:52 www2 kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> Apr 30 01:46:52 www2 kernel: CR2: ffff88001e1f0000 CR3: 0000000065c76000 CR4: 0000000000000660
> Apr 30 01:46:52 www2 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 30 01:46:52 www2 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 30 01:46:52 www2 kernel: Process automount (pid: 30128, threadinfo ffff8800023aa000, task ffff8800751ff200)
> Apr 30 01:46:52 www2 kernel: Stack:
> Apr 30 01:46:52 www2 kernel: ffff8800023abcb0 ffff8800023abcb0 ffff8800023abce0 ffff8800023abe08
> Apr 30 01:46:52 www2 kernel: ffff8800751ff200 ffff8800751ff200 ffff8800751ff200 ffff880075024720
> Apr 30 01:46:52 www2 kernel: ffff880056423000 0000000300777777 ffff88001e1effe3 ffffffff8113b42a
> Apr 30 01:46:52 www2 kernel: Call Trace:
> Apr 30 01:46:52 www2 kernel: [<ffffffff8113b42a>] ? path_init+0x2fa/0x3c0
> Apr 30 01:46:52 www2 kernel: [<ffffffffa01a9580>] ? find_autofs_mount+0xb0/0xb0 [autofs4]
> Apr 30 01:46:52 www2 kernel: [<ffffffff8113ccd2>] ? path_lookupat+0x52/0x740
> Apr 30 01:46:52 www2 kernel: [<ffffffff811461cf>] ? __d_alloc+0x11f/0x180
> Apr 30 01:46:52 www2 kernel: [<ffffffffa01a9580>] ? find_autofs_mount+0xb0/0xb0 [autofs4]
> Apr 30 01:46:52 www2 kernel: [<ffffffff8113d3ec>] ? do_path_lookup+0x2c/0xc0
> Apr 30 01:46:52 www2 kernel: [<ffffffff81152a34>] ? dcache_dir_open+0x14/0x30
> Apr 30 01:46:52 www2 kernel: [<ffffffff8113d61d>] ? kern_path+0x1d/0x40
> Apr 30 01:46:52 www2 kernel: [<ffffffff811455ce>] ? dput+0x1e/0x190
> Apr 30 01:46:52 www2 kernel: [<ffffffff8114c40e>] ? mntput_no_expire+0x1e/0x140
> Apr 30 01:46:52 www2 kernel: [<ffffffff811270ce>] ? __kmalloc_track_caller+0x3e/0x1d0
> Apr 30 01:46:52 www2 kernel: [<ffffffffa01a9b7b>] ? _autofs_dev_ioctl+0xab/0x360 [autofs4]
> Apr 30 01:46:52 www2 kernel: [<ffffffffa01a96a0>] ? autofs_dev_ioctl_ismountpoint+0x120/0x190 [autofs4]
> Apr 30 01:46:52 www2 kernel: [<ffffffffa01a9cca>] ? _autofs_dev_ioctl+0x1fa/0x360 [autofs4]
> Apr 30 01:46:52 www2 kernel: [<ffffffffa01a9e3e>] ? autofs_dev_ioctl+0xe/0x20 [autofs4]
> Apr 30 01:46:52 www2 kernel: [<ffffffff81140b5e>] ? do_vfs_ioctl+0x8e/0x4f0
> Apr 30 01:46:52 www2 kernel: [<ffffffff811455ce>] ? dput+0x1e/0x190
> Apr 30 01:46:52 www2 kernel: [<ffffffff81131708>] ? fput+0x198/0x260
> Apr 30 01:46:52 www2 kernel: [<ffffffff81141009>] ? sys_ioctl+0x49/0x90
> Apr 30 01:46:52 www2 kernel: [<ffffffff814241b9>] ? system_call_fastpath+0x16/0x1b
> Apr 30 01:46:52 www2 kernel: Code: 49 b9 2f 2f 2f 2f 2f 2f 2f 2f 49 b8 ff fe fe fe fe fe fe fe 48 bf 80 80 80 80 80 80 80 80 66 90 4c 01 d0 48 83 c1 08 4c 8d 14 c0 <48> 8b 44 0d 00 48 89 c6 4e 8d 24 00 4c 31 ce 4a 8d 14 06 48 f7
> Apr 30 01:46:52 www2 kernel: RSP <ffff8800023abbb8>
> Apr 30 01:46:52 www2 kernel: CR2: ffff88001e1f0000
> Apr 30 01:46:52 www2 kernel: ---[ end trace b65a19b637bb67fb ]---
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--

Life's Little Instruction Book #20

"Be forgiving of yourself and others."

http://www.jlbec.org/
[email protected]


2012-05-01 12:28:45

by Jana Saout

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

Hi Joel,

> > I've been trying out the latest kernel and ran into an occasional oops
> > on a machine with OCFS2 and another machine with autofs. (on x86_64)
> >
> > I've attached one of those as full log excerpt at the end of the mail
> > for completeness.
> >
> > What the crashes have in common is that they always occur in fs/namei.c
> > hash_name (inlined into link_path_walk):
> >
> > [...]
> >
> > hash = (hash + a) * 9;
> > len += sizeof(unsigned long);
> > here ---> a = *(unsigned long *)(name+len);
> > /* Do we have any NUL or '/' bytes in this word? */
> > mask = has_zero(a) | has_zero(a ^ REPEAT_BYTE('/'));
> > [...]
> >
> > The line got compiled into "mov 0(%rbp,%rcx,1),%rax" with rbp being
> > "name" and "rcx" being len.
> >
> > Now, it seems ocfs2 and autofs both manage to call into link_path_walk
> > with "name" not being word-aligned.
> >
> > In the first example oops rbp ends with 0x...ff9, which is not
> > word-aligned, and in this particular case, the read goes one byte over
> > the end of the page, hence the rare, but occasional oops. (similar issue
> > for the autofs oops)
>
> ocfs2 copyies a fast symlink into a len+1 buffer, allocated with
> kzalloc. I'm not sure kzalloc is required to provide word-aligned
> allocs, but I think it does.

I thought so too... maybe the backtrace is slightly misleading, they
sometimes are. I'm not too clean on what the exact callchain is here. Do
you want me to investigate some more?

> > Force-disabling CONFIG_DCACHE_WORD_ACCESS make the oopses go away on
> > those machines.
> >
> > Now, I guess, since the check is for dcache, and the name being passed
> > in is from filesystem code and not dcache, that there is something weird
> > going on here, or a case that has been missed, or something is happening
> > that is not supposed to happen in OCFS2 or autofs.
> >
> > For the OCFS2 case I have a couple of oopses, always with almost
> > identical backtraces with "ocfs2_fast_follow_link" in them. The autofs
> > oops is the only one I ran into so far.
>
> Do you have any ocfs2 OOPSen that are *not* in
> fast_follow_link()? Where are they?

No, all (about 15) I saw from ocfs2 all have ocfs2_fast_follow_link in
them, the first few lines of the backtrace are always identical.

Here is another one:

Apr 29 21:00:22 web5 kernel: [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
Apr 29 21:00:22 web5 kernel: [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
Apr 29 21:00:22 web5 kernel: [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
Apr 29 21:00:22 web5 kernel: [<ffffffff8113c760>] ? link_path_walk+0x480/0x890
Apr 29 21:00:22 web5 kernel: [<ffffffff8113ccd2>] ? path_lookupat+0x52/0x740
Apr 29 21:00:22 web5 kernel: [<ffffffff81320049>] ? sock_aio_read.part.22+0xd9/0x100
Apr 29 21:00:22 web5 kernel: [<ffffffff8113d3ec>] ? do_path_lookup+0x2c/0xc0
Apr 29 21:00:22 web5 kernel: [<ffffffff8113a9fd>] ? getname_flags+0xed/0x260
Apr 29 21:00:22 web5 kernel: [<ffffffff8113ebae>] ? user_path_at_empty+0x5e/0xb0
Apr 29 21:00:22 web5 kernel: [<ffffffff8112f4d8>] ? do_sync_read+0xb8/0xf0
Apr 29 21:00:22 web5 kernel: [<ffffffff81036c52>] ? pvclock_clocksource_read+0x52/0xf0
Apr 29 21:00:22 web5 kernel: [<ffffffff81134482>] ? vfs_fstatat+0x32/0x60
Apr 29 21:00:22 web5 kernel: [<ffffffff8100b4dd>] ? xen_clocksource_read+0x3d/0x70
Apr 29 21:00:22 web5 kernel: [<ffffffff811345f2>] ? sys_newstat+0x12/0x30

(all others are also coming from either sys_netstatat or sys_newlstat,
except for this one - still, the top part looks the same again):

Apr 30 10:07:30 web5 kernel: [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
Apr 30 10:07:30 web5 kernel: [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
Apr 30 10:07:30 web5 kernel: [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
Apr 30 10:07:30 web5 kernel: [<ffffffff8113c760>] ? link_path_walk+0x480/0x890
Apr 30 10:07:30 web5 kernel: [<ffffffff8113e7fe>] ? path_openat+0xbe/0x3f0
Apr 30 10:07:30 web5 kernel: [<ffffffffa00e1617>] ? ocfs2_lock_res_free+0x77/0x730 [ocfs2]
Apr 30 10:07:30 web5 kernel: [<ffffffff8113ec55>] ? do_filp_open+0x45/0xb0
Apr 30 10:07:30 web5 kernel: [<ffffffff8114a98b>] ? alloc_fd+0xcb/0x110
Apr 30 10:07:30 web5 kernel: [<ffffffff8112f0e6>] ? do_sys_open+0xf6/0x1d0
Apr 30 10:07:30 web5 kernel: [<ffffffff814241b9>] ? system_call_fastpath+0x16/0x1b

Jana


> > OCFS2 oops:
> >
> > Apr 30 14:02:46 web5 kernel: PGD 180c067 PUD bf5f5067 PMD bf635067 PTE 0
> > Apr 30 14:02:46 web5 kernel: Oops: 0000 [#8] PREEMPT SMP
> > Apr 30 14:02:46 web5 kernel: CPU 0
> > Apr 30 14:02:46 web5 kernel: Modules linked in: nfs lockd auth_rpcgss nfs_acl sunrpc autofs4 ocfs2 jbd2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
> > Apr 30 14:02:46 web5 kernel:
> > Apr 30 14:02:46 web5 kernel: Pid: 18880, comm: apache2 Tainted: G D 3.4.0-js1 #1
> > Apr 30 14:02:46 web5 kernel: RIP: e030:[<ffffffff8113c29b>] [<ffffffff8113c29b>] link_path_walk+0xab/0x890
> > Apr 30 14:02:46 web5 kernel: RSP: e02b:ffff88001e7a3bc8 EFLAGS: 00010257
> > Apr 30 14:02:46 web5 kernel: RAX: 0000000000000000 RBX: ffff88001e7a3e08 RCX: 0000000000000000
> > Apr 30 14:02:46 web5 kernel: RDX: 0000000000000000 RSI: 0000000000003230 RDI: 8080808080808080
> > Apr 30 14:02:46 web5 kernel: RBP: ffff880147e6dff9 R08: fefefefefefefeff R09: 2f2f2f2f2f2f2f2f
> > Apr 30 14:02:46 web5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800646c7878
> > Apr 30 14:02:46 web5 kernel: R13: ffff880012103c00 R14: 0000000000000000 R15: ffff880012103c00
> > Apr 30 14:02:46 web5 kernel: FS: 00007f9940f51750(0000) GS:ffff8800bff0c000(0000) knlGS:0000000000000000
> > Apr 30 14:02:46 web5 kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > Apr 30 14:02:46 web5 kernel: CR2: ffff880147e6e000 CR3: 00000000051a8000 CR4: 0000000000000660
> > Apr 30 14:02:46 web5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Apr 30 14:02:46 web5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Apr 30 14:02:46 web5 kernel: Process apache2 (pid: 18880, threadinfo ffff88001e7a2000, task ffff880012103c00)
> > Apr 30 14:02:46 web5 kernel: Stack:
> > Apr 30 14:02:46 web5 kernel: ffff880012103c00 ffffffff8112538c 0000000000000020 ffffffffa014f7d5
> > Apr 30 14:02:46 web5 kernel: ffff88001e7a3c40 ffff880012103c00 ffff88001e7a3e08 ffff8800a115ed20
> > Apr 30 14:02:46 web5 kernel: ffff8800646f33c0 000000094e96972a ffff880147e6dfef ffffffffa014f808
> > Apr 30 14:02:46 web5 kernel: Call Trace:
> > Apr 30 14:02:46 web5 kernel: [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
> > Apr 30 14:02:46 web5 kernel: [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
> > Apr 30 14:02:46 web5 kernel: [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
> > Apr 30 14:02:46 web5 kernel: [<ffffffff8113c670>] ? link_path_walk+0x480/0x890
> > Apr 30 14:02:46 web5 kernel: [<ffffffff8113cbe2>] ? path_lookupat+0x52/0x740
> > Apr 30 14:02:46 web5 kernel: [<ffffffffa00fe05f>] ? ocfs2_wait_for_recovery+0x2f/0xc0 [ocfs2]
> > Apr 30 14:02:46 web5 kernel: [<ffffffff810056c9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
> > Apr 30 14:02:46 web5 kernel: [<ffffffff8113d2fc>] ? do_path_lookup+0x2c/0xc0
> > Apr 30 14:02:46 web5 kernel: [<ffffffff8113a94d>] ? getname_flags+0xed/0x260
> > Apr 30 14:02:46 web5 kernel: [<ffffffff8113ed0e>] ? user_path_at_empty+0x5e/0xb0
> > Apr 30 14:02:46 web5 kernel: [<ffffffff8141d251>] ? _raw_spin_lock_irqsave+0x11/0x60
> > Apr 30 14:02:46 web5 kernel: [<ffffffffa00e2c7d>] ? __ocfs2_cluster_unlock.isra.28+0x2d/0xe0 [ocfs2]
> > Apr 30 14:02:46 web5 kernel: [<ffffffff81420a30>] ? do_page_fault+0x2d0/0x540
> > Apr 30 14:02:46 web5 kernel: [<ffffffff811342f0>] ? cp_new_stat+0xe0/0x100
> > Apr 30 14:02:46 web5 kernel: [<ffffffff81134482>] ? vfs_fstatat+0x32/0x60
> > Apr 30 14:02:46 web5 kernel: [<ffffffff81134622>] ? sys_newlstat+0x12/0x30
> > Apr 30 14:02:46 web5 kernel: [<ffffffff814242f9>] ? system_call_fastpath+0x16/0x1b
> > Apr 30 14:02:46 web5 kernel: Code: 49 b9 2f 2f 2f 2f 2f 2f 2f 2f 49 b8 ff fe fe fe fe fe fe fe 48 bf 80 80 80 80 80 80 80 80 66 90 4c 01 d0 48 83 c1 08 4c 8d 14 c0 <48> 8b 44 0d 00 48 89 c6 4e 8d 24 00 4c 31 ce 4a 8d 14 06 48 f7
> > Apr 30 14:02:46 web5 kernel: RSP <ffff88001e7a3bc8>
> > Apr 30 14:02:46 web5 kernel: CR2: ffff880147e6e000
> > Apr 30 14:02:46 web5 kernel: ---[ end trace d2be4a7423d225ba ]---

2012-05-03 05:02:52

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

Linus did you see this thread? Any ideas what is going on?

On 1 May 2012 22:28, Jana Saout <[email protected]> wrote:
> Hi Joel,
>
>> > I've been trying out the latest kernel and ran into an occasional oops
>> > on a machine with OCFS2 and another machine with autofs. (on x86_64)
>> >
>> > I've attached one of those as full log excerpt at the end of the mail
>> > for completeness.
>> >
>> > What the crashes have in common is that they always occur in fs/namei.c
>> > hash_name (inlined into link_path_walk):
>> >
>> >        [...]
>> >
>> >                 hash = (hash + a) * 9;
>> >                 len += sizeof(unsigned long);
>> >  here --->      a = *(unsigned long *)(name+len);
>> >                 /* Do we have any NUL or '/' bytes in this word? */
>> >                 mask = has_zero(a) | has_zero(a ^ REPEAT_BYTE('/'));
>> >        [...]
>> >
>> > The line got compiled into "mov 0(%rbp,%rcx,1),%rax" with rbp being
>> > "name" and "rcx" being len.
>> >
>> > Now, it seems ocfs2 and autofs both manage to call into link_path_walk
>> > with "name" not being word-aligned.
>> >
>> > In the first example oops rbp ends with 0x...ff9, which is not
>> > word-aligned, and in this particular case, the read goes one byte over
>> > the end of the page, hence the rare, but occasional oops. (similar issue
>> > for the autofs oops)
>>
>>       ocfs2 copyies a fast symlink into a len+1 buffer, allocated with
>> kzalloc.  I'm not sure kzalloc is required to provide word-aligned
>> allocs, but I think it does.
>
> I thought so too... maybe the backtrace is slightly misleading, they
> sometimes are. I'm not too clean on what the exact callchain is here. Do
> you want me to investigate some more?
>
>> > Force-disabling CONFIG_DCACHE_WORD_ACCESS make the oopses go away on
>> > those machines.
>> >
>> > Now, I guess, since the check is for dcache, and the name being passed
>> > in is from filesystem code and not dcache, that there is something weird
>> > going on here, or a case that has been missed, or something is happening
>> > that is not supposed to happen in OCFS2 or autofs.
>> >
>> > For the OCFS2 case I have a couple of oopses, always with almost
>> > identical backtraces with "ocfs2_fast_follow_link" in them.  The autofs
>> > oops is the only one I ran into so far.
>>
>>       Do you have any ocfs2 OOPSen that are *not* in
>> fast_follow_link()?  Where are they?
>
> No, all (about 15) I saw from ocfs2 all have ocfs2_fast_follow_link in
> them, the first few lines of the backtrace are always identical.
>
> Here is another one:
>
> Apr 29 21:00:22 web5 kernel:  [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
> Apr 29 21:00:22 web5 kernel:  [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
> Apr 29 21:00:22 web5 kernel:  [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
> Apr 29 21:00:22 web5 kernel:  [<ffffffff8113c760>] ? link_path_walk+0x480/0x890
> Apr 29 21:00:22 web5 kernel:  [<ffffffff8113ccd2>] ? path_lookupat+0x52/0x740
> Apr 29 21:00:22 web5 kernel:  [<ffffffff81320049>] ? sock_aio_read.part.22+0xd9/0x100
> Apr 29 21:00:22 web5 kernel:  [<ffffffff8113d3ec>] ? do_path_lookup+0x2c/0xc0
> Apr 29 21:00:22 web5 kernel:  [<ffffffff8113a9fd>] ? getname_flags+0xed/0x260
> Apr 29 21:00:22 web5 kernel:  [<ffffffff8113ebae>] ? user_path_at_empty+0x5e/0xb0
> Apr 29 21:00:22 web5 kernel:  [<ffffffff8112f4d8>] ? do_sync_read+0xb8/0xf0
> Apr 29 21:00:22 web5 kernel:  [<ffffffff81036c52>] ? pvclock_clocksource_read+0x52/0xf0
> Apr 29 21:00:22 web5 kernel:  [<ffffffff81134482>] ? vfs_fstatat+0x32/0x60
> Apr 29 21:00:22 web5 kernel:  [<ffffffff8100b4dd>] ? xen_clocksource_read+0x3d/0x70
> Apr 29 21:00:22 web5 kernel:  [<ffffffff811345f2>] ? sys_newstat+0x12/0x30
>
> (all others are also coming from either sys_netstatat or sys_newlstat,
> except for this one - still, the top part looks the same again):
>
> Apr 30 10:07:30 web5 kernel:  [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
> Apr 30 10:07:30 web5 kernel:  [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
> Apr 30 10:07:30 web5 kernel:  [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
> Apr 30 10:07:30 web5 kernel:  [<ffffffff8113c760>] ? link_path_walk+0x480/0x890
> Apr 30 10:07:30 web5 kernel:  [<ffffffff8113e7fe>] ? path_openat+0xbe/0x3f0
> Apr 30 10:07:30 web5 kernel:  [<ffffffffa00e1617>] ? ocfs2_lock_res_free+0x77/0x730 [ocfs2]
> Apr 30 10:07:30 web5 kernel:  [<ffffffff8113ec55>] ? do_filp_open+0x45/0xb0
> Apr 30 10:07:30 web5 kernel:  [<ffffffff8114a98b>] ? alloc_fd+0xcb/0x110
> Apr 30 10:07:30 web5 kernel:  [<ffffffff8112f0e6>] ? do_sys_open+0xf6/0x1d0
> Apr 30 10:07:30 web5 kernel:  [<ffffffff814241b9>] ? system_call_fastpath+0x16/0x1b
>
>        Jana
>
>
>> > OCFS2 oops:
>> >
>> > Apr 30 14:02:46 web5 kernel: PGD 180c067 PUD bf5f5067 PMD bf635067 PTE 0
>> > Apr 30 14:02:46 web5 kernel: Oops: 0000 [#8] PREEMPT SMP
>> > Apr 30 14:02:46 web5 kernel: CPU 0
>> > Apr 30 14:02:46 web5 kernel: Modules linked in: nfs lockd auth_rpcgss nfs_acl sunrpc autofs4 ocfs2 jbd2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
>> > Apr 30 14:02:46 web5 kernel:
>> > Apr 30 14:02:46 web5 kernel: Pid: 18880, comm: apache2 Tainted: G      D      3.4.0-js1 #1
>> > Apr 30 14:02:46 web5 kernel: RIP: e030:[<ffffffff8113c29b>]  [<ffffffff8113c29b>] link_path_walk+0xab/0x890
>> > Apr 30 14:02:46 web5 kernel: RSP: e02b:ffff88001e7a3bc8  EFLAGS: 00010257
>> > Apr 30 14:02:46 web5 kernel: RAX: 0000000000000000 RBX: ffff88001e7a3e08 RCX: 0000000000000000
>> > Apr 30 14:02:46 web5 kernel: RDX: 0000000000000000 RSI: 0000000000003230 RDI: 8080808080808080
>> > Apr 30 14:02:46 web5 kernel: RBP: ffff880147e6dff9 R08: fefefefefefefeff R09: 2f2f2f2f2f2f2f2f
>> > Apr 30 14:02:46 web5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800646c7878
>> > Apr 30 14:02:46 web5 kernel: R13: ffff880012103c00 R14: 0000000000000000 R15: ffff880012103c00
>> > Apr 30 14:02:46 web5 kernel: FS:  00007f9940f51750(0000) GS:ffff8800bff0c000(0000) knlGS:0000000000000000
>> > Apr 30 14:02:46 web5 kernel: CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
>> > Apr 30 14:02:46 web5 kernel: CR2: ffff880147e6e000 CR3: 00000000051a8000 CR4: 0000000000000660
>> > Apr 30 14:02:46 web5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > Apr 30 14:02:46 web5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Apr 30 14:02:46 web5 kernel: Process apache2 (pid: 18880, threadinfo ffff88001e7a2000, task ffff880012103c00)
>> > Apr 30 14:02:46 web5 kernel: Stack:
>> > Apr 30 14:02:46 web5 kernel:  ffff880012103c00 ffffffff8112538c 0000000000000020 ffffffffa014f7d5
>> > Apr 30 14:02:46 web5 kernel:  ffff88001e7a3c40 ffff880012103c00 ffff88001e7a3e08 ffff8800a115ed20
>> > Apr 30 14:02:46 web5 kernel:  ffff8800646f33c0 000000094e96972a ffff880147e6dfef ffffffffa014f808
>> > Apr 30 14:02:46 web5 kernel: Call Trace:
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff8113c670>] ? link_path_walk+0x480/0x890
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff8113cbe2>] ? path_lookupat+0x52/0x740
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffffa00fe05f>] ? ocfs2_wait_for_recovery+0x2f/0xc0 [ocfs2]
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff810056c9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff8113d2fc>] ? do_path_lookup+0x2c/0xc0
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff8113a94d>] ? getname_flags+0xed/0x260
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff8113ed0e>] ? user_path_at_empty+0x5e/0xb0
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff8141d251>] ? _raw_spin_lock_irqsave+0x11/0x60
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffffa00e2c7d>] ? __ocfs2_cluster_unlock.isra.28+0x2d/0xe0 [ocfs2]
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff81420a30>] ? do_page_fault+0x2d0/0x540
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff811342f0>] ? cp_new_stat+0xe0/0x100
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff81134482>] ? vfs_fstatat+0x32/0x60
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff81134622>] ? sys_newlstat+0x12/0x30
>> > Apr 30 14:02:46 web5 kernel:  [<ffffffff814242f9>] ? system_call_fastpath+0x16/0x1b
>> > Apr 30 14:02:46 web5 kernel: Code: 49 b9 2f 2f 2f 2f 2f 2f 2f 2f 49 b8 ff fe fe fe fe fe fe fe 48 bf 80 80 80 80 80 80 80 80 66 90 4c 01 d0 48 83 c1 08 4c 8d 14 c0 <48> 8b 44 0d 00 48 89 c6 4e 8d 24 00 4c 31 ce 4a 8d 14 06 48 f7
>> > Apr 30 14:02:46 web5 kernel:  RSP <ffff88001e7a3bc8>
>> > Apr 30 14:02:46 web5 kernel: CR2: ffff880147e6e000
>> > Apr 30 14:02:46 web5 kernel: ---[ end trace d2be4a7423d225ba ]---

2012-05-03 05:57:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Wed, May 2, 2012 at 10:02 PM, Nick Piggin <[email protected]> wrote:
> Linus did you see this thread?

I did not..

>Any ideas what is going on?

Note that the discussion about aligned allocations is irrelevant. It
doesn't matter at all if the pathname allocation is aligned - what
matters if whether the last *component* of the pathname is aligned or
not, and that is not going to depend on the allocation alignment.

The word-at-a-time code assumes that no allocation will be the last
page (whether kmalloc or normal page allocation), which was always
somewhat optimistic but I thought it would be true on PC's.

And that %rbp value does *not* look like end-of-memory, but maybe
there is something else than just the CONFIG_DEBUG_PAGEALLOC that
causes us to punch holes even in the kernel memory map.

Peter, Ingo - do we unmap kernel pages for PAT etc attributes?

Jana, can you send me the whole dmesg for the bootup up to and
including the oops?

There are multiple ways to fix this, including just marking that
unaligned word access as being able to take an exception, but I had
hoped to avoid having to do that. There are alternatives, like always
padding allocations up by 7 bytes, but those are nasty too. So I'd
like to understand what triggers this for Jana, it's possible we can
just work around that particular issue.

Linus

2012-05-03 05:58:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

Forgot to actually add Peter and Ingo to the cc..

Linus

On Wed, May 2, 2012 at 10:57 PM, Linus Torvalds
<[email protected]> wrote:
>
> The word-at-a-time code assumes that no allocation will be the last
> page (whether kmalloc or normal page allocation), which was always
> somewhat optimistic but I thought it would be true on PC's.
>
> And that %rbp value does *not* look like end-of-memory, but maybe
> there is something else than just the CONFIG_DEBUG_PAGEALLOC that
> causes us to punch holes even in the kernel memory map.
>
> Peter, Ingo - do we unmap kernel pages for PAT etc attributes?

2012-05-03 06:23:43

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On 3 May 2012 15:57, Linus Torvalds <[email protected]> wrote:
> On Wed, May 2, 2012 at 10:02 PM, Nick Piggin <[email protected]> wrote:
>> Linus did you see this thread?
>
> I did not..
>
>>Any ideas what is going on?
>
> Note that the discussion about aligned allocations is irrelevant. It
> doesn't matter at all if the pathname allocation is aligned - what
> matters if whether the last *component* of the pathname is aligned or
> not, and that is not going to depend on the allocation alignment.
>
> The word-at-a-time code assumes that no allocation will be the last
> page (whether kmalloc or normal page allocation), which was always
> somewhat optimistic but I thought it would be true on PC's.
>
> And that %rbp value does *not* look like end-of-memory, but maybe
> there is something else than just the CONFIG_DEBUG_PAGEALLOC that
> causes us to punch holes even in the kernel memory map.
>
> Peter, Ingo - do we unmap kernel pages for PAT etc attributes?
>
> Jana, can you send me the whole dmesg for the bootup up to and
> including the oops?
>
> There are multiple ways to fix this, including just marking that
> unaligned word access as being able to take an exception, but I had
> hoped to avoid having to do that. There are alternatives, like always
> padding allocations up by 7 bytes, but those are nasty too. So I'd
> like to understand what triggers this for Jana, it's possible we can
> just work around that particular issue.

Ah, I see what you mean. kmalloc is padded to 8 bytes, but that's
irrelevant if the full string was exactly modulo 8 bytes long, but the
last component starts inside the last 8 bytes.

That seems to exonerate OCFS2 and autofs.

vmalloc of course does guard pages, and that creeps into percpu
data and other things. It's not the case here, but would it be worth
putting a check in to catch that, or is it just a totally insane thing
to pass vmalloc()/percpu_alloc()/etc name string?

Any other strange possible corner cases? If we put a string on stack,
do any architectures use vmalloc or anything strange for stacks?

2012-05-03 06:26:29

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On 3 May 2012 16:23, Nick Piggin <[email protected]> wrote:
> On 3 May 2012 15:57, Linus Torvalds <[email protected]> wrote:
>> On Wed, May 2, 2012 at 10:02 PM, Nick Piggin <[email protected]> wrote:
>>> Linus did you see this thread?
>>
>> I did not..
>>
>>>Any ideas what is going on?
>>
>> Note that the discussion about aligned allocations is irrelevant. It
>> doesn't matter at all if the pathname allocation is aligned - what
>> matters if whether the last *component* of the pathname is aligned or
>> not, and that is not going to depend on the allocation alignment.
>>
>> The word-at-a-time code assumes that no allocation will be the last
>> page (whether kmalloc or normal page allocation), which was always
>> somewhat optimistic but I thought it would be true on PC's.
>>
>> And that %rbp value does *not* look like end-of-memory, but maybe
>> there is something else than just the CONFIG_DEBUG_PAGEALLOC that
>> causes us to punch holes even in the kernel memory map.
>>
>> Peter, Ingo - do we unmap kernel pages for PAT etc attributes?
>>
>> Jana, can you send me the whole dmesg for the bootup up to and
>> including the oops?
>>
>> There are multiple ways to fix this, including just marking that
>> unaligned word access as being able to take an exception, but I had
>> hoped to avoid having to do that. There are alternatives, like always
>> padding allocations up by 7 bytes, but those are nasty too. So I'd
>> like to understand what triggers this for Jana, it's possible we can
>> just work around that particular issue.
>
> Ah, I see what you mean. kmalloc is padded to 8 bytes, but that's
> irrelevant if the full string was exactly modulo 8 bytes long, but the
> last component starts inside the last 8 bytes.
>
> That seems to exonerate OCFS2 and autofs.
>
> vmalloc of course does guard pages, and that creeps into percpu
> data and other things. It's not the case here, but would it be worth
> putting a check in to catch that, or is it just a totally insane thing
> to pass vmalloc()/percpu_alloc()/etc name string?
>
> Any other strange possible corner cases? If we put a string on stack,
> do any architectures use vmalloc or anything strange for stacks?

(I guess in practice stack hardly matters, because you're not going
to get within 8 bytes of either end, unless stack overflow is imminent)

2012-05-03 06:39:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On 05/02/2012 10:57 PM, Linus Torvalds wrote:
>
> There are multiple ways to fix this, including just marking that
> unaligned word access as being able to take an exception, but I had
> hoped to avoid having to do that. There are alternatives, like always
> padding allocations up by 7 bytes, but those are nasty too. So I'd
> like to understand what triggers this for Jana, it's possible we can
> just work around that particular issue.
>

Can we do the trick of aligning the pointer and ignoring the start?
That would allow even architectures that don't have unaligned accesses
to work, too.

> Apr 30 14:02:46 web5 kernel: RIP: e030:[<ffffffff8113c29b>]
[<ffffffff8113c29b>] link_path_walk+0xab/0x890
> Apr 30 14:02:46 web5 kernel: RSP: e02b:ffff88001e7a3bc8 EFLAGS: 00010257
> Apr 30 14:02:46 web5 kernel: CS: e033

These segment values look odd in the extreme...

> Apr 30 14:02:46 web5 kernel: [<ffffffff810056c9>] ?
__raw_callee_save_xen_pmd_val+0x11/0x1e

... because he's running under Xen-PV. So his memory map can be
arbitrarily screwed seven ways to Sunday.

-hpa

2012-05-03 06:40:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On 05/02/2012 11:38 PM, H. Peter Anvin wrote:
> On 05/02/2012 10:57 PM, Linus Torvalds wrote:
>>
>> There are multiple ways to fix this, including just marking that
>> unaligned word access as being able to take an exception, but I had
>> hoped to avoid having to do that. There are alternatives, like always
>> padding allocations up by 7 bytes, but those are nasty too. So I'd
>> like to understand what triggers this for Jana, it's possible we can
>> just work around that particular issue.
>>
>
> Can we do the trick of aligning the pointer and ignoring the start?
> That would allow even architectures that don't have unaligned accesses
> to work, too.
>
>> Apr 30 14:02:46 web5 kernel: RIP: e030:[<ffffffff8113c29b>]
> [<ffffffff8113c29b>] link_path_walk+0xab/0x890
>> Apr 30 14:02:46 web5 kernel: RSP: e02b:ffff88001e7a3bc8 EFLAGS: 00010257
>> Apr 30 14:02:46 web5 kernel: CS: e033
>
> These segment values look odd in the extreme...
>
>> Apr 30 14:02:46 web5 kernel: [<ffffffff810056c9>] ?
> __raw_callee_save_xen_pmd_val+0x11/0x1e
>
> ... because he's running under Xen-PV. So his memory map can be
> arbitrarily screwed seven ways to Sunday.
>

This almost makes me want to suggest adding a taint flag for PV.

-hpa

2012-05-03 06:47:28

by Al Viro

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Wed, May 02, 2012 at 10:57:00PM -0700, Linus Torvalds wrote:

> There are multiple ways to fix this, including just marking that
> unaligned word access as being able to take an exception, but I had
> hoped to avoid having to do that. There are alternatives, like always
> padding allocations up by 7 bytes, but those are nasty too. So I'd
> like to understand what triggers this for Jana, it's possible we can
> just work around that particular issue.

What I'd really like to know is whether we can hit the same kind of "steps
off the end of page" crap on pagecache based symlinks with very long bodies.
The upper limit is 4K, which allows that sucker to reach the end of page
on most of the architectures. And that can be done on just about any fs
supporting symlinks at all - create a symlink with the long body (up to the
limit), something like (("./"x2047)."a").

2012-05-03 06:55:42

by David Miller

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

From: "H. Peter Anvin" <[email protected]>
Date: Wed, 02 May 2012 23:38:51 -0700

> On 05/02/2012 10:57 PM, Linus Torvalds wrote:
>>
>> There are multiple ways to fix this, including just marking that
>> unaligned word access as being able to take an exception, but I had
>> hoped to avoid having to do that. There are alternatives, like always
>> padding allocations up by 7 bytes, but those are nasty too. So I'd
>> like to understand what triggers this for Jana, it's possible we can
>> just work around that particular issue.
>>
>
> Can we do the trick of aligning the pointer and ignoring the start?
> That would allow even architectures that don't have unaligned accesses
> to work, too.

Doing that would flub the hash computation.

2012-05-03 06:58:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On 05/02/2012 11:54 PM, David Miller wrote:
>>
>> Can we do the trick of aligning the pointer and ignoring the start?
>> That would allow even architectures that don't have unaligned accesses
>> to work, too.
>
> Doing that would flub the hash computation.

I guess the shifts would be to expensive?

-hpa

2012-05-03 07:03:53

by David Miller

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

From: "H. Peter Anvin" <[email protected]>
Date: Wed, 02 May 2012 23:57:57 -0700

> On 05/02/2012 11:54 PM, David Miller wrote:
>>>
>>> Can we do the trick of aligning the pointer and ignoring the start?
>>> That would allow even architectures that don't have unaligned accesses
>>> to work, too.
>>
>> Doing that would flub the hash computation.
>
> I guess the shifts would be to expensive?

Yes, barrel-shifting (if that's your idea) would negate much of the
gain from the optimization.

Actually, thinking some more, a barrel-shifting loop would have the
same problem the current code has. You don't know if you are at the
end of the string until you do the tests on the word. But if you're
at the end of the page, you need to somehow elide that extra load
to get the word you're going to barrel-shift into the previous word.

2012-05-03 08:02:13

by Jana Saout

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

Hi Linus,

> Jana, can you send me the whole dmesg for the bootup up to and
> including the oops?

[ note: that kernel has the Xen frontswap/selfballooning patch applied,
but the same issue happened with vanilla 3.4.0-rc5 too, I tried that
too, just to make sure ]

BTW, the Oops is not fatal, the kernel keeps running for hours
afterwards.

Jana


Apr 30 00:19:35 web kernel: Initializing cgroup subsys cpu
Apr 30 00:19:35 web kernel: Linux version 3.4.0-js1 ([email protected]) (gcc version 4.6.3) #1 SMP PREEMPT Sun Apr 29 22:13:22 CEST 2012
Apr 30 00:19:35 web kernel: Command line: root=/dev/mapper/xvda console=hvc0 tmem reboot=10 automap=xvda
Apr 30 00:19:35 web kernel: KERNEL supported cpus:
Apr 30 00:19:35 web kernel: Intel GenuineIntel
Apr 30 00:19:35 web kernel: AMD AuthenticAMD
Apr 30 00:19:35 web kernel: Centaur CentaurHauls
Apr 30 00:19:35 web kernel: BIOS-provided physical RAM map:
Apr 30 00:19:35 web kernel: Xen: 0000000000000000 - 00000000000a0000 (usable)
Apr 30 00:19:35 web kernel: Xen: 00000000000a0000 - 0000000000100000 (reserved)
Apr 30 00:19:35 web kernel: Xen: 0000000000100000 - 0000000100800000 (usable)
Apr 30 00:19:35 web kernel: NX (Execute Disable) protection: active
Apr 30 00:19:35 web kernel: DMI not present or invalid.
Apr 30 00:19:35 web kernel: e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
Apr 30 00:19:35 web kernel: e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
Apr 30 00:19:35 web kernel: last_pfn = 0x100800 max_arch_pfn = 0x400000000
Apr 30 00:19:35 web kernel: last_pfn = 0x100000 max_arch_pfn = 0x400000000
Apr 30 00:19:35 web kernel: initial memory mapped : 0 - 01f4e000
Apr 30 00:19:35 web kernel: Base memory trampoline at [ffff88000009e000] 9e000 size 8192
Apr 30 00:19:35 web kernel: init_memory_mapping: 0000000000000000-0000000100000000
Apr 30 00:19:35 web kernel: 0000000000 - 0100000000 page 4k
Apr 30 00:19:35 web kernel: kernel direct mapping tables up to 100000000 @ 7fb000-1000000
Apr 30 00:19:35 web kernel: xen: setting RW the range fe8000 - 1000000
Apr 30 00:19:35 web kernel: init_memory_mapping: 0000000100000000-0000000100800000
Apr 30 00:19:35 web kernel: 0100000000 - 0100800000 page 4k
Apr 30 00:19:35 web kernel: kernel direct mapping tables up to 100800000 @ 7f7f6000-80000000
Apr 30 00:19:35 web kernel: xen: setting RW the range 7f7fb000 - 80000000
Apr 30 00:19:35 web kernel: No NUMA configuration found
Apr 30 00:19:35 web kernel: Faking a node at 0000000000000000-0000000100800000
Apr 30 00:19:35 web kernel: Initmem setup node 0 0000000000000000-0000000100800000
Apr 30 00:19:35 web kernel: NODE_DATA [000000007fffe000 - 000000007fffffff]
Apr 30 00:19:35 web kernel: Zone PFN ranges:
Apr 30 00:19:35 web kernel: DMA 0x00000010 -> 0x00001000
Apr 30 00:19:35 web kernel: DMA32 0x00001000 -> 0x00100000
Apr 30 00:19:35 web kernel: Normal 0x00100000 -> 0x00100800
Apr 30 00:19:35 web kernel: Movable zone start PFN for each node
Apr 30 00:19:35 web kernel: Early memory PFN ranges
Apr 30 00:19:35 web kernel: 0: 0x00000010 -> 0x000000a0
Apr 30 00:19:35 web kernel: 0: 0x00000100 -> 0x00100800
Apr 30 00:19:35 web kernel: On node 0 totalpages: 1050512
Apr 30 00:19:35 web kernel: DMA zone: 64 pages used for memmap
Apr 30 00:19:35 web kernel: DMA zone: 2031 pages reserved
Apr 30 00:19:35 web kernel: DMA zone: 1889 pages, LIFO batch:0
Apr 30 00:19:35 web kernel: DMA32 zone: 16320 pages used for memmap
Apr 30 00:19:35 web kernel: DMA32 zone: 1028160 pages, LIFO batch:31
Apr 30 00:19:35 web kernel: Normal zone: 32 pages used for memmap
Apr 30 00:19:35 web kernel: Normal zone: 2016 pages, LIFO batch:0
Apr 30 00:19:35 web kernel: SMP: Allowing 6 CPUs, 0 hotplug CPUs
Apr 30 00:19:35 web kernel: No local APIC present
Apr 30 00:19:35 web kernel: APIC: disable apic facility
Apr 30 00:19:35 web kernel: APIC: switched to apic NOOP
Apr 30 00:19:35 web kernel: nr_irqs_gsi: 16
Apr 30 00:19:35 web kernel: PM: Registered nosave memory: 00000000000a0000 - 0000000000100000
Apr 30 00:19:35 web kernel: PCI: Warning: Cannot find a gap in the 32bit address range
Apr 30 00:19:35 web kernel: PCI: Unassigned devices with 32bit resource registers may break!
Apr 30 00:19:35 web kernel: Allocating PCI resources starting at 100900000 (gap: 100900000:400000)
Apr 30 00:19:35 web kernel: Booting paravirtualized kernel on Xen
Apr 30 00:19:35 web kernel: Xen version: 4.1.2 (preserve-AD)
Apr 30 00:19:35 web kernel: setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:6 nr_node_ids:1
Apr 30 00:19:35 web kernel: PERCPU: Embedded 27 pages/cpu @ffff88007ff0c000 s81536 r8192 d20864 u110592
Apr 30 00:19:35 web kernel: pcpu-alloc: s81536 r8192 d20864 u110592 alloc=27*4096
Apr 30 00:19:35 web kernel: pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 [0] 4 [0] 5
Apr 30 00:19:35 web kernel: Built 1 zonelists in Node order, mobility grouping on. Total pages: 1032065
Apr 30 00:19:35 web kernel: Policy zone: Normal
Apr 30 00:19:35 web kernel: Kernel command line: root=/dev/mapper/xvda console=hvc0 tmem reboot=10 automap=xvda
Apr 30 00:19:35 web kernel: PID hash table entries: 4096 (order: 3, 32768 bytes)
Apr 30 00:19:35 web kernel: Memory: 1938152k/4202496k available (4255k kernel code, 448k absent, 2263896k reserved, 4389k data, 2032k init)
Apr 30 00:19:35 web kernel: SLUB: Genslabs=15, HWalign=64, Order=0-3, MinObjects=0, CPUs=6, Nodes=1
Apr 30 00:19:35 web kernel: Preemptible hierarchical RCU implementation.
Apr 30 00:19:35 web kernel: ^IAdditional per-CPU info printed with stalls.
Apr 30 00:19:35 web kernel: NR_IRQS:4352 nr_irqs:64 16
Apr 30 00:19:35 web kernel: Console: colour dummy device 80x25
Apr 30 00:19:35 web kernel: console [tty0] enabled
Apr 30 00:19:35 web kernel: console [hvc0] enabled
Apr 30 00:19:35 web kernel: Xen: using vcpuop timer interface
Apr 30 00:19:35 web kernel: installing Xen timer for CPU 0
Apr 30 00:19:35 web kernel: Detected 2100.082 MHz processor.
Apr 30 00:19:35 web kernel: Marking TSC unstable due to TSCs unsynchronized
Apr 30 00:19:35 web kernel: Calibrating delay loop (skipped), value calculated using timer frequency.. 4201.17 BogoMIPS (lpj=7000273)
Apr 30 00:19:35 web kernel: pid_max: default: 32768 minimum: 301
Apr 30 00:19:35 web kernel: Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Apr 30 00:19:35 web kernel: Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Apr 30 00:19:35 web kernel: Mount-cache hash table entries: 256
Apr 30 00:19:35 web kernel: tseg: 00d7f00000
Apr 30 00:19:35 web kernel: CPU: Physical Processor ID: 1
Apr 30 00:19:35 web kernel: CPU: Processor Core ID: 4
Apr 30 00:19:35 web kernel: SMP alternatives: switching to UP code
Apr 30 00:19:35 web kernel: cpu 0 spinlock event irq 17
Apr 30 00:19:35 web kernel: Performance Events:
Apr 30 00:19:35 web kernel: no APIC, boot with the "lapic" boot parameter to force-enable it.
Apr 30 00:19:35 web kernel: no hardware sampling interrupt available.
Apr 30 00:19:35 web kernel: Broken PMU hardware detected, using software events only.
Apr 30 00:19:35 web kernel: NMI watchdog: disabled (cpu0): hardware events not enabled
Apr 30 00:19:35 web kernel: installing Xen timer for CPU 1
Apr 30 00:19:35 web kernel: cpu 1 spinlock event irq 24
Apr 30 00:19:35 web kernel: SMP alternatives: switching to SMP code
Apr 30 00:19:35 web kernel: NMI watchdog: disabled (cpu1): hardware events not enabled
Apr 30 00:19:35 web kernel: installing Xen timer for CPU 2
Apr 30 00:19:35 web kernel: cpu 2 spinlock event irq 31
Apr 30 00:19:35 web kernel: NMI watchdog: disabled (cpu2): hardware events not enabled
Apr 30 00:19:35 web kernel: installing Xen timer for CPU 3
Apr 30 00:19:35 web kernel: cpu 3 spinlock event irq 38
Apr 30 00:19:35 web kernel: NMI watchdog: disabled (cpu3): hardware events not enabled
Apr 30 00:19:35 web kernel: installing Xen timer for CPU 4
Apr 30 00:19:35 web kernel: cpu 4 spinlock event irq 45
Apr 30 00:19:35 web kernel: NMI watchdog: disabled (cpu4): hardware events not enabled
Apr 30 00:19:35 web kernel: installing Xen timer for CPU 5
Apr 30 00:19:35 web kernel: cpu 5 spinlock event irq 52
Apr 30 00:19:35 web kernel: NMI watchdog: disabled (cpu5): hardware events not enabled
Apr 30 00:19:35 web kernel: Brought up 6 CPUs
Apr 30 00:19:35 web kernel: devtmpfs: initialized
Apr 30 00:19:35 web kernel: Grant tables using version 2 layout.
Apr 30 00:19:35 web kernel: Grant table initialized
Apr 30 00:19:35 web kernel: NET: Registered protocol family 16
Apr 30 00:19:35 web kernel: bio: create slab <bio-0> at 0
Apr 30 00:19:35 web kernel: xen/balloon: Initialising balloon driver.
Apr 30 00:19:35 web kernel: xen-balloon: Initialising balloon driver.
Apr 30 00:19:35 web kernel: xen/balloon: Initializing Xen selfballooning driver.
Apr 30 00:19:35 web kernel: xen/balloon: Initializing frontswap selfshrinking driver.
Apr 30 00:19:35 web kernel: SCSI subsystem initialized
Apr 30 00:19:35 web kernel: Switching to clocksource xen
Apr 30 00:19:35 web kernel: NET: Registered protocol family 2
Apr 30 00:19:35 web kernel: IP route cache hash table entries: 131072 (order: 8, 1048576 bytes)
Apr 30 00:19:35 web kernel: TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
Apr 30 00:19:35 web kernel: TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
Apr 30 00:19:35 web kernel: TCP: Hash tables configured (established 524288 bind 65536)
Apr 30 00:19:35 web kernel: TCP: reno registered
Apr 30 00:19:35 web kernel: UDP hash table entries: 2048 (order: 4, 65536 bytes)
Apr 30 00:19:35 web kernel: UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes)
Apr 30 00:19:35 web kernel: NET: Registered protocol family 1
Apr 30 00:19:35 web kernel: PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
Apr 30 00:19:35 web kernel: Placing 64MB software IO TLB between ffff880077400000 - ffff88007b400000
Apr 30 00:19:35 web kernel: software IO TLB at phys 0x77400000 - 0x7b400000
Apr 30 00:19:35 web kernel: platform rtc_cmos: registered platform RTC device (no PNP device found)
Apr 30 00:19:35 web kernel: microcode: CPU0: patch_level=0x010000c4
Apr 30 00:19:35 web kernel: microcode: CPU1: patch_level=0x010000c4
Apr 30 00:19:35 web kernel: microcode: CPU2: patch_level=0x010000c4
Apr 30 00:19:35 web kernel: microcode: CPU3: patch_level=0x010000c4
Apr 30 00:19:35 web kernel: microcode: CPU4: patch_level=0x010000c4
Apr 30 00:19:35 web kernel: microcode: CPU5: patch_level=0x010000c4
Apr 30 00:19:35 web kernel: microcode: Microcode Update Driver: v2.00 <[email protected]>, Peter Oruba
Apr 30 00:19:35 web kernel: sha1_ssse3: Neither AVX nor SSSE3 is available/usable.
Apr 30 00:19:35 web kernel: audit: initializing netlink socket (disabled)
Apr 30 00:19:35 web kernel: type=2000 audit(1335737960.553:1): initialized
Apr 30 00:19:35 web kernel: HugeTLB registered 2 MB page size, pre-allocated 0 pages
Apr 30 00:19:35 web kernel: VFS: Disk quotas dquot_6.5.2
Apr 30 00:19:35 web kernel: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
Apr 30 00:19:35 web kernel: msgmni has been set to 3785
Apr 30 00:19:35 web kernel: io scheduler noop registered
Apr 30 00:19:35 web kernel: io scheduler deadline registered
Apr 30 00:19:35 web kernel: io scheduler cfq registered (default)
Apr 30 00:19:35 web kernel: Event-channel device installed.
Apr 30 00:19:35 web kernel: frontswap enabled, RAM provided by Xen Transcendent Memory
Apr 30 00:19:35 web kernel: cleancache enabled, RAM provided by Xen Transcendent Memory
Apr 30 00:19:35 web kernel: Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
Apr 30 00:19:35 web kernel: brd: module loaded
Apr 30 00:19:35 web kernel: blkfront: xvda: flush diskcache: enabled
Apr 30 00:19:35 web kernel: Initialising Xen virtual ethernet driver.
Apr 30 00:19:35 web kernel: xvda: unknown partition table
Apr 30 00:19:35 web kernel: blkfront: xvdb: flush diskcache: enabled
Apr 30 00:19:35 web kernel: xvdb: unknown partition table
Apr 30 00:19:35 web kernel: Setting capacity to 8388608
Apr 30 00:19:35 web kernel: xvdb: detected capacity change from 0 to 4294967296
Apr 30 00:19:35 web kernel: blkfront: xvde: flush diskcache: enabled
Apr 30 00:19:35 web kernel: xvde: unknown partition table
Apr 30 00:19:35 web kernel: Setting capacity to 9856614400
Apr 30 00:19:35 web kernel: xvde: detected capacity change from 0 to 5046586572800
Apr 30 00:19:35 web kernel: i8042: No controller found
Apr 30 00:19:35 web kernel: mousedev: PS/2 mouse device common for all mice
Apr 30 00:19:35 web kernel: md: linear personality registered for level -1
Apr 30 00:19:35 web kernel: md: raid0 personality registered for level 0
Apr 30 00:19:35 web kernel: md: raid1 personality registered for level 1
Apr 30 00:19:35 web kernel: md: raid10 personality registered for level 10
Apr 30 00:19:35 web kernel: md: multipath personality registered for level -4
Apr 30 00:19:35 web kernel: device-mapper: ioctl: 4.22.0-ioctl (2011-10-19) initialised: [email protected]
Apr 30 00:19:35 web kernel: IPv4 over IPv4 tunneling driver
Apr 30 00:19:35 web kernel: TCP: cubic registered
Apr 30 00:19:35 web kernel: Initializing XFRM netlink socket
Apr 30 00:19:35 web kernel: NET: Registered protocol family 10
Apr 30 00:19:35 web kernel: IPv6 over IPv4 tunneling driver
Apr 30 00:19:35 web kernel: NET: Registered protocol family 17
Apr 30 00:19:35 web kernel: NET: Registered protocol family 15
Apr 30 00:19:35 web kernel: Registering the dns_resolver key type
Apr 30 00:19:35 web kernel: BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
Apr 30 00:19:35 web kernel: EDD information not available.
Apr 30 00:19:35 web kernel: Freeing unused kernel memory: 2032k freed
Apr 30 00:19:35 web kernel: Write protecting the kernel read-only data: 8192k
Apr 30 00:19:35 web kernel: Freeing unused kernel memory: 1872k freed
Apr 30 00:19:35 web kernel: Freeing unused kernel memory: 832k freed
Apr 30 00:19:35 web kernel: PM: Starting manual resume from disk
Apr 30 00:19:35 web kernel: kjournald starting. Commit interval 5 seconds
Apr 30 00:19:35 web kernel: EXT3-fs (dm-0): using internal journal
Apr 30 00:19:35 web kernel: EXT3-fs (dm-0): mounted filesystem with ordered data mode
Apr 30 00:19:35 web kernel: udevd (1070): /proc/1070/oom_adj is deprecated, please use /proc/1070/oom_score_adj instead.
Apr 30 00:19:35 web kernel: Adding 4194300k swap on /dev/xvdb. Priority:-1 extents:1 across:4194300k SSFS
Apr 30 00:19:35 web kernel: EXT3-fs (dm-0): using internal journal
Apr 30 00:19:35 web kernel: OCFS2 Node Manager 1.5.0
Apr 30 00:19:35 web kernel: OCFS2 DLM 1.5.0
Apr 30 00:19:35 web kernel: ocfs2: Registered cluster interface o2cb
Apr 30 00:19:35 web kernel: OCFS2 DLMFS 1.5.0
Apr 30 00:19:35 web kernel: OCFS2 User DLM kernel interface loaded
Apr 30 00:19:35 web kernel: warning: `rpc.statd' uses 32-bit capabilities (legacy support in use)
Apr 30 00:19:35 web kernel: o2net: Connected to node adm (num 2) at 192.168.101.101:7778
Apr 30 00:19:35 web kernel: o2net: Connected to node app (num 3) at 192.168.101.105:7778
Apr 30 00:19:35 web kernel: o2net: Accepted connection from node web-xxxxx (num 7) at 192.168.101.106:7778
Apr 30 00:19:35 web kernel: o2net: Accepted connection from node web5 (num 5) at 192.168.101.110:7778
Apr 30 00:19:35 web kernel: o2net: Accepted connection from node web-php5 (num 6) at 192.168.101.111:7778
Apr 30 00:19:35 web kernel: OCFS2 1.5.0
Apr 30 00:19:35 web kernel: o2dlm: Joining domain 94B06879B4C442399BFFD9BA45B4D52F ( 2 3 4 5 6 7 ) 6 nodes
Apr 30 00:19:35 web kernel: ocfs2: Mounting device (202,64) on (node 4, slot 3) with writeback data mode.
Apr 30 00:19:35 web kernel: eth0: no IPv6 routers present
Apr 30 00:19:35 web kernel: eth1: no IPv6 routers present
Apr 30 00:19:35 web kernel: eth2: no IPv6 routers present
Apr 30 04:55:54 web kernel: BUG: unable to handle kernel paging request at ffff880064de9000
Apr 30 04:55:54 web kernel: IP: [<ffffffff8113c38b>] link_path_walk+0xab/0x890
Apr 30 04:55:54 web kernel: PGD 180c067 PUD 9e5067 PMD b0c067 PTE 0
Apr 30 04:55:54 web kernel: Oops: 0000 [#1] PREEMPT SMP
Apr 30 04:55:54 web kernel: CPU 0
Apr 30 04:55:54 web kernel: Modules linked in: autofs4 ocfs2 jbd2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
Apr 30 04:55:54 web kernel:
Apr 30 04:55:54 web kernel: Pid: 23654, comm: apache2 Not tainted 3.4.0-js1 #1
Apr 30 04:55:54 web kernel: RIP: e030:[<ffffffff8113c38b>] [<ffffffff8113c38b>] link_path_walk+0xab/0x890
Apr 30 04:55:54 web kernel: RSP: e02b:ffff8800582b9c78 EFLAGS: 00010257
Apr 30 04:55:54 web kernel: RAX: 0000000000000000 RBX: ffff8800582b9e68 RCX: 0000000000000000
Apr 30 04:55:54 web kernel: RDX: 0000000000000000 RSI: 0000000000001a18 RDI: 8080808080808080
Apr 30 04:55:54 web kernel: RBP: ffff880064de8ffb R08: fefefefefefefeff R09: 2f2f2f2f2f2f2f2f
Apr 30 04:55:54 web kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007246db78
Apr 30 04:55:54 web kernel: R13: ffff8800d599e000 R14: 0000000000000000 R15: ffff8800d599e000
Apr 30 04:55:54 web kernel: FS: 00007fdb37a6c6d0(0000) GS:ffff88007ff0c000(0000) knlGS:0000000000000000
Apr 30 04:55:54 web kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 30 04:55:54 web kernel: CR2: ffff880064de9000 CR3: 0000000066be7000 CR4: 0000000000000660
Apr 30 04:55:54 web kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 30 04:55:54 web kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 30 04:55:54 web kernel: Process apache2 (pid: 23654, threadinfo ffff8800582b8000, task ffff8800d599e000)
Apr 30 04:55:54 web kernel: Stack:
Apr 30 04:55:54 web kernel: 0000000000008050 ffffffff8112538c 0000000000000020 ffffffffa014f7d5
Apr 30 04:55:54 web kernel: ffff8800582b9cf0 ffff8800d599e000 ffff8800582b9e68 ffff8800754c5220
Apr 30 04:55:54 web kernel: ffff88001835d3c0 000000066573e6d5 ffff880064de8ff4 ffffffffa014f808
Apr 30 04:55:54 web kernel: Call Trace:
Apr 30 04:55:54 web kernel: [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
Apr 30 04:55:54 web kernel: [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
Apr 30 04:55:54 web kernel: [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
Apr 30 04:55:54 web kernel: [<ffffffff8113c760>] ? link_path_walk+0x480/0x890
Apr 30 04:55:54 web kernel: [<ffffffff8113e7fe>] ? path_openat+0xbe/0x3f0
Apr 30 04:55:54 web kernel: [<ffffffffa00e1617>] ? ocfs2_lock_res_free+0x77/0x730 [ocfs2]
Apr 30 04:55:54 web kernel: [<ffffffff8113ec55>] ? do_filp_open+0x45/0xb0
Apr 30 04:55:54 web kernel: [<ffffffff8114a98b>] ? alloc_fd+0xcb/0x110
Apr 30 04:55:54 web kernel: [<ffffffff8112f0e6>] ? do_sys_open+0xf6/0x1d0
Apr 30 04:55:54 web kernel: [<ffffffff814241b9>] ? system_call_fastpath+0x16/0x1b
Apr 30 04:55:54 web kernel: Code: 49 b9 2f 2f 2f 2f 2f 2f 2f 2f 49 b8 ff fe fe fe fe fe fe fe 48 bf 80 80 80 80 80 80 80 80 66 90 4c 01 d0 48 83 c1 08 4c 8d 14 c0 <48> 8b 44 0d 00 48 89 c6 4e 8d 24 00 4c 31 ce 4a 8d 14 06 48 f7
Apr 30 04:55:54 web kernel: RIP [<ffffffff8113c38b>] link_path_walk+0xab/0x890
Apr 30 04:55:54 web kernel: RSP <ffff8800582b9c78>
Apr 30 04:55:54 web kernel: CR2: ffff880064de9000
Apr 30 04:55:54 web kernel: ---[ end trace a05958e307359332 ]---
Apr 30 08:21:08 web kernel: BUG: unable to handle kernel paging request at ffff88008db0e000
Apr 30 08:21:08 web kernel: IP: [<ffffffff8113c38b>] link_path_walk+0xab/0x890
Apr 30 08:21:08 web kernel: PGD 180c067 PUD be6067 PMD c54067 PTE 0
Apr 30 08:21:08 web kernel: Oops: 0000 [#2] PREEMPT SMP
Apr 30 08:21:08 web kernel: CPU 0
Apr 30 08:21:08 web kernel: Modules linked in: autofs4 ocfs2 jbd2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
Apr 30 08:21:08 web kernel:
Apr 30 08:21:08 web kernel: Pid: 17151, comm: apache2 Tainted: G D 3.4.0-js1 #1
Apr 30 08:21:08 web kernel: RIP: e030:[<ffffffff8113c38b>] [<ffffffff8113c38b>] link_path_walk+0xab/0x890
Apr 30 08:21:08 web kernel: RSP: e02b:ffff880004737bc8 EFLAGS: 00010257
Apr 30 08:21:08 web kernel: RAX: 0000000000000000 RBX: ffff880004737e08 RCX: 0000000000000000
Apr 30 08:21:08 web kernel: RDX: 0000000000000000 RSI: 000000000000a4a2 RDI: 8080808080808080
Apr 30 08:21:08 web kernel: RBP: ffff88008db0dffc R08: fefefefefefefeff R09: 2f2f2f2f2f2f2f2f
Apr 30 08:21:08 web kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800047ca178
Apr 30 08:21:08 web kernel: R13: ffff880017b10600 R14: 0000000000000000 R15: ffff880017b10600
Apr 30 08:21:08 web kernel: FS: 00007f7eadb206d0(0000) GS:ffff88007ff0c000(0000) knlGS:0000000000000000
Apr 30 08:21:08 web kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 30 08:21:08 web kernel: CR2: ffff88008db0e000 CR3: 000000003dbcd000 CR4: 0000000000000660
Apr 30 08:21:08 web kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 30 08:21:08 web kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 30 08:21:08 web kernel: Process apache2 (pid: 17151, threadinfo ffff880004736000, task ffff880017b10600)
Apr 30 08:21:08 web kernel: Stack:
Apr 30 08:21:08 web kernel: ffff880017b10600 ffffffff8112538c 0000000000000020 ffffffffa014f7d5
Apr 30 08:21:08 web kernel: ffff880004737c40 ffff880017b10600 ffff880004737e08 ffff8800754c5220
Apr 30 08:21:08 web kernel: ffff880073e77000 000000066573e6d5 ffff88008db0dff5 ffffffffa014f808
Apr 30 08:21:08 web kernel: Call Trace:
Apr 30 08:21:08 web kernel: [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
Apr 30 08:21:08 web kernel: [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
Apr 30 08:21:08 web kernel: [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
Apr 30 08:21:08 web kernel: [<ffffffff8113c760>] ? link_path_walk+0x480/0x890
Apr 30 08:21:08 web kernel: [<ffffffff8113ccd2>] ? path_lookupat+0x52/0x740
Apr 30 08:21:08 web kernel: [<ffffffff8100b4dd>] ? xen_clocksource_read+0x3d/0x70
Apr 30 08:21:08 web kernel: [<ffffffff810876ea>] ? getnstimeofday+0x4a/0xc0
Apr 30 08:21:08 web kernel: [<ffffffff8113d3ec>] ? do_path_lookup+0x2c/0xc0
Apr 30 08:21:08 web kernel: [<ffffffff8113a9fd>] ? getname_flags+0xed/0x260
Apr 30 08:21:08 web kernel: [<ffffffff8113ebae>] ? user_path_at_empty+0x5e/0xb0
Apr 30 08:21:08 web kernel: [<ffffffff814162c7>] ? __bad_area_nosemaphore+0x8e/0x1fd
Apr 30 08:21:08 web kernel: [<ffffffff81420ab2>] ? do_page_fault+0x492/0x540
Apr 30 08:21:08 web kernel: [<ffffffff81134482>] ? vfs_fstatat+0x32/0x60
Apr 30 08:21:08 web kernel: [<ffffffff811345f2>] ? sys_newstat+0x12/0x30
Apr 30 08:21:08 web kernel: [<ffffffff8112fdee>] ? vfs_read+0x11e/0x150
Apr 30 08:21:08 web kernel: [<ffffffff8141db25>] ? page_fault+0x25/0x30
Apr 30 08:21:08 web kernel: [<ffffffff814241b9>] ? system_call_fastpath+0x16/0x1b
Apr 30 08:21:08 web kernel: Code: 49 b9 2f 2f 2f 2f 2f 2f 2f 2f 49 b8 ff fe fe fe fe fe fe fe 48 bf 80 80 80 80 80 80 80 80 66 90 4c 01 d0 48 83 c1 08 4c 8d 14 c0 <48> 8b 44 0d 00 48 89 c6 4e 8d 24 00 4c 31 ce 4a 8d 14 06 48 f7
Apr 30 08:21:08 web kernel: RIP [<ffffffff8113c38b>] link_path_walk+0xab/0x890
Apr 30 08:21:08 web kernel: RSP <ffff880004737bc8>
Apr 30 08:21:08 web kernel: CR2: ffff88008db0e000
Apr 30 08:21:08 web kernel: ---[ end trace a05958e307359333 ]---
Apr 30 08:23:59 web kernel: BUG: unable to handle kernel paging request at ffff8800b5244000
Apr 30 08:23:59 web kernel: IP: [<ffffffff8113c38b>] link_path_walk+0xab/0x890
Apr 30 08:23:59 web kernel: PGD 180c067 PUD be6067 PMD d90067 PTE 0
Apr 30 08:23:59 web kernel: Oops: 0000 [#3] PREEMPT SMP
Apr 30 08:23:59 web kernel: CPU 0
Apr 30 08:23:59 web kernel: Modules linked in: autofs4 ocfs2 jbd2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
Apr 30 08:23:59 web kernel:
Apr 30 08:23:59 web kernel: Pid: 18840, comm: apache2 Tainted: G D 3.4.0-js1 #1
Apr 30 08:23:59 web kernel: RIP: e030:[<ffffffff8113c38b>] [<ffffffff8113c38b>] link_path_walk+0xab/0x890
Apr 30 08:23:59 web kernel: RSP: e02b:ffff8800aaba7c28 EFLAGS: 00010257
Apr 30 08:23:59 web kernel: RAX: 0000000000000000 RBX: ffff8800aaba7e68 RCX: 0000000000000000
Apr 30 08:23:59 web kernel: RDX: 0000000000000000 RSI: 000000000000aeac RDI: 8080808080808080
Apr 30 08:23:59 web kernel: RBP: ffff8800b5243ffb R08: fefefefefefefeff R09: 2f2f2f2f2f2f2f2f
Apr 30 08:23:59 web kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007246db78
Apr 30 08:23:59 web kernel: R13: ffff880074cc6000 R14: 0000000000000000 R15: ffff880074cc6000
Apr 30 08:23:59 web kernel: FS: 00007f7eadb206d0(0000) GS:ffff88007ff0c000(0000) knlGS:0000000000000000
Apr 30 08:23:59 web kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 30 08:23:59 web kernel: CR2: ffff8800b5244000 CR3: 00000000aab8a000 CR4: 0000000000000660
Apr 30 08:23:59 web kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 30 08:23:59 web kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 30 08:23:59 web kernel: Process apache2 (pid: 18840, threadinfo ffff8800aaba6000, task ffff880074cc6000)
Apr 30 08:23:59 web kernel: Stack:
Apr 30 08:23:59 web kernel: 0000000000008050 ffffffff8112538c 0000000000000020 ffffffffa014f7d5
Apr 30 08:23:59 web kernel: ffff8800aaba7ca0 ffff880074cc6000 ffff8800aaba7e68 ffff8800754c5220
Apr 30 08:23:59 web kernel: ffff88001835d3c0 000000066573e6d5 ffff8800b5243ff4 ffffffffa014f808
Apr 30 08:23:59 web kernel: Call Trace:
Apr 30 08:23:59 web kernel: [<ffffffff8112538c>] ? __kmalloc+0x17c/0x1e0
Apr 30 08:23:59 web kernel: [<ffffffffa014f7d5>] ? ocfs2_fast_follow_link+0x95/0x320 [ocfs2]
Apr 30 08:23:59 web kernel: [<ffffffffa014f808>] ? ocfs2_fast_follow_link+0xc8/0x320 [ocfs2]
Apr 30 08:23:59 web kernel: [<ffffffff8113c760>] ? link_path_walk+0x480/0x890
Apr 30 08:23:59 web kernel: [<ffffffff81036c52>] ? pvclock_clocksource_read+0x52/0xf0
Apr 30 08:23:59 web kernel: [<ffffffff8113ccd2>] ? path_lookupat+0x52/0x740
Apr 30 08:23:59 web kernel: [<ffffffff8104bad1>] ? sys_gettimeofday+0x31/0x80
Apr 30 08:23:59 web kernel: [<ffffffff8101a2eb>] ? emulate_vsyscall+0x31b/0x350
Apr 30 08:23:59 web kernel: [<ffffffff8113d3ec>] ? do_path_lookup+0x2c/0xc0
Apr 30 08:23:59 web kernel: [<ffffffff8113a9fd>] ? getname_flags+0xed/0x260
Apr 30 08:23:59 web kernel: [<ffffffff8113ebae>] ? user_path_at_empty+0x5e/0xb0
Apr 30 08:23:59 web kernel: [<ffffffff8104c912>] ? local_bh_enable_ip+0x22/0xb0
Apr 30 08:23:59 web kernel: [<ffffffff81325e9b>] ? sock_setsockopt+0x8b/0x870
Apr 30 08:23:59 web kernel: [<ffffffff8112e961>] ? sys_faccessat+0xa1/0x1c0
Apr 30 08:23:59 web kernel: [<ffffffff814241b9>] ? system_call_fastpath+0x16/0x1b
Apr 30 08:23:59 web kernel: Code: 49 b9 2f 2f 2f 2f 2f 2f 2f 2f 49 b8 ff fe fe fe fe fe fe fe 48 bf 80 80 80 80 80 80 80 80 66 90 4c 01 d0 48 83 c1 08 4c 8d 14 c0 <48> 8b 44 0d 00 48 89 c6 4e 8d 24 00 4c 31 ce 4a 8d 14 06 48 f7
Apr 30 08:23:59 web kernel: RIP [<ffffffff8113c38b>] link_path_walk+0xab/0x890
Apr 30 08:23:59 web kernel: RSP <ffff8800aaba7c28>
Apr 30 08:23:59 web kernel: CR2: ffff8800b5244000
Apr 30 08:23:59 web kernel: ---[ end trace a05958e307359334 ]---

2012-05-03 16:16:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Wed, May 2, 2012 at 11:47 PM, Al Viro <[email protected]> wrote:
>
> What I'd really like to know is whether we can hit the same kind of "steps
> off the end of page" crap on pagecache based symlinks with very long bodies.
> The upper limit is 4K, which allows that sucker to reach the end of page
> on most of the architectures. ?And that can be done on just about any fs
> supporting symlinks at all - create a symlink with the long body (up to the
> limit), something like (("./"x2047)."a").

Sure. And I've even tested that. You don't need a symlink, you can
have just a regular system call that passes a filename that is 4095
bytes long, and ends in a short component. We *will* access the next
page.

And that was always understood to be the case for the
DCACHE_WORD_ACCESS patches. It's just that accessing the next page
should be entirely harmless on any actual real x86 implementation.

Well, except for the DEBUG_PAGEALLOC case, which is why it's disabled
for that case.

And apparently except for some Xen PV case, where we have similar
issues, and which seems to be the reason Jana sees the problems.

I don't know the Xen paravirtualization code, but it looks like it is
punching holes in the kernel memory map, so you get the same issue you
get with DEBUG_PAGEALLOC.

Actually, looking at things, I think there's another case that can do
it: the AMD gart_64 code also does set_memory_np(), which can cause
problems.

So I guess I need to do the exception handling that I was hoping I
wouldn't have to. Give me a jiffy.

Linus

2012-05-03 17:30:29

by Al Viro

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Thu, May 03, 2012 at 09:15:41AM -0700, Linus Torvalds wrote:
> I don't know the Xen paravirtualization code, but it looks like it is
> punching holes in the kernel memory map, so you get the same issue you
> get with DEBUG_PAGEALLOC.
>
> Actually, looking at things, I think there's another case that can do
> it: the AMD gart_64 code also does set_memory_np(), which can cause
> problems.
>
> So I guess I need to do the exception handling that I was hoping I
> wouldn't have to. Give me a jiffy.

BTW, I've looked through the ->readlink()/->follow_link() instances and
there's an interesting picture:
* "slow" ocfs2 symlinks could bloody well use generic_readlink();
page_readlink() doesn't buy us anything when we have page_follow_link_light()
as ->follow_link().
* "fast" ocfs2 symlinks would probably be better off if they just
added ->readlink() of their own and used the same inode_operations as
the rest of them. And to hell with those dances with kmalloc and special
->readlink().
* ecryptfs is *definitely* better off by switching to generic_readlink()
and having ecryptfs_follow_link() call ecryptfs_readlink_lower() directly;
we get rid of one of the rounds of kmalloc/memcpy/kfree on that, not to mention
that memcpy being killed is actually copy_to_user() wrapped into set_fs().

I've done (completely untested) patches for those - see vfs.git#symlinks;
if ocfs2 folks can live with that, I'll drop those into #for-next.

BTW, after that we have generic_readlink() for _everything_ with normal symlink
semantics. Places that are different:
* /proc/<pid>/{*,fd/*} - magical symlinks, ->follow_link() actually
does a direct jump.
* /proc/self - different target for every process; we _could_ have
switched it to generic_readlink(), but I'm actually tempted to make it
a "direct jump" kind of symlink instead - its ->follow_link() would be
nicer (and faster) if we did that.
* hppfs symlinks - those are bounced to procfs, so they inherit
the weirdness
* afs automount points. Those are not symlinks at all; we are
probably tied by avoiding userland breakage here, but readlink(2) on those
is badly abusing the syscall. It's basically "which syscall could we use
to tell what'll get mounted when we step on automount point? aha, readlink()
returns a string, let's use it".
* bad_inode - actually, we could use generic_readlink() there as well,
it'll fail with the right error ;-) Again, this one is not quite a garden
variety symlink.
And that's it...

2012-05-03 17:31:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Thu, May 3, 2012 at 9:15 AM, Linus Torvalds
<[email protected]> wrote:
>
> So I guess I need to do the exception handling that I was hoping I
> wouldn't have to. Give me a jiffy.

Ok, that took longer than a jiffy, the asm was just nasty to get right
with all the proper suffixes for 32-bit vs 64-bit, and the fact that
gas apparently really needs %cl for the shift count, and doesn't like
%rcx. Silly assembler.

Also, the asm would have been much simpler if I didn't care so much
about the regular fast-path. I wanted the fast-path for the asm to be
a single load, with no downside, and everything fixed up in the
exception case.

And it's close. It's a single load, and the only downside is that
register '%rcx' is marked as used, because *if* the exception happens,
we want to use %rcx do the alignment fixup.

Peter, in particular, can you double (and triple-) check my asm, to
see if I missed anything? It does that "lea" of the address into %rcx
twice, because that way we don't need any other register temporaries.

On 32-bit, this results in:

- fast-path single-instruction unaligned load (with gcc free to pick
registers and addressing modes):

movl (%edi,%edx),%eax

- with the exception fixup code becoming:

leal (%edi,%edx),%ecx
andl $-4,%ecx
movl (%ecx),%eax
leal (%edi,%edx),%ecx
andl $3,%ecx
shll $3,%ecx
shll %cl,%eax
shrl %cl,%eax
jmp 2b

which looks ok. I don't worry about the efficiency of the fixup code,
because if that code is ever entered we will have taken a page fault
etc, so the only thing to worry about is that the fixup doesn't
need/fix any unnecessary extra registers so that the fast-path case
doesn't get less flexible.

Does anybody see anything wrong with this?

Anyway, with this, I guess we could enable word-at-a-time even with
CONFIG_DEBUG_PAGEALLOC on x86, and that might even be a good idea for
coverage.

Jana - does the attached patch work for you?

Linus


Attachments:
patch.diff (3.49 kB)

2012-05-03 18:14:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On 05/03/2012 10:30 AM, Linus Torvalds wrote:
> On Thu, May 3, 2012 at 9:15 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> So I guess I need to do the exception handling that I was hoping I
>> wouldn't have to. Give me a jiffy.
>
> Ok, that took longer than a jiffy, the asm was just nasty to get right
> with all the proper suffixes for 32-bit vs 64-bit, and the fact that
> gas apparently really needs %cl for the shift count, and doesn't like
> %rcx. Silly assembler.
>

Yes, although it's a fixed register you can also just write as %%cl.

> Also, the asm would have been much simpler if I didn't care so much
> about the regular fast-path. I wanted the fast-path for the asm to be
> a single load, with no downside, and everything fixed up in the
> exception case.
>
> And it's close. It's a single load, and the only downside is that
> register '%rcx' is marked as used, because *if* the exception happens,
> we want to use %rcx do the alignment fixup.
>
> Peter, in particular, can you double (and triple-) check my asm, to
> see if I missed anything? It does that "lea" of the address into %rcx
> twice, because that way we don't need any other register temporaries.

Just from a cleanliness point of view, I don't think you need the
__WORDSUFFIX for any of these instructions (it is only required if it
would be ambiguous, but the register names should deal with it.)

> - fast-path single-instruction unaligned load (with gcc free to pick
> registers and addressing modes):
>
> movl (%edi,%edx),%eax
>
> - with the exception fixup code becoming:
>
> leal (%edi,%edx),%ecx
> andl $-4,%ecx
> movl (%ecx),%eax
> leal (%edi,%edx),%ecx
> andl $3,%ecx
> shll $3,%ecx
> shll %cl,%eax
> shrl %cl,%eax
> jmp 2b

I think you want to drop the shl instruction. You're loading what
should end up at the LSB end of the register into the MSB end of the
register, so shr is all you should need.

Let's say %edi+%edx points to 0xcccccffd with the values 66 77 88 99
starting at 0xcccccffc. If the next page is present and zero, you'd end
up with %eax = 0x00998877, and so you would expect the same.

lea (%edi,%edx),%ecx -> %ecx = 0xcccccffd
and $-4,%ecx -> %ecx = 0xcccccffc
mov (%ecx),%eax -> %eax = 0x99887766
lea (%edi,%edx),%ecx -> %ecx = 0xcccccffd
and $3,%ecx -> %ecx = 1
shl $3,%ecx -> %ecx = 8
shr %cl,%eax -> %eax = 0x00998877

-hpa



--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-05-03 18:24:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Thu, May 3, 2012 at 11:13 AM, H. Peter Anvin <[email protected]> wrote:
>
> Just from a cleanliness point of view, I don't think you need the
> __WORDSUFFIX for any of these instructions (it is only required if it
> would be ambiguous, but the register names should deal with it.)

I get nervous about those kinds of things, but you are probably right.

> I think you want to drop the shl instruction. ?You're loading what
> should end up at the LSB end of the register into the MSB end of the
> register, so shr is all you should need.

Right you are.

Jana - never mind that patch. It will avoid the page fault, but try to
use the wrong (truncated) name due to the extraneous left-shift of the
loaded value.

So use the attached one instead. It just removes the extra shift that
Peter noticed, and also allows the use of the word-at-a-time code with
DEBUG_PAGEALLOC so that I can test it myself too.

I left the instruction suffixes in place, although Peter is probably
right that the assembler will do the right thing.

Linus


Attachments:
patch.diff (3.98 kB)

2012-05-03 18:27:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On 05/03/2012 11:23 AM, Linus Torvalds wrote:
>
> I left the instruction suffixes in place, although Peter is probably
> right that the assembler will do the right thing.
>

Yes, and we have <asm/asm.h> for the case where it doesn't (where there
is only a memory operand, for example.) I tend to put the suffixes in
even if they are redundant, but we already rely on the unsuffixed
instructions working in many, many places.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-05-03 18:28:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

+ "lea" __WORDSUFFIX " %2,%1\n\t"
+ "and" __WORDSUFFIX " %4,%1\n\t"
+ "shl" __WORDSUFFIX " $3,%1\n\t"
+ "shr" __WORDSUFFIX " %b1,%0\n\t"

Also, for this sequence of instructions using %ecx unconditionally is
actually better (avoids REX prefixes on 64 bits.)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-05-03 18:50:02

by David Miller

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

From: "H. Peter Anvin" <[email protected]>
Date: Thu, 03 May 2012 11:28:44 -0700

> + "lea" __WORDSUFFIX " %2,%1\n\t"
> + "and" __WORDSUFFIX " %4,%1\n\t"
> + "shl" __WORDSUFFIX " $3,%1\n\t"
> + "shr" __WORDSUFFIX " %b1,%0\n\t"
>
> Also, for this sequence of instructions using %ecx unconditionally is
> actually better (avoids REX prefixes on 64 bits.)

If it doesn't exist already, someone should really add bits to
binutils so you guys don't have to string paste like this.

For example a bit a mnenomics that the assembler internally changes
into the 64-bit or 32-bit variant based upon what bitness it is
targetting.

We have these on sparc, for example "ldn" is transformed into "lduw"
for 32-bit and "ldx" for 64-bit.

2012-05-03 19:07:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Thu, May 3, 2012 at 11:28 AM, H. Peter Anvin <[email protected]> wrote:
> + ? ? ? ? ? ? ? "lea" __WORDSUFFIX " %2,%1\n\t"
> + ? ? ? ? ? ? ? "and" __WORDSUFFIX " %4,%1\n\t"
> + ? ? ? ? ? ? ? "shl" __WORDSUFFIX " $3,%1\n\t"
> + ? ? ? ? ? ? ? "shr" __WORDSUFFIX " %b1,%0\n\t"
>
> Also, for this sequence of instructions using %ecx unconditionally is
> actually better (avoids REX prefixes on 64 bits.)

Ok, agreed, except for the last shr, which does need to be done in the
full word width. But yes, we could do the calculations of "how many
bits do we need to shift" in just 32 bits.

And yes, it does look like we can just rely on the register names, so
we can avoid all the WORDSUFFIX games, and make it be


static inline unsigned long load_unaligned_zeropad(const void *addr)
{
unsigned long ret, dummy;

asm(
"1:\tmov %2,%0\n"
"2:\n"
".section .fixup,\"ax\"\n"
"3:\t"
"lea %2,%1\n\t"
"and %3,%1\n\t"
"mov (%1),%0\n\t"
"leal %2,%%ecx\n\t"
"andl %4,%%ecx\n\t"
"shll $3,%%ecx\n\t"
"shr %%cl,%0\n\t"
"jmp 2b\n"
".previous\n"
_ASM_EXTABLE(1b, 3b)
:"=&r" (ret),"=&c" (dummy)
:"m" (*(unsigned long *)addr),
"i" (-sizeof(unsigned long)),
"i" (sizeof(unsigned long)-1));
return ret;
}

instead.

And I'm running that with DEBUG_PAGEALLOC, and haven't seen anything
odd yet. Not that it's easy to trigger the page crosser, but I've
tried to do things like look up 4095-byte pathnames with the final
components being just one character each, so I've *tried*.

Linus

2012-05-03 20:30:38

by Jana Saout

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

Hello,

> Jana - never mind that patch. It will avoid the page fault, but try to
> use the wrong (truncated) name due to the extraneous left-shift of the
> loaded value.
>
> So use the attached one instead. It just removes the extra shift that
> Peter noticed, and also allows the use of the word-at-a-time code with
> DEBUG_PAGEALLOC so that I can test it myself too.

Ok, I have been running with that patch for slightly over an hour and
didn't get that oops anymore.

Instead, now I got an oops on __d_lookup (fs/dcache.c line 155) (kernel
log excerpt below):

> #ifdef CONFIG_DCACHE_WORD_ACCESS
> unsigned long a,b,mask;
>
> if (unlikely(scount != tcount))
> return 1;
>
> for (;;) {
> a = *(unsigned long *)cs;
> -> b = *(unsigned long *)ct;
> if (tcount < sizeof(unsigned long))
> break;

Similar issue, ct is pointing at 0xffff880115a59ffb

Jana


BUG: unable to handle kernel paging request at ffff880115a5a000
IP: [<ffffffff81146850>] __d_lookup+0xe0/0x170
PGD 180c067 PUD bf3f4067 PMD bf4a2067 PTE 0
Oops: 0000 [#1] PREEMPT SMP
CPU 0
Modules linked in: nfs lockd auth_rpcgss nfs_acl sunrpc autofs4 ocfs2
jbd2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
ocfs2_stackglue configfs

Pid: 10310, comm: apache2 Not tainted 3.4.0-js1-test #1
RIP: e030:[<ffffffff81146850>] [<ffffffff81146850>] __d_lookup
+0xe0/0x170
RSP: e02b:ffff880095cc9bf8 EFLAGS: 00010293
RAX: 0000000000006461 RBX: ffff88009d3ef248 RCX: 0000000000000002
RDX: 0000000000000006 RSI: 0000000000000000 RDI: ffff88009d3ef29c
RBP: 0000000000007374 R08: 0000000000000002 R09: ffff88009d3ef278
R10: ffff88009d3ef240 R11: 0000000000000000 R12: ffff88009d3ef29c
R13: ffff880065148e40 R14: ffff880115a59ffb R15: ffff880095cc9e78
FS: 00007fcb26ffe6d0(0000) GS:ffff8800bff0c000(0000)
knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880115a5a000 CR3: 0000000095cc2000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process apache2 (pid: 10310, threadinfo ffff880095cc8000, task
ffff8800b3de1800)
Stack:
ffffffff814268da 000000000000e030 0000000000010257 0000000000000002
000000000000e02b ffff880095cc9d10 ffff880095cc9d40 ffff880095cc9e68
0000000000000001 0000000000000000 ffff8800b3de1800 ffffffff8113c135
Call Trace:
[<ffffffff814268da>] ? bad_gs+0xa34/0x197a
[<ffffffff8113c135>] ? do_lookup+0x165/0x340
[<ffffffff8113c7ba>] ? link_path_walk+0x4aa/0x890
[<ffffffff81036c52>] ? pvclock_clocksource_read+0x52/0xf0
[<ffffffff8113cd02>] ? path_lookupat+0x52/0x740
[<ffffffff8104bad1>] ? sys_gettimeofday+0x31/0x80
[<ffffffff8101a2eb>] ? emulate_vsyscall+0x31b/0x350
[<ffffffff8113d41c>] ? do_path_lookup+0x2c/0xc0
[<ffffffff8113aa2d>] ? getname_flags+0xed/0x260
[<ffffffff8113ebde>] ? user_path_at_empty+0x5e/0xb0
[<ffffffff8104c912>] ? local_bh_enable_ip+0x22/0xb0
[<ffffffff81325eeb>] ? sock_setsockopt+0x8b/0x870
[<ffffffff8112e981>] ? sys_faccessat+0xa1/0x1c0
[<ffffffff814241f9>] ? system_call_fastpath+0x16/0x1b
Code: 8d 53 f8 44 8b 43 1c 4c 8b 4b 20 75 4e 48 8b 4c 24 18 4d 63 c0 4c
39 c1 75 b0 31 f6 0f 1f 80 00 00 00 00 48 83 f9 07 49 8b 04 31 <49> 8b
14 36 76 5a 48 39 d0 75 94 48 83 c6 08 48 83 e9 08 75 e3
RIP [<ffffffff81146850>] __d_lookup+0xe0/0x170
RSP <ffff880095cc9bf8>
CR2: ffff880115a5a000
---[ end trace b53471b6e52caa9d ]---

(followed by note: apache2[10310] exited with preempt_count 1
BUG: scheduling while atomic: apache2/10310/0x10000002)

2012-05-03 21:02:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Thu, May 3, 2012 at 1:30 PM, Jana Saout <[email protected]> wrote:
>
> Instead, now I got an oops on __d_lookup (fs/dcache.c line 155) (kernel
> log excerpt below):

Heh, forgot about that one.

Trivial enough to fix. And I'll just make both of them use the
unaligned load helper function, even if technically I think only the
'ct 'access needs it.

This is just the incremental diff. The "real meat" of the change is
just making it use the helper function instead of the "direct
dereference through a unsigned long pointer cast", but the patch is
bigger than that because I decided to split the whole function up so
that we could do the nicer #include. setup. And because I shouldn't
have done it with an #ifdef inside a function to begin with.

Anyway, the fact that you can trigger these things so quickly
certainly is a good sign that your setup is good at finding it. It
could have been worse. Hopefully I now actually caught all users.

Linus


Attachments:
dentry_cmp.diff (1.86 kB)

2012-05-03 21:03:43

by Jana Saout

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

Hi again,

sorry for the addendum: This was actually the first patch - my mistake -
copied the new kernel into the wrong place... don't know if this makes a
difference. Running the latest patch now, just to make sure.

Jana


> > Jana - never mind that patch. It will avoid the page fault, but try to
> > use the wrong (truncated) name due to the extraneous left-shift of the
> > loaded value.
> >
> > So use the attached one instead. It just removes the extra shift that
> > Peter noticed, and also allows the use of the word-at-a-time code with
> > DEBUG_PAGEALLOC so that I can test it myself too.
>
> Ok, I have been running with that patch for slightly over an hour and
> didn't get that oops anymore.
>
> Instead, now I got an oops on __d_lookup (fs/dcache.c line 155) (kernel
> log excerpt below):
>
> > #ifdef CONFIG_DCACHE_WORD_ACCESS
> > unsigned long a,b,mask;
> >
> > if (unlikely(scount != tcount))
> > return 1;
> >
> > for (;;) {
> > a = *(unsigned long *)cs;
> > -> b = *(unsigned long *)ct;
> > if (tcount < sizeof(unsigned long))
> > break;
>
> Similar issue, ct is pointing at 0xffff880115a59ffb
>
> Jana
>
>
> BUG: unable to handle kernel paging request at ffff880115a5a000
> IP: [<ffffffff81146850>] __d_lookup+0xe0/0x170
> PGD 180c067 PUD bf3f4067 PMD bf4a2067 PTE 0
> Oops: 0000 [#1] PREEMPT SMP
> CPU 0
> Modules linked in: nfs lockd auth_rpcgss nfs_acl sunrpc autofs4 ocfs2
> jbd2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
> ocfs2_stackglue configfs
>
> Pid: 10310, comm: apache2 Not tainted 3.4.0-js1-test #1
> RIP: e030:[<ffffffff81146850>] [<ffffffff81146850>] __d_lookup
> +0xe0/0x170
> RSP: e02b:ffff880095cc9bf8 EFLAGS: 00010293
> RAX: 0000000000006461 RBX: ffff88009d3ef248 RCX: 0000000000000002
> RDX: 0000000000000006 RSI: 0000000000000000 RDI: ffff88009d3ef29c
> RBP: 0000000000007374 R08: 0000000000000002 R09: ffff88009d3ef278
> R10: ffff88009d3ef240 R11: 0000000000000000 R12: ffff88009d3ef29c
> R13: ffff880065148e40 R14: ffff880115a59ffb R15: ffff880095cc9e78
> FS: 00007fcb26ffe6d0(0000) GS:ffff8800bff0c000(0000)
> knlGS:0000000000000000
> CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff880115a5a000 CR3: 0000000095cc2000 CR4: 0000000000000660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process apache2 (pid: 10310, threadinfo ffff880095cc8000, task
> ffff8800b3de1800)
> Stack:
> ffffffff814268da 000000000000e030 0000000000010257 0000000000000002
> 000000000000e02b ffff880095cc9d10 ffff880095cc9d40 ffff880095cc9e68
> 0000000000000001 0000000000000000 ffff8800b3de1800 ffffffff8113c135
> Call Trace:
> [<ffffffff814268da>] ? bad_gs+0xa34/0x197a
> [<ffffffff8113c135>] ? do_lookup+0x165/0x340
> [<ffffffff8113c7ba>] ? link_path_walk+0x4aa/0x890
> [<ffffffff81036c52>] ? pvclock_clocksource_read+0x52/0xf0
> [<ffffffff8113cd02>] ? path_lookupat+0x52/0x740
> [<ffffffff8104bad1>] ? sys_gettimeofday+0x31/0x80
> [<ffffffff8101a2eb>] ? emulate_vsyscall+0x31b/0x350
> [<ffffffff8113d41c>] ? do_path_lookup+0x2c/0xc0
> [<ffffffff8113aa2d>] ? getname_flags+0xed/0x260
> [<ffffffff8113ebde>] ? user_path_at_empty+0x5e/0xb0
> [<ffffffff8104c912>] ? local_bh_enable_ip+0x22/0xb0
> [<ffffffff81325eeb>] ? sock_setsockopt+0x8b/0x870
> [<ffffffff8112e981>] ? sys_faccessat+0xa1/0x1c0
> [<ffffffff814241f9>] ? system_call_fastpath+0x16/0x1b
> Code: 8d 53 f8 44 8b 43 1c 4c 8b 4b 20 75 4e 48 8b 4c 24 18 4d 63 c0 4c
> 39 c1 75 b0 31 f6 0f 1f 80 00 00 00 00 48 83 f9 07 49 8b 04 31 <49> 8b
> 14 36 76 5a 48 39 d0 75 94 48 83 c6 08 48 83 e9 08 75 e3
> RIP [<ffffffff81146850>] __d_lookup+0xe0/0x170
> RSP <ffff880095cc9bf8>
> CR2: ffff880115a5a000
> ---[ end trace b53471b6e52caa9d ]---
>
> (followed by note: apache2[10310] exited with preempt_count 1
> BUG: scheduling while atomic: apache2/10310/0x10000002)
>

2012-05-03 21:20:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On Thu, May 3, 2012 at 2:03 PM, Jana Saout <[email protected]> wrote:
> Hi again,
>
> sorry for the addendum: This was actually the first patch - my mistake -
> copied the new kernel into the wrong place... don't know if this makes a
> difference. ?Running the latest patch now, just to make sure.

Both of the early patches had the same issue, and didn't cover
__d_lookup(). I had simply forgotten about it, partly because
__d_lookup() didn't use the other helper functions from
asm/word-at-a-time.h.

So run the later patch, together with the incremental change on top of
it for just the __d_lookup() case.

Or take the patch from this email, which has all the changes,
including the asm simplification that we talked about with Peter. It
also has an optimistic "tested-by" line from you already ;)

Linus


Attachments:
patch.diff (5.35 kB)

2012-05-03 21:24:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

On 05/03/2012 11:48 AM, David Miller wrote:
> From: "H. Peter Anvin" <[email protected]>
> Date: Thu, 03 May 2012 11:28:44 -0700
>
>> + "lea" __WORDSUFFIX " %2,%1\n\t"
>> + "and" __WORDSUFFIX " %4,%1\n\t"
>> + "shl" __WORDSUFFIX " $3,%1\n\t"
>> + "shr" __WORDSUFFIX " %b1,%0\n\t"
>>
>> Also, for this sequence of instructions using %ecx unconditionally is
>> actually better (avoids REX prefixes on 64 bits.)
>
> If it doesn't exist already, someone should really add bits to
> binutils so you guys don't have to string paste like this.
>
> For example a bit a mnenomics that the assembler internally changes
> into the 64-bit or 32-bit variant based upon what bitness it is
> targetting.
>
> We have these on sparc, for example "ldn" is transformed into "lduw"
> for 32-bit and "ldx" for 64-bit.

I don't think we really need it. For the vast majority of all
instructions the size is given by the operands, and for the balance we
can generally use %z and/or <asm/asm.h>. %z would be more useful if
there wasn't for the fact that some now quite old versions of gcc
incorrectly produce "ll" instead of "q".

-hpa

2012-05-03 21:47:07

by Jana Saout

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

Hi Linus,

> Both of the early patches had the same issue, and didn't cover
> __d_lookup(). I had simply forgotten about it, partly because
> __d_lookup() didn't use the other helper functions from
> asm/word-at-a-time.h.
>
> So run the later patch, together with the incremental change on top of
> it for just the __d_lookup() case.
>
> Or take the patch from this email, which has all the changes,
> including the asm simplification that we talked about with Peter. It
> also has an optimistic "tested-by" line from you already ;)

Ok, I'll leave the system running with that patch over night and report
back tomorrow.

Thanks,
Jana

2012-05-04 11:21:35

by Jana Saout

[permalink] [raw]
Subject: Re: Oops with DCACHE_WORD_ACCESS and ocfs2, autofs4

Hello,

> Both of the early patches had the same issue, and didn't cover
> __d_lookup(). I had simply forgotten about it, partly because
> __d_lookup() didn't use the other helper functions from
> asm/word-at-a-time.h.
>
> So run the later patch, together with the incremental change on top of
> it for just the __d_lookup() case.
>
> Or take the patch from this email, which has all the changes,
> including the asm simplification that we talked about with Peter. It
> also has an optimistic "tested-by" line from you already ;)

... which seems justified. No oopses or other weird behavior -
everything looks fine now. :)

Thanks,
Jana