2005-05-13 20:57:35

by Gregory Brauer

[permalink] [raw]
Subject: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)



I have seen some references to a similar bug in other kernel list
posts from October 2004 and am trying to figure out if this is the
same problem, or something new related to the xfs_iget_core patch
in 2.6.11. This seems to be a very hard to reproduce bug, but
we've seen this problem twice in a week of testing under Fedora
Core 3 on the following kernel:

Linux violet 2.6.10-1.741_FC3smp #1 SMP Thu Jan 13 16:53:16 EST 2005
i686 i686 i386 GNU/Linux

This is on a system with 4 3ware 8506-4 cards in a software-RAID-0
configuration to make one large volume. The first traceback was
triggered after 1 week of a production NFS load and the second
traceback was triggered after two hours of a load generated by
five NFS clients running nfs_fsstress.sh.

The two tracebacks are follow:


May 10 15:49:56 violet kernel: xfs_iget_core: ambiguous vns:
vp/0xee02202c, invp/0xe13dddd0
May 10 15:49:56 violet kernel: ------------[ cut here ]------------
May 10 15:49:56 violet kernel: kernel BUG at fs/xfs/support/debug.c:106!
May 10 15:49:56 violet kernel: invalid operand: 0000 [#1]
May 10 15:49:56 violet kernel: SMP
May 10 15:49:56 violet kernel: Modules linked in: nfs vfat fat nfsd
lockd md5 ipv6 parport_pc lp parport sunrpc xfs exportfs dm_mod video
button batter
y ac uhci_hcd hw_random i2c_i801 i2c_core e1000 bonding floppy ext3 jbd
raid0 3w_xxxx sd_mod scsi_mod
May 10 15:49:56 violet kernel: CPU: 1
May 10 15:49:56 violet kernel: EIP: 0060:[<f8aa660f>] Not tainted VLI
May 10 15:49:56 violet kernel: EFLAGS: 00010246 (2.6.11-1.14_FC3smp)
May 10 15:49:56 violet kernel: EIP is at cmn_err+0x98/0xa8 [xfs]
May 10 15:49:56 violet kernel: eax: 00000000 ebx: f8aa9a84 ecx:
f8abd364 edx: 00000000
May 10 15:49:56 violet kernel: esi: f8aa9304 edi: f8abe91e ebp:
00000000 esp: d77f4af0
May 10 15:49:56 violet kernel: ds: 007b es: 007b ss: 0068
May 10 15:49:56 violet kernel: Process nfsd (pid: 6303,
threadinfo=d77f4000 task=e7412aa0)
May 10 15:49:56 violet kernel: Stack: f8aa9304 f8aa92cb f8abe8e0
00000292 d77f4000 a1e0b035 00000000 f53891ac
May 10 15:49:56 violet kernel: f8a7932f 00000000 f8aa9a84
ee02202c e13dddd0 f7005200 f8aa4fe1 d82d0354
May 10 15:49:56 violet kernel: 000001e0 e13dddf4 d82d0350
00000000 cabd6400 e13dddd0 f53891ac e13dddf4
May 10 15:49:56 violet kernel: Call Trace:
May 10 15:49:56 violet kernel: [<f8a7932f>] xfs_iget_core+0x4f9/0x5e2 [xfs]
May 10 15:49:56 violet kernel: [<f8aa4fe1>]
linvfs_alloc_inode+0x20/0x2d [xfs]
May 10 15:49:56 violet kernel: [<f8a794ec>] xfs_iget+0xd4/0x16a [xfs]
May 10 15:49:56 violet kernel: [<f8a95193>] xfs_vget+0x61/0xce [xfs]
May 10 15:49:56 violet kernel: [<f8aa5b93>] vfs_vget+0x2d/0x30 [xfs]
May 10 15:49:56 violet kernel: [<f8a42341>] linvfs_get_dentry+0x3b/0x6f
[xfs]
May 10 15:49:56 violet kernel: [<f890302d>]
find_exported_dentry+0x2d/0x601 [exportfs]
May 10 15:49:56 violet kernel: [<f88ee45c>]
e1000_xmit_frame+0x554/0x8b3 [e1000]
May 10 15:49:56 violet kernel: [<c02a0fba>] __kfree_skb+0xab/0x155
May 10 15:49:56 violet kernel: [<f8a83425>]
xfs_log_release_iclog+0x11/0x3e [xfs]
May 10 15:49:56 violet kernel: [<c01cdd4e>] copy_to_user+0x4d/0x5f
May 10 15:49:56 violet kernel: [<c02a2b8a>] memcpy_toiovec+0x27/0x4c
May 10 15:49:56 violet kernel: [<c02a30f2>]
skb_copy_datagram_iovec+0x4d/0x1f7
May 10 15:49:56 violet kernel: [<c02a09a4>] release_sock+0x12/0x79
May 10 15:49:56 violet kernel: [<c02c8793>] tcp_recvmsg+0x2d5/0x738
May 10 15:49:56 violet kernel: [<c02a0ae5>] sock_common_recvmsg+0x41/0x57
May 10 15:49:56 violet kernel: [<c029d537>] sock_recvmsg+0x10c/0x127
May 10 15:49:56 violet kernel: [<c01727da>] d_instantiate+0x54/0x58
May 10 15:49:56 violet kernel: [<c011b6c4>] recalc_task_prio+0x8a/0x150
May 10 15:49:56 violet kernel: [<c011b814>] activate_task+0x8a/0x99
May 10 15:49:56 violet kernel: [<c011bce3>] try_to_wake_up+0x238/0x270
May 10 15:49:56 violet kernel: [<f8c08c57>]
svc_expkey_lookup+0x404/0x425 [nfsd]
May 10 15:49:56 violet kernel: [<f8c094cc>] exp_find_key+0x69/0x7e [nfsd]
May 10 15:49:56 violet kernel: [<f8a422ae>] linvfs_decode_fh+0x5e/0xb6
[xfs]
May 10 15:49:56 violet kernel: [<f8c03c10>] nfsd_acceptable+0x0/0xdd [nfsd]
May 10 15:49:56 violet kernel: [<f8a42250>] linvfs_decode_fh+0x0/0xb6 [xfs]
May 10 15:49:56 violet kernel: [<f8c03eb8>] fh_verify+0x1cb/0x578 [nfsd]
May 10 15:49:56 violet kernel: [<f8c03c10>] nfsd_acceptable+0x0/0xdd [nfsd]
May 10 15:49:56 violet kernel: [<c02ff57d>] schedule+0x33d/0xb11
May 10 15:49:56 violet kernel: [<f8c05b54>] nfsd_open+0x22/0x1bd [nfsd]
May 10 15:49:56 violet kernel: [<f8c0624a>] nfsd_write+0x31/0x316 [nfsd]
May 10 15:49:56 violet kernel: [<f89aa8b7>] ip_map_lookup+0x34c/0x397
[sunrpc]
May 10 15:49:57 violet kernel: [<f89aaad1>]
svcauth_unix_set_client+0xab/0xb2 [sunrpc]
May 10 15:49:57 violet kernel: [<c012997d>]
del_singleshot_timer_sync+0x8/0x27
May 10 15:49:57 violet kernel: [<c03005e1>] schedule_timeout+0x91/0xfd
May 10 15:49:57 violet kernel: [<f8c0caac>] nfsd3_proc_write+0xab/0x10a
[nfsd]
May 10 15:49:57 violet kernel: [<f8c0e4de>]
nfs3svc_decode_writeargs+0x0/0x175 [nfsd]
May 10 15:49:57 violet kernel: [<f8c02595>] nfsd_dispatch+0x80/0x1db [nfsd]
May 10 15:49:57 violet kernel: [<f89a9ef9>] svc_authenticate+0x79/0xb1
[sunrpc]
May 10 15:49:57 violet kernel: [<f89a7920>] svc_process+0x52e/0x60e
[sunrpc]
May 10 15:49:57 violet kernel: [<f8c0239a>] nfsd+0x185/0x300 [nfsd]
May 10 15:49:57 violet kernel: [<f8c02215>] nfsd+0x0/0x300 [nfsd]
May 10 15:49:57 violet kernel: [<c01022f9>] kernel_thread_helper+0x5/0xb
May 10 15:49:57 violet kernel: Code: f8 c7 44 24 08 e0 e8 ab f8 c7 04 24
04 93 aa f8 89 44 24 04 e8 a5 b2 67 c7 b8 64 d3 ab f8 8b 54 24 0c e8 9f
a8 85
c7 85 ed 75 08 <0f> 0b 6a 00 eb 92 aa f8 83 c4 10 5b 5e 5f 5d c3 55 89
c5 b8 07



May 12 19:42:01 violet kernel: xfs_iget_core: ambiguous vns:
vp/0xd710bc4c, invp/0xc400a7c0
May 12 19:42:01 violet kernel: ------------[ cut here ]------------
May 12 19:42:01 violet kernel: kernel BUG at fs/xfs/support/debug.c:106!
May 12 19:42:01 violet kernel: invalid operand: 0000 [#1]
May 12 19:42:01 violet kernel: SMP
May 12 19:42:01 violet kernel: Modules linked in: loop nfs nfsd lockd
md5 ipv6 parport_pc lp parport sunrpc xfs exportfs dm_mod video button
battery ac
uhci_hcd hw_random i2c_i801 i2c_core e1000 bonding floppy ext3 jbd
raid0 3w_xxxx sd_mod scsi_mod
May 12 19:42:01 violet kernel: CPU: 1
May 12 19:42:01 violet kernel: EIP: 0060:[<f8aa660f>] Not tainted VLI
May 12 19:42:01 violet kernel: EFLAGS: 00010246 (2.6.11-1.14_FC3smp)
May 12 19:42:01 violet kernel: EIP is at cmn_err+0x98/0xa8 [xfs]
May 12 19:42:01 violet kernel: eax: 00000000 ebx: f8aa9a84 ecx:
f8abd364 edx: 00000000
May 12 19:42:01 violet kernel: esi: f8aa9304 edi: f8abe91e ebp:
00000000 esp: f77adb5c
May 12 19:42:01 violet kernel: ds: 007b es: 007b ss: 0068
May 12 19:42:01 violet kernel: Process nfsd (pid: 4186,
threadinfo=f77ad000 task=f779f560)
May 12 19:42:01 violet kernel: Stack: f8aa9304 f8aa92cb f8abe8e0
00000296 f77ad000 20819027 00000000 e29c5aac
May 12 19:42:01 violet kernel: f8a7932f 00000000 f8aa9a84
d710bc4c c400a7c0 f67f7400 f8aa4fe1 f76d0274
May 12 19:42:01 violet kernel: 00000000 c400a7e4 f76d0270
00000000 f67ef000 c400a7c0 e29c5aac c400a7e4
May 12 19:42:01 violet kernel: Call Trace:
May 12 19:42:01 violet kernel: [<f8a7932f>] xfs_iget_core+0x4f9/0x5e2 [xfs]
May 12 19:42:01 violet kernel: [<f8aa4fe1>]
linvfs_alloc_inode+0x20/0x2d [xfs]
May 12 19:42:01 violet kernel: [<f8a794ec>] xfs_iget+0xd4/0x16a [xfs]
May 12 19:42:01 violet kernel: [<f8a95193>] xfs_vget+0x61/0xce [xfs]
May 12 19:42:01 violet kernel: [<c02a0fba>] __kfree_skb+0xab/0x155
May 12 19:42:01 violet kernel: [<f8aa5b93>] vfs_vget+0x2d/0x30 [xfs]
May 12 19:42:01 violet kernel: [<f8a42341>] linvfs_get_dentry+0x3b/0x6f
[xfs]
May 12 19:42:01 violet kernel: [<f88d902d>]
find_exported_dentry+0x2d/0x601 [exportfs]
May 12 19:42:01 violet kernel: [<c02a0fba>] __kfree_skb+0xab/0x155
May 12 19:42:01 violet kernel: [<c02a09a4>] release_sock+0x12/0x79
May 12 19:42:01 violet kernel: [<c02c8793>] tcp_recvmsg+0x2d5/0x738
May 12 19:42:01 violet kernel: [<c02a0ae5>] sock_common_recvmsg+0x41/0x57




processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.66GHz
stepping : 7
cpu MHz : 2666.116
cache size : 512 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 5275.64

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.66GHz
stepping : 7
cpu MHz : 2666.116
cache size : 512 KB
physical id : 3
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 5324.80



PCI devices found:
Bus 0, device 0, function 0:
Class 0600: PCI device 8086:254c (rev 1).
Bus 0, device 0, function 1:
Class ff00: PCI device 8086:2541 (rev 1).
Bus 0, device 2, function 0:
Class 0604: PCI device 8086:2543 (rev 1).
Master Capable. Latency=64. Min Gnt=4.
Bus 0, device 3, function 0:
Class 0604: PCI device 8086:2545 (rev 1).
Master Capable. Latency=64. Min Gnt=4.
Bus 0, device 29, function 0:
Class 0c03: PCI device 8086:2482 (rev 2).
IRQ 225.
I/O at 0x2000 [0x201f].
Bus 0, device 29, function 1:
Class 0c03: PCI device 8086:2484 (rev 2).
IRQ 233.
I/O at 0x2020 [0x203f].
Bus 0, device 29, function 2:
Class 0c03: PCI device 8086:2487 (rev 2).
IRQ 169.
I/O at 0x2040 [0x205f].
Bus 0, device 30, function 0:
Class 0604: PCI device 8086:244e (rev 66).
Master Capable. No bursts. Min Gnt=12.
Bus 0, device 31, function 0:
Class 0601: PCI device 8086:2480 (rev 2).
Bus 0, device 31, function 1:
Class 0101: PCI device 8086:248b (rev 2).
IRQ 169.
I/O at 0x2060 [0x206f].
Non-prefetchable 32 bit memory at 0xf8000000 [0xf80003ff].
Bus 0, device 31, function 3:
Class 0c05: PCI device 8086:2483 (rev 2).
I/O at 0x1100 [0x111f].
Bus 1, device 28, function 0:
Class 0800: PCI device 8086:1461 (rev 4).
Non-prefetchable 32 bit memory at 0xf9100000 [0xf9100fff].
Bus 1, device 30, function 0:
Class 0800: PCI device 8086:1461 (rev 4).
Non-prefetchable 32 bit memory at 0xf9101000 [0xf9101fff].
Bus 4, device 28, function 0:
Class 0800: PCI device 8086:1461 (rev 4).
Non-prefetchable 32 bit memory at 0xfa000000 [0xfa000fff].
Bus 4, device 30, function 0:
Class 0800: PCI device 8086:1461 (rev 4).
Non-prefetchable 32 bit memory at 0xfa001000 [0xfa001fff].
Bus 1, device 29, function 0:
Class 0604: PCI device 8086:1460 (rev 4).
Master Capable. Latency=64. Min Gnt=4.
Bus 1, device 31, function 0:
Class 0604: PCI device 8086:1460 (rev 4).
Master Capable. Latency=64. Min Gnt=4.
Bus 4, device 29, function 0:
Class 0604: PCI device 8086:1460 (rev 4).
Master Capable. Latency=64. Min Gnt=4.
Bus 4, device 31, function 0:
Class 0604: PCI device 8086:1460 (rev 4).
Master Capable. Latency=64. Min Gnt=4.
Bus 3, device 1, function 0:
Class 0104: PCI device 13c1:1001 (rev 1).
IRQ 177.
Master Capable. Latency=72. Min Gnt=9.
I/O at 0x3080 [0x308f].
Non-prefetchable 32 bit memory at 0xf9240000 [0xf924000f].
Non-prefetchable 32 bit memory at 0xf9800000 [0xf9ffffff].
Bus 5, device 1, function 0:
Class 0104: PCI device 13c1:1001 (rev 1).
IRQ 185.
Master Capable. Latency=72. Min Gnt=9.
I/O at 0x4000 [0x400f].
Non-prefetchable 32 bit memory at 0xfa100000 [0xfa10000f].
Non-prefetchable 32 bit memory at 0xfa800000 [0xfaffffff].
Bus 5, device 3, function 0:
Class 0104: PCI device 13c1:1001 (rev 1).
IRQ 193.
Master Capable. Latency=72. Min Gnt=9.
I/O at 0x4010 [0x401f].
Non-prefetchable 32 bit memory at 0xfa100400 [0xfa10040f].
Non-prefetchable 32 bit memory at 0xfb000000 [0xfb7fffff].
Bus 6, device 1, function 0:
Class 0104: PCI device 13c1:1001 (rev 1).
IRQ 201.
Master Capable. Latency=72. Min Gnt=9.
I/O at 0x5000 [0x500f].
Non-prefetchable 32 bit memory at 0xfc000000 [0xfc00000f].
Non-prefetchable 32 bit memory at 0xfb800000 [0xfbffffff].
Bus 3, device 2, function 0:
Class 0200: PCI device 8086:1010 (rev 1).
IRQ 209.
Master Capable. Latency=64. Min Gnt=255.
Non-prefetchable 64 bit memory at 0xf9200000 [0xf921ffff].
I/O at 0x3000 [0x303f].
Bus 3, device 2, function 1:
Class 0200: PCI device 8086:1010 (rev 1).
IRQ 217.
Master Capable. Latency=64. Min Gnt=255.
Non-prefetchable 64 bit memory at 0xf9220000 [0xf923ffff].
I/O at 0x3040 [0x307f].
Bus 7, device 1, function 0:
Class 0300: PCI device 1002:4752 (rev 39).
IRQ 11.
Master Capable. Latency=64. Min Gnt=8.
Non-prefetchable 32 bit memory at 0xfd000000 [0xfdffffff].
I/O at 0x6000 [0x60ff].
Non-prefetchable 32 bit memory at 0xfc100000 [0xfc100fff].


I would appreciate any information you can offer.

Greg Brauer


2005-05-14 18:47:25

by Chris Wedgwood

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Fri, May 13, 2005 at 01:45:44PM -0700, Gregory Brauer wrote:

> I have seen some references to a similar bug in other kernel list
> posts from October 2004 and am trying to figure out if this is the
> same problem, or something new related to the xfs_iget_core patch in
> 2.6.11. This seems to be a very hard to reproduce bug, but we've
> seen this problem twice in a week of testing under Fedora
> Core 3 on the following kernel:

does disabling CONFIG_4K_STACKS help?

2005-05-18 17:38:53

by Gregory Brauer

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

Chris Wedgwood wrote:
> On Fri, May 13, 2005 at 01:45:44PM -0700, Gregory Brauer wrote:
>
>
>>I have seen some references to a similar bug in other kernel list
>>posts from October 2004 and am trying to figure out if this is the
>>same problem, or something new related to the xfs_iget_core patch in
>>2.6.11. This seems to be a very hard to reproduce bug, but we've
>>seen this problem twice in a week of testing under Fedora
>>Core 3 on the following kernel:
>
>
> does disabling CONFIG_4K_STACKS help?

No. Here is the oops for a kernel.org kernel 2.6.11.10 compiled
with the same config file that FC3 uses for their kernel
(2.6.11-1.14_FC3) except with the CONFIG_4KSTACKS=y line
commented out. This one died after about 8 hours of testing.



May 18 02:59:47 violet kernel: xfs_iget_core: ambiguous vns:
vp/0xf53f8ac8, invp/0xe49ccc4c
May 18 02:59:47 violet kernel: ------------[ cut here ]------------
May 18 02:59:47 violet kernel: kernel BUG at fs/xfs/support/debug.c:106!
May 18 02:59:47 violet kernel: invalid operand: 0000 [#1]
May 18 02:59:47 violet kernel: SMP
May 18 02:59:47 violet kernel: Modules linked in: nfsd lockd md5 ipv6
parport_pc lp parport sunrpc xfs exportfs dm_mod video button battery ac
uhci_h
cd hw_random i2c_i801 i2c_core e1000 bonding floppy ext3 jbd raid0
3w_xxxx sd_mod scsi_mod
May 18 02:59:47 violet kernel: CPU: 0
May 18 02:59:47 violet kernel: EIP: 0060:[<f9206de5>] Not tainted VLI
May 18 02:59:47 violet kernel: EFLAGS: 00010246 (2.6.11.10)
May 18 02:59:47 violet kernel: EIP is at cmn_err+0xa5/0xd0 [xfs]
May 18 02:59:47 violet kernel: eax: 00000000 ebx: f920a178 ecx:
f921e484 edx: 00000200
May 18 02:59:47 violet kernel: esi: f92099f8 edi: f921fa1e ebp:
00000000 esp: f643db48
May 18 02:59:47 violet kernel: ds: 007b es: 007b ss: 0068
May 18 02:59:47 violet kernel: Process nfsd (pid: 6177,
threadinfo=f643c000 task=f6ab4560)
May 18 02:59:47 violet kernel: Stack: f92099f8 f92099bf f921f9e0
00000282 f643c000 f920a178 00000000 dc8444ac
May 18 02:59:47 violet kernel: f91d5fe4 00000000 f920a178
f53f8ac8 e49ccc4c f71d7c00 f92053b3 ed7c01f4
May 18 02:59:47 violet kernel: 00000000 e49ccc70 ed7c01f0
00000000 f707f400 e49ccc4c dc8444ac e49ccc70
May 18 02:59:47 violet kernel: Call Trace:
May 18 02:59:47 violet kernel: [<f91d5fe4>] xfs_iget_core+0x514/0x610 [xfs]
May 18 02:59:47 violet kernel: [<f92053b3>]
linvfs_alloc_inode+0x23/0x30 [xfs]
May 18 02:59:47 violet kernel: [<f91d61b4>] xfs_iget+0xd4/0x170 [xfs]
May 18 02:59:47 violet kernel: [<f91f3f2b>] xfs_vget+0x6b/0xf0 [xfs]
May 18 02:59:47 violet kernel: [<c02c25d1>] alloc_skb+0x41/0xf0
May 18 02:59:47 violet kernel: [<f920613d>] vfs_vget+0x2d/0x40 [xfs]
May 18 02:59:47 violet kernel: [<f919c659>] linvfs_get_dentry+0x39/0x70
[xfs]
May 18 02:59:47 violet kernel: [<f88ea02d>]
find_exported_dentry+0x2d/0x630 [exportfs]
May 18 02:59:47 violet kernel: [<c02e890c>] ip_append_data+0x57c/0x850
May 18 02:59:47 violet kernel: [<c0305759>] udp_sendmsg+0x2f9/0x6d0
May 18 02:59:47 violet kernel: [<f897f627>]
e1000_xmit_frame+0x557/0x8e0 [e1000]
May 18 02:59:47 violet kernel: [<c03274c2>] packet_rcv_spkt+0xd2/0x260
May 18 02:59:47 violet kernel: [<c02d8330>] qdisc_restart+0x20/0x230
May 18 02:59:47 violet kernel: [<c02c85fb>] dev_queue_xmit+0x26b/0x300
May 18 02:59:47 violet kernel: [<c02e6f9a>] ip_finish_output+0xda/0x290
May 18 02:59:47 violet kernel: [<c02be4e8>] sock_sendmsg+0x118/0x150
May 18 02:59:47 violet kernel: [<c02e9280>]
ip_push_pending_frames+0x240/0x490
May 18 02:59:47 violet kernel: [<c0138730>]
autoremove_wake_function+0x0/0x50
May 18 02:59:47 violet kernel: [<c03052e9>]
udp_push_pending_frames+0x139/0x270
May 18 02:59:47 violet kernel: [<c011d158>] recalc_task_prio+0x88/0x150
May 18 02:59:47 violet kernel: [<f8ae151a>]
svc_expkey_lookup+0x40a/0x430 [nfsd]
May 18 02:59:47 violet kernel: [<f919c5b3>] linvfs_decode_fh+0x63/0xd0
[xfs]
May 18 02:59:47 violet kernel: [<f8adbf50>] nfsd_acceptable+0x0/0x100
[nfsd]
May 18 02:59:47 violet kernel: [<f919c550>] linvfs_decode_fh+0x0/0xd0 [xfs]
May 18 02:59:47 violet kernel: [<f8adc228>] fh_verify+0x1d8/0x5a0 [nfsd]
May 18 02:59:47 violet kernel: [<f8adbf50>] nfsd_acceptable+0x0/0x100
[nfsd]
May 18 02:59:47 violet kernel: [<f8add053>] nfsd_lookup+0x43/0x4d0 [nfsd]
May 18 02:59:47 violet kernel: [<f8adaf44>] nfsd_proc_lookup+0x54/0xa0
[nfsd]
May 18 02:59:47 violet kernel: [<f8ae3d30>]
nfssvc_decode_diropargs+0x0/0xd0 [nfsd]
May 18 02:59:47 violet kernel: [<f8ada5ba>] nfsd_dispatch+0x8a/0x210 [nfsd]
May 18 02:59:47 violet kernel: [<f8c1afb6>] svc_authenticate+0x86/0xd0
[sunrpc]
May 18 02:59:47 violet kernel: [<f8c18641>] svc_process+0x551/0x640
[sunrpc]
May 18 02:59:47 violet kernel: [<c01172bc>]
smp_call_function_interrupt+0x3c/0x60
May 18 02:59:47 violet kernel: [<c0104b04>]
call_function_interrupt+0x1c/0x24
May 18 02:59:47 violet kernel: [<f8ada3b0>] nfsd+0x190/0x310 [nfsd]
May 18 02:59:47 violet kernel: [<f8ada220>] nfsd+0x0/0x310 [nfsd]
May 18 02:59:47 violet kernel: [<c0102345>] kernel_thread_helper+0x5/0x10
May 18 02:59:47 violet kernel: Code: b8 e0 f9 21 f9 89 44 24 08 8b 04 ad
a0 e4 21 f9 89 44 24 04 e8 ed cd f1 c6 8b 54 24 0c b8 84 e4 21 f9 e8 bf 4d 1
2 c7 85 ed 75 08 <0f> 0b 6a 00 df 99 20 f9 8b 5c 24 10 8b 74 24 14 8b 7c
24 18 8b


Greg

2005-05-18 18:00:30

by Chris Wedgwood

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Wed, May 18, 2005 at 10:38:07AM -0700, Gregory Brauer wrote:

> May 18 02:59:47 violet kernel: xfs_iget_core: ambiguous vns: vp/0xf53f8ac8, invp/0xe49ccc4c

I'm pretty sure it's NFS that aggravates this --- can anyone recall
why?

2005-05-18 19:53:03

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Wed, May 18, 2005 at 10:59:25AM -0700, Chris Wedgwood wrote:
> On Wed, May 18, 2005 at 10:38:07AM -0700, Gregory Brauer wrote:
>
> > May 18 02:59:47 violet kernel: xfs_iget_core: ambiguous vns: vp/0xf53f8ac8, invp/0xe49ccc4c
>
> I'm pretty sure it's NFS that aggravates this --- can anyone recall
> why?

Not why no - but there where *major* problems with SMP+NFS+XFS up until
2.6.11.

I run 2.6.11(.8/9) on both SMP (dual athlon) and NUMA (64 bit kernel on
dual opteron) with NFS and XFS and haven't yet seen any problems (knock
the wood).

Seriously, any 2.6 earlier than .11 is *unusable* for file serving over
NFS (at least with XFS which at the moment is the only FS with
journalled quota so at least for me that's the only option).

--

/ jakob

2005-05-18 20:03:56

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Wed, 18 May 2005 at 9:52pm, Jakob Oestergaard wrote

> On Wed, May 18, 2005 at 10:59:25AM -0700, Chris Wedgwood wrote:
> > On Wed, May 18, 2005 at 10:38:07AM -0700, Gregory Brauer wrote:
> >
> > > May 18 02:59:47 violet kernel: xfs_iget_core: ambiguous vns: vp/0xf53f8ac8, invp/0xe49ccc4c
> >
> > I'm pretty sure it's NFS that aggravates this --- can anyone recall
> > why?
>
> Not why no - but there where *major* problems with SMP+NFS+XFS up until
> 2.6.11.
>
> I run 2.6.11(.8/9) on both SMP (dual athlon) and NUMA (64 bit kernel on
> dual opteron) with NFS and XFS and haven't yet seen any problems (knock
> the wood).
>
> Seriously, any 2.6 earlier than .11 is *unusable* for file serving over
> NFS (at least with XFS which at the moment is the only FS with
> journalled quota so at least for me that's the only option).

Do you have a test case that would show this up? I've been testing a
centos-4 based server with the RH-derived 2.6.9-based kernel tweaked to
disable 4K stacks and enable XFS and haven't run into any issues yet.
This includes running the parallel IOR benchmark from 10 clients (and
getting 200MiB/s throughput on reads).

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

2005-05-18 20:20:35

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Wed, May 18, 2005 at 04:00:08PM -0400, Joshua Baker-LePain wrote:
...
> >
> > Seriously, any 2.6 earlier than .11 is *unusable* for file serving over
> > NFS (at least with XFS which at the moment is the only FS with
> > journalled quota so at least for me that's the only option).
>
> Do you have a test case that would show this up? I've been testing a
> centos-4 based server with the RH-derived 2.6.9-based kernel tweaked to
> disable 4K stacks and enable XFS and haven't run into any issues yet.
> This includes running the parallel IOR benchmark from 10 clients (and
> getting 200MiB/s throughput on reads).

Server must be SMP

Two clients; on each of them untar/cat/delete kernel trees.

You want a few million files on the FS in order to confuse the server
sufficiently for it to screw up severely.

Make sure you keep lots of things going concurrently on the clients.

And don't run as root - common problems are also that files get wrong
ownership/modes (a file created by one unprivileged user shows up as
belonging to another unprivileged user - files can show up with modes
d---------)

I guess RH could have patched up a 2.6.9 to include whatever fixes (more
than one!) for the issues that were resolved from 2.6.9 to 2.6.11.

--

/ jakob

2005-05-18 20:50:48

by Gregory Brauer

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

Joshua Baker-LePain wrote:
> Do you have a test case that would show this up? I've been testing a
> centos-4 based server with the RH-derived 2.6.9-based kernel tweaked to
> disable 4K stacks and enable XFS and haven't run into any issues yet.
> This includes running the parallel IOR benchmark from 10 clients (and
> getting 200MiB/s throughput on reads).
>

For Jakob,
Note that the last OOPS I posted was for 2.6.11.10.


For Joshua,

We first saw the problem after 5 days in production, but since then
we took the server out of production and used the script
nfs_fsstress.sh located in this package:

http://prdownloads.sourceforge.net/ltp/ltp-full-20050505.tgz?download

We run the script on 5 client machines that are running RedHat 9
with kernel-smp-2.4.20-20.9 and nfs-utils-1.0.1-3.9.1.legacy and
are NFS mounting our 2.6 kernel server. The longest time to OOPS
has been about 8 hours. We have not tried the parallel IOR
benchmark. (Where can we get that?)

You didn't mention if you are using md at all. We have a
software RAID-0 of 4 x 3ware 8506-4 controllers running the
latest 3ware driver from their site. The filesystem is XFS.
The network driver is e1000 (two interfaces, not bonded). The
system is a dual Xeon. We upped the number of NFS daemons
from 8 to 64. The nfs_fsstress.sh client mounts the servers
both UDP and TCP, and our in-production oops likely happened
with a combination of both protocols in use simultaneously as
well. We've seen the OOPS with both the default and with 32K
read and write NFS block sizes. The machine was stable for
over a year with RedHat 9 and 2.4.20.

I'm grasping for any subtle details that might help...

Here is our list of loaded modules:

Our server configuration is
Module Size Used by
nfsd 185569 65
exportfs 9921 1 nfsd
lockd 59625 2 nfsd
md5 8001 1
ipv6 236769 16
parport_pc 29701 1
lp 15405 0
parport 37129 2 parport_pc,lp
sunrpc 135077 28 nfsd,lockd
xfs 487809 1
dm_mod 57925 0
video 19653 0
button 10577 0
battery 13253 0
ac 8773 0
uhci_hcd 33497 0
hw_random 9429 0
i2c_i801 11981 0
i2c_core 24513 1 i2c_i801
e1000 84629 0
bonding 59817 0
floppy 56913 0
ext3 117961 2
jbd 57177 1 ext3
raid0 11840 1
3w_xxxx 30561 4
sd_mod 20545 4
scsi_mod 116033 2 3w_xxxx,sd_mod


Let me know if there is anything else I can provide.

Thanks.

Greg

2005-05-18 21:00:11

by Gregory Brauer

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

Jakob Oestergaard wrote:
> You want a few million files on the FS in order to confuse the server
> sufficiently for it to screw up severely.

Here we reproduced the OOPS with an fresh and empty XFS volume using
the nfs_fsstress.sh script.

> And don't run as root - common problems are also that files get wrong
> ownership/modes (a file created by one unprivileged user shows up as
> belonging to another unprivileged user - files can show up with modes
> d---------)

Our nfs_fsstress.sh tests were running as root and writing only
root-owned files (with no_root_squash, of course) and reproduced
the OOPS twice. We haven't seen the privileges problem yet.

Greg

2005-05-19 19:44:07

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Wed, 18 May 2005 at 1:43pm, Gregory Brauer wrote

> Joshua Baker-LePain wrote:
> > Do you have a test case that would show this up? I've been testing a
> > centos-4 based server with the RH-derived 2.6.9-based kernel tweaked to
> > disable 4K stacks and enable XFS and haven't run into any issues yet.
> > This includes running the parallel IOR benchmark from 10 clients (and
> > getting 200MiB/s throughput on reads).
> >
>
> We first saw the problem after 5 days in production, but since then
> we took the server out of production and used the script
> nfs_fsstress.sh located in this package:
>
> http://prdownloads.sourceforge.net/ltp/ltp-full-20050505.tgz?download
>
> We run the script on 5 client machines that are running RedHat 9
> with kernel-smp-2.4.20-20.9 and nfs-utils-1.0.1-3.9.1.legacy and
> are NFS mounting our 2.6 kernel server. The longest time to OOPS

My clients are all RHEL3. I'll give the nfs_fsstress scripts a shot.

> has been about 8 hours. We have not tried the parallel IOR
> benchmark. (Where can we get that?)

http://www.llnl.gov/asci/purple/benchmarks/limited/ior/

> You didn't mention if you are using md at all. We have a
> software RAID-0 of 4 x 3ware 8506-4 controllers running the
> latest 3ware driver from their site. The filesystem is XFS.
> The network driver is e1000 (two interfaces, not bonded). The
> system is a dual Xeon. We upped the number of NFS daemons
> from 8 to 64. The nfs_fsstress.sh client mounts the servers
> both UDP and TCP, and our in-production oops likely happened
> with a combination of both protocols in use simultaneously as
> well. We've seen the OOPS with both the default and with 32K
> read and write NFS block sizes. The machine was stable for
> over a year with RedHat 9 and 2.4.20.

The server I'll be testing is dual Xeon, 2GB RAM, 2 x 3ware 9500-12
controllers running the 9.1.5.2 firmware and using the stock centos-4
drivers. (I see a huge performance hit using the latest 3ware driver set,
9.2.) The 3wares are doing hardware RAID5, and I do software RAID0 across
'em. Networking is the same -- 2 e1000s, not bonded. I'm running 256 NFS
daemons, well, just because.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

2005-05-19 21:01:24

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Thu, 19 May 2005 at 3:43pm, Joshua Baker-LePain wrote

> On Wed, 18 May 2005 at 1:43pm, Gregory Brauer wrote

> > We first saw the problem after 5 days in production, but since then
> > we took the server out of production and used the script
> > nfs_fsstress.sh located in this package:
> >
> > http://prdownloads.sourceforge.net/ltp/ltp-full-20050505.tgz?download
> >
> > We run the script on 5 client machines that are running RedHat 9
> > with kernel-smp-2.4.20-20.9 and nfs-utils-1.0.1-3.9.1.legacy and
> > are NFS mounting our 2.6 kernel server. The longest time to OOPS
>
> My clients are all RHEL3 (well, centos 3 actually). I'll give the
> nfs_fsstress scripts a shot.

Hrm. That didn't take long. I've got 6 clients (3 per interface) going
with nfs_fsstress.sh, and I saw the following in the logs on the server
after about 20 minutes. Note, however, that I can still access the FS,
both locally on the server and via NFS. The scripts are still going, but
some have already reported errors (obviously).

The server error:

May 19 16:47:10 norbert kernel: xfs_da_do_buf: bno 8388608
May 19 16:47:10 norbert kernel: dir: inode 2162706
May 19 16:47:10 norbert kernel: Filesystem "md0": XFS internal error xfs_da_do_buf(1) at line 2176 of file fs/xfs/xfs_da_btree.c. Caller 0xf8c90148
May 19 16:47:10 norbert kernel: [<f8c8fd5b>] xfs_da_do_buf+0x357/0x70d [xfs]
May 19 16:47:10 norbert kernel: [<f8c90148>] xfs_da_read_buf+0x19/0x1e [xfs]
May 19 16:47:10 norbert kernel: [<c013e67c>] buffered_rmqueue+0x17d/0x1a5
May 19 16:47:10 norbert kernel: [<f8c90148>] xfs_da_read_buf+0x19/0x1e [xfs]
May 19 16:47:10 norbert kernel: [<f8c8e7e7>] xfs_da_node_lookup_int+0x9d/0x2c0 [xfs]
May 19 16:47:10 norbert kernel: [<f8c8e7e7>] xfs_da_node_lookup_int+0x9d/0x2c0 [xfs]
May 19 16:47:10 norbert kernel: [<f8cc103e>] kmem_zone_alloc+0x3b/0x70 [xfs]
May 19 16:47:10 norbert kernel: [<f8c98300>] xfs_dir2_node_lookup+0x34/0x96 [xfs]
May 19 16:47:10 norbert kernel: [<f8c91a23>] xfs_dir2_lookup+0xde/0x107 [xfs]
May 19 16:47:10 norbert kernel: [<c019e09d>] avc_has_perm_noaudit+0x8d/0xda
May 19 16:47:10 norbert kernel: [<c019e09d>] avc_has_perm_noaudit+0x8d/0xda
May 19 16:47:10 norbert kernel: [<f8cb8c1f>] xfs_dir_lookup_int+0x26/0xa8 [xfs]
May 19 16:47:10 norbert kernel: [<f8cbd09b>] xfs_lookup+0x40/0x69 [xfs]
May 19 16:47:10 norbert kernel: [<c016b6c2>] wake_up_inode+0x6/0x29
May 19 16:47:10 norbert kernel: [<f8cc953d>] vfs_init_vnode+0x1e/0x22 [xfs]
May 19 16:47:10 norbert kernel: [<f8cc910f>] linvfs_get_parent+0x43/0x75 [xfs]
May 19 16:47:10 norbert kernel: [<c02c4022>] __cond_resched+0x14/0x39
May 19 16:47:10 norbert kernel: [<c02c4022>] __cond_resched+0x14/0x39
May 19 16:47:10 norbert kernel: [<c016941e>] d_alloc+0x197/0x1a1
May 19 16:47:10 norbert kernel: [<c0169587>] d_alloc_anon+0xd1/0xee
May 19 16:47:10 norbert kernel: [<c02c4022>] __cond_resched+0x14/0x39
May 19 16:47:10 norbert kernel: [<f8b1c303>] find_exported_dentry+0x303/0x5e8 [exportfs]
May 19 16:47:10 norbert kernel: [<c026f9c2>] skb_copy_datagram_iovec+0x4f/0x1e1
May 19 16:47:10 norbert kernel: [<c026d423>] release_sock+0xf/0x4f
May 19 16:47:10 norbert kernel: [<c0291ab6>] tcp_recvmsg+0x64a/0x681
May 19 16:47:10 norbert kernel: [<c026d550>] sock_common_recvmsg+0x30/0x46
May 19 16:47:10 norbert kernel: [<c026a230>] sock_recvmsg+0xef/0x10c
May 19 16:47:10 norbert kernel: [<c028b67f>] dst_output+0x0/0x1a
May 19 16:47:10 norbert kernel: [<c011b3ee>] recalc_task_prio+0x128/0x133
May 19 16:47:10 norbert kernel: [<c011b481>] activate_task+0x88/0x95
May 19 16:47:10 norbert kernel: [<c011b902>] try_to_wake_up+0x222/0x22d
May 19 16:47:10 norbert kernel: [<c011cdbd>] __wake_up_common+0x36/0x51
May 19 16:47:11 norbert kernel: [<c011ce01>] __wake_up+0x29/0x3c
May 19 16:47:11 norbert kernel: [<f8b438d2>] svc_sock_enqueue+0x1d6/0x212 [sunrpc]
May 19 16:47:11 norbert kernel: [<f8b4485d>] svc_tcp_recvfrom+0x304/0x376 [sunrpc]
May 19 16:47:11 norbert kernel: [<f8c1beb5>] svc_expkey_lookup+0x1fc/0x330 [nfsd]
May 19 16:47:11 norbert kernel: [<f8b1c88e>] export_decode_fh+0x61/0x6d [exportfs]
May 19 16:47:11 norbert kernel: [<f8c17884>] nfsd_acceptable+0x0/0xba [nfsd]
May 19 16:47:11 norbert kernel: [<f8b1c82d>] export_decode_fh+0x0/0x6d [exportfs]
May 19 16:47:11 norbert kernel: [<f8c17cfa>] fh_verify+0x3bc/0x5bd [nfsd]
May 19 16:47:11 norbert kernel: [<f8c17884>] nfsd_acceptable+0x0/0xba [nfsd]
May 19 16:47:11 norbert kernel: [<f8c1ff2a>] nfsd3_proc_getattr+0x6f/0x77 [nfsd]
May 19 16:47:11 norbert kernel: [<f8c21b06>] nfs3svc_decode_fhandle+0x0/0x8d [nfsd]
May 19 16:47:11 norbert kernel: [<f8c165d7>] nfsd_dispatch+0xba/0x16f [nfsd]
May 19 16:47:11 norbert kernel: [<f8b43446>] svc_process+0x420/0x6d6 [sunrpc]
May 19 16:47:11 norbert kernel: [<f8c163b7>] nfsd+0x1cc/0x332 [nfsd]
May 19 16:47:11 norbert kernel: [<f8c161eb>] nfsd+0x0/0x332 [nfsd]
May 19 16:47:11 norbert kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb
May 19 16:47:11 norbert kernel: nfsd: non-standard errno: -990


--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

2005-05-19 21:09:36

by Lee Revell

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Thu, 2005-05-19 at 17:00 -0400, Joshua Baker-LePain wrote:
> May 19 16:47:10 norbert kernel: Filesystem "md0": XFS internal error xfs_da_do_buf(1) at line 2176 of file fs/xfs/xfs_da_btree.c. Caller 0xf8c90148
> May 19 16:47:10 norbert kernel: [<f8c8fd5b>] xfs_da_do_buf+0x357/0x70d [xfs]
> May 19 16:47:10 norbert kernel: [<f8c90148>] xfs_da_read_buf+0x19/0x1e [xfs]
> May 19 16:47:10 norbert kernel: [<c013e67c>] buffered_rmqueue+0x17d/0x1a5
> May 19 16:47:10 norbert kernel: [<f8c90148>] xfs_da_read_buf+0x19/0x1e [xfs]
> May 19 16:47:10 norbert kernel: [<f8c8e7e7>] xfs_da_node_lookup_int+0x9d/0x2c0 [xfs]
> May 19 16:47:10 norbert kernel: [<f8c8e7e7>] xfs_da_node_lookup_int+0x9d/0x2c0 [xfs]
> May 19 16:47:10 norbert kernel: [<f8cc103e>] kmem_zone_alloc+0x3b/0x70 [xfs]
> May 19 16:47:10 norbert kernel: [<f8c98300>] xfs_dir2_node_lookup+0x34/0x96 [xfs]
> May 19 16:47:10 norbert kernel: [<f8c91a23>] xfs_dir2_lookup+0xde/0x107 [xfs]
> May 19 16:47:10 norbert kernel: [<c019e09d>] avc_has_perm_noaudit+0x8d/0xda
> May 19 16:47:10 norbert kernel: [<c019e09d>] avc_has_perm_noaudit+0x8d/0xda
> May 19 16:47:10 norbert kernel: [<f8cb8c1f>] xfs_dir_lookup_int+0x26/0xa8 [xfs]
> May 19 16:47:10 norbert kernel: [<f8cbd09b>] xfs_lookup+0x40/0x69 [xfs]
> May 19 16:47:10 norbert kernel: [<c016b6c2>] wake_up_inode+0x6/0x29
> May 19 16:47:10 norbert kernel: [<f8cc953d>] vfs_init_vnode+0x1e/0x22 [xfs]
> May 19 16:47:10 norbert kernel: [<f8cc910f>] linvfs_get_parent+0x43/0x75 [xfs]
> May 19 16:47:10 norbert kernel: [<c02c4022>] __cond_resched+0x14/0x39
> May 19 16:47:10 norbert kernel: [<c02c4022>] __cond_resched+0x14/0x39
> May 19 16:47:10 norbert kernel: [<c016941e>] d_alloc+0x197/0x1a1
> May 19 16:47:10 norbert kernel: [<c0169587>] d_alloc_anon+0xd1/0xee
> May 19 16:47:10 norbert kernel: [<c02c4022>] __cond_resched+0x14/0x39
> May 19 16:47:10 norbert kernel: [<f8b1c303>] find_exported_dentry+0x303/0x5e8 [exportfs]
> May 19 16:47:10 norbert kernel: [<c026f9c2>] skb_copy_datagram_iovec+0x4f/0x1e1
> May 19 16:47:10 norbert kernel: [<c026d423>] release_sock+0xf/0x4f
> May 19 16:47:10 norbert kernel: [<c0291ab6>] tcp_recvmsg+0x64a/0x681
> May 19 16:47:10 norbert kernel: [<c026d550>] sock_common_recvmsg+0x30/0x46
> May 19 16:47:10 norbert kernel: [<c026a230>] sock_recvmsg+0xef/0x10c
> May 19 16:47:10 norbert kernel: [<c028b67f>] dst_output+0x0/0x1a
> May 19 16:47:10 norbert kernel: [<c011b3ee>] recalc_task_prio+0x128/0x133
> May 19 16:47:10 norbert kernel: [<c011b481>] activate_task+0x88/0x95
> May 19 16:47:10 norbert kernel: [<c011b902>] try_to_wake_up+0x222/0x22d
> May 19 16:47:10 norbert kernel: [<c011cdbd>] __wake_up_common+0x36/0x51
> May 19 16:47:11 norbert kernel: [<c011ce01>] __wake_up+0x29/0x3c
> May 19 16:47:11 norbert kernel: [<f8b438d2>] svc_sock_enqueue+0x1d6/0x212 [sunrpc]
> May 19 16:47:11 norbert kernel: [<f8b4485d>] svc_tcp_recvfrom+0x304/0x376 [sunrpc]
> May 19 16:47:11 norbert kernel: [<f8c1beb5>] svc_expkey_lookup+0x1fc/0x330 [nfsd]
> May 19 16:47:11 norbert kernel: [<f8b1c88e>] export_decode_fh+0x61/0x6d [exportfs]
> May 19 16:47:11 norbert kernel: [<f8c17884>] nfsd_acceptable+0x0/0xba [nfsd]
> May 19 16:47:11 norbert kernel: [<f8b1c82d>] export_decode_fh+0x0/0x6d [exportfs]
> May 19 16:47:11 norbert kernel: [<f8c17cfa>] fh_verify+0x3bc/0x5bd [nfsd]
> May 19 16:47:11 norbert kernel: [<f8c17884>] nfsd_acceptable+0x0/0xba [nfsd]
> May 19 16:47:11 norbert kernel: [<f8c1ff2a>] nfsd3_proc_getattr+0x6f/0x77 [nfsd]
> May 19 16:47:11 norbert kernel: [<f8c21b06>] nfs3svc_decode_fhandle+0x0/0x8d [nfsd]
> May 19 16:47:11 norbert kernel: [<f8c165d7>] nfsd_dispatch+0xba/0x16f [nfsd]
> May 19 16:47:11 norbert kernel: [<f8b43446>] svc_process+0x420/0x6d6 [sunrpc]
> May 19 16:47:11 norbert kernel: [<f8c163b7>] nfsd+0x1cc/0x332 [nfsd]
> May 19 16:47:11 norbert kernel: [<f8c161eb>] nfsd+0x0/0x332 [nfsd]
> May 19 16:47:11 norbert kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb

Couldn't this be a stack overflow? That's a very large kernel stack.

Lee

2005-05-19 21:16:13

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Thu, 19 May 2005 at 5:09pm, Lee Revell wrote

> On Thu, 2005-05-19 at 17:00 -0400, Joshua Baker-LePain wrote:
> > May 19 16:47:10 norbert kernel: Filesystem "md0": XFS internal error xfs_da_do_buf(1) at line 2176 of file fs/xfs/xfs_da_btree.c. Caller 0xf8c90148
*snip*
>
> Couldn't this be a stack overflow? That's a very large kernel stack.

I am using 8K stacks, and that's all the kernel messages I see.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

2005-05-19 21:27:23

by Chris Wedgwood

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Thu, May 19, 2005 at 05:00:22PM -0400, Joshua Baker-LePain wrote:

> May 19 16:47:10 norbert kernel: xfs_da_do_buf: bno 8388608
> May 19 16:47:10 norbert kernel: dir: inode 2162706
> May 19 16:47:10 norbert kernel: Filesystem "md0": XFS internal error xfs_da_do_buf(1) at line 2176 of file fs/xfs/xfs_da_btree.c. Caller 0xf8c90148
> May 19 16:47:10 norbert kernel: [<f8c8fd5b>] xfs_da_do_buf+0x357/0x70d [xfs]

This isn't what was reported before. Maybe check inode 2162706 on
disk using xfs_db and make sure it looks OK?

As pointed out it is a really long call-chain --- I'm guessing it's OK
but it could still be a stack overflow problem as with 8K stacks you
don't have separate IRQ stacks.

If you like you could apply the following patch and *enable*
CONFIG_4KSTACKS. In this case is badly named as you will really have
8K process stacks and 4K irq stacks after applying this.


Index: cw-current/include/asm-i386/thread_info.h
===================================================================
--- cw-current.orig/include/asm-i386/thread_info.h 2004-12-15 14:35:00.296402459 -0800
+++ cw-current/include/asm-i386/thread_info.h 2004-12-15 14:38:20.129259160 -0800
@@ -54,12 +54,12 @@

#define PREEMPT_ACTIVE 0x10000000
#ifdef CONFIG_4KSTACKS
-#define THREAD_SIZE (4096)
+#define THREAD_SIZE (8192)
#else
#define THREAD_SIZE (8192)
#endif

-#define STACK_WARN (THREAD_SIZE/8)
+#define STACK_WARN (THREAD_SIZE/2)
/*
* macros/functions for gaining access to the thread information structure
*

2005-05-19 21:29:53

by Steve Lord

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

Joshua Baker-LePain wrote:
> On Thu, 19 May 2005 at 5:09pm, Lee Revell wrote
>
>
>>On Thu, 2005-05-19 at 17:00 -0400, Joshua Baker-LePain wrote:
>>
>>>May 19 16:47:10 norbert kernel: Filesystem "md0": XFS internal error xfs_da_do_buf(1) at line 2176 of file fs/xfs/xfs_da_btree.c. Caller 0xf8c90148
>
> *snip*
>
>>Couldn't this be a stack overflow? That's a very large kernel stack.
>
>
> I am using 8K stacks, and that's all the kernel messages I see.
>

The stack backtrace just converts all the code addresses it sees on
the stack, so you get extra false positives in there, it is not as
large as it seems.

Try setting /proc/sys/fs/xfs/error_level to 1 and running again,
it should spout out some more information about what it thinks
is corrupted.

Does xfs_repair report anything after this has happened, it looks
like it is trying to read a directory block up from disk to satisfy
a lookup request and failing for some reason. My suspicion is that
the filesystem will look ok (unmount it, then remount to reply the
log, then unmount again before running repair).

anything else in the syslog shortly before this?

Steve

2005-05-19 21:32:55

by Steve Lord

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

Steve Lord wrote:

>
> Try setting /proc/sys/fs/xfs/error_level to 1 and running again,
> it should spout out some more information about what it thinks
> is corrupted.
>

Never mind, you already did that didn't you. I will now go back to
my day job.....

Steve

2005-05-19 21:35:41

by Chris Wedgwood

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Thu, May 19, 2005 at 04:29:36PM -0500, Steve Lord wrote:
> Joshua Baker-LePain wrote:

> Does xfs_repair report anything after this has happened, it looks
> like it is trying to read a directory block up from disk to satisfy
> a lookup request and failing for some reason.

bit corruption? bad hardware maybe?

2005-05-19 21:38:54

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Thu, 19 May 2005 at 4:32pm, Steve Lord wrote

> Steve Lord wrote:
> >
> > Try setting /proc/sys/fs/xfs/error_level to 1 and running again,
> > it should spout out some more information about what it thinks
> > is corrupted.
> >
> Never mind, you already did that didn't you. I will now go back to
> my day job.....

Actually, I hadn't. And 'cat /proc/sys/fs/xfs/error_level' says '3'. Is
1 more or less verbose?

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

2005-05-19 21:42:31

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

And now I've got some OOPSes:

Unable to handle kernel NULL pointer dereference at virtual address
00000000
printing eip:
00000000
*pde = 37137001
Oops: 0000 [#1]
SMP
Modules linked in: ipt_REJECT ipt_state ip_conntrack iptable_filter
ip_tables nfsd exportfs lockd xfs md5 ipv6 parport_pc lp parport i2c_dev
i2c_core sunrpc raid0 dm_mod button battery ac uhci_hcd hw_random e1000
floppy ext3 jbd 3w_9xxx 3w_xxxx sd_mod scsi_mod
CPU: 2
EIP: 0060:[<00000000>] Not tainted VLI
EFLAGS: 00010286 (2.6.9-5.0.5.EL.XFSsmp)
EIP is at 0x0
eax: ca9a8dac ebx: c03f03e0 ecx: 00000000 edx: d8cc689c
esi: d8cc689c edi: f2be7f04 ebp: 00000000 esp: f2be7ee8
ds: 007b es: 007b ss: 0068
Process nfsd (pid: 3255, threadinfo=f2be6000 task=f0a5e830)
Stack: c0161332 ca9a8dac ffffffff e8f5608d 112f5575 f6b56f24 c01613a6 112f5575
00000005 e8f56088 ca9a8e1c ca9a8e1c ca9a8dac f6b56f24 e8f55804 f8c18aa8
e8f56088 f7fba800 f699c940 f7fba800 00000246 e8f4a000 e8f55800 e8f559d4
Call Trace:
[<c0161332>] __lookup_hash+0x70/0x89
[<c01613a6>] lookup_one_len+0x54/0x63
[<f8c18aa8>] nfsd_lookup+0x321/0x3ad [nfsd]
[<f8c20061>] nfsd3_proc_lookup+0xa9/0xb3 [nfsd]
[<f8c21fe4>] nfs3svc_decode_diropargs+0x0/0xfa [nfsd]
[<f8c165d7>] nfsd_dispatch+0xba/0x16f [nfsd]
[<f8b43446>] svc_process+0x420/0x6d6 [sunrpc]
[<f8c163b7>] nfsd+0x1cc/0x332 [nfsd]
[<f8c161eb>] nfsd+0x0/0x332 [nfsd]
[<c01041f1>] kernel_thread_helper+0x5/0xb
Code: Bad EIP value.
<1>Unable to handle kernel NULL pointer dereference at virtual address
00000000
printing eip:
00000000
*pde = 1d5c4001
Oops: 0000 [#2]
SMP
Modules linked in: ipt_REJECT ipt_state ip_conntrack iptable_filter
ip_tables nfsd exportfs lockd xfs md5 ipv6 parport_pc lp parport i2c_dev
i2c_core sunrpc raid0 dm_mod button battery ac uhci_hcd hw_random e1000
floppy ext3 jbd 3w_9xxx 3w_xxxx sd_mod scsi_mod
CPU: 0
EIP: 0060:[<00000000>] Not tainted VLI
EFLAGS: 00010286 (2.6.9-5.0.5.EL.XFSsmp)
EIP is at 0x0
eax: c325cc30 ebx: c03f03e0 ecx: 00000000 edx: f0ee79cc
esi: f0ee79cc edi: e9d99f0c ebp: 00000000 esp: e9d99ef0
ds: 007b es: 007b ss: 0068
Process nfsd (pid: 3353, threadinfo=e9d98000 task=ea2130b0)
Stack: c0161332 c325cc30 ffffffff d61f10cf 002249be dad2017c c01613a6 002249be
00000003 d61f10cc c325cca0 c325cca0 c325cc30 dad2017c e88a9000 f8c18aa8
d61f10cc f7d74c00 f699ca00 c234b780 f7d74d60 ea3f7800 e88a9000 ea3f78e8
Call Trace:
[<c0161332>] __lookup_hash+0x70/0x89
[<c01613a6>] lookup_one_len+0x54/0x63
[<f8c18aa8>] nfsd_lookup+0x321/0x3ad [nfsd]
[<f8c16c7c>] nfsd_proc_lookup+0x5f/0x71 [nfsd]
[<f8c1e13d>] nfssvc_decode_diropargs+0x0/0xa7 [nfsd]
[<f8c165d7>] nfsd_dispatch+0xba/0x16f [nfsd]
[<f8b43446>] svc_process+0x420/0x6d6 [sunrpc]
[<f8c163b7>] nfsd+0x1cc/0x332 [nfsd]
[<f8c161eb>] nfsd+0x0/0x332 [nfsd]
[<c01041f1>] kernel_thread_helper+0x5/0xb
Code: Bad EIP value.


--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

2005-05-19 21:44:26

by Steve Lord

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

Joshua Baker-LePain wrote:
> On Thu, 19 May 2005 at 4:32pm, Steve Lord wrote
>
>
>>Steve Lord wrote:
>>
>>>Try setting /proc/sys/fs/xfs/error_level to 1 and running again,
>>>it should spout out some more information about what it thinks
>>>is corrupted.
>>>
>>
>>Never mind, you already did that didn't you. I will now go back to
>>my day job.....
>
>
> Actually, I hadn't. And 'cat /proc/sys/fs/xfs/error_level' says '3'. Is
> 1 more or less verbose?
>

Less, 3 is the default now that I look.

Steve

2005-05-19 21:49:20

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Thu, 19 May 2005 at 5:42pm, Joshua Baker-LePain wrote

> And now I've got some OOPSes:

But, looking at 'em (instead of just blindly sending 'em along), I don't
see XFS anywhere in them. The "interesting" thing is that, of the
nfs_fsstress logs in /tmp on all the clients, only the NFSv3 over TCP log
is showing errors on any of the clients.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

2005-05-19 21:51:56

by Lee Revell

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Thu, 2005-05-19 at 17:42 -0400, Joshua Baker-LePain wrote:
> And now I've got some OOPSes:

Did these come after the Oops you first posted, or had you rebooted
since?

If you have not rebooted since the previous Oops, then these are not
useful.

Lee

2005-05-19 21:52:55

by Joshua Baker-LePain

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

On Thu, 19 May 2005 at 5:49pm, Lee Revell wrote

> On Thu, 2005-05-19 at 17:42 -0400, Joshua Baker-LePain wrote:
> > And now I've got some OOPSes:
>
> Did these come after the Oops you first posted, or had you rebooted
> since?
>
> If you have not rebooted since the previous Oops, then these are not
> useful.

No reboot. Bah. Too tired. Too much to do. Need caffeine.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

2005-05-19 22:04:05

by Gregory Brauer

[permalink] [raw]
Subject: Re: kernel OOPS for XFS in xfs_iget_core (using NFS+SMP+MD)

Joshua Baker-LePain wrote:
> On Thu, 19 May 2005 at 5:42pm, Joshua Baker-LePain wrote
>
>
>>And now I've got some OOPSes:
>
>
> But, looking at 'em (instead of just blindly sending 'em along), I don't
> see XFS anywhere in them. The "interesting" thing is that, of the
> nfs_fsstress logs in /tmp on all the clients, only the NFSv3 over TCP log
> is showing errors on any of the clients.

As noted, you seem to have stumbled upon some other bug.

We have another data point to report, this time with
kernel-smp-2.6.10-1.741_FC3. The one difference here worth noting
is that when this error was triggered with this kernel, the machine
did not lock up hard as with 2.6.11.1 and 2.6.11.10. You could
still log into the machine, but processes that attempted to access
the XFS filesystem would hang.

Greg


May 19 13:44:20 violet kernel: xfs_iget_core: ambiguous vns:
vp/0xed506a90, invp/0xed506798
May 19 13:44:20 violet kernel: ------------[ cut here ]------------
May 19 13:44:20 violet kernel: kernel BUG at fs/xfs/support/debug.c:106!
May 19 13:44:20 violet kernel: invalid operand: 0000 [#1]
May 19 13:44:20 violet kernel: SMP
May 19 13:44:20 violet kernel: Modules linked in: nfsd exportfs lockd
md5 ipv6 parport_pc lp parport sunrpc xfs dm_mod video button battery ac
uhci_hcd hw_random i2c_i801 i2c_core e1000 floppy ext3 jbd raid0 3w_xxxx
sd_mod scsi_mod
May 19 13:44:20 violet kernel: CPU: 0
May 19 13:44:20 violet kernel: EIP: 0060:[<f8a2951e>] Not tainted VLI
May 19 13:44:20 violet kernel: EFLAGS: 00010246 (2.6.10-1.741_FC3smp)
May 19 13:44:20 violet kernel: EIP is at cmn_err+0x8d/0x9b [xfs]
May 19 13:44:20 violet kernel: eax: 00000000 ebx: f8a2b743 ecx:
f8a3faa4 edx: 00000200
May 19 13:44:20 violet kernel: esi: f8a2e17d edi: f8a4205e ebp:
00000000 esp: f6bdfbb4
May 19 13:44:20 violet kernel: ds: 007b es: 007b ss: 0068
May 19 13:44:20 violet kernel: Process nfsd (pid: 4054,
threadinfo=f6bdf000 task=f6bcf530)
May 19 13:44:20 violet kernel: Stack: 00000293 ed5067bc f73dc3d0
503b6809 00000000 f8a02731 00000000 f8a2b743
May 19 13:44:20 violet kernel: ed506a90 ed506798 f747ea00
00000000 f74d7000 ed506798 cf8fbc2c ed5067bc
May 19 13:44:20 violet kernel: c03a5340 ed506798 f6bdf000
f8a02be7 503b6809 00000000 00000000 00000008
May 19 13:44:20 violet kernel: Call Trace:
May 19 13:44:20 violet kernel: [<f8a02731>] xfs_iget_core+0x12d/0x55f [xfs]
May 19 13:44:20 violet kernel: [<f8a02be7>] xfs_iget+0x84/0x139 [xfs]
May 19 13:44:20 violet kernel: [<f8a1a7cc>] xfs_vget+0x43/0xae [xfs]
May 19 13:44:20 violet kernel: [<f8a28c53>] vfs_vget+0x1a/0x1d [xfs]
May 19 13:44:20 violet kernel: [<f8a288f1>] linvfs_get_dentry+0x3b/0x6c
[xfs]
May 19 13:44:20 violet kernel: [<f892c02d>]
find_exported_dentry+0x2d/0x5a5 [exportfs]
May 19 13:44:20 violet kernel: [<c01b4ca1>] copy_to_user+0x49/0x51
May 19 13:44:20 violet kernel: [<c0269614>] memcpy_toiovec+0x27/0x47
May 19 13:44:20 violet kernel: [<c0269b2a>]
skb_copy_datagram_iovec+0x4f/0x1e1
May 19 13:44:20 violet kernel: [<c02678b0>] release_sock+0xf/0x4f
May 19 13:44:20 violet kernel: [<c028a98b>] tcp_recvmsg+0x61e/0x659
May 19 13:44:20 violet kernel: [<c02679ae>] sock_common_recvmsg+0x30/0x46
May 19 13:44:20 violet kernel: [<c0264bbc>] sock_recvmsg+0xef/0x10c
May 19 13:44:20 violet kernel: [<c0118922>] recalc_task_prio+0x128/0x133
May 19 13:44:20 violet kernel: [<c01189b5>] activate_task+0x88/0x95
May 19 13:44:20 violet kernel: [<c0118e2a>] try_to_wake_up+0x222/0x22d
May 19 13:44:20 violet kernel: [<c011a350>] __wake_up_common+0x36/0x51
May 19 13:44:20 violet kernel: [<c011a394>] __wake_up+0x29/0x3c
May 19 13:44:20 violet kernel: [<f896e3de>]
svc_sock_enqueue+0x1d6/0x212 [sunrpc]
May 19 13:44:20 violet kernel: [<f8b9ba95>]
svc_expkey_lookup+0x1fc/0x330 [nfsd]
May 19 13:44:20 violet kernel: [<f892c84b>] export_decode_fh+0x61/0x6d
[exportfs]
May 19 13:44:20 violet kernel: [<f8b977a0>] nfsd_acceptable+0x0/0xba [nfsd]
May 19 13:44:20 violet kernel: [<f892c7ea>] export_decode_fh+0x0/0x6d
[exportfs]
May 19 13:44:20 violet kernel: [<f8b97ba1>] fh_verify+0x347/0x4be [nfsd]
May 19 13:44:20 violet kernel: [<f8b977a0>] nfsd_acceptable+0x0/0xba [nfsd]
May 19 13:44:20 violet kernel: [<f8970f4b>]
svcauth_unix_accept+0x207/0x27a [sunrpc]
May 19 13:44:20 violet kernel: [<c02bafb6>] schedule_timeout+0x9a/0xae
May 19 13:44:20 violet kernel: [<f8b9914f>] nfsd_access+0x1f/0xd8 [nfsd]
May 19 13:44:20 violet kernel: [<f8b9f0f5>] nfsd3_proc_access+0x86/0x8e
[nfsd]
May 19 13:44:20 violet kernel: [<f8ba07d7>]
nfs3svc_decode_accessargs+0x0/0x7a [nfsd]
May 19 13:44:20 violet kernel: [<f8b96540>] nfsd_dispatch+0xba/0x16e [nfsd]
May 19 13:44:20 violet kernel: [<f896dfcb>] svc_process+0x331/0x56e
[sunrpc]
May 19 13:44:20 violet kernel: [<f8b9633b>] nfsd+0x190/0x2db [nfsd]
May 19 13:44:20 violet kernel: [<f8b961ab>] nfsd+0x0/0x2db [nfsd]
May 19 13:44:20 violet kernel: [<c01021f5>] kernel_thread_helper+0x5/0xb
May 19 13:44:20 violet kernel: Code: 68 20 20 a4 f8 ff 34 ad c0 fa a3 f8
68 7d e1 a2 f8 e8 54 4a 6f c7 8b 54 24 0c b8 a4 fa a3 f8 e8 f0 21 89 c7
83 c4 0c 85 ed 75 08 <0f> 0b 6a 00 64 e1 a2 f8 58 5b 5e 5f 5d c3 55 89
c5 83 e5 07 57