2006-08-15 14:27:31

by Martin Braun

[permalink] [raw]
Subject: kernel BUG at <bad filename>:50307!

Hello all,

I got this bug (see below) in my logs, the system showed with "top" an
increasing load average of 11 and more but with an cpu-idle of 99% and
no processes used mentionable resources, there were 6 zombies. A
shutdown was not possible most of the samba processes didn't respond to
a kill.
Before the exception the server was -as usual- under heavy load of samba
processes 4-5 clients, with many automated activity (batch-processes
with image processing).

What does this bug mean?

Hardware-Details:
* Device sdc (on an easy-raid system) has an XFS Filesystem.
* uname -a
Linux pers109 2.6.17.8 #1 SMP Mon Aug 7 11:04:08 CEST 2006 i686 i686
i386 GNU/Linux
* lspci
0000:00:00.0 Host bridge: Intel Corporation E7501 Memory Controller Hub
(rev 01)
0000:00:02.0 PCI bridge: Intel Corporation E7500/E7501 Hub Interface B
PCI-to-PCI Bridge (rev 01)
0000:00:1d.0 USB Controller: Intel Corporation 82801CA/CAM USB (Hub #1)
(rev 02)
0000:00:1d.1 USB Controller: Intel Corporation 82801CA/CAM USB (Hub #2)
(rev 02)
0000:00:1d.2 USB Controller: Intel Corporation 82801CA/CAM USB (Hub #3)
(rev 02)
0000:00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 42)
0000:00:1f.0 ISA bridge: Intel Corporation 82801CA LPC Interface
Controller (rev 02)
0000:00:1f.1 IDE interface: Intel Corporation 82801CA Ultra ATA Storage
Controller (rev 02)
0000:01:1c.0 PIC: Intel Corporation 82870P2 P64H2 I/OxAPIC (rev 04)
0000:01:1d.0 PCI bridge: Intel Corporation 82870P2 P64H2 Hub PCI Bridge
(rev 04)
0000:01:1e.0 PIC: Intel Corporation 82870P2 P64H2 I/OxAPIC (rev 04)
0000:01:1f.0 PCI bridge: Intel Corporation 82870P2 P64H2 Hub PCI Bridge
(rev 04)
0000:02:05.0 SCSI storage controller: Adaptec AIC-7902 U320 (rev 03)
0000:02:05.1 SCSI storage controller: Adaptec AIC-7902 U320 (rev 03)
0000:03:03.0 Ethernet controller: Intel Corporation 82544GC Gigabit
Ethernet Controller (LOM) (rev 02)
0000:04:01.0 Ethernet controller: Intel Corporation 82540EM Gigabit
Ethernet Controller (rev 02)
0000:04:02.0 VGA compatible controller: ATI Technologies Inc Rage XL
(rev 27)
===

cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.80GHz
stepping : 7
cpu MHz : 1595.130
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 3193.91
===


/usr/local/samba/sbin/smbd -V
Version 3.0.20

_________________________
/var/log/messages extract
--------------------------


Aug 15 15:01:02 pers109 kernel: Access to block zero: fs: <sdc1> inode:
254474718 start_block : 0 start_off : c0a0b0e8a099
0 blkcnt : 90000 extent-state : 0
Aug 15 15:01:02 pers109 kernel: ------------[ cut here ]------------
Aug 15 15:01:02 pers109 kernel: kernel BUG at <bad filename>:50307!
Aug 15 15:01:02 pers109 kernel: invalid opcode: 0000 [#1]
Aug 15 15:01:02 pers109 kernel: SMP
Aug 15 15:01:02 pers109 kernel: CPU: 0
Aug 15 15:01:02 pers109 kernel: EIP: 0060:[<c0257d64>] Not tainted VLI
Aug 15 15:01:02 pers109 kernel: EFLAGS: 00010246 (2.6.17.8 #1)
Aug 15 15:01:02 pers109 kernel: eax: c0479f84 ebx: c0436464 ecx:
c046c9bc edx: 00000282
Aug 15 15:01:02 pers109 kernel: esi: cea51cb0 edi: c0526120 ebp:
00000000 esp: cea51b70
Aug 15 15:01:02 pers109 kernel: ds: 007b es: 007b ss: 0068
Aug 15 15:01:02 pers109 kernel: Process smbd (pid: 18095,
threadinfo=cea50000 task=c212e0b0)
Aug 15 15:01:02 pers109 kernel: Stack: c04452ac c042855c c0526120
00000282 f7204db0 cea51cb0 00000000 e31d5b00
Aug 15 15:01:02 pers109 kernel: c01fe13d 00000000 c0436464
c49083e0 0f2af9de 00000000 00000000 00000000
Aug 15 15:01:02 pers109 kernel: 0e8a0990 000c0a0b 00090000
00000000 00000000 cea51cb0 00000000 00000000
Aug 15 15:01:02 pers109 kernel: Call Trace:
Aug 15 15:01:02 pers109 kernel: <c01fe13d> <c01ff637>
Aug 15 15:01:02 pers109 kernel: <c0115e51> <c0115e51>
Aug 15 15:01:02 pers109 kernel: <c015987b> <c015a791>
Aug 15 15:01:03 pers109 kernel: <c0140a91> <c0254ff3>
Aug 15 15:01:03 pers109 kernel: <c039c7d2> <c017187e>
Aug 15 15:01:03 pers109 kernel: <c0255653> <c0395c89>
Aug 15 15:01:03 pers109 kernel: <c01696c8> <c0288a3a>
Aug 15 15:01:03 pers109 kernel: <c025091f> <c0157383>
Aug 15 15:01:03 pers109 kernel: <c012d613> <c01574a9>
Aug 15 15:01:03 pers109 kernel: <c015774e> <c01027df>
Aug 15 15:01:03 pers109 kernel: Code: c0 c7 44 24 08 20 61 52 c0 c7 04
24 ac 52 44 c0 89 44 24 04 e8 5b 34 ec ff b8 84 9f
47 c0 8b 54 24 0c e8 bc fa 1a 00 85 ed 75 02 <0f> 0b 83 c4 10 5b 5e 5f
5d c3 55 b8 07 00 00 00 57 bf 20 61 52
Aug 15 15:01:03 pers109 kernel: EIP: [<c0257d64>] SS:ESP 0068:cea51b70


thanks in advance,
martin



2006-08-15 14:31:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: kernel BUG at <bad filename>:50307!

On Tue, 2006-08-15 at 16:27 +0200, Martin Braun wrote:
> Hello all,
>
> I got this bug (see below) in my logs, the system showed with "top" an
> increasing load average of 11 and more but with an cpu-idle of 99% and
> no processes used mentionable resources, there were 6 zombies. A
> shutdown was not possible most of the samba processes didn't respond to
> a kill.
> Before the exception the server was -as usual- under heavy load of samba
> processes 4-5 clients, with many automated activity (batch-processes
> with image processing).
>
> What does this bug mean?

Hi,

it means you don't have CONFIG_KALLSYMS enabled, so the kernel isn't
able to give a decent debugging output in the oops.. if it's a
repeatable oops turning that option on would be a great help to even
figure out which part of the kernel is involved...

Greetings,
Arjan van de Ven
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com

2006-08-16 00:11:56

by Nathan Scott

[permalink] [raw]
Subject: Re: kernel BUG at <bad filename>:50307!

Hi Martin,

On Tue, Aug 15, 2006 at 04:27:22PM +0200, Martin Braun wrote:
> ...
> What does this bug mean?
> ...
> Aug 15 15:01:02 pers109 kernel: Access to block zero: fs: <sdc1> inode:
> 254474718 start_block : 0 start_off : c0a0b0e8a099
> 0 blkcnt : 90000 extent-state : 0
> Aug 15 15:01:02 pers109 kernel: ------------[ cut here ]------------
> Aug 15 15:01:02 pers109 kernel: kernel BUG at <bad filename>:50307!

It means XFS detected ondisk corruption in inode# 254474718, and
paniced your system (stupidly; a fix for this is around, will be
merged with the next mainline update). For me, a more interesting
question is how that inode got into this state... have you had any
crashes recently (i.e. has the filesystem journal needed to be
replayed recently?) Can you send the output of:

# xfs_db -c 'inode 254474718' -c print /dev/sdc1

You'll need to run xfs_repair on that filesystem to fix this up,
but please send us that output first.

thanks.

--
Nathan

2006-08-16 09:05:29

by Martin Braun

[permalink] [raw]
Subject: Re: kernel BUG at <bad filename>:50307!

Hi Nathan,
> It means XFS detected ondisk corruption in inode# 254474718, and
> paniced your system (stupidly; a fix for this is around, will be
> merged with the next mainline update). For me, a more interesting
> question is how that inode got into this state... have you had any
> crashes recently (i.e. has the filesystem journal needed to be
> replayed recently?) Can you send the output of:

We had recently problems with our XFS partition caused by the Kernel-Bug
in 2.6.17. I updated xfsprogs-2.8.10 and repaired the partition with
xfs_repair - it found a corrupted dir-inode (254474253)


>
> # xfs_db -c 'inode 254474718' -c print /dev/sdc1
> You'll need to run xfs_repair on that filesystem to fix this up,
> but please send us that output first.

core.magic = 0x494e
core.mode = 0100774
core.version = 1
core.format = 3 (btree)
core.nlinkv1 = 1
core.uid = 1348
core.gid = 104
core.flushiter = 0
core.atime.sec = Tue Aug 15 15:00:58 2006
core.atime.nsec = 934572500
core.mtime.sec = Tue Aug 15 15:01:02 2006
core.mtime.nsec = 261116500
core.ctime.sec = Tue Aug 15 15:01:02 2006
core.ctime.nsec = 261116500
core.size = 10092544
core.nblocks = 197
core.extsize = 0
core.nextents = 182
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.gen = 9
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[1]
u.bmbt.ptrs[1] = 1:112941297



thanks.
martin