2003-06-22 17:16:53

by Nix

[permalink] [raw]
Subject: 2.4.21 reiserfs oops

An nnrpd just oopsed and took the news filesystem (the only reiserfs fs
on this box) down with it (everything touching it D-states).

Oops:

ksymoops 2.4.6 on i586 2.4.21. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.21/ (default)
-m /boot/System.map (specified)

Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000001
Jun 22 13:52:42 loki kernel: c0092df4
Jun 22 13:52:42 loki kernel: c0092df4
Jun 22 13:52:42 loki kernel: *pde = 00000000
Jun 22 13:52:43 loki kernel: Oops: 0000
Jun 22 13:52:43 loki kernel: Oops: 0000
Jun 22 13:52:43 loki kernel: CPU: 0
Jun 22 13:52:43 loki kernel: CPU: 0
Jun 22 13:52:43 loki kernel: EIP: 0010:[<c0092df4>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Jun 22 13:52:43 loki kernel: EIP: 0010:[<c0092df4>] Not tainted
Jun 22 13:52:43 loki kernel: EFLAGS: 00010246
Jun 22 13:52:43 loki kernel: EFLAGS: 00010246
Jun 22 13:52:43 loki kernel: eax: 00000001 ebx: 00000064 ecx: 00000000 edx: 00000001
Jun 22 13:52:43 loki kernel: eax: 00000001 ebx: 00000064 ecx: 00000000 edx: 00000001
Jun 22 13:52:43 loki kernel: esi: c1527b28 edi: c7381ea0 ebp: ffffffff esp: c1527ae4
Jun 22 13:52:43 loki kernel: esi: c1527b28 edi: c7381ea0 ebp: ffffffff esp: c1527ae4
Jun 22 13:52:43 loki kernel: ds: 0018 es: 0018 ss: 0018
Jun 22 13:52:43 loki kernel: ds: 0018 es: 0018 ss: 0018
Jun 22 13:52:43 loki kernel: Process nnrpd (pid: 1993, stackpage=c1527000)
Jun 22 13:52:43 loki kernel: Process nnrpd (pid: 1993, stackpage=c1527000)
Jun 22 13:52:43 loki kernel: Stack: c01a1c02 c1527b28 00000064 c1527c7c 00000000 c7381ea0 c019044b c1527b28
Jun 22 13:52:43 loki kernel: Stack: c01a1c02 c1527b28 00000064 c1527c7c 00000000 c7381ea0 c019044b c1527b28
Jun 22 13:52:43 loki kernel: 00000000 00000000 00000001 ffffffff c1527c7c 00000064 00000000 c1527c38
Jun 22 13:52:43 loki kernel: 00000000 00000000 00000001 ffffffff c1527c7c 00000064 00000000 c1527c38
Jun 22 13:52:43 loki kernel: 00000000 c1527c7c c7381ea0 c23af7a0 00000008 c01907a0 c1527c7c 00000064
Jun 22 13:52:43 loki kernel: 00000000 c1527c7c c7381ea0 c23af7a0 00000008 c01907a0 c1527c7c 00000064
Jun 22 13:52:43 loki kernel: Call Trace: [<c01a1c02>] [<c019044b>] [<c01907a0>] [<c019bdec>] [<c019c59d>]
Jun 22 13:52:43 loki kernel: Call Trace: [<c01a1c02>] [<c019044b>] [<c01907a0>] [<c019bdec>] [<c019c59d>]
Jun 22 13:52:43 loki kernel: [<c019c799>] [<c0193150>] [<c01a6bdf>] [<c01a6f7b>] [<c01a72f9>] [<c01a67db>]
Jun 22 13:52:43 loki kernel: [<c019c799>] [<c0193150>] [<c01a6bdf>] [<c01a6f7b>] [<c01a72f9>] [<c01a67db>]
Jun 22 13:52:43 loki kernel: [<c01954ee>] [<c01954a0>] [<c0140d93>] [<c013e996>] [<c012f104>] [<c012e135>]
Jun 22 13:52:43 loki kernel: [<c01954ee>] [<c01954a0>] [<c0140d93>] [<c013e996>] [<c012f104>] [<c012e135>]
Jun 22 13:52:43 loki kernel: [<c0114ec8>] [<c0115467>] [<c011561f>] [<c0106c23>]
Jun 22 13:52:44 loki kernel: [<c0114ec8>] [<c0115467>] [<c011561f>] [<c0106c23>]
Jun 22 13:52:44 loki kernel: Code: 0b 08 00 00 19 00 00 00 53 b0 2f 00 00 00 00 00 00 00 00 00


>>EIP; c0092df4 Before first symbol <=====
>>EIP; c0092df4 Before first symbol <=====

>>esi; c1527b28 <_end+11e54e4/8512a1c>
>>edi; c7381ea0 <_end+703f85c/8512a1c>
>>esp; c1527ae4 <_end+11e54a0/8512a1c>
>>esi; c1527b28 <_end+11e54e4/8512a1c>
>>edi; c7381ea0 <_end+703f85c/8512a1c>
>>esp; c1527ae4 <_end+11e54a0/8512a1c>

Trace; c01a1c02 <leaf_delete_items+3e/158>
Trace; c019044b <balance_leaf_when_delete+6b/37c>
Trace; c01907a0 <balance_leaf+44/2698>
Trace; c019bdec <dc_check_balance_internal+2c8/568>
Trace; c019c59d <clear_all_dirty_bits+11/18>
Trace; c01a1c02 <leaf_delete_items+3e/158>
Trace; c019044b <balance_leaf_when_delete+6b/37c>
Trace; c01907a0 <balance_leaf+44/2698>
Trace; c019bdec <dc_check_balance_internal+2c8/568>
Trace; c019c59d <clear_all_dirty_bits+11/18>
Trace; c019c799 <wait_tb_buffers_until_unlocked+1f5/2a4>
Trace; c0193150 <do_balance+84/fc>
Trace; c01a6bdf <reiserfs_cut_from_item+97/45c>
Trace; c01a6f7b <reiserfs_cut_from_item+433/45c>
Trace; c01a72f9 <reiserfs_do_truncate+2fd/430>
Trace; c01a67db <reiserfs_delete_object+23/50>
Trace; c019c799 <wait_tb_buffers_until_unlocked+1f5/2a4>
Trace; c0193150 <do_balance+84/fc>
Trace; c01a6bdf <reiserfs_cut_from_item+97/45c>
Trace; c01a6f7b <reiserfs_cut_from_item+433/45c>
Trace; c01a72f9 <reiserfs_do_truncate+2fd/430>
Trace; c01a67db <reiserfs_delete_object+23/50>
Trace; c01954ee <reiserfs_delete_inode+4e/98>
Trace; c01954a0 <reiserfs_delete_inode+0/98>
Trace; c0140d93 <iput+113/208>
Trace; c013e996 <dput+e6/144>
Trace; c012f104 <fput+bc/e0>
Trace; c012e135 <filp_close+59/64>
Trace; c01954ee <reiserfs_delete_inode+4e/98>
Trace; c01954a0 <reiserfs_delete_inode+0/98>
Trace; c0140d93 <iput+113/208>
Trace; c013e996 <dput+e6/144>
Trace; c012f104 <fput+bc/e0>
Trace; c012e135 <filp_close+59/64>
Trace; c0114ec8 <put_files_struct+54/bc>
Trace; c0115467 <do_exit+af/240>
Trace; c011561f <sys_exit+f/10>
Trace; c0106c23 <system_call+33/40>
Trace; c0114ec8 <put_files_struct+54/bc>
Trace; c0115467 <do_exit+af/240>
Trace; c011561f <sys_exit+f/10>
Trace; c0106c23 <system_call+33/40>

Code; c0092df4 Before first symbol
00000000 <_EIP>:
Code; c0092df4 Before first symbol <=====
0: 0b 08 or (%eax),%ecx <=====
Code; c0092df6 Before first symbol
2: 00 00 add %al,(%eax)
Code; c0092df8 Before first symbol
4: 19 00 sbb %eax,(%eax)
Code; c0092dfa Before first symbol
6: 00 00 add %al,(%eax)
Code; c0092dfc Before first symbol
8: 53 push %ebx
Code; c0092dfd Before first symbol
9: b0 2f mov $0x2f,%al

Jun 22 13:52:44 loki kernel: Code: 0b 08 00 00 19 00 00 00 53 b0 2f 00 00 00 00 00 00 00 00 00


Code; c0092df4 Before first symbol
00000000 <_EIP>:
Code; c0092df4 Before first symbol
0: 0b 08 or (%eax),%ecx
Code; c0092df6 Before first symbol
2: 00 00 add %al,(%eax)
Code; c0092df8 Before first symbol
4: 19 00 sbb %eax,(%eax)
Code; c0092dfa Before first symbol
6: 00 00 add %al,(%eax)
Code; c0092dfc Before first symbol
8: 53 push %ebx
Code; c0092dfd Before first symbol
9: b0 2f mov $0x2f,%al


Modules:

Module Size Used by Not tainted
opl3 11040 0 (unused)
sb 7348 0
sb_lib 32334 0 [sb]
uart401 6020 0 [sb_lib]
sound 51692 0 [opl3 sb_lib uart401]
nls_iso8859-1 2844 1 (autoclean)
nls_cp437 4348 1 (autoclean)


.config:

CONFIG_X86=y
CONFIG_UID16=y
CONFIG_EXPERIMENTAL=y
CONFIG_MODULES=y
CONFIG_KMOD=y
CONFIG_M586MMX=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_X86_USE_STRING_486=y
CONFIG_X86_ALIGNMENT_16=y
CONFIG_X86_HAS_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_MCE=y
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_NOHIGHMEM=y
CONFIG_X86_TSC=y
CONFIG_NET=y
CONFIG_PCI=y
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_ISA=y
CONFIG_PCI_NAMES=y
CONFIG_SYSVIPC=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_SYSCTL=y
CONFIG_KCORE_ELF=y
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=y
CONFIG_PM=y
CONFIG_APM=y
CONFIG_APM_RTC_IS_GMT=y
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_PC_CML1=m
CONFIG_PARPORT_PC_FIFO=y
CONFIG_PNP=y
CONFIG_ISAPNP=y
CONFIG_BLK_DEV_FD=y
CONFIG_BLK_DEV_LOOP=m
CONFIG_PACKET=m
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
CONFIG_BLK_DEV_IDEDMA_PCI=y
CONFIG_IDEDMA_PCI_AUTO=y
CONFIG_BLK_DEV_IDEDMA=y
CONFIG_BLK_DEV_PIIX=y
CONFIG_IDEDMA_AUTO=y
CONFIG_BLK_DEV_IDE_MODES=y
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_SD_EXTRA_DEVS=10
CONFIG_BLK_DEV_SR=y
CONFIG_SR_EXTRA_DEVS=2
CONFIG_CHR_DEV_SG=y
CONFIG_SCSI_DEBUG_QUEUES=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_SYM53C8XX_2=y
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_NET_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
CONFIG_VORTEX=y
CONFIG_PLIP=m
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_SERIAL=y
CONFIG_UNIX98_PTYS=y
CONFIG_UNIX98_PTY_COUNT=256
CONFIG_I2C=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_PROC=y
CONFIG_MOUSE=y
CONFIG_PSMOUSE=y
CONFIG_RTC=y
CONFIG_QUOTA=y
CONFIG_REISERFS_FS=y
CONFIG_EXT3_FS=y
CONFIG_JBD=y
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_UMSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_TMPFS=y
CONFIG_RAMFS=y
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_MINIX_FS=m
CONFIG_PROC_FS=y
CONFIG_DEVPTS_FS=y
CONFIG_EXT2_FS=y
CONFIG_INTERMEZZO_FS=m
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
CONFIG_SUNRPC=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_PARTITION_ADVANCED=y
CONFIG_MSDOS_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_15=m
CONFIG_VGA_CONSOLE=y
CONFIG_VIDEO_SELECT=y
CONFIG_SOUND=y
CONFIG_SOUND_OSS=m
CONFIG_SOUND_SB=m
CONFIG_SOUND_AWE32_SYNTH=m
CONFIG_SOUND_YM3812=m
CONFIG_USB=y
CONFIG_USB_UHCI_ALT=y
CONFIG_USB_STORAGE=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_ZLIB_INFLATE=m
CONFIG_ZLIB_DEFLATE=m

--
`It is an unfortunate coincidence that the date locarchive.h was
written (in hex) matches Ritchie's birthday (in octal).'
-- Roland McGrath on the libc-alpha list


2003-06-23 09:39:52

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.4.21 reiserfs oops

Hello!

On Sun, Jun 22, 2003 at 03:00:20PM +0100, Nix wrote:

> Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000001

This is very strange address to oops on.

> Jun 22 13:52:43 loki kernel: EIP: 0010:[<c0092df4>] Not tainted

And the EIP is prior to kernel start which is also very strange.
On the other hand the address c0192df4 is somewhere inside reiserfs code,
so it looks like a single bit error, I'd say.

Can you run memtest86 for some time to verify that your RAM is OK?
(hm, and the oops got twice to the logs which is pretty strange thing, too,
never seen anything like this).

Bye,
Oleg

2003-06-23 22:19:40

by Nix

[permalink] [raw]
Subject: Re: 2.4.21 reiserfs oops

On Mon, 23 Jun 2003, Oleg Drokin said:
> Hello!
>
> On Sun, Jun 22, 2003 at 03:00:20PM +0100, Nix wrote:
>
>> Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000001
>
> This is very strange address to oops on.

I'll say! Looks almost like it JMPed to a null pointer or something.

>> Jun 22 13:52:43 loki kernel: EIP: 0010:[<c0092df4>] Not tainted
>
> And the EIP is prior to kernel start which is also very strange.
> On the other hand the address c0192df4 is somewhere inside reiserfs code,
> so it looks like a single bit error, I'd say.

I think it unlikely to be RAM problems given that the problem happened
shortly after upgrading to 2.4.21; this was about half a day after I
rebooted it because it threw a pile of never-seen-again, un-syslogged
SCSI abort errors at me (sym53c875); and *that* was a few minutes after
I rebooted into 2.4.21 for the first time.

All my other boxes love 2.4.21, but this one dislikes it. (Of course it
has to be my second-most-critical server... ah well, the NFS problems
in 2.4.20 bit my most critical server and my home directory both, so
I guess this is less unpleasant.)

> Can you run memtest86 for some time to verify that your RAM is OK?

Did that last night; no problems reported. (Not really surprising.)

> (hm, and the oops got twice to the logs which is pretty strange thing, too,
> never seen anything like this).

That's my weirdly broken syslog config. I've never got around to fixing
it; it only happens with kernel messages and I don't get all that many
of those.

--
`It is an unfortunate coincidence that the date locarchive.h was
written (in hex) matches Ritchie's birthday (in octal).'
-- Roland McGrath on the libc-alpha list

2003-06-24 05:17:27

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.4.21 reiserfs oops

Hello!

On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote:

> >> Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000001
> > This is very strange address to oops on.
> I'll say! Looks almost like it JMPed to a null pointer or something.

No, if it'd jumped to a NULL pointer, we'd see 0 in EIP.

> >> Jun 22 13:52:43 loki kernel: EIP: 0010:[<c0092df4>] Not tainted
> > And the EIP is prior to kernel start which is also very strange.
> > On the other hand the address c0192df4 is somewhere inside reiserfs code,
> > so it looks like a single bit error, I'd say.
> I think it unlikely to be RAM problems given that the problem happened
> shortly after upgrading to 2.4.21; this was about half a day after I
> rebooted it because it threw a pile of never-seen-again, un-syslogged
> SCSI abort errors at me (sym53c875); and *that* was a few minutes after
> I rebooted into 2.4.21 for the first time.

Hm, so first there were some scsi problems and then reiserfs oops?

Actually since the RAM is good, I see no good reason for this to happen.
(actually I see no good reason for valid code before _text, either).

I wonder if 2.4.21 constantly crashes like that for you, then?

Bye,
Oleg

2003-06-24 20:46:34

by Nix

[permalink] [raw]
Subject: Re: 2.4.21 reiserfs oops

On Tue, 24 Jun 2003, Oleg Drokin moaned:
> Hello!
>
> On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote:
>
>> >> Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000001
>> > This is very strange address to oops on.
>> I'll say! Looks almost like it JMPed to a null pointer or something.
>
> No, if it'd jumped to a NULL pointer, we'd see 0 in EIP.

JMPed to ((long)NULL)+1 or something then :) the fact remains that it's
not somewhere that even a memory error would make us likely to jump to.

>> >> Jun 22 13:52:43 loki kernel: EIP: 0010:[<c0092df4>] Not tainted
>> > And the EIP is prior to kernel start which is also very strange.
>> > On the other hand the address c0192df4 is somewhere inside reiserfs code,
>> > so it looks like a single bit error, I'd say.
>> I think it unlikely to be RAM problems given that the problem happened
>> shortly after upgrading to 2.4.21; this was about half a day after I
>> rebooted it because it threw a pile of never-seen-again, un-syslogged
>> SCSI abort errors at me (sym53c875); and *that* was a few minutes after
>> I rebooted into 2.4.21 for the first time.
>
> Hm, so first there were some scsi problems and then reiserfs oops?

Different boots. I upgraded, the first boot crashed within five minutes
with weird SCSI errors, so I rebooted again and this happened six hours
later.

I'm willing to write off the SCSI errors to the shock effect of having
just been powered down for the first time in a year (the shutdown
scripts didn't quite work and the reset button is disconnected).

>> Actually since the RAM is good, I see no good reason for this to happen.
>> (actually I see no good reason for valid code before _text, either).
>
> I wonder if 2.4.21 constantly crashes like that for you, then?

No obvious sign of it:

9:34pm up 1 day 22:30, 14 users, load average: 0.09, 0.12, 0.16

(it is of course waiting until I am hundreds of miles away. *Then* it'll
crash.)

--
`It is an unfortunate coincidence that the date locarchive.h was
written (in hex) matches Ritchie's birthday (in octal).'
-- Roland McGrath on the libc-alpha list

2003-06-24 20:55:42

by Chris Mason

[permalink] [raw]
Subject: Re: 2.4.21 reiserfs oops

On Tue, 2003-06-24 at 16:34, Nix wrote:
> On Tue, 24 Jun 2003, Oleg Drokin moaned:
> > Hello!
> >
> > On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote:
> >
> >> >> Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000001
> >> > This is very strange address to oops on.
> >> I'll say! Looks almost like it JMPed to a null pointer or something.
> >
> > No, if it'd jumped to a NULL pointer, we'd see 0 in EIP.
>
> JMPed to ((long)NULL)+1 or something then :) the fact remains that it's
> not somewhere that even a memory error would make us likely to jump to.
>
> >> >> Jun 22 13:52:43 loki kernel: EIP: 0010:[<c0092df4>] Not tainted

The EIP isn't zero or 1, you've got a bad null pinter dereference at
address 1. You get this when you do something like *(char *)1 =
some_val.

The ram is most likely bad, you're 1 bit away from zero, but you might
try a reiserfsck on any drives affected by the scsi errors.

-chris


2003-06-25 21:47:51

by Nix

[permalink] [raw]
Subject: Re: 2.4.21 reiserfs oops

[reiserfs excised as this is no longer quite on-topic]

On 24 Jun 2003, Chris Mason said:
> The EIP isn't zero or 1, you've got a bad null pinter dereference at
> address 1. You get this when you do something like *(char *)1 =
> some_val.

Ah. That does sound like either bad RAM or scattershot corruption.

> The ram is most likely bad, you're 1 bit away from zero, but you might
> try a reiserfsck on any drives affected by the scsi errors.

Done that, no problems found. (Most drives are e3fs.)

2.4.21's crashed once more since then, network driver death; I'm
probably going to revert to 2.4.20 on that box, which ran perfectly
since three days past 2.4.20's release. (However, several other boxes
had problems with 2.4.20 which were cleared up with .21... you win some,
you lose some. I'd say this has been a net gain.)

I'll see if 2.4.20 has the same problems; if it does, then I agree, the
RAM's probably had it.

--
`It is an unfortunate coincidence that the date locarchive.h was
written (in hex) matches Ritchie's birthday (in octal).'
-- Roland McGrath on the libc-alpha list