2002-01-08 19:58:43

by Ville Herva

[permalink] [raw]
Subject: 2.2.21pre2 oops

I got the following oops while stress testing 2.2.21pre2 and ide subsystem.

The kernel has ide, raid and e2compr patches applied. Of those ide and raid
were in use at the time of oops - no e2compr'ed fs had been mounted after
boot.

It's noteworthy that I have done a LOT of testing with _very_ similar work
loads on 2.2.20 + ide + raid + e2compr and have seen no oopses. The only
difference between this kernel and the 2.2.20 one was the -pre2 patch.

So while this most likely is a merging error of my part, or just some
incompability between the patches, I figured you might want to take a look.

When the oops happened, I was reading two IDE drives in raid0. This was
essentially cat /dev/md0 > /dev/null kind of test to stress the Via KT133
pci transfers.

Rootfs is on ide cdrom, the harddrives had no fs on them.

ksymoops 0.7c on i686 2.2.21pre2-ide+e2compr+raid. Options
used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.2.21pre2-ide+e2compr+raid+gibbs+patches/ (default)
-m ./System.map (specified)

kmem_free: Bad obj addr (objp=c1a0c420, name=buffer_head)

Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0120871>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282
eax: 0000003d ebx: c1a0c420 ecx: ffffffff edx: 0000003c
esi: cffef740 edi: 00000282 ebp: e82e0b92 esp: cffd5f68
ds: 0018 es: 0018 ss: 0018
Process kswapd (pid: 4, process nr: 4, stackpage=cffd5000)
Stack: c756dea0 c02bb728 c1a0c47c cffd5f80 c0126d35 cffef740 c1a0c420
c1a0c420
c756dea0 c012777b c1a0c420 c1a0c420 c02bb728 0000332f cffd4000
00000030
c011c6a7 c02bb728 00000030 00000004 00000005 00000030 0008e000
c012150a
Call Trace: [<c0126d35>] [<c012777b>] [<c011c6a7>] [<c012150a>] [<c01eec2e>]
[<c
01215d3>] [<c0106000>]
[<c010749f>]
Code: c7 05 00 00 00 00 00 00 00 00 eb 12 8d 76 00 56 53 68 3e ea

>>EIP; c0120871 <kmem_cache_free+14d/174> <=====
Trace; c0126d35 <put_unused_buffer_head+21/4c>
Trace; c012777b <try_to_free_buffers+3b/a4>
Trace; c011c6a7 <shrink_mmap+103/160>
Trace; c012150a <try_to_free_pages+26/8c>
Trace; c01eec2e <tvecs+182e/31a0>
Trace; c01215d3 <kswapd+63/98>
Trace; c0106000 <get_options+0/74>
Trace; c010749f <kernel_thread+23/30>
Code; c0120871 <kmem_cache_free+14d/174>
00000000 <_EIP>:
Code; c0120871 <kmem_cache_free+14d/174> <=====
0: c7 05 00 00 00 00 00 movl $0x0,0x0 <=====
Code; c0120878 <kmem_cache_free+154/174>
7: 00 00 00
Code; c012087b <kmem_cache_free+157/174>
a: eb 12 jmp 1e <_EIP+0x1e> c012088f
<kmem_cache_fre
e+16b/174>
Code; c012087d <kmem_cache_free+159/174>
c: 8d 76 00 lea 0x0(%esi),%esi
Code; c0120880 <kmem_cache_free+15c/174>
f: 56 push %esi
Code; c0120881 <kmem_cache_free+15d/174>
10: 53 push %ebx
Code; c0120882 <kmem_cache_free+15e/174>
11: 68 3e ea 00 00 push $0xea3e

kmem_free: Bad obj addr (objp=c1a0ca20, name=buffer_head)

Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 0f073000, %cr3 = 0f073000
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0120871>]
EFLAGS: 00010282
eax: 0000003d ebx: c1a0ca20 ecx: c02276e8 edx: 00000021
esi: cffef740 edi: 00000282 ebp: e82e0b92 esp: cf74bc40
ds: 0018 es: 0018 ss: 0018
Process wrchk (pid: 690, process nr: 25, stackpage=cf74b000)
Stack: c7af2d40 c02c2ff0 c1a0ca7c ffffff0a c0126d35 cffef740 c1a0ca20
c1a0ca20
c7af2d40 c012777b c1a0ca20 c1a0ca20 c02c2ff0 0000332f cf74a000
00000005
c011c6a7 c02c2ff0 00000005 0000000b 00000005 00000005 00000900
c012150a
Call Trace: [<c0126d35>] [<c012777b>] [<c011c6a7>] [<c012150a>] [<c0121fb2>]
[<c
01275fc>] [<c012679e>]
[<c012695a>] [<c0129ac1>] [<c0188f2d>] [<c0124ece>] [<c0108924>]
Code: c7 05 00 00 00 00 00 00 00 00 eb 12 8d 76 00 56 53 68 3e ea

>>EIP; c0120871 <kmem_cache_free+14d/174> <=====
Trace; c0126d35 <put_unused_buffer_head+21/4c>
Trace; c012777b <try_to_free_buffers+3b/a4>
Trace; c011c6a7 <shrink_mmap+103/160>
Trace; c012150a <try_to_free_pages+26/8c>
Trace; c0121fb2 <__get_free_pages+9a/2ac>
Trace; c01275fc <grow_buffers+3c/fc>
Trace; c012679e <refill_freelist+a/38>
Trace; c012695a <getblk+11e/144>
Trace; c0129ac1 <block_read+2c1/4f4>
Trace; c0188f2d <md_read+41/48>
Trace; c0124ece <sys_read+ae/c4>
Trace; c0108924 <system_call+34/38>
Code; c0120871 <kmem_cache_free+14d/174>
00000000 <_EIP>:
Code; c0120871 <kmem_cache_free+14d/174> <=====
0: c7 05 00 00 00 00 00 movl $0x0,0x0 <=====
Code; c0120878 <kmem_cache_free+154/174>
7: 00 00 00
Code; c012087b <kmem_cache_free+157/174>
a: eb 12 jmp 1e <_EIP+0x1e> c012088f
<kmem_cache_fre
e+16b/174>
Code; c012087d <kmem_cache_free+159/174>
c: 8d 76 00 lea 0x0(%esi),%esi
Code; c0120880 <kmem_cache_free+15c/174>
f: 56 push %esi
Code; c0120881 <kmem_cache_free+15d/174>
10: 53 push %ebx
Code; c0120882 <kmem_cache_free+15e/174>
11: 68 3e ea 00 00 push $0xea3e

kmem_free: Bad obj addr (objp=c1a0c480, name=buffer_head)

Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 0f97f000, %cr3 = 0f97f000
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0120871>]
EFLAGS: 00010282
eax: 0000003d ebx: c1a0c480 ecx: c02276e8 edx: 00000021
esi: cffef740 edi: 00000282 ebp: e82e0b92 esp: cebbbc38
ds: 0018 es: 0018 ss: 0018
Process wrchk (pid: 795, process nr: 17, stackpage=cebbb000)
Stack: c1a0c480 c02f0b80 c1a0c4dc c967df00 c0126d35 cffef740 c1a0c480
c1a0c480
c1a0c480 c012777b c1a0c480 c1a0c480 c02f0b80 0000332e cebba000
00000005
c011c6a7 c02f0b80 00000005 00000008 00000005 00000005 00000900
c012150a
Call Trace: [<c0126d35>] [<c012777b>] [<c011c6a7>] [<c012150a>] [<c0121fb2>]
[<c
01275fc>] [<c012679e>]
[<c012695a>] [<c012947d>] [<c012a7f6>] [<c012afc0>] [<c011af63>]
[<c011af
a4>] [<c0112e37>] [<c012bf7c>]
[<c012c0ff>] [<c01424a5>] [<c012bf7c>] [<c012c0ff>] [<c012bd73>]
[<c01256
33>] [<c01ef8ad>] [<c01258f6>]
[<c0188f75>] [<c0124fc9>] [<c0188f34>] [<c0108924>]
Code: c7 05 00 00 00 00 00 00 00 00 eb 12 8d 76 00 56 53 68 3e ea

>>EIP; c0120871 <kmem_cache_free+14d/174> <=====
Trace; c0126d35 <put_unused_buffer_head+21/4c>
Trace; c012777b <try_to_free_buffers+3b/a4>
Trace; c011c6a7 <shrink_mmap+103/160>
Trace; c012150a <try_to_free_pages+26/8c>
Trace; c0121fb2 <__get_free_pages+9a/2ac>
Trace; c01275fc <grow_buffers+3c/fc>
Trace; c012679e <refill_freelist+a/38>
Trace; c012695a <getblk+11e/144>
Trace; c012947d <block_write+1b5/538>
Trace; c012a7f6 <read_exec+c2/13c>
Trace; c012afc0 <search_binary_handler+60/120>
Trace; c011af63 <do_anonymous_page+73/84>
Trace; c011afa4 <do_no_page+30/c4>
Trace; c0112e37 <update_process_times+5b/64>
Trace; c012bf7c <do_follow_link+9c/a8>
Trace; c012c0ff <lookup_dentry+177/200>
Trace; c01424a5 <ext2_follow_link+5d/78>
Trace; c012bf7c <do_follow_link+9c/a8>
Trace; c012c0ff <lookup_dentry+177/200>
Trace; c012bd73 <permission+27/2c>
Trace; c0125633 <get_blkfops+1b/20>
Trace; c01ef8ad <tvecs+24ad/31a0>
Trace; c01258f6 <blkdev_open+32/40>
Trace; c0188f75 <md_write+41/48>
Trace; c0124fc9 <sys_write+e5/118>
Trace; c0188f34 <md_write+0/48>
Trace; c0108924 <system_call+34/38>
Code; c0120871 <kmem_cache_free+14d/174>
00000000 <_EIP>:
Code; c0120871 <kmem_cache_free+14d/174> <=====
0: c7 05 00 00 00 00 00 movl $0x0,0x0 <=====
Code; c0120878 <kmem_cache_free+154/174>
7: 00 00 00
Code; c012087b <kmem_cache_free+157/174>
a: eb 12 jmp 1e <_EIP+0x1e> c012088f
<kmem_cache_fre
e+16b/174>
Code; c012087d <kmem_cache_free+159/174>
c: 8d 76 00 lea 0x0(%esi),%esi
Code; c0120880 <kmem_cache_free+15c/174>
f: 56 push %esi
Code; c0120881 <kmem_cache_free+15d/174>
10: 53 push %ebx
Code; c0120882 <kmem_cache_free+15e/174>
11: 68 3e ea 00 00 push $0xea3e

kmem_free: Bad obj addr (objp=c1a0c300, name=buffer_head)

Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 03620000, %cr3 = 03620000
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0120871>]
EFLAGS: 00010286
eax: 0000003d ebx: c1a0c300 ecx: c02276e8 edx: 00000021
esi: cffef740 edi: 00000286 ebp: e82e0b92 esp: c5683d6c
ds: 0018 es: 0018 ss: 0018
Process sh (pid: 1892, process nr: 34, stackpage=c5683000)
Stack: c2d31ea0 c031cb68 c1a0c35c 00000000 c0126d35 cffef740 c1a0c300
c1a0c300
c2d31ea0 c012777b c1a0c300 c1a0c300 c031cb68 0000332e c5682000
00000013
c011c6a7 c031cb68 00000013 00000002 00000005 00000013 00000ff2
c012150a
Call Trace: [<c0126d35>] [<c012777b>] [<c011c6a7>] [<c012150a>] [<c0121fb2>]
[<c
012a580>] [<c012b1cc>]
[<c0107923>] [<c0108924>]
Code: c7 05 00 00 00 00 00 00 00 00 eb 12 8d 76 00 56 53 68 3e ea

>>EIP; c0120871 <kmem_cache_free+14d/174> <=====
Trace; c0126d35 <put_unused_buffer_head+21/4c>
Trace; c012777b <try_to_free_buffers+3b/a4>
Trace; c011c6a7 <shrink_mmap+103/160>
Trace; c012150a <try_to_free_pages+26/8c>
Trace; c0121fb2 <__get_free_pages+9a/2ac>
Trace; c012a580 <copy_strings+114/1c0>
Trace; c012b1cc <do_execve+14c/224>
Trace; c0107923 <sys_execve+2f/58>
Trace; c0108924 <system_call+34/38>
Code; c0120871 <kmem_cache_free+14d/174>
00000000 <_EIP>:
Code; c0120871 <kmem_cache_free+14d/174> <=====
0: c7 05 00 00 00 00 00 movl $0x0,0x0 <=====
Code; c0120878 <kmem_cache_free+154/174>
7: 00 00 00
Code; c012087b <kmem_cache_free+157/174>
a: eb 12 jmp 1e <_EIP+0x1e> c012088f
<kmem_cache_fre
e+16b/174>
Code; c012087d <kmem_cache_free+159/174>
c: 8d 76 00 lea 0x0(%esi),%esi
Code; c0120880 <kmem_cache_free+15c/174>
f: 56 push %esi
Code; c0120881 <kmem_cache_free+15d/174>
10: 53 push %ebx
Code; c0120882 <kmem_cache_free+15e/174>
11: 68 3e ea 00 00 push $0xea3e

kmem_free: Bad obj addr (objp=c1a0cba0, name=buffer_head)

Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 03620000, %cr3 = 03620000
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0120871>]
EFLAGS: 00010286
eax: 0000003d ebx: c1a0cba0 ecx: c02276e8 edx: 00000021
esi: cffef740 edi: 00000286 ebp: e82e0b92 esp: c5683d6c
ds: 0018 es: 0018 ss: 0018
Process sh (pid: 1912, process nr: 34, stackpage=c5683000)
Stack: c1a0cba0 c031cd70 c1a0cbfc 00000000 c0126d35 cffef740 c1a0cba0
c1a0cba0
c1a0cba0 c012777b c1a0cba0 c1a0cba0 c031cd70 0000332f c5682000
00000013
c011c6a7 c031cd70 00000013 0000001a 00000005 00000013 00000ff2
c012150a
Call Trace: [<c0126d35>] [<c012777b>] [<c011c6a7>] [<c012150a>] [<c0121fb2>]
[<c
012a580>] [<c012b1cc>]
[<c0107923>] [<c0108924>]
Code: c7 05 00 00 00 00 00 00 00 00 eb 12 8d 76 00 56 53 68 3e ea

>>EIP; c0120871 <kmem_cache_free+14d/174> <=====
Trace; c0126d35 <put_unused_buffer_head+21/4c>
Trace; c012777b <try_to_free_buffers+3b/a4>
Trace; c011c6a7 <shrink_mmap+103/160>
Trace; c012150a <try_to_free_pages+26/8c>
Trace; c0121fb2 <__get_free_pages+9a/2ac>
Trace; c012a580 <copy_strings+114/1c0>
Trace; c012b1cc <do_execve+14c/224>
Trace; c0107923 <sys_execve+2f/58>
Trace; c0108924 <system_call+34/38>
Code; c0120871 <kmem_cache_free+14d/174>
00000000 <_EIP>:
Code; c0120871 <kmem_cache_free+14d/174> <=====
0: c7 05 00 00 00 00 00 movl $0x0,0x0 <=====
Code; c0120878 <kmem_cache_free+154/174>
7: 00 00 00
Code; c012087b <kmem_cache_free+157/174>
a: eb 12 jmp 1e <_EIP+0x1e> c012088f
<kmem_cache_fre
e+16b/174>
Code; c012087d <kmem_cache_free+159/174>
c: 8d 76 00 lea 0x0(%esi),%esi
Code; c0120880 <kmem_cache_free+15c/174>
f: 56 push %esi
Code; c0120881 <kmem_cache_free+15d/174>
10: 53 push %ebx
Code; c0120882 <kmem_cache_free+15e/174>
11: 68 3e ea 00 00 push $0xea3e

kmem_free: Bad obj addr (objp=c1a0cb40, name=buffer_head)

Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 03620000, %cr3 = 03620000
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0120871>]
EFLAGS: 00010282
eax: 0000003d ebx: c1a0cb40 ecx: c02276e8 edx: 00000021
esi: cffef740 edi: 00000282 ebp: e82e0b92 esp: c5683d6c
ds: 0018 es: 0018 ss: 0018
Process sh (pid: 1915, process nr: 34, stackpage=c5683000)
Stack: c1a0c000 c031cde8 c1a0cb9c 00000000 c0126d35 cffef740 c1a0cb40
c1a0cb40
c1a0c000 c012777b c1a0cb40 c1a0cb40 c031cde8 0000332d c5682000
00000013
c011c6a7 c031cde8 00000013 00000020 00000005 00000013 00000ff2
c012150a
Call Trace: [<c0126d35>] [<c012777b>] [<c011c6a7>] [<c012150a>] [<c0121fb2>]
[<c
012a580>] [<c012b1cc>]
[<c0107923>] [<c0108924>]
Code: c7 05 00 00 00 00 00 00 00 00 eb 12 8d 76 00 56 53 68 3e ea

>>EIP; c0120871 <kmem_cache_free+14d/174> <=====
Trace; c0126d35 <put_unused_buffer_head+21/4c>
Trace; c012777b <try_to_free_buffers+3b/a4>
Trace; c011c6a7 <shrink_mmap+103/160>
Trace; c012150a <try_to_free_pages+26/8c>
Trace; c0121fb2 <__get_free_pages+9a/2ac>
Trace; c012a580 <copy_strings+114/1c0>
Trace; c012b1cc <do_execve+14c/224>
Trace; c0107923 <sys_execve+2f/58>
Trace; c0108924 <system_call+34/38>
Code; c0120871 <kmem_cache_free+14d/174>
00000000 <_EIP>:
Code; c0120871 <kmem_cache_free+14d/174> <=====
0: c7 05 00 00 00 00 00 movl $0x0,0x0 <=====
Code; c0120878 <kmem_cache_free+154/174>
7: 00 00 00
Code; c012087b <kmem_cache_free+157/174>
a: eb 12 jmp 1e <_EIP+0x1e> c012088f
<kmem_cache_fre
e+16b/174>
Code; c012087d <kmem_cache_free+159/174>
c: 8d 76 00 lea 0x0(%esi),%esi
Code; c0120880 <kmem_cache_free+15c/174>
f: 56 push %esi
Code; c0120881 <kmem_cache_free+15d/174>
10: 53 push %ebx
Code; c0120882 <kmem_cache_free+15e/174>
11: 68 3e ea 00 00 push $0xea3e



3 warnings issued. Results may not be reliable.


-- v --

[email protected]


2002-01-08 20:05:13

by Alan

[permalink] [raw]
Subject: Re: 2.2.21pre2 oops

> essentially cat /dev/md0 > /dev/null kind of test to stress the Via KT133
> pci transfers.
>
> Rootfs is on ide cdrom, the harddrives had no fs on them.
>
> ksymoops 0.7c on i686 2.2.21pre2-ide+e2compr+raid. Options
> used

Can you repeat the test to make sure its replicable, then repeat it again
after disabling the new VIA fixups in pci/quirks.c

2002-01-08 20:13:54

by Ville Herva

[permalink] [raw]
Subject: Re: 2.2.21pre2 oops

On Tue, Jan 08, 2002 at 08:16:03PM +0000, you [Alan Cox] claimed:
> > essentially cat /dev/md0 > /dev/null kind of test to stress the Via KT133
> > pci transfers.
> >
> > Rootfs is on ide cdrom, the harddrives had no fs on them.
> >
> > ksymoops 0.7c on i686 2.2.21pre2-ide+e2compr+raid. Options
> > used
>
> Can you repeat the test to make sure its replicable, then repeat it again
> after disabling the new VIA fixups in pci/quirks.c

The test has been repeated several times even with 2.2.21pre2 (although
we've run a lot more 2.2.20 tests). This was the first time we saw an oops.
The difference between this and the former 2.2.21pre2 runs is certain bios
settings. (We are still trying to isolate the one setting that triggers the
Via pci transfer corruption on HPT reads.) We'll repeat the test with these
settings and try to see if it is via bios settings / pci/quirks.c related.

There seems to be _something_ fishy in the pre2 quirks, since there is at
least one bios setting combination with which 2.2.20 does not show the pci
corruption, but 2.2.21pre2 does. It just that it is really tedious to
isolate. But we are working on it.


-- v --

[email protected]

2002-01-09 12:46:24

by Ville Herva

[permalink] [raw]
Subject: Re: 2.2.21pre2 oops

On Tue, Jan 08, 2002 at 10:13:15PM +0200, you [Ville Herva] claimed:
> On Tue, Jan 08, 2002 at 08:16:03PM +0000, you [Alan Cox] claimed:
> > > essentially cat /dev/md0 > /dev/null kind of test to stress the Via KT133
> > > pci transfers.
> > >
> > > Rootfs is on ide cdrom, the harddrives had no fs on them.
> > >
> > > ksymoops 0.7c on i686 2.2.21pre2-ide+e2compr+raid. Options
> > > used
> >
> > Can you repeat the test to make sure its replicable, then repeat it again
> > after disabling the new VIA fixups in pci/quirks.c
>
> The test has been repeated several times even with 2.2.21pre2 (although
> we've run a lot more 2.2.20 tests). This was the first time we saw an oops.
> The difference between this and the former 2.2.21pre2 runs is certain bios
> settings. (We are still trying to isolate the one setting that triggers the
> Via pci transfer corruption on HPT reads.) We'll repeat the test with these
> settings and try to see if it is via bios settings / pci/quirks.c related.
>
> There seems to be _something_ fishy in the pre2 quirks, since there is at
> least one bios setting combination with which 2.2.20 does not show the pci
> corruption, but 2.2.21pre2 does. It just that it is really tedious to
> isolate. But we are working on it.

Bleah.

It turned out that mere hpt370 read/write test hadn't caused it. My
colleague had launched "ping -f" on background which had immediately
triggered the oops. (When I found the oops on the screen, I initially tought
he had just left the hpt370 read/write test running and left.)

We booted and tried to reproduce it. ping -f didn't immediately trigger it,
but after a while it happened. We got a number of oopses one of which was
similar to the first one and one of which showed process table corruption
(the name of the process in the oops was a random ascii pattern.)

We also got the oops with 2.2.20+patches, so this is not a pre2 thing.
Rather, the difference is that we now ran ping -f on background.

The bad news is that all the bios setting configurations we thought stable
(that had run the hpt370 read/write test without a hitch for days) now give
oopses and corruption pretty quickly when we run ping -f on background :(.

Also, ping -f shows "...EEE.EE.EEE.." which I gather means the packets get
corrupted somewhere.

I'm not too hopeful regarding finding a set of bios settings that would fix
this. It seems the "stable" configuration we found just hid the problem, but
when we push the board further, it appears again.

The two disks on HPT370 read on parallel give about 60MB/s. Add the 10MB/s
from 3c905 to that, and we are pretty close to the 75MB/s number that I've
seen referred somewhere(1) as the maximum Via KT133 can do.

My conclusion at this point is that Via KT133 / Abit KT7-RAID pci transfer
is positively FUBAR, and no sane person should touch the bugger with a ten
foot pole. I'd be happy to be proven wrong, though.


-- v --

[email protected]

(1) http://www.tecchannel.de/hardware/817/1.html

2002-01-09 15:26:26

by Ville Herva

[permalink] [raw]
Subject: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

On Wed, Jan 09, 2002 at 02:45:49PM +0200, you [Ville Herva] claimed:
>
> We also got the oops with 2.2.20+patches, so this is not a pre2 thing.
> Rather, the difference is that we now ran ping -f on background.
>
> The bad news is that all the bios setting configurations we thought stable
> (that had run the hpt370 read/write test without a hitch for days) now give
> oopses and corruption pretty quickly when we run ping -f on background :(.
>
> Also, ping -f shows "...EEE.EE.EEE.." which I gather means the packets get
> corrupted somewhere.

It also happens with _pristine_ 2.4.18pre2. I ran

cat /dev/hde > /dev/null& cat /dev/hdg > /dev/null& ping -f -s 64000 box2

in single user mode. (hde and hdg are Samsung 80GB disks on HPT370, eth0 is
3c905). After just few seconds I got the following oops:

Unable to handle kernel paging request at virtual address 1d292ee9
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0131ce0>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010203
eax: 00000000 ebx: 1d292ed1 ecx: 000001d0 edx: 00000000
esi: 00000000 edi: cf12e940 ebp: c10acb80 esp: c1433f0c
ds: 0018 es: 0018 ss: 0018
Process kswapd (pid: 5, stackpage=c1433000)
Stack: c10acb80 cf12ef40 cf12e940 c0131e37 cf12e940 c10acb80 000001d0 00000017
00000200 c013066a c10acb80 000001d0 00000000 c10acb80 c01282f7 c10acb80
000001d0 00000020 000001d0 00000020 00000006 00000006 000022a5 000001d0
Call Trace: [<c0131e37>] [<c013066a>] [<c01282f7>] [<c012852b>] [<c012859c>]
[<c0128653>] [<c01286c6>] [<c01287e7>] [<c0105000>] [<c0105523>]
Code: f6 43 18 06 0f 84 7f 00 00 00 b8 07 00 00 00 0f ab 43 18 19

>>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
Trace; c0131e37 <try_to_free_buffers+b7/e0>
Trace; c013066a <try_to_release_page+3a/40>
Trace; c01282f7 <shrink_cache+1b7/2c0>
Trace; c012852b <shrink_caches+5b/90>
Trace; c012859c <try_to_free_pages+3c/60>
Trace; c0128653 <kswapd_balance_pgdat+53/b0>
Trace; c01286c6 <kswapd_balance+16/30>
Trace; c01287e7 <kswapd+a7/d0>
Trace; c0105000 <_stext+0/0>
Trace; c0105523 <kernel_thread+23/30>
Code; c0131ce0 <sync_page_buffers+10/b0>
00000000 <_EIP>:
Code; c0131ce0 <sync_page_buffers+10/b0> <=====
0: f6 43 18 06 testb $0x6,0x18(%ebx) <=====
Code; c0131ce4 <sync_page_buffers+14/b0>
4: 0f 84 7f 00 00 00 je 89 <_EIP+0x89> c0131d69
<sync_page_buffers+99/b0>
Code; c0131cea <sync_page_buffers+1a/b0>
a: b8 07 00 00 00 mov $0x7,%eax
Code; c0131cef <sync_page_buffers+1f/b0>
f: 0f ab 43 18 bts %eax,0x18(%ebx)
Code; c0131cf3 <sync_page_buffers+23/b0>
13: 19 00 sbb %eax,(%eax)


Which is pretty similar to the 2.2.20 oopses (here's one:)

Unable to handle kernel paging request at virtual address 4d7ebf3e
current->tss.cr3 = 0e912000, %cr3 = 0e912000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0120631>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010006
eax: ccf0efe0 ebx: ccf0efe0 ecx: 4d7ebf3e edx: ccf0ee00
esi: 00000800 edi: cffef740 ebp: 00000282 esp: cf05fc9c
ds: 0018 es: 0018 ss: 0018
Process cat (pid: 708, process nr: 29, stackpage=cf05f000)
Stack: 00000000 00000400 c0126db5 cffef740 00000005 ccf0ee00 00000000 c0126e42
00000000 00000400 00000400 cc908000 00002100 cf05fcdc cf05fcdc cf05e000
cf05e000 00000000 c0127615 cc908000 00000400 00000000 00000000 00000400
Call Trace: [<c0126db5>] [<c0126e42>] [<c0127615>] [<c012679e>] [<c012695a>]
[<c0129ac1>] [<c0111fb6>]
[<c0111fb6>] [<c011255b>] [<c017e988>] [<c018015d>] [<c0196ad9>]
[<c0181bad>] [<c0196a7c>] [<c015dc19>]
[<c0165a18>] [<c015dc19>] [<c0165a18>] [<c01661a9>] [<c016560b>]
[<c01659d6>] [<c015f5a1>] [<c0124fc9>]
[<c0124ece>] [<c0108924>]
Code: 8b 01 89 03 85 c0 74 2b 8b 73 04 85 f6 75 10 89 19 89 c8 2b

>>EIP; c0120631 <kmem_cache_alloc+31/124> <=====
Trace; c0126db5 <get_unused_buffer_head+55/a0>
Trace; c0126e42 <create_buffers+42/198>
Trace; c0127615 <grow_buffers+55/fc>
Trace; c012679e <refill_freelist+a/38>
Trace; c012695a <getblk+11e/144>
Trace; c0129ac1 <block_read+2c1/4f4>
Trace; c0111fb6 <wake_up_process+3a/44>
Trace; c0111fb6 <wake_up_process+3a/44>
Trace; c011255b <__wake_up+4f/6c>
Trace; c017e988 <end_that_request_last+28/2c>
Trace; c018015d <ide_end_request+61/6c>
Trace; c0196ad9 <ide_dma_intr+5d/94>
Trace; c0181bad <ide_intr+111/130>
Trace; c0196a7c <ide_dma_intr+0/94>
Trace; c015dc19 <alloc_skb+71/dc>
Trace; c0165a18 <ip_frag_create+18/60>
Trace; c015dc19 <alloc_skb+71/dc>
Trace; c0165a18 <ip_frag_create+18/60>
Trace; c01661a9 <ip_defrag+2f9/360>
Trace; c016560b <ip_local_deliver+2f/1c4>
Trace; c01659d6 <ip_rcv+236/260>
Trace; c015f5a1 <net_bh+181/1dc>
Trace; c0124fc9 <sys_write+e5/118>
Trace; c0124ece <sys_read+ae/c4>
Trace; c0108924 <system_call+34/38>
Code; c0120631 <kmem_cache_alloc+31/124>
00000000 <_EIP>:
Code; c0120631 <kmem_cache_alloc+31/124> <=====
0: 8b 01 mov (%ecx),%eax <=====
Code; c0120633 <kmem_cache_alloc+33/124>
2: 89 03 mov %eax,(%ebx)
Code; c0120635 <kmem_cache_alloc+35/124>
4: 85 c0 test %eax,%eax
Code; c0120637 <kmem_cache_alloc+37/124>
6: 74 2b je 33 <_EIP+0x33> c0120664
<kmem_cache_all
oc+64/124>
Code; c0120639 <kmem_cache_alloc+39/124>
8: 8b 73 04 mov 0x4(%ebx),%esi
Code; c012063c <kmem_cache_alloc+3c/124>
b: 85 f6 test %esi,%esi
Code; c012063e <kmem_cache_alloc+3e/124>
d: 75 10 jne 1f <_EIP+0x1f> c0120650
<kmem_cache_all
oc+50/124>
Code; c0120640 <kmem_cache_alloc+40/124>
f: 89 19 mov %ebx,(%ecx)
Code; c0120642 <kmem_cache_alloc+42/124>
11: 89 c8 mov %ecx,%eax
Code; c0120644 <kmem_cache_alloc+44/124>
13: 2b 00 sub (%eax),%eax


This is with the bios settings we thought stable.

Any ideas?


-- v --

[email protected]

2002-01-09 21:08:06

by Andrew Morton

[permalink] [raw]
Subject: Re: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

Ville Herva wrote:
>
> >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====

Looks like a corrupted `next' pointer in the page's buffer_head
ring. Your report is identical to Todd Eigenschink's repeatable
oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html

In another thread, yesterday, we were discussing the elusive
"end_request: buffer-list destroyed" crash.

I am able to trigger this in around ten minutes on 2.4.13 and
later kernels. However 2.4.13-pre6 ran the test for nine hours
and did not fail.

I've put the 2.4.13-pre6 -> 2.4.13 diff at http://www.zip.com.au/~akpm/1.gz


MAINTAINERS | 6 ++
Makefile | 2
arch/i386/kernel/smp.c | 58 +++++++++++-----------------
drivers/message/i2o/i2o_block.c | 44 ++++++++-------------
drivers/message/i2o/i2o_config.c | 1
drivers/message/i2o/i2o_core.c | 39 ++++++++++++++++---
drivers/message/i2o/i2o_lan.c | 4 +
drivers/message/i2o/i2o_pci.c | 14 ++++++
drivers/message/i2o/i2o_proc.c | 16 +++----
drivers/message/i2o/i2o_scsi.c | 17 ++++++--
drivers/scsi/dpt_i2o.c | 14 +++---
drivers/sound/ymfpci.c | 52 +++++++++++--------------
fs/buffer.c | 54 ++++++++++++++++----------
fs/ntfs/fs.c | 1
include/linux/fs.h | 3 -
include/linux/locks.h | 2
include/linux/mm.h | 17 ++++----
include/linux/slab.h | 2
include/linux/swap.h | 4 -
kernel/exit.c | 13 +-----
mm/highmem.c | 4 -
mm/page_alloc.c | 39 +++++++++----------
mm/swap.c | 4 -
mm/vmscan.c | 80 ++++++++++++++++++++++-----------------

There were VM changes, and a messy, complex and undocumented
change to sync_page_buffers(), which was the point at which
I ceased to understand that function. The patch was never Cc'ed
to the mailing list, was never explained. This sort of thing
makes it very hard for other developers to hunt down bugs.

Probably, the bug lies elsewhere and perhaps my bug is different
from yours and Todd's. It is timing-related, and the VM and
buffer changes may just have triggered it.

I have a debug patch from Jens to try tonight.

It could just be some random memory scribbler. Dunno yet. It's
awfully repeatable.

-

2002-01-09 21:57:48

by Ville Herva

[permalink] [raw]
Subject: Re: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

On Wed, Jan 09, 2002 at 01:00:53PM -0800, you [Andrew Morton] claimed:
> Ville Herva wrote:
> >
> > >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
>
> Looks like a corrupted `next' pointer in the page's buffer_head
> ring. Your report is identical to Todd Eigenschink's repeatable
> oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html
>
> In another thread, yesterday, we were discussing the elusive
> "end_request: buffer-list destroyed" crash.

(...)

> There were VM changes, and a messy, complex and undocumented change to
> sync_page_buffers(), which was the point at which I ceased to understand
> that function.

Nice, yet one more variable to the equation ;). And I thought I could rule
out kernel bugs by reproducing this on supposedly stable kernel (the 2.2.20
I used had all sort of patches in it; ide, e2compr and raid to name the
largest ones.)

This could be a sync_page_buffers() bug, but what puzzles me is that I can
reproduce the oopses on 2.2 as well (although they can of course be
different oopses).

Also, I'm seeing ide and network corruption that would very much point to
pci transfer corruption. Of course, it can be that the oopses are not caused
by that.

> It could just be some random memory scribbler. Dunno yet. It's awfully
> repeatable.

Yep.


-- v --

[email protected]

2002-01-09 23:32:58

by Martin Josefsson

[permalink] [raw]
Subject: Re: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

On Wed, 9 Jan 2002, Ville Herva wrote:

> On Wed, Jan 09, 2002 at 01:00:53PM -0800, you [Andrew Morton] claimed:
> > Ville Herva wrote:
> > >
> > > >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
> >
> > Looks like a corrupted `next' pointer in the page's buffer_head
> > ring. Your report is identical to Todd Eigenschink's repeatable
> > oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html
> >
> > In another thread, yesterday, we were discussing the elusive
> > "end_request: buffer-list destroyed" crash.
>
> (...)
>
> > There were VM changes, and a messy, complex and undocumented change to
> > sync_page_buffers(), which was the point at which I ceased to understand
> > that function.
>
> Nice, yet one more variable to the equation ;). And I thought I could rule
> out kernel bugs by reproducing this on supposedly stable kernel (the 2.2.20
> I used had all sort of patches in it; ide, e2compr and raid to name the
> largest ones.)
>
> This could be a sync_page_buffers() bug, but what puzzles me is that I can
> reproduce the oopses on 2.2 as well (although they can of course be
> different oopses).
>
> Also, I'm seeing ide and network corruption that would very much point to
> pci transfer corruption. Of course, it can be that the oopses are not caused
> by that.

I havn't followed this thread but I have a machine with an Asus A7V
motherboard with KT133 chipset and we had massive corruption before
christmas, both ide and network had corrupted packets. and now after
christmas we ran memtest86 on it and a 256MB module was very very broken.
and we got alot of Oopses and all kinds of strange stuff happened.

We've replaved that memory module now and now it's better but I have to
say that the KT133 or atleast the Asus A7V motherboard seems to be quite
broken. we have a lot of spurious irq's and the ide controllers freak when
but under some load and start getting irq timeouts and resets the ide
channels over and over again with some delay in between when it kind of
works, slow as hell but works.

We are going to replace the motherboard with one with VIA KT266A chipset,
hope that works better.

/Martin

Never argue with an idiot. They drag you down to their level, then beat you with experience.

2002-01-09 23:34:37

by Daniel J Blueman

[permalink] [raw]
Subject: RE: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

> Nice, yet one more variable to the equation ;). And I thought
> I could rule out kernel bugs by reproducing this on
> supposedly stable kernel (the 2.2.20 I used had all sort of
> patches in it; ide, e2compr and raid to name the largest ones.)
>
> This could be a sync_page_buffers() bug, but what puzzles me
> is that I can reproduce the oopses on 2.2 as well (although
> they can of course be different oopses).
>
> Also, I'm seeing ide and network corruption that would very
> much point to pci transfer corruption. Of course, it can be
> that the oopses are not caused by that.
[snip]

>From what I've read, it looks like there can be issues with the VIA
KT133 PCI implementation, possibly applying to other VIA chipsets too.

Master memory reads can talk 45 cycles rather than 16 (the max defined
in the PCI spec) - this sounds like it could be due to either a) bad
motherboard design with signal problems, or b) BIOS chipset
configuration (try setting 'PCI master read caching' to on?). This is
since problems have been reported with different make motherboards using
the same chipset, and those being the only two factors differing.

Of course, this may well not help if it is geniunely a bug in the
kernel, but may solve the PCI corruption (if any).

Also, if it is a chipset issue, updating the BIOS can help at times,
with the vendor incorporating work-arounds for known chipset problems
(eg the well-publicised IDE corruption issue).

Dan
___________________
Daniel J Blueman

2002-01-10 00:20:19

by Daniel J Blueman

[permalink] [raw]
Subject: RE: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

> I havn't followed this thread but I have a machine with an
> Asus A7V motherboard with KT133 chipset and we had massive
> corruption before christmas, both ide and network had
> corrupted packets. and now after christmas we ran memtest86
> on it and a 256MB module was very very broken. and we got
> alot of Oopses and all kinds of strange stuff happened.
>
> We've replaved that memory module now and now it's better but
> I have to say that the KT133 or atleast the Asus A7V
> motherboard seems to be quite broken. we have a lot of
> spurious irq's and the ide controllers freak when but under
> some load and start getting irq timeouts and resets the ide
> channels over and over again with some delay in between when
> it kind of works, slow as hell but works.

There are known issues with the VIA 82C686A/B chipset south-bridge and
IDE in particular. Make sure you have the latest BIOS and latest VIA
4in1 drivers to workaround the IDE corruption and other known issues
(sound problems with certain soundcards).

Dan
___________________
Daniel J Blueman

2002-01-10 00:42:36

by Martin Josefsson

[permalink] [raw]
Subject: RE: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

On Thu, 10 Jan 2002, Daniel J Blueman wrote:

> There are known issues with the VIA 82C686A/B chipset south-bridge and
> IDE in particular. Make sure you have the latest BIOS and latest VIA
> 4in1 drivers to workaround the IDE corruption and other known issues
> (sound problems with certain soundcards).

Yes I'm aware of these problems, I thought that the VIA 4in1 driver where
wintendo drivers. And I also thought that there are workarounds for these
bugs in the kernel.

/Martin

Never argue with an idiot. They drag you down to their level, then beat you with experience.

2002-01-10 01:04:39

by Daniel J Blueman

[permalink] [raw]
Subject: RE: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

> On Thu, 10 Jan 2002, Daniel J Blueman wrote:
>
> > There are known issues with the VIA 82C686A/B chipset
> south-bridge and
> > IDE in particular. Make sure you have the latest BIOS and
> latest VIA
> > 4in1 drivers to workaround the IDE corruption and other
> known issues
> > (sound problems with certain soundcards).
>
> Yes I'm aware of these problems, I thought that the VIA 4in1
> driver where wintendo drivers. And I also thought that there
> are workarounds for these bugs in the kernel.

Yep, the VIA 4in1 drivers are purely for windows.

In linux, if the chipset-fixup code is being trigged on boot (and
appears in your dmesg?), then it looks like the problem maybe
elsewhere...

On the other hand, perhaps that fixup code isn't complete (or relies on
certain chipset features being on/off by default, vendor specific
defaults?)

Dan
___________________
Daniel J Blueman

2002-01-10 05:35:17

by Ville Herva

[permalink] [raw]
Subject: Re: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

On Thu, Jan 10, 2002 at 12:30:37AM +0100, you [Martin Josefsson] claimed:
>
> I havn't followed this thread but I have a machine with an Asus A7V
> motherboard with KT133 chipset and we had massive corruption before
> christmas, both ide and network had corrupted packets. and now after
> christmas we ran memtest86 on it and a 256MB module was very very broken.
> and we got alot of Oopses and all kinds of strange stuff happened.

We ran memtest86 at one point but it showed nothing. We changed the memory
modules, and it didn't help. (I did seem like the order and placement of the
modules on the mobo made difference, but that turned out to be false
positive. Trying harder made the corruption happen again.)

Also, we only begun to see oopses when we stress tested hpt370 ide AND
network (so far we did only stress hpt370 and run "normal" stuff). The board
never oopsed or behaved strangely other than the hpt370 corruption and the
hpt370+3c905 stress test oopses.

> We are going to replace the motherboard with one with VIA KT266A chipset,
> hope that works better.

If we get around to replace the bugger, the one thing I'll make sure is that
the replacement is not Via. Even if 1000 people told me KT266A was stable.


-- v --

[email protected]

2002-01-10 06:35:04

by Ville Herva

[permalink] [raw]
Subject: Re: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

On Wed, Jan 09, 2002 at 01:00:53PM -0800, you [Andrew Morton] claimed:
> Ville Herva wrote:
> >
> > >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
>
> Looks like a corrupted `next' pointer in the page's buffer_head
> ring. Your report is identical to Todd Eigenschink's repeatable
> oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html
<snip>
> I am able to trigger this in around ten minutes on 2.4.13 and
> later kernels. However 2.4.13-pre6 ran the test for nine hours
> and did not fail.

Out of curiosity: what kind of load do you use to trigger it?

> I've put the 2.4.13-pre6 -> 2.4.13 diff at http://www.zip.com.au/~akpm/1.gz

Seems your diff didn't include some bits (Maintainers changes and something
else.)

Anyhow, I compiled 2.4.13pre6 and it collapsed in just a few minutes. My
best guess is that network card pci dma is somehow fubar, and it writes
stuff to where it shouldn't.


Unable to handle kernel paging request at virtual address 86061d0e
c012f354
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c012f354>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 86061cee ebx: cb4aba40 ecx: 68158c40 edx: 0000aa25
esi: cb4aba40 edi: 00000001 ebp: 00000001 esp: ceb45b48
ds: 0018 es: 0018 ss: 0018
Process ping (pid: 4134, stackpage=ceb45000)
Stack: 00000001 cb4aba40 c012fc93 cb4aba40 ceb45ba8 00000000 c012fcca cb4aba40
c017cdc3 cb4aba40 cb4ab7c0 000003f0 cb4ab7c0 c1324000 00000008 c1405e50
0000dc2f cb4aba40 c0131766 00000001 00000001 ceb45ba8 00000000 cb4aba40
Call Trace: [<c012fc93>] [<c012fcca>] [<c017cdc3>] [<c0131766>] [<c01318a3>]
[<c0127a69>] [<c0127d4f>] [<c0127d9d>] [<c01286e5>] [<c012893f>] [<c0128688>]
[<c01289ba>] [<c0126922>] [<c0126bef>] [<c01e4c2b>] [<c01e4311>] [<c01f59c3>]
[<c01f5d3e>] [<c020c250>] [<c020c5cf>] [<c020c250>] [<c0212f70>] [<c0212fa9>]
[<c01e1f01>] [<c0212f70>] [<c01e3177>] [<c0193800>] [<c0193a3e>] [<c0193a20>]
[<c018f2e0>] [<c018f32a>] [<c018f696>] [<c0193a3e>] [<c018faba>] [<c0193a50>]
[<c01e360c>] [<c0106ebb>]
Code: 89 48 20 c1 e2 02 be 24 0b 2c c0 89 41 24 39 1c 32 75 0a 31

>>EIP; c012f354 <__remove_from_lru_list+14/60> <=====
Trace; c012fc93 <__refile_buffer+33/60>
Trace; c012fcca <refile_buffer+a/10>
Trace; c017cdc3 <ll_rw_block+1a3/1c0>
Trace; c0131766 <sync_page_buffers+46/a0>
Trace; c01318a3 <try_to_free_buffers+e3/110>
Trace; c0127a69 <shrink_cache+129/2b0>
Trace; c0127d4f <shrink_caches+5f/90>
Trace; c0127d9d <try_to_free_pages+1d/50>
Trace; c01286e5 <balance_classzone+55/170>
Trace; c012893f <__alloc_pages+13f/1b0>
Trace; c0128688 <_alloc_pages+18/20>
Trace; c01289ba <__get_free_pages+a/20>
Trace; c0126922 <kmem_cache_grow+a2/200>
Trace; c0126bef <kmalloc+bf/e0>
Trace; c01e4c2b <alloc_skb+cb/180>
Trace; c01e4311 <sock_alloc_send_skb+71/110>
Trace; c01f59c3 <ip_build_xmit_slow+193/4c0>
Trace; c01f5d3e <ip_build_xmit+4e/350>
Trace; c020c250 <raw_getfrag+0/30>
Trace; c020c5cf <raw_sendmsg+28f/300>
Trace; c020c250 <raw_getfrag+0/30>
Trace; c0212f70 <inet_sendmsg+0/40>
Trace; c0212fa9 <inet_sendmsg+39/40>
Trace; c01e1f01 <sock_sendmsg+81/b0>
Trace; c0212f70 <inet_sendmsg+0/40>
Trace; c01e3177 <sys_sendmsg+197/1f0>
Trace; c0193800 <hpt370_rw_proc+0/10>
Trace; c0193a3e <hpt370_dmaproc+1e/30>
Trace; c0193a20 <hpt370_dmaproc+0/30>
Trace; c018f2e0 <start_request+190/240>
Trace; c018f32a <start_request+1da/240>
Trace; c018f696 <ide_do_request+296/2e0>
Trace; c0193a3e <hpt370_dmaproc+1e/30>
Trace; c018faba <ide_intr+7a/140>
Trace; c0193a50 <ide_dma_intr+0/c0>
Trace; c01e360c <sys_socketcall+1cc/1f0>
Trace; c0106ebb <system_call+33/38>
Code; c012f354 <__remove_from_lru_list+14/60>
00000000 <_EIP>:
Code; c012f354 <__remove_from_lru_list+14/60> <=====
0: 89 48 20 mov %ecx,0x20(%eax) <=====
Code; c012f357 <__remove_from_lru_list+17/60>
3: c1 e2 02 shl $0x2,%edx
Code; c012f35a <__remove_from_lru_list+1a/60>
6: be 24 0b 2c c0 mov $0xc02c0b24,%esi
Code; c012f35f <__remove_from_lru_list+1f/60>
b: 89 41 24 mov %eax,0x24(%ecx)
Code; c012f362 <__remove_from_lru_list+22/60>
e: 39 1c 32 cmp %ebx,(%edx,%esi,1)
Code; c012f365 <__remove_from_lru_list+25/60>
11: 75 0a jne 1d <_EIP+0x1d> c012f371 <__remove_from_lru_list+31/60>
Code; c012f367 <__remove_from_lru_list+27/60>
13: 31 00 xor %eax,(%eax)


<1>Unable to handle kernel paging request at virtual address b8f2ed62
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c012f4f1>] Not tainted
EFLAGS: 00010282
eax: b8f2ed5e ebx: cb4ab9c0 ecx: 000002f0 edx: ff2eca38
esi: cb4abd40 edi: cb4ab9c0 ebp: c127e940 esp: ceb43d34
ds: 0018 es: 0018 ss: 0018
Process cat (pid: 4133, stackpage=ceb43000)
Stack: c0131812 cb4ab9c0 00000000 c127e940 000002f0 00002298 c0127a69 c127e940
000002f0 00000020 000002f0 00000006 00002299 00000020 000002f0 c0127d4f
00000000 00000006 00000020 00000000 000002f0 00019f2e c0127d9d 00000000
Call Trace: [<c0131812>] [<c0127a69>] [<c0127d4f>] [<c0127d9d>] [<c01286e5>]
[<c012893f>] [<c0128688>] [<c01289ba>] [<c0126922>] [<c0126b19>] [<c012fdc3>]
[<c012fe4b>] [<c0130097>] [<c01305c8>] [<c01288d8>] [<c0121b76>] [<c0132fef>]
[<c0132f80>] [<c0121c15>] [<c01220c4>] [<c0122327>] [<c01226b5>] [<c01225e0>]
[<c012dffe>] [<c0106ebb>]
Code: 89 50 04 89 02 c3 89 f6 8d bc 27 00 00 00 00 8b 54 24 04 31

>>EIP; c012f4f1 <__remove_inode_queue+11/20> <=====
Trace; c0131812 <try_to_free_buffers+52/110>
Trace; c0127a69 <shrink_cache+129/2b0>
Trace; c0127d4f <shrink_caches+5f/90>
Trace; c0127d9d <try_to_free_pages+1d/50>
Trace; c01286e5 <balance_classzone+55/170>
Trace; c012893f <__alloc_pages+13f/1b0>
Trace; c0128688 <_alloc_pages+18/20>
Trace; c01289ba <__get_free_pages+a/20>
Trace; c0126922 <kmem_cache_grow+a2/200>
Trace; c0126b19 <kmem_cache_alloc+99/b0>
Trace; c012fdc3 <get_unused_buffer_head+33/80>
Trace; c012fe4b <create_buffers+1b/130>
Trace; c0130097 <create_empty_buffers+17/50>
Trace; c01305c8 <block_read_full_page+58/290>
Trace; c01288d8 <__alloc_pages+d8/1b0>
Trace; c0121b76 <add_to_page_cache_unique+66/70>
Trace; c0132fef <blkdev_readpage+f/20>
Trace; c0132f80 <blkdev_get_block+0/40>
Trace; c0121c15 <page_cache_read+95/c0>
Trace; c01220c4 <generic_file_readahead+104/150>
Trace; c0122327 <do_generic_file_read+1e7/4a0>
Trace; c01226b5 <generic_file_read+75/90>
Trace; c01225e0 <file_read_actor+0/60>
Trace; c012dffe <sys_read+8e/d0>
Trace; c0106ebb <system_call+33/38>
Code; c012f4f1 <__remove_inode_queue+11/20>
00000000 <_EIP>:
Code; c012f4f1 <__remove_inode_queue+11/20> <=====
0: 89 50 04 mov %edx,0x4(%eax) <=====
Code; c012f4f4 <__remove_inode_queue+14/20>
3: 89 02 mov %eax,(%edx)
Code; c012f4f6 <__remove_inode_queue+16/20>
5: c3 ret
Code; c012f4f7 <__remove_inode_queue+17/20>
6: 89 f6 mov %esi,%esi
Code; c012f4f9 <__remove_inode_queue+19/20>
8: 8d bc 27 00 00 00 00 lea 0x0(%edi,1),%edi
Code; c012f500 <inode_has_buffers+0/20>
f: 8b 54 24 04 mov 0x4(%esp,1),%edx
Code; c012f504 <inode_has_buffers+4/20>
13: 31 00 xor %eax,(%eax)



-- v --

[email protected]

2002-01-10 06:46:14

by Andrew Morton

[permalink] [raw]
Subject: Re: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

Ville Herva wrote:
>
> On Wed, Jan 09, 2002 at 01:00:53PM -0800, you [Andrew Morton] claimed:
> > Ville Herva wrote:
> > >
> > > >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
> >
> > Looks like a corrupted `next' pointer in the page's buffer_head
> > ring. Your report is identical to Todd Eigenschink's repeatable
> > oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html
> <snip>
> > I am able to trigger this in around ten minutes on 2.4.13 and
> > later kernels. However 2.4.13-pre6 ran the test for nine hours
> > and did not fail.
>
> Out of curiosity: what kind of load do you use to trigger it?

Massive VM load and ext3. I've found the buffer-list destroyed
bug. It's incorrect buffer locking in ext3. It used to work,
sleazily, but blockdev-in-pagecache pulled its pants down.

> > I've put the 2.4.13-pre6 -> 2.4.13 diff at http://www.zip.com.au/~akpm/1.gz
>
> Seems your diff didn't include some bits (Maintainers changes and something
> else.)
>
> Anyhow, I compiled 2.4.13pre6 and it collapsed in just a few minutes. My
> best guess is that network card pci dma is somehow fubar, and it writes
> stuff to where it shouldn't.

OK. Looks like they're different things - you have hardware problems,
I have brain problems.

-

Subject: Re: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

On Thu, 10 Jan 2002, Martin Josefsson wrote:
> We've replaved that memory module now and now it's better but I have to
> say that the KT133 or atleast the Asus A7V motherboard seems to be quite
> broken. we have a lot of spurious irq's and the ide controllers freak when
> but under some load and start getting irq timeouts and resets the ide
> channels over and over again with some delay in between when it kind of
> works, slow as hell but works.

Well, my A7V is also acting up, with spurious IRQs (but not too many), and
PCI lockups if the load on the PCI bus increases too much -- this is
probably the last time I ever buy a VIA board (because they take soooo much
time to acknowledge their screw ups and help people fix it) unless they
start issuing non-binary-only fixes (heck, all it takes is a doc telling us
what to do on the PCI registers!).

The IDE corruption and lockups you can fix, just apply the latest IDE
patches, the 2.4.18pre IDE subsystem is not to be used on a KT133, it will
not work at all if you give it a slightly bigger load on the promise
controller, for example.

> We are going to replace the motherboard with one with VIA KT266A chipset,
> hope that works better.

Without the IDE patches, it will (most probably) not help.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2002-01-10 12:57:56

by Martin Josefsson

[permalink] [raw]
Subject: Re: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

On Thu, 10 Jan 2002, Henrique de Moraes Holschuh wrote:

> On Thu, 10 Jan 2002, Martin Josefsson wrote:
> > We've replaved that memory module now and now it's better but I have to
> > say that the KT133 or atleast the Asus A7V motherboard seems to be quite
> > broken. we have a lot of spurious irq's and the ide controllers freak when
> > but under some load and start getting irq timeouts and resets the ide
> > channels over and over again with some delay in between when it kind of
> > works, slow as hell but works.
>
> Well, my A7V is also acting up, with spurious IRQs (but not too many), and
> PCI lockups if the load on the PCI bus increases too much -- this is
> probably the last time I ever buy a VIA board (because they take soooo much
> time to acknowledge their screw ups and help people fix it) unless they
> start issuing non-binary-only fixes (heck, all it takes is a doc telling us
> what to do on the PCI registers!).
>
> The IDE corruption and lockups you can fix, just apply the latest IDE
> patches, the 2.4.18pre IDE subsystem is not to be used on a KT133, it will
> not work at all if you give it a slightly bigger load on the promise
> controller, for example.
>
> > We are going to replace the motherboard with one with VIA KT266A chipset,
> > hope that works better.
>
> Without the IDE patches, it will (most probably) not help.

I am using the IDE patch. I've heard that the A7V133 which is based on the
KT133A chipset works much better in linux. I know people using it in a
router for a 1000 client network on a 100Mbit connection and it's working
fine, no problems at all. If we push the networking too hard we get a lot
of spurious interrupts and it appears as we loose some interrupts aswell
as NIC drivers and IDE drivers start complaining sometimes and when it has
started loosing interrupts only a reboot can bring it back to
"normal" operation.

/Martin

Never argue with an idiot. They drag you down to their level, then beat you with experience.

2002-01-10 13:37:43

by Ville Herva

[permalink] [raw]
Subject: Re: Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

On Thu, Jan 10, 2002 at 10:01:02AM -0200, you [Henrique de Moraes Holschuh] claimed:
>
> The IDE corruption and lockups you can fix, just apply the latest IDE
> patches, the 2.4.18pre IDE subsystem is not to be used on a KT133, it will
> not work at all if you give it a slightly bigger load on the promise
> controller, for example.

We just tried with 2.4.18pre2 + Hedrick ATA patch, but it oopsed just like
2.4.18pre2 vanilla. I reckon the ide corruption will also happen if we leave
the "ping -f" out of the equation.

This is propably a pci issue, not an ide issue.


-- v --

[email protected]