2005-12-27 16:58:41

by Chris Stromsoe

[permalink] [raw]
Subject: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

I have a machine that oopsed twice in the last 3 weeks. Immediately
before each oops was a "filemap.c:2234: bad pmd" message. The first oops
happened with 2.4.30, the second with 2.4.32. The oops from 2.4.30 is
below. I don't have the oops from 2.4.32.

The machine is a usenet feeder and does a constant ~110mbit/s traffic. I
have the tg3 and bonding modules loaded. There are 2 Adaptec controllers,
one onboard, one pci (aic7899 and 3960D). There are 5 disks off the first
channel of aic7899 (comes up as scsi2), 4 of which are in a RAID5. The
other 3 channels are unused. I have the .config for 2.4.30 available if
needed.

Pointers for where to look if/when it happens again would be appreciated.
Thanks.


-Chris

filemap.c:2234: bad pmd 00c001e3.
filemap.c:2234: bad pmd 010001e3.
Unable to handle kernel paging request at virtual address c13aef08
printing eip:
c012d7b6
*pde = 010001e3
*pte = ce919a00
Oops: 0000
CPU: 1
EIP: 0010:[mark_page_accessed+6/48] Not tainted
EFLAGS: 00010296
eax: c13aeef0 ebx: c13aeef0 ecx: 0005d800 edx: ee030900
esi: 0005d7a0 edi: 0005d8a9 ebp: f66b1c3c esp: f66b1c38
ds: 0018 es: 0018 ss: 0018
Process innfeed (pid: 526, stackpage=f66b1000)
Stack: c13aeef0 f66b1c70 c012ea08 ee030900 0005d7a0 0005d8a9 0005d8a9 f7fa1d60
f6628080 f6628144 f7628200 ee030900 c012e830 f77f4d80 f66b1cb8 c012a18e
ee030900 63ca0000 00000000 f66b1ce4 c027404c 00000000 f77f4d80 00000106
Call Trace: [filemap_nopage+472/544] [filemap_nopage+0/544] [do_no_page+126/608] [ip_queue_xmit+780/1424] [handle_mm_fault+121/272]
[do_page_fault+1024/1472] [tcp_write_xmit+353/688] [tcp_new_space+137/160] [tcp_rcv_established+716/2480] [memcpy_toiovec+67/112] [do_page_fault+0/1472]
[error_code+52/60] [csum_partial_copy_generic+61/260] [tcp_sendmsg+2367/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176]
[sock_writev+79/96] [do_readv_writev+567/608] [sys_writev+88/128] [system_call+51/56]

Code: 8b 40 18 a8 80 75 07 8b 43 18 a8 04 75 0c f0 0f ba 6b 18 02




2005-12-28 06:13:16

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

Hi Chris,

On Tue, Dec 27, 2005 at 08:58:39AM -0800, Chris Stromsoe wrote:
> I have a machine that oopsed twice in the last 3 weeks. Immediately
> before each oops was a "filemap.c:2234: bad pmd" message. The first oops
> happened with 2.4.30, the second with 2.4.32. The oops from 2.4.30 is
> below. I don't have the oops from 2.4.32.
>
> The machine is a usenet feeder and does a constant ~110mbit/s traffic. I
> have the tg3 and bonding modules loaded. There are 2 Adaptec controllers,
> one onboard, one pci (aic7899 and 3960D). There are 5 disks off the first
> channel of aic7899 (comes up as scsi2), 4 of which are in a RAID5. The
> other 3 channels are unused. I have the .config for 2.4.30 available if
> needed.
>
> Pointers for where to look if/when it happens again would be appreciated.
>
> Thanks.
>
>
> -Chris
>
> filemap.c:2234: bad pmd 00c001e3.
> filemap.c:2234: bad pmd 010001e3.

This is usually due to memory corruption. Please verify it with
memtest86.


2005-12-29 02:52:11

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Tue, 27 Dec 2005, Marcelo Tosatti wrote:
> On Tue, Dec 27, 2005 at 08:58:39AM -0800, Chris Stromsoe wrote:
>>
>> filemap.c:2234: bad pmd 00c001e3.
>> filemap.c:2234: bad pmd 010001e3.
>
> This is usually due to memory corruption. Please verify it with
> memtest86.

I've run through three complete memtest86 passes so far with no errors.
I'll keep running, but I'm not expecting to see anything.

I caught another two bad pmd errors followed by an oops this morning.
This is with 2.4.32, bond/tg3 loaded as modules. Full .config available.


-Chris

Dec 27 09:28:19 filemap.c:2234: bad pmd 020001e3.
Dec 27 09:28:19 filemap.c:2234: bad pmd 024001e3.

The oops came in ata 09:28:20

ksymoops 2.4.9 on i686 2.4.32. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.32/ (default)
-m /boot/System.map-2.4.32 (specified)

Unable to handle kernel paging request at virtual address c22eee80
c0259bb3
*pde = 020001e3
Oops: 0002
CPU: 2
EIP: 0010:[alloc_skb+275/480] Not tainted
EFLAGS: 00010282
eax: c22eee80 ebx: ccbdb480 ecx: 000006bc edx: 00000680
esi: 000001f0 edi: 00000000 ebp: f663bdf0 esp: f663bddc
ds: 0018 es: 0018 ss: 0018
Process innfeed (pid: 526, stackpage=f663b000)
Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b 00000680
000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38 d84bec34
d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774 000005a8
Call Trace: [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b
Using defaults from ksymoops -t elf32-i386 -a i386


>>eax; c22eee80 <_end+1f0d380/38650560>
>>ebx; ccbdb480 <_end+c7f9980/38650560>
>>ebp; f663bdf0 <_end+3625a2f0/38650560>
>>esp; f663bddc <_end+3625a2dc/38650560>

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: c7 00 01 00 00 00 movl $0x1,(%eax)
Code; 00000006 Before first symbol
6: 8b 83 8c 00 00 00 mov 0x8c(%ebx),%eax
Code; 0000000c Before first symbol
c: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
Code; 00000013 Before first symbol
13: 8b 00 mov (%eax),%eax

2005-12-29 05:14:09

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Wed, Dec 28, 2005 at 06:52:06PM -0800, Chris Stromsoe wrote:
> On Tue, 27 Dec 2005, Marcelo Tosatti wrote:
> >On Tue, Dec 27, 2005 at 08:58:39AM -0800, Chris Stromsoe wrote:
> >>
> >>filemap.c:2234: bad pmd 00c001e3.
> >>filemap.c:2234: bad pmd 010001e3.
> >
> >This is usually due to memory corruption. Please verify it with
> >memtest86.
>
> I've run through three complete memtest86 passes so far with no errors.
> I'll keep running, but I'm not expecting to see anything.
>
> I caught another two bad pmd errors followed by an oops this morning.
> This is with 2.4.32, bond/tg3 loaded as modules. Full .config available.
>

I have some servers running on tg3+bond with up to 70 Mbps with about one
year of uptime. Ok, they're not on 2.4.32 yet, but that's just to say that
I dont suspect those drivers.

> -Chris
>
> Dec 27 09:28:19 filemap.c:2234: bad pmd 020001e3.
> Dec 27 09:28:19 filemap.c:2234: bad pmd 024001e3.
>
> The oops came in ata 09:28:20
>
> ksymoops 2.4.9 on i686 2.4.32. Options used
> -V (default)
> -k /proc/ksyms (default)
> -l /proc/modules (default)
> -o /lib/modules/2.4.32/ (default)
> -m /boot/System.map-2.4.32 (specified)
>
> Unable to handle kernel paging request at virtual address c22eee80
> c0259bb3
> *pde = 020001e3
> Oops: 0002
> CPU: 2
^^^^^
interesting, this machine is SMP.
memtest86 only involves CPU0 in tests. I've already had a great difficulty
trying to detect memory problems which occured only when more than one CPU
was accessing the RAM. Can your machine support its load with only one CPU ?
Maybe you observe more I/O than pure CPU. It would be interesting to restart
it with the 'nosmp' boot option.


> EIP: 0010:[alloc_skb+275/480] Not tainted

I'm somewhat surprized, because I've not found a direct nor indirect call
path from alloc_skb() to filemap_sync_pte_range() in which the error is
reported. I'm clearly missing something here.


> EFLAGS: 00010282
> eax: c22eee80 ebx: ccbdb480 ecx: 000006bc edx: 00000680
> esi: 000001f0 edi: 00000000 ebp: f663bdf0 esp: f663bddc
> ds: 0018 es: 0018 ss: 0018
> Process innfeed (pid: 526, stackpage=f663b000)
> Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b
> 00000680
> 000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38
> d84bec34
> d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774
> 000005a8 Call Trace: [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80]
> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b
> Using defaults from ksymoops -t elf32-i386 -a i386
>
>
> >>eax; c22eee80 <_end+1f0d380/38650560>
> >>ebx; ccbdb480 <_end+c7f9980/38650560>
> >>ebp; f663bdf0 <_end+3625a2f0/38650560>
> >>esp; f663bddc <_end+3625a2dc/38650560>
>
> Code; 00000000 Before first symbol
> 00000000 <_EIP>:
> Code; 00000000 Before first symbol
> 0: c7 00 01 00 00 00 movl $0x1,(%eax)
> Code; 00000006 Before first symbol
> 6: 8b 83 8c 00 00 00 mov 0x8c(%ebx),%eax
> Code; 0000000c Before first symbol
> c: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
> Code; 00000013 Before first symbol
> 13: 8b 00 mov (%eax),%eax

Regards,
willy

2005-12-29 09:33:50

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Thu, 29 Dec 2005, Willy Tarreau wrote:
> On Wed, Dec 28, 2005 at 06:52:06PM -0800, Chris Stromsoe wrote:
>>
>> Dec 27 09:28:19 filemap.c:2234: bad pmd 020001e3.
>> Dec 27 09:28:19 filemap.c:2234: bad pmd 024001e3.
>>
>> The oops came in ata 09:28:20
>>
>> ksymoops 2.4.9 on i686 2.4.32. Options used
>> -V (default)
>> -k /proc/ksyms (default)
>> -l /proc/modules (default)
>> -o /lib/modules/2.4.32/ (default)
>> -m /boot/System.map-2.4.32 (specified)
>>
>> Unable to handle kernel paging request at virtual address c22eee80
>> c0259bb3
>> *pde = 020001e3
>> Oops: 0002
>> CPU: 2
> ^^^^^
> interesting, this machine is SMP.
> memtest86 only involves CPU0 in tests. I've already had a great difficulty
> trying to detect memory problems which occured only when more than one CPU
> was accessing the RAM. Can your machine support its load with only one CPU ?
> Maybe you observe more I/O than pure CPU. It would be interesting to restart
> it with the 'nosmp' boot option.

The machine is a dual P4 Xeon with hyperthreading on. It can probably get
by with only one cpu enabled. If/when it goes down again, I'll boot with
nosmp. For what it's worth, I ran a Dell memory tester ("MP Memory")
which claims to test all of the CPUs for a few hours and didn't come up
with anything. The machine feeds usenet and is seeing a lot more io than
cpu. (There are two Adaptec controllers, 4 channels, aic79xx, 5 drives on
one channel, 3 unused, spool is on a 4 disk raid5, jfs formatted.)


>> EIP: 0010:[alloc_skb+275/480] Not tainted
>
> I'm somewhat surprized, because I've not found a direct nor indirect
> call path from alloc_skb() to filemap_sync_pte_range() in which the
> error is reported. I'm clearly missing something here.

If it helps, the oops with 2.4.30 had two "bad pmd" messages right before
it then:

Unable to handle kernel paging request at virtual address c13aef08
printing eip:
c012d7b6
*pde = 010001e3
*pte = ce919a00
Oops: 0000
CPU: 1
EIP: 0010:[mark_page_accessed+6/48] Not tainted
EFLAGS: 00010296
eax: c13aeef0 ebx: c13aeef0 ecx: 0005d800 edx: ee030900
esi: 0005d7a0 edi: 0005d8a9 ebp: f66b1c3c esp: f66b1c38
ds: 0018 es: 0018 ss: 0018
Process innfeed (pid: 526, stackpage=f66b1000)
Stack: c13aeef0 f66b1c70 c012ea08 ee030900 0005d7a0 0005d8a9 0005d8a9 f7fa1d60
f6628080 f6628144 f7628200 ee030900 c012e830 f77f4d80 f66b1cb8 c012a18e
ee030900 63ca0000 00000000 f66b1ce4 c027404c 00000000 f77f4d80 00000106
Call Trace: [filemap_nopage+472/544] [filemap_nopage+0/544][do_no_page+126/608] [ip_queue_xmit+780/1424] [handle_mm_fault+121/272]
[do_page_fault+1024/1472] [tcp_write_xmit+353/688] [tcp_new_space+137/160][tcp_rcv_established+716/2480] [memcpy_toiovec+67/112] [do_page_fault+0/1472]
[error_code+52/60] [csum_partial_copy_generic+61/260] [tcp_sendmsg+2367/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176]
[sock_writev+79/96] [do_readv_writev+567/608] [sys_writev+88/128] [system_call+51/56]

Code: 8b 40 18 a8 80 75 07 8b 43 18 a8 04 75 0c f0 0f ba 6b 18 02



-Chris

2005-12-29 10:10:24

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Thu, Dec 29, 2005 at 01:33:47AM -0800, Chris Stromsoe wrote:
> On Thu, 29 Dec 2005, Willy Tarreau wrote:
> >On Wed, Dec 28, 2005 at 06:52:06PM -0800, Chris Stromsoe wrote:
> >>
> >>Dec 27 09:28:19 filemap.c:2234: bad pmd 020001e3.
> >>Dec 27 09:28:19 filemap.c:2234: bad pmd 024001e3.
> >>
> >>The oops came in ata 09:28:20
> >>
> >>ksymoops 2.4.9 on i686 2.4.32. Options used
> >> -V (default)
> >> -k /proc/ksyms (default)
> >> -l /proc/modules (default)
> >> -o /lib/modules/2.4.32/ (default)
> >> -m /boot/System.map-2.4.32 (specified)
> >>
> >>Unable to handle kernel paging request at virtual address c22eee80
> >>c0259bb3
> >>*pde = 020001e3
> >>Oops: 0002
> >>CPU: 2
> > ^^^^^
> >interesting, this machine is SMP.
> >memtest86 only involves CPU0 in tests. I've already had a great
> >difficulty
> >trying to detect memory problems which occured only when more than one
> >CPU
> >was accessing the RAM. Can your machine support its load with only one
> >CPU ?
> >Maybe you observe more I/O than pure CPU. It would be interesting to
> >restart
> >it with the 'nosmp' boot option.
>
> The machine is a dual P4 Xeon with hyperthreading on. It can probably
> get by with only one cpu enabled. If/when it goes down again, I'll boot
> with nosmp. For what it's worth, I ran a Dell memory tester ("MP
> Memory") which claims to test all of the CPUs for a few hours and didn't
> come up with anything. The machine feeds usenet and is seeing a lot more
> io than cpu. (There are two Adaptec controllers, 4 channels, aic79xx, 5
> drives on one channel, 3 unused, spool is on a 4 disk raid5, jfs
> formatted.)

OK, I've found two old similar reports from people running news servers :
http://www.ussg.iu.edu/hypermail/linux/kernel/0308.1/0807.html
http://seclists.org/lists/linux-kernel/2004/Jan/5699.html

both were using an SMP server with an AIC7xxx adapter, and kernels varying
from 2.4.18 to 2.4.24. One of them used XFS and not JFS, so we will exclude
any potential JFS-related cause for now.

If you feel brave, you can try to switch the AIC7xxx driver to Justin Gibbs'
more recent version, but which has not evolved during last year, but which
I have running reliably on production servers :

http://people.freebsd.org/~gibbs/linux/

I also have it rediffed for recent kernels if you prefer :

http://w.ods.org/kernel/2.4-wt/2.4.32-wt2/patches-2.4.32-wt2/pool/aic79xx-20040522-linux-2.4.30-pre3.rediff


> >>EIP: 0010:[alloc_skb+275/480] Not tainted
> >
> >I'm somewhat surprized, because I've not found a direct nor indirect
> >call path from alloc_skb() to filemap_sync_pte_range() in which the
> >error is reported. I'm clearly missing something here.
>
> If it helps, the oops with 2.4.30 had two "bad pmd" messages right before
> it then:
>
> Unable to handle kernel paging request at virtual address c13aef08
> printing eip:
> c012d7b6
> *pde = 010001e3
> *pte = ce919a00
> Oops: 0000
> CPU: 1
> EIP: 0010:[mark_page_accessed+6/48] Not tainted
> EFLAGS: 00010296
> eax: c13aeef0 ebx: c13aeef0 ecx: 0005d800 edx: ee030900
> esi: 0005d7a0 edi: 0005d8a9 ebp: f66b1c3c esp: f66b1c38
> ds: 0018 es: 0018 ss: 0018
> Process innfeed (pid: 526, stackpage=f66b1000)
> Stack: c13aeef0 f66b1c70 c012ea08 ee030900 0005d7a0 0005d8a9 0005d8a9
> f7fa1d60
> f6628080 f6628144 f7628200 ee030900 c012e830 f77f4d80 f66b1cb8
> c012a18e
> ee030900 63ca0000 00000000 f66b1ce4 c027404c 00000000 f77f4d80
> 00000106
> Call Trace: [filemap_nopage+472/544]
> [filemap_nopage+0/544][do_no_page+126/608] [ip_queue_xmit+780/1424]
> [handle_mm_fault+121/272]
> [do_page_fault+1024/1472] [tcp_write_xmit+353/688]
> [tcp_new_space+137/160][tcp_rcv_established+716/2480]
> [memcpy_toiovec+67/112] [do_page_fault+0/1472]
> [error_code+52/60] [csum_partial_copy_generic+61/260]
> [tcp_sendmsg+2367/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176]
> [sock_readv_writev+116/176]
> [sock_writev+79/96] [do_readv_writev+567/608] [sys_writev+88/128]
> [system_call+51/56]
>
> Code: 8b 40 18 a8 80 75 07 8b 43 18 a8 04 75 0c f0 0f ba 6b 18 02

Out of curiosity, it would be interesting to disable swap if you have it
enabled.

> -Chris

Willy

2005-12-29 12:01:53

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Thu, 29 Dec 2005, Willy Tarreau wrote:
> On Thu, Dec 29, 2005 at 01:33:47AM -0800, Chris Stromsoe wrote:
>>
>> The machine is a dual P4 Xeon with hyperthreading on. It can probably
>> get by with only one cpu enabled. If/when it goes down again, I'll
>> boot with nosmp. For what it's worth, I ran a Dell memory tester ("MP
>> Memory") which claims to test all of the CPUs for a few hours and
>> didn't come up with anything. The machine feeds usenet and is seeing a
>> lot more io than cpu. (There are two Adaptec controllers, 4 channels,
>> aic79xx, 5 drives on one channel, 3 unused, spool is on a 4 disk raid5,
>> jfs formatted.)
>
> OK, I've found two old similar reports from people running news servers :
> http://www.ussg.iu.edu/hypermail/linux/kernel/0308.1/0807.html
> http://seclists.org/lists/linux-kernel/2004/Jan/5699.html
>
> both were using an SMP server with an AIC7xxx adapter, and kernels
> varying from 2.4.18 to 2.4.24. One of them used XFS and not JFS, so we
> will exclude any potential JFS-related cause for now.

I am also building with highmem/4Gb support, which one of the reports
mentioned. I did not have any pmd messages while running 2.4.26 or
2.4.27, built with the same set of options (make oldconfig dep clean
bzimage .... )


> If you feel brave, you can try to switch the AIC7xxx driver to Justin
> Gibbs' more recent version, but which has not evolved during last year,
> but which I have running reliably on production servers :
>
> http://people.freebsd.org/~gibbs/linux/
>
> I also have it rediffed for recent kernels if you prefer :
>
> http://w.ods.org/kernel/2.4-wt/2.4.32-wt2/patches-2.4.32-wt2/pool/aic79xx-20040522-linux-2.4.30-pre3.rediff

I've pulled the patch and saved it. I don't want to change more than one
thing at a time. I'll try the alternate driver if booting with nosmp
doesn't help.

> Out of curiosity, it would be interesting to disable swap if you have it
> enabled.

I'm running with 4G of swap, but usually don't dip more than 30M or 40M
into it. I'll add disabling swap to the list of things to check.



-Chris

2005-12-31 00:13:03

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

I oopsed again last night with an identical EIP and Call Trace to the oops
from the 28th. The new oops is below, the prior below that. I'm going to
reboot the machine into UP and see if that helps.

-Chris

Unable to handle kernel paging request at virtual address c211ce80
c0259bb3
*pde = 020001e3
Oops: 0002
CPU: 2
EIP: 0010:[alloc_skb+275/480] Not tainted
EFLAGS: 00010282
eax: c211ce80 ebx: f5303680 ecx: f7eeb780 edx: 00000680
esi: 000001f0 edi: 00000000 ebp: d348ddf0 esp: d348dddc
ds: 0018 es: 0018 ss: 0018
Process innfeed (pid: 25080, stackpage=d348d000)
Stack: 000006bc 000001f0 ebabc980 eb0e64d8 eb0e6400 d348de68 c027b50b 00000680
000001f0 000005a8 00000000 d348de54 00000000 00000000 00000001 00000000
012815b5 00000000 00000000 d7a160a0 d348c000 636686ac 000c3dec 000087c0
Call Trace: [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b
Using defaults from ksymoops -t elf32-i386 -a i386


>>eax; c211ce80 <_end+1d3b380/38650560>
>>ebx; f5303680 <_end+34f21b80/38650560>
>>ecx; f7eeb780 <_end+37b09c80/38650560>
>>ebp; d348ddf0 <_end+130ac2f0/38650560>
>>esp; d348dddc <_end+130ac2dc/38650560>

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: c7 00 01 00 00 00 movl $0x1,(%eax)
Code; 00000006 Before first symbol
6: 8b 83 8c 00 00 00 mov 0x8c(%ebx),%eax
Code; 0000000c Before first symbol
c: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
Code; 00000013 Before first symbol
13: 8b 00 mov (%eax),%eax


On Wed, 28 Dec 2005, Chris Stromsoe wrote:

> Unable to handle kernel paging request at virtual address c22eee80
> c0259bb3
> *pde = 020001e3
> Oops: 0002
> CPU: 2
> EIP: 0010:[alloc_skb+275/480] Not tainted
> EFLAGS: 00010282
> eax: c22eee80 ebx: ccbdb480 ecx: 000006bc edx: 00000680
> esi: 000001f0 edi: 00000000 ebp: f663bdf0 esp: f663bddc
> ds: 0018 es: 0018 ss: 0018
> Process innfeed (pid: 526, stackpage=f663b000)
> Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b 00000680
> 000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38 d84bec34
> d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774 000005a8
> Call Trace: [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b Using
> defaults from ksymoops -t elf32-i386 -a i386
>
>>> eax; c22eee80 <_end+1f0d380/38650560>
>>> ebx; ccbdb480 <_end+c7f9980/38650560>
>>> ebp; f663bdf0 <_end+3625a2f0/38650560>
>>> esp; f663bddc <_end+3625a2dc/38650560>
>
> Code; 00000000 Before first symbol
> 00000000 <_EIP>:
> Code; 00000000 Before first symbol
> 0: c7 00 01 00 00 00 movl $0x1,(%eax)
> Code; 00000006 Before first symbol
> 6: 8b 83 8c 00 00 00 mov 0x8c(%ebx),%eax
> Code; 0000000c Before first symbol
> c: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
> Code; 00000013 Before first symbol
> 13: 8b 00 mov (%eax),%eax
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-12-31 01:48:17

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

I'm starting to suspect bad hardware. Booting is now hanging (with
2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:

.....

Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4
ide: Assuming 33MHz system bus speed for PIO modes; override with
idebus=xx
hda: TEAC CD-ROM CD-224E, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: attached ide-cdrom driver.
hda: ATAPI 24X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.12
SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
<Adaptec 3960D Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
<Adaptec 3960D Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
<Adaptec aic7899 Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

blk: queue f7e46018, I/O limit 4095Mb (mask 0xffffffff)


If I wait several minutes (around 10 or 15 minutes), I get:

scsi0:0:0:0: Attempting to queue an ABORT message
CDB: 0x12 0x0 0x0 0x0 0xff 0x0
scsi0:0:0:0: Command already completed
aic7xxx_abort returns 0x2002
scsi0:0:0:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi0:0:0:0: Command already completed
aic7xxx_abort returns 0x2002
scsi0:0:0:0: Attempting to queue a TARGET RESET message
CDB: 0x12 0x0 0x0 0x0 0xff 0x0
scsi0:0:0:0: Is not an active device
aic7xxx_dev_reset returns 0x2002
scsi0:0:0:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi0:0:0:0: Command already completed
aic7xxx_abort returns 0x2002
scsi0:0:0:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi0:0:0:0: Command already completed
aic7xxx_abort returns 0x2002
scsi: device set offline - not ready or command retry failed after bus reset: host 0 channel 0 id 0 lun 0


The messages repeated for all 15 targets on scsi0. It's looking like it
will repeat for scsi1 as well.

How likely is it that a failing scsi controller contribute to the other
problems I was seeing?


-Chris

On Fri, 30 Dec 2005, Chris Stromsoe wrote:

> I oopsed again last night with an identical EIP and Call Trace to the
> oops from the 28th. The new oops is below, the prior below that. I'm
> going to reboot the machine into UP and see if that helps.
>
> -Chris
>
> Unable to handle kernel paging request at virtual address c211ce80
> c0259bb3
> *pde = 020001e3
> Oops: 0002
> CPU: 2
> EIP: 0010:[alloc_skb+275/480] Not tainted
> EFLAGS: 00010282
> eax: c211ce80 ebx: f5303680 ecx: f7eeb780 edx: 00000680
> esi: 000001f0 edi: 00000000 ebp: d348ddf0 esp: d348dddc
> ds: 0018 es: 0018 ss: 0018
> Process innfeed (pid: 25080, stackpage=d348d000)
> Stack: 000006bc 000001f0 ebabc980 eb0e64d8 eb0e6400 d348de68 c027b50b
> 00000680
> 000001f0 000005a8 00000000 d348de54 00000000 00000000 00000001
> 00000000
> 012815b5 00000000 00000000 d7a160a0 d348c000 636686ac 000c3dec
> 000087c0
> Call Trace: [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80]
> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b
> Using defaults from ksymoops -t elf32-i386 -a i386
>
>
>>> eax; c211ce80 <_end+1d3b380/38650560>
>>> ebx; f5303680 <_end+34f21b80/38650560>
>>> ecx; f7eeb780 <_end+37b09c80/38650560>
>>> ebp; d348ddf0 <_end+130ac2f0/38650560>
>>> esp; d348dddc <_end+130ac2dc/38650560>
>
> Code; 00000000 Before first symbol
> 00000000 <_EIP>:
> Code; 00000000 Before first symbol
> 0: c7 00 01 00 00 00 movl $0x1,(%eax)
> Code; 00000006 Before first symbol
> 6: 8b 83 8c 00 00 00 mov 0x8c(%ebx),%eax
> Code; 0000000c Before first symbol
> c: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
> Code; 00000013 Before first symbol
> 13: 8b 00 mov (%eax),%eax
>
>
> On Wed, 28 Dec 2005, Chris Stromsoe wrote:
>
>> Unable to handle kernel paging request at virtual address c22eee80
>> c0259bb3
>> *pde = 020001e3
>> Oops: 0002
>> CPU: 2
>> EIP: 0010:[alloc_skb+275/480] Not tainted
>> EFLAGS: 00010282
>> eax: c22eee80 ebx: ccbdb480 ecx: 000006bc edx: 00000680
>> esi: 000001f0 edi: 00000000 ebp: f663bdf0 esp: f663bddc
>> ds: 0018 es: 0018 ss: 0018
>> Process innfeed (pid: 526, stackpage=f663b000)
>> Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b
>> 00000680
>> 000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38
>> d84bec34
>> d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774
>> 000005a8
>> Call Trace: [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80]
>> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
>> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b Using
>> defaults from ksymoops -t elf32-i386 -a i386
>>
>>>> eax; c22eee80 <_end+1f0d380/38650560>
>>>> ebx; ccbdb480 <_end+c7f9980/38650560>
>>>> ebp; f663bdf0 <_end+3625a2f0/38650560>
>>>> esp; f663bddc <_end+3625a2dc/38650560>
>>
>> Code; 00000000 Before first symbol
>> 00000000 <_EIP>:
>> Code; 00000000 Before first symbol
>> 0: c7 00 01 00 00 00 movl $0x1,(%eax)
>> Code; 00000006 Before first symbol
>> 6: 8b 83 8c 00 00 00 mov 0x8c(%ebx),%eax
>> Code; 0000000c Before first symbol
>> c: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
>> Code; 00000013 Before first symbol
>> 13: 8b 00 mov (%eax),%eax
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-12-31 04:00:38

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

I couldn't get the machine to come up with 2.4.32, 2.4.30, or 2.4.27. It
was hanging and then throwing the SCSI errors below. The machine did come
up with a vanilla 2.6.14.4 and appears to be working fine. I'm going to
leave it up over the weekend and see if it oopses. If it would help, I
can mail out the .config for the 2.4.32 and 2.6.14.4 builds, or provide
other information of interest.

-Chris

On Fri, 30 Dec 2005, Chris Stromsoe wrote:

> I'm starting to suspect bad hardware. Booting is now hanging (with
> 2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:
>
> .....
>
> Floppy drive(s): fd0 is 1.44M
> FDC 0 is a National Semiconductor PC87306
> Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> hda: TEAC CD-ROM CD-224E, ATAPI CD/DVD-ROM drive
> ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> hda: attached ide-cdrom driver.
> hda: ATAPI 24X CD-ROM drive, 128kB Cache
> Uniform CD-ROM driver Revision: 3.12
> SCSI subsystem driver Revision: 1.00
> scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
> <Adaptec 3960D Ultra160 SCSI adapter>
> aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
>
> scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
> <Adaptec 3960D Ultra160 SCSI adapter>
> aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
>
> scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
> <Adaptec aic7899 Ultra160 SCSI adapter>
> aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
>
> blk: queue f7e46018, I/O limit 4095Mb (mask 0xffffffff)
>
>
> If I wait several minutes (around 10 or 15 minutes), I get:
>
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue a TARGET RESET message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Is not an active device
> aic7xxx_dev_reset returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi: device set offline - not ready or command retry failed after bus reset:
> host 0 channel 0 id 0 lun 0
>
>
> The messages repeated for all 15 targets on scsi0. It's looking like it will
> repeat for scsi1 as well.
>
> How likely is it that a failing scsi controller contribute to the other
> problems I was seeing?
>
>
> -Chris
>
> On Fri, 30 Dec 2005, Chris Stromsoe wrote:
>
>> I oopsed again last night with an identical EIP and Call Trace to the oops
>> from the 28th. The new oops is below, the prior below that. I'm going to
>> reboot the machine into UP and see if that helps.
>>
>> -Chris
>>
>> Unable to handle kernel paging request at virtual address c211ce80
>> c0259bb3
>> *pde = 020001e3
>> Oops: 0002
>> CPU: 2
>> EIP: 0010:[alloc_skb+275/480] Not tainted
>> EFLAGS: 00010282
>> eax: c211ce80 ebx: f5303680 ecx: f7eeb780 edx: 00000680
>> esi: 000001f0 edi: 00000000 ebp: d348ddf0 esp: d348dddc
>> ds: 0018 es: 0018 ss: 0018
>> Process innfeed (pid: 25080, stackpage=d348d000)
>> Stack: 000006bc 000001f0 ebabc980 eb0e64d8 eb0e6400 d348de68 c027b50b
>> 00000680
>> 000001f0 000005a8 00000000 d348de54 00000000 00000000 00000001
>> 00000000
>> 012815b5 00000000 00000000 d7a160a0 d348c000 636686ac 000c3dec
>> 000087c0
>> Call Trace: [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80]
>> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
>> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b
>> Using defaults from ksymoops -t elf32-i386 -a i386
>>
>>
>>>> eax; c211ce80 <_end+1d3b380/38650560>
>>>> ebx; f5303680 <_end+34f21b80/38650560>
>>>> ecx; f7eeb780 <_end+37b09c80/38650560>
>>>> ebp; d348ddf0 <_end+130ac2f0/38650560>
>>>> esp; d348dddc <_end+130ac2dc/38650560>
>>
>> Code; 00000000 Before first symbol
>> 00000000 <_EIP>:
>> Code; 00000000 Before first symbol
>> 0: c7 00 01 00 00 00 movl $0x1,(%eax)
>> Code; 00000006 Before first symbol
>> 6: 8b 83 8c 00 00 00 mov 0x8c(%ebx),%eax
>> Code; 0000000c Before first symbol
>> c: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
>> Code; 00000013 Before first symbol
>> 13: 8b 00 mov (%eax),%eax
>>
>>
>> On Wed, 28 Dec 2005, Chris Stromsoe wrote:
>>
>>> Unable to handle kernel paging request at virtual address c22eee80
>>> c0259bb3
>>> *pde = 020001e3
>>> Oops: 0002
>>> CPU: 2
>>> EIP: 0010:[alloc_skb+275/480] Not tainted
>>> EFLAGS: 00010282
>>> eax: c22eee80 ebx: ccbdb480 ecx: 000006bc edx: 00000680
>>> esi: 000001f0 edi: 00000000 ebp: f663bdf0 esp: f663bddc
>>> ds: 0018 es: 0018 ss: 0018
>>> Process innfeed (pid: 526, stackpage=f663b000)
>>> Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b
>>> 00000680
>>> 000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38
>>> d84bec34
>>> d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774
>>> 000005a8
>>> Call Trace: [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80]
>>> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
>>> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b Using
>>> defaults from ksymoops -t elf32-i386 -a i386
>>>
>>>>> eax; c22eee80 <_end+1f0d380/38650560>
>>>>> ebx; ccbdb480 <_end+c7f9980/38650560>
>>>>> ebp; f663bdf0 <_end+3625a2f0/38650560>
>>>>> esp; f663bddc <_end+3625a2dc/38650560>
>>>
>>> Code; 00000000 Before first symbol
>>> 00000000 <_EIP>:
>>> Code; 00000000 Before first symbol
>>> 0: c7 00 01 00 00 00 movl $0x1,(%eax)
>>> Code; 00000006 Before first symbol
>>> 6: 8b 83 8c 00 00 00 mov 0x8c(%ebx),%eax
>>> Code; 0000000c Before first symbol
>>> c: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
>>> Code; 00000013 Before first symbol
>>> 13: 8b 00 mov (%eax),%eax
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-12-31 07:14:50

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32


On Fri, Dec 30, 2005 at 05:48:15PM -0800, Chris Stromsoe wrote:
> I'm starting to suspect bad hardware. Booting is now hanging (with
> 2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:

And nothing changed since previous boot, except UP ?

(...)
> If I wait several minutes (around 10 or 15 minutes), I get:
>
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue a TARGET RESET message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Is not an active device
> aic7xxx_dev_reset returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi: device set offline - not ready or command retry failed after bus
> reset: host 0 channel 0 id 0 lun 0
>
>
> The messages repeated for all 15 targets on scsi0. It's looking like it
> will repeat for scsi1 as well.
(...)

it recalls me bad memories on my machine a very long time ago when the
driver was buggy :-(
It's not necessarily bad hardware. I also had trouble on one version
of the 29160 bios where it hanged during device scan if there were
too many terminations. Oh, BTW, please check that you have disabled
"automatic" termination in the BIOS. Manually set it either to ON or
OFF (low/high depending on your setup).

> How likely is it that a failing scsi controller contribute to the other
> problems I was seeing?

Not much. Perhaps at worst, a failing controller could corrupt memory
by writing garbage at wrong locations, but you would not always get
the same messages. It seems to be a different problem here. To be
honnest, it's where I think you should try the new driver.

Regards,
Willy

2005-12-31 07:28:03

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Fri, Dec 30, 2005 at 08:00:34PM -0800, Chris Stromsoe wrote:
> I couldn't get the machine to come up with 2.4.32, 2.4.30, or 2.4.27. It
> was hanging and then throwing the SCSI errors below. The machine did
> come up with a vanilla 2.6.14.4 and appears to be working fine. I'm
> going to leave it up over the weekend and see if it oopses. If it would
> help, I can mail out the .config for the 2.4.32 and 2.6.14.4 builds, or
> provide other information of interest.

Please do post at least the 2.4.32 .config, I'll try to boot it on my
system right here. I find it amazing that it suddenly stopped working
with the same kernels as before.

> -Chris

Willy

2005-12-31 10:39:45

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sat, 31 Dec 2005, Willy Tarreau wrote:
> On Fri, Dec 30, 2005 at 05:48:15PM -0800, Chris Stromsoe wrote:
>
>> I'm starting to suspect bad hardware. Booting is now hanging (with
>> 2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:
>
> And nothing changed since previous boot, except UP ?

All I changed was adding nosmp to the kernel boot line.

> It's not necessarily bad hardware. I also had trouble on one version of
> the 29160 bios where it hanged during device scan if there were too many
> terminations. Oh, BTW, please check that you have disabled "automatic"
> termination in the BIOS. Manually set it either to ON or OFF (low/high
> depending on your setup).

I'll have to check it tomorrow or on Monday.

>> How likely is it that a failing scsi controller contribute to the other
>> problems I was seeing?
>
> Not much. Perhaps at worst, a failing controller could corrupt memory by
> writing garbage at wrong locations, but you would not always get the
> same messages. It seems to be a different problem here. To be honnest,
> it's where I think you should try the new driver.

The machine has been running 2.6.14.4 for the last 6 hours. It came up
fine. I did not try booting it with nosmp. If I have time, I will revert
back to 2.4 with the newer driver to test.


-Chris

2005-12-31 10:59:37

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sat, Dec 31, 2005 at 02:39:43AM -0800, Chris Stromsoe wrote:
> On Sat, 31 Dec 2005, Willy Tarreau wrote:
> >On Fri, Dec 30, 2005 at 05:48:15PM -0800, Chris Stromsoe wrote:
> >
> >>I'm starting to suspect bad hardware. Booting is now hanging (with
> >>2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:
> >
> >And nothing changed since previous boot, except UP ?
>
> All I changed was adding nosmp to the kernel boot line.

OK maybe interrupts don't get distributed to the remaining CPU, which
would explain your timeouts.

> >It's not necessarily bad hardware. I also had trouble on one version of
> >the 29160 bios where it hanged during device scan if there were too many
> >terminations. Oh, BTW, please check that you have disabled "automatic"
> >termination in the BIOS. Manually set it either to ON or OFF (low/high
> >depending on your setup).
>
> I'll have to check it tomorrow or on Monday.
>
> >>How likely is it that a failing scsi controller contribute to the other
> >>problems I was seeing?
> >
> >Not much. Perhaps at worst, a failing controller could corrupt memory by
> >writing garbage at wrong locations, but you would not always get the
> >same messages. It seems to be a different problem here. To be honnest,
> >it's where I think you should try the new driver.
>
> The machine has been running 2.6.14.4 for the last 6 hours. It came up
> fine. I did not try booting it with nosmp. If I have time, I will
> revert back to 2.4 with the newer driver to test.

Thanks.

> -Chris

Willy

2005-12-31 11:06:18

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sat, 31 Dec 2005, Willy Tarreau wrote:
> On Fri, Dec 30, 2005 at 08:00:34PM -0800, Chris Stromsoe wrote:
>
>> I couldn't get the machine to come up with 2.4.32, 2.4.30, or 2.4.27.
>> It was hanging and then throwing the SCSI errors below. The machine
>> did come up with a vanilla 2.6.14.4 and appears to be working fine.
>> I'm going to leave it up over the weekend and see if it oopses. If it
>> would help, I can mail out the .config for the 2.4.32 and 2.6.14.4
>> builds, or provide other information of interest.
>
> Please do post at least the 2.4.32 .config, I'll try to boot it on my
> system right here. I find it amazing that it suddenly stopped working
> with the same kernels as before.

Both configs are at <http://hashbrown.cts.ucla.edu/pub/oops-200512/>.

I have no idea why it wouldn't come up with nosmp on the command line
(being supplied by lilo as append="nosmp"). I tried warm boot, cold boot,
removing all power from the hardware. I booted from a rescue cd that had
2.6 on it and the machine came up right away. I tried to go back to 2.4
and it hung and then had SCSI errors again, so I installed 2.6.14.4 and
left it running. I can put a copy of the 2.4.32 kernel and modules up if
that would help.


-Chris

2005-12-31 12:06:52

by Alan

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Gwe, 2005-12-30 at 17:48 -0800, Chris Stromsoe wrote:
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002

IRQ routing by the look of that trace. Make sure that if you are using
2.4.x you have ACPI disabled and see it looks any better

2005-12-31 13:05:03

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

Hi Alan,

On Sat, Dec 31, 2005 at 12:08:21PM +0000, Alan Cox wrote:
> On Gwe, 2005-12-30 at 17:48 -0800, Chris Stromsoe wrote:
> > scsi0:0:0:0: Attempting to queue an ABORT message
> > CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> > scsi0:0:0:0: Command already completed
> > aic7xxx_abort returns 0x2002
>
> IRQ routing by the look of that trace. Make sure that if you are using
> 2.4.x you have ACPI disabled and see it looks any better

Correct, and I came to the same conclusion ; Chris told us he booted with
the "nosmp" option. I've checked his config, and he has CONFIG_ACPI_BOOT=y.
I've just tried the same here, and I confirm that my machine (dual athlon)
does not boot with "nosmp" unless I also add "acpi=off". Mine even stops
ealier, while scanning IDE devices.

So now we're back to the original problem, i.e. why does he get bad pmd
that often on 2.4. It leaves us with the following possible next steps
after the problem occurs again (if it still happens with 2.6.14 or if
Chris is OK for a few more tests) :
- 2.4.32 nosmp acpi=off => the easiest one
- 2.4.32 + aic7xxx+20040522 => the more interesting one

Regards,
Willy

2006-01-05 03:52:47

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sat, 31 Dec 2005, Willy Tarreau wrote:
> On Sat, Dec 31, 2005 at 12:08:21PM +0000, Alan Cox wrote:
>> On Gwe, 2005-12-30 at 17:48 -0800, Chris Stromsoe wrote:
>>> scsi0:0:0:0: Attempting to queue an ABORT message CDB: 0x12 0x0 0x0
>>> 0x0 0xff 0x0 scsi0:0:0:0: Command already completed aic7xxx_abort
>>> returns 0x2002
>>
>> IRQ routing by the look of that trace. Make sure that if you are using
>> 2.4.x you have ACPI disabled and see it looks any better
>
> Correct, and I came to the same conclusion ; Chris told us he booted
> with the "nosmp" option. I've checked his config, and he has
> CONFIG_ACPI_BOOT=y. I've just tried the same here, and I confirm that my
> machine (dual athlon) does not boot with "nosmp" unless I also add
> "acpi=off". Mine even stops ealier, while scanning IDE devices.

2.6.14.4 has been running stable for 4 days. For the long term, I'll
probably migrate the box to 2.6 and leave it there.

> So now we're back to the original problem, i.e. why does he get bad pmd
> that often on 2.4. It leaves us with the following possible next steps
> after the problem occurs again (if it still happens with 2.6.14 or if
> Chris is OK for a few more tests) :
> - 2.4.32 nosmp acpi=off => the easiest one
> - 2.4.32 + aic7xxx+20040522 => the more interesting one

I booted 2.4.32 with the aic7xxx patch you pointed me at last week. It's
been up for a few hours. I'll let it run for at least a week or two and
will report back positive or negative results. After that, I'll try
2.4.32 with nosmp and acpi=off.


-Chris

2006-01-05 05:44:20

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Wed, Jan 04, 2006 at 07:52:36PM -0800, Chris Stromsoe wrote:
> On Sat, 31 Dec 2005, Willy Tarreau wrote:
> >On Sat, Dec 31, 2005 at 12:08:21PM +0000, Alan Cox wrote:
> >>On Gwe, 2005-12-30 at 17:48 -0800, Chris Stromsoe wrote:
> >>>scsi0:0:0:0: Attempting to queue an ABORT message CDB: 0x12 0x0 0x0
> >>>0x0 0xff 0x0 scsi0:0:0:0: Command already completed aic7xxx_abort
> >>>returns 0x2002
> >>
> >>IRQ routing by the look of that trace. Make sure that if you are using
> >>2.4.x you have ACPI disabled and see it looks any better
> >
> >Correct, and I came to the same conclusion ; Chris told us he booted
> >with the "nosmp" option. I've checked his config, and he has
> >CONFIG_ACPI_BOOT=y. I've just tried the same here, and I confirm that my
> >machine (dual athlon) does not boot with "nosmp" unless I also add
> >"acpi=off". Mine even stops ealier, while scanning IDE devices.
>
> 2.6.14.4 has been running stable for 4 days. For the long term, I'll
> probably migrate the box to 2.6 and leave it there.
>
> >So now we're back to the original problem, i.e. why does he get bad pmd
> >that often on 2.4. It leaves us with the following possible next steps
> >after the problem occurs again (if it still happens with 2.6.14 or if
> >Chris is OK for a few more tests) :
> > - 2.4.32 nosmp acpi=off => the easiest one
> > - 2.4.32 + aic7xxx+20040522 => the more interesting one
>
> I booted 2.4.32 with the aic7xxx patch you pointed me at last week. It's
> been up for a few hours. I'll let it run for at least a week or two and
> will report back positive or negative results. After that, I'll try
> 2.4.32 with nosmp and acpi=off.

Thanks for your continued feedback, Chris. Your reports are very helpful,
they tend to prove that your hardware is OK and that there's a bug in
mainline 2.4.32 with SMP+ACPI+aic7xxx enabled. That's already a good
piece of information.

> -Chris

Regards,
Willy

2006-01-06 21:54:54

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Thu, 5 Jan 2006, Willy Tarreau wrote:
> On Wed, Jan 04, 2006 at 07:52:36PM -0800, Chris Stromsoe wrote:
>
>> I booted 2.4.32 with the aic7xxx patch you pointed me at last week.
>> It's been up for a few hours. I'll let it run for at least a week or
>> two and will report back positive or negative results. After that,
>> I'll try 2.4.32 with nosmp and acpi=off.
>
> Thanks for your continued feedback, Chris. Your reports are very
> helpful, they tend to prove that your hardware is OK and that there's a
> bug in mainline 2.4.32 with SMP+ACPI+aic7xxx enabled. That's already a
> good piece of information.

After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got
another bad pmd and an oops this morning at 4:23am. I'm going to boot
vanilla 2.4.32 with nosmp and acpi=off.


-Chris

ksymoops 2.4.9 on i686 2.4.32-aic79xx. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.32-aic79xx/ (default)
-m /boot/System.map-2.4.32-aic79xx (specified)

Unable to handle kernel paging request at virtual address c2deee80
c025b3d3
*pde = 02c001e3
Oops: 0002
CPU: 2
EIP: 0010:[alloc_skb+275/480] Not tainted
EFLAGS: 00010282
eax: c2deee80 ebx: e0508880 ecx: 000006bc edx: 00000680
esi: 000001f0 edi: 00000000 ebp: f6cf7df0 esp: f6cf7ddc
ds: 0018 es: 0018 ss: 0018
Process innfeed (pid: 523, stackpage=f6cf7000)
Stack: 000006bc 000001f0 f3023b80 00000000 d307e000 f6cf7e68 c027cd2b 00000680
000001f0 000005a8 00000000 f6cf7e54 00000000 00000283 cb3f3000 c025a339
c8083280 00000000 00000000 c43428a0 f6cf6000 461800d6 00009bc7 00010430
Call Trace: [tcp_sendmsg+2619/4512] [sock_wfree+73/80] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176]
Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b
Using defaults from ksymoops -t elf32-i386 -a i386


>>eax; c2deee80 <_end+2a0b300/3864e4e0>
>>ebx; e0508880 <_end+20124d00/3864e4e0>
>>ebp; f6cf7df0 <_end+36914270/3864e4e0>
>>esp; f6cf7ddc <_end+3691425c/3864e4e0>

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: c7 00 01 00 00 00 movl $0x1,(%eax)
Code; 00000006 Before first symbol
6: 8b 83 8c 00 00 00 mov 0x8c(%ebx),%eax
Code; 0000000c Before first symbol
c: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
Code; 00000013 Before first symbol
13: 8b 00 mov (%eax),%eax

2006-01-06 22:14:35

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Fri, 6 Jan 2006, Chris Stromsoe wrote:

> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got
> another bad pmd and an oops this morning at 4:23am. I'm going to boot
> vanilla 2.4.32 with nosmp and acpi=off.

booting with "nosmp acpi=off" did not help. The box hung as before, at

hda: TEAC CD-ROM CD-224E, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: attached ide-cdrom driver.
hda: ATAPI 24X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.12
SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
<Adaptec 3960D Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
<Adaptec 3960D Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
<Adaptec aic7899 Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

blk: queue f7e46018, I/O limit 4095Mb (mask 0xffffffff)


I waited about 10 minutes to see if it would continue, then booted back
into 2.6.14.4.



-Chris

2006-01-06 22:17:03

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Fri, 6 Jan 2006, Chris Stromsoe wrote:
> On Fri, 6 Jan 2006, Chris Stromsoe wrote:
>
>> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got
>> another bad pmd and an oops this morning at 4:23am. I'm going to boot
>> vanilla 2.4.32 with nosmp and acpi=off.
>
> booting with "nosmp acpi=off" did not help. The box hung as before, at

One last datapoint; 2.6.14.4 boots fine with "nosmp acpi=off".


-Chris

2006-01-07 09:19:10

by Roberto Nibali

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

>> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got
>> another bad pmd and an oops this morning at 4:23am. I'm going to boot
>> vanilla 2.4.32 with nosmp and acpi=off.

Your oops does not make much sense, could you enable following, please:

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_FRAME_POINTER=y

> booting with "nosmp acpi=off" did not help. The box hung as before, at

Could you boot with pci=noacpi and report again? The difference is that
ACPI will still be used but not for IRQ routing. I have a few boxes out
with 2.4.x kernels and Adaptec HBAs that need this to work reliably.

> hda: TEAC CD-ROM CD-224E, ATAPI CD/DVD-ROM drive
> ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> hda: attached ide-cdrom driver.
> hda: ATAPI 24X CD-ROM drive, 128kB Cache
> Uniform CD-ROM driver Revision: 3.12
> SCSI subsystem driver Revision: 1.00
> scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36

What's the SCSI BIOS version?

> <Adaptec 3960D Ultra160 SCSI adapter>
> aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
>
> scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
> <Adaptec 3960D Ultra160 SCSI adapter>
> aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
>
> scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
> <Adaptec aic7899 Ultra160 SCSI adapter>
> aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
>
> blk: queue f7e46018, I/O limit 4095Mb (mask 0xffffffff)
>
> I waited about 10 minutes to see if it would continue, then booted back
> into 2.6.14.4.

What's the diff between /proc/interrupt and lspci -v on those kernels,
when they've finished the booting sequence?

If you find time, send me your BIOS settings and your .config in private
email. I didn't track this thread from the beginning, so I don't know if
you've already done this.

It might also help to carry this problem over to the linux-scsi mailing
list, since, I believe, most SCSI guys don't ready lkml too frequently.

Of course, if 2.6.x works for you and you need to go productive, then
I'd switch to it if I was you.

Just my 2 cents,
Roberto Nibali, ratz
--
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

2006-01-08 09:45:22

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

Hi Chris,

On Fri, Jan 06, 2006 at 01:54:45PM -0800, Chris Stromsoe wrote:
> On Thu, 5 Jan 2006, Willy Tarreau wrote:
> >On Wed, Jan 04, 2006 at 07:52:36PM -0800, Chris Stromsoe wrote:
> >
> >>I booted 2.4.32 with the aic7xxx patch you pointed me at last week.
> >>It's been up for a few hours. I'll let it run for at least a week or
> >>two and will report back positive or negative results. After that,
> >>I'll try 2.4.32 with nosmp and acpi=off.
> >
> >Thanks for your continued feedback, Chris. Your reports are very
> >helpful, they tend to prove that your hardware is OK and that there's a
> >bug in mainline 2.4.32 with SMP+ACPI+aic7xxx enabled. That's already a
> >good piece of information.
>
> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got
> another bad pmd and an oops this morning at 4:23am. I'm going to boot
> vanilla 2.4.32 with nosmp and acpi=off.

Well, I'm puzzled. On the one hand, your oopses don't all look the
same, so we could think it's a hardware problem. On the other hand,
your hardware tests did not find anything and 2.6.14 runs fine. BTW,
I also have other machines running in production with an adaptec
29160 like yours and I don't encounter this. It looks like some
memory corruption, but finding what causes it seems very hard. In
fact, it somewhat reminds me the problems encountered by Stephan
von Krawczynski 2.5 years ago. He encountered data corruption when
saving large amounts of data to a DLT connected to an AIC7xxx, and
often had freezes, and sometimes an oops. IIRC, changing the board
for something else fixed his problem.

I've compared the driver between 2.4 and 2.6, and the core has not
changed much, but its interface to the OS has changed a lot, so it's
not easy to identify a potential fix.

Eventhough I don't like this, I would join Roberto's advice to
upgrade to 2.6 and stick to it. If you finally encounter the
same problem on 2.6 after a very long time, then it would be
an indication that the problem is well in your hardware.

> -Chris

Thanks for all your investigations,
Willy

2006-01-09 18:28:09

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sat, 7 Jan 2006, Roberto Nibali wrote:

>>> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got
>>> another bad pmd and an oops this morning at 4:23am. I'm going to boot
>>> vanilla 2.4.32 with nosmp and acpi=off.
>
> Your oops does not make much sense, could you enable following, please:
>
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_SLAB=y
> CONFIG_MAGIC_SYSRQ=y
> CONFIG_FRAME_POINTER=y

kernel, sysrq, and frame_pointer were already enabled. I'll enable
debug_slab, as well.

>> booting with "nosmp acpi=off" did not help. The box hung as before, at
>
> Could you boot with pci=noacpi and report again? The difference is that
> ACPI will still be used but not for IRQ routing. I have a few boxes out
> with 2.4.x kernels and Adaptec HBAs that need this to work reliably.

Are you interested in results from "pci=noacpi" by itself or in
conjunction with nosmp?

> What's the SCSI BIOS version?

The SCSI controller is an onboard AIC 7899 (in a Dell PowerEdge 2650), and
reports itself as "25309".

> What's the diff between /proc/interrupt and lspci -v on those kernels,
> when they've finished the booting sequence?

> If you find time, send me your BIOS settings and your .config in private
> email. I didn't track this thread from the beginning, so I don't know if
> you've already done this.

<http://hashbrown.cts.ucla.edu/pub/oops-200512/> has the .config, lspci
-v, and /proc/interrupts for 2.6.14.4 and 2.4.32.


-Chris

2006-01-09 18:33:21

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sun, 8 Jan 2006, Willy Tarreau wrote:

> Eventhough I don't like this, I would join Roberto's advice to upgrade
> to 2.6 and stick to it. If you finally encounter the same problem on 2.6
> after a very long time, then it would be an indication that the problem
> is well in your hardware.

I'll keep 2.4.32 with DEBUG_SLAB up until it oopses again and will report
that. After, I'll probably stick with 2.6. Thanks.

-Chris

2006-01-09 20:16:08

by Roberto Nibali

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

>> CONFIG_DEBUG_KERNEL=y
>> CONFIG_DEBUG_SLAB=y
>> CONFIG_MAGIC_SYSRQ=y
>> CONFIG_FRAME_POINTER=y
>
> kernel, sysrq, and frame_pointer were already enabled. I'll enable
> debug_slab, as well.

Excellent.

>>> booting with "nosmp acpi=off" did not help. The box hung as before, at
>>
>> Could you boot with pci=noacpi and report again? The difference is
>> that ACPI will still be used but not for IRQ routing. I have a few
>> boxes out with 2.4.x kernels and Adaptec HBAs that need this to work
>> reliably.
>
> Are you interested in results from "pci=noacpi" by itself or in
> conjunction with nosmp?

With SMP, please.

>> What's the SCSI BIOS version?
>
> The SCSI controller is an onboard AIC 7899 (in a Dell PowerEdge 2650),
> and reports itself as "25309".

What I meant was the SCSI Bios revision you get to see when you cold
reset the system.

>> If you find time, send me your BIOS settings and your .config in
>> private email. I didn't track this thread from the beginning, so I
>> don't know if you've already done this.
>
> <http://hashbrown.cts.ucla.edu/pub/oops-200512/> has the .config, lspci
> -v, and /proc/interrupts for 2.6.14.4 and 2.4.32.

Thanks, I'll skim over these and get back to you if I can correlate
anything with the issues we were having using this controller.

Regards,
Roberto Nibali, ratz
--
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

2006-01-09 20:22:23

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Mon, 9 Jan 2006, Roberto Nibali wrote:

>>> What's the SCSI BIOS version?
>>
>> The SCSI controller is an onboard AIC 7899 (in a Dell PowerEdge 2650),
>> and reports itself as "25309".
>
> What I meant was the SCSI Bios revision you get to see when you cold
> reset the system.

That is the SCSI BIOS rev. The machine is a Dell PowerEdge 2650 and
that's the onboard AIC 7899. It comes up as "BIOS Build 25309".


-Chris

2006-01-09 22:20:55

by Roberto Nibali

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

> That is the SCSI BIOS rev. The machine is a Dell PowerEdge 2650 and
> that's the onboard AIC 7899. It comes up as "BIOS Build 25309".

Brain is engaged now, thanks ;). If you find time, could you maybe
compile a 2.4.32 kernel using following config (slightly changed from
yours):

http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s

And put a dmidecode[1] output onto your website. Is the BMC interface
enabled in your BIOS?

[1] http://download.savannah.nongnu.org/releases/dmidecode/

Best regards,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc

2006-01-10 01:00:00

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Mon, 9 Jan 2006, Roberto Nibali wrote:

>> That is the SCSI BIOS rev. The machine is a Dell PowerEdge 2650 and
>> that's the onboard AIC 7899. It comes up as "BIOS Build 25309".
>
> Brain is engaged now, thanks ;). If you find time, could you maybe
> compile a 2.4.32 kernel using following config (slightly changed from
> yours):
>
> http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s

If/when the current run with DEBUG_SLAB oopses, I'll reboot with the
config modifications.

> And put a dmidecode[1] output onto your website.

http://hashbrown.cts.ucla.edu/pub/oops-200512/dmidecode.out

> Is the BMC interface enabled in your BIOS?

I haven't changed the BMC defaults and am not using it, but I believe that
it shipped as enabled so should still be.


-Chris

2006-01-15 11:29:24

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Mon, 9 Jan 2006, Chris Stromsoe wrote:
> On Mon, 9 Jan 2006, Roberto Nibali wrote:
>
>>> That is the SCSI BIOS rev. The machine is a Dell PowerEdge 2650 and
>>> that's the onboard AIC 7899. It comes up as "BIOS Build 25309".
>>
>> Brain is engaged now, thanks ;). If you find time, could you maybe
>> compile a 2.4.32 kernel using following config (slightly changed from
>> yours):
>>
>> http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
>
> If/when the current run with DEBUG_SLAB oopses, I'll reboot with the
> config modifications.

I've been running stable with the propsed changes since the 10th. The
original config and the currently running config are both at
<http://hashbrown.cts.ucla.edu/pub/oops-200512/>. This is the diff:

cbs@hashbrown:~ > diff config-2.4.32 config-2.4.32-20060115

65c65
< CONFIG_HIGHIO=y
---
> # CONFIG_HIGHIO is not set
69c69
< CONFIG_NR_CPUS=32
---
> CONFIG_NR_CPUS=4
87c87
< CONFIG_ISA=y
---
> # CONFIG_ISA is not set
109c109
< # CONFIG_ACPI is not set
---
> CONFIG_ACPI=y
110a111,127
> CONFIG_ACPI_BUS=y
> CONFIG_ACPI_INTERPRETER=y
> CONFIG_ACPI_EC=y
> CONFIG_ACPI_POWER=y
> CONFIG_ACPI_PCI=y
> CONFIG_ACPI_MMCONFIG=y
> CONFIG_ACPI_SLEEP=y
> CONFIG_ACPI_SYSTEM=y
> # CONFIG_ACPI_AC is not set
> # CONFIG_ACPI_BATTERY is not set
> # CONFIG_ACPI_BUTTON is not set
> # CONFIG_ACPI_FAN is not set
> # CONFIG_ACPI_PROCESSOR is not set
> # CONFIG_ACPI_THERMAL is not set
> # CONFIG_ACPI_ASUS is not set
> # CONFIG_ACPI_TOSHIBA is not set
> # CONFIG_ACPI_DEBUG is not set
385c402
< # CONFIG_AIC7XXX_DEBUG_ENABLE is not set
---
> CONFIG_AIC7XXX_DEBUG_ENABLE=y
387c404
< # CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
---
> CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
492,493d508
< # CONFIG_AT1700 is not set
< # CONFIG_DEPCA is not set
500d514
< # CONFIG_AC3200 is not set
585,589d598
< # Old CD-ROM drivers (not SCSI, not IDE)
< #
< # CONFIG_CD_NO_IDESCSI is not set
<
< #
864,865c873,874
< # CONFIG_DEBUG_HIGHMEM is not set
< # CONFIG_DEBUG_SLAB is not set
---
> CONFIG_DEBUG_HIGHMEM=y
> CONFIG_DEBUG_SLAB=y




-Chris

2006-01-15 12:13:06

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sun, Jan 15, 2006 at 03:29:15AM -0800, Chris Stromsoe wrote:
> On Mon, 9 Jan 2006, Chris Stromsoe wrote:
> >On Mon, 9 Jan 2006, Roberto Nibali wrote:
> >
> >>>That is the SCSI BIOS rev. The machine is a Dell PowerEdge 2650 and
> >>>that's the onboard AIC 7899. It comes up as "BIOS Build 25309".
> >>
> >>Brain is engaged now, thanks ;). If you find time, could you maybe
> >>compile a 2.4.32 kernel using following config (slightly changed from
> >>yours):
> >>
> >>http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
> >
> >If/when the current run with DEBUG_SLAB oopses, I'll reboot with the
> >config modifications.
>
> I've been running stable with the propsed changes since the 10th. The
> original config and the currently running config are both at
> <http://hashbrown.cts.ucla.edu/pub/oops-200512/>. This is the diff:
>
> cbs@hashbrown:~ > diff config-2.4.32 config-2.4.32-20060115
>
> 65c65
> < CONFIG_HIGHIO=y
> ---
> ># CONFIG_HIGHIO is not set

I wonder if this change could be suspected of affecting stability. With
this unset, data will be sent from the card to low memory, then bounced
to high mem when needed. Maybe the card, northbridge or anything else
sometimes corrupts memory during direct highmem I/O from PCI ? :-/

Or perhaps it's simply too early to conclude anything.

Thanks for your report anyway.

Regards,
Willy

2006-01-15 21:18:28

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sun, 15 Jan 2006, Willy Tarreau wrote:
> On Sun, Jan 15, 2006 at 03:29:15AM -0800, Chris Stromsoe wrote:
>>
>> I've been running stable with the propsed changes since the 10th. The
>> original config and the currently running config are both at
>> <http://hashbrown.cts.ucla.edu/pub/oops-200512/>. This is the diff:
>>
>> cbs@hashbrown:~ > diff config-2.4.32 config-2.4.32-20060115
>>
>> 65c65
>> < CONFIG_HIGHIO=y
>> ---
>> > # CONFIG_HIGHIO is not set
>
> I wonder if this change could be suspected of affecting stability. With
> this unset, data will be sent from the card to low memory, then bounced
> to high mem when needed. Maybe the card, northbridge or anything else
> sometimes corrupts memory during direct highmem I/O from PCI ? :-/

I'll let it run for another week as it is. If it would be useful
information, I can switch CONFIG_HIGHIO back to =y and let that kernel run
for a while. Otherwise, I'll probably switch permanently to 2.6.


-Chris

2006-01-15 22:47:03

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sun, Jan 15, 2006 at 02:38:51PM -0800, Chris Stromsoe wrote:
> On Sun, 15 Jan 2006, Chris Stromsoe wrote:
> >On Mon, 9 Jan 2006, Chris Stromsoe wrote:
> >>On Mon, 9 Jan 2006, Roberto Nibali wrote:
> >>
> >>>>That is the SCSI BIOS rev. The machine is a Dell PowerEdge 2650 and
> >>>>that's the onboard AIC 7899. It comes up as "BIOS Build 25309".
> >>>
> >>>Brain is engaged now, thanks ;). If you find time, could you maybe
> >>>compile a 2.4.32 kernel using following config (slightly changed from
> >>>yours):
> >>>
> >>>http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
> >>
> >>If/when the current run with DEBUG_SLAB oopses, I'll reboot with the
> >>config modifications.
> >
> >I've been running stable with the propsed changes since the 10th. The
> >original config and the currently running config are both at
> ><http://hashbrown.cts.ucla.edu/pub/oops-200512/>. This is the diff:
>
> I made a mistake.
>
> The machine was /not/ booted into that config. It is running the original
> config from http://hashbrown.cts.ucla.edu/pub/oops-200512/config-2.4.32
> with DEBUG_SLAB defined and "pci=noacpi" passed in on the command line.
>
> The config with HIGHIO disabled an ACPI=y has not been tested.

Thanks for the precision. So logically we should expect it to break sooner
or later ?

>
> -Chris

Willy

2006-01-15 22:38:57

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sun, 15 Jan 2006, Chris Stromsoe wrote:
> On Mon, 9 Jan 2006, Chris Stromsoe wrote:
>> On Mon, 9 Jan 2006, Roberto Nibali wrote:
>>
>>>> That is the SCSI BIOS rev. The machine is a Dell PowerEdge 2650 and
>>>> that's the onboard AIC 7899. It comes up as "BIOS Build 25309".
>>>
>>> Brain is engaged now, thanks ;). If you find time, could you maybe
>>> compile a 2.4.32 kernel using following config (slightly changed from
>>> yours):
>>>
>>> http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
>>
>> If/when the current run with DEBUG_SLAB oopses, I'll reboot with the
>> config modifications.
>
> I've been running stable with the propsed changes since the 10th. The
> original config and the currently running config are both at
> <http://hashbrown.cts.ucla.edu/pub/oops-200512/>. This is the diff:

I made a mistake.

The machine was /not/ booted into that config. It is running the original
config from http://hashbrown.cts.ucla.edu/pub/oops-200512/config-2.4.32
with DEBUG_SLAB defined and "pci=noacpi" passed in on the command line.

The config with HIGHIO disabled an ACPI=y has not been tested.


-Chris

2006-01-15 22:54:21

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sun, 15 Jan 2006, Willy TARREAU wrote:
> On Sun, Jan 15, 2006 at 02:38:51PM -0800, Chris Stromsoe wrote:
>> On Sun, 15 Jan 2006, Chris Stromsoe wrote:
>>> On Mon, 9 Jan 2006, Chris Stromsoe wrote:
>>>> On Mon, 9 Jan 2006, Roberto Nibali wrote:
>>>>
>>>>>> That is the SCSI BIOS rev. The machine is a Dell PowerEdge 2650
>>>>>> and that's the onboard AIC 7899. It comes up as "BIOS Build
>>>>>> 25309".
>>>>>
>>>>> Brain is engaged now, thanks ;). If you find time, could you maybe
>>>>> compile a 2.4.32 kernel using following config (slightly changed
>>>>> from yours):
>>>>>
>>>>> http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
>>>>
>>>> If/when the current run with DEBUG_SLAB oopses, I'll reboot with the
>>>> config modifications.
>>>
>>> I've been running stable with the propsed changes since the 10th.
>>> The original config and the currently running config are both at
>>> <http://hashbrown.cts.ucla.edu/pub/oops-200512/>. This is the diff:
>>
>> I made a mistake.
>>
>> The machine was /not/ booted into that config. It is running the
>> original config from
>> http://hashbrown.cts.ucla.edu/pub/oops-200512/config-2.4.32 with
>> DEBUG_SLAB defined and "pci=noacpi" passed in on the command line.
>>
>> The config with HIGHIO disabled an ACPI=y has not been tested.
>
> Thanks for the precision. So logically we should expect it to break
> sooner or later ?

It is the same .config as one that crashed before, except that it has
DEBUG_SLAB defined. If it does not crash, then adding pci=noacpi to the
command fixes the problem for me.


-Chris

2006-01-16 20:52:56

by Roberto Nibali

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

>>> The machine was /not/ booted into that config. It is running the
>>> original config from
>>> http://hashbrown.cts.ucla.edu/pub/oops-200512/config-2.4.32 with
>>> DEBUG_SLAB defined and "pci=noacpi" passed in on the command line.
>>>
>>> The config with HIGHIO disabled an ACPI=y has not been tested.

CONFIG_SMP at least sets CONFIG_ACPI_BOOT. Do you still have the boot
messages somewhere (dmesg)? I'd be interested in the difference between
IOAPIC PCI routing entries between pci=noacpi and normal boot.

>> Thanks for the precision. So logically we should expect it to break
>> sooner or later ?
>
> It is the same .config as one that crashed before, except that it has
> DEBUG_SLAB defined. If it does not crash, then adding pci=noacpi to the
> command fixes the problem for me.

Hmm, I'm not fully convinced yet, however glad that it has been a bit
more stable for you.

Sidenote: We boot our systems having built-in AIC7* SCSI on moderately
cheap motherboards with "bad" interrupt routing using pci=noacpi on
2.4.x kernels to evade instability.

I suggest that if you experience more problems using this setup _and_
would like to continue debugging the issue, we take this off-list into a
private discussion.

[Another thing which would be interesting to test regarding the HIGHIO
setting is a RedHat based 2.4.x kernel, since according to some SCSI
driver's documentation, RedHat had a different HIGHIO convention.]

Thanks for your feedback,
Roberto Nibali, ratz
--
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

2006-01-16 21:32:57

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Mon, 16 Jan 2006, Roberto Nibali wrote:

>>> Thanks for the precision. So logically we should expect it to break
>>> sooner or later ?
>>
>> It is the same .config as one that crashed before, except that it has
>> DEBUG_SLAB defined. If it does not crash, then adding pci=noacpi to
>> the command fixes the problem for me.
>
> Hmm, I'm not fully convinced yet, however glad that it has been a bit
> more stable for you.

The stability only lasted for a week. Last night I got another bad pmd
message, an oops, and a hang. I was not able to capture the oops.

> Sidenote: We boot our systems having built-in AIC7* SCSI on moderately
> cheap motherboards with "bad" interrupt routing using pci=noacpi on
> 2.4.x kernels to evade instability.
>
> I suggest that if you experience more problems using this setup _and_
> would like to continue debugging the issue, we take this off-list into a
> private discussion.

At this point, I'm going to stick with 2.6. If I get more time to debug
this laster, I'll drop back down to the modified 2.4 with HIGHIO disabled.

> [Another thing which would be interesting to test regarding the HIGHIO
> setting is a RedHat based 2.4.x kernel, since according to some SCSI
> driver's documentation, RedHat had a different HIGHIO convention.]

Thanks. I'll keep that on my list of things to try if I ever get back to
this. I appreciate the pointers.


-Chris

2006-02-08 06:32:50

by Chris Stromsoe

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Sun, 15 Jan 2006, Chris Stromsoe wrote:
> On Sun, 15 Jan 2006, Willy TARREAU wrote:
>>
>> Thanks for the precision. So logically we should expect it to break
>> sooner or later ?
>
> It is the same .config as one that crashed before, except that it has
> DEBUG_SLAB defined. If it does not crash, then adding pci=noacpi to the
> command fixes the problem for me.

For what it's worth, I'm fairly certain at this point that the problem was
hardware related. After a week of uptime with 2.6 we had another pmd
error and oops. We then replaced the system board and one of the CPUs and
have not seen any problems since.


-Chris

2006-02-08 06:38:17

by Willy Tarreau

[permalink] [raw]
Subject: Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32

On Tue, Feb 07, 2006 at 10:32:45PM -0800, Chris Stromsoe wrote:
> On Sun, 15 Jan 2006, Chris Stromsoe wrote:
> >On Sun, 15 Jan 2006, Willy TARREAU wrote:
> >>
> >>Thanks for the precision. So logically we should expect it to break
> >>sooner or later ?
> >
> >It is the same .config as one that crashed before, except that it has
> >DEBUG_SLAB defined. If it does not crash, then adding pci=noacpi to the
> >command fixes the problem for me.
>
> For what it's worth, I'm fairly certain at this point that the problem
> was hardware related. After a week of uptime with 2.6 we had another pmd
> error and oops. We then replaced the system board and one of the CPUs
> and have not seen any problems since.

Chris, thank you very much for this useful feedback. Now we're sure that
it's not worth investigating on the aic7xxx driver for any potential
memory corruption bug.

> -Chris

Regards,
Willy