2005-12-07 09:37:53

by Jan Oberländer

[permalink] [raw]
Subject: 2.4.32 Oops in scsi_dispatch_cmd

Hi,

[please Cc:, I'm not on the list!]

I've been receiving Oops repeatedly in scsi_dispatch_cmd, in different
kernels (2.4.{27,31,32}). At first I thought that a non-free module for
an ATA RAID card was responsible, but in 2.4.32 I've seen it without
using that as well (in a non-tainted kernel). I can't exactly rule out
that it's a hardware problem, but since it's very repeatable I'm really
not sure. The box doesn't recover from the Oops, but I was still able
to see the Oops message in /var/log/messages. See the attached file for
ksymoops output.

The tar process is run from a backup scripts that mounts an IDE drive
partition, writes a backup to it and unmounts it. It's always been the
tar process behind this crash. Some system details:

$ uname -a
Linux server 2.4.32 #1 Tue Dec 6 10:55:41 CET 2005 i686 GNU/Linux
$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 7
model name : AMD Duron(tm) Processor
stepping : 1
cpu MHz : 1197.726
cache size : 64 KB
[...]
$ lspci
0000:00:00.0 Host bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333]
0000:00:01.0 PCI bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333 AGP]
0000:00:0d.0 SCSI storage controller: Adaptec AHA-2940U2/U2W
0000:00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
0000:00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
0000:00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
0000:00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82)
0000:00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
0000:00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
0000:00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 50)
0000:00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 74)
0000:01:00.0 VGA compatible controller: nVidia Corporation NV5M64 [RIVA TNT2 Model 64/Model 64 Pro] (rev 15)
$

There are no modules loaded. The IDE HDD is on the VIA controller,
several other disks are behind the Adaptec controller (several of them
on Software RAID).

Maybe you have an idea what's going on (or who to talk to instead)? I'm
stuck.

Thanks in advance & keep up the good work!

Jan

--

+-------------------------------------+
| Jan Oberl?nder <[email protected]> |
| PGP key: 0xC4D910E3 |
+-------------------------------------+


Attachments:
(No filename) (0.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-12-07 21:50:20

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.32 Oops in scsi_dispatch_cmd

Hi,

On Wed, Dec 07, 2005 at 10:37:47AM +0100, Jan Oberl?nder wrote:
> Hi,
>
> [please Cc:, I'm not on the list!]
>
> I've been receiving Oops repeatedly in scsi_dispatch_cmd, in different
> kernels (2.4.{27,31,32}). At first I thought that a non-free module for
> an ATA RAID card was responsible, but in 2.4.32 I've seen it without
> using that as well (in a non-tainted kernel). I can't exactly rule out
> that it's a hardware problem, but since it's very repeatable I'm really
> not sure. The box doesn't recover from the Oops, but I was still able
> to see the Oops message in /var/log/messages. See the attached file for
> ksymoops output.

could you send your .config and gcc version please ? I've checked the
code and it's not easy to find what data is accessed in your oopses.

> The tar process is run from a backup scripts that mounts an IDE drive
> partition, writes a backup to it and unmounts it. It's always been the
> tar process behind this crash. Some system details:

could you please also tell us what partition tar reads data from ? I've
understood that you have some disks on your adaptec card and some software
RAID, so if you could roughly explain the setup, it would be great.

> $ uname -a
> Linux server 2.4.32 #1 Tue Dec 6 10:55:41 CET 2005 i686 GNU/Linux
> $ cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 6
> model : 7
> model name : AMD Duron(tm) Processor
> stepping : 1
> cpu MHz : 1197.726
> cache size : 64 KB
> [...]
> $ lspci
> 0000:00:00.0 Host bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333]
> 0000:00:01.0 PCI bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333 AGP]
> 0000:00:0d.0 SCSI storage controller: Adaptec AHA-2940U2/U2W
> 0000:00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
> 0000:00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
> 0000:00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
> 0000:00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82)
> 0000:00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
> 0000:00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
> 0000:00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 50)
> 0000:00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 74)
> 0000:01:00.0 VGA compatible controller: nVidia Corporation NV5M64 [RIVA TNT2 Model 64/Model 64 Pro] (rev 15)
> $
>
> There are no modules loaded. The IDE HDD is on the VIA controller,
> several other disks are behind the Adaptec controller (several of them
> on Software RAID).
>
> Maybe you have an idea what's going on (or who to talk to instead)? I'm
> stuck.

It may be doable with a few more info. Please also confirm that your
System.map really matches your kernel (for the oops report).

> Thanks in advance & keep up the good work!
>
> Jan

Regards,
Willy

2005-12-07 22:53:10

by Jan Oberländer

[permalink] [raw]
Subject: Re: 2.4.32 Oops in scsi_dispatch_cmd

On Wed, Dec 07, 2005 at 10:50:14PM +0100, Willy Tarreau wrote:
> On Wed, Dec 07, 2005 at 10:37:47AM +0100, Jan Oberl?nder wrote:
> > I've been receiving Oops repeatedly
>
> could you send your .config and gcc version please ? I've checked the
> code and it's not easy to find what data is accessed in your oopses.

I attached the .config.

$ gcc -v
Reading specs from /usr/lib/gcc-lib/i486-linux/3.3.5/specs
Configured with: ../src/configure -v
--enable-languages=c,c++,java,f77,pascal,objc,ada,treelang --prefix=/usr
--mandir=/usr/share/man --infodir=/usr/share/info
--with-gxx-include-dir=/usr/include/c++/3.3 --enable-shared
--enable-__cxa_atexit --with-system-zlib --enable-nls
--without-included-gettext --enable-clocale=gnu --enable-debug
--enable-java-gc=boehm --enable-java-awt=xlib --enable-objc-gc
i486-linux
Thread model: posix
gcc version 3.3.5 (Debian 1:3.3.5-13)

> > The tar process is run from a backup scripts that mounts an IDE
> > drive partition, writes a backup to it and unmounts it. It's always
> > been the tar process behind this crash. Some system details:
>
> could you please also tell us what partition tar reads data from ?
> I've understood that you have some disks on your adaptec card and some
> software RAID, so if you could roughly explain the setup, it would be
> great.

The setup:

/dev/hda on onboard IDE
/dev/sd{a,b,c,d,e,f} on Adaptec
md0 : active raid5 sdd1[2] sdc1[1] sdb1[0]
md1 : active raid5 sdd2[2] sdc2[1] sdb2[0]
md3 : active raid5 sdd3[2] sdc3[1] sdb3[0]
md2 : active raid1 sdf1[1] sde1[0]

The backup script roughly does the following:
1. mount hda
2. backup data from the md*,sd* devices to hda
3. umount hda

As I said, the IDE drive was on an ATA RAID card at first, visible to
the system as /dev/sdg. I changed this because of the tainted ATA RAID
module, but I'm receiving the same oops either way.

> It may be doable with a few more info. Please also confirm that your
> System.map really matches your kernel (for the oops report).

I double-checked.

Tell me if you need any further information.

Best wishes,

Jan

--

+-------------------------------------+
| Jan Oberl?nder <[email protected]> |
| PGP key: 0xC4D910E3 |
+-------------------------------------+


Attachments:
(No filename) (0.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-12-08 20:16:28

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.32 Oops in scsi_dispatch_cmd

Hi,

On Wed, Dec 07, 2005 at 11:52:44PM +0100, Jan Oberl?nder wrote:
> On Wed, Dec 07, 2005 at 10:50:14PM +0100, Willy Tarreau wrote:
> > On Wed, Dec 07, 2005 at 10:37:47AM +0100, Jan Oberl?nder wrote:
> > > I've been receiving Oops repeatedly
> >
> > could you send your .config and gcc version please ? I've checked the
> > code and it's not easy to find what data is accessed in your oopses.
>
> I attached the .config.
>
> $ gcc -v
> Reading specs from /usr/lib/gcc-lib/i486-linux/3.3.5/specs
> Configured with: ../src/configure -v
> --enable-languages=c,c++,java,f77,pascal,objc,ada,treelang --prefix=/usr
> --mandir=/usr/share/man --infodir=/usr/share/info
> --with-gxx-include-dir=/usr/include/c++/3.3 --enable-shared
> --enable-__cxa_atexit --with-system-zlib --enable-nls
> --without-included-gettext --enable-clocale=gnu --enable-debug
> --enable-java-gc=boehm --enable-java-awt=xlib --enable-objc-gc
> i486-linux
> Thread model: posix
> gcc version 3.3.5 (Debian 1:3.3.5-13)

OK, thanks, I could reproduce the same code.


> > > The tar process is run from a backup scripts that mounts an IDE
> > > drive partition, writes a backup to it and unmounts it. It's always
> > > been the tar process behind this crash. Some system details:
> >
> > could you please also tell us what partition tar reads data from ?
> > I've understood that you have some disks on your adaptec card and some
> > software RAID, so if you could roughly explain the setup, it would be
> > great.
>
> The setup:
>
> /dev/hda on onboard IDE
> /dev/sd{a,b,c,d,e,f} on Adaptec
> md0 : active raid5 sdd1[2] sdc1[1] sdb1[0]
> md1 : active raid5 sdd2[2] sdc2[1] sdb2[0]
> md3 : active raid5 sdd3[2] sdc3[1] sdb3[0]
> md2 : active raid1 sdf1[1] sde1[0]
>
> The backup script roughly does the following:
> 1. mount hda
> 2. backup data from the md*,sd* devices to hda
> 3. umount hda
>
> As I said, the IDE drive was on an ATA RAID card at first, visible to
> the system as /dev/sdg. I changed this because of the tainted ATA RAID
> module, but I'm receiving the same oops either way.
>
> > It may be doable with a few more info. Please also confirm that your
> > System.map really matches your kernel (for the oops report).
>
> I double-checked.

Fine.

> Tell me if you need any further information.

I must say I'm a bit lost, I found this in drivers/scsi.c :

672 if (host->hostt->use_new_eh_code) {
673 scsi_add_timer(SCpnt, SCpnt->timeout_per_command, scsi_times_out);
674 } else {
675 scsi_add_timer(SCpnt, SCpnt->timeout_per_command,
676 scsi_old_times_out);
677 }


The oops you're reporting shows that eax==0 below :

mov 0x24(%edi),%eax
testb $0x4,0x67(%eax)

But at this point, eax=(struct Scsi_Host_Template *)host->hostt,
so host->hostt == NULL. The problem is that it is assigned only
once in drivers/scsi/host.c:scsi_register(), and it directly takes
the Scsi_Host_Template *tpnt passed as the first argument, which
is dereferenced many times before being assigned to host->hostt,
so it should have crashed far earlier and never reached this code.

So unless I'm missing something, I see two possibilities :
- a bug somewhere else corrupted the struct Scsi_Host and
put a NULL in hostt ;

- a hardware problem is having fun of you. I'd personnaly
check in this area first.

So I suggest that you run memtest on the whole system over night
if possible (at least several hours) to check memory. If you
cannot stop the system this long, then you might also exchange
all SIMMs with another system and check whether the problem
still happens.

What is possible too is a chipset problem or CPU overheating
during those intensive backup activity.

Good luck,
Willy

PS: please keep the whole CC list on LKML, as people generally
don't read the list all the day.