2001-07-12 21:23:17

by Luc Lalonde

[permalink] [raw]
Subject: Adaptec SCSI driver lockups

Hello folks,

I'm having trouble identifying wether I'm having hardware or software(
OS ) problems. For the past couple of Months I've been having system
lockups every 10 days or so.

I suspect that it is a problem with the Adaptec 39160 SCSI controller
that is on my system (aic799). The lockups always occur when I'm
backing up to my HP DAT40 that is connected to channel A of this SCSI
controller. The strange thing is that I backup every night most of the
time without problems to this tape.

Here is some info on my system:

Kernel: 2.4.6pre3 ( I've been getting the same problem since 2.4.2 )

Here is all the SCSI controller info (the first two are on the
motherboard):

scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.1.13
<Adaptec aic7890/91 Ultra2 SCSI adapter>
aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/255 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.1.13
<Adaptec aic7880 Ultra SCSI adapter>
aic7880: Ultra Single Channel A, SCSI Id=7, 16/255 SCBs

scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.1.13
<Adaptec 3960D Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/255 SCBs

scsi3 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.1.13
<Adaptec 3960D Ultra160 SCSI adapter>
aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/255 SCBs

This is a PowerEdge 2400 Dual 866Mhz PentiumIII:

Motherboard: ServerWorks Entry ServerSet III LE64 Chipset
HardDrives: Four 36Gig Ultra160 connected to scsi0 (aic7890/91)
TapeDrive: HP DAT40 connected to scsi2

I've checked the termination on the 39160 and everything seems fine.

Would anyone have any pointers? Please CC:[email protected]

Thanks.

--
Luc Lalonde, Responsable du reseau GIREF

Telephone: (418) 656-2131 poste 6623
Courriel: [email protected]


Attachments:
llalonde.vcf (210.00 B)
Card for Luc Lalonde

2001-07-12 21:32:36

by Alan

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups

> I suspect that it is a problem with the Adaptec 39160 SCSI controller
> that is on my system (aic799). The lockups always occur when I'm
> backing up to my HP DAT40 that is connected to channel A of this SCSI
> controller. The strange thing is that I backup every night most of the
> time without problems to this tape.

I had to switch to aic7xxx_old driver to get my aic7xxx box stable under
test loads. it worked fine until I ran test suites then choked. aic7xxx_old
hasnt died on me yet

2001-07-12 21:32:46

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups

>Hello folks,
>
>I'm having trouble identifying wether I'm having hardware or software(
>OS ) problems. For the past couple of Months I've been having system
>lockups every 10 days or so.

Are there any kernel messages printed prior to the lockup?
Please attach a serial cable to another machine, enable serial console
support, run with aic7xxx=verbose, and log all console activity remotely.

--
Justin

2001-07-12 21:55:46

by Luc Lalonde

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups

Hello Justin and Alan,

There was some garbage printed to the /var/log/messages before the
lockup but it is unreadable.

I'll have to read up on how to connect a serial console to this
machine. It's our main server (YPserver, mail, http, etc) so I don't
want to use it as a test system. If I use the append="aic7xxx=verbose"
in my lilo.conf will it log extra messages in /var/log/messages? If so,
will it be useful enough to figure out what the problem is?

Alan,

Wasn't the old_aic7xxx the default driver up to 2.4.5? If so I don't
know how using the old driver would help since I had the same problems
with 2.4.[2,3,4]. Unless there was some other stuff that has been fixed
that would cause similar problems.

Cheers.


"Justin T. Gibbs" wrote:
>
> >Hello folks,
> >
> >I'm having trouble identifying wether I'm having hardware or software(
> >OS ) problems. For the past couple of Months I've been having system
> >lockups every 10 days or so.
>
> Are there any kernel messages printed prior to the lockup?
> Please attach a serial cable to another machine, enable serial console
> support, run with aic7xxx=verbose, and log all console activity remotely.
>
> --
> Justin

--
Luc Lalonde, Responsable du reseau GIREF

Telephone: (418) 656-2131 poste 6623
Courriel: [email protected]


Attachments:
llalonde.vcf (210.00 B)
Card for Luc Lalonde

2001-07-12 22:07:06

by Jussi Laako

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups

Luc Lalonde wrote:
>
> I suspect that it is a problem with the Adaptec 39160 SCSI controller
> that is on my system (aic799). The lockups always occur when I'm
> backing up to my HP DAT40 that is connected to channel A of this SCSI

Our HP DAT (24 GB) occasionally locks up. This doesn't lead to system
lockup, but it's probably because there are no HDDs connected to SCSI bus.
Resetting the DAT (by cycling power) doesn't help, the SCSI
controller/driver gets somehow confused. It requires hardware reset.

This happened also with OpenBSD, although power cycling the DAT fixed the
situation there. I believe it's either buggy software in DAT drive or the
drive is breaking up (those thingies seem to last for about two years). (Or
there is some other SCSI hardware related issue.)

I have tested this with 2940/2930 cards. I think it could lead to system
lockup if there were SCSI HDD with swap connected to same controller.

- Jussi Laako

--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers

2001-07-12 22:12:06

by Luc Lalonde

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups

Thanks for your note Jussi.

I don't think that this applies here. The tape drive is alone on this
controller.

Cheers, Luc.

Jussi Laako wrote:
>
> Luc Lalonde wrote:
> >
> > I suspect that it is a problem with the Adaptec 39160 SCSI controller
> > that is on my system (aic799). The lockups always occur when I'm
> > backing up to my HP DAT40 that is connected to channel A of this SCSI
>
> Our HP DAT (24 GB) occasionally locks up. This doesn't lead to system
> lockup, but it's probably because there are no HDDs connected to SCSI bus.
> Resetting the DAT (by cycling power) doesn't help, the SCSI
> controller/driver gets somehow confused. It requires hardware reset.
>
> This happened also with OpenBSD, although power cycling the DAT fixed the
> situation there. I believe it's either buggy software in DAT drive or the
> drive is breaking up (those thingies seem to last for about two years). (Or
> there is some other SCSI hardware related issue.)
>
> I have tested this with 2940/2930 cards. I think it could lead to system
> lockup if there were SCSI HDD with swap connected to same controller.
>
> - Jussi Laako
>
> --
> PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
> Available at PGP keyservers

--
Luc Lalonde, Responsable du reseau GIREF

Telephone: (418) 656-2131 poste 6623
Courriel: [email protected]


Attachments:
llalonde.vcf (210.00 B)
Card for Luc Lalonde

2001-07-12 22:26:58

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups

>Hello Justin and Alan,
>
>There was some garbage printed to the /var/log/messages before the
>lockup but it is unreadable.

It may have been corrupted by the hang/crash. This is why using
a serial console is *always* the best bet for tracking down these
kinds of issues.

>If I use the append="aic7xxx=verbose"
>in my lilo.conf will it log extra messages in /var/log/messages?

The messages are printed to the console and, if syslogd is running,
will be recorded in /var/log/messages. However, there is always
a delay between the error being printed and syslogd getting that
text to disk. A serial console doesn't have this problem.

>If so,
>will it be useful enough to figure out what the problem is?

I'll let you know when I see the messages. ;-)

--
Justin

2001-07-13 08:46:45

by Chuck Hemker

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups


On 12-Jul-01 Justin T. Gibbs wrote:
>>Hello Justin and Alan,
>>
>>There was some garbage printed to the /var/log/messages before the
>>lockup but it is unreadable.
>
> It may have been corrupted by the hang/crash. This is why using
> a serial console is *always* the best bet for tracking down these
> kinds of issues.
>
>>If I use the append="aic7xxx=verbose"
>>in my lilo.conf will it log extra messages in /var/log/messages?
>
> The messages are printed to the console and, if syslogd is running,
> will be recorded in /var/log/messages. However, there is always
> a delay between the error being printed and syslogd getting that
> text to disk. A serial console doesn't have this problem.
>
>>If so,
>>will it be useful enough to figure out what the problem is?
>
> I'll let you know when I see the messages. ;-)
>
> --
> Justin

If the message looks like it's getting as far as the messages file, but getting
corrupt by the lockup, what about configuring syslog to send the messages over
the net to another machine. Maybe it will get the UDP packet out before the
lockup. Also, this could be done without rebooting the machine. Might be
worth a try first.


2001-07-13 21:57:06

by s-jaschke

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups

On Thursday 12 July 2001 23:21, Luc Lalonde wrote:
> Hello folks,
>
> I'm having trouble identifying wether I'm having hardware or software(
> OS ) problems. For the past couple of Months I've been having system
> lockups every 10 days or so.

I had a similar problem with an Adaptec AHA-29160 (AIC-7892) when
doing transfers from a SCSI DVD-drive to SCSI hard disk. It was actually
very annoying since
- I did an update to SuSE 7.2
- it almost destroyed my installation (several runs of e2fsck needed, some
libs broken after the crash)
- the SCSI bus hang appeared not immediatly, but after several minutes
of activity.
I tried:
(1) to load aic7xxx_old (see
http://sdb.suse.de/de/sdb/html/ostoelt_aic7xxx_old.html) -> crashed also.
(2) I tried to disable tagged command queuing or set "tag_info" to
more moderate values since I noticed that the queue length was set
to an incredible length of 253. However, the new driver from Adaptec
just ignores the old kernel parameters.
(3) According to an SDB article
(http://sdb.suse.de/de/sdb/html/jsj_29160_interrupt.html),
I manually set all PCI slots to different interrupts (but I could not set
the interrupt of the graphics card). -> crashed also
(4) I finally managed to get a workable system again by booting with a 2.2.x
kernel (SuSE 7.1), copying the DVD to a hard disk and then install from the
hard disk.
(5) I compiled a custom kernel and set the queue length parameter to 8. The
system now appears to be stable with the new (!) driver.

My guess is: the new driver stresses the hardware more than the old driver.
And it may at least partly be a hardware problem involving slow devices.
(The same configuration that consistently crashed my DVD->HD transfers did
not crash in an hour of very heavy HD->HD transfers.)

And here is a hint for the developers. Before the driver went into an endless
loop of SCSI bus resets, it said something like "locking queue length to 64
for device 0:0:0" (the HD that the data was copied to).

Hope this helps,
Stefan J.

--
Stefan R. Jaschke <[email protected]>
http://www.jaschke-net.de

2001-07-13 22:40:21

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups

>(2) I tried to disable tagged command queuing or set "tag_info" to
> more moderate values since I noticed that the queue length was set
> to an incredible length of 253. However, the new driver from Adaptec
> just ignores the old kernel parameters.

Hmm. The tag_info stuff seems to work for me. The new driver certainly
is not supposed to ignore these parameters.

BTW, the tag depth is merely an upper bound. The driver will dynamically
determine the abilities of each device and throttle accordingly. Your
DVD drive probably doesn't even support tagged queing, so your custom
kernel with a lower tag count will only affect the HD.

--
Justin

2001-07-13 23:31:14

by Luc Lalonde

[permalink] [raw]
Subject: Re: Adaptec SCSI driver lockups

Hello stefan,

Thanks for your note. I've recompiled the 2.4.7pre6 with the old adaptec
drivers (sorry Justin). I just need to get this system stable again.
It's at the point where I can't take a vacation this summer until I feel
the system is stable.

I didn't have the fsck problems I'm using XFS and ReiserFS. If this does
not fix the problem I'm likely to connect a serial console to figure out
the problem.


On Fri, 13 Jul 2001, Stefan Jaschke wrote:

> On Thursday 12 July 2001 23:21, Luc Lalonde wrote:
> > Hello folks,
> >
> > I'm having trouble identifying wether I'm having hardware or software(
> > OS ) problems. For the past couple of Months I've been having system
> > lockups every 10 days or so.
>
> I had a similar problem with an Adaptec AHA-29160 (AIC-7892) when
> doing transfers from a SCSI DVD-drive to SCSI hard disk. It was actually
> very annoying since
> - I did an update to SuSE 7.2
> - it almost destroyed my installation (several runs of e2fsck needed, some
> libs broken after the crash)
> - the SCSI bus hang appeared not immediatly, but after several minutes
> of activity.
> I tried:
> (1) to load aic7xxx_old (see
> http://sdb.suse.de/de/sdb/html/ostoelt_aic7xxx_old.html) -> crashed also.
> (2) I tried to disable tagged command queuing or set "tag_info" to
> more moderate values since I noticed that the queue length was set
> to an incredible length of 253. However, the new driver from Adaptec
> just ignores the old kernel parameters.
> (3) According to an SDB article
> (http://sdb.suse.de/de/sdb/html/jsj_29160_interrupt.html),
> I manually set all PCI slots to different interrupts (but I could not set
> the interrupt of the graphics card). -> crashed also
> (4) I finally managed to get a workable system again by booting with a 2.2.x
> kernel (SuSE 7.1), copying the DVD to a hard disk and then install from the
> hard disk.
> (5) I compiled a custom kernel and set the queue length parameter to 8. The
> system now appears to be stable with the new (!) driver.
>
> My guess is: the new driver stresses the hardware more than the old driver.
> And it may at least partly be a hardware problem involving slow devices.
> (The same configuration that consistently crashed my DVD->HD transfers did
> not crash in an hour of very heavy HD->HD transfers.)
>
> And here is a hint for the developers. Before the driver went into an endless
> loop of SCSI bus resets, it said something like "locking queue length to 64
> for device 0:0:0" (the HD that the data was copied to).
>
> Hope this helps,
> Stefan J.
>
> --
> Stefan R. Jaschke <[email protected]>
> http://www.jaschke-net.de
>