2005-02-11 14:28:56

by Jonathan Knight

[permalink] [raw]
Subject: aacraid fails under kernel 2.6



We are having major problems with the aacraid module under fedora core 2 on
Dell poweredge 2500. These use PERC3/Di controllers.

01:02.1 RAID bus controller: Dell Computer Corporation PowerEdge Expandable RAID Controller 3 (rev 01)
01:0c.0 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01)
01:0c.1 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01)


interrupts are shared:

CPU0
0: 16288777 XT-PIC timer
1: 12 XT-PIC i8042
2: 0 XT-PIC cascade
5: 2332026 XT-PIC aic7xxx, aacraid, eth0
7: 4402 XT-PIC parport0
8: 1 XT-PIC rtc
9: 0 XT-PIC acpi
10: 1 XT-PIC ohci_hcd
11: 1636985 XT-PIC aic7xxx, eth1
12: 66 XT-PIC i8042
14: 489 XT-PIC ide0
NMI: 0
ERR: 4070

We had no problems with Redhat 9 running a 2.4 kernel, but since our move to
Fedora and the 2.6 kernel nothing has worked well. The latest 2.6.10 build
is the worst so far. We've even gone and unpacked the rc3 for 2.6.11 and
dug out the aacraid controller but that didn't perform any better. We think
2.6.8 was the most usable of the 2.6's so far.

The systems run fine with no users, but as soon as the disks go under load
we get the following:

Feb 10 08:20:56 romeo kernel: aacraid: Host adapter reset request. SCSI hang ?
Feb 10 08:21:56 romeo kernel: aacraid: SCSI bus appears hung
Feb 10 08:21:56 romeo kernel: scsi: Device offlined - not ready after error recovery: host 2 channel 0 id 0 lun 0
Feb 10 08:21:56 romeo kernel: SCSI error : <2 0 0 0> return code = 0x6000000
Feb 10 08:21:56 romeo kernel: end_request: I/O error, dev sdc, sector 296582775
Feb 10 08:21:56 romeo kernel: scsi2 (0:0): rejecting I/O to offline device
Feb 10 08:21:56 romeo last message repeated 4 times
Feb 10 08:21:56 romeo kernel: Buffer I/O error on device sdc1, logical block 33660323
Feb 10 08:21:56 romeo kernel: scsi2 (0:0): rejecting I/O to offline device
Feb 10 08:21:56 romeo kernel: Buffer I/O error on device sdc1, logical block 33660323
....and so on.


Careful rebooting shows nothing untoward on the raid array and the system
will start with no problems.

Does anyone know why these controllers are broken under 2.6 and whether
there is any workaround we can implement?


We did discover that the megaraid driver claims to support the PERC3/Di but
couldn't get the module to recognise the card. A check on the source code
seems to show that the code for the PERC3/Di isn't present.

--
______ [email protected] Jonathan Knight,
/ Department of Computer Science
/ _ __ Telephone: +44 1782 583437 University of Keele, Keele,
(_/ (_) / / Fax : +44 1782 713082 Staffordshire. ST5 5BG. U.K.


2005-02-11 15:29:29

by Mark Haverkamp

[permalink] [raw]
Subject: Re: aacraid fails under kernel 2.6

On Fri, 2005-02-11 at 14:28 +0000, Jonathan Knight wrote:
>
> We are having major problems with the aacraid module under fedora core 2 on
> Dell poweredge 2500. These use PERC3/Di controllers.

[ ... ]

>
> The systems run fine with no users, but as soon as the disks go under load
> we get the following:
>
> Feb 10 08:20:56 romeo kernel: aacraid: Host adapter reset request. SCSI hang ?
> Feb 10 08:21:56 romeo kernel: aacraid: SCSI bus appears hung
> Feb 10 08:21:56 romeo kernel: scsi: Device offlined - not ready after error recovery: host 2 channel 0 id 0 lun 0
> Feb 10 08:21:56 romeo kernel: SCSI error : <2 0 0 0> return code = 0x6000000
> Feb 10 08:21:56 romeo kernel: end_request: I/O error, dev sdc, sector 296582775
> Feb 10 08:21:56 romeo kernel: scsi2 (0:0): rejecting I/O to offline device
> Feb 10 08:21:56 romeo last message repeated 4 times
> Feb 10 08:21:56 romeo kernel: Buffer I/O error on device sdc1, logical block 33660323
> Feb 10 08:21:56 romeo kernel: scsi2 (0:0): rejecting I/O to offline device
> Feb 10 08:21:56 romeo kernel: Buffer I/O error on device sdc1, logical block 33660323
> ....and so on.
>
>
> Careful rebooting shows nothing untoward on the raid array and the system
> will start with no problems.
>

[ ... ]

A number of people have seen problems like this going from 2.4 to 2.6.
Mark Salyzyn from Adaptec has suggested in those cases to make sure that
the board firmware is up to date. I've copied Mark on this mail.

Mark.

>
--
Mark Haverkamp <[email protected]>

2005-02-11 16:01:00

by Jonathan Knight

[permalink] [raw]
Subject: Re: aacraid fails under kernel 2.6

> A number of people have seen problems like this going from 2.4 to 2.6.
> Mark Salyzyn from Adaptec has suggested in those cases to make sure that
> the board firmware is up to date. I've copied Mark on this mail.


We think we're on the latest everything. The BIOS is A07 and the firmware
on the controller is 2.8.0 build 6092



--
______ [email protected] Jonathan Knight,
/ Department of Computer Science
/ _ __ Telephone: +44 1782 583437 University of Keele, Keele,
(_/ (_) / / Fax : +44 1782 713082 Staffordshire. ST5 5BG. U.K.

2005-02-11 16:41:39

by Mark Salyzyn

[permalink] [raw]
Subject: RE: aacraid fails under kernel 2.6

Then turn off both read and write cache on the card ...

You should contact Dell Technical support as there are many reasons for
the adapter to fail ranging from bad power supply, cables, drives etc.

Sincerely -- Mark Salyzyn

-----Original Message-----
From: Jonathan Knight [mailto:[email protected]]
Sent: Friday, February 11, 2005 11:01 AM
To: Mark Haverkamp
Cc: Jonathan Knight; linux-kernel; Salyzyn, Mark
Subject: Re: aacraid fails under kernel 2.6

> A number of people have seen problems like this going from 2.4 to 2.6.
> Mark Salyzyn from Adaptec has suggested in those cases to make sure
that
> the board firmware is up to date. I've copied Mark on this mail.


We think we're on the latest everything. The BIOS is A07 and the
firmware
on the controller is 2.8.0 build 6092



--
______ [email protected] Jonathan Knight,
/ Department of Computer Science
/ _ __ Telephone: +44 1782 583437 University of Keele, Keele,
(_/ (_) / / Fax : +44 1782 713082 Staffordshire. ST5 5BG. U.K.

2005-02-14 10:31:09

by Jonathan Knight

[permalink] [raw]
Subject: Re: aacraid fails under kernel 2.6

> Then turn off both read and write cache on the card ...


We've tried with no cache and we had multiple failures over the weekend.
We are running 2.4.20 on some of these boxes and it is stable. We're
only having problems with the 2.6 kernel.

These systems did stay running for a few hours and then started dying every
few minutes and then stable again for a few hours. They are in a air
conditioned machine room with UPS power supplies and we have 11 identical
2500 systems. Only the ones running 2.6 have issues and those issues start
the moment 2.6 is installed so we're convinced its software.

What's puzzling me is that the aacraid driver isn't so different between
2.4.20 and 2.6. Is there a debugging run or something that I can get that
would help diagnose the problem?



--
______ [email protected] Jonathan Knight,
/ Department of Computer Science
/ _ __ Telephone: +44 1782 583437 University of Keele, Keele,
(_/ (_) / / Fax : +44 1782 713082 Staffordshire. ST5 5BG. U.K.

2005-02-14 12:36:18

by Mark Salyzyn

[permalink] [raw]
Subject: RE: aacraid fails under kernel 2.6

A significant portion of the operations reside in the Adapter Firmware,
the driver report is essentially telling us the Adapter is locking up.

In the Adaptec branch of the driver there is an AAC_DETAILED_STATUS_INFO
manifest in aacraid.h that would enable firmware prints. The Firmware
that exists in the ROMB occasionally reports conditions via that
channel. The in-box driver has this debug channel turned on by default.
Sometimes this can be used to isolate the condition of failure down.

The logs supplied started with the driver report that the Adapter was
hung, messages prior to that would contain the firmware prints.

Sincerely -- Mark Salyzyn

-----Original Message-----
From: Jonathan Knight [mailto:[email protected]]
Sent: Monday, February 14, 2005 5:31 AM
To: Salyzyn, Mark
Cc: Jonathan Knight; Mark Haverkamp; linux-kernel
Subject: Re: aacraid fails under kernel 2.6

> Then turn off both read and write cache on the card ...


We've tried with no cache and we had multiple failures over the weekend.
We are running 2.4.20 on some of these boxes and it is stable. We're
only having problems with the 2.6 kernel.

These systems did stay running for a few hours and then started dying
every
few minutes and then stable again for a few hours. They are in a air
conditioned machine room with UPS power supplies and we have 11
identical
2500 systems. Only the ones running 2.6 have issues and those issues
start
the moment 2.6 is installed so we're convinced its software.

What's puzzling me is that the aacraid driver isn't so different between
2.4.20 and 2.6. Is there a debugging run or something that I can get
that
would help diagnose the problem?



--
______ [email protected] Jonathan Knight,
/ Department of Computer Science
/ _ __ Telephone: +44 1782 583437 University of Keele, Keele,
(_/ (_) / / Fax : +44 1782 713082 Staffordshire. ST5 5BG. U.K.

2005-02-15 02:16:40

by Alan

[permalink] [raw]
Subject: Re: aacraid fails under kernel 2.6

On Gwe, 2005-02-11 at 14:28, Jonathan Knight wrote:
> Fedora and the 2.6 kernel nothing has worked well. The latest 2.6.10 build
> is the worst so far. We've even gone and unpacked the rc3 for 2.6.11 and
> dug out the aacraid controller but that didn't perform any better. We think
> 2.6.8 was the most usable of the 2.6's so far.

Fedora 2.6.10 or the base 2.6.10. The base 2.6.10 is missing at least
one aacraid fix.

2005-02-15 11:17:31

by Jonathan Knight

[permalink] [raw]
Subject: Re: aacraid fails under kernel 2.6

> On Gwe, 2005-02-11 at 14:28, Jonathan Knight wrote:
> Fedora 2.6.10 or the base 2.6.10. The base 2.6.10 is missing at least
> one aacraid fix.


Fedora. We checked that it had a fix in that you'd posted about on this
list.


--
______ [email protected] Jonathan Knight,
/ Department of Computer Science
/ _ __ Telephone: +44 1782 583437 University of Keele, Keele,
(_/ (_) / / Fax : +44 1782 713082 Staffordshire. ST5 5BG. U.K.