2006-09-01 16:05:17

by Ethan

[permalink] [raw]
Subject: File corruption with 2940U2 SCSI card and aic7xxx driver.

I recently installed an Adaptec 2940U2 controller and two disks in my
Debian Sarge system, kernel version 2.6.8. Prior to this
installation, the system
had been a rock-solid IDE only system. The card and drives are correctly
detected and identified by the kernel at boot. Unfortunately, I am
experiencing consistent corruption on large files written to the SCSI
drives. For example, if I copy a file from the old, stable IDE drive
to one of the SCSI disks using dd:

>dd if=alphabet of=/dev/sda1
205200+0 records in
205200+0 records out
105062400 bytes transferred in 5.480344 seconds (19170767 bytes/sec)

Then copy the file back:

>dd if=/dev/sda1 of=alphabet_ver2 count=205200
205200+0 records in
205200+0 records out
105062400 bytes transferred in 5.840856 seconds (17987500 bytes/sec)

The md5sums are different:

>md5sum alphabet alphabet_ver2
5a96c70a890ff479568f75c54abb82a8 alphabet
e507a5b662b5f528bb6aa3a489a0e04e alphabet_ver2

The original file, "alphabet", contains the line
"abcdefghijklmnopqrstuvwxyz" repeated many times; however the file
read from the SCSI drive, "alphabet_ver2", contains a number lines
like "abcdefghijklmnopqrstubcdefghijklmnopqrstuvwxyz" and
"abcdopqrstuvwxyz" --- all the correct characters, just out of order.
Curiously, all of the corruption appears to occur when writing the
file to the disk, as reading the data from the disk a second time
yields the same corrupt data:

>dd if=/dev/sda1 of=alphabet_ver3 count=205200
205200+0 records in
205200+0 records out
105062400 bytes transferred in 5.840856 seconds (17987500 bytes/sec)
>md5sum alphabet alphabet_ver2 alphabet_ver3
5a96c70a890ff479568f75c54abb82a8 alphabet
e507a5b662b5f528bb6aa3a489a0e04e alphabet_ver2
e507a5b662b5f528bb6aa3a489a0e04e alphabet_ver3

The corruption on write appears to be different each time:

>dd if=alphabet of=/dev/sda1;\
dd if=/dev/sda1 of=alphabet_ver4 count=205200;md5sum alphabet*
205200+0 records in
205200+0 records out
105062400 bytes transferred in 5.488071 seconds (19143775 bytes/sec)
205200+0 records in
205200+0 records out
105062400 bytes transferred in 5.776168 seconds (18188944 bytes/sec)
5a96c70a890ff479568f75c54abb82a8 alphabet
e507a5b662b5f528bb6aa3a489a0e04e alphabet_ver2
e507a5b662b5f528bb6aa3a489a0e04e alphabet_ver3
40a369cb78d68f9b6d293dfd5012c87f alphabet_ver4

You'll note that I've given up trying to create a filesystem on the
SCSI disk since the filesystem was always corrupted quickly and
fatally. I have exhausted my ideas for troubleshooting this problem.
I would greatly appreciate any ideas for further troubleshooting.
Here is a brief list of what I have tried:

- Copying data to and from the other SCSI disk, sdb.
- Changing PCI slots and SCSI cables.
- The 2940 card does not share an interrupt with any other card.
- Trying the aic7xxx_old driver.
- Trying the new version of the aic7xxx driver with a 2.6.16 kernel.
- Disabling write caching on the drives.
- Enabling the debug information in the aic7xxx driver module (see
below for transcript). There is no indication of problems from the
debug output.

In all cases, I get the same results. This set of card, cable, and
drives worked flawlessly when it was removed from another computer
(which ran Windows and SUSE Linux).

A few relevant system details:

- Debian version 3.1 (Sarge)
- kernel-image-2.6.8-3-686 ver. 2.6.8-16sarge4 (primarily) and
linux-image-2.6.16-1-686 ver. 2.6.16-11bpo1 from backports.org
- Pentium III 500 MHz with 640 MB memory, VIA Apollo Pro 133 chipset

The relevant kernel messages during boot with full aic7xxx debug:

PCI: Found IRQ 11 for device 0000:00:0c.0
aic7xxx: PCI Device 0:12:0 failed memory mapped test. Using PIO.
ahc_pci:0:12:0: Reading SEEPROM...done.
ahc_pci:0:12:0: BIOS eeprom is present
ahc_pci:0:12:0: Secondary High byte termination Enabled
ahc_pci:0:12:0: Secondary Low byte termination Enabled
ahc_pci:0:12:0: Primary Low Byte termination Enabled
ahc_pci:0:12:0: Primary High Byte termination Enabled
ahc_pci:0:12:0: Downloading Sequencer Program... 423 instructions downloaded
ahc_pci:0:12:0: Features 0x56f6, Bugs 0x6, Flags 0x20485440
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
<Adaptec 2940 Ultra2 SCSI adapter>
aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi0: Slave Alloc 0
(scsi0:A:0:0): Sending WDTR 1
(scsi0:A:0:0): Received WDTR 1 filtered to 1
(scsi0:A:0): 1.960MB/s transfers (0.980MHz , offset 255, 16bit)
scsi0: target 0 using 16bit transfers
(scsi0:A:0:0): Sending SDTR period a, offset 7f
(scsi0:A:0:0): Received SDTR period a, offset 7f
Filtered to period a, offset 7f
(scsi0:A:0): 80.000MB/s transfers (40.000MHz, offset 127, 16bit)
scsi0: target 0 synchronous at 40.0MHz, offset = 0x7f
Vendor: QUANTUM Model: ATLAS10K2-TY367L Rev: DA40
Type: Direct-Access ANSI SCSI revision: 03
scsi0: Slave Configure 0
(scsi0:A:0): 80.000MB/s transfers (40.000MHz, offset 127, 16bit)
scsi0:A:0:0: Tagged Queuing enabled. Depth 8
scsi0: Slave Alloc 1
scsi0: Slave Destroy 1
scsi0: Slave Alloc 2
scsi0: Slave Destroy 2
scsi0: Slave Alloc 3
scsi0: Slave Destroy 3
scsi0: Slave Alloc 4
scsi0: Slave Destroy 4
scsi0: Slave Alloc 5
SCSI device sda: 71132959 512-byte hdwr sectors (36420 MB)
scsi0: Slave Destroy 5
scsi0: Slave Alloc 6
(scsi0:A:6:0): Sending WDTR 1
(scsi0:A:6:0): Received WDTR 1 filtered to 1
(scsi0:A:6): 1.960MB/s transfers (0.980MHz, offset 255, 16bit)
scsi0: target 6 using 16bit transfers
(scsi0:A:6:0): Sending SDTR period a, offset 7f
(scsi0:A:6:0): Received SDTR period a, offset 3f
Filtered to period a, offset 3f
(scsi0:A:6): 80.000MB/s transfers (40.000MHz, offset 63, 16bit)
scsi0: target 6 synchronous at 40.0MHz, offset = 0x3f
Vendor: IBM Model: DDYS-T36950N Rev: S96H
Type: Direct-Access ANSI SCSI revision: 03
scsi0: Slave Configure 6
(scsi0:A:6): 80.000MB/s transfers (40.000MHz, offset 63, 16bit)
scsi0:A:6:0: Tagged Queuing enabled. Depth 8
SCSI device sda: drive cache: write through
/dev/scsi/host0/bus0/target0/lun0: p1 p2 p3
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
SCSI device sdb: 71687340 512-byte hdwr sectors (36704 MB)
SCSI device sdb: drive cache: write back
/dev/scsi/host0/bus0/target6/lun0: p1 p2
Attached scsi disk sdb at scsi0, channel 0, id 6, lun 0
scsi0: Slave Alloc 8
scsi0: Slave Destroy 8
scsi0: Slave Alloc 9
scsi0: Slave Destroy 9
scsi0: Slave Alloc 10
scsi0: Slave Destroy 10
scsi0: Slave Alloc 11
scsi0: Slave Destroy 11
scsi0: Slave Alloc 12
scsi0: Slave Destroy 12
scsi0: Slave Alloc 13
scsi0: Slave Destroy 13
scsi0: Slave Alloc 14
scsi0: Slave Destroy 14
scsi0: Slave Alloc 15
scsi0: Slave Destroy 15

The aic7xxx driver does not emit any further kernel messages. The
aic7xxx module
is loaded with the following flags: verbose,debug:0xffff,pci_parity

Please CC me directly with any comments or ideas you have. Thanks for
your time.

--Ethan


2006-09-01 22:43:07

by Alan

[permalink] [raw]
Subject: Re: File corruption with 2940U2 SCSI card and aic7xxx driver.

Ar Gwe, 2006-09-01 am 09:05 -0700, ysgrifennodd Ethan:
> detected and identified by the kernel at boot. Unfortunately, I am
> experiencing consistent corruption on large files written to the SCSI
> drives. For example, if I copy a file from the old, stable IDE drive
> to one of the SCSI disks using dd:

Does this still occur with a more recent upstream kernel ?


There are also known AHA2940 incompatibilities with a few boards. People
always had problems with CUV4X* boards for one. Bit early to assume its
the board however it might be worth making sure the card is well seated
and the cabling looks good. That said I'd expect parity errors..



--
VGER BF report: H 0.215243

2006-09-02 00:44:03

by Ethan

[permalink] [raw]
Subject: Re: File corruption with 2940U2 SCSI card and aic7xxx driver.

> Does this still occur with a more recent upstream kernel ?

I've tried kernel version 2.6.16 (with version 7 of the aic7xxx
driver). Same problem.

>
>
> There are also known AHA2940 incompatibilities with a few boards. People
> always had problems with CUV4X* boards for one. Bit early to assume its
> the board however it might be worth making sure the card is well seated
> and the cabling looks good. That said I'd expect parity errors..
>

I've tried two different PCI slots and two different SCSI cables.
Same problem. I've enabled PCI parity checking via the pci_parity
option to the aic7xxx driver, but I don't see any parity errors in the
kernel messages.

I'm having trouble believing that this could be a hardware problem
because I can consistently read data from the SCSI disks, the
corruption only seems to happen during writes.

Thanks for your suggestions.

--
VGER BF report: H 1.90148e-11

2006-09-02 01:35:04

by Ray Lee

[permalink] [raw]
Subject: Re: File corruption with 2940U2 SCSI card and aic7xxx driver.

On 9/1/06, Ethan <[email protected]> wrote:
> I recently installed an Adaptec 2940U2 controller and two disks in my
> Debian Sarge system, kernel version 2.6.8.
[...]
> The original file, "alphabet", contains the line
> "abcdefghijklmnopqrstuvwxyz" repeated many times; however the file
> read from the SCSI drive, "alphabet_ver2", contains a number lines
> like "abcdefghijklmnopqrstubcdefghijklmnopqrstuvwxyz" and
> "abcdopqrstuvwxyz" --- all the correct characters, just out of order.

Well, they're probably not out of order per se, but more than some
data on a page granularity was dropped, duplicated, or something. If
you have a bit of coding skills, I'd suggest writing a bunch of 32-bit
ints to a file, in increasing order, and use that as a test case. That
way each 32-bit word is unique, and you might be able to spot a bit
more of a pattern as to what's going on (is it duplicated? Is it out
of order?).

This might give hints to those with bigger brains than mine.

Ray

--
VGER BF report: H 0.222399

2006-09-02 16:14:13

by Alan

[permalink] [raw]
Subject: Re: File corruption with 2940U2 SCSI card and aic7xxx driver.

Ar Gwe, 2006-09-01 am 17:44 -0700, ysgrifennodd Ethan:
> I'm having trouble believing that this could be a hardware problem
> because I can consistently read data from the SCSI disks, the
> corruption only seems to happen during writes.

I would actually say that is very consistent with hardware
incompatibility between the card and motherboard. I'd still like to know
what the current kernels do


--
VGER BF report: H 0