2001-07-08 13:37:31

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update

On Thu, 21 Jun 2001, Geert Uytterhoeven wrote:
> On Tue, 8 May 2001, Geert Uytterhoeven wrote:
> > In the mean time I down/upgraded to 2.2.17 on my PPC box (CHRP LongTrail,
> > Sym53c875, HP C5136A DDS1) and I can confirm that the problem does not happen
> > under 2.2.17 neither.
> >
> > My experiences:
> > - reading works fine, writing doesn't
> > - 2.2.x works fine, 2.4.x doesn't (at least since 2.4.0-test1-ac10)
> > - hardware compression doesn't matter
> > - I have a sym53c875, Lorenzo has an Adaptec, so most likely it's not a
> > SCSI hardware driver bug
> > - I have a PPC, Lorenzo doesn't, so it's not CPU-specific
> > - corruption is always a block of 32 bytes being replaced by 32 bytes from
> > the previous tape block (depending on block size!) (approx. 6 errors per
> > 256 MB)
> >
> > Lorenzo, can you please investigate the exact nature of the corruption on your
> > system?
> > - How many successive bytes are corrupted?
> > - Where do the corrupted data come from?
>
> Yesterday I noticed the same corruption under 2.2.19 (yes, I run amverify after
> backing up my system now, so it detects corruption through the gzip CRCs).
>
> I'll do some more tests (when I find time) to get a higher statistical
> certainty that it really doesn't happen under earlier 2.2.x kernels.

New findings:
- The problem doesn't happen with kernels <= 2.2.17. It does happen with all
kernels starting with 2.2.18-pre1.
- The only related stuff that changed in 2.2.18-pre1 seems to be the
Sym53c8xx driver itself. I'll do some more tests soon to isolate the
problem.
- The changes to the Sym53c8xx driver in 2.2.18-pre1 are _huge_. Are the
individual changes between sym53c8xx-1.3g and sym53c8xx-1.7.0 available
somewhere?

BTW, I wrote a small test program which tries to analyze error bursts. You can
find it at http://home.tvd.be/cr26864/Download/genpseudorandom.c

Sample test using 200000000 bytes of data:

genpseudorandom -o -l 200000000 > /dev/tape
genpseudorandom -i < /dev/tape

So far I always saw problems when writing even only 10 MB to tape: ca. 3-5
bursts of 32 or 12 incorrect bytes, which are always a copy of the
corresponding bytes in the previous block. Of course I used a much larger test
stream to verify 2.2.17.

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


2001-07-08 19:03:46

by Gérard Roudier

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update



On Sun, 8 Jul 2001, Geert Uytterhoeven wrote:

> On Thu, 21 Jun 2001, Geert Uytterhoeven wrote:
> > On Tue, 8 May 2001, Geert Uytterhoeven wrote:
> > > In the mean time I down/upgraded to 2.2.17 on my PPC box (CHRP LongTrail,
> > > Sym53c875, HP C5136A DDS1) and I can confirm that the problem does not happen
> > > under 2.2.17 neither.
> > >
> > > My experiences:
> > > - reading works fine, writing doesn't
> > > - 2.2.x works fine, 2.4.x doesn't (at least since 2.4.0-test1-ac10)
> > > - hardware compression doesn't matter
> > > - I have a sym53c875, Lorenzo has an Adaptec, so most likely it's not a
> > > SCSI hardware driver bug
> > > - I have a PPC, Lorenzo doesn't, so it's not CPU-specific
> > > - corruption is always a block of 32 bytes being replaced by 32 bytes from
> > > the previous tape block (depending on block size!) (approx. 6 errors per
> > > 256 MB)
> > >
> > > Lorenzo, can you please investigate the exact nature of the corruption on your
> > > system?
> > > - How many successive bytes are corrupted?
> > > - Where do the corrupted data come from?
> >
> > Yesterday I noticed the same corruption under 2.2.19 (yes, I run amverify after
> > backing up my system now, so it detects corruption through the gzip CRCs).
> >
> > I'll do some more tests (when I find time) to get a higher statistical
> > certainty that it really doesn't happen under earlier 2.2.x kernels.
>
> New findings:
> - The problem doesn't happen with kernels <= 2.2.17. It does happen with all
> kernels starting with 2.2.18-pre1.
> - The only related stuff that changed in 2.2.18-pre1 seems to be the
> Sym53c8xx driver itself. I'll do some more tests soon to isolate the
> problem.
> - The changes to the Sym53c8xx driver in 2.2.18-pre1 are _huge_. Are the
> individual changes between sym53c8xx-1.3g and sym53c8xx-1.7.0 available
> somewhere?

No. But you can move the sym/ncr driver bundle from 2.2.18-pre1 to 2.2.17
and vice-versa.
sym53c8xx.h, sym53c8xx_defs.h, sym53c8xx.c,
sym53c8xx_comm.h, ncr53c8xx.h, ncr53c8xx.c

You also can download either sym-1.7.3c-ncr-3.4.3b, or sym-2.1.11, or just
both and play with all that stuff under 2.2.17 and later 2.2 kernels.

ftp://ftp.tux.org/pub/roudier/README-drivers-linux

Btw, I am interested in results using sym-1.7.3c and sym-2.1.11 under
kernel 2.2.17 and possibly 2.2.18.

> BTW, I wrote a small test program which tries to analyze error bursts. You can
> find it at http://home.tvd.be/cr26864/Download/genpseudorandom.c
>
> Sample test using 200000000 bytes of data:
>
> genpseudorandom -o -l 200000000 > /dev/tape
> genpseudorandom -i < /dev/tape

Unfortunately, I haven't any tape device.

> So far I always saw problems when writing even only 10 MB to tape: ca. 3-5
> bursts of 32 or 12 incorrect bytes, which are always a copy of the
> corresponding bytes in the previous block. Of course I used a much larger test
> stream to verify 2.2.17.
>
> Thanks!
>
> Gr{oetje,eeting}s,
>
> Geert

Thanks for your testings,
G?rard.

2001-07-20 17:12:08

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update

On Sun, 8 Jul 2001, Geert Uytterhoeven wrote:
> New findings:
> - The problem doesn't happen with kernels <= 2.2.17. It does happen with all
> kernels starting with 2.2.18-pre1.
> - The only related stuff that changed in 2.2.18-pre1 seems to be the
> Sym53c8xx driver itself. I'll do some more tests soon to isolate the
> problem.
> - The changes to the Sym53c8xx driver in 2.2.18-pre1 are _huge_. Are the
> individual changes between sym53c8xx-1.3g and sym53c8xx-1.7.0 available
> somewhere?

The problem is indeed introduced by the changes to the Sym53c8xx in 2.2.18-pre1.
I managed to find some intermediate versions in the 2.3.x series, and here are the
results:
- sym53c8xx-1.3g (from BK linuxppc_2_2): OK
- sym53c8xx-1.5e: crash in SCSI interrupt during driver init
- sym53c8xx-1.5f: lock up during driver init
- sym53c8xx-1.5g: random 32-byte error bursts when writing to tape

Perhaps I can get 1.5e and 1.5g to work using some PPC-specific fixes from the
1.3.g driver in the linuxppc_2_2 tree (it differed a bit from the 1.3g in
Alan's 2.2.17). But even then the changes in 1.5f and 1.5g are rather small,
compared to the changes between 1.3g and 1.5f.

So I'd be very happy if I could get my hand on more intermediate versions.
Thanks for your help! I _really_ want to nail this one down!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-07-20 21:04:52

by Gérard Roudier

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update



On Fri, 20 Jul 2001, Geert Uytterhoeven wrote:

> On Sun, 8 Jul 2001, Geert Uytterhoeven wrote:
> > New findings:
> > - The problem doesn't happen with kernels <= 2.2.17. It does happen with all
> > kernels starting with 2.2.18-pre1.
> > - The only related stuff that changed in 2.2.18-pre1 seems to be the
> > Sym53c8xx driver itself. I'll do some more tests soon to isolate the
> > problem.
> > - The changes to the Sym53c8xx driver in 2.2.18-pre1 are _huge_. Are the
> > individual changes between sym53c8xx-1.3g and sym53c8xx-1.7.0 available
> > somewhere?

Not completely. The reason is that I used manual diffing/patching against
various kernel versions and it would be a PITA to resurrect all
intermediate driver versions using these patches. If we consider patches
that went directly to kernel main stream without changing the driver
version, a double PITA it would be. Btw, for sym-2.1.x series, I now use a
CVS tree and each driver release is tagged independently. For those ones,
it will be much more easy to isolate broken changes.

> The problem is indeed introduced by the changes to the Sym53c8xx in 2.2.18-pre1.
> I managed to find some intermediate versions in the 2.3.x series, and here are the
> results:
> - sym53c8xx-1.3g (from BK linuxppc_2_2): OK
> - sym53c8xx-1.5e: crash in SCSI interrupt during driver init
> - sym53c8xx-1.5f: lock up during driver init
> - sym53c8xx-1.5g: random 32-byte error bursts when writing to tape

That's an interesting result. But 1.5g - 1.3g diffs are probably very
large. Patches available from ftp.tux.org should allow to resurrect
driver versions 1.4, 1.5, 1.5a, 1.5b, 1.5c, 1.5d.

ftp://ftp.tux.org/pub/roudier/drivers/linux/sym53c8xx/README

You may, for example, apply incremental patches that address kernel 2.2.5
to a fresh kernel 2.2.5 tree and extract driver files accordingly.

> Perhaps I can get 1.5e and 1.5g to work using some PPC-specific fixes from the
> 1.3.g driver in the linuxppc_2_2 tree (it differed a bit from the 1.3g in
> Alan's 2.2.17). But even then the changes in 1.5f and 1.5g are rather small,
> compared to the changes between 1.3g and 1.5f.

Some PPC specific changes are very probably not present in my driver
sources. I am unable to help on that point.

> So I'd be very happy if I could get my hand on more intermediate versions.
> Thanks for your help! I _really_ want to nail this one down!
>
> Gr{oetje,eeting}s,

Regards,
G?rard.



2001-07-27 07:53:42

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update

On Fri, 20 Jul 2001, G?rard Roudier wrote:
> On Fri, 20 Jul 2001, Geert Uytterhoeven wrote:
> > The problem is indeed introduced by the changes to the Sym53c8xx in 2.2.18-pre1.
> > I managed to find some intermediate versions in the 2.3.x series, and here are the
> > results:
> > - sym53c8xx-1.3g (from BK linuxppc_2_2): OK
> > - sym53c8xx-1.5e: crash in SCSI interrupt during driver init
> > - sym53c8xx-1.5f: lock up during driver init
> > - sym53c8xx-1.5g: random 32-byte error bursts when writing to tape
>
> That's an interesting result. But 1.5g - 1.3g diffs are probably very
> large. Patches available from ftp.tux.org should allow to resurrect
> driver versions 1.4, 1.5, 1.5a, 1.5b, 1.5c, 1.5d.
>
> ftp://ftp.tux.org/pub/roudier/drivers/linux/sym53c8xx/README
>
> You may, for example, apply incremental patches that address kernel 2.2.5
> to a fresh kernel 2.2.5 tree and extract driver files accordingly.

Thanks!

With some small modifications, I made 1.5a to work fine. No error burst. So the
problem is introduced between 1.5a and 1.5g.

Unfortunately my DDS-1 drive seems to have died for real after this test :-(
I don't know yet whether I will replace it with a new tape drive or with a
CD-RW. Which means I may never find out which change caused the problem...

I assume other people suffer from the same error burst problem, but they never
notice until they really want to restore data. Me myself only notived it by
accident, too.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-07-27 20:44:16

by Gérard Roudier

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update



On Fri, 27 Jul 2001, Geert Uytterhoeven wrote:

> On Fri, 20 Jul 2001, G?rard Roudier wrote:
> > On Fri, 20 Jul 2001, Geert Uytterhoeven wrote:
> > > The problem is indeed introduced by the changes to the Sym53c8xx in 2.2.18-pre1.
> > > I managed to find some intermediate versions in the 2.3.x series, and here are the
> > > results:
> > > - sym53c8xx-1.3g (from BK linuxppc_2_2): OK
> > > - sym53c8xx-1.5e: crash in SCSI interrupt during driver init
> > > - sym53c8xx-1.5f: lock up during driver init
> > > - sym53c8xx-1.5g: random 32-byte error bursts when writing to tape
> >
> > That's an interesting result. But 1.5g - 1.3g diffs are probably very
> > large. Patches available from ftp.tux.org should allow to resurrect
> > driver versions 1.4, 1.5, 1.5a, 1.5b, 1.5c, 1.5d.
> >
> > ftp://ftp.tux.org/pub/roudier/drivers/linux/sym53c8xx/README
> >
> > You may, for example, apply incremental patches that address kernel 2.2.5
> > to a fresh kernel 2.2.5 tree and extract driver files accordingly.
>
> Thanks!
>
> With some small modifications, I made 1.5a to work fine. No error burst. So the
> problem is introduced between 1.5a and 1.5g.

Fine! But diffs between 1.5a and 1.5g are still large. :(
Results with 1.5c would have divided the diffs by about 2. :(

> Unfortunately my DDS-1 drive seems to have died for real after this test :-(
> I don't know yet whether I will replace it with a new tape drive or with a
> CD-RW. Which means I may never find out which change caused the problem...

I expect the problem to pong again to me. For now, I plan to look into the
1.5g-1.5a source diffs and inspect each change. But as I will be in
vacation for the next two weeks, I will not be able to work on this
problem immediately.

> I assume other people suffer from the same error burst problem, but they never
> notice until they really want to restore data. Me myself only notived it by
> accident, too.

Thanks for your testings and results.

Regards,
G?rard.

2001-07-28 10:02:18

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update

On Fri, 27 Jul 2001, G?rard Roudier wrote:
> On Fri, 27 Jul 2001, Geert Uytterhoeven wrote:
> > On Fri, 20 Jul 2001, G?rard Roudier wrote:
> > > On Fri, 20 Jul 2001, Geert Uytterhoeven wrote:
> > > > The problem is indeed introduced by the changes to the Sym53c8xx in 2.2.18-pre1.
> > > > I managed to find some intermediate versions in the 2.3.x series, and here are the
> > > > results:
> > > > - sym53c8xx-1.3g (from BK linuxppc_2_2): OK
> > > > - sym53c8xx-1.5e: crash in SCSI interrupt during driver init
> > > > - sym53c8xx-1.5f: lock up during driver init
> > > > - sym53c8xx-1.5g: random 32-byte error bursts when writing to tape
> > >
> > > That's an interesting result. But 1.5g - 1.3g diffs are probably very
> > > large. Patches available from ftp.tux.org should allow to resurrect
> > > driver versions 1.4, 1.5, 1.5a, 1.5b, 1.5c, 1.5d.
> > >
> > > ftp://ftp.tux.org/pub/roudier/drivers/linux/sym53c8xx/README
> > >
> > > You may, for example, apply incremental patches that address kernel 2.2.5
> > > to a fresh kernel 2.2.5 tree and extract driver files accordingly.
> >
> > Thanks!
> >
> > With some small modifications, I made 1.5a to work fine. No error burst. So the
> > problem is introduced between 1.5a and 1.5g.
>
> Fine! But diffs between 1.5a and 1.5g are still large. :(
> Results with 1.5c would have divided the diffs by about 2. :(
>
> > Unfortunately my DDS-1 drive seems to have died for real after this test :-(
> > I don't know yet whether I will replace it with a new tape drive or with a
> > CD-RW. Which means I may never find out which change caused the problem...
>
> I expect the problem to pong again to me. For now, I plan to look into the
> 1.5g-1.5a source diffs and inspect each change. But as I will be in
> vacation for the next two weeks, I will not be able to work on this
> problem immediately.

Just in case the fix is in the changes between the official 1.5a and the 1.5a
in my tree, here are the diffs. But I doubt it.

Good luck!

diff -ur sym53c8xx-s01-d07-2.2.5-1.5a/drivers/scsi/Config.in longtrail-2.2.18-pre1/drivers/scsi/Config.in
--- sym53c8xx-s01-d07-2.2.5-1.5a/drivers/scsi/Config.in Thu Jul 26 20:14:50 2001
+++ longtrail-2.2.18-pre1/drivers/scsi/Config.in Thu Jul 26 20:07:03 2001
@@ -18,6 +18,9 @@
mainmenu_option next_comment
comment 'SCSI low-level drivers'

+if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then
+ dep_tristate '3ware Hardware ATA-RAID support (EXPERIMENTAL)' CONFIG_BLK_DEV_3W_XXXX_RAID $CONFIG_SCSI
+fi
dep_tristate '7000FASST SCSI support' CONFIG_SCSI_7000FASST $CONFIG_SCSI
dep_tristate 'ACARD SCSI support' CONFIG_SCSI_ACARD $CONFIG_SCSI
dep_tristate 'Adaptec AHA152X/2825 support' CONFIG_SCSI_AHA152X $CONFIG_SCSI
@@ -25,13 +28,12 @@
dep_tristate 'Adaptec AHA1740 support' CONFIG_SCSI_AHA1740 $CONFIG_SCSI
dep_tristate 'Adaptec AIC7xxx support' CONFIG_SCSI_AIC7XXX $CONFIG_SCSI
if [ "$CONFIG_SCSI_AIC7XXX" != "n" ]; then
- bool ' Override driver defaults for commands per LUN' CONFIG_OVERRIDE_CMDS N
- if [ "$CONFIG_OVERRIDE_CMDS" != "n" ]; then
- int ' Maximum number of commands per LUN' CONFIG_AIC7XXX_CMDS_PER_LUN 24
- fi
+ bool ' Enable Tagged Command Queueing (TCQ) by default' CONFIG_AIC7XXX_TCQ_ON_BY_DEFAULT
+ int ' Maximum number of TCQ commands per device' CONFIG_AIC7XXX_CMDS_PER_DEVICE 8
bool ' Collect statistics to report in /proc' CONFIG_AIC7XXX_PROC_STATS N
int ' Delay in seconds after SCSI bus reset' CONFIG_AIC7XXX_RESET_DELAY 5
fi
+dep_tristate 'IBM ServeRAID support' CONFIG_SCSI_IPS $CONFIG_SCSI
dep_tristate 'AdvanSys SCSI support' CONFIG_SCSI_ADVANSYS $CONFIG_SCSI
dep_tristate 'Always IN2000 SCSI support' CONFIG_SCSI_IN2000 $CONFIG_SCSI
dep_tristate 'AM53/79C974 PCI SCSI support' CONFIG_SCSI_AM53C974 $CONFIG_SCSI
@@ -52,9 +54,7 @@
dep_tristate 'EATA-PIO (old DPT PM2001, PM2012A) support' CONFIG_SCSI_EATA_PIO $CONFIG_SCSI
dep_tristate 'Future Domain 16xx SCSI/AHA-2920A support' CONFIG_SCSI_FUTURE_DOMAIN $CONFIG_SCSI
if [ "$CONFIG_MCA" = "y" ]; then
- if [ "$CONFIG_SCSI" = "y" ]; then
- bool 'Future Domain MCS-600/700 SCSI support' CONFIG_SCSI_FD_MCS
- fi
+ dep_tristate 'Future Domain MCS-600/700 SCSI support' CONFIG_SCSI_FD_MCS $CONFIG_SCSI
fi
dep_tristate 'GDT SCSI Disk Array Controller support' CONFIG_SCSI_GDTH $CONFIG_SCSI
dep_tristate 'Generic NCR5380/53c400 SCSI support' CONFIG_SCSI_GENERIC_NCR5380 $CONFIG_SCSI
@@ -78,6 +78,7 @@
fi
dep_tristate 'NCR53c406a SCSI support' CONFIG_SCSI_NCR53C406A $CONFIG_SCSI
dep_tristate 'symbios 53c416 SCSI support' CONFIG_SCSI_SYM53C416 $CONFIG_SCSI
+dep_tristate 'Simple 53c710 SCSI support (Compaq, NCR machines)' CONFIG_SCSI_SIM710 $CONFIG_SCSI
if [ "$CONFIG_PCI" = "y" ]; then
dep_tristate 'NCR53c7,8xx SCSI support' CONFIG_SCSI_NCR53C7xx $CONFIG_SCSI
if [ "$CONFIG_SCSI_NCR53C7xx" != "n" ]; then
@@ -95,8 +96,9 @@
int ' synchronous transfers frequency in MHz' CONFIG_SCSI_NCR53C8XX_SYNC 20
bool ' enable profiling' CONFIG_SCSI_NCR53C8XX_PROFILE
bool ' use normal IO' CONFIG_SCSI_NCR53C8XX_IOMAPPED
- bool ' include support for the NCR PQS/PDS SCSI card' CONFIG_SCSI_NCR53C8XX_PQS_PDS
- bool ' enable immediate arbitration' CONFIG_SCSI_NCR53C8XX_IARB
+ if [ "$CONFIG_SCSI_SYM53C8XX" != "n" ]; then
+ bool ' include support for the NCR PQS/PDS SCSI card' CONFIG_SCSI_NCR53C8XX_PQS_PDS
+ fi
if [ "$CONFIG_SCSI_NCR53C8XX_DEFAULT_TAGS" = "0" ]; then
bool ' not allow targets to disconnect' CONFIG_SCSI_NCR53C8XX_NO_DISCONNECT
fi
diff -ur sym53c8xx-s01-d07-2.2.5-1.5a/drivers/scsi/sym53c8xx.c longtrail-2.2.18-pre1/drivers/scsi/sym53c8xx.c
--- sym53c8xx-s01-d07-2.2.5-1.5a/drivers/scsi/sym53c8xx.c Thu Jul 26 20:14:05 2001
+++ longtrail-2.2.18-pre1/drivers/scsi/sym53c8xx.c Thu Jul 26 22:43:12 2001
@@ -73,6 +73,7 @@
** 53C895 (Wide, Fast 40, on-board rom BIOS)
** 53C895A (Wide, Fast 40, on-board rom BIOS)
** 53C896 (Wide, Fast 40 Dual, on-board rom BIOS)
+** 53C1510D (Wide, Fast 40 Dual, on-board rom BIOS)
**
** Other features:
** Memory mapped IO
@@ -558,10 +559,11 @@
#endif

#ifdef __sparc__
+# include <asm/irq.h>
# define remap_pci_mem(base, size) ((u_long) __va(base))
# define unmap_pci_mem(vaddr, size)
# define pcivtobus(p) ((p) & pci_dvma_mask)
-# define memcpy_to_pci(a, b, c) memcpy_toio((u_long) (a), (b), (c))
+# define memcpy_to_pci(a, b, c) memcpy_toio((void *) (a), (b), (c))
#elif defined(__alpha__)
# define pcivtobus(p) ((p) & 0xfffffffful)
# define memcpy_to_pci(a, b, c) memcpy_toio((a), (b), (c))
@@ -1808,6 +1810,8 @@
*/
u_short device_id; /* PCI device id */
u_char revision_id; /* PCI device revision id */
+ u_char pci_bus; /* PCI bus number */
+ u_char pci_devfn; /* PCI device and function */
u_int features; /* Chip features map */
u_char myaddr; /* SCSI id of the adapter */
u_char maxburst; /* log base 2 of dwords burst */
@@ -4582,7 +4586,7 @@
** 64 bit (53C895A or 53C896) ?
*/
if (np->features & FE_64BIT)
-#if BITS_PER_LONG > 32
+#ifdef SCSI_NCR_USE_64BIT_DAC
np->rv_ccntl1 |= (XTIMOD | EXTIBMV);
#else
np->rv_ccntl1 |= (DDAC);
@@ -4955,6 +4959,8 @@
sprintf(np->inst_name, NAME53C "%s-%d", np->chip_name, np->unit);
np->device_id = device->chip.device_id;
np->revision_id = device->chip.revision_id;
+ np->pci_bus = device->slot.bus;
+ np->pci_devfn = device->slot.device_fn;
np->features = device->chip.features;
np->clock_divn = device->chip.nr_divisor;
np->maxoffs = device->chip.offset_max;
@@ -5088,7 +5094,7 @@
** the clock doubler.
*/
i = (int) ncr_getpciclock(np);
- if (i > 37000) {
+ if (0 && i > 37000) {
printk(KERN_ERR "%s: PCI clock seems too high (%u KHz).\n",
ncr_name(np), i);
goto attach_error;
@@ -10091,7 +10097,7 @@
** code will get more complex later).
*/

-#if BITS_PER_LONG > 32
+#ifdef SCSI_NCR_USE_64BIT_DAC
#define SCATTER_ONE(data, badd, len) \
(data)->addr = cpu_to_scr(badd); \
(data)->size = cpu_to_scr((((badd) >> 8) & 0xff000000) + len);
@@ -10531,6 +10537,8 @@
u_int f1, f2;
int gen = 11;

+ OUTB(nc_istat, SRST); UDELAY (5); OUTB(nc_istat, 0);
+
(void) ncrgetfreq (np, gen); /* throw away first result */
f1 = ncrgetfreq (np, gen);
f2 = ncrgetfreq (np, gen);
@@ -11290,7 +11298,7 @@
** Ignore Symbios chips controlled by SISL RAID controller.
** This controller sets value 0x52414944 at RAM end - 16.
*/
-#ifndef SCSI_NCR_PCI_MEM_NOT_SUPPORTED
+#if defined(__i386__) && !defined(SCSI_NCR_PCI_MEM_NOT_SUPPORTED)
if (chip && (base_2 & PCI_BASE_ADDRESS_MEM_MASK)) {
unsigned int ram_size, ram_val;
u_long ram_ptr;
@@ -11379,6 +11387,8 @@
if (!cache_line_size)
suggested_cache_line_size = 16;

+ driver_setup.pci_fix_up |= 0x7;
+
#endif /* __sparc__ */

#if defined(__i386__) && !defined(MODULE)
@@ -11692,7 +11702,15 @@
*/
const char *sym53c8xx_info (struct Scsi_Host *host)
{
+#ifdef __sparc__
+ /* Ok to do this on all archs? */
+ static char buffer[80];
+ ncb_p np = ((struct host_data *) host->hostdata)->ncb;
+ sprintf (buffer, "%s\nPCI bus %02x device %02x", SCSI_NCR_DRIVER_NAME, np->pci_bus, np->pci_devfn);
+ return buffer;
+#else
return SCSI_NCR_DRIVER_NAME;
+#endif
}

/*
@@ -12274,7 +12292,13 @@
copy_info(&info, "revision id 0x%x\n", np->revision_id);

copy_info(&info, " IO port address 0x%lx, ", (u_long) np->base_io);
+#ifdef __sparc__
+ copy_info(&info, "IRQ number %s\n", __irq_itoa(np->irq));
+ /* Ok to do this on all archs? */
+ copy_info(&info, "PCI bus %02x device %02x\n", np->pci_bus, np->pci_devfn);
+#else
copy_info(&info, "IRQ number %d\n", (int) np->irq);
+#endif

#ifndef NCR_IOMAPPED
if (np->reg)
diff -ur sym53c8xx-s01-d07-2.2.5-1.5a/drivers/scsi/sym53c8xx_defs.h longtrail-2.2.18-pre1/drivers/scsi/sym53c8xx_defs.h
--- sym53c8xx-s01-d07-2.2.5-1.5a/drivers/scsi/sym53c8xx_defs.h Thu Jul 26 20:14:05 2001
+++ longtrail-2.2.18-pre1/drivers/scsi/sym53c8xx_defs.h Thu Jul 26 22:42:44 2001
@@ -66,8 +66,9 @@
#endif
#include <linux/config.h>

+#ifndef LinuxVersionCode
#define LinuxVersionCode(v, p, s) (((v)<<16)+((p)<<8)+(s))
-
+#endif
/*
* NCR PQS/PDS special device support.
*/
@@ -182,6 +183,13 @@
#endif

/*
+ * Should we enable DAC cycles on this platform?
+ * Until further investigation we do not enable it
+ * anywhere at the moment.
+ */
+#undef SCSI_NCR_USE_64BIT_DAC
+
+/*
* Sync transfer frequency at startup.
* Allow from 5Mhz to 40Mhz default 20 Mhz.
*/
@@ -395,6 +403,10 @@
#define PCI_DEVICE_ID_NCR_53C895A 0x12
#endif

+#ifndef PCI_DEVICE_ID_NCR_53C1510D
+#define PCI_DEVICE_ID_NCR_53C1510D 0xa
+#endif
+
/*
** NCR53C8XX devices features table.
*/
@@ -482,6 +494,9 @@
{PCI_DEVICE_ID_NCR_53C875, 0x2f, "875E", 6, 16, 5, \
FE_WIDE|FE_ULTRA|FE_DBLR|FE_CACHE0_SET|FE_BOF|FE_DFS|FE_LDSTR|FE_PFEN|FE_RAM}\
, \
+ {PCI_DEVICE_ID_NCR_53C875, 0xff, "876", 6, 16, 5, \
+ FE_WIDE|FE_ULTRA|FE_DBLR|FE_CACHE0_SET|FE_BOF|FE_DFS|FE_LDSTR|FE_PFEN|FE_RAM}\
+ , \
{PCI_DEVICE_ID_NCR_53C875J,0xff, "875J", 6, 16, 5, \
FE_WIDE|FE_ULTRA|FE_DBLR|FE_CACHE0_SET|FE_BOF|FE_DFS|FE_LDSTR|FE_PFEN|FE_RAM}\
, \
@@ -498,6 +513,10 @@
{PCI_DEVICE_ID_NCR_53C895A, 0xff, "895a", 6, 31, 7, \
FE_WIDE|FE_ULTRA2|FE_QUAD|FE_CACHE_SET|FE_BOF|FE_DFS|FE_LDSTR|FE_PFEN|FE_RAM|\
FE_RAM8K|FE_64BIT|FE_IO256|FE_NOPM|FE_LEDC}\
+ , \
+ {PCI_DEVICE_ID_NCR_53C1510D, 0xff, "1510D", 7, 31, 7, \
+ FE_WIDE|FE_ULTRA2|FE_QUAD|FE_CACHE_SET|FE_BOF|FE_DFS|FE_LDSTR|FE_PFEN|FE_RAM|\
+ FE_IO256}\
}

/*
@@ -515,7 +534,8 @@
PCI_DEVICE_ID_NCR_53C885, \
PCI_DEVICE_ID_NCR_53C895, \
PCI_DEVICE_ID_NCR_53C896, \
- PCI_DEVICE_ID_NCR_53C895A \
+ PCI_DEVICE_ID_NCR_53C895A, \
+ PCI_DEVICE_ID_NCR_53C1510D \
}

/*


Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-11-01 19:16:53

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update


[ About SCSI tape corruption with sym53c8xx, some months ago ]

On Fri, 27 Jul 2001, G?rard Roudier wrote:
> On Fri, 27 Jul 2001, Geert Uytterhoeven wrote:
> > With some small modifications, I made 1.5a to work fine. No error burst. So the
> > problem is introduced between 1.5a and 1.5g.
>
> Fine! But diffs between 1.5a and 1.5g are still large. :(
> Results with 1.5c would have divided the diffs by about 2. :(
>
> > Unfortunately my DDS-1 drive seems to have died for real after this test :-(
> > I don't know yet whether I will replace it with a new tape drive or with a
> > CD-RW. Which means I may never find out which change caused the problem...
>
> I expect the problem to pong again to me. For now, I plan to look into the
> 1.5g-1.5a source diffs and inspect each change. But as I will be in
> vacation for the next two weeks, I will not be able to work on this
> problem immediately.

Any progress on this?

> > I assume other people suffer from the same error burst problem, but they never
> > notice until they really want to restore data. Me myself only notived it by
> > accident, too.
>
> Thanks for your testings and results.

I have good news and bad news. First the bad news...

In the mean time I replaced my broken HP C5136A with a Plextor PX-W1210S CD
rewriter. So far I only wrote CDRs and CDRWs with 2.2.18pre1 and a `known
good' (for tape) version of the Sym53c8xx driver.

I just tried writing a CDRW with test data (512000000 bytes, created with
http://home.tvd.be/cr26864/Download/genpseudorandom.c) using version 1.5g
of the ym53c8xx driver, and guess what happened!

| Error burst of 18 bytes detected at offset 0x361be0
| Error data is a copy of the data at offset 0x3523e0 (shift = 63488)
| Error burst of 13 bytes detected at offset 0x361bf3
| Error data is a copy of the data at offset 0x3523f3 (shift = 63488)
| Error burst of 18 bytes detected at offset 0x6a8ba0
| Error data is a copy of the data at offset 0x6993a0 (shift = 63488)
| Error burst of 13 bytes detected at offset 0x6a8bb3
| Error data is a copy of the data at offset 0x6993b3 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x2b96980
| Error data is a copy of the data at offset 0x2b87180 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x3f351a0
| Error data is a copy of the data at offset 0x3f259a0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x46d4de0
| Error data is a copy of the data at offset 0x46c55e0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x8d2e7e0
| Error data is a copy of the data at offset 0x8d1efe0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x8f5dea0
| Error data is a copy of the data at offset 0x8f4e6a0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0xce732a0
| Error data is a copy of the data at offset 0xce63aa0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0xd0455a0
| Error data is a copy of the data at offset 0xd035da0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x114746a0
| Error data is a copy of the data at offset 0x11464ea0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x1229e0a0
| Error data is a copy of the data at offset 0x1228e8a0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x1246e620
| Error data is a copy of the data at offset 0x1245ee20 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x162cd2e0
| Error data is a copy of the data at offset 0x162bdae0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x16f82de0
| Error data is a copy of the data at offset 0x16f735e0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x16f84420
| Error data is a copy of the data at offset 0x16f74c20 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x1703e6e0
| Error data is a copy of the data at offset 0x1702eee0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x1760eaa0
| Error data is a copy of the data at offset 0x175ff2a0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x177ddbc0
| Error data is a copy of the data at offset 0x177ce3c0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x1bcc5560
| Error data is a copy of the data at offset 0x1bcb5d60 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x1bcc55e0
| Error data is a copy of the data at offset 0x1bcb5de0 (shift = 63488)
| Error burst of 32 bytes detected at offset 0x1d0fd260
| Error data is a copy of the data at offset 0x1d0eda60 (shift = 63488)
| Found a total of 670 errors (max 32 consecutive) in 23 error bursts

Thus the problem exists not only for tape drives, but also for CD writers! Main
difference is the shift of 63488 (= 31*2048) for CDs, compared to 10240 (block
size of tar) for tapes.
So far I haven't found any evidence of a similar problem for hard disks.

Now the good news: I have hardware to try finding the source of this problem
again ;-) If I'll find some time, I'll try driver version 1.5c.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


2001-11-02 09:50:18

by Gérard Roudier

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update



On Thu, 1 Nov 2001, Geert Uytterhoeven wrote:

>
> [ About SCSI tape corruption with sym53c8xx, some months ago ]
>
> On Fri, 27 Jul 2001, G?rard Roudier wrote:
> > On Fri, 27 Jul 2001, Geert Uytterhoeven wrote:
> > > With some small modifications, I made 1.5a to work fine. No error burst. So the
> > > problem is introduced between 1.5a and 1.5g.
> >
> > Fine! But diffs between 1.5a and 1.5g are still large. :(
> > Results with 1.5c would have divided the diffs by about 2. :(
> >
> > > Unfortunately my DDS-1 drive seems to have died for real after this test :-(
> > > I don't know yet whether I will replace it with a new tape drive or with a
> > > CD-RW. Which means I may never find out which change caused the problem...
> >
> > I expect the problem to pong again to me. For now, I plan to look into the
> > 1.5g-1.5a source diffs and inspect each change. But as I will be in
> > vacation for the next two weeks, I will not be able to work on this
> > problem immediately.
>
> Any progress on this?

I looked into the diffs, but stuff was too large to get any clue from. :(

> > > I assume other people suffer from the same error burst problem, but they never
> > > notice until they really want to restore data. Me myself only notived it by
> > > accident, too.
> >
> > Thanks for your testings and results.
>
> I have good news and bad news. First the bad news...

/* Let's rearrange your report */
goto good_news: /* :-) */

> In the mean time I replaced my broken HP C5136A with a Plextor PX-W1210S CD
> rewriter. So far I only wrote CDRs and CDRWs with 2.2.18pre1 and a `known
> good' (for tape) version of the Sym53c8xx driver.
>
> I just tried writing a CDRW with test data (512000000 bytes, created with
> http://home.tvd.be/cr26864/Download/genpseudorandom.c) using version 1.5g
> of the ym53c8xx driver, and guess what happened!
>
> | Error burst of 18 bytes detected at offset 0x361be0
> | Error data is a copy of the data at offset 0x3523e0 (shift = 63488)
> | Error burst of 13 bytes detected at offset 0x361bf3
> | Error data is a copy of the data at offset 0x3523f3 (shift = 63488)
> | Error burst of 18 bytes detected at offset 0x6a8ba0
> | Error data is a copy of the data at offset 0x6993a0 (shift = 63488)
> | Error burst of 13 bytes detected at offset 0x6a8bb3
> | Error data is a copy of the data at offset 0x6993b3 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x2b96980
> | Error data is a copy of the data at offset 0x2b87180 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x3f351a0
> | Error data is a copy of the data at offset 0x3f259a0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x46d4de0
> | Error data is a copy of the data at offset 0x46c55e0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x8d2e7e0
> | Error data is a copy of the data at offset 0x8d1efe0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x8f5dea0
> | Error data is a copy of the data at offset 0x8f4e6a0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0xce732a0
> | Error data is a copy of the data at offset 0xce63aa0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0xd0455a0
> | Error data is a copy of the data at offset 0xd035da0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x114746a0
> | Error data is a copy of the data at offset 0x11464ea0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x1229e0a0
> | Error data is a copy of the data at offset 0x1228e8a0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x1246e620
> | Error data is a copy of the data at offset 0x1245ee20 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x162cd2e0
> | Error data is a copy of the data at offset 0x162bdae0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x16f82de0
> | Error data is a copy of the data at offset 0x16f735e0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x16f84420
> | Error data is a copy of the data at offset 0x16f74c20 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x1703e6e0
> | Error data is a copy of the data at offset 0x1702eee0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x1760eaa0
> | Error data is a copy of the data at offset 0x175ff2a0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x177ddbc0
> | Error data is a copy of the data at offset 0x177ce3c0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x1bcc5560
> | Error data is a copy of the data at offset 0x1bcb5d60 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x1bcc55e0
> | Error data is a copy of the data at offset 0x1bcb5de0 (shift = 63488)
> | Error burst of 32 bytes detected at offset 0x1d0fd260
> | Error data is a copy of the data at offset 0x1d0eda60 (shift = 63488)
> | Found a total of 670 errors (max 32 consecutive) in 23 error bursts
>
> Thus the problem exists not only for tape drives, but also for CD writers! Main
> difference is the shift of 63488 (= 31*2048) for CDs, compared to 10240 (block
> size of tar) for tapes.

good_news: /* :) */

> So far I haven't found any evidence of a similar problem for hard disks.
>
> Now the good news: I have hardware to try finding the source of this problem
> again ;-) If I'll find some time, I'll try driver version 1.5c.

I have CD/RW. I can run your test process on my machines and see how it
behaves (2xPIII/LE chipset and 1xAthlon/KT266 chipset).

As driver sym-2 is planned to replace sym53c8xx in the future, it would be
interesting to give it a try on your hardware. There are some source
available from ftp.tux.org, but I can provide you with a flat patch
against the stock kernel version you want. You may let me know.

Regards,
G?rard.

2001-11-07 15:26:58

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update

On Fri, 2 Nov 2001, [ISO-8859-1] G?rard Roudier wrote:
> On Thu, 1 Nov 2001, Geert Uytterhoeven wrote:
> > [ About SCSI tape corruption with sym53c8xx, some months ago ]
> >
> > On Fri, 27 Jul 2001, G?rard Roudier wrote:
> > > On Fri, 27 Jul 2001, Geert Uytterhoeven wrote:
> > > > With some small modifications, I made 1.5a to work fine. No error burst. So the
> > > > problem is introduced between 1.5a and 1.5g.
> > >
> > > Fine! But diffs between 1.5a and 1.5g are still large. :(
> > > Results with 1.5c would have divided the diffs by about 2. :(

1.5c seems to be fine!

Still have to try 1.5d, 1.5g1, 1.5g2 and 1.5g3.
1.5e and 1.5f are nowhere available?

> As driver sym-2 is planned to replace sym53c8xx in the future, it would be
> interesting to give it a try on your hardware. There are some source
> available from ftp.tux.org, but I can provide you with a flat patch
> against the stock kernel version you want. You may let me know.

I just saw the sym-2 driver enter the ac-series. As soon as I have a recent
2.4.x kernel on this box, I can give it a try...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-12-05 15:33:41

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: SCSI Tape corruption - update

On Fri, 2 Nov 2001, [ISO-8859-1] G?rard Roudier wrote:
> On Thu, 1 Nov 2001, Geert Uytterhoeven wrote:
> As driver sym-2 is planned to replace sym53c8xx in the future, it would be
> interesting to give it a try on your hardware. There are some source
> available from ftp.tux.org, but I can provide you with a flat patch
> against the stock kernel version you want. You may let me know.

I tried sym-2 (2.4.17-pre2) and it didn't show up the problem, which is good!

More news from the old driver:

1.5c OK
1.5d OK
1.5e page fault in interrupt handler 0xa53c0c68
1.5f lock up
1.5pre-g1 lock up
1.5pre-g2 lock up
1.5pre-g3 corruption
1.5g corruption

So it happened somewhere in between 1.5d and 1.5pre-g3. I'll see whether I can
get any of the intermediates to run...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-12-28 20:36:40

by Geert Uytterhoeven

[permalink] [raw]
Subject: Sym53c8xx tape corruption squashed! (was: Re: SCSI Tape corruption - update)

On Wed, 5 Dec 2001, Geert Uytterhoeven wrote:
> On Fri, 2 Nov 2001, [ISO-8859-1] G?rard Roudier wrote:
> > On Thu, 1 Nov 2001, Geert Uytterhoeven wrote:
> > As driver sym-2 is planned to replace sym53c8xx in the future, it would be
> > interesting to give it a try on your hardware. There are some source
> > available from ftp.tux.org, but I can provide you with a flat patch
> > against the stock kernel version you want. You may let me know.
>
> I tried sym-2 (2.4.17-pre2) and it didn't show up the problem, which is good!
>
> More news from the old driver:
>
> 1.5c OK
> 1.5d OK
> 1.5e page fault in interrupt handler 0xa53c0c68
> 1.5f lock up
> 1.5pre-g1 lock up
> 1.5pre-g2 lock up
> 1.5pre-g3 corruption
> 1.5g corruption
>
> So it happened somewhere in between 1.5d and 1.5pre-g3. I'll see whether I can
> get any of the intermediates to run...

I made all intermediate versions to work.

The problem is introduced in 1.5pre-g2 by the following change:

diff -urN callisto-1.5g-pre2a/sym53c8xx.c callisto-1.5g-pre2+/sym53c8xx.c
--- callisto-1.5g-pre2a/sym53c8xx.c Fri Dec 28 21:12:30 2001
+++ callisto-1.5g-pre2+/sym53c8xx.c Fri Dec 28 20:11:10 2001
@@ -11981,7 +11981,7 @@
** (latency timer >= burst length + 6, we add 10 to be quite sure)
*/

- if ((pci_fix_up & 4) && chip->burst_max) {
+ if (chip->burst_max && (latency_timer == 0 || (pci_fix_up & 4))) {
uchar lt = (1 << chip->burst_max) + 6 + 10;
if (latency_timer < lt) {
printk(NAME53C8XX

This change causes the PCI latency timer to be changed from 0 to 80.

The sym-2 driver has a define for modifying the PCI latency timer
(SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.

Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
driver (wrong burst_max)?

To recapitulate, the bug causes error bursts of (almost always) 32 bytes long.
The incorrect bytes are always a copy of previous data, at a fixed offset (10
kiB on my (now dead) DDS-1 tape drive, 32 kiB on my Plexwriter).

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-12-28 23:55:43

by Gérard Roudier

[permalink] [raw]
Subject: Re: Sym53c8xx tape corruption squashed! (was: Re: SCSI Tape corruption - update)



On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:

> On Wed, 5 Dec 2001, Geert Uytterhoeven wrote:
> > On Fri, 2 Nov 2001, [ISO-8859-1] G?rard Roudier wrote:
> > > On Thu, 1 Nov 2001, Geert Uytterhoeven wrote:
> > > As driver sym-2 is planned to replace sym53c8xx in the future, it would be
> > > interesting to give it a try on your hardware. There are some source
> > > available from ftp.tux.org, but I can provide you with a flat patch
> > > against the stock kernel version you want. You may let me know.
> >
> > I tried sym-2 (2.4.17-pre2) and it didn't show up the problem, which is good!
> >
> > More news from the old driver:
> >
> > 1.5c OK
> > 1.5d OK
> > 1.5e page fault in interrupt handler 0xa53c0c68
> > 1.5f lock up
> > 1.5pre-g1 lock up
> > 1.5pre-g2 lock up
> > 1.5pre-g3 corruption
> > 1.5g corruption
> >
> > So it happened somewhere in between 1.5d and 1.5pre-g3. I'll see whether I can
> > get any of the intermediates to run...
>
> I made all intermediate versions to work.
>
> The problem is introduced in 1.5pre-g2 by the following change:
>
> diff -urN callisto-1.5g-pre2a/sym53c8xx.c callisto-1.5g-pre2+/sym53c8xx.c
> --- callisto-1.5g-pre2a/sym53c8xx.c Fri Dec 28 21:12:30 2001
> +++ callisto-1.5g-pre2+/sym53c8xx.c Fri Dec 28 20:11:10 2001
> @@ -11981,7 +11981,7 @@
> ** (latency timer >= burst length + 6, we add 10 to be quite sure)
> */
>
> - if ((pci_fix_up & 4) && chip->burst_max) {
> + if (chip->burst_max && (latency_timer == 0 || (pci_fix_up & 4))) {
> uchar lt = (1 << chip->burst_max) + 6 + 10;
> if (latency_timer < lt) {
> printk(NAME53C8XX
>
> This change causes the PCI latency timer to be changed from 0 to 80.
>
> The sym-2 driver has a define for modifying the PCI latency timer
> (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.

By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
write and invalidate). The latency timer fix-up has been removed, since it
is rather up to the generic PCI driver to tune latency timers.

> Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> driver (wrong burst_max)?

Great bug hunting!

It is about certainly not a software bug in the driver. Any latency timer
value should not give any trouble if hardware was flawless. Just the PCI
performances could be affected.

Anyway, value 0 looks way stupid for devices capable of bursting more than
1 data phase, thus the improvement above. :)

> To recapitulate, the bug causes error bursts of (almost always) 32 bytes long.
> The incorrect bytes are always a copy of previous data, at a fixed offset (10
> kiB on my (now dead) DDS-1 tape drive, 32 kiB on my Plexwriter).

Unfortunately, I haven't the errata listing for teh 53c875 rev 4. I have
the DEL for 875 rev. 3 and for 876 rev. 5.

If we assume that rev 4 hasn't more bugs than rev 3, then you may try to
disable MEMORY WRITE and INVALIDATE (and not tell the driver to fix this
up) but allow the driver to fix the bogus zero latency timer. The 875 rev
3 may, under certain conditions, execute unaligned PCI MEMORY WRITE and
INVALIDATE transactions. Note that this may explain data corruptions
occurring for SCSI READ commands not WRITE commands. No other items can
explain, on paper, data corruptions of the form you describe due to 875
chip misbehaviour.

Btw, latency timer zero should not change the likelyhood of this item.
This let me think that the host bridge is likely to be the culprit.

> Gr{oetje,eeting}s,

Gr{oudier,eat bug hunting, indeed}. :)

2001-12-29 10:49:49

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Sym53c8xx tape corruption squashed! (was: Re: SCSI Tape corruption - update)

On Sat, 29 Dec 2001, [ISO-8859-1] G?rard Roudier wrote:
> On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:
> > The problem is introduced in 1.5pre-g2 by the following change:

[...]

> > This change causes the PCI latency timer to be changed from 0 to 80.
> >
> > The sym-2 driver has a define for modifying the PCI latency timer
> > (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.
>
> By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
> write and invalidate). The latency timer fix-up has been removed, since it
> is rather up to the generic PCI driver to tune latency timers.
>
> > Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> > bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> > driver (wrong burst_max)?
>
> Great bug hunting!
>
> It is about certainly not a software bug in the driver. Any latency timer
> value should not give any trouble if hardware was flawless. Just the PCI
> performances could be affected.
>
> Anyway, value 0 looks way stupid for devices capable of bursting more than
> 1 data phase, thus the improvement above. :)

OK.

> > To recapitulate, the bug causes error bursts of (almost always) 32 bytes long.
> > The incorrect bytes are always a copy of previous data, at a fixed offset (10
> > kiB on my (now dead) DDS-1 tape drive, 32 kiB on my Plexwriter).
>
> Unfortunately, I haven't the errata listing for teh 53c875 rev 4. I have
> the DEL for 875 rev. 3 and for 876 rev. 5.

And I'm afraid I won't be able to get errata for the VLSI VAS96011/12 :-( Of
course I can always give it a try...

> If we assume that rev 4 hasn't more bugs than rev 3, then you may try to
> disable MEMORY WRITE and INVALIDATE (and not tell the driver to fix this
> up) but allow the driver to fix the bogus zero latency timer. The 875 rev
> 3 may, under certain conditions, execute unaligned PCI MEMORY WRITE and
> INVALIDATE transactions. Note that this may explain data corruptions
> occurring for SCSI READ commands not WRITE commands. No other items can
> explain, on paper, data corruptions of the form you describe due to 875
> chip misbehaviour.

I'll give that a try...

> Btw, latency timer zero should not change the likelyhood of this item.
> This let me think that the host bridge is likely to be the culprit.

Hmmm... I'm still wondering why I see the problem when writing to tape or
CD-R(W), while I can't provoke it when writing to disk (Quantum Viking II U2W).

What's so special about tape and CD-R?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-12-29 13:23:44

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Sym53c8xx tape corruption squashed! (was: Re: SCSI Tape corruption - update)

On Sat, 29 Dec 2001, [ISO-8859-1] G?rard Roudier wrote:
> On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:
> > The sym-2 driver has a define for modifying the PCI latency timer
> > (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.
>
> By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
> write and invalidate). The latency timer fix-up has been removed, since it
> is rather up to the generic PCI driver to tune latency timers.
>
> > Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> > bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> > driver (wrong burst_max)?
>
> Great bug hunting!
>
> It is about certainly not a software bug in the driver. Any latency timer
> value should not give any trouble if hardware was flawless. Just the PCI
> performances could be affected.

I played a bit with sym-2 and setpci. Everything goes fine as long as the PCI
latency timer value is smaller than 0x16 (yes, at first I thought it was
decimal, but setpci parameters are in hex).

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-12-29 17:37:38

by Gérard Roudier

[permalink] [raw]
Subject: Re: Sym53c8xx tape corruption squashed! (was: Re: SCSI Tape corruption - update)


On Sat, 29 Dec 2001, Geert Uytterhoeven wrote:

> On Sat, 29 Dec 2001, [ISO-8859-1] G?rard Roudier wrote:
> > On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:
> > > The sym-2 driver has a define for modifying the PCI latency timer
> > > (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.
> >
> > By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
> > write and invalidate). The latency timer fix-up has been removed, since it
> > is rather up to the generic PCI driver to tune latency timers.
> >
> > > Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> > > bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> > > driver (wrong burst_max)?
> >
> > Great bug hunting!
> >
> > It is about certainly not a software bug in the driver. Any latency timer
> > value should not give any trouble if hardware was flawless. Just the PCI
> > performances could be affected.
>
> I played a bit with sym-2 and setpci. Everything goes fine as long as the PCI
> latency timer value is smaller than 0x16 (yes, at first I thought it was
> decimal, but setpci parameters are in hex).

Interesting result, even if it doesn't trigger any of my guessing
capabilities, for now. :-)

Just it means that the 875 must release the PCI BUS if its GNT# signal is
deasserted by PCI arbiter and current transaction lasted 22 PCI cycles or
more since the assertion of FRAME#.

If I remember correctly, the problem occurred when data is written to the
device. Is it ok?

If so, the MWI problem I pointed out in my previous posting is unlikely to
apply. But, for user data DMA write, the 875 may execute Memory Read Line
or Memory Read Multiple Lines transactions. It would be interesting to
know if it makes difference disabling those capabilities.

Setting to zero the PCI cache line register in the PCI configuration space
does force the chip not to use any of the cache line based PCI
transactions. It is brute force but should work.

In order to disable separately those features, some IO register bits must
be set to zero. The faster way is to hack the driver (sym_hipd.c) at some
place, for example (entered by hand just for you):

/*
* Select all supported special features.
* If we are using on-board RAM for scripts, prefetch (PFEN)
* does not help, but burst op fetch (BOF) does.
* Disabling PFEN makes sure BOF will be used.
*/
if (np->features & FE_ERL)
np->rv_dmode |= ERL; /* Enable Read Line */
if (np->features & FE_BOF)
np->rv_dmode |= BOF; /* Burst Opcode Fetch */
if (np->features & FE_ERMP)
np->rv_dmode |= ERMP; /* Enable Read Multiple */
#if 1
if ((np->features & FE_PFEN) && !np->ram_ba)
#else
if (np->features & FE_PFEN)
#endif
np->rv_dcntl |= PFEN; /* Prefetch Enable */
if (np->features & FE_CLSE)
np->rv_dcntl |= CLSE; /* Cache Line Size Enable */
if (np->features & FE_WRIE)
np->rv_ctest3 |= WRIE; /* Write and Invalidate */
if (np->features & FE_DFS)
np->rv_ctest5 |= DFS; /* Dma Fifo Size */

+ #if 0 /* Disable all cache line based features */
+ np->rv_dcntl &= ~CLSE;
+ #endif
+ #if 1 /* Disable Read Line */
+ np->rv_dmode &= ~ERL;
+ #endif
+ #if 1 /* Disable Read Multiple */
+ np->rv_dmode &= ~ERMP;
+ #endif
+ #if 0 /* Disable Write and Invalidate */
+ np->rv_ctest3 &= ~WRIE;
+ #endif

This example disables Read Line and Memory Read Multiple. I just added
provisions (#if'ed zero) for other bits that also apply to cache line
based transactions.

G?rard.

2001-12-29 21:28:39

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Sym53c8xx tape corruption squashed! (was: Re: SCSI Tape corruption - update)

On Sat, 29 Dec 2001, [ISO-8859-1] G?rard Roudier wrote:
> On Sat, 29 Dec 2001, Geert Uytterhoeven wrote:
> > On Sat, 29 Dec 2001, [ISO-8859-1] G?rard Roudier wrote:
> > > On Fri, 28 Dec 2001, Geert Uytterhoeven wrote:
> > > > The sym-2 driver has a define for modifying the PCI latency timer
> > > > (SYM_SETUP_PCI_FIX_UP), but it is never used, so I see no corruption.
> > >
> > > By default sym-2 use value 3 for the pci_fix_up (cache line size + memory
> > > write and invalidate). The latency timer fix-up has been removed, since it
> > > is rather up to the generic PCI driver to tune latency timers.
> > >
> > > > Is this a hardware bug in my SCSI host adapter (53c875 rev 04) or my host
> > > > bridge (VLSI VAS96011/12 Golden Gate II for PPC), or a software bug in the
> > > > driver (wrong burst_max)?
> > >
> > > Great bug hunting!
> > >
> > > It is about certainly not a software bug in the driver. Any latency timer
> > > value should not give any trouble if hardware was flawless. Just the PCI
> > > performances could be affected.
> >
> > I played a bit with sym-2 and setpci. Everything goes fine as long as the PCI
> > latency timer value is smaller than 0x16 (yes, at first I thought it was
> > decimal, but setpci parameters are in hex).
>
> Interesting result, even if it doesn't trigger any of my guessing
> capabilities, for now. :-)
>
> Just it means that the 875 must release the PCI BUS if its GNT# signal is
> deasserted by PCI arbiter and current transaction lasted 22 PCI cycles or
> more since the assertion of FRAME#.

Exactly my thoughts.

> If I remember correctly, the problem occurred when data is written to the
> device. Is it ok?

Yes.

> If so, the MWI problem I pointed out in my previous posting is unlikely to
> apply. But, for user data DMA write, the 875 may execute Memory Read Line
> or Memory Read Multiple Lines transactions. It would be interesting to
> know if it makes difference disabling those capabilities.
>
> Setting to zero the PCI cache line register in the PCI configuration space
> does force the chip not to use any of the cache line based PCI
> transactions. It is brute force but should work.

Note that on my system the PCI cache line register in the PCI configuration
space of the '875 is already set to zero.

> In order to disable separately those features, some IO register bits must
> be set to zero. The faster way is to hack the driver (sym_hipd.c) at some
> place, for example (entered by hand just for you):

So I don't think it would help to test this, since PCI_CACHE_LINE_SIZE is set
to 0?

Anyway, thanks for your time and suggestions!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-12-29 22:59:27

by Gérard Roudier

[permalink] [raw]
Subject: Re: Sym53c8xx tape corruption squashed! (was: Re: SCSI Tape corruption - update)



On Sat, 29 Dec 2001, Geert Uytterhoeven wrote:

> On Sat, 29 Dec 2001, [ISO-8859-1] G?rard Roudier wrote:
> > On Sat, 29 Dec 2001, Geert Uytterhoeven wrote:
[...]
> > > I played a bit with sym-2 and setpci. Everything goes fine as long as the PCI
> > > latency timer value is smaller than 0x16 (yes, at first I thought it was
> > > decimal, but setpci parameters are in hex).
> >
> > Interesting result, even if it doesn't trigger any of my guessing
> > capabilities, for now. :-)
> >
> > Just it means that the 875 must release the PCI BUS if its GNT# signal is
> > deasserted by PCI arbiter and current transaction lasted 22 PCI cycles or
> > more since the assertion of FRAME#.
>
> Exactly my thoughts.

Note that this looks a bit less than 8 DWORDs. If your beast use such
cache line size, this can be related to.

> > If I remember correctly, the problem occurred when data is written to the
> > device. Is it ok?
>
> Yes.
>
> > If so, the MWI problem I pointed out in my previous posting is unlikely to
> > apply. But, for user data DMA write, the 875 may execute Memory Read Line
> > or Memory Read Multiple Lines transactions. It would be interesting to
> > know if it makes difference disabling those capabilities.
> >
> > Setting to zero the PCI cache line register in the PCI configuration space
> > does force the chip not to use any of the cache line based PCI
> > transactions. It is brute force but should work.
>
> Note that on my system the PCI cache line register in the PCI configuration
> space of the '875 is already set to zero.

Then, the 875 never used cache line based PCI transactions.

> > In order to disable separately those features, some IO register bits must
> > be set to zero. The faster way is to hack the driver (sym_hipd.c) at some
> > place, for example (entered by hand just for you):
>
> So I don't think it would help to test this, since PCI_CACHE_LINE_SIZE is set
> to 0?

Indeed. A least your system hasn't been bitten by PCI cache line related
bugs.

I donnot know how the 875 behaves when supplied with a zero latency timer.
Normally it should consider the timeout to happen immediately, but it must
and is allowed to perform at least one data phase. In this hypothesis, and
given that a latency timer greater than 22 PCI clocks makes problem, I may
risk the following:

Your hardware (probably the host bridge) is only able as a PCI target to
provide a limited amount of data for the current PCI read transaction in
some circumstances. It deasserts GNT#. If the master wants more data it
has to force transaction termination, otherwise it is the master that will
terminate the transaction.

Then the cause of the problem could be something like:

- The host bridge is unable to terminate a read transaction as a target
and just feeds the master with stale data if it cannot get good ones.
(Unlikely, but why not?)
- The host bridge just does not terminate the transaction in time in
some circumstances and provides stale data to the master until the
transaction terminates.

This is an example, probably just wrong.

Anyway, given that short latency timers hide (fix?) the problem, your
system seems to like much better PCI transactions to be preferently
(always?) terminated by the master.

G?rard.