While doing some tests with my DDS drive, I noticed some file corruption. The
data was written to the tape incorrectly, since reading always gives the same
result. The drive didn't notice any write error.
My test consisted of tarring up some kernel sources and splitting them in
files of 16 MB. Then I archived the files on the tape using tar.
Some of the files are corrupted. Each corruption consists of 32 consecutive
bytes being changed. The corrupted bytes are not random, but seem to contain
parts of the kernel sources. This indicates that some data got copied from
somewhere in the buffer cache.
The disk with the test files is connected to a MESH SCSI interface.
The tape drive is a HP C1536 DDS and is connected to a Sym53c875 SCSI card.
The machine is a CHRP LongTrail running the 2.4.3-pre4 kernel for PPC.
Anyone who saw something similar?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Mon, 19 Mar 2001, Jeff Garzik wrote:
> Is the corruption reproducible? If so, does the corruption go away if
Yes, it is reproducible. In all my tests, I tarred 16 files of 16 MB each to
tape (I used a new one).
- test 1: 4 files with failed md5sum (no further investigation on type of
corruption)
- test 2: 7 files with failed md5sum, 7 blocks of 32 consecutive bytes were
corrupted, all starting at an offset of the form 32*x+1.
- test 3: 7 files with failed md5sum, 7 blocks of 32 consecutive bytes were
corrupted, all starting at an offset of the form 32*x+1.
The files seem to be corrupted during writing only, as reading always gives the
exact same (corrupted) data back.
Copying files from the disk on the MESH to a disk on the Sym53c875 (which also
has the tape drive) shows no corruption.
> you rip out the scsi_error patch in 2.4.3-preXX?
After reverting that patch, the problem got worse:
- test 4: 15 files with failed md5sum, a total of 40 blocks of 32 consecutive
bytes were corrupted, all starting at an offset of the form 32*x+1.
So it seems to be related to scsi_error.c.
If you have some suggestions, I'm willing to try them. I'd like to trust
whatever Amanda writes to my backup tapes :-)
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Tue, 20 Mar 2001, Geert Uytterhoeven wrote:
> On Mon, 19 Mar 2001, Jeff Garzik wrote:
> > Is the corruption reproducible? If so, does the corruption go away if
>
> Yes, it is reproducible. In all my tests, I tarred 16 files of 16 MB each to
> tape (I used a new one).
> - test 1: 4 files with failed md5sum (no further investigation on type of
> corruption)
> - test 2: 7 files with failed md5sum, 7 blocks of 32 consecutive bytes were
> corrupted, all starting at an offset of the form 32*x+1.
> - test 3: 7 files with failed md5sum, 7 blocks of 32 consecutive bytes were
> corrupted, all starting at an offset of the form 32*x+1.
>
> The files seem to be corrupted during writing only, as reading always gives the
> exact same (corrupted) data back.
>
> Copying files from the disk on the MESH to a disk on the Sym53c875 (which also
> has the tape drive) shows no corruption.
I did some more tests:
- The problem also occurs when tarring up files from a disk on the Sym53c875.
- The corrupted data always occurs at offset 32*x (the `+1' above was caused
by hexdump, starting counting at 1).
- The 32 bytes of corrupted data at offset 32*x are always a copy of the data
at offset 32*x-10240.
- Since 10240 is the default blocksize of tar (bug in tar?), I made a tarball
on disk instead of on tape, but no corruption.
- 32 is the size of a cacheline on PPC. Is there a missing cacheflush
somewhere in the Sym53c875 driver? But then it should happen on disk as
well?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Hello everybody,
This is a message on behalf of a friend that is not subscribed to list:
It's about an ASUS board that has this ncr53-1010 dual 160 SCSI
controller (sym53c1010).
On both latest kernels (2.2.18ac19 AND 2.4.2ac18) the log and console
is filled with that:
sym53c1010-33-0: unable to abort current chip operation.
sym53c1010-33-0: Downloading SCSI SCRIPTS.
sym53c8xx_reset: pid=0 reset_flags=2 ...
and the controller suddenly blocks and the system have be restarted.
Do someone know the meaning of this messages and what's matter, do you
want more details and what else ???
Regards,
Mircea C.
On Tue, 20 Mar 2001, Geert Uytterhoeven wrote:
> On Tue, 20 Mar 2001, Geert Uytterhoeven wrote:
> > On Mon, 19 Mar 2001, Jeff Garzik wrote:
> > > Is the corruption reproducible? If so, does the corruption go away if
> >
> > Yes, it is reproducible. In all my tests, I tarred 16 files of 16 MB each to
> > tape (I used a new one).
> > - test 1: 4 files with failed md5sum (no further investigation on type of
> > corruption)
> > - test 2: 7 files with failed md5sum, 7 blocks of 32 consecutive bytes were
> > corrupted, all starting at an offset of the form 32*x+1.
> > - test 3: 7 files with failed md5sum, 7 blocks of 32 consecutive bytes were
> > corrupted, all starting at an offset of the form 32*x+1.
> >
> > The files seem to be corrupted during writing only, as reading always gives the
> > exact same (corrupted) data back.
> >
> > Copying files from the disk on the MESH to a disk on the Sym53c875 (which also
> > has the tape drive) shows no corruption.
>
> I did some more tests:
> - The problem also occurs when tarring up files from a disk on the Sym53c875.
> - The corrupted data always occurs at offset 32*x (the `+1' above was caused
> by hexdump, starting counting at 1).
> - The 32 bytes of corrupted data at offset 32*x are always a copy of the data
> at offset 32*x-10240.
> - Since 10240 is the default blocksize of tar (bug in tar?), I made a tarball
> on disk instead of on tape, but no corruption.
> - 32 is the size of a cacheline on PPC. Is there a missing cacheflush
> somewhere in the Sym53c875 driver? But then it should happen on disk as
> well?
The only PCI transaction that requires the cache line size to be correctly
configured is PCI WRITE and INVALIDATE. This transaction may be used by
the 875 only for data read from a SCSI device and DMAed to memory.
Note that the controller may use optimized PCI transactions only if the
cache line size is configured in its PCI device configuration space.
Otherwise only normal PCI memory read and PCI memory write transactions
will be used.
Could you check if the cache line size is configured for your 875?
Let me imagine it is so. Btw, I may be wasting my time if it is not ...
Then the 875 may also use PCI read multiple transactions and/or PCI read
line transactions when reading data from memory. If the corruption is due
to the use of these transactions, the the PCI-HOST bridges may well be the
culprit, in my opinion.
Anyway, since the sym53c8xx driver does not try to change the configured
cache line size on PPC, I would suggest to try again the same tests with
the cache line size set to zero for the 875. You may hack the driver code
or the PPC pci code if needed, for example, for value zero to be written
in the proper place in the PCI configuration space of the 875.
G?rard.
On Tue, 20 Mar 2001, G?rard Roudier wrote:
> On Tue, 20 Mar 2001, Geert Uytterhoeven wrote:
> > On Tue, 20 Mar 2001, Geert Uytterhoeven wrote:
> > > On Mon, 19 Mar 2001, Jeff Garzik wrote:
> > > > Is the corruption reproducible? If so, does the corruption go away if
> > >
> > > Yes, it is reproducible. In all my tests, I tarred 16 files of 16 MB each to
> > > tape (I used a new one).
> > > - test 1: 4 files with failed md5sum (no further investigation on type of
> > > corruption)
> > > - test 2: 7 files with failed md5sum, 7 blocks of 32 consecutive bytes were
> > > corrupted, all starting at an offset of the form 32*x+1.
> > > - test 3: 7 files with failed md5sum, 7 blocks of 32 consecutive bytes were
> > > corrupted, all starting at an offset of the form 32*x+1.
> > >
> > > The files seem to be corrupted during writing only, as reading always gives the
> > > exact same (corrupted) data back.
> > >
> > > Copying files from the disk on the MESH to a disk on the Sym53c875 (which also
> > > has the tape drive) shows no corruption.
> >
> > I did some more tests:
> > - The problem also occurs when tarring up files from a disk on the Sym53c875.
> > - The corrupted data always occurs at offset 32*x (the `+1' above was caused
> > by hexdump, starting counting at 1).
> > - The 32 bytes of corrupted data at offset 32*x are always a copy of the data
> > at offset 32*x-10240.
> > - Since 10240 is the default blocksize of tar (bug in tar?), I made a tarball
> > on disk instead of on tape, but no corruption.
> > - 32 is the size of a cacheline on PPC. Is there a missing cacheflush
> > somewhere in the Sym53c875 driver? But then it should happen on disk as
> > well?
>
> The only PCI transaction that requires the cache line size to be correctly
> configured is PCI WRITE and INVALIDATE. This transaction may be used by
> the 875 only for data read from a SCSI device and DMAed to memory.
So if this would be the problem, I should see the corruption when reading files
from disks too? But my tests indicate it happens when writing to tape only, not
when reading from tape, nor when copying between disks.
> Note that the controller may use optimized PCI transactions only if the
> cache line size is configured in its PCI device configuration space.
> Otherwise only normal PCI memory read and PCI memory write transactions
> will be used.
>
> Could you check if the cache line size is configured for your 875?
>
> Let me imagine it is so. Btw, I may be wasting my time if it is not ...
> Then the 875 may also use PCI read multiple transactions and/or PCI read
> line transactions when reading data from memory. If the corruption is due
> to the use of these transactions, the the PCI-HOST bridges may well be the
> culprit, in my opinion.
>
> Anyway, since the sym53c8xx driver does not try to change the configured
> cache line size on PPC, I would suggest to try again the same tests with
> the cache line size set to zero for the 875. You may hack the driver code
> or the PPC pci code if needed, for example, for value zero to be written
> in the proper place in the PCI configuration space of the 875.
I'll try that.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Tue, 20 Mar 2001, G?rard Roudier wrote:
> On Tue, 20 Mar 2001, Geert Uytterhoeven wrote:
> > On Tue, 20 Mar 2001, Geert Uytterhoeven wrote:
> > > On Mon, 19 Mar 2001, Jeff Garzik wrote:
> > I did some more tests:
> > - The problem also occurs when tarring up files from a disk on the Sym53c875.
> > - The corrupted data always occurs at offset 32*x (the `+1' above was caused
> > by hexdump, starting counting at 1).
> > - The 32 bytes of corrupted data at offset 32*x are always a copy of the data
> > at offset 32*x-10240.
> > - Since 10240 is the default blocksize of tar (bug in tar?), I made a tarball
> > on disk instead of on tape, but no corruption.
> > - 32 is the size of a cacheline on PPC. Is there a missing cacheflush
> > somewhere in the Sym53c875 driver? But then it should happen on disk as
> > well?
>
> The only PCI transaction that requires the cache line size to be correctly
> configured is PCI WRITE and INVALIDATE. This transaction may be used by
> the 875 only for data read from a SCSI device and DMAed to memory.
>
> Note that the controller may use optimized PCI transactions only if the
> cache line size is configured in its PCI device configuration space.
> Otherwise only normal PCI memory read and PCI memory write transactions
> will be used.
>
> Could you check if the cache line size is configured for your 875?
It's set to 0 :-(
> Let me imagine it is so. Btw, I may be wasting my time if it is not ...
> Then the 875 may also use PCI read multiple transactions and/or PCI read
> line transactions when reading data from memory. If the corruption is due
> to the use of these transactions, the the PCI-HOST bridges may well be the
> culprit, in my opinion.
>
> Anyway, since the sym53c8xx driver does not try to change the configured
> cache line size on PPC, I would suggest to try again the same tests with
> the cache line size set to zero for the 875. You may hack the driver code
> or the PPC pci code if needed, for example, for value zero to be written
> in the proper place in the PCI configuration space of the 875.
So this is not the case.
Any more clues? I want to try different tape drives as well, but so far the
first batch of old DDS drives I found at work seem to be no longer functional.
Let's fetch some other drives tomorrow :-)
BTW, I tried my good old 2.4.0-test1-ac10 kernel from June 2000, and it also
suffered from the same problem. Also note that I did read/write tests on the
tape drive when I just bought it and when I installed the Sym53c875 later, and
I never noticed the problem. So I'm still willing to believe it's a software
bug in recent(?) kernels...
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Thu, 22 Mar 2001, Geert Uytterhoeven wrote:
> On Tue, 20 Mar 2001, G?rard Roudier wrote:
> > On Tue, 20 Mar 2001, Geert Uytterhoeven wrote:
> > > On Tue, 20 Mar 2001, Geert Uytterhoeven wrote:
> > > > On Mon, 19 Mar 2001, Jeff Garzik wrote:
> > > I did some more tests:
> > > - The problem also occurs when tarring up files from a disk on the Sym53c875.
> > > - The corrupted data always occurs at offset 32*x (the `+1' above was caused
> > > by hexdump, starting counting at 1).
> > > - The 32 bytes of corrupted data at offset 32*x are always a copy of the data
> > > at offset 32*x-10240.
> > > - Since 10240 is the default blocksize of tar (bug in tar?), I made a tarball
> > > on disk instead of on tape, but no corruption.
> > > - 32 is the size of a cacheline on PPC. Is there a missing cacheflush
> > > somewhere in the Sym53c875 driver? But then it should happen on disk as
> > > well?
>
> BTW, I tried my good old 2.4.0-test1-ac10 kernel from June 2000, and it also
> suffered from the same problem. Also note that I did read/write tests on the
> tape drive when I just bought it and when I installed the Sym53c875 later, and
> I never noticed the problem. So I'm still willing to believe it's a software
> bug in recent(?) kernels...
Status update:
- When I connect my DDS1 to the MESH, I see no corruption (as long as I get
no `lost arbitration' messages from the MESH driver. I never get those with
the disk BTW. Anyone who knows what needs to be done to make the MESH
driver recover from lost arbitration errors?). So the tape drive seems to
be fine.
- I wanted to try different tape drives, but all retired DDS drives I found
at work seem to be in a non-functional state. I tried 3 of them, without
any luck.
- I wanted to try a 2.2.x kernel, but linuxppc_2_2 (2.2.19-pre3) just says
`illegal instruction' and returns me to the OF prompt.
- Adding more eieio/syncs to the sym53c8xx driver doesn't help. In fact there
are already memory barriers where I'd expect them (as could be expected, of
course :-)
[...]
OK, I managed to compile an old 2.2.13 kernel from the PPC bk repository that
boots more or less (no video) on my box.
Surprise! So far no corruption!! Time to let Amanda make some dumps tonight :-)
So something broke the st/sym53c8xx combination on my box between 2.2.13 and
2.4.0-test1-ac10...
I'm still waiting for other reports of st/sym53c8xx on PPC under 2.4.x. BTW,
does it work on other big-endian platforms, like sparc?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
BTW, my 2.4.3-pre8 kernel just said
| sym53c875-0:0: ERROR (81:0) (3-21-0) (10/9d) @ (script 8a8:0b000000).
| sym53c875-0: script cmd = 11000000
| sym53c875-0: regdump: da 10 80 9d 47 10 00 0d 00 03 80 21 80 01 09 09 00 30 4e 00 08 ff ff ff.
| sym53c875-0-<0,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16)
during the boot process, and continued without problems. What does this mean?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
> I'm still waiting for other reports of st/sym53c8xx on PPC under
> 2.4.x. BTW,
> does it work on other big-endian platforms, like sparc?
I don't know if it is the same problem, but ...
I have a Motorola MVME5100 (PowerPC 750 based CPU) with a mezzanine PCI
based on the sym53c875 chip. I'm using the 2_5 kernel from fmslabs and the
first time I have downloaded the kernel all works fine, while in a
successive update the sym53c8xx driver was changed and my board don't work
anymore. The driver hangs on downloading the SCSI scripts.
I'm not a SCSI driver expert, so I've solved the problem installing the old
version of the driver.
Tom Rini says to me that it happened when he have merged some updates from
the 2_4 tree, so I think my problem is related to the latest updates to the
driver.
I hope this helps you.
Bye.
Stefano.
On Fri, 6 Apr 2001, Stefano Coluccini wrote:
> > I'm still waiting for other reports of st/sym53c8xx on PPC under
> > 2.4.x. BTW,
> > does it work on other big-endian platforms, like sparc?
>
> I don't know if it is the same problem, but ...
> I have a Motorola MVME5100 (PowerPC 750 based CPU) with a mezzanine PCI
> based on the sym53c875 chip. I'm using the 2_5 kernel from fmslabs and the
> first time I have downloaded the kernel all works fine, while in a
> successive update the sym53c8xx driver was changed and my board don't work
> anymore. The driver hangs on downloading the SCSI scripts.
> I'm not a SCSI driver expert, so I've solved the problem installing the old
> version of the driver.
> Tom Rini says to me that it happened when he have merged some updates from
> the 2_4 tree, so I think my problem is related to the latest updates to the
> driver.
This is a different problem. You have to do the equivalent of what
process_bridge_ranges()/pci_process_OF_bridge_ranges() (the function got
renamed recently) does for your machine. Else PCI memory space won't work.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Fri, 6 Apr 2001, Stefano Coluccini wrote:
> > I'm still waiting for other reports of st/sym53c8xx on PPC under
> > 2.4.x. BTW,
> > does it work on other big-endian platforms, like sparc?
>
> I don't know if it is the same problem, but ...
> I have a Motorola MVME5100 (PowerPC 750 based CPU) with a mezzanine PCI
> based on the sym53c875 chip. I'm using the 2_5 kernel from fmslabs and the
> first time I have downloaded the kernel all works fine, while in a
> successive update the sym53c8xx driver was changed and my board don't work
> anymore. The driver hangs on downloading the SCSI scripts.
> I'm not a SCSI driver expert, so I've solved the problem installing the old
> version of the driver.
> Tom Rini says to me that it happened when he have merged some updates from
> the 2_4 tree, so I think my problem is related to the latest updates to the
> driver.
> I hope this helps you.
> Bye.
> Stefano.
IMO, it might well be the Linux/PPC PCI interface that doesn't return
expected values.
1) The [sym|ncr]53c8xx need to know about BAR addresses as physical
address values as seen from the BUS. These values are used by the
SCSI SCRIPTS and _NOT_ by the CPU.
2) The pcidev structure returns cookies instead, that commonly are
BARs physical addresses as seen from CPU.
The recent change in the Symbios driver about point (1) is that the
driver now reads the BARs using the pci_read_config*() interface. If these
functions donnot return the actual BAR values usable from the BUS for some
obscure reasons, this may explain your problem.
The cookies contained in the pcidev structure are completely useless for
the driver and probably for any driver. They just have to be used to remap
memory BARs to CPU virtual addresses. Then the driver forgets about them.
There are still some PPC PCI specific hacks in the sym53c8xx driver and it
has been reported to me that they can be removed. If the PPC PCI interface
is correct, then they should be removed without problems, IMO.
Here is a patch that removes the offending PPC PCI hacky area from the
driver (sym53c8xx_defs.h):
--- sym53c8xx_defs.h Fri Apr 6 16:23:48 2001
+++ sym53c8xx_defs.h.orig Sun Mar 4 13:54:11 2001
@@ -175,6 +175,9 @@
#define SCSI_NCR_IOMAPPED
#elif defined(__alpha__)
#define SCSI_NCR_IOMAPPED
+#elif defined(__powerpc__)
+#define SCSI_NCR_IOMAPPED
+#define SCSI_NCR_PCI_MEM_NOT_SUPPORTED
#elif defined(__sparc__)
#undef SCSI_NCR_IOMAPPED
#endif
-------------------- Cut Here ------------------
Regards,
G?rard.
On Thu, 5 Apr 2001, Geert Uytterhoeven wrote:
>
> BTW, my 2.4.3-pre8 kernel just said
>
> | sym53c875-0:0: ERROR (81:0) (3-21-0) (10/9d) @ (script 8a8:0b000000).
Illegal instruction detected.
> | sym53c875-0: script cmd = 11000000
> | sym53c875-0: regdump: da 10 80 9d 47 10 00 0d 00 03 80 21 80 01 09 09 00 30 4e 00 08 ff ff ff.
> | sym53c875-0-<0,*>: FAST-20 WIDE SCSI 40.0 MB/s (50.0 ns, offset 16)
>
> during the boot process, and continued without problems. What does this mean?
Looks extremally serious to me.
The SCRIPTS processor should be fetching CHMOV DSA relative when DATA_IN
instructions. This corresponds to opcode 0x11000000.
However, it seems to have fetched instruction 0x0b000000 which is a
MOVE ABSOLUTE WHEN STATUS PHASE.
In (3-21-0) we can see that the chip is expecting STATUS PHASE (3), but
the target is driving DATA IN phase (21 - the 1 indicates DATA IN phase).
In other word, the SCRIPTS processor seems to have fetched a bogus
instruction. The signaled 'illegal instruction detected' may be due to the
count of bytes to transfer to be zero.
> Gr{oetje,eeting}s,
G?rard.
On Fri, 6 Apr 2001, G?rard Roudier wrote:
> Here is a patch that removes the offending PPC PCI hacky area from the
> driver (sym53c8xx_defs.h):
>
> --- sym53c8xx_defs.h Fri Apr 6 16:23:48 2001
> +++ sym53c8xx_defs.h.orig Sun Mar 4 13:54:11 2001
> @@ -175,6 +175,9 @@
> #define SCSI_NCR_IOMAPPED
> #elif defined(__alpha__)
> #define SCSI_NCR_IOMAPPED
> +#elif defined(__powerpc__)
> +#define SCSI_NCR_IOMAPPED
> +#define SCSI_NCR_PCI_MEM_NOT_SUPPORTED
> #elif defined(__sparc__)
> #undef SCSI_NCR_IOMAPPED
> #endif
> -------------------- Cut Here ------------------
The patch is obviously reversed. You just have to remove the 3 lines that
apply to powerpc using you preferred editor.
Btw, using the one you dislike the most will also fit. :-)
G?rard.