2003-01-06 11:23:16

by Paul

[permalink] [raw]
Subject: Fwd: File system corruption

Hi,

I sent the following email regarding a suspected bug to the IDE maintainer
mentioned in the DOCs but haven't got a response.

Can anyone point me in the right direction here?

Thanks, Paul.

---------- Forwarded Message ----------
Subject: File system corruption
Date: Tue, 17 Dec 2002 22:27:26 +1000
From: Paul <[email protected]>
To: [email protected]


Hi Andre,

I hope I am emailing the right person :)

I am currently having a problem with a compact flash card corrupting its
FAT16 filesystem. The compact flash card is used in an IDE to compact flash
adaptor board. We manufacture the board and have used this without problem
in over 100 units so far and the card functions perfectly in Windows 98/2000
and when used with other flash cards.

I have been able to determine that the corruption only appears in a certain
flash card size and manufactured after a certain period from Sandisk. I have
others on order from other manufacturers to test this further. I have tested
this in linux 2.4.10, 2.4.18 and 2.4.20...all with the same result. I have
used Sandisk 32MB and 64MB cards in the past with no problems. The lastest
batch of 32MB card we received worked correctly, the 64MB batch did not. I
have listed the hdparm output for both 64MB cards:

Good Sandisk:

/dev/hdd:
multcount = 0 (off)
I/O support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 0 (off)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 490/8/32, sectors = 125152, start = 0

# /sbin/hdparm -i /dev/hdd

/dev/hdd:

Model=SanDisk SDCFB-64, FwRev=Vdg 1.23, SerialNo=06210224227
Config={ HardSect NotMFM Removeable DTR>10Mbs nonMagnetic }
RawCHS=490/8/32, TrkSize=0, SectSize=576, ECCbytes=4
BuffType=DualPort, BuffSize=1kB, MaxMultSect=1, MultSect=off
CurCHS=490/8/32, CurSects=-369098751, LBA=yes, LBAsects=125440
IORDY=no
PIO modes: pio0 pio1
DMA modes:

============================================
Bad Sandisk

# /sbin/hdparm /dev/hdd

/dev/hdd:
multcount = 0 (off)
I/O support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 0 (off)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 978/4/32, sectors = 125184, start = 0

# /sbin/hdparm -i /dev/hdd

/dev/hdd:

Model=SanDisk SDCFB-64, FwRev=Rev 3.03, SerialNo=X0409 20020924041900
Config={ HardSect NotMFM Removeable DTR>10Mbs nonMagnetic }
RawCHS=978/4/32, TrkSize=0, SectSize=512, ECCbytes=4
BuffType=DualPort, BuffSize=1kB, MaxMultSect=1, MultSect=off
CurCHS=978/4/32, CurSects=-385875967, LBA=yes, LBAsects=125184
IORDY=no
PIO modes: pio0 pio1
DMA modes:
===================================

You can see that the batch of cards that cause corruption have a newer
firmware version (rev 3.03) and the reported CHS translation has changed. I
have checked the BIOS reports the same as the hdparm programme and also that
the kernel also has the same values, they are 978/4/32 as hdparm reports.

I have tried 12 different IDE adaptors with the same error. I have tried 2
different VIA Eden mainboards. I tried my home system with a different
Northbridge/Southbridge (VT82C693A/VT82C686A and VT8601/VT8231) all with the
same results.

To reproduce this problem I only need to mount any compact flash card from
the new batch (we have tried over 10 different cards), write a file of any
size (even less than 200 bytes) then unmount the drive. calling "sync"
before unmounting makes no difference. When the drive is mounted again the
filesystem will be corrupted 100% of the time. I use md5sum to check the
file when the drive is mounted again (fails every time).

I have an output from ./ver_linux below if that helps:

[root@paul scripts]# ./ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux paul.home.com.au 2.4.20 #1 Fri Dec 13 23:10:14 EST 2002 i686 unknown

Gnu C 2.96
Gnu make 3.79.1
binutils 2.10.91.0.2
util-linux 2.10s
mount 2.10r
modutils 2.4.2
e2fsprogs 1.19
PPP 2.4.0
isdn4k-utils 3.1pre1
Linux C Library 2.2.2
Dynamic linker (ldd) 2.2.2
Procps 2.0.7
Net-tools 1.57
Console-tools 0.3.3
Sh-utils 2.0
Modules Loaded ppp_async ppp_generic slhc mga agpgart autofs 8139too
mii ipchains usb-storage scsi_mod usb-uhci usbcore

The funny thing is that I can take this same flash card (or any card from
that batch) and the same IDE adaptor and it works perfectly under Windows
98/2000 on the same machine so I assume it is a kernel problem or filesystem
problem of some sort. Perhaps it's a bug in the compact flash that Windows
doesn't trigger?

Would you have any suggestion of what I can try next? or perhaps the
appropriate person to email (perhaps vfat filesystem maintainer)?

Kind Regards,

Paul Krushka

-------------------------------------------------------


2003-01-06 14:13:16

by Alan

[permalink] [raw]
Subject: Re: Fwd: File system corruption

On Mon, 2003-01-06 at 11:38, Paul wrote:
> Hi,
>
> I sent the following email regarding a suspected bug to the IDE maintainer
> mentioned in the DOCs but haven't got a response.
>
> Can anyone point me in the right direction here?

Sandisk I think. Looking at the corruption pattern and actual disk
behaviour might be informative. Its possible the vendor has done
something silly like teach the firmware 'tricks' about FAT file
systems that depend on exact windows behaviour I guess.

Might be interesting to see what it does given a totally not FAT
environment (eg fill the disk start to end with each sector filled
with its sector number repeatedly) and see what comes out the other
end.

Alan

2003-01-06 14:23:08

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Fwd: File system corruption

On Mon, 2003-01-06 at 16:06, Alan Cox wrote:
> On Mon, 2003-01-06 at 11:38, Paul wrote:
> > Hi,
> >
> > I sent the following email regarding a suspected bug to the IDE maintainer
> > mentioned in the DOCs but haven't got a response.
> >
> > Can anyone point me in the right direction here?
>
> Sandisk I think.

for sandisk you want dma OFF. hard. always.



Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-01-07 12:00:22

by Rogier Wolff

[permalink] [raw]
Subject: Re: Fwd: File system corruption

On Mon, Jan 06, 2003 at 03:06:20PM +0000, Alan Cox wrote:
> Might be interesting to see what it does given a totally not FAT
> environment (eg fill the disk start to end with each sector filled
> with its sector number repeatedly) and see what comes out the other
> end.

How about the following program to do this.

Roger.


/* Written By [email protected]
*
* This program is distributed under GPL. */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

int main (int argc, char **argv)
{
int i;
int ascii = 0;
int size = 512;
long long secno;
char *buf;
int s;

for (i=1;i<argc;i++) {
if (strcmp (argv[i], "-a") == 0) {
ascii = 1;
}
if (strcmp (argv[i], "-b") == 0) {
ascii = 0;
}

if (strncmp (argv[i], "-s", 2) == 0) {
if (strlen (argv[i]) > 2)
size = atoi (argv[i]+2);
else
/* Sorry. Will crash if you specify -s as the last argument */
size = atoi (argv[++i]);
}
}

buf = malloc (size + 16);

if (!buf) {
fprintf (stderr, "Can't allocate buffer.\n");
exit (1);
}

secno = 0;
while (1) {
if (ascii) {
sprintf (buf, "%lld\n", secno);
s = strlen (buf);
for (i=s;i<size;i+=s)
sprintf (buf+i, "%lld\n", secno);
} else {
for (i=0;i<size;i+=sizeof (long long))
*(long long *)(buf+i) = secno;
}
if (write (1, buf, size) < 0)
break;
secno++;
}
exit (0);
}

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an *
* excursion: The stable situation does not include humans. ***************

2003-01-08 11:20:17

by Paul

[permalink] [raw]
Subject: Re: Fwd: File system corruption

Roger,

Thanks for the programme Roger!

I'm not sure if I did this right but.....

I ran the programme as follows: (I called it sandisktest :)

# ./sandisktest -b | dd of=/dev/hdc

then I dd'd the image of the flash disk to my local disk and viewed the file
through midnight commander.

What I have found is that just after the start of a sector, usually 43 to 45
bytes, 6 bytes are skipped and the sequence starts again. This continues
until the next sector starts, where the sequence corrects. This appears to
happen every 65536 bytes or some multiple of 65536. It may skip three blocks
of 65536 and then corrupt on the next block of 65536 bytes.

I would greatly appreciate some other ideas to try, I'm not game to start
hacking the kernel code quite yet :)

Paul.

BTW: I also tried a EXT2 FS and it corrupted files just as the FAT16
filesystem had.

On Tue, 7 Jan 2003 10:08 pm, Rogier Wolff wrote:
> On Mon, Jan 06, 2003 at 03:06:20PM +0000, Alan Cox wrote:
> > Might be interesting to see what it does given a totally not FAT
> > environment (eg fill the disk start to end with each sector filled
> > with its sector number repeatedly) and see what comes out the other
> > end.
>
> How about the following program to do this.
>
> Roger.
>
>
> /* Written By [email protected]
> *
> * This program is distributed under GPL. */
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
>
> int main (int argc, char **argv)
> {
> int i;
> int ascii = 0;
> int size = 512;
> long long secno;
> char *buf;
> int s;
>
> for (i=1;i<argc;i++) {
> if (strcmp (argv[i], "-a") == 0) {
> ascii = 1;
> }
> if (strcmp (argv[i], "-b") == 0) {
> ascii = 0;
> }
>
> if (strncmp (argv[i], "-s", 2) == 0) {
> if (strlen (argv[i]) > 2)
> size = atoi (argv[i]+2);
> else
> /* Sorry. Will crash if you specify -s as the last argument */
> size = atoi (argv[++i]);
> }
> }
>
> buf = malloc (size + 16);
>
> if (!buf) {
> fprintf (stderr, "Can't allocate buffer.\n");
> exit (1);
> }
>
> secno = 0;
> while (1) {
> if (ascii) {
> sprintf (buf, "%lld\n", secno);
> s = strlen (buf);
> for (i=s;i<size;i+=s)
> sprintf (buf+i, "%lld\n", secno);
> } else {
> for (i=0;i<size;i+=sizeof (long long))
> *(long long *)(buf+i) = secno;
> }
> if (write (1, buf, size) < 0)
> break;
> secno++;
> }
> exit (0);
> }

2003-01-08 13:23:07

by Paul

[permalink] [raw]
Subject: Re: Fwd: File system corruption

I have put the gzipped image here if anyone wants to take a peek :)
size==348k, unzips to ~64Mb

http://home.iprimus.com.au/krushka/img.gz

On Tue, 7 Jan 2003 10:08 pm, you wrote:
> On Mon, Jan 06, 2003 at 03:06:20PM +0000, Alan Cox wrote:
> > Might be interesting to see what it does given a totally not FAT
> > environment (eg fill the disk start to end with each sector filled
> > with its sector number repeatedly) and see what comes out the other
> > end.
>
> How about the following program to do this.
>
> Roger.
>
>
> /* Written By [email protected]
> *
> * This program is distributed under GPL. */
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
>
> int main (int argc, char **argv)
> {
> int i;
> int ascii = 0;
> int size = 512;
> long long secno;
> char *buf;
> int s;
>
> for (i=1;i<argc;i++) {
> if (strcmp (argv[i], "-a") == 0) {
> ascii = 1;
> }
> if (strcmp (argv[i], "-b") == 0) {
> ascii = 0;
> }
>
> if (strncmp (argv[i], "-s", 2) == 0) {
> if (strlen (argv[i]) > 2)
> size = atoi (argv[i]+2);
> else
> /* Sorry. Will crash if you specify -s as the last argument */
> size = atoi (argv[++i]);
> }
> }
>
> buf = malloc (size + 16);
>
> if (!buf) {
> fprintf (stderr, "Can't allocate buffer.\n");
> exit (1);
> }
>
> secno = 0;
> while (1) {
> if (ascii) {
> sprintf (buf, "%lld\n", secno);
> s = strlen (buf);
> for (i=s;i<size;i+=s)
> sprintf (buf+i, "%lld\n", secno);
> } else {
> for (i=0;i<size;i+=sizeof (long long))
> *(long long *)(buf+i) = secno;
> }
> if (write (1, buf, size) < 0)
> break;
> secno++;
> }
> exit (0);
> }

2003-01-08 13:21:10

by Alan

[permalink] [raw]
Subject: Re: Fwd: File system corruption

On Wed, 2003-01-08 at 11:35, Paul wrote:
> What I have found is that just after the start of a sector, usually 43 to 45
> bytes, 6 bytes are skipped and the sequence starts again. This continues
> until the next sector starts, where the sequence corrects. This appears to
> happen every 65536 bytes or some multiple of 65536. It may skip three blocks
> of 65536 and then corrupt on the next block of 65536 bytes.

Ok that I'm afraid bears no resemblance to anything the software side
does (we write in chunks but we do single PIO block transfers of each
sector).

> I would greatly appreciate some other ideas to try, I'm not game to start
> hacking the kernel code quite yet :)

Two things

1. Tweak the code to write 1K, fsync, write 1K fsync
2. Repeat the above in 512 byte chunks.

That tests the way the device responds to writes. You can then try different
bigger sizes. If the 512 byte one corrupts and the 1K one doesn't that is
the only thing I can think of that would fit the pattern

2003-01-08 13:46:51

by Rogier Wolff

[permalink] [raw]
Subject: Re: Fwd: File system corruption

On Wed, Jan 08, 2003 at 02:15:06PM +0000, Alan Cox wrote:
> On Wed, 2003-01-08 at 11:35, Paul wrote:
> > What I have found is that just after the start of a sector, usually 43 to 45
> > bytes, 6 bytes are skipped and the sequence starts again. This continues
> > until the next sector starts, where the sequence corrects. This appears to
> > happen every 65536 bytes or some multiple of 65536. It may skip three blocks
> > of 65536 and then corrupt on the next block of 65536 bytes.
>
> Ok that I'm afraid bears no resemblance to anything the software side
> does (we write in chunks but we do single PIO block transfers of each
> sector).

After examining the resulting image, Paul has a "clock" line to his
flash device that is a bit noisy. This occasionally causes one
16-bit entity to be clocked into the device twice.

To detect this going wrong, we could (but only as a configurable
option), write 255 16-bit words to the device (remember this is PIO!),
check that DRQ is still active and only then write the last word.
(at which point DRQ should go inactive).

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an *
* excursion: The stable situation does not include humans. ***************

2003-01-31 11:54:30

by Paul

[permalink] [raw]
Subject: Re: Fwd: File system corruption

I have finally received a new batch of CF cards (Mittoni brand) and these do
not have the problem with data corruption as the Sandisk cards did. So this
fault is specific to one batch of Sandisk cards (so far)

Rogier has written below what might be causing this and I would like to try
fixing it...gulp! Can anyone spare a minute to point out the function I need
to change to test the suggested "fix"? I've looked around a bit and the
closest I got was to finding an "outsl" function but can't find where the
actual data is written to the device...

On Wed, 8 Jan 2003 11:55 pm, Rogier Wolff wrote:
> On Wed, Jan 08, 2003 at 02:15:06PM +0000, Alan Cox wrote:
> > On Wed, 2003-01-08 at 11:35, Paul wrote:
> > > What I have found is that just after the start of a sector, usually 43
> > > to 45 bytes, 6 bytes are skipped and the sequence starts again. This
> > > continues until the next sector starts, where the sequence corrects.
> > > This appears to happen every 65536 bytes or some multiple of 65536. It
> > > may skip three blocks of 65536 and then corrupt on the next block of
> > > 65536 bytes.
> >
> > Ok that I'm afraid bears no resemblance to anything the software side
> > does (we write in chunks but we do single PIO block transfers of each
> > sector).
>
> After examining the resulting image, Paul has a "clock" line to his
> flash device that is a bit noisy. This occasionally causes one
> 16-bit entity to be clocked into the device twice.
>
> To detect this going wrong, we could (but only as a configurable
> option), write 255 16-bit words to the device (remember this is PIO!),
> check that DRQ is still active and only then write the last word.
> (at which point DRQ should go inactive).
>
> Roger.