2008-12-25 18:04:21

by Justin Piszcz

[permalink] [raw]
Subject: mismatch_cnt, random bitflips, silent corruption(?), mdadm/sw raid[156]

I have many backups of data from the late 1990s onward, each of which have
been combined into tar files and then encrypted with gpg, each is about 4
gigabytes so I can store them on DVD. For quicker access, I also kept
them on software raid. In the past, I have always kept them on either SW
RAID5 or SW RAID6 using XFS. All of the machines in question that I
have/had the data on never exhibited any bad memory and all pass memtest
OK. However, what is worrisome, is, on occasion, the mismatch_cnt
would be some crazy number on different machines, 2000 or 5000, I checked
all of the disks with smart, they appear to be OK and when I run a check
and repair against the raid then the count will usually go back down to
zero.

However, in reality, are these "mismatches" really an indication of silent
data corruption? In my case, I have 199 of these backup files and when I
tried to unencrypt the data, I got these errors on several volumes:

========================================================================

gpg: block_filter 0x1c8e200: read error (size=10266,a->size=10266)
gpg: mdc_packet with invalid encoding
gpg: decryption failed: invalid packet
gpg: block_filter: pending bytes!
gpg: fatal: zlib inflate problem: invalid stored block lengths
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid block type
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid code lengths set
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid block type
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid block type
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: [don't know]: invalid packet (ctb=1f)
gpg: [don't know]: invalid packet (ctb=2d)
gpg: mdc_packet with invalid encoding
gpg: decryption failed: invalid packet
gpg: [don't know]: invalid packet (ctb=30)
gpg: fatal: zlib inflate problem: invalid stored block lengths
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid stored block lengths
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid code lengths set
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid stored block lengths
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid distance code
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid distance code
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid code lengths set
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
gpg: fatal: zlib inflate problem: invalid block type
secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768

========================================================================

This resulted in partial-sized corrupted tarballs, I am decompressing each
of the 199 backups now to isolate which are affected so I can come up with
a better count on how many files were affected.

The total amount of data is:
832G in 199 files (4.18gig/file average)

========================================================================

Bad files (out of the 199):

Thu Dec 25 05:44:16 EST 2008: Decompressing backup038.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 08:18:00 EST 2008: Decompressing backup103.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 08:34:39 EST 2008: Decompressing backup111.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 08:43:26 EST 2008: Decompressing backup115.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 08:54:39 EST 2008: Decompressing backup120.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 09:36:02 EST 2008: Decompressing backup137.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 09:36:39 EST 2008: Decompressing backup138.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 09:52:06 EST 2008: Decompressing backup145.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 10:10:14 EST 2008: Decompressing backup153.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 10:10:32 EST 2008: Decompressing backup154.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 10:36:50 EST 2008: Decompressing backup166.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 10:40:19 EST 2008: Decompressing backup168.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 11:20:00 EST 2008: Decompressing backup181.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 11:39:11 EST 2008: Decompressing backup187.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Thu Dec 25 12:05:27 EST 2008: Decompressing backup194.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now



========================================================================

Example of one file:
-rw-r--r-- 1 root root 40M 2008-12-24 20:08 backup038.tar
-rw-r--r-- 1 root root 4.3G 2008-12-23 17:45 backup038.tar.gpg

========================================================================

I have restored the data off of DVD and confirmed the data that had
resided on disk must have suffered from a bit-flip/corruption:

$ md5sum backup038.tar.gpg
9958813aa22e4307f2101f87b8820bff backup038.tar.gpg (from dvd)

$ md5sum backup038.tar.gpg
9437138a01fc2429ca2131f6a10295b5 backup038.tar.gpg (from disk)

========================================================================

Splitting these files up we can isolate how many corruptions occured in the
file:

dvd$ split -b 100M backup038.tar.gpg
hdd$ split -b 100M backup038.tar.gpg

$ diff dvd.md5 ../hdd/hdd.md5
1c1
< 9dfcab90fd6590705ba4293994f10a85 xaa
---
> cce57047ac869ef3e603d4d6bd3579e9 xaa

========================================================================

Dig further:
$ mv xaa dvd.xaa
$ mv xaa hdd.xaa

$ split -b 1M dvd.xaa
$ split -b 1M hdd.xaa

$ diff dvd.md5 ../../hdd/xaa_split/hdd.md5
40c40
< 439e4291ea67240c0b083abe5f041319 xbn
---
> 95e548284fa78f750b50f3e6a8e1b847 xbn

========================================================================

Further..
$ split -a 100 -b 1k dvd.xbn
$ split -a 100 -b 1k hdd.xbn

$ diff dvd.md5 ../../../hdd/xaa_split/xbn_split/hdd.md5 |wc
132 392 17960

========================================================================

Quite a big 'chunk' of the file suffered corruption.

$ diff dvd.md5 ../../../hdd/xaa_split/xbn_split/hdd.md5 | awk '{print $3}' | xargs -n1 ls | xargs du -ach
520K total

========================================================================

dvd$ md5sum one_part_with_corruption
c131c2f7dea3c316cbcd2c26360e4b03 one_part_with_corruption
hdd$ md5sum one_part_with_corruption
35968f907c5e6f986d9c5a171dd0a7ac one_part_with_corruption

Taking an octal dump and checking the ASCII characters:

dvd:
< 0000000 034 ? ? 201 f ? ? ? e D 221 ? ? \b ? $
< 0000020 > p M | 212 236 034 ? \b ? z ? ? \a ? i
< 0000040 212 \f ? ] 200 l ? 234 Y 224 ? ? ? ? ?
< 0000060 233 } G ? W : 236 b ? } 215 ) 224 R R ?
< 0000100 237 ? ? 227 ? 1 ? O ? ? 003 177 $ e 032 ?
< 0000120 ! ^ j ? 235 8 5 a J ? ? < ? H 223 ?

hdd:
> 0000000 \r ? ? 033 034 ? ? 201 f ? ? ? e D 221 ?
> 0000020 ? \b ? $ > p M | 212 236 034 ? \b ? z ?
> 0000040 ? \a ? i 212 \f ? ] 200 l ? 234 Y 224 ? ?
> 0000060 ? ? ? 233 } G ? W : 236 b ? } 215 )
> 0000100 224 R R ? 237 ? ? 227 ? 1 ? O ? ? 003 177
> 0000120 $ e 032 ? ! ^ j ? 235 8 5 a J ? ? <

The hdd image seems 'shifted' see the following sequence: (201..f..R..c) on
the first line.

========================================================================

So the question then becomes where did the shift begin? Way above we see
the initial error began on xaa (100M) -> xbn (1M)

dvd$ md5sum xbm
2524788769acf2b0f98346f488a58441 xbm
hdd$ md5sum xbm
2524788769acf2b0f98346f488a58441 xbm

So then we want to take an od of xbn and then diff that:

It starts at the following segment:

dvd:
< 2006000 ? r 222 235 ? z 8 d 035 ? 230 # 207 ? ? u

hdd:
> 2006000 ? r 222 235 ? r 222 235 ? z 8 d 035 ? 230 #

========================================================================

So let's look at it a few lines before as well:

dvd:
2005720 ' 0 ? 001 ? ? [ ( 037 024 . 211 M 023 001 ?
2005740 V 021 \ ? Y ? ? 207 206 ? n ? 234 z ? 034
2005760 ? 206 002 + w ? 200 ? 3 ? ? K $ ? e n
2006000 ? r 222 235 ? z 8 d 035 ? 230 # 207 ? ? u
2006020 R ? ? 025 ? ? 022 ? 024 ? 200 Q 217 ? ? ?
2006040 203 ? ? c 3 ? 207 # ] ? ? H ( ? 5 207
2006060 ? C C a @ \r 200 ? ? ? \r 204 230 U ? O

hdd:
2005720 ' 0 ? 001 ? ? [ ( 037 024 . 211 M 023 001 ?
2005740 V 021 \ ? Y ? ? 207 206 ? n ? 234 z ? 034
2005760 ? 206 002 + w ? 200 ? 3 ? ? K $ ? e n
2006000 ? r 222 235 ? r 222 235 ? z 8 d 035 ? 230 #
2006020 207 ? ? u R ? ? 025 ? ? 022 ? 024 ? 200 Q
2006040 217 ? ? ? 203 ? ? c 3 ? 207 # ] ? ? H
2006060 ( ? 5 207 ? C C a @ \r 200 ? ? ? \r 204

The same sequence still seems to exist after the flip for the 5th character
on line 2006000.

========================================================================

On the linux-raid mailing list, this issue is often discussed:
http://www.mail-archive.com/[email protected]/msg07541.html

The solution seems to be repair and keep on trucking along, which is
probably fine for a filesystem with hundreds/thousands of files, most
likely, you would never notice the corruption. However, with gpg, the files
obviously cannot suffer any corruption. It is only when I tried to extract the
data did I notice this problem.

========================================================================

== server 1 (965 chipset) ==
Root filesystem (Raid-1):
Sat Nov 15 11:39:03 EST 2008: cat /sys/block/md2/md/mismatch_cnt
Sat Nov 15 11:39:03 EST 2008: 128

Storage array (Raid-6):
Wed Nov 26 10:14:21 EST 2008: cat /sys/block/md3/md/mismatch_cnt
Wed Nov 26 10:14:21 EST 2008: 1208

== server 2 (p35-chipset) ==
Swap (Raid-1): I understand this is 'normal' for swap:
Fri Jan 18 22:51:47 EST 2008: cat /sys/block/md0/md/mismatch_cnt
Fri Jan 18 22:51:47 EST 2008: 4096
Fri May 30 21:00:07 EDT 2008: cat /sys/block/md0/md/mismatch_cnt
Fri May 30 21:00:07 EDT 2008: 896
Fri Nov 14 23:30:10 EST 2008: cat /sys/block/md0/md/mismatch_cnt
Fri Nov 14 23:30:10 EST 2008: 384

== server 3 (also 965 chipset) ==
Root filesystem (Raid-1):
Fri May 18 20:50:05 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Fri May 18 20:50:05 EDT 2007: 128
Sat May 26 04:40:09 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Sat May 26 04:40:09 EDT 2007: 128

Swap (Raid-1): Again, normal but a very high mismatch_cnt:
Fri Oct 5 20:50:05 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Fri Oct 5 20:50:05 EDT 2007: 27904

Storage array (RAID-5)
Fri Apr 4 22:00:09 EDT 2008: cat /sys/block/md3/md/mismatch_cnt
Fri Apr 4 22:00:09 EDT 2008: 512
Fri May 23 22:00:10 EDT 2008: cat /sys/block/md3/md/mismatch_cnt
Fri May 23 22:00:10 EDT 2008: 256

== server 4 (intel 875 chipset, raid-5)
# grep mismatches\ found *
daemon.log:Jan 3 11:27:27 p26 mdadm: RebuildFinished event detected on md device /dev/md0, component device mismatches found: 11896
daemon.log:Nov 14 12:25:59 p26 mdadm[1956]: RebuildFinished event detected on md device /dev/md0, component device mismatches found: 11664
daemon.log:Nov 14 18:19:01 p26 mdadm[1956]: RebuildFinished event detected on md device /dev/md0, component device mismatches found: 11664
syslog:Jan 3 11:27:27 p26 mdadm: RebuildFinished event detected on md device /dev/md0, component device mismatches found: 11896
syslog:Nov 14 12:25:59 p26 mdadm[1956]: RebuildFinished event detected on md device /dev/md0, component device mismatches found: 11664
syslog:Nov 14 18:19:01 p26 mdadm[1956]: RebuildFinished event detected on md device /dev/md0, component device mismatches found: 11664

========================================================================

Options/alternatives for maintaining data integrity?

1. Obviously DVD and/or LTO-tape (multiple copies of data at rest)
http://epoka.dk/media/Linear_Tape_Open_(LTO)_Ultrium_Data_Cartridges.pdf
-> For LTO-2/LTO-3
-> Uncorrected bit error rate 1x10-17 1x10-17
-> Undetected bit error rate 1x10-27 1x10-27

2. ZFS, but only runs in user-space in Linux.

3. Keep an md5sum for each file on the filesystem and run daily checks?

========================================================================

Other options?

How do others maintain data integrity? Just not worry about it until you have
to, rely on backups.. or?

Justin.


2008-12-26 02:43:54

by David Lethe

[permalink] [raw]
Subject: RE: mismatch_cnt, random bitflips, silent corruption(?), mdadm/sw raid[156]

> -----Original Message-----
> From: [email protected] [mailto:linux-raid-
> [email protected]] On Behalf Of Justin Piszcz
> Sent: Thursday, December 25, 2008 12:04 PM
> To: [email protected]; [email protected]
> Cc: [email protected]; [email protected]; Alan
> Piszcz
> Subject: mismatch_cnt, random bitflips, silent corruption(?), mdadm/sw
> raid[156]
>
> I have many backups of data from the late 1990s onward, each of which
> have
> been combined into tar files and then encrypted with gpg, each is about
> 4
> gigabytes so I can store them on DVD. For quicker access, I also kept
> them on software raid. In the past, I have always kept them on either
> SW
> RAID5 or SW RAID6 using XFS. All of the machines in question that I
> have/had the data on never exhibited any bad memory and all pass
> memtest
> OK. However, what is worrisome, is, on occasion, the mismatch_cnt
> would be some crazy number on different machines, 2000 or 5000, I
> checked
> all of the disks with smart, they appear to be OK and when I run a
> check
> and repair against the raid then the count will usually go back down to
> zero.
>
> However, in reality, are these "mismatches" really an indication of
> silent
> data corruption? In my case, I have 199 of these backup files and when
> I
> tried to unencrypt the data, I got these errors on several volumes:
>
> =======================================================================
> =
>
> gpg: block_filter 0x1c8e200: read error (size=10266,a->size=10266)
> gpg: mdc_packet with invalid encoding
> gpg: decryption failed: invalid packet
> gpg: block_filter: pending bytes!
> gpg: fatal: zlib inflate problem: invalid stored block lengths
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid block type
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid code lengths set
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid block type
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid block type
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: [don't know]: invalid packet (ctb=1f)
> gpg: [don't know]: invalid packet (ctb=2d)
> gpg: mdc_packet with invalid encoding
> gpg: decryption failed: invalid packet
> gpg: [don't know]: invalid packet (ctb=30)
> gpg: fatal: zlib inflate problem: invalid stored block lengths
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid stored block lengths
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid code lengths set
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid stored block lengths
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid distance code
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid distance code
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid code lengths set
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
> gpg: fatal: zlib inflate problem: invalid block type
> secmem usage: 2368/2496 bytes in 6/7 blocks of pool 3104/32768
>
> =======================================================================
> =
>
> This resulted in partial-sized corrupted tarballs, I am decompressing
> each
> of the 199 backups now to isolate which are affected so I can come up
> with
> a better count on how many files were affected.
>
> The total amount of data is:
> 832G in 199 files (4.18gig/file average)
>
> =======================================================================
> =
>
> Bad files (out of the 199):
>
> Thu Dec 25 05:44:16 EST 2008: Decompressing backup038.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 08:18:00 EST 2008: Decompressing backup103.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 08:34:39 EST 2008: Decompressing backup111.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 08:43:26 EST 2008: Decompressing backup115.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 08:54:39 EST 2008: Decompressing backup120.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 09:36:02 EST 2008: Decompressing backup137.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 09:36:39 EST 2008: Decompressing backup138.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 09:52:06 EST 2008: Decompressing backup145.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 10:10:14 EST 2008: Decompressing backup153.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 10:10:32 EST 2008: Decompressing backup154.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 10:36:50 EST 2008: Decompressing backup166.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 10:40:19 EST 2008: Decompressing backup168.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 11:20:00 EST 2008: Decompressing backup181.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 11:39:11 EST 2008: Decompressing backup187.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
> Thu Dec 25 12:05:27 EST 2008: Decompressing backup194.tar
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
>
>
>
> =======================================================================
> =
>
> Example of one file:
> -rw-r--r-- 1 root root 40M 2008-12-24 20:08 backup038.tar
> -rw-r--r-- 1 root root 4.3G 2008-12-23 17:45 backup038.tar.gpg
>
> =======================================================================
> =
>
> I have restored the data off of DVD and confirmed the data that had
> resided on disk must have suffered from a bit-flip/corruption:
>
> $ md5sum backup038.tar.gpg
> 9958813aa22e4307f2101f87b8820bff backup038.tar.gpg (from dvd)
>
> $ md5sum backup038.tar.gpg
> 9437138a01fc2429ca2131f6a10295b5 backup038.tar.gpg (from disk)
>
> =======================================================================
> =
>
> Splitting these files up we can isolate how many corruptions occured in
> the
> file:
>
> dvd$ split -b 100M backup038.tar.gpg
> hdd$ split -b 100M backup038.tar.gpg
>
> $ diff dvd.md5 ../hdd/hdd.md5
> 1c1
> < 9dfcab90fd6590705ba4293994f10a85 xaa
> ---
> > cce57047ac869ef3e603d4d6bd3579e9 xaa
>
> =======================================================================
> =
>
> Dig further:
> $ mv xaa dvd.xaa
> $ mv xaa hdd.xaa
>
> $ split -b 1M dvd.xaa
> $ split -b 1M hdd.xaa
>
> $ diff dvd.md5 ../../hdd/xaa_split/hdd.md5
> 40c40
> < 439e4291ea67240c0b083abe5f041319 xbn
> ---
> > 95e548284fa78f750b50f3e6a8e1b847 xbn
>
> =======================================================================
> =
>
> Further..
> $ split -a 100 -b 1k dvd.xbn
> $ split -a 100 -b 1k hdd.xbn
>
> $ diff dvd.md5 ../../../hdd/xaa_split/xbn_split/hdd.md5 |wc
> 132 392 17960
>
> =======================================================================
> =
>
> Quite a big 'chunk' of the file suffered corruption.
>
> $ diff dvd.md5 ../../../hdd/xaa_split/xbn_split/hdd.md5 | awk '{print
> $3}' | xargs -n1 ls | xargs du -ach
> 520K total
>
> =======================================================================
> =
>
> dvd$ md5sum one_part_with_corruption
> c131c2f7dea3c316cbcd2c26360e4b03 one_part_with_corruption
> hdd$ md5sum one_part_with_corruption
> 35968f907c5e6f986d9c5a171dd0a7ac one_part_with_corruption
>
> Taking an octal dump and checking the ASCII characters:
>
> dvd:
> < 0000000 034 ? ? 201 f ? ? ? e D 221 ? ? \b ?
> $
> < 0000020 > p M | 212 236 034 ? \b ? z ? ? \a ?
> i
> < 0000040 212 \f ? ] 200 l ? 234 Y 224 ? ? ? ? ?
> < 0000060 233 } G ? W : 236 b ? } 215 ) 224 R R
> ?
> < 0000100 237 ? ? 227 ? 1 ? O ? ? 003 177 $ e 032
> ?
> < 0000120 ! ^ j ? 235 8 5 a J ? ? < ? H 223
> ?
>
> hdd:
> > 0000000 \r ? ? 033 034 ? ? 201 f ? ? ? e D 221
> ?
> > 0000020 ? \b ? $ > p M | 212 236 034 ? \b ? z
> ?
> > 0000040 ? \a ? i 212 \f ? ] 200 l ? 234 Y 224 ?
> ?
> > 0000060 ? ? ? 233 } G ? W : 236 b ? } 215
> )
> > 0000100 224 R R ? 237 ? ? 227 ? 1 ? O ? ? 003
> 177
> > 0000120 $ e 032 ? ! ^ j ? 235 8 5 a J ? ?
> <
>
> The hdd image seems 'shifted' see the following sequence:
> (201..f..R..c) on
> the first line.
>
> =======================================================================
> =
>
> So the question then becomes where did the shift begin? Way above we
> see
> the initial error began on xaa (100M) -> xbn (1M)
>
> dvd$ md5sum xbm
> 2524788769acf2b0f98346f488a58441 xbm
> hdd$ md5sum xbm
> 2524788769acf2b0f98346f488a58441 xbm
>
> So then we want to take an od of xbn and then diff that:
>
> It starts at the following segment:
>
> dvd:
> < 2006000 ? r 222 235 ? z 8 d 035 ? 230 # 207 ? ?
> u
>
> hdd:
> > 2006000 ? r 222 235 ? r 222 235 ? z 8 d 035 ? 230
> #
>
> =======================================================================
> =
>
> So let's look at it a few lines before as well:
>
> dvd:
> 2005720 ' 0 ? 001 ? ? [ ( 037 024 . 211 M 023 001 ?
> 2005740 V 021 \ ? Y ? ? 207 206 ? n ? 234 z ? 034
> 2005760 ? 206 002 + w ? 200 ? 3 ? ? K $ ? e n
> 2006000 ? r 222 235 ? z 8 d 035 ? 230 # 207 ? ? u
> 2006020 R ? ? 025 ? ? 022 ? 024 ? 200 Q 217 ? ? ?
> 2006040 203 ? ? c 3 ? 207 # ] ? ? H ( ? 5 207
> 2006060 ? C C a @ \r 200 ? ? ? \r 204 230 U ? O
>
> hdd:
> 2005720 ' 0 ? 001 ? ? [ ( 037 024 . 211 M 023 001 ?
> 2005740 V 021 \ ? Y ? ? 207 206 ? n ? 234 z ? 034
> 2005760 ? 206 002 + w ? 200 ? 3 ? ? K $ ? e n
> 2006000 ? r 222 235 ? r 222 235 ? z 8 d 035 ? 230 #
> 2006020 207 ? ? u R ? ? 025 ? ? 022 ? 024 ? 200 Q
> 2006040 217 ? ? ? 203 ? ? c 3 ? 207 # ] ? ? H
> 2006060 ( ? 5 207 ? C C a @ \r 200 ? ? ? \r 204
>
> The same sequence still seems to exist after the flip for the 5th
> character
> on line 2006000.
>
> =======================================================================
> =
>
> On the linux-raid mailing list, this issue is often discussed:
> http://www.mail-archive.com/[email protected]/msg07541.html
>
> The solution seems to be repair and keep on trucking along, which is
> probably fine for a filesystem with hundreds/thousands of files, most
> likely, you would never notice the corruption. However, with gpg, the
> files
> obviously cannot suffer any corruption. It is only when I tried to
> extract the
> data did I notice this problem.
>
> =======================================================================
> =
>
> == server 1 (965 chipset) ==
> Root filesystem (Raid-1):
> Sat Nov 15 11:39:03 EST 2008: cat /sys/block/md2/md/mismatch_cnt
> Sat Nov 15 11:39:03 EST 2008: 128
>
> Storage array (Raid-6):
> Wed Nov 26 10:14:21 EST 2008: cat /sys/block/md3/md/mismatch_cnt
> Wed Nov 26 10:14:21 EST 2008: 1208
>
> == server 2 (p35-chipset) ==
> Swap (Raid-1): I understand this is 'normal' for swap:
> Fri Jan 18 22:51:47 EST 2008: cat /sys/block/md0/md/mismatch_cnt
> Fri Jan 18 22:51:47 EST 2008: 4096
> Fri May 30 21:00:07 EDT 2008: cat /sys/block/md0/md/mismatch_cnt
> Fri May 30 21:00:07 EDT 2008: 896
> Fri Nov 14 23:30:10 EST 2008: cat /sys/block/md0/md/mismatch_cnt
> Fri Nov 14 23:30:10 EST 2008: 384
>
> == server 3 (also 965 chipset) ==
> Root filesystem (Raid-1):
> Fri May 18 20:50:05 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
> Fri May 18 20:50:05 EDT 2007: 128
> Sat May 26 04:40:09 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
> Sat May 26 04:40:09 EDT 2007: 128
>
> Swap (Raid-1): Again, normal but a very high mismatch_cnt:
> Fri Oct 5 20:50:05 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
> Fri Oct 5 20:50:05 EDT 2007: 27904
>
> Storage array (RAID-5)
> Fri Apr 4 22:00:09 EDT 2008: cat /sys/block/md3/md/mismatch_cnt
> Fri Apr 4 22:00:09 EDT 2008: 512
> Fri May 23 22:00:10 EDT 2008: cat /sys/block/md3/md/mismatch_cnt
> Fri May 23 22:00:10 EDT 2008: 256
>
> == server 4 (intel 875 chipset, raid-5)
> # grep mismatches\ found *
> daemon.log:Jan 3 11:27:27 p26 mdadm: RebuildFinished event detected on
> md device /dev/md0, component device mismatches found: 11896
> daemon.log:Nov 14 12:25:59 p26 mdadm[1956]: RebuildFinished event
> detected on md device /dev/md0, component device mismatches found:
> 11664
> daemon.log:Nov 14 18:19:01 p26 mdadm[1956]: RebuildFinished event
> detected on md device /dev/md0, component device mismatches found:
> 11664
> syslog:Jan 3 11:27:27 p26 mdadm: RebuildFinished event detected on md
> device /dev/md0, component device mismatches found: 11896
> syslog:Nov 14 12:25:59 p26 mdadm[1956]: RebuildFinished event detected
> on md device /dev/md0, component device mismatches found: 11664
> syslog:Nov 14 18:19:01 p26 mdadm[1956]: RebuildFinished event detected
> on md device /dev/md0, component device mismatches found: 11664
>
> =======================================================================
> =
>
> Options/alternatives for maintaining data integrity?
>
> 1. Obviously DVD and/or LTO-tape (multiple copies of data at rest)
> http://epoka.dk/media/Linear_Tape_Open_(LTO)_Ultrium_Data_Cartridges.pd
> f
> -> For LTO-2/LTO-3
> -> Uncorrected bit error rate 1x10-17 1x10-17
> -> Undetected bit error rate 1x10-27 1x10-27
>
> 2. ZFS, but only runs in user-space in Linux.
>
> 3. Keep an md5sum for each file on the filesystem and run daily checks?
>
> =======================================================================
> =
>
> Other options?
>
> How do others maintain data integrity? Just not worry about it until
> you have
> to, rely on backups.. or?
>
> Justin.

4GB files using gpg and tar in the '90s?
I know gpg had 2GB file-related bugs as late as 2005 that caused corruption, and
there was a heck of a lot of 2GB-related bugs in 2.2 and 2.4 kernels which you
must have been running back then. You are also using later versions of these
programs on the new systems, and I'd be willing to bet they compound the problem by
assuming there was no corruption to begin with.

I use:
- gzipped tar archives, but I gzip the individual files, rather than the tarball. That way
any compression-related bugs are limited to a single file. I copy them to DVDs.

- For online/nearline, I now use a ZFS, but on a native Solaris system that functions as my primary
NFS/CIFS/iSCSI server with and a ZFS software-RAID based file system. I am profoundly impressed with
it, and when they release the deduplication enhancement for ZFS, I'll adopt it and won't have to buy
any more DVDs, except for offsite archiving purposes.

David

2008-12-26 09:07:41

by Justin Piszcz

[permalink] [raw]
Subject: RE: mismatch_cnt, random bitflips, silent corruption(?), mdadm/sw raid[156]



On Thu, 25 Dec 2008, David Lethe wrote:

>> -----Original Message-----
>> From: [email protected] [mailto:linux-raid-
>>
>> Other options?
>>
>> How do others maintain data integrity? Just not worry about it until
>> you have
>> to, rely on backups.. or?
>>
>> Justin.
>
> 4GB files using gpg and tar in the '90s?
Sorry, to clarify they were ~650-700MiB tars but combined to around a ~4GiB
file later on (from the late 90s), CD's were cheap and yeah 2GiB limit. But
later on to consolidate I moved them to 4GiB DVDs and thus tarred them together
and then gpg on top of that.

> I know gpg had 2GB file-related bugs as late as 2005 that caused corruption, and
> there was a heck of a lot of 2GB-related bugs in 2.2 and 2.4 kernels which you
> must have been running back then. You are also using later versions of these
> programs on the new systems, and I'd be willing to bet they compound the problem by
> assuming there was no corruption to begin with.
Once I restored the data from DVDs, I was able to restore *all* data
successfully.

>
> I use:
> - gzipped tar archives, but I gzip the individual files, rather than the tarball. That way
> any compression-related bugs are limited to a single file. I copy them to DVDs.
That works as well, but are they your regular files gzipped, no encryption?

>
> - For online/nearline, I now use a ZFS, but on a native Solaris system that functions as my primary
> NFS/CIFS/iSCSI server with and a ZFS software-RAID based file system. I am profoundly impressed with
> it, and when they release the deduplication enhancement for ZFS, I'll adopt it and won't have to buy
> any more DVDs, except for offsite archiving purposes.
Wow, I did not know ZFS had plans for de-dupe!! I will have to look into this,
thanks for the info.

Justin.

2008-12-30 14:20:43

by Matt Garman

[permalink] [raw]
Subject: Re: mismatch_cnt, random bitflips, silent corruption(?), mdadm/sw raid[156]

On Thu, Dec 25, 2008 at 01:04:07PM -0500, Justin Piszcz wrote:
> Options/alternatives for maintaining data integrity?
>
> 1. Obviously DVD and/or LTO-tape (multiple copies of data at rest)
> http://epoka.dk/media/Linear_Tape_Open_(LTO)_Ultrium_Data_Cartridges.pdf
> -> For LTO-2/LTO-3
> -> Uncorrected bit error rate 1x10-17 1x10-17
> -> Undetected bit error rate 1x10-27 1x10-27
>
> 2. ZFS, but only runs in user-space in Linux.
>
> 3. Keep an md5sum for each file on the filesystem and run daily
> checks?

I've recently started using par2[1][2] on backed up files.

I just started doing this, so don't know how it will work in the
long run. Can't hurt, anyway, other than taking up more space.

[1] http://www.par2.net/
[2] http://parchive.sourceforge.net/

2008-12-31 11:39:40

by Mario 'BitKoenig' Holbe

[permalink] [raw]
Subject: Re: mismatch_cnt, random bitflips, silent corruption(?), mdadm/sw raid[156]

Justin Piszcz <[email protected]> wrote:
> Options/alternatives for maintaining data integrity?
> 3. Keep an md5sum for each file on the filesystem and run daily checks?

I'm doing this since about 6 years now with all of my files and I never
experienced any silent corruption.
I also regularly experience MD mismatch_cnt effects on RAID1 with ext2/3
filesystems with heavy file removal/creation load (even before MD
implemented its self-check features - I cmp-ed and corrected them
manually before) and they all seem to have been related to metadata,
i.e. inodes.


regards
Mario
--
[mod_nessus for iauth]
<delta> "scanning your system...found depreciated OS...found
hole...installing new OS...please reboot and reconnect now"