2011-03-29 02:59:12

by Daniel Taylor

[permalink] [raw]
Subject: breaking ext4 to test recovery

I would like to be able to break our ext4 file system
(specifically corrupt the journal) to be sure that we
can automatically notice the problem and attempt an
autonomous fix.

dumpe2fs tells me the inode, but not, that I can see, the
blocks where the journal exists (for "dd"ing junk to it).

Is there any debug tool that would let me deliberately
break the file system (at least, trash the journal)?

If not, is there a hint for figuring out the block(s) of
the journal so I can stomp it?

The kernel is in an embedded machine, so it's a little old
2.6.32.11 and e2fsprogs/libs 1.41.12-2 (Lenny)

Dan Taylor
Sr. Staff Engineer
WD Branded Products
949.672.7761


2011-03-29 03:10:30

by Tao Ma

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 03/29/2011 10:45 AM, Daniel Taylor wrote:
> I would like to be able to break our ext4 file system
> (specifically corrupt the journal) to be sure that we
> can automatically notice the problem and attempt an
> autonomous fix.
>
> dumpe2fs tells me the inode, but not, that I can see, the
> blocks where the journal exists (for "dd"ing junk to it).
yeah, AFAICS, you can corrupt it by dd.
As for the journal, normally the journal file uses the inode no 8.
So use
debugfs -R 'stat <8>' /dev/sdx.
Then you will get the disk layout of your journal.
In my box, it looks as:

Inode: 8 Type: regular Mode: 0600 Flags: 0x80000
Generation: 0 Version: 0x00000000:00000000
User: 0 Group: 0 Size: 33554432
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 65536
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4d86f9ad:00000000 -- Mon Mar 21 15:09:33 2011
atime: 0x4d86f9ad:00000000 -- Mon Mar 21 15:09:33 2011
mtime: 0x4d86f9ad:00000000 -- Mon Mar 21 15:09:33 2011
crtime: 0x4d86f9ad:00000000 -- Mon Mar 21 15:09:33 2011
Size of extra inode fields: 28
EXTENTS:
(0-8191): 131072-139263

So see, you get the file's physical block number of that file
131072-139263.
Now corrupt the file as you wish with dd. ;)

Regards,
Tao

2011-03-29 13:50:20

by Eric Sandeen

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 3/28/11 9:45 PM, Daniel Taylor wrote:
> I would like to be able to break our ext4 file system
> (specifically corrupt the journal) to be sure that we
> can automatically notice the problem and attempt an
> autonomous fix.
>
> dumpe2fs tells me the inode, but not, that I can see, the
> blocks where the journal exists (for "dd"ing junk to it).
>
> Is there any debug tool that would let me deliberately
> break the file system (at least, trash the journal)?
>
> If not, is there a hint for figuring out the block(s) of
> the journal so I can stomp it?
>
> The kernel is in an embedded machine, so it's a little old
> 2.6.32.11 and e2fsprogs/libs 1.41.12-2 (Lenny)

As Tao Ma said, you can stat <8> in debugfs to see the journal
blocks.

Another tool which can be useful for this sort of thing is
fsfuzzer. It writes garbage; using dd to write zeros actually
might be "nice" corruption.

But are you trying to test in-kernel recovery, or e2fsck, after
you corrupt the journal? Or both?

I assume you'd start with a filesystem with a dirty log,
corrupt that log, and then what, fsck it, or try to mount it?

How are you generating your fs w/ dirty log?

(xfs has an ioctl to abruptly "stop" the fs as if it had crashed,
that would be very useful in extN as well).

Another thing which could use lots more testing in the wild is
simple journal recovery; nothing is corrupted, but the drive got
unplugged or the system lost power while the fs was under load;
see if a mount; umount; fsck and/or if a fsck; mount; umount; fsck finds
errors.

(the former will test in-kernel log recovery, the latter will test
log recovery in e2fsck).

-Eric

> Dan Taylor
> Sr. Staff Engineer
> WD Branded Products
> 949.672.7761


2011-03-29 14:33:13

by Rogier Wolff

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On Tue, Mar 29, 2011 at 08:50:18AM -0500, Eric Sandeen wrote:
> Another tool which can be useful for this sort of thing is
> fsfuzzer. It writes garbage; using dd to write zeros actually
> might be "nice" corruption.

Besides writing blocks of "random data", you could write blocks with a
small percentage of bits (byte) set to non-zero, or just toggle a
configurable number of bits (bytes). This is slightly more devious than just
"random data".

If you try to verify the integrity of a block full of random data, you
can quickly determine that it is completely bogus (I don't think that
e2fsck already exploits this as I've seen it get this wrong).

If you have an indirect block, and it contains:

00000 72 6f 6f 74 3a 78 3a 30 3a 30 3a 72 6f 6f 74 3a root:x:0:0:root:
00010 2f 72 6f 6f 74 3a 2f 62 69 6e 2f 62 61 73 68 0a /root:/bin/bash.
00020 64 61 65 6d 6f 6e 3a 78 3a 31 3a 31 3a 64 61 65 daemon:x:1:1:dae
00030 6d 6f 6e 3a 2f 75 73 72 2f 73 62 69 6e 3a 2f 62 mon:/usr/sbin:/b
00040 69 6e 2f 73 68 0a 62 69 6e 3a 78 3a 32 3a 32 3a in/sh.bin:x:2:2:
00050 62 69 6e 3a 2f 62 69 6e 3a 2f 62 69 6e 2f 73 68 bin:/bin:/bin/sh
00060 0a 73 79 73 3a 78 3a 33 3a 33 3a 73 79 73 3a 2f .sys:x:3:3:sys:/
00070 64 65 76 3a 2f 62 69 6e 2f 73 68 0a 73 79 6e 63 dev:/bin/sh.sync
00080 3a 78 3a 34 3a 36 35 35 33 34 3a 73 79 6e 63 3a :x:4:65534:sync:
00090 2f 62 69 6e 3a 2f 62 69 6e 2f 73 79 6e 63 0a 67 /bin:/bin/sync.g

You can see that the block numbers that are represented here are all
bad. In this case, one of the options should be to discard the whole
indirect block. If you happen to find a few "valid" block numbers
here, they are likely to be bogus. It is counterproductive to check
those for duplicate allocation, or to mark them as used if they happen
to be free.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2011-03-29 17:34:11

by Greg Freemyer

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On Tue, Mar 29, 2011 at 10:33 AM, Rogier Wolff <[email protected]> wrote:
> On Tue, Mar 29, 2011 at 08:50:18AM -0500, Eric Sandeen wrote:
>> Another tool which can be useful for this sort of thing is
>> fsfuzzer. ?It writes garbage; using dd to write zeros actually
>> might be "nice" corruption.
>
> Besides writing blocks of "random data", you could write blocks with a
> small percentage of bits (byte) set to non-zero, or just toggle a
> configurable number of bits (bytes). This is slightly more devious than just
> "random data".

I don't know what exactly is being tested, but "hdparm
--make-bad-sector" can be used to create a media error on a specific
sector.

Thus allowing you to simulate a sector failing in the middle of the journal.

I assume that is a relevant test.

fyi: --repair-sector undoes the damage. You may need to follow that
with a normal write to put legit data there.

If you try a normal data write without first repairing, the drive
should mark the sector permanently bad and remap that sector to a
spare sector.

I have only used these tools with raw drives, no partitions, etc. So
I've never had to worry about data loss, etc.

Greg

2011-03-29 22:26:55

by Daniel Taylor

[permalink] [raw]
Subject: RE: breaking ext4 to test recovery



> -----Original Message-----
> From: Greg Freemyer [mailto:[email protected]]
> Sent: Tuesday, March 29, 2011 10:34 AM
> To: Rogier Wolff
> Cc: Eric Sandeen; Daniel Taylor; [email protected]
> Subject: Re: breaking ext4 to test recovery
>
> On Tue, Mar 29, 2011 at 10:33 AM, Rogier Wolff
> <[email protected]> wrote:
> > On Tue, Mar 29, 2011 at 08:50:18AM -0500, Eric Sandeen wrote:
> >> Another tool which can be useful for this sort of thing is
> >> fsfuzzer. ?It writes garbage; using dd to write zeros actually
> >> might be "nice" corruption.
> >
> > Besides writing blocks of "random data", you could write
> blocks with a
> > small percentage of bits (byte) set to non-zero, or just toggle a
> > configurable number of bits (bytes). This is slightly more
> devious than just
> > "random data".
>
> I don't know what exactly is being tested, but "hdparm
> --make-bad-sector" can be used to create a media error on a specific
> sector.
>
> Thus allowing you to simulate a sector failing in the middle
> of the journal.
>
> I assume that is a relevant test.
>
> fyi: --repair-sector undoes the damage. You may need to follow that
> with a normal write to put legit data there.
>
> If you try a normal data write without first repairing, the drive
> should mark the sector permanently bad and remap that sector to a
> spare sector.
>
> I have only used these tools with raw drives, no partitions, etc. So
> I've never had to worry about data loss, etc.
>
> Greg
>

Thanks for the suggestions. Tao Ma's got me started, but doing some
of the more "devious" tests is on my list, too.

The original issue was that during component stress testing, we were
seeing instances of the ext4 file system becoming "read-only" (showing
in /proc/mounts, but not "mount"). Looking back through the logs, we
saw that at mount time, there was a complaint about a corrupted journal.

Some writing had occurred before the change to read-only, however.

The original mount script didn't check for any "mount" return value, so
we theorized that ext4 just got to a point where it couldn't sensibly
handle any more changes.

It seemed that the right answer was to check the return value from mount
and, if non-0, umount the file system, fix it, and try again. To test
the return value from mount, I need to be able to corrupt, but not
destroy the journal, since the component tests were taking days to show
the failure.

Running an "fsck -f" every time on a 3TB file system with an embedded
PPC was just taking too much time to impose on a consumer-level customer.

2011-03-29 22:33:13

by Eric Sandeen

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 3/29/11 5:26 PM, Daniel Taylor wrote:
> Thanks for the suggestions. Tao Ma's got me started, but doing some
> of the more "devious" tests is on my list, too.
>
> The original issue was that during component stress testing, we were
> seeing instances of the ext4 file system becoming "read-only" (showing
> in /proc/mounts, but not "mount"). Looking back through the logs, we
> saw that at mount time, there was a complaint about a corrupted journal.

So, did it go "read-only" right at mount time due to a journal replay
failure? Or ...

> Some writing had occurred before the change to read-only, however.

That makes it sound like it did get mounted ok... and then something
went wrong? What did the logs say?

> The original mount script didn't check for any "mount" return value, so
> we theorized that ext4 just got to a point where it couldn't sensibly
> handle any more changes.

I'm not sure what that means, TBH :)

Just want to make sure you're barking up the right tree, here ...

-Eric

> It seemed that the right answer was to check the return value from mount
> and, if non-0, umount the file system, fix it, and try again. To test
> the return value from mount, I need to be able to corrupt, but not
> destroy the journal, since the component tests were taking days to show
> the failure.
>
> Running an "fsck -f" every time on a 3TB file system with an embedded
> PPC was just taking too much time to impose on a consumer-level customer.


2011-03-31 22:11:42

by Andreas Dilger

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

There is a patch we have in the Lustre version of e2fsprogs called "ibadness" that has e2fsck track the number of errors hit for each inode, and if an inode exceeds a threshold of errors then e2fsck will offer to clear the inode instead of making small fixes to turn a garbage inode into something that looks half correct.

One of the tests that we developed for this feature was to write both random garbage into the filesystem, as well as copying data blocks from one part of the filesystem to another. This is much more difficult to fix because there may be random inode blocks that have what looks like valid inodes in them, but they are in the wrong location.

When/if we get proper block checksums this kind of corruption would be easily detected, because the checksum (which hopefully includes the block number) would be wrong even if the data looks sane.

Cheers, Andreas

On 2011-03-29, at 4:33 AM, Rogier Wolff <[email protected]> wrote:

> On Tue, Mar 29, 2011 at 08:50:18AM -0500, Eric Sandeen wrote:
>> Another tool which can be useful for this sort of thing is
>> fsfuzzer. It writes garbage; using dd to write zeros actually
>> might be "nice" corruption.
>
> Besides writing blocks of "random data", you could write blocks with a
> small percentage of bits (byte) set to non-zero, or just toggle a
> configurable number of bits (bytes). This is slightly more devious than just
> "random data".
>
> If you try to verify the integrity of a block full of random data, you
> can quickly determine that it is completely bogus (I don't think that
> e2fsck already exploits this as I've seen it get this wrong).
>
> If you have an indirect block, and it contains:
>
> 00000 72 6f 6f 74 3a 78 3a 30 3a 30 3a 72 6f 6f 74 3a root:x:0:0:root:
> 00010 2f 72 6f 6f 74 3a 2f 62 69 6e 2f 62 61 73 68 0a /root:/bin/bash.
> 00020 64 61 65 6d 6f 6e 3a 78 3a 31 3a 31 3a 64 61 65 daemon:x:1:1:dae
> 00030 6d 6f 6e 3a 2f 75 73 72 2f 73 62 69 6e 3a 2f 62 mon:/usr/sbin:/b
> 00040 69 6e 2f 73 68 0a 62 69 6e 3a 78 3a 32 3a 32 3a in/sh.bin:x:2:2:
> 00050 62 69 6e 3a 2f 62 69 6e 3a 2f 62 69 6e 2f 73 68 bin:/bin:/bin/sh
> 00060 0a 73 79 73 3a 78 3a 33 3a 33 3a 73 79 73 3a 2f .sys:x:3:3:sys:/
> 00070 64 65 76 3a 2f 62 69 6e 2f 73 68 0a 73 79 6e 63 dev:/bin/sh.sync
> 00080 3a 78 3a 34 3a 36 35 35 33 34 3a 73 79 6e 63 3a :x:4:65534:sync:
> 00090 2f 62 69 6e 3a 2f 62 69 6e 2f 73 79 6e 63 0a 67 /bin:/bin/sync.g
>
> You can see that the block numbers that are represented here are all
> bad. In this case, one of the options should be to discard the whole
> indirect block. If you happen to find a few "valid" block numbers
> here, they are likely to be bogus. It is counterproductive to check
> those for duplicate allocation, or to mark them as used if they happen
> to be free.
>
> Roger.
>
> --
> ** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-03-31 22:21:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 2011-03-29, at 3:50 AM, Eric Sandeen wrote:
> On 3/28/11 9:45 PM, Daniel Taylor wrote:
>> I would like to be able to break our ext4 file system
>> (specifically corrupt the journal) to be sure that we
>> can automatically notice the problem and attempt an
>> autonomous fix.
>>
>> dumpe2fs tells me the inode, but not, that I can see, the
>> blocks where the journal exists (for "dd"ing junk to it).
>>
>> Is there any debug tool that would let me deliberately
>> break the file system (at least, trash the journal)?
>>
>> If not, is there a hint for figuring out the block(s) of
>> the journal so I can stomp it?
>>
>> The kernel is in an embedded machine, so it's a little old
>> 2.6.32.11 and e2fsprogs/libs 1.41.12-2 (Lenny)
>
> But are you trying to test in-kernel recovery, or e2fsck, after
> you corrupt the journal? Or both?
>
> I assume you'd start with a filesystem with a dirty log,
> corrupt that log, and then what, fsck it, or try to mount it?
>
> How are you generating your fs w/ dirty log?
>
> (xfs has an ioctl to abruptly "stop" the fs as if it had crashed,
> that would be very useful in extN as well).

We have a kernel patch "dev_read_only" that we use with Lustre to disable writes to the block device while the device is in use. This allows simulating crashes at arbitrary points in the code or test scripts. It was based on Andrew Morton's test harness that he used for ext3 recovery testing back when it was being ported to the 2.4 kernel.

http://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/kernel_patches/patches/dev_read_only-2.6.32-rhel6.patch;hb=HEAD

The best part of this patch is that it works with any block device, can simulate power failure w/o any need for automated power control, and once the block device is unused (all buffers and references dropped) it can be re-activated safely.

> Another thing which could use lots more testing in the wild is
> simple journal recovery; nothing is corrupted, but the drive got
> unplugged or the system lost power while the fs was under load;
> see if a mount; umount; fsck and/or if a fsck; mount; umount; fsck finds
> errors.
>
> (the former will test in-kernel log recovery, the latter will test
> log recovery in e2fsck).

Cheers, Andreas






2011-03-31 22:22:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 2011-03-31, at 12:11 PM, Andreas Dilger wrote:
> There is a patch we have in the Lustre version of e2fsprogs called "ibadness" that has e2fsck track the number of errors hit for each inode, and if an inode exceeds a threshold of errors then e2fsck will offer to clear the inode instead of making small fixes to turn a garbage inode into something that looks half correct.

http://git.whamcloud.com/?p=tools/e2fsprogs.git;a=blob_plain;f=patches/e2fsprogs-ibadness-counter.patch;hb=8dd11ed9bdf0914d57d78d0c387bd21f747c1d29

> One of the tests that we developed for this feature was to write both random garbage into the filesystem, as well as copying data blocks from one part of the filesystem to another. This is much more difficult to fix because there may be random inode blocks that have what looks like valid inodes in them, but they are in the wrong location.

This one is already in the e2fsprogs upstream "f_random_corruption", though it could stand some improvements.

> When/if we get proper block checksums this kind of corruption would be easily detected, because the checksum (which hopefully includes the block number) would be wrong even if the data looks sane.
>
> Cheers, Andreas
>
> On 2011-03-29, at 4:33 AM, Rogier Wolff <[email protected]> wrote:
>
>> On Tue, Mar 29, 2011 at 08:50:18AM -0500, Eric Sandeen wrote:
>>> Another tool which can be useful for this sort of thing is
>>> fsfuzzer. It writes garbage; using dd to write zeros actually
>>> might be "nice" corruption.
>>
>> Besides writing blocks of "random data", you could write blocks with a
>> small percentage of bits (byte) set to non-zero, or just toggle a
>> configurable number of bits (bytes). This is slightly more devious than just
>> "random data".
>>
>> If you try to verify the integrity of a block full of random data, you
>> can quickly determine that it is completely bogus (I don't think that
>> e2fsck already exploits this as I've seen it get this wrong).
>>
>> If you have an indirect block, and it contains:
>>
>> 00000 72 6f 6f 74 3a 78 3a 30 3a 30 3a 72 6f 6f 74 3a root:x:0:0:root:
>> 00010 2f 72 6f 6f 74 3a 2f 62 69 6e 2f 62 61 73 68 0a /root:/bin/bash.
>> 00020 64 61 65 6d 6f 6e 3a 78 3a 31 3a 31 3a 64 61 65 daemon:x:1:1:dae
>> 00030 6d 6f 6e 3a 2f 75 73 72 2f 73 62 69 6e 3a 2f 62 mon:/usr/sbin:/b
>> 00040 69 6e 2f 73 68 0a 62 69 6e 3a 78 3a 32 3a 32 3a in/sh.bin:x:2:2:
>> 00050 62 69 6e 3a 2f 62 69 6e 3a 2f 62 69 6e 2f 73 68 bin:/bin:/bin/sh
>> 00060 0a 73 79 73 3a 78 3a 33 3a 33 3a 73 79 73 3a 2f .sys:x:3:3:sys:/
>> 00070 64 65 76 3a 2f 62 69 6e 2f 73 68 0a 73 79 6e 63 dev:/bin/sh.sync
>> 00080 3a 78 3a 34 3a 36 35 35 33 34 3a 73 79 6e 63 3a :x:4:65534:sync:
>> 00090 2f 62 69 6e 3a 2f 62 69 6e 2f 73 79 6e 63 0a 67 /bin:/bin/sync.g
>>
>> You can see that the block numbers that are represented here are all
>> bad. In this case, one of the options should be to discard the whole
>> indirect block. If you happen to find a few "valid" block numbers
>> here, they are likely to be bogus. It is counterproductive to check
>> those for duplicate allocation, or to mark them as used if they happen
>> to be free.
>>
>> Roger.
>>
>> --
>> ** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
>> ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
>> *-- BitWizard writes Linux device drivers for any device you may have! --*
>> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
>> Does it sit on the couch all day? Is it unemployed? Please be specific!
>> Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






2011-03-31 22:44:21

by Eric Sandeen

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 3/31/11 5:21 PM, Andreas Dilger wrote:

> We have a kernel patch "dev_read_only" that we use with Lustre to
> disable writes to the block device while the device is in use. This
> allows simulating crashes at arbitrary points in the code or test
> scripts. It was based on Andrew Morton's test harness that he used
> for ext3 recovery testing back when it was being ported to the 2.4
> kernel.
>
> http://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/kernel_patches/patches/dev_read_only-2.6.32-rhel6.patch;hb=HEAD
>
> The best part of this patch is that it works with any block device,
> can simulate power failure w/o any need for automated power control,
> and once the block device is unused (all buffers and references
> dropped) it can be re-activated safely.

It won't simulate a lost write cache though, will it?

-Eric

2011-04-01 15:26:23

by Lukas Czerner

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On Thu, 31 Mar 2011, Eric Sandeen wrote:

> On 3/31/11 5:21 PM, Andreas Dilger wrote:
>
> > We have a kernel patch "dev_read_only" that we use with Lustre to
> > disable writes to the block device while the device is in use. This
> > allows simulating crashes at arbitrary points in the code or test
> > scripts. It was based on Andrew Morton's test harness that he used
> > for ext3 recovery testing back when it was being ported to the 2.4
> > kernel.
> >
> > http://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/kernel_patches/patches/dev_read_only-2.6.32-rhel6.patch;hb=HEAD
> >
> > The best part of this patch is that it works with any block device,
> > can simulate power failure w/o any need for automated power control,
> > and once the block device is unused (all buffers and references
> > dropped) it can be re-activated safely.
>
> It won't simulate a lost write cache though, will it?

That's a very good question, I would like to know if there is any way at
all to force the device to drop the write cache. That would really help
the power failure testing filesystems.

-Lukas

>
> -Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2011-04-01 15:52:17

by Ric Wheeler

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 04/01/2011 11:26 AM, Lukas Czerner wrote:
> On Thu, 31 Mar 2011, Eric Sandeen wrote:
>
>> On 3/31/11 5:21 PM, Andreas Dilger wrote:
>>
>>> We have a kernel patch "dev_read_only" that we use with Lustre to
>>> disable writes to the block device while the device is in use. This
>>> allows simulating crashes at arbitrary points in the code or test
>>> scripts. It was based on Andrew Morton's test harness that he used
>>> for ext3 recovery testing back when it was being ported to the 2.4
>>> kernel.
>>>
>>> http://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/kernel_patches/patches/dev_read_only-2.6.32-rhel6.patch;hb=HEAD
>>>
>>> The best part of this patch is that it works with any block device,
>>> can simulate power failure w/o any need for automated power control,
>>> and once the block device is unused (all buffers and references
>>> dropped) it can be re-activated safely.
>> It won't simulate a lost write cache though, will it?
> That's a very good question, I would like to know if there is any way at
> all to force the device to drop the write cache. That would really help
> the power failure testing filesystems.
>
> -Lukas
>

Write cache behavior can be really mysterious. Small writes (say single 4K
blocks) might stay in cache and not get written for a very long time while
large, streaming writes might bypass the write cache entirely.

It would be neat to be able to simulate these odd things for failure testing :)

Ric


2011-04-02 02:15:42

by Andreas Dilger

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 2011-03-31, at 12:44 PM, Eric Sandeen wrote:
> On 3/31/11 5:21 PM, Andreas Dilger wrote:
>> We have a kernel patch "dev_read_only" that we use with Lustre to
>> disable writes to the block device while the device is in use. This
>> allows simulating crashes at arbitrary points in the code or test
>> scripts. It was based on Andrew Morton's test harness that he used
>> for ext3 recovery testing back when it was being ported to the 2.4
>> kernel.
>>
>> http://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/kernel_patches/patches/dev_read_only-2.6.32-rhel6.patch;hb=HEAD
>>
>> The best part of this patch is that it works with any block device,
>> can simulate power failure w/o any need for automated power control,
>> and once the block device is unused (all buffers and references
>> dropped) it can be re-activated safely.
>
> It won't simulate a lost write cache though, will it?

I'm not sure what you mean. Since the patch works at the block device layer (in __generic_make_request()) it will drop the write at the time it is submitted to the device, not when it is put into the cache.

That said, I notice in the linux git repo a line that is in the same place as our patch "if (should_fail_request(bio))" which looks like it might have similar functionality when CONFIG_FAIL_MAKE_REQUEST is enabled. I'm not sure what kernel version it was added in. It looks like it is possible to fail the IOs some fraction of the time, or permanently, by writing something into /sys/block/{dev}/fail.

Cheers, Andreas






2011-04-02 12:38:23

by Ric Wheeler

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 04/01/2011 10:15 PM, Andreas Dilger wrote:
> On 2011-03-31, at 12:44 PM, Eric Sandeen wrote:
>> On 3/31/11 5:21 PM, Andreas Dilger wrote:
>>> We have a kernel patch "dev_read_only" that we use with Lustre to
>>> disable writes to the block device while the device is in use. This
>>> allows simulating crashes at arbitrary points in the code or test
>>> scripts. It was based on Andrew Morton's test harness that he used
>>> for ext3 recovery testing back when it was being ported to the 2.4
>>> kernel.
>>>
>>> http://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/kernel_patches/patches/dev_read_only-2.6.32-rhel6.patch;hb=HEAD
>>>
>>> The best part of this patch is that it works with any block device,
>>> can simulate power failure w/o any need for automated power control,
>>> and once the block device is unused (all buffers and references
>>> dropped) it can be re-activated safely.
>> It won't simulate a lost write cache though, will it?
> I'm not sure what you mean. Since the patch works at the block device layer (in __generic_make_request()) it will drop the write at the time it is submitted to the device, not when it is put into the cache.
>
> That said, I notice in the linux git repo a line that is in the same place as our patch "if (should_fail_request(bio))" which looks like it might have similar functionality when CONFIG_FAIL_MAKE_REQUEST is enabled. I'm not sure what kernel version it was added in. It looks like it is possible to fail the IOs some fraction of the time, or permanently, by writing something into /sys/block/{dev}/fail.
>
> Cheers, Andreas

The device mapper developers are looking at having a device mapper target that
can be used as a hot block cache - say given a S-ATA disk and a PCI-e SSD, you
would store the hot blocks on the PCI-e card.

What might be a great simulation would be to have a way to destroy that cache,
assuming we could get a cache policy that simulates some reasonable, disk like
caching policy :)

Ric



2011-04-03 00:46:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

On 2011-04-02, at 2:38 AM, Ric Wheeler <[email protected]> wrote:
> The device mapper developers are looking at having a device mapper target that can be used as a hot block cache - say given a S-ATA disk and a PCI-e SSD, you would store the hot blocks on the PCI-e card.

There was a patch posted around December called bcache which did this same thing. I don't recall if it was a DM target or not.

> What might be a great simulation would be to have a way to destroy that cache, assuming we could get a cache policy that simulates some reasonable, disk like caching policy :)

The one difficulty with DM targets is that they cannot be used with non DM devices. That was one of the advantages of EVMS (if anyone remembers that) - it could work with any existing block device.

Cheers, Andreas

2011-04-03 02:37:57

by Tao Ma

[permalink] [raw]
Subject: Re: breaking ext4 to test recovery

Hi Ric,
On 04/02/2011 08:38 PM, Ric Wheeler wrote:
> On 04/01/2011 10:15 PM, Andreas Dilger wrote:
>> On 2011-03-31, at 12:44 PM, Eric Sandeen wrote:
>>> On 3/31/11 5:21 PM, Andreas Dilger wrote:
>>>> We have a kernel patch "dev_read_only" that we use with Lustre to
>>>> disable writes to the block device while the device is in use. This
>>>> allows simulating crashes at arbitrary points in the code or test
>>>> scripts. It was based on Andrew Morton's test harness that he used
>>>> for ext3 recovery testing back when it was being ported to the 2.4
>>>> kernel.
>>>>
>>>> http://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/kernel_patches/patches/dev_read_only-2.6.32-rhel6.patch;hb=HEAD
>>>>
>>>>
>>>> The best part of this patch is that it works with any block device,
>>>> can simulate power failure w/o any need for automated power control,
>>>> and once the block device is unused (all buffers and references
>>>> dropped) it can be re-activated safely.
>>> It won't simulate a lost write cache though, will it?
>> I'm not sure what you mean. Since the patch works at the block device
>> layer (in __generic_make_request()) it will drop the write at the time
>> it is submitted to the device, not when it is put into the cache.
>>
>> That said, I notice in the linux git repo a line that is in the same
>> place as our patch "if (should_fail_request(bio))" which looks like it
>> might have similar functionality when CONFIG_FAIL_MAKE_REQUEST is
>> enabled. I'm not sure what kernel version it was added in. It looks
>> like it is possible to fail the IOs some fraction of the time, or
>> permanently, by writing something into /sys/block/{dev}/fail.
>>
>> Cheers, Andreas
>
> The device mapper developers are looking at having a device mapper
> target that can be used as a hot block cache - say given a S-ATA disk
> and a PCI-e SSD, you would store the hot blocks on the PCI-e card.
My topic in this year's lsf is "ssd and flashcache". I will talk about
how we use ssd and flashcache in our product system. And one of my
proposal is to rewrite it and get it merged into the upstream. So
you know anyone is working on this by now? I am glad that some other
guys have the same thought as me and we are happy to cooperate with
him/her to get it upstreamed ASAP.

Regards,
Tao