LinuxLists.cc - ext3-fs transient corruption with devmapper over md raid, kernel 2.6.16.14

2006-05-23 18:26:39

Subject: ext3-fs transient corruption with devmapper over md raid, kernel 2.6.16.14

I just recently upgraded a machine to use devmapper for an encrypted
filesystem on top of a software raid5 array. System is running a
stock 2.6.16.14 kernel with no additional patches.

Under periods of high disk load on that array, I get various errors like the
following:

May 23 06:26:48 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #108298277: rec_len %% 4 != 0 - offset=0, inode=857743392, rec_len=12853, name_len=52
May 23 06:27:01 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #109215774: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
May 23 06:27:23 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #109232146: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
May 23 06:27:27 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #109282688: rec_len is smaller than minimal - offset=0, inode=1048832, rec_len=0, name_len=0
May 23 06:27:27 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #109297749: rec_len %% 4 != 0 - offset=0, inode=1309226288, rec_len=8303, name_len=67
May 23 06:28:07 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #109871722: rec_len %% 4 != 0 - offset=0, inode=6586752, rec_len=1581, name_len=0
May 23 06:28:21 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #110412086: rec_len is smaller than minimal - offset=0, inode=1048832, rec_len=0, name_len=0
May 23 06:28:23 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #110428194: rec_len is smaller than minimal - offset=0, inode=0, rec_len=1, name_len=0
May 23 06:28:24 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #110428301: rec_len %% 4 != 0 - offset=0, inode=1767526191, rec_len=13679, name_len=77
May 23 06:28:25 localhost kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #110444639: directory entry across blocks - offset=0, inode=538976288, rec_len=24892, name_len=32

After the errors occur, if I then unmount and fsck the devmapper device, it
finds no errors. If the disk load isn't heavy, the errors never seem to
crop up. There are no messages concerning the underlying md device or any
of the member disks of the md device.

This is not completely reproducible, but appears to be commonly triggered by
the nightly updatedb/find cronjob running concurrently with a hefty rsync
process on a filesystem with about 3 million files.

Other md devices on the same machine that are NOT used via devmapper are not
showing these problems.

System is running AMD64 stable distribution of Debian, using devmapper with
dm_crypt and aes_x86_64 kernel modules.

I found a couple of other similar reports via google, but most were pretty
old, and none seemed to have applicable resolutions.

Happy to provide any further info that may be useful.

Evan

2006-05-23 22:16:53

by Alasdair G Kergon

[permalink] [raw]

Subject: Re: ext3-fs transient corruption with devmapper over md raid, kernel 2.6.16.14

On Tue, May 23, 2006 at 01:26:32PM -0500, Evan Harris wrote:
> I just recently upgraded a machine to use devmapper for an encrypted
> filesystem on top of a software raid5 array. System is running a
> stock 2.6.16.14 kernel with no additional patches.

> Happy to provide any further info that may be useful.

This might not be practical for you, but what we're looking for
is people who can reproduce this on a test system where they can
try varying things one-at-a-time. For example, replace dm-crypt
with dm-linear (e.g. a standard unencrypted LVM2 logical volume);
replace raid5 with (md) linear. Also test with the latest
development kernels to see if recent md patches fixed the problem.

Alasdair
--
[email protected]

2006-05-26 19:36:04

by Evan Harris

[permalink] [raw]

Subject: Re: ext3-fs transient corruption with devmapper over md raid, kernel 2.6.16.14

Luckily the filesystem had no important data on it yet, so I have been
testing various changes.

First I changed the mount from ext3 to plain ext2, and that eventually
produced a series of errors like this:

May 24 03:15:18 localhost kernel: init_special_inode: bogus i_mode (2640)
May 24 03:15:29 localhost kernel: init_special_inode: bogus i_mode (175015)
May 24 03:15:29 localhost kernel: init_special_inode: bogus i_mode (11)
May 24 03:15:29 localhost kernel: init_special_inode: bogus i_mode (161265)
May 24 03:15:29 localhost kernel: init_special_inode: bogus i_mode (0)
May 24 03:15:29 localhost kernel: attempt to access beyond end of device
May 24 03:15:29 localhost kernel: dm-0: rw=0, want=8341843240, limit=2930302464
May 24 03:15:30 localhost kernel: attempt to access beyond end of device
May 24 03:15:30 localhost kernel: dm-0: rw=0, want=9187258480, limit=2930302464
May 24 03:15:30 localhost kernel: attempt to access beyond end of device
May 24 03:15:30 localhost kernel: dm-0: rw=0, want=9184366040, limit=2930302464
May 24 03:15:30 localhost kernel: attempt to access beyond end of device
May 24 03:15:30 localhost kernel: dm-0: rw=0, want=9182932112, limit=2930302464
May 24 03:15:30 localhost kernel: attempt to access beyond end of device
May 24 03:15:30 localhost kernel: dm-0: rw=0, want=8994978168, limit=2930302464
May 24 03:15:30 localhost kernel: attempt to access beyond end of device
May 24 03:15:30 localhost kernel: dm-0: rw=0, want=9187258480, limit=2930302464
May 24 03:15:30 localhost kernel: attempt to access beyond end of device

Then I tried changing from dm-crypt to dm-linear and have not been able to
reproduce the problem using dm-linear. Unfortunately, the test conditions
are not exactly the same because using the dm-crypt module completely pegs
the cpu, while using the linear module is disk-bound not io-bound and the
cpu utilization is MUCH lower.

However, this leads me to suspect that the problem is either in dm-crypt, or
a data corruption problem resulting from a decrypt error from the aes_x86_64
module that dm-crypt is using.

One thing I forgot to mention before is this is on a dual-core box. Just in
case the problem may be related to SMP, I'm planning to try recompiling for
a non-SMP kernel and go back to using dm-crypt and see if I can still
produce the error that way.

If anyone has variations that would be more useful, I can try to test those
first.

Evan

On Tue, 23 May 2006, Alasdair G Kergon wrote:

> On Tue, May 23, 2006 at 01:26:32PM -0500, Evan Harris wrote:
>> I just recently upgraded a machine to use devmapper for an encrypted
>> filesystem on top of a software raid5 array. System is running a
>> stock 2.6.16.14 kernel with no additional patches.
>
>> Happy to provide any further info that may be useful.
>
> This might not be practical for you, but what we're looking for
> is people who can reproduce this on a test system where they can
> try varying things one-at-a-time. For example, replace dm-crypt
> with dm-linear (e.g. a standard unencrypted LVM2 logical volume);
> replace raid5 with (md) linear. Also test with the latest
> development kernels to see if recent md patches fixed the problem.
>
> Alasdair
> --
> [email protected]

2006-05-26 20:30:12

by Alasdair G Kergon

[permalink] [raw]

Subject: Re: ext3-fs transient corruption with devmapper over md raid, kernel 2.6.16.14

On Fri, May 26, 2006 at 02:35:48PM -0500, Evan Harris wrote:
> However, this leads me to suspect that the problem is either in dm-crypt,
> or a data corruption problem resulting from a decrypt error from the
> aes_x86_64 module that dm-crypt is using.

Perhaps try a different crypt module too?

Alasdair
--
[email protected]