Hi Linus,
This changeset contains Daniel Phillip's indexed directory changes,
ported to 2.5 by Christopher Li and Andrew Morton, and then extensively
cleaned up by me. I also implemented the enhanced dx_readdir code which
returns files would be returned in hash order. This was necessary so
that concurrent tree splits would not result in filenames to be
erroneously returned twice or not at all when the b-tree splits
reorganized the directory out from underneath readdir().
This patch significantly increases the speed of using large directories.
Creating 100,000 files in a single directory took 38 minutes without
directory indexing... and 11 seconds with the directory indexing turned on.
I've given this code a good bit of testing, under both 2.4 and 2.5
kernels, and believe that it is ready for prime-time. Please pull it
from:
bk://extfs.bkbits.net/for-linus-htree-2.5
In order to use the new directory indexing feature, please update your
e2fsprogs to 1.29. Existing filesystem can be updated to use directory
indexing using the command "tune2fs -O dir_index /dev/hdXXX". This can
be done while the filesystem is mounted, and subsequent new directories
or directories fit within a single block will be use the new (backwards
compatible) dirctory indexing format when they grow beyond a single
block.
Existing large directories on the filesystem can be converted to use the
new indexed directory format by running the following command on an
unmounted filesystem "e2fsck -fD /dev/hdXXX".
- Ted
fs/ext3/Makefile | 2
fs/ext3/dir.c | 298 ++++++++++
fs/ext3/file.c | 3
fs/ext3/hash.c | 215 +++++++
fs/ext3/namei.c | 1305 ++++++++++++++++++++++++++++++++++++++++-----
fs/ext3/super.c | 6
include/linux/ext3_fs.h | 86 ++
include/linux/ext3_fs_sb.h | 2
include/linux/ext3_jbd.h | 2
include/linux/rbtree.h | 1
lib/rbtree.c | 16
11 files changed, 1797 insertions(+), 139 deletions(-)
(The changes to rbtree.c were to add a new function rb_first(), which
returns the first node in the rbtree.)
On Sep 25, 2002 16:03 -0400, [email protected] wrote:
> This patch significantly increases the speed of using large directories.
> Creating 100,000 files in a single directory took 38 minutes without
> directory indexing... and 11 seconds with the directory indexing turned on.
Not mentioned, but very important to note is that the dir indexing code
is 100% compatible with older kernels (even back to 1.2 days) for both
read and write.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
On Wed, Sep 25, 2002 at 04:03:44PM -0400, [email protected] wrote:
> This patch significantly increases the speed of using large directories.
> Creating 100,000 files in a single directory took 38 minutes without
> directory indexing... and 11 seconds with the directory indexing turned on.
Just curious.. what measurable overhead (if any) is there of indexing
dirs with smaller numbers of files vs non-indexed ?
If so, where would be the break-even point ?
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs
On Sep 25, 2002 21:41 +0100, Dave Jones wrote:
> On Wed, Sep 25, 2002 at 04:03:44PM -0400, [email protected] wrote:
>
> > This patch significantly increases the speed of using large directories.
> > Creating 100,000 files in a single directory took 38 minutes without
> > directory indexing... and 11 seconds with the directory indexing turned on.
>
> Just curious.. what measurable overhead (if any) is there of indexing
> dirs with smaller numbers of files vs non-indexed ?
> If so, where would be the break-even point ?
No overhead at all for directories 1 block in size. The htree code uses
the existing "search leaf block" code for such a directory directly.
For directories > 1 block in size, you have the index (1 block
overhead), but also the benefit that you are only searching 1/N of the
blocks for an entry (the leaf block searching code remains the same,
just the "which block to search" code is activated.
So, in summary, htree is never slower than an un-indexed directory, so
there is never really a time when you wouldn't want to use it.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
On Wed, Sep 25, 2002 at 09:41:01PM +0100, Dave Jones wrote:
> Just curious.. what measurable overhead (if any) is there of indexing
> dirs with smaller numbers of files vs non-indexed ?
> If so, where would be the break-even point ?
It should be negligible to the point of being non-measurable. If the
number of files fit within a single block, the format doesn't change
at all. Once the directory is grows so there are two block's worth of
directory entries, then we move to an indexed format, and every file
will be found within two disk block reads, and on average we will need
to do comparisons on half a block worth of directory names. Without
directory indexing, on average a lookup will succeed in 1.5 disk reads
(50% will require one disk block read, and 50% will require two disk
block reads), and on average we will need to do comparisons on a full
block's worth of directory entries.
So when the directory is two blocks' worth of data, it's a slight lose
if you are looking at things from the point of view of disk reads, but
we're winning already from the point of view of CPU time. Not that
this matters; it won't be measureable because the block read
statistics only apply the first time you search the directory. After
that, the directory blocks will be in cache, and the only thing that
matters is the CPU time. And the amount of CPU time that it takes to
search directory entries, whether it we need to search on average 1
block or half a block worth of directory entries, is small enough to
be not an issue.
Not that it matters, but for the record though, we break-even on disk
reads once the directory is 3 blocks long. At that point, linear
search will require on average reading 2 blocks (1*1/3 + 2*1/3 + 3*1/3
== 2) and the indexed directory will still require 2 disk reads. On
the CPU front, the linear search will require comparisons with 1.5
blocks worth of directory entries, while the indexed directory will
still require only 0.5 blocks worth of directory comparisons. (I'm
ignoring here the CPU time it takes to calculate the hash, but it's
the time to search the directory blocks that matters, since that's
where you'll have all of the memory cache misses that will slow down
the search.)
Oh, and finally, for a small directory, it won't be measurable because
after the first time around, all of the directory entries will fit in
the dcache. :-)
- Ted
[email protected] wrote:
> I've given this code a good bit of testing, under both 2.4 and 2.5
> kernels, and believe that it is ready for prime-time. Please pull it
> from:
>
> bk://extfs.bkbits.net/for-linus-htree-2.5
Can you post a GNU patch too, for public lookover and independent
integration?
Jeff
On Wed, Sep 25, 2002 at 06:54:00PM -0400, Jeff Garzik wrote:
>
> Can you post a GNU patch too, for public lookover and independent
> integration?
>
Sure. The patch is a bit big for e-mail, but you can find it at:
http://thunk.org/tytso/linux/ext3-dxdir/patch-ext3-dxdir-2.5.38
There is also a 2.4.19 patch available as well:
http://thunk.org/tytso/linux/ext3-dxdir/patch-ext3-dxdir-2.4.19-2
- Ted
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 25, 2002 16:29, Theodore Ts'o wrote:
> There is also a 2.4.19 patch available as well:
>
> http://thunk.org/tytso/linux/ext3-dxdir/patch-ext3-dxdir-2.4.19-2
>
I got some pretty nasty results with that patch. After enabling the dir_index
superblock flag and running e2fsck -fD, the filesystem would spontaneously
remount itself read-only (I have errors=remount-ro set) after a few minutes
of use. Once I disabled dir_index, e2fsck picked up many duplicate blocks.
There doesn't appear to be any severe filesystem damage, however.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9kkqkLGMzRzbJfbQRAiiSAJwLFZSyHkG9WSGw6HY23g6i+N+z0QCggQEF
ixJJJJLCe56O5lL51sTDgaE=
=sywJ
-----END PGP SIGNATURE-----
Am Mit, 2002-09-25 um 22.03 schrieb [email protected]:
> I've given this code a good bit of testing, under both 2.4 and 2.5
> kernels, and believe that it is ready for prime-time. Please pull it
> from:
> bk://extfs.bkbits.net/for-linus-htree-2.5
Where can one fetch the 2.4 code?
--
Servus,
Daniel
On 26 Sep 2002, Daniel Egger wrote:
| Am Mit, 2002-09-25 um 22.03 schrieb [email protected]:
|
| > I've given this code a good bit of testing, under both 2.4 and 2.5
| > kernels, and believe that it is ready for prime-time. Please pull it
| > from:
|
| > bk://extfs.bkbits.net/for-linus-htree-2.5
|
| Where can one fetch the 2.4 code?
As Ted already said:
> Can you post a GNU patch too, for public lookover and independent
> integration?
Sure. The patch is a bit big for e-mail, but you can find it at:
http://thunk.org/tytso/linux/ext3-dxdir/patch-ext3-dxdir-2.5.38
There is also a 2.4.19 patch available as well:
http://thunk.org/tytso/linux/ext3-dxdir/patch-ext3-dxdir-2.4.19-2
--
~Randy
On Wed, Sep 25, 2002 at 04:03:44PM -0400, [email protected] wrote:
> In order to use the new directory indexing feature, please update your
> e2fsprogs to 1.29. Existing filesystem can be updated to use directory
> indexing using the command "tune2fs -O dir_index /dev/hdXXX".
Do new filesystems created with e2fsprogs 1.29 use this feature by
default?
On Wed, Sep 25, 2002 at 05:50:20PM -0700, Aaron Lehmann wrote:
> On Wed, Sep 25, 2002 at 04:03:44PM -0400, [email protected] wrote:
> > In order to use the new directory indexing feature, please update your
> > e2fsprogs to 1.29. Existing filesystem can be updated to use directory
> > indexing using the command "tune2fs -O dir_index /dev/hdXXX".
>
> Do new filesystems created with e2fsprogs 1.29 use this feature by
> default?
Yes, they do. The "tune2fs -O dir_index" command is only needed for
filesystems created before 1.29.
- Ted
On Wed, Sep 25, 2002 at 04:45:34PM -0700, Ryan Cumming wrote:
> I got some pretty nasty results with that patch. After enabling the
> dir_index superblock flag and running e2fsck -fD, the filesystem
> would spontaneously remount itself read-only (I have
> errors=remount-ro set) after a few minutes of use. Once I disabled
> dir_index, e2fsck picked up many duplicate blocks. There doesn't
> appear to be any severe filesystem damage, however.
Well, I'm currently running with 2.4.19 with the ext3 patch, and I'm
not having any problems. In fact, my mail directory is an indexed
directory, and I'm replying out of it at this very moment. I've also
done a number of fairly intensive bitkeeper operations (the
deleted/SCCS directory gets is pretty big for the linux kernel), and
it seems to be working just fine for me.
Are you sure you upgraded to the latest version of e2fsprogs (version
1.29, released yesterday?). I specified version 1.29 explicitly in my
last e-mail for a reason --- e2fsprogs 1.28 had a bug where e2fsck -fD
had a 1 in 512 chance of corrupting each directory block which it
tries to optimize. Unfortunately because the fence post error had
only a 1 in 512 chance of triggering, I didn't notice it in my
regression tests.
- Ted
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 25, 2002 20:27, Theodore Ts'o wrote:
> Are you sure you upgraded to the latest version of e2fsprogs (version
> 1.29, released yesterday?).
Yes, I compiled from source explicitly for this purpose.
I made a second attempt at getting dir_index to work, with make clean in linux
and e2fsprogs.
1) I ran 'tune2fs -O dir_index /dev/hda2". So good so far.
2) Upon rebooting in to single user mode, I ran 'man fsck' to figure out how
to get those nifty progressbars ;)
3) While starting man(1), EXT3 began spewing messages in the form:
"EXT3-fs error (device (ide0(3,2)): ext3_readdir: directory #4243459 contains
a hole at offset xxxxxx"
The directory number stayed constant, but the offset was variable. fsck -fD
had -not- been run at this point.
4) On reboot, fsck reported:
"Directory inode has unallocated block #xx"
multiple times. fsck seemed to fully recover the filesystem. I rebooted again
for good measure.
5) Not set back, I tried 'man fsck' again. It worked as expected this time.
6) Next, I executed 'e2fsck -CfD /dev/hda2". When it completed, I rebooted.
7) While KDE was trying to start, EXT3 dumped the following to the console:
"EXT3-fs error (device ide(3,2)) in start_transaction: Journal has aborted"
8) I rebooted, and fsck said:
"Directory inode 131073,block3,offset 528: Directory corrupted"
I wasn't so lucky this time, and a good portion of my home directory got
eaten.
9) I tried to start KDE again, and the filesystem went read-only halfway
through composing an email. I didn't catch the error on the console that
time.
10) fsck reported more 'Directory corrupted' errors, and more of my home
directory ended up in /lost+found
11) I ran 'tune2fs -O ^dir_index', rebooted, and fsck cleaned up all the index
blocks
It seems to be running stable now. Linux 2.4.19, UP Athlon, GCC 3.2.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9kpnBLGMzRzbJfbQRAhlfAKCPiz7J6EWNyk/hlFsMbXjeeT7D7QCeI10u
k0yuUW2maTuHhhDMsKRjO5c=
=jOkd
-----END PGP SIGNATURE-----
On Wed, Sep 25, 2002 at 10:23:11PM -0700, Ryan Cumming wrote:
>
> It seems to be running stable now. Linux 2.4.19, UP Athlon, GCC 3.2.
Just to humor me, can you try it with gcc 2.95.4? I just want to
eliminate one variable....
> 3) While starting man(1), EXT3 began spewing messages in the form:
> "EXT3-fs error (device (ide0(3,2)): ext3_readdir: directory #4243459 contains
> a hole at offset xxxxxx"
what directory was 4243459? You can use the debugfs's ncheck command
to get back a pathname from an inode number?
Are you sure the filesystem was consistent before you started this
whole procedure?
It sounds like you hadn't started modifying directories at this point
in the procedure. Yet this error ("directory #XXXX contains a hole")
is printed by the non-indexed-directory version of readdir. So that
would imply that the directory with the initial error reported on it
was not an indexed directory..... very strange!
> The directory number stayed constant, but the offset was variable. fsck -fD
> had -not- been run at this point.
> 4) On reboot, fsck reported:
> "Directory inode has unallocated block #xx"
> multiple times. fsck seemed to fully recover the filesystem. I rebooted again
> for good measure.
What were the directory inode numbers, and what pathname did they map
to?
> 7) While KDE was trying to start, EXT3 dumped the following to the console:
> "EXT3-fs error (device ide(3,2)) in start_transaction: Journal has aborted"
This message will appear if previously some other part of ext2
reported a filesystem inconsistency. So it's a symptom, and not the
root cause of the problem.
> 8) I rebooted, and fsck said:
> "Directory inode 131073,block3,offset 528: Directory corrupted"
> I wasn't so lucky this time, and a good portion of my home directory got
> eaten.
Against, what was the pathnmae to the inode #131073?
This is strange, since I'm not seeing any of the problems that you're
seeing. I'm going to need a lot more information if I'm going to have
a prayer of a chance of digging into it.
- Ted
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 25, 2002 22:57, Theodore Ts'o wrote:
> Just to humor me, can you try it with gcc 2.95.4? I just want to
> eliminate one variable....
I'll make a brief run with 2.95.4 shortly. I hope you understand my reluctance
to do any more in-depth testing after the data loss I experienced
> > 3) While starting man(1), EXT3 began spewing messages in the form:
> > "EXT3-fs error (device (ide0(3,2)): ext3_readdir: directory #4243459
> > contains a hole at offset xxxxxx"
>
> what directory was 4243459? You can use the debugfs's ncheck command
> to get back a pathname from an inode number?
/usr/share/man/man1
> Are you sure the filesystem was consistent before you started this
> whole procedure?
Fairly sure, I ran fsck to completion after my first run-in with dir_index.
> > The directory number stayed constant, but the offset was variable. fsck
> > -fD had -not- been run at this point.
> > 4) On reboot, fsck reported:
> > "Directory inode has unallocated block #xx"
> > multiple times. fsck seemed to fully recover the filesystem. I rebooted
> > again for good measure.
>
> What were the directory inode numbers, and what pathname did they map
> to?
Sorry, I forgot to mention that the directory inode was 4243459
(/usr/share/man/man1), the same one EXT3 complained about above.
> > 7) While KDE was trying to start, EXT3 dumped the following to the
> > console: "EXT3-fs error (device ide(3,2)) in start_transaction: Journal
> > has aborted"
>
> This message will appear if previously some other part of ext2
> reported a filesystem inconsistency. So it's a symptom, and not the
> root cause of the problem.
>
> > 8) I rebooted, and fsck said:
> > "Directory inode 131073,block3,offset 528: Directory corrupted"
> > I wasn't so lucky this time, and a good portion of my home directory got
> > eaten.
>
> Against, what was the pathnmae to the inode #131073?
/home/ryan (ugh)
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9kqe0LGMzRzbJfbQRAnUxAJ9NWR7sdunDTFo2brbAV4qGCdHhkgCfZ7yt
TiR5s6fv6E+CDDc4KayqFgU=
=ycMb
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 25, 2002 22:57, Theodore Ts'o wrote:
> On Wed, Sep 25, 2002 at 10:23:11PM -0700, Ryan Cumming wrote:
> > It seems to be running stable now. Linux 2.4.19, UP Athlon, GCC 3.2.
>
> Just to humor me, can you try it with gcc 2.95.4? I just want to
> eliminate one variable....
Before I go again, any suggestions on how to reliably capture these error
messages? Because the filesystem goes read-only immediately,
/var/log/messages is hardly useful.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9kqhELGMzRzbJfbQRAkENAJ42ihIhFYwW8+0ssBjgY9OJK4u3nQCeKAz5
RUgZltxYOScsTMG9HwAsdso=
=au0V
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 25, 2002 22:57, Theodore Ts'o wrote:
> On Wed, Sep 25, 2002 at 10:23:11PM -0700, Ryan Cumming wrote:
> > It seems to be running stable now. Linux 2.4.19, UP Athlon, GCC 3.2.
>
> Just to humor me, can you try it with gcc 2.95.4? I just want to
> eliminate one variable....
Using GCC 2.95.4 seems to stabilize dir_index nicely, both before and after
the hdparm -fD run. Only the kernel was recompiled with 2.95.4, I reused the
original GCC 3.2 compiled e2fsprogs.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9krpHLGMzRzbJfbQRArCGAJ0V8NuAtrUQt/HcOVgbVRtXJzhDwQCeOwtS
Lkkp52o/ku9W4pXoFl8nARc=
=axyt
-----END PGP SIGNATURE-----
On Thu, Sep 26, 2002 at 12:41:54AM -0700, Ryan Cumming wrote:
>
> Using GCC 2.95.4 seems to stabilize dir_index nicely, both before and after
> the hdparm -fD run. Only the kernel was recompiled with 2.95.4, I reused the
> original GCC 3.2 compiled e2fsprogs.
>
Thanks for willing to be a Guienea Pig; now I know what to look for.
I didn't have a chance to answer your question last night about how to
test it without risking your data, but the answer is that that patch
doesn't change how ext3 works unless the dir_index feature flag is
set. So it would have been safe to boot the patched, kernel, and then
create a test filesystem in a plain file, and the mount it using the
loop device:
dd if=/dev/zero of=test.img bs=1024k count=32
mke2fs -f -j test.img
mount -o loop test.img /mnt
So I'll take a look at things and try to figure out what GCC bug we
might be triggering. It's strange; it's not like the code using
inline instructions, or anything else that might be triggering a GCC
bug or discrepancy between how GCC 2.95.4 and GCC 3.2 works. In fact,
it's all relatively straightforward C code.
Anyway, thanks.
- Ted
Am Don, 2002-09-26 um 08.25 schrieb Ryan Cumming:
> Before I go again, any suggestions on how to reliably capture these error
> messages? Because the filesystem goes read-only immediately,
> /var/log/messages is hardly useful.
Indeed, but I'd wonder why someone whould start syslogd on a r/o
filesystem, so probably /var/log/messages is not used at this time
anyway. You could set up syslogd to do remote logging to another machine
and fire it up before the troubles to capture anything. But of course
this requires
a) another machine
b) syslog set up on both sides for remote logging
c) a working network
An alternative would be to mount another disk with a different
filesystem (reiserfs/FAT) and log to there.
--
Servus,
Daniel
On Wed, Sep 25, 2002 at 11:22:41PM -0700, Ryan Cumming wrote:
> > > 8) I rebooted, and fsck said:
> > > "Directory inode 131073,block3,offset 528: Directory corrupted"
> > > I wasn't so lucky this time, and a good portion of my home directory got
> > > eaten.
> >
> > Against, what was the pathnmae to the inode #131073?
> /home/ryan (ugh)
One more thought. The files in your directory weren't gone; they were
just moved to the lost+found directory, and you just lost their
name->inode numbers mapping. This is recoverable, however, if you had
saved that information first.
If you must test on filesystems with precious data (as opposed to
using scratch disks or loopback mounts), something which is really
useful to do in advance is to run the command:
e2image -r /dev/hdXX /var/e2image/XXX.e2img
This will create a dump of all of the critical filesystem metadata,
including the directory blocks. I sometimes ask people to send me
compressed e2image dump files when I need to debug a filesystem, since
it allows me to see all of the metadata, which is what I care about,
without getting need to send all of the data in the filesystem, which
(a) might be a privacy problem, and (b) makes the image file
significantly bigger to send over the network.
It's also useful for recovery purposes. For example, you can run
debugfs on the e2image file, and use the ls command to see the
original directory, which would have allowed you to relatively easily
reconstruct the filenames of the files left in the lost+found
directory.
Actually, the easiest way to do it would be to write a quick perl or
python script to gether up all of the inode numbers of the files in
lost+found, and pass it to the debugfs's "ncheck" command, which will
map inode numbers to pathnames. You can specify all of the inodes
ncheck at once:
debugfs -R "ncheck ino1 ino2 ino3 ..." /var/e2image/XXX.e2img
Then just parse the output, and move the files from lost+found back to
their original location in your home directory.
In fact, what e2image can be so helpful that a wise distribution might
want to consider automatically running it at boot time, to save
critical filesystem metadata. This could be used to reconstruct
filesystems and recover critical data in the case of various sorts of
hardware failure and/or kernel bugs.
- Ted
On Thu, Sep 26, 2002 at 12:41:54AM -0700, Ryan Cumming wrote:
> On September 25, 2002 22:57, Theodore Ts'o wrote:
> > On Wed, Sep 25, 2002 at 10:23:11PM -0700, Ryan Cumming wrote:
> > > It seems to be running stable now. Linux 2.4.19, UP Athlon, GCC 3.2.
> >
> > Just to humor me, can you try it with gcc 2.95.4? I just want to
> > eliminate one variable....
>
> Using GCC 2.95.4 seems to stabilize dir_index nicely, both before and after
> the hdparm -fD run. Only the kernel was recompiled with 2.95.4, I reused the
> original GCC 3.2 compiled e2fsprogs.
Hmm... I just tried biult 2.4.19 with the ext3 patch on my UP P3
machine, using GCC 3.2, and I wasn't able to replicate your problem.
(This was using Debian's gcc 3.2.1-0pre2 release from testing.)
What was the precise GCC 3.2 version you were using, and for what
architecture were you compiling for? As I recall P3 kernels should
run on Athlons, if memory serves me correctly. (I don't have any
Athlons at home, so if it's an Athlon-specific code generation bug,
I'll have problems replicating it.) Could you try compiling a kernel
for the P3 architecture using GCC 3.2, and see whether or not the
problem is there? This is just a shot in the dark, but....
- Ted
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 26, 2002 08:42, Theodore Ts'o wrote:
> Hmm... I just tried biult 2.4.19 with the ext3 patch on my UP P3
> machine, using GCC 3.2, and I wasn't able to replicate your problem.
> (This was using Debian's gcc 3.2.1-0pre2 release from testing.)
The whole GCC 3.2 thing was a red herring. Although it ran stable for a few
hours last night (cvs up, compiled a kernel, etc), the filesystem was once
again read-only when I came to check my mail this morning.
The interesting fsck errors this time were:
245782 was part of the orphaned inode list FIXED
245792 was part of the orphaned inode list FIXED
245797...
245782,245792 don't exist according to ncheck.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9k1tKLGMzRzbJfbQRAj/AAJ9/yPnuSPF3kUTlwFt2wF2JFSJlJwCfc39B
ZKxRtt+hpDHQ7XuiyAYkpNM=
=MHW8
-----END PGP SIGNATURE-----
Ryan Cumming <[email protected]> said:
> On September 26, 2002 08:42, Theodore Ts'o wrote:
> > Hmm... I just tried biult 2.4.19 with the ext3 patch on my UP P3
> > machine, using GCC 3.2, and I wasn't able to replicate your problem.
> > (This was using Debian's gcc 3.2.1-0pre2 release from testing.)
> The whole GCC 3.2 thing was a red herring. Although it ran stable for a few
> hours last night (cvs up, compiled a kernel, etc), the filesystem was once
> again read-only when I came to check my mail this morning.
>
> The interesting fsck errors this time were:
> 245782 was part of the orphaned inode list FIXED
> 245792 was part of the orphaned inode list FIXED
> 245797...
>
> 245782,245792 don't exist according to ncheck.
Old(ish) WD disk, perhaps PIIX IDE? At home I get filesystem corruption if
DMA is enabled...
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 26, 2002 12:51, Horst von Brand wrote:
> > The interesting fsck errors this time were:
> > 245782 was part of the orphaned inode list FIXED
> > 245792 was part of the orphaned inode list FIXED
> > 245797...
> >
> > 245782,245792 don't exist according to ncheck.
>
> Old(ish) WD disk, perhaps PIIX IDE? At home I get filesystem corruption if
> DMA is enabled...
No, 40GB Maxtor, VT133A chipset. This box has experienced -zero- filesystem
corruption since it was purchased, up until I tried the directory index
patch. And even then, it only shows up with the dir_index flag set.
OTOH, much of this corruption could be because journal replay doesn't seem to
be happening after a filesystem is remounted read-only due to errors. That
would certainly leave the filesystem in an inconsistant state.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9k2c0LGMzRzbJfbQRAt5lAJ4mSPUgX3EPDsvzLnS0f768Rc51nwCfUpdk
ZSt7Q/WjmA1kyPF/C9DS4Nc=
=6uMQ
-----END PGP SIGNATURE-----
On Thu, Sep 26, 2002 at 12:08:54PM -0700, Ryan Cumming wrote:
> On September 26, 2002 08:42, Theodore Ts'o wrote:
> > Hmm... I just tried biult 2.4.19 with the ext3 patch on my UP P3
> > machine, using GCC 3.2, and I wasn't able to replicate your problem.
> > (This was using Debian's gcc 3.2.1-0pre2 release from testing.)
> The whole GCC 3.2 thing was a red herring. Although it ran stable for a few
> hours last night (cvs up, compiled a kernel, etc), the filesystem was once
> again read-only when I came to check my mail this morning.
Was there anything in the logs at all? There should be, if the
filesystem was remounted read-only.
> The interesting fsck errors this time were:
> 245782 was part of the orphaned inode list FIXED
> 245792 was part of the orphaned inode list FIXED
> 245797...
>
> 245782,245792 don't exist according to ncheck.
That's not surprising. What this means is that those inodes were
deleted, but since some process still had a file descriptor open for
that inode, it was placed on the orphaned inode list. But the
directory entry would have already been removed, which is why ncheck
couldn't find an associated pathname. The e2fsck error message simply
states that these inodes had a dtime which was small enough that it
was probably the next entry on the orphaned inode linked list, these
inodes weren't actually on the list. At a guess, this probably
happened when an error was noted in the filesystem, and the filesystem
was forcibly put into the read-only state. That probably arrested
some transactions which were not fully completed, and would explain
these sorts of fsck errors.
The real question is what was the original error that caused the ext3
filesystme to decide it needed to remount the filesystem read-only.
That should be in your logs, since calls to ext3_error should always
cause printk's explaining what the error was to be sent to the logs.
The filesystem wouldn't happen to be running close to full either on
the number of blocks or the number of inodes, would it? There's a bug
in ext3 (for which Stephen has already posted bug fixes to be applied
to the 2.4.20-preX kernels) where an running out of blocks or inodes
is being erroneously flagged as a filesystem corruption error, which
would mount the filesystem read-only.
- Ted
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 26, 2002 15:04, Theodore Ts'o wrote:
> Was there anything in the logs at all? There should be, if the
> filesystem was remounted read-only.
There is nothing. grep -i "ext3" /var/log/messages returned a whole lot of the
usual:
Sep 25 15:51:00 completely kernel: EXT3-fs: mounted filesystem with ordered
data mode.
Sep 25 15:51:00 completely kernel: EXT3 FS 2.4-0.9.18, 14 May 2002 on
ide0(3,2), internal journal
The one mildly interesting thing was:
Sep 26 11:49:06 (none) kernel: EXT3-fs: INFO: recovery required on readonly
filesystem.
Sep 26 11:49:06 (none) kernel: EXT3-fs: write access will be enabled during
recovery.
Sep 26 11:49:06 (none) kernel: EXT3-fs warning (device ide0(3,2)):
ext3_clear_journal_err: Filesystem error recorded from previous mount: IO
failure
Sep 26 11:49:06 (none) kernel: EXT3-fs warning (device ide0(3,2)):
ext3_clear_journal_err: Marking fs in need of filesystem check.
Sep 26 11:49:06 (none) kernel: EXT3-fs: ide0(3,2): orphan cleanup on readonly
fs
Sep 26 11:49:06 (none) kernel: EXT3-fs: recovery complete.
Sep 26 11:49:06 (none) kernel: EXT3-fs: mounted filesystem with ordered data
mode.
Sep 26 11:49:06 (none) kernel: EXT3 FS 2.4-0.9.18, 14 May 2002 on ide0(3,2),
internal journal
> The real question is what was the original error that caused the ext3
> filesystme to decide it needed to remount the filesystem read-only.
> That should be in your logs, since calls to ext3_error should always
> cause printk's explaining what the error was to be sent to the logs.
I've seen the printk's (although I didn't write down this one). However, they
are never hitting the disk. I could try to log the messages to my FreeBSD
mail server...
> The filesystem wouldn't happen to be running close to full either on
> the number of blocks or the number of inodes, would it?
Inode count: 4767744
Block count: 9522528
Reserved block count: 476126
Free blocks: 2362117
Free inodes: 4490071
Seems to be fine.....
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9k4/TLGMzRzbJfbQRAgp6AJ9Xp7WviFP9ByQaUJ8Ak9sYVf1C4ACffemS
tS4gB+W4WE7VrPKa+tan/68=
=MhXA
-----END PGP SIGNATURE-----
On Thu, Sep 26, 2002 at 03:53:02PM -0700, Ryan Cumming wrote:
> The one mildly interesting thing was:
> Sep 26 11:49:06 (none) kernel: EXT3-fs: INFO: recovery required on readonly
> filesystem.
> Sep 26 11:49:06 (none) kernel: EXT3-fs: write access will be enabled during
> recovery.
> Sep 26 11:49:06 (none) kernel: EXT3-fs warning (device ide0(3,2)):
> ext3_clear_journal_err: Filesystem error recorded from previous mount: IO
> failure
> Sep 26 11:49:06 (none) kernel: EXT3-fs warning (device ide0(3,2)):
> ext3_clear_journal_err: Marking fs in need of filesystem check.
> Sep 26 11:49:06 (none) kernel: EXT3-fs: ide0(3,2): orphan cleanup on readonly
> fs
> Sep 26 11:49:06 (none) kernel: EXT3-fs: recovery complete.
> Sep 26 11:49:06 (none) kernel: EXT3-fs: mounted filesystem with ordered data
> mode.
> Sep 26 11:49:06 (none) kernel: EXT3 FS 2.4-0.9.18, 14 May 2002 on ide0(3,2),
> internal journal
Wait a second. These messages would occur only if you had done a
read-only mount at 11:49:06. Did you do a manual mount at that time?
Do you have one or more filesystems in your /etc/fstab (in particular
/dev/hda2) that are set to be mounted read-only? That's the only
thing that would explain the "write access enabled during recovery of
readonly filesystem" warning message. That message means that
/dev/hda2 was readonly because the mount command *requested* that it
be mounted read-only, not because of some error.
The other strange thing from these syslog messages is that, (a) they
indicate that the filesystem wasn't checked by e2fsck before they were
mounted --- which is what I generally recommend: let e2fsck run the
journal take take care of the recovery, and (b) the reason why the
filesystem was marked as being "in error" is because the previous time
the filesystem was mounted, something reported an IO error. That
could be a hardware problem, but it's also used as a generic error by
much of the exte filesystem code, so we really need the log entry,
which should have been the previous time this filesystem had been
mounted.
How is your system configured vis-a-vis the /etc/fstab entry for
/dev/hda2?
- Ted
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 26, 2002 16:57, Theodore Ts'o wrote:
> Wait a second. These messages would occur only if you had done a
> read-only mount at 11:49:06. Did you do a manual mount at that time?
> Do you have one or more filesystems in your /etc/fstab (in particular
> /dev/hda2) that are set to be mounted read-only?
Only /dev/cdrom...
> That's the only
> thing that would explain the "write access enabled during recovery of
> readonly filesystem" warning message. That message means that
> /dev/hda2 was readonly because the mount command *requested* that it
> be mounted read-only, not because of some error.
Would init remounting the filesystem read-only before a reboot explain that?
11:49 is around the time I came to check my mail.
> How is your system configured vis-a-vis the /etc/fstab entry for
> /dev/hda2?
/dev/hda2 / ext3 defaults,errors=remount-ro
0 1
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9k62rLGMzRzbJfbQRAlcSAKCMNMTRNN0D/T3GA7eM3HnzM9MVMgCfT1w6
V6JU4fqtHDN8pzsLSdo0mzw=
=kt7X
-----END PGP SIGNATURE-----
On Thu, Sep 26, 2002 at 06:00:23PM -0700, Ryan Cumming wrote:
>
> > That's the only
> > thing that would explain the "write access enabled during recovery of
> > readonly filesystem" warning message. That message means that
> > /dev/hda2 was readonly because the mount command *requested* that it
> > be mounted read-only, not because of some error.
> Would init remounting the filesystem read-only before a reboot explain that?
> 11:49 is around the time I came to check my mail.
No, it doesn't. The message "Filesystem recorded error from previous
mount" can only happen on an initial mount (which from the other
messages must have been a read-only mount), or on remount of a
read-only mount to a read-write mount. Remounting read-only wouldn't
explain these syslogs.
- Ted
On Sep 26, 2002 19:57 -0400, Theodore Ts'o wrote:
> On Thu, Sep 26, 2002 at 03:53:02PM -0700, Ryan Cumming wrote:
> > The one mildly interesting thing was:
> > Sep 26 11:49:06 (none) kernel: EXT3-fs: INFO: recovery required on readonly
> > filesystem.
> > Sep 26 11:49:06 (none) kernel: EXT3-fs: write access will be enabled during
> > recovery.
> > Sep 26 11:49:06 (none) kernel: EXT3-fs warning (device ide0(3,2)):
> > ext3_clear_journal_err: Filesystem error recorded from previous mount: IO
> > failure
> > Sep 26 11:49:06 (none) kernel: EXT3-fs warning (device ide0(3,2)):
> > ext3_clear_journal_err: Marking fs in need of filesystem check.
> > Sep 26 11:49:06 (none) kernel: EXT3-fs: ide0(3,2): orphan cleanup on readonly
> > fs
> > Sep 26 11:49:06 (none) kernel: EXT3-fs: recovery complete.
> > Sep 26 11:49:06 (none) kernel: EXT3-fs: mounted filesystem with ordered data
> > mode.
> > Sep 26 11:49:06 (none) kernel: EXT3 FS 2.4-0.9.18, 14 May 2002 on ide0(3,2),
> > internal journal
>
> Wait a second. These messages would occur only if you had done a
> read-only mount at 11:49:06. Did you do a manual mount at that time?
> Do you have one or more filesystems in your /etc/fstab (in particular
> /dev/hda2) that are set to be mounted read-only? That's the only
> thing that would explain the "write access enabled during recovery of
> readonly filesystem" warning message. That message means that
> /dev/hda2 was readonly because the mount command *requested* that it
> be mounted read-only, not because of some error.
Ryan Cumming replied:
> /dev/hda2 / ext3 defaults,errors=remount-ro
Ted, this is Ryan's root filesystem, so it _has_ to be mounted read-only
and have the kernel do journal recovery on it. That said, there must
have been an error from the previous boot which caused it to have an
error recorded in the journal.
Rather than Ryan running dir-indexing on his root filesystem, I would
suggest that he instead test with a loopback filesystem to see if he
can reproduce the errors without destroying his data. Ryan - you
should run "tune2fs -O ^dir_index /dev/hda2" to turn off indexing on
your root filesystem, and run a full e2fsck on it (e2fsck -f /dev/hda2).
After that, we'd be happy if you could test with a loopback filesystem:
touch /tmp/foo
mke2fs -j -F /tmp/foo 100000
mkdir /mnt/tmp
mount -o loop /tmp/foo /mnt/tmp
and run tests on /mnt/tmp instead of your root filesystem.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 26, 2002 21:12, Andreas Dilger wrote:
> After that, we'd be happy if you could test with a loopback filesystem:
>
> touch /tmp/foo
> mke2fs -j -F /tmp/foo 100000
> mkdir /mnt/tmp
> mount -o loop /tmp/foo /mnt/tmp
>
> and run tests on /mnt/tmp instead of your root filesystem.
Okay, I followed those steps, and gave /mnt a fairly good fsstress'ing. No FS
errors were encountered during the run. Upon umount and fsck, I got the
following error:
"Problem in HTREE directory inode 2 (/): bad block number 3617593."
fsck then "optimized" inode 2 in pass 3A. Will beat up the filesystem some
more ;)
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9lA8ILGMzRzbJfbQRAtSnAJwNp9um7iTwK2cEpQo4OlGOGjTp4ACgj5lo
8JPvFW0jS18DOPFN5bBccUg=
=XzaF
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 26, 2002 21:12, Andreas Dilger wrote:
> After that, we'd be happy if you could test with a loopback filesystem:
Okay, got another one:
"EXT3-fs error (device loop(7,0)): ext3_add_entry: bad entry in directory #2:
rec_len is smaller than minimal - offset=0, inode=0, rec_len=8, name_len=0"
fsck then reported:
"Directory inode 2, block 166, offset 0: directory corrupted"
This is while deleteing an old fsstress directory (a full fsck had been
performed since the last time the fsstress directory had been touched) while
running a few instances of the attached program.
You guys have any idea what's going on yet?
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9lQPpLGMzRzbJfbQRAlzdAJ9mbiu8lRBbcFZXE/tD94Ad9DoAowCdFzab
JkelXV/8C4fsqEqg9TfBvb0=
=VA58
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 27, 2002 18:20, Ryan Cumming wrote:
> Okay, got another one:
> "EXT3-fs error (device loop(7,0)): ext3_add_entry: bad entry in directory
> #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=8,
> name_len=0" fsck then reported:
> "Directory inode 2, block 166, offset 0: directory corrupted"
>
> This is while deleteing an old fsstress directory (a full fsck had been
> performed since the last time the fsstress directory had been touched)
> while running a few instances of the attached program.
You can get a 6.1MiB bzip2'ed image of the broken filesystem (fsck hasn't
touched this copy) at:
http://completely.kicks-ass.org/image-broken.ext3.bz2
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9lQnrLGMzRzbJfbQRAqhBAJ9EuJ6OhH13W1B4LWjnj8IhyIMX6QCgobUt
gphiPMuRMSdewLp+67phv4g=
=10yO
-----END PGP SIGNATURE-----
On Fri, Sep 27, 2002 at 06:20:27PM -0700, Ryan Cumming wrote:
>
> This is while deleteing an old fsstress directory (a full fsck had been
> performed since the last time the fsstress directory had been touched) while
> running a few instances of the attached program.
>
> You guys have any idea what's going on yet?
Some ideas yes, although I don't have a complete solution yet.
I've been able to replicate it now fairly reliably, with the attached
shell script and 2.4.19 with the 2.4.19-2 dxdir patch. It appears to
be somewhat timing dependent, as where the directory corruption occurs
is not consistent, but I believe it is in the split code. Since
e2fsck -fD packs all of the directories completely, it means that any
attempt to add a file to directory will guarantee at least one split,
and possibly two levels of tree splits. Since the -D option to e2fsck
has only been relatively recently been available, I believe this is
why it hasn't been noticed up until now in the testing; directories
which are indexed "naturally" as they grow don't appear to trigger the
problem, or are very, very unlikely to trigger the problem. (One
potential avenue for exploration is that -D option perfectly sorts all
of the directory entries in hash order, which doesn't normally occur
for naturally grown directories, and this may be triggering a
fencepost error in the split code.)
The other thing which I've developed is a patch to e2fsprogs 1.29
(also attached) which fixes the directory corruption without causing
files to end up in lost+found. I'm using the dxdir patch in
production, and I was first able to replicate your problem after I ran
e2fsck -fD on my /usr partition. At that point, /usr/bin and
/usr/share/man/man8 got corrupted. So I modified e2fsck to be able to
correct the problem without throwing a directory block's worth of
dirents into lost+found.
The nature of the corruption is that a directory entry of size 8
(which is enough room for a zero-length name) is left in the
directory. This is harmless, but it should never happen normally, and
so the ext3 sanity-checking code flags it as an error. With this
patch, e2fsck is much smarter about salvaging corrupt directories, and
so it can do so without causing any directory entries to be lost.
(This corrupted, too-small directory entry appears at the beginning of
the directory block, which is another reason why I strongly suspect
the dx_split code.)
BTW, Andreas, I've tried your stack-usage reduction patch (modified to
sanely deal with out of memory conditions instead of panicking), but
that doesn't seem to fix the problem. So whatever it is, it's
something else, although your patch is still a good one and I've
commited it to the 2.4 ext3-dxdir BK tree.
- Ted
Oh, one other thing. I'm pretty sure the rest of the errors you saw
were a result of the fact that you had your filesystem set to remount
the filesystem read-only after running into errors. When the error
was detected, all existing updates to the filesystem are aborted (to
minimize damage to the filesystem), and that can leave the filesystem
in a somewhat inconsistent state, although nothing which e2fsck
shouldn't be able to fix.
When running with the filesystem set to continue-with-errors, the only
problem which e2fsck picks up is the too-small-to-be-sane directory
entry problem.
- Ted
On Sep 28, 2002 10:13 -0400, Theodore Ts'o wrote:
> The nature of the corruption is that a directory entry of size 8
> (which is enough room for a zero-length name) is left in the
> directory. This is harmless, but it should never happen normally, and
> so the ext3 sanity-checking code flags it as an error. With this
> patch, e2fsck is much smarter about salvaging corrupt directories, and
> so it can do so without causing any directory entries to be lost.
> (This corrupted, too-small directory entry appears at the beginning of
> the directory block, which is another reason why I strongly suspect
> the dx_split code.)
One idea I just had but don't have time to investigate (babysitting
both kids today) is if the do_split() code is creating a hash entry
for unused dir entries (i.e. inode == 0 or name_len == 0). If that
is the case, then it could explain the presence of this short entry.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
I am looking at this also.
It is very possible the case. I am doing experiment now. This happened
very randomly. Same script randomly produce the bug. Each time the problem
block happen at a different place. I just can't consistently reproduce
the bug.
I am doing some experiment now.
Chris
On Sat, Sep 28, 2002 at 11:27:49AM -0600, Andreas Dilger wrote:
> On Sep 28, 2002 10:13 -0400, Theodore Ts'o wrote:
> > The nature of the corruption is that a directory entry of size 8
> > (which is enough room for a zero-length name) is left in the
> > directory. This is harmless, but it should never happen normally, and
> > so the ext3 sanity-checking code flags it as an error. With this
> > patch, e2fsck is much smarter about salvaging corrupt directories, and
> > so it can do so without causing any directory entries to be lost.
> > (This corrupted, too-small directory entry appears at the beginning of
> > the directory block, which is another reason why I strongly suspect
> > the dx_split code.)
>
> One idea I just had but don't have time to investigate (babysitting
> both kids today) is if the do_split() code is creating a hash entry
> for unused dir entries (i.e. inode == 0 or name_len == 0). If that
> is the case, then it could explain the presence of this short entry.
>
> Cheers, Andreas
> --
> Andreas Dilger
> http://www-mddsp.enel.ucalgary.ca/People/adilger/
> http://sourceforge.net/projects/ext2resize/
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel
That is exactly the case. Now I am looking what where is those entry
come from. Remove entry should merge with the previous one.
Add a check in dx_make_map will help.
Chris
On Sat, Sep 28, 2002 at 11:27:49AM -0600, Andreas Dilger wrote:
> On Sep 28, 2002 10:13 -0400, Theodore Ts'o wrote:
> > The nature of the corruption is that a directory entry of size 8
> > (which is enough room for a zero-length name) is left in the
> > directory. This is harmless, but it should never happen normally, and
> > so the ext3 sanity-checking code flags it as an error. With this
> > patch, e2fsck is much smarter about salvaging corrupt directories, and
> > so it can do so without causing any directory entries to be lost.
> > (This corrupted, too-small directory entry appears at the beginning of
> > the directory block, which is another reason why I strongly suspect
> > the dx_split code.)
>
> One idea I just had but don't have time to investigate (babysitting
> both kids today) is if the do_split() code is creating a hash entry
> for unused dir entries (i.e. inode == 0 or name_len == 0). If that
> is the case, then it could explain the presence of this short entry.
>
> Cheers, Andreas
> --
> Andreas Dilger
> http://www-mddsp.enel.ucalgary.ca/People/adilger/
> http://sourceforge.net/projects/ext2resize/
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 28, 2002 07:13, Theodore Ts'o wrote:
> I've been able to replicate it now fairly reliably, with the attached
> shell script and 2.4.19 with the 2.4.19-2 dxdir patch.
Yes, that was what I'm using
> It appears to
> be somewhat timing dependent, as where the directory corruption occurs
> is not consistent, but I believe it is in the split code. Since
> e2fsck -fD packs all of the directories completely, it means that any
> attempt to add a file to directory will guarantee at least one split,
> and possibly two levels of tree splits. Since the -D option to e2fsck
> has only been relatively recently been available, I believe this is
> why it hasn't been noticed up until now in the testing; directories
> which are indexed "naturally" as they grow don't appear to trigger the
> problem, or are very, very unlikely to trigger the problem. (One
> potential avenue for exploration is that -D option perfectly sorts all
> of the directory entries in hash order, which doesn't normally occur
> for naturally grown directories, and this may be triggering a
> fencepost error in the split code.)
Yes, running fsck -D seemed to make my loopback filesystem more brittle. I
have triggered corruption without fsck, but it's much more difficult. That
seems to match your description of the problem.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9li2QLGMzRzbJfbQRAlSfAJ93gPP65tXjzsyGiOjvoDlGwk5WrACeIILY
QKk+DpMb9kpq7rsiaSSuJHs=
=LoWq
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 28, 2002 07:18, Theodore Ts'o wrote:
> Oh, one other thing. I'm pretty sure the rest of the errors you saw
> were a result of the fact that you had your filesystem set to remount
> the filesystem read-only after running into errors. When the error
> was detected, all existing updates to the filesystem are aborted (to
> minimize damage to the filesystem), and that can leave the filesystem
> in a somewhat inconsistent state, although nothing which e2fsck
> shouldn't be able to fix.
The loopback filesystem was set to continue on errors. The only things fsck
noticed were the short directory entries and unattached inodes. It actually
tried to connect lost+found to lost+found, BTW ;)
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9li6yLGMzRzbJfbQRAtn2AKCUoXTYKuVMkfB17wd4RWua0YtTmgCfV+Uh
Gg//dy/RviPA5LWvpfDYQAw=
=lno1
-----END PGP SIGNATURE-----
OK, I believe this two patch will fix the problem.
When e2fsck rehash the directory index. it will left one empty
dir entry at the end of block if next dir entry can not fit in.
I think it is OK to do so but I also make a patch to make the empty
space merge with previous entry to consistent with the kernel.
dx_make_map need to skip the empty entry. So here is the patch for
review before I push it back to htree BK repository.
I already do the initial test and it fix the problem in kernel and
e2fsck.
Chris
PS,
Ted, is e2fsck endian free now? I notice some code in
copy_dir_entries() do not have cpu_to_le32 for rec_len etc.
===== namei.c 1.8 vs edited =====
--- 1.8/fs/ext3/namei.c Fri Sep 20 00:40:12 2002
+++ edited/namei.c Sat Sep 28 23:43:56 2002
@@ -601,11 +601,13 @@
while ((char *) de < base + size)
{
- ext3fs_dirhash(de->name, de->name_len, &h);
- map[count].hash = h.hash;
- map[count].offs = (u32) ((char *) de - base);
+ if (de->name_len && de->inode) {
+ ext3fs_dirhash(de->name, de->name_len, &h);
+ map[count].hash = h.hash;
+ map[count].offs = (u32) ((char *) de - base);
+ count++;
+ }
de = (ext3_dirent *) ((char *) de + le16_to_cpu(de->rec_len));
- count++;
}
return count;
}
===== rehash.c 1.3 vs edited =====
--- 1.3/e2fsck/rehash.c Fri Sep 6 07:14:11 2002
+++ edited/rehash.c Sat Sep 28 23:19:16 2002
@@ -278,22 +278,16 @@
for (i=0; i < fd->num_array; i++) {
ent = fd->harray + i;
rec_len = EXT2_DIR_REC_LEN(ent->dir->name_len & 0xFF);
- dirent = (struct ext2_dir_entry *) (block_start + offset);
- left = fs->blocksize - offset;
if (rec_len > left) {
- if (left) {
- dirent->rec_len = left;
- dirent->inode = 0;
- dirent->name_len = 0;
- offset += left;
- left = 0;
- }
+ if (left)
+ dirent->rec_len += left;
if ((retval = get_next_block(fs, outdir,
&block_start)))
return retval;
- offset = 0; left = fs->blocksize;
- dirent = (struct ext2_dir_entry *) block_start;
+ offset = 0;
}
+ left = fs->blocksize - offset;
+ dirent = (struct ext2_dir_entry *) (block_start + offset);
if (offset == 0) {
if (ent->hash == prev_hash)
outdir->hashes[outdir->num-1] = ent->hash | 1;
On Sat, Sep 28, 2002 at 10:13:30AM -0400, Theodore Ts'o wrote:
> On Fri, Sep 27, 2002 at 06:20:27PM -0700, Ryan Cumming wrote:
> >
> > This is while deleteing an old fsstress directory (a full fsck had been
> > performed since the last time the fsstress directory had been touched) while
> > running a few instances of the attached program.
> >
> > You guys have any idea what's going on yet?
>
> Some ideas yes, although I don't have a complete solution yet.
>
> I've been able to replicate it now fairly reliably, with the attached
> shell script and 2.4.19 with the 2.4.19-2 dxdir patch. It appears to
> be somewhat timing dependent, as where the directory corruption occurs
> is not consistent, but I believe it is in the split code. Since
> e2fsck -fD packs all of the directories completely, it means that any
> attempt to add a file to directory will guarantee at least one split,
> and possibly two levels of tree splits. Since the -D option to e2fsck
> has only been relatively recently been available, I believe this is
> why it hasn't been noticed up until now in the testing; directories
> which are indexed "naturally" as they grow don't appear to trigger the
> problem, or are very, very unlikely to trigger the problem. (One
> potential avenue for exploration is that -D option perfectly sorts all
> of the directory entries in hash order, which doesn't normally occur
> for naturally grown directories, and this may be triggering a
> fencepost error in the split code.)
>
> The other thing which I've developed is a patch to e2fsprogs 1.29
> (also attached) which fixes the directory corruption without causing
> files to end up in lost+found. I'm using the dxdir patch in
> production, and I was first able to replicate your problem after I ran
> e2fsck -fD on my /usr partition. At that point, /usr/bin and
> /usr/share/man/man8 got corrupted. So I modified e2fsck to be able to
> correct the problem without throwing a directory block's worth of
> dirents into lost+found.
>
> The nature of the corruption is that a directory entry of size 8
> (which is enough room for a zero-length name) is left in the
> directory. This is harmless, but it should never happen normally, and
> so the ext3 sanity-checking code flags it as an error. With this
> patch, e2fsck is much smarter about salvaging corrupt directories, and
> so it can do so without causing any directory entries to be lost.
> (This corrupted, too-small directory entry appears at the beginning of
> the directory block, which is another reason why I strongly suspect
> the dx_split code.)
>
> BTW, Andreas, I've tried your stack-usage reduction patch (modified to
> sanely deal with out of memory conditions instead of panicking), but
> that doesn't seem to fix the problem. So whatever it is, it's
> something else, although your patch is still a good one and I've
> commited it to the 2.4 ext3-dxdir BK tree.
>
> - Ted
>
Content-Description: gen-test-cdir
> #!/bin/sh
> #dd if=/dev/zero of=test.img bs=8k count=2000
> mke2fs -j -F -b 1024 -N 24000 test.img
> mount -t ext3 -o loop test.img /mnt
> pushd /mnt
> mkdir test
> cd test
> time seq -f "gabby-qf%f" 1 10000 | xargs touch
> popd
> umount /mnt
> e2fsck -fD test.img
> mount -t ext3 -o loop test.img /mnt
> pushd /mnt/test
> time seq -f "%f" 1 62 | xargs touch
> popd
> umount /mnt
> e2fsck -f test.img
>
Content-Description: e2fsprogs-patch
> # This is a BitKeeper generated patch for the following project:
> # Project Name: Ext2 filesystem utilities
> # This patch format is intended for GNU patch command version 2.5 or higher.
> # This patch includes the following deltas:
> # ChangeSet 1.1020 -> 1.1021
> # e2fsck/pass2.c 1.45 -> 1.46
> # e2fsck/ChangeLog 1.281 -> 1.282
> # e2fsck/unix.c 1.66 -> 1.67
> # e2fsck/rehash.c 1.3 -> 1.4
> #
> # The following is the BitKeeper ChangeSet Log
> # --------------------------------------------
> # 02/09/28 [email protected] 1.1021
> # Add a more sophisticated algorithm to e2fsck to salvage corrupted
> # directories.
> #
> # Speed up e2fsck slightly by only updating the master superblock;
> # there is no point to update the backup superblocks.
> #
> # Fix a small bug in the rehashing code which could leave the indexed
> # flag set even after the directory was compressed instead of indexed.
> # (Not fatal, since the kernel will deal with this, but technically
> # it filesystem isn't consistent, and the filesystem will be marked
> # as being in error when the kernel comes across the directory. It
> # should also never happen in real life, since directories that small
> # will never be indexed, but better safe than sorry.)
> #
> # Also change the threshold of when directories are indexed, so that
> # directories of size 2 blocks will be indexed. Otherwise they will
> # never be indexed by the kernel when they grow.
> # --------------------------------------------
> #
> diff -Nru a/e2fsck/ChangeLog b/e2fsck/ChangeLog
> --- a/e2fsck/ChangeLog Sat Sep 28 09:50:43 2002
> +++ b/e2fsck/ChangeLog Sat Sep 28 09:50:43 2002
> @@ -1,3 +1,20 @@
> +2002-09-28 Theodore Ts'o <[email protected]>
> +
> + * rehash.c (write_directory): Clear the index flag if by
> + reoptimizing the directory, we bring it back into a
> + non-indexed state.
> + (e2fsck_rehash_dir): Allow directories that contain two
> + blocks to be indexed. Otherwise when they grow, they
> + never will be indexed by the kernel.
> +
> + * unix.c (main): Only update the master superblock; there's no
> + point updating the backup superblocks, and it speeds up
> + fsck slightly.
> +
> + * pass2.c (salvage_directory): New function called by
> + check_dir_block() which is much more sophisticated about
> + how it salvages corrupted filesystems.
> +
> 2001-09-24 Theodore Tso <[email protected]>
>
> * Release of E2fsprogs 1.29
> diff -Nru a/e2fsck/pass2.c b/e2fsck/pass2.c
> --- a/e2fsck/pass2.c Sat Sep 28 09:50:43 2002
> +++ b/e2fsck/pass2.c Sat Sep 28 09:50:43 2002
> @@ -574,6 +574,55 @@
> }
> #endif /* ENABLE_HTREE */
>
> +/*
> + * Given a busted directory, try to salvage it somehow.
> + *
> + */
> +static int salvage_directory(ext2_filsys fs,
> + struct ext2_dir_entry *dirent,
> + struct ext2_dir_entry *prev,
> + int offset)
> +{
> + char *cp = (char *) dirent;
> + int left = fs->blocksize - offset - dirent->rec_len;
> + int prev_offset = offset - ((char *) dirent - (char *) prev);
> +
> + /*
> + * Special case of directory entry of size 8: copy what's left
> + * of the directory block up to cover up the invalid hole.
> + */
> + if ((left >= 12) && (dirent->rec_len == 8)) {
> + memmove(cp, cp+8, left);
> + memset(cp + left, 0, 8);
> + return offset;
> + }
> + /*
> + * If the directory entry is a multiple of four, so it is
> + * valid, let the previous directory entry absorb the invalid
> + * one.
> + */
> + if (prev && dirent->rec_len && (dirent->rec_len % 4) == 0) {
> + prev->rec_len += dirent->rec_len;
> + return prev_offset;
> + }
> + /*
> + * Default salvage method --- kill all of the directory
> + * entries for the rest of the block. We will either try to
> + * absorb it into the previous directory entry, or create a
> + * new empty directory entry the rest of the directory block.
> + */
> + if (prev) {
> + prev->rec_len += fs->blocksize - offset;
> + return prev_offset;
> + } else {
> + dirent->rec_len = fs->blocksize - offset;
> + dirent->name_len = 0;
> + dirent->inode = 0;
> + return offset;
> + }
> +
> +}
> +
> static int check_dir_block(ext2_filsys fs,
> struct ext2_db_entry *db,
> void *priv_data)
> @@ -583,7 +632,7 @@
> #ifdef ENABLE_HTREE
> struct dx_dirblock_info *dx_db = 0;
> #endif /* ENABLE_HTREE */
> - struct ext2_dir_entry *dirent;
> + struct ext2_dir_entry *dirent, *prev;
> ext2_dirhash_t hash;
> int offset = 0;
> int dir_modified = 0;
> @@ -680,8 +729,8 @@
> }
> #endif /* ENABLE_HTREE */
>
> + prev = 0;
> do {
> - dot_state++;
> problem = 0;
> dirent = (struct ext2_dir_entry *) (buf + offset);
> cd->pctx.dirent = dirent;
> @@ -691,10 +740,10 @@
> ((dirent->rec_len % 4) != 0) ||
> (((dirent->name_len & 0xFF)+8) > dirent->rec_len)) {
> if (fix_problem(ctx, PR_2_DIR_CORRUPTED, &cd->pctx)) {
> - dirent->rec_len = fs->blocksize - offset;
> - dirent->name_len = 0;
> - dirent->inode = 0;
> + offset = salvage_directory(fs, dirent,
> + prev, offset);
> dir_modified++;
> + continue;
> } else
> return DIRENT_ABORT;
> }
> @@ -705,10 +754,10 @@
> }
> }
>
> - if (dot_state == 1) {
> + if (dot_state == 0) {
> if (check_dot(ctx, dirent, ino, &cd->pctx))
> dir_modified++;
> - } else if (dot_state == 2) {
> + } else if (dot_state == 1) {
> dir = e2fsck_get_dir_info(ctx, ino);
> if (!dir) {
> fix_problem(ctx, PR_2_NO_DIRINFO, &cd->pctx);
> @@ -749,7 +798,7 @@
> * clear it.
> */
> problem = PR_2_BB_INODE;
> - } else if ((dot_state > 2) &&
> + } else if ((dot_state > 1) &&
> ((dirent->name_len & 0xFF) == 1) &&
> (dirent->name[0] == '.')) {
> /*
> @@ -758,7 +807,7 @@
> * duplicate entry that should be removed.
> */
> problem = PR_2_DUP_DOT;
> - } else if ((dot_state > 2) &&
> + } else if ((dot_state > 1) &&
> ((dirent->name_len & 0xFF) == 2) &&
> (dirent->name[0] == '.') &&
> (dirent->name[1] == '.')) {
> @@ -768,7 +817,7 @@
> * duplicate entry that should be removed.
> */
> problem = PR_2_DUP_DOT_DOT;
> - } else if ((dot_state > 2) &&
> + } else if ((dot_state > 1) &&
> (dirent->inode == EXT2_ROOT_INO)) {
> /*
> * Don't allow links to the root directory.
> @@ -777,7 +826,7 @@
> * directory hasn't been created yet.
> */
> problem = PR_2_LINK_ROOT;
> - } else if ((dot_state > 2) &&
> + } else if ((dot_state > 1) &&
> (dirent->name_len & 0xFF) == 0) {
> /*
> * Don't allow zero-length directory names.
> @@ -842,7 +891,7 @@
> * hard link. We assume the first link is correct,
> * and ask the user if he/she wants to clear this one.
> */
> - if ((dot_state > 2) &&
> + if ((dot_state > 1) &&
> (ext2fs_test_inode_bitmap(ctx->inode_dir_map,
> dirent->inode))) {
> subdir = e2fsck_get_dir_info(ctx, dirent->inode);
> @@ -871,7 +920,9 @@
> ctx->fs_links_count++;
> ctx->fs_total_count++;
> next:
> + prev = dirent;
> offset += dirent->rec_len;
> + dot_state++;
> } while (offset < fs->blocksize);
> #if 0
> printf("\n");
> diff -Nru a/e2fsck/rehash.c b/e2fsck/rehash.c
> --- a/e2fsck/rehash.c Sat Sep 28 09:50:43 2002
> +++ b/e2fsck/rehash.c Sat Sep 28 09:50:43 2002
> @@ -522,7 +522,9 @@
> return wd.err;
>
> e2fsck_read_inode(ctx, ino, &inode, "rehash_dir");
> - if (!compress)
> + if (compress)
> + inode.i_flags &= ~EXT2_INDEX_FL;
> + else
> inode.i_flags |= EXT2_INDEX_FL;
> inode.i_size = outdir->num * fs->blocksize;
> inode.i_blocks -= (fs->blocksize / 512) * wd.cleared;
> @@ -561,7 +563,7 @@
> fd.dir_size = 0;
> fd.compress = 0;
> if (!(fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_DIR_INDEX) ||
> - (inode.i_size / fs->blocksize) < 3)
> + (inode.i_size / fs->blocksize) < 2)
> fd.compress = 1;
> fd.parent = 0;
>
> diff -Nru a/e2fsck/unix.c b/e2fsck/unix.c
> --- a/e2fsck/unix.c Sat Sep 28 09:50:43 2002
> +++ b/e2fsck/unix.c Sat Sep 28 09:50:43 2002
> @@ -974,8 +974,11 @@
> ext2fs_mark_super_dirty(fs);
>
> /*
> - * Don't overwrite the backup superblock and block
> - * descriptors, until we're sure the filesystem is OK....
> + * We only update the master superblock because (a) paranoia;
> + * we don't want to corrupt the backup superblocks, and (b) we
> + * don't need to update the mount count and last checked
> + * fields in the backup superblock (the kernel doesn't
> + * update the backup superblocks anyway).
> */
> fs->flags |= EXT2_FLAG_MASTER_SB_ONLY;
>
> @@ -1058,9 +1061,7 @@
> exit_value |= FSCK_REBOOT;
> }
> }
> - if (ext2fs_test_valid(fs))
> - fs->flags &= ~EXT2_FLAG_MASTER_SB_ONLY;
> - else {
> + if (!ext2fs_test_valid(fs)) {
> printf(_("\n%s: ********** WARNING: Filesystem still has "
> "errors **********\n\n"), ctx->device_name);
> exit_value |= FSCK_UNCORRECTED;
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 29, 2002 00:03, [email protected] wrote:
> I already do the initial test and it fix the problem in kernel and
> e2fsck.
Still broken here. The short directory inodes are cleared up, but I'm still
getting various errors from fsck
Case 1:
"Problem in HTREE directory inode 2 (/): bad block 3223649"
Case 2:
"Inode 2, i_blocks is 3718, should be 2280
Directory inode 2 has an unallocated block #377
Directory inode 2 has an unallocated block #378
Directory inode 2 has an unallocated block #379"
etc
This is a completely fresh loopback EXT3 filesystem, untouched by fsck -D, and
normally unmounted.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9lrb9LGMzRzbJfbQRAhpSAKCbkbyiwM8PnpAbN2FvU6tRHM1urwCdEzFK
WSwjN6jC+0QI0NnJzKc0rX8=
=HHtV
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 29, 2002 01:16, Ryan Cumming wrote:
> This is a completely fresh loopback EXT3 filesystem, untouched by fsck -D,
> and normally unmounted.
Oh, and I've attached the current version of my test program if anyone is
interested.
It spawns 8 child processes which repeatedly create up to 1,000,000 files
each, stat up to 1,000,000 (probably) non-existant files, and then unlink the
files they created.
It also spawns another 4 processes which iterate over all of the directory
entries using readdir(). For each file it encounters, it has an equal
probability of renaming it, unlinking it, or truncating it to a random
length.
It can corrupt my loopback test filesystems in under 5 minutes. Note that it
will completely destroy any data in its working directory, however.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9lrulLGMzRzbJfbQRAgxMAJ46Y/L4FA8nuwe76MyGCyhG+mSE3QCgjb5a
TBJbqp55p5yOU2BY0AnW/TA=
=cKn9
-----END PGP SIGNATURE-----
On Sun, Sep 29, 2002 at 12:03:15AM -0700, [email protected] wrote:
> When e2fsck rehash the directory index. it will left one empty
> dir entry at the end of block if next dir entry can not fit in.
> I think it is OK to do so but I also make a patch to make the empty
> space merge with previous entry to consistent with the kernel.
Nice catch! Yup, e2fsck will create an empty directory entry, instead
of letting the previous entry absorb the extra space. Both are legal
ways of doing things, but yes, the kernel does the latter in all cases
except when deleting the last directory entry in a block. (But that's
OK, because the split code would never need to be invoked on an empty
directory block.)
> dx_make_map need to skip the empty entry. So here is the patch for
> review before I push it back to htree BK repository.
That patch looks fine. Push away!
> Ted, is e2fsck endian free now? I notice some code in
> copy_dir_entries() do not have cpu_to_le32 for rec_len etc.
Yes, e2fsprogs is endian-free. The byte swapping happens in the
library routine ext2fs_read_dir_block() and ext2fs_write_dir_block().
Note that you have to tell these routines whether you're planning on
using the v2 version of dirent struct or not, because that affects
whether or not the name_len field needs to be byte-swapped or not.
- Ted
Hi Ryan,
>> This is a completely fresh loopback EXT3 filesystem, untouched by fsck -D,
>> and normally unmounted.
> Oh, and I've attached the current version of my test program if anyone is
> interested.
> ...
> It can corrupt my loopback test filesystems in under 5 minutes. Note that it
> will completely destroy any data in its working directory, however.
I am running your program now over an hour without any corruption on the
loopback mounted ext3 filesystem.
This is with the latest patch (1) from Theodore + latest small fix from
Christoper for the kernel and for latest e2fsprogs.
I made some debian packages available which includes the latest fixes, can be
found here (2).
Anyway, works great for me! :)
1) http://thunk.org/tytso/linux/ext3-dxdir/
2) http://wolk.sf.net/e2fsprogs/
--
Kind regards
Marc-Christian Petersen
http://sourceforge.net/projects/wolk
PGP/GnuPG Key: 1024D/569DE2E3DB441A16
Fingerprint: 3469 0CF8 CA7E 0042 7824 080A 569D E2E3 DB44 1A16
Key available at http://www.keyserver.net. Encrypted e-mail preferred.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On September 29, 2002 01:16, Ryan Cumming wrote:
> Case 1:
> "Problem in HTREE directory inode 2 (/): bad block 3223649"
>
> Case 2:
> "Inode 2, i_blocks is 3718, should be 2280
> Directory inode 2 has an unallocated block #377
> Directory inode 2 has an unallocated block #378
> Directory inode 2 has an unallocated block #379"
> etc
Okay, a few more datapoints:
1) Case #1 happens much more frequently than case #2
2) I can trigger filesystem corruption with nothing but non-concurrent
readdir()s and unlink()s
3) Although my initial tests were reaching the filesystem's inode limit, I've
been able to reproduce case #1 with 153,129 free inodes and 217,294 free
blocks.
4) Case #1 persists with Chris' stack overflow fix, modified to use memmove
instead of memcpy
5) I have not been able to reproduce either error without the dir_index flag
set. This may be due to the fact that my test program runs -much- more slowly
without directory indexing.
- -Ryan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9l7slLGMzRzbJfbQRAtbnAKCo5c6NzuGY59Xy4MvCRFqwuw5/+gCgpQkr
FBXfCLl2R2ii/PiJ3+0ZQR8=
=droh
-----END PGP SIGNATURE-----