Hi,
I was running debian-stable, on my backup-server (the server that
does backups, not the "just-in-case" server).
Debian apparently recently pointed that to the new release squeeze, so
I got upgraded. I went from kernel 2.6.26 to 2.6.32. After about a day
my system rebooted without my consent. So now it's running 2.6.32.
Since then I'm getting kernel-oops-lookalikes that start with:
[71664.306573] swapper: page allocation failure. order:5, mode:0x4020
Lots of them actually.
(on the other hand, none of these happened before my filesystem got
thrashed...)
Anyway, upon boot into the new kernel ext3 printed abunch of these:
[ 5.212119] ext3_orphan_cleanup: deleting unreferenced inode 1335743
A few hours later, my storage partition was marked read-only and the
backups started failing.
kern.log.1.gz:Feb 18 05:39:53 driepoot kernel: [10328.424778]
EXT3-fs error (device md3): ext3_lookup: deleted inode referenced: 277447722
So to correct the situation I started an fsck.
After about 24 hours, I decided that the fsck was taking too long and
decided to upgrade e2fsck. It has now been running for an hour and a
half. Now I don't mind fsck taking an hour or two. But I expect fsck
to be disk bound.
However iostat shows me it's doing next to nothing for seconds
at a time:
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
md3 0.00 0.00 0.00 0 0
md3 0.00 0.00 0.00 0 0
md3 0.00 0.00 0.00 0 0
md3 733.33 2933.33 0.00 2904 0
md3 0.00 0.00 0.00 0 0
md3 63.37 253.47 0.00 256 0
md3 0.00 0.00 0.00 0 0
md3 0.00 0.00 0.00 0 0
md3 5.88 23.53 0.00 24 0
and it turns out that fsck is completely CPU bound:
top - 09:26:29 up 2 days, 6:38, 10 users, load average: 1.06, 1.07, 1.27
Tasks: 136 total, 2 running, 134 sleeping, 0 stopped, 0 zombie
Cpu(s): 79.1%us, 4.9%sy, 0.0%ni, 0.0%id, 0.4%wa, 1.5%hi, 14.1%si, 0.0%st
Mem: 969400k total, 956624k used, 12776k free, 226828k buffers
Swap: 1975976k total, 252220k used, 1723756k free, 67768k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10274 root 20 0 839m 631m 52m R 97.7 66.7 50:07.09 e2fsck
and when I trace fsck I get:
fcntl64(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=164, len=1}) = 0
fcntl64(6, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=164, len=1}) = 0
fcntl64(6, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=264, len=1}) = 0
fcntl64(5, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=572, len=1}) = 0
fcntl64(5, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=164, len=1}) = 0
fcntl64(5, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=164, len=1}) = 0
fcntl64(5, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=572, len=1}) = 0
fcntl64(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=572, len=1}) = 0
fcntl64(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=164, len=1}) = 0
So, my question is: Are these fcntl calls neccesary?
As far as I know locking is neccesary if another process might be
handling the same data. Here is is doing this with the cache
files:
lrwx------ 1 root root 64 Feb 20 09:28 5 ->
/var/cache/e2fsck/123a1cfe-2455-4646-aa32-87492ed1ac97-icount-ayxVou
lrwx------ 1 root root 64 Feb 20 09:28 6 ->
/var/cache/e2fsck/123a1cfe-2455-4646-aa32-87492ed1ac97-dirinfo-rBBTtb
were, using these swap files makes sense as some machines don't have
the memory and/or addressingspace to handle a big fsck, but in my
case I have 1G RAM, and these two files are 56M total:
-rw------- 1 root root 21M Feb 20 09:30 ...97-dirinfo-rBBTtb
-rw------- 1 root root 35M Feb 20 09:30 ...97-icount-ayxVou
# strace -p 10274 | & head -100000 | sort | uniq -c | sort -n
shows me that out of 100k system calls
10876 fcntl64(6, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=164, len=1}) = 0
10877 fcntl64(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=164, len=1}) = 0
13339 fcntl64(5, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=164, len=1}) = 0
13339 fcntl64(5, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=164, len=1}) = 0
and 60-139 locks for different locations.
Oh... and fsck is now at the stage:
Pass 1: Checking inodes, blocks, and sizes
The filesystem is 3T:
md3 : active raid5 sda3[0] sdd3[3] sdc3[2] sdb3[1]
2868686592 blocks level 5, 256k chunk, algorithm 2 [4/4] [UUUU]
I'm studying e2fsck source code abit, but I don't yet see where the
fcntl calls are coming from.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Sun, Feb 20, 2011 at 10:06:56AM +0100, Rogier Wolff wrote:
> Debian apparently recently pointed that to the new release squeeze, so
> I got upgraded. I went from kernel 2.6.26 to 2.6.32. After about a day
> my system rebooted without my consent. So now it's running 2.6.32.
>
> Since then I'm getting kernel-oops-lookalikes that start with:
> [71664.306573] swapper: page allocation failure. order:5, mode:0x4020
That's a warning which has been suppressed in newer kernels. Ext4
falls back to vmalloc() if kmalloc() fails, and sometime in the early
2.6.3x kernels the kernel started warning on huge kmalloc failures,
without having a way of suppressing those errors. That's a cosmetic
issue which has been fixed. (It should only be happening at when an
ext4 file system is being mounted, right?)
> Anyway, upon boot into the new kernel ext3 printed abunch of these:
> [ 5.212119] ext3_orphan_cleanup: deleting unreferenced inode 1335743
That's normal if you didn't cleanly umount the file system before
reporting, and there were files that were still open, but deleted at
the time of the crash.
> A few hours later, my storage partition was marked read-only and the
> backups started failing.
>
> kern.log.1.gz:Feb 18 05:39:53 driepoot kernel: [10328.424778]
> EXT3-fs error (device md3): ext3_lookup: deleted inode referenced: 277447722
That's not normal. :-)
> fcntl64(6, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=164, len=1}) = 0
> fcntl64(6, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=164, len=1}) = 0
>
> So, my question is: Are these fcntl calls neccesary?
> As far as I know locking is neccesary if another process might be
> handling the same data. Here is is doing this with the cache
> files:
>
> lrwx------ 1 root root 64 Feb 20 09:28 5 ->
> /var/cache/e2fsck/123a1cfe-2455-4646-aa32-87492ed1ac97-icount-ayxVou
> lrwx------ 1 root root 64 Feb 20 09:28 6 ->
> /var/cache/e2fsck/123a1cfe-2455-4646-aa32-87492ed1ac97-dirinfo-rBBTtb
Ah, you're using tdb. Tdb can be really slow. It's been on my todo
list to replace tdb with something else, but I haven't gotten around
to it.
No, it shouldn't be necessary given that e2fsck is the only user of
the tdb files. I'll need to look at trying to remove them, but I'm
not sure that would really improve the speed.
- Ted
On Sun, Feb 20, 2011 at 12:09:31PM -0500, Ted Ts'o wrote:
>
> Ah, you're using tdb. Tdb can be really slow. It's been on my todo
> list to replace tdb with something else, but I haven't gotten around
> to it.
Hmm... after taking a quick look at the TDB sources, why don't you try
this. In lib/ext2fs/icount.c and e2fsck/dirinfo.c, try replacing the
flag TDB_CLEAR_IF_FIRST with TDB_NOLOCK | TDB_NOSYNC. i.e., try
replacing:
icount->tdb = tdb_open(fn, 0, TDB_CLEAR_IF_FIRST,
O_RDWR | O_CREAT | O_TRUNC, 0600);
with:
icount->tdb = tdb_open(fn, 0, TDB_NOLOCK | TDB_NOSYNC,
O_RDWR | O_CREAT | O_TRUNC, 0600);
Could you let me know what this does to the performance of e2fsck with
scratch files enabled?
Oh, and BTW, it would be useful if you tried configuring
tests/test_config so that it sets E2FSCK_CONFIG with a test
e2fsck.conf that enables the scratch files somewhere in tmp, and then
run the regression test suite with these changes.
If they work, and it solves the performance problem, let me know and
send me patches. If we can figure out some way of improving the
performance without needing to replace tdb, that would be great...
- Ted
Hi Ted,
Thanks for looking into this...
On Sun, Feb 20, 2011 at 02:34:06PM -0500, Ted Ts'o wrote:
> On Sun, Feb 20, 2011 at 12:09:31PM -0500, Ted Ts'o wrote:
> >
> > Ah, you're using tdb. Tdb can be really slow. It's been on my todo
> > list to replace tdb with something else, but I haven't gotten around
> > to it.
>
> Hmm... after taking a quick look at the TDB sources, why don't you try
> this. In lib/ext2fs/icount.c and e2fsck/dirinfo.c, try replacing the
> flag TDB_CLEAR_IF_FIRST with TDB_NOLOCK | TDB_NOSYNC. i.e., try
> replacing:
>
> icount->tdb = tdb_open(fn, 0, TDB_CLEAR_IF_FIRST,
> O_RDWR | O_CREAT | O_TRUNC, 0600);
>
> with:
>
> icount->tdb = tdb_open(fn, 0, TDB_NOLOCK | TDB_NOSYNC,
> O_RDWR | O_CREAT | O_TRUNC, 0600);
I looked into this myself as well. Suspecting the locking calls I put
a "return 0" in the first line of the tdb locking function. This makes
all locking requests a noop. Doing it the proper way as you suggest
may be nicer, but this was a method that existed within my
abilities...
Ayway, this removed all the fcntl calls to lock and unlock the
database.... It didn't solve the performance issue though....
Here is an strace...
0.000379 .525531 munmap(0x8d03e000, 108937216) = 0
0.008008 .533540 ftruncate(5, 108941312) = 0
0.000207 .533748 pwrite64(5, "BBBBBBBBBB"..., 1024, 108937216) = 1024
0.000235 .533983 pwrite64(5, "BBBBBBBBBB"..., 1024, 108938240) = 1024
0.000108 .534092 pwrite64(5, "BBBBBBBBBB"..., 1024, 108939264) = 1024
0.000138 .534230 pwrite64(5, "BBBBBBBBBB"..., 1024, 108940288) = 1024
0.000106 .534336 mmap2(NULL, 108941312, PROT_READ|PROT_WRITE, MAP_SHARED, 5, 0) = 0x8d03d000
1.994850 2.529190 fstat64(6, {st_mode=S_IFREG|0600, st_size=92045312, ...}) = 0
The first column is the difference of the timestamp on THIS line
compared to the previous one. Consider that mostly CPUtime.
The system calls all take between 17 and 127 microseconds. i.e. fast.
The exception is the munmap call, which takes 7
milliseconds. Acceptable.
The performance killer is the almost two seconds of CPU time spent
before the fstat of the 5 or 6 file descriptors.
It seems wasteful to mmap and munmap the whole 100M of those two
files all the time.
The "BBBBB" strings in the pwrite calls are the padding. 0x42, get it?
I checked... The full 4x1024 bytes are just padding. Nothing else.
> Could you let me know what this does to the performance of e2fsck
> with scratch files enabled?
I apparently have scratch files enabled, right? I just typed
./configure ; ./make ; scp e2fsck/e2fsck othermachine:e2fsck.test
so I didn't mess with the configuration.
I just straced
1298236533.396622 _llseek(3, 522912374784, [], SEEK_SET) = 0 <0.000038>
1298236540.311416 _llseek(3, 522912407552, [], SEEK_SET) = 0 <0.000035>
1298236547.288401 _llseek(3, 522912440320, [], SEEK_SET) = 0 <0.000035>
and I see it seeking to somewhere in the 486Gb range. Does this mean
it has 6x more to go? I don't really see the numbers increasing
significantly. Although out-of-order numbers appear in the llseek
outputs, the most common numbers are slowly increasing.
I had first estimated thee ETA around the end of this century, but that
seems to be a bit overly pessimistic. I probably missed a factor of
1000 somewhere. I now get about 9 days. That means I'm likely to live
long enough to see the end of this..... :-)
Whenever the time to completion seems longer than optimizing it a bit
and then restarting, I'll restart. But in this case, if I keep
estimating the "normal fsck time" as 8 hours, and "a bit of coding" as
2 hours, I'm afraid It will never finish.
To estimate the time-to-run, would it be safe to suspend the running
fsck, and start an fsck -n ? I've invested 10 CPU hours in this fsck
instance already, I would like it to finish eventually... 9 days seems
doable...
out-of-order example:
1298236950.540958 _llseek(3, 523986247680, [], SEEK_SET) = 0 <0.000035>
1298236950.646999 _llseek(3, 523986280448, [], SEEK_SET) = 0 <0.000038>
1298236952.813587 _llseek(3, 630728769536, [], SEEK_SET) = 0 <0.000036>
1298236953.947109 _llseek(3, 523986313216, [], SEEK_SET) = 0 <0.000035>
1298236953.948982 _llseek(3, 523986345984, [], SEEK_SET) = 0 <0.000015>
(I've deleted the number in the brackets, it's the same as the number
before.)
> Oh, and BTW, it would be useful if you tried configuring
> tests/test_config so that it sets E2FSCK_CONFIG with a test
> e2fsck.conf that enables the scratch files somewhere in tmp, and then
> run the regression test suite with these changes.
I'm not sure I understand correctly. Although undocumented you're
saying that e2fsck honors an environment variable E2FSCK_CONFIG, that
allows me to specify a different config file from /etc/e2fsck.conf.
I've created a e2fsck.conf file in the tests directory and changed it
to:
[options]
buggy_init_scripts = 1
[scratch_files]
directory=/tmp
I've then pointed E2FSCK_CONFIG to this file (absolute pathname). I
then chickend out and edited my system /etc/e2fsck.conf to be the
same.
Next I typed "make" and got:
102 tests succeeded 0 tests failed
> If they work, and it solves the performance problem, let me know and
> send me patches. If we can figure out some way of improving the
> performance without needing to replace tdb, that would be great...
The system where the large filesystem is running already has an
e2fsck.conf that holds:
[scratch_files]
directory = /var/cache/e2fsck
With "send me patches" you mean with the NOSYNC option enabled?
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Sun, Feb 20, 2011 at 10:55:31PM +0100, Rogier Wolff wrote:
> I looked into this myself as well. Suspecting the locking calls I put
> a "return 0" in the first line of the tdb locking function. This makes
> all locking requests a noop. Doing it the proper way as you suggest
> may be nicer, but this was a method that existed within my
> abilities...
Well, my change also enables the TDB_NOSYNC flag, which eliminates the
sync calls. Based on your straces, I'm not convinced that will make a
huge difference, but it might be worth a try.
> > Could you let me know what this does to the performance of e2fsck
> > with scratch files enabled?
>
> I apparently have scratch files enabled, right?
Well, given that you are accessing the tdb files, I assume you have an
e2fsck.conf file that has the "[scratch_files]" configuration section
in it....
> I just straced
>
> 1298236533.396622 _llseek(3, 522912374784, [], SEEK_SET) = 0 <0.000038>
> 1298236540.311416 _llseek(3, 522912407552, [], SEEK_SET) = 0 <0.000035>
> 1298236547.288401 _llseek(3, 522912440320, [], SEEK_SET) = 0 <0.000035>
>
> and I see it seeking to somewhere in the 486Gb range. Does this mean
> it has 6x more to go?
Well, I assume at the moment you're still in pass 1. After you finish
the scan of the inode table, you'll need to scan directory blocks,
which will also involve touching the tdb dirinfo file (but mostly not
the icount file). So it might be closer to two weeks, but yeah, we're
talking about 1-2 weeks, not months or years. :-)
> To estimate the time-to-run, would it be safe to suspend the running
> fsck, and start an fsck -n ? I've invested 10 CPU hours in this fsck
> instance already, I would like it to finish eventually... 9 days seems
> doable...
Yes, that should be safe.
> out-of-order example:
>
> 1298236950.540958 _llseek(3, 523986247680, [], SEEK_SET) = 0 <0.000035>
> 1298236950.646999 _llseek(3, 523986280448, [], SEEK_SET) = 0 <0.000038>
> 1298236952.813587 _llseek(3, 630728769536, [], SEEK_SET) = 0 <0.000036>
> 1298236953.947109 _llseek(3, 523986313216, [], SEEK_SET) = 0 <0.000035>
> 1298236953.948982 _llseek(3, 523986345984, [], SEEK_SET) = 0 <0.000015>
>
> (I've deleted the number in the brackets, it's the same as the number
> before.)
The out of order scan was probably reading an extent tree block.
>
> > Oh, and BTW, it would be useful if you tried configuring
> > tests/test_config so that it sets E2FSCK_CONFIG with a test
> > e2fsck.conf that enables the scratch files somewhere in tmp, and then
> > run the regression test suite with these changes.
>
> I'm not sure I understand correctly. Although undocumented you're
> saying that e2fsck honors an environment variable E2FSCK_CONFIG, that
> allows me to specify a different config file from /etc/e2fsck.conf.
Correct.
> I've created a e2fsck.conf file in the tests directory and changed it
> to:
> [options]
> buggy_init_scripts = 1
> [scratch_files]
> directory=/tmp
Well, it won't use the e2fsck.conf file unless you also modify the
test_config.in file, since it generates the test_config file, which
explicitly sets E2FSCK_CONF to be /dev/null (this prevents a locally
installed /etc/e2fsck.conf file from affecting the test results).
> With "send me patches" you mean with the NOSYNC option enabled?
Well, with the TDB_NOSYNC and TDB_NOLOCK flags set. Although it looks
like it might not be sufficient.
BTW, my backup plan was to replace tdb with something else. One of
the candidates I was looking at was sqlite, but rumors of its speed
deficiencies are making me worry that it won't be a good fit. I don't
want to use berk_db because it has a habit of changing API's
regularly, and you can never be sure which version of berk_db
different distributions might be using. One package which I thought
held promise was Koyoto Cabinet, but unfortunately, it's released
under GPLv3, which makes it incompatible with the license used by
e2fsprogs (which has to be GPLv2, since there are a few files which
are shared with the Linux kernel).
Here's another possibility if you are willing to replace the kernel
--- can you upgrade to a 64-bit kernel, even if you are mostly using
32-bit binaries, and then use a statically linked 64-bit e2fsck? Then
all you need to do is configure a nice big swap space, and then
disable the scratch_files section in e2fsck.conf....
- Ted
On Sun, Feb 20, 2011 at 05:20:13PM -0500, Ted Ts'o wrote:
> On Sun, Feb 20, 2011 at 10:55:31PM +0100, Rogier Wolff wrote:
> > I looked into this myself as well. Suspecting the locking calls I put
> > a "return 0" in the first line of the tdb locking function. This makes
> > all locking requests a noop. Doing it the proper way as you suggest
> > may be nicer, but this was a method that existed within my
> > abilities...
>
> Well, my change also enables the TDB_NOSYNC flag, which eliminates the
> sync calls. Based on your straces, I'm not convinced that will make a
> huge difference, but it might be worth a try.
In my straces it is not calling sync. So the performance hit
of the "sync calls" is unmeasurable....
> > > Could you let me know what this does to the performance of e2fsck
> > > with scratch files enabled?
> >
> > I apparently have scratch files enabled, right?
>
> Well, given that you are accessing the tdb files, I assume you have an
> e2fsck.conf file that has the "[scratch_files]" configuration section
> in it....
Yeah. Found that near the end of writing my message. I'm starting to
remember something about e2fsck crashing outright because of the
scratchfiles missing....
> > I just straced
> >
> > 1298236533.396622 _llseek(3, 522912374784, [], SEEK_SET) = 0 <0.000038>
> > 1298236540.311416 _llseek(3, 522912407552, [], SEEK_SET) = 0 <0.000035>
> > 1298236547.288401 _llseek(3, 522912440320, [], SEEK_SET) = 0 <0.000035>
> >
> > and I see it seeking to somewhere in the 486Gb range. Does this mean
> > it has 6x more to go?
>
> Well, I assume at the moment you're still in pass 1. After you finish
> the scan of the inode table, you'll need to scan directory blocks,
> which will also involve touching the tdb dirinfo file (but mostly not
> the icount file). So it might be closer to two weeks, but yeah, we're
> talking about 1-2 weeks, not months or years. :-)
Oh.... On the other hand, it seems it takes a sprint, reading more
like 10Mb per second at the beginning. And it seems to be slowing down
due to linearly searching a list or something like that. Thus when it
has progressed 2x further than where it is now it'll be 2x slower.
That might mean we need 2 weeks * 25.... :-(
> > To estimate the time-to-run, would it be safe to suspend the running
> > fsck, and start an fsck -n ? I've invested 10 CPU hours in this fsck
> > instance already, I would like it to finish eventually... 9 days seems
> > doable...
>
> Yes, that should be safe.
>
> > out-of-order example:
> >
> > 1298236950.540958 _llseek(3, 523986247680, [], SEEK_SET) = 0 <0.000035>
> > 1298236950.646999 _llseek(3, 523986280448, [], SEEK_SET) = 0 <0.000038>
> > 1298236952.813587 _llseek(3, 630728769536, [], SEEK_SET) = 0 <0.000036>
> > 1298236953.947109 _llseek(3, 523986313216, [], SEEK_SET) = 0 <0.000035>
> > 1298236953.948982 _llseek(3, 523986345984, [], SEEK_SET) = 0 <0.000015>
> >
> > (I've deleted the number in the brackets, it's the same as the number
> > before.)
>
> The out of order scan was probably reading an extent tree block.
>
> >
> > > Oh, and BTW, it would be useful if you tried configuring
> > > tests/test_config so that it sets E2FSCK_CONFIG with a test
> > > e2fsck.conf that enables the scratch files somewhere in tmp, and then
> > > run the regression test suite with these changes.
> >
> > I'm not sure I understand correctly. Although undocumented you're
> > saying that e2fsck honors an environment variable E2FSCK_CONFIG, that
> > allows me to specify a different config file from /etc/e2fsck.conf.
>
> Correct.
>
>
> > I've created a e2fsck.conf file in the tests directory and changed it
> > to:
> > [options]
> > buggy_init_scripts = 1
> > [scratch_files]
> > directory=/tmp
>
> Well, it won't use the e2fsck.conf file unless you also modify the
> test_config.in file, since it generates the test_config file, which
> explicitly sets E2FSCK_CONF to be /dev/null (this prevents a locally
> installed /etc/e2fsck.conf file from affecting the test results).
Ah! Back to the drawing board. :-) I'll redo the tests.
102 tests succeeded 0 tests failed
> > With "send me patches" you mean with the NOSYNC option enabled?
>
> Well, with the TDB_NOSYNC and TDB_NOLOCK flags set. Although it looks
> like it might not be sufficient.
No. I would like to find out where it's spending its CPU time. When
the kernel suspends a process, it has to store the current userspace
program counter somewhere.
[....] It's called
kstkeip
in /proc/<pid>/stat . It is the 30th field .
Now figure out a way to reverse this to what function it's in.
Hmm. My eip is: 3076930326 which is hex 0xB7663B16.
According to /proc/<pid>/maps this is:
b75ee000-b75ef000 rw-p 00000000 00:00 0
b75ef000-b772f000 r-xp 00000000 09:02 103630 /lib/i686/cmov/libc-2.11.2.so
b772f000-b7731000 r--p 0013f000 09:02 103630 /lib/i686/cmov/libc-2.11.2.so
b7731000-b7732000 rw-p 00141000 09:02 103630 /lib/i686/cmov/libc-2.11.2.so
in the executable part of libc ???
Every once in a while... it ends up somewhere else... Ah. Succes!
08077340 t tdb_rec_read
08077349
08077356
080773d2
080773f2
080773fa
08077c50 t tdb_oob
08077c51
08077c6a
08077cbd
08077cc3
080787a0 t tdb_read
080787a1
080787a1
080787a9
080787a9
080787be
080787e1
080787e9
080787f3
080787f8
080787f8
080787fb
08078809
0807880f
08078bb0 t tdb_find
08078bfa
08078c11
08078c11
08078c1c
I've managed to catch it outside of "libc" some 30 times the last 5
minutes. I'll leave it running the next few hours, to make a bit
better profile.
Now we have a couple of functions where fsck spends its time outside
of libc, and one of them is the likely candidate for calling a
time-consuming libc function.
> BTW, my backup plan was to replace tdb with something else. One of
> the candidates I was looking at was sqlite, but rumors of its speed
> deficiencies are making me worry that it won't be a good fit. I don't
> want to use berk_db because it has a habit of changing API's
> regularly, and you can never be sure which version of berk_db
> different distributions might be using. One package which I thought
> held promise was Koyoto Cabinet, but unfortunately, it's released
> under GPLv3, which makes it incompatible with the license used by
> e2fsprogs (which has to be GPLv2, since there are a few files which
> are shared with the Linux kernel).
Hmm. I'll take a look.
> Here's another possibility if you are willing to replace the kernel
> --- can you upgrade to a 64-bit kernel, even if you are mostly using
> 32-bit binaries, and then use a statically linked 64-bit e2fsck? Then
> all you need to do is configure a nice big swap space, and then
> disable the scratch_files section in e2fsck.conf....
Ohhhhh shit. long time ago that I've done that.... I have a page on my
internal wiki on how to do this..... Problem is....
driepoot:/home/wolff# grep lm /proc/cpuinfo
driepoot:/home/wolff#
.... it doesn't have a 64-bit CPU.... :-(
I thought when I bought those that buying AMD chips would give me
64-bit because AMD had brought that feature down to the lower-end
chips (at least much lower-end than Intel), but apparenly not to
the desktop CPUs that I was buying at the time. I didn't want to
run 64-bit OSes on those machines until years later...
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Mon, Feb 21, 2011 at 12:15:14AM +0100, Rogier Wolff wrote:
> > BTW, my backup plan was to replace tdb with something else. One of
> > the candidates I was looking at was sqlite, but rumors of its speed
> > deficiencies are making me worry that it won't be a good fit. I don't
> > want to use berk_db because it has a habit of changing API's
> > regularly, and you can never be sure which version of berk_db
> > different distributions might be using. One package which I thought
> > held promise was Koyoto Cabinet, but unfortunately, it's released
> > under GPLv3, which makes it incompatible with the license used by
> > e2fsprogs (which has to be GPLv2, since there are a few files which
> > are shared with the Linux kernel).
>
> Hmm. I'll take a look.
If you could put a bit of time into this, that would be really great.
I have a lot of things that I need to do at the moment, and trying to
improve [scratch_files] performance is something I've known about for
a while, but I just haven't had time to get to it.
The fact that the problem can be solved by using 64-bit capable CPU's
and a large swap space has kept this from rising to the top of the
priority heap, but it is an important use case, since we do have NAS
boxes that use cheap-sh*t 32-bit processors, and I'd like to be able
to support them. But I just don't have the time ATM, so if I can
delegate this out to someone else, that would be really helpful.
- Ted
On Mon, Feb 21, 2011 at 1:41 AM, Ted Ts'o <[email protected]> wrote:
> On Mon, Feb 21, 2011 at 12:15:14AM +0100, Rogier Wolff wrote:
>> > BTW, my backup plan was to replace tdb with something else. ?One of
>> > the candidates I was looking at was sqlite, but rumors of its speed
>> > deficiencies are making me worry that it won't be a good fit. ?I don't
>> > want to use berk_db because it has a habit of changing API's
>> > regularly, and you can never be sure which version of berk_db
>> > different distributions might be using. ?One package which I thought
>> > held promise was Koyoto Cabinet, but unfortunately, it's released
>> > under GPLv3, which makes it incompatible with the license used by
>> > e2fsprogs (which has to be GPLv2, since there are a few files which
>> > are shared with the Linux kernel).
>>
What about Tokyo Cabinet? it seems to be released under GPLv2.1.
Isn't it sufficient (or even better suiting) for scratch files?
>> Hmm. I'll take a look.
>
> If you could put a bit of time into this, that would be really great.
> I have a lot of things that I need to do at the moment, and trying to
> improve [scratch_files] performance is something I've known about for
> a while, but I just haven't had time to get to it.
>
> The fact that the problem can be solved by using 64-bit capable CPU's
> and a large swap space has kept this from rising to the top of the
> priority heap, but it is an important use case, since we do have NAS
> boxes that use cheap-sh*t 32-bit processors, and I'd like to be able
> to support them. ?But I just don't have the time ATM, so if I can
> delegate this out to someone else, that would be really helpful.
>
CTERA uses *affordable* 32-bit processors for it's low-end NAS products,
which still need to be able to complete fsck of a 8TB fs in a reasonable time
(less than the time is takes the customer to loose it's patient and look for
alternative products).
We would be willing to invest resources for this cause,
should we know there is a promising path to follow.
One thing I am not sure I understand is (excuse my ignorance) why is the
swap space solution good only for 64bit processors?
Is it a common knowledge that fsck can require more than 3GB of memory?
If it is common knowledge, do you know of an upper limit (depending on fs size,
no. of inodes, etc)?
Thanks,
Amir.
2011/2/21 Amir Goldstein <[email protected]>:
> One thing I am not sure I understand is (excuse my ignorance) why is the
> swap space solution good only for 64bit processors?
It's an address space limit on 32 bit processors. Even with PAE the
user space process still won't have access to more than 2^32 bits,
that is 4 GiB of memory. Due to practical limitations (e.g. kernel
needing some address space) usually a process won't have access to
more than 3 GiB.
> Is it a common knowledge that fsck can require more than 3GB of memory?
It is, for a given value of common. Disk sizes exploded and there have
been reports of people running out of memory on 32 bit boxes with
terabyte-sized filesystems for several years now. I did a bit of
googling and found descriptions of such case from 2008.
> If it is common knowledge, do you know of an upper limit (depending on fs size,
> no. of inodes, etc)?
>
I vaguely remember some estimation of memory requirements of fsck
being given somewhere, but I'm not able to find the posts now :(.
On 2011-02-21, at 9:04 AM, Paweł Brodacki wrote:
> 2011/2/21 Amir Goldstein <[email protected]>:
>> One thing I am not sure I understand is (excuse my ignorance) why is the
>> swap space solution good only for 64bit processors?
>
> It's an address space limit on 32 bit processors. Even with PAE the
> user space process still won't have access to more than 2^32 bits,
> that is 4 GiB of memory. Due to practical limitations (e.g. kernel
> needing some address space) usually a process won't have access to
> more than 3 GiB.
Roger,
are you using the icount allocation reduction patches previously posted? They won't help if you need more than 3GB of address space, but they definitely reduce the size of allocations and allow the icount data to be swapped. See the thread "[PATCH]: icount: Replace the icount list by a two-level tree".
>> If it is common knowledge, do you know of an upper limit (depending on fs size,
>> no. of inodes, etc)?
>>
>
> I vaguely remember some estimation of memory requirements of fsck
> being given somewhere, but I'm not able to find the posts now :(.
My rule of thumb is about 1 byte of memory per block in the filesystem, for "normal" filesystems (i.e. mostly regular files, and a small fraction of directories). For a 3TB filesystem this would mean ~768MB of RAM. One problem is that the current icount implementation allocates almost 2x the peak usage when it is resizing the array, hence the patch mentioned above for filesystems with lots of directories and hard links.
Cheers, Andreas
On Mon, Feb 21, 2011 at 11:00:02AM -0700, Andreas Dilger wrote:
> On 2011-02-21, at 9:04 AM, Pawe?? Brodacki wrote:
> > 2011/2/21 Amir Goldstein <[email protected]>:
> >> One thing I am not sure I understand is (excuse my ignorance) why is the
> >> swap space solution good only for 64bit processors?
> >
> > It's an address space limit on 32 bit processors. Even with PAE the
> > user space process still won't have access to more than 2^32 bits,
> > that is 4 GiB of memory. Due to practical limitations (e.g. kernel
> > needing some address space) usually a process won't have access to
> > more than 3 GiB.
>
> Roger,
> are you using the icount allocation reduction patches previously
> posted? They won't help if you need more than 3GB of address space,
> but they definitely reduce the size of allocations and allow the
> icount data to be swapped. See the thread "[PATCH]: icount: Replace
> the icount list by a two-level tree".
No I don't think I'm using those patches. (Unless they are in the git head).
I wouldn't be surprised if I'd need more than 3G of RAM. When I
extrapolated "more than a few days" it was at under 20% of the
filesystem and had already allocated on the order of 800Gb of
memory. Now I'm not entirely sure that this is fair: memory use seems
to go up quickly in the beginning, and then stabilize: as if it has
decided that 800M of memory use is "acceptable" and somehow uses a
different strategy once it hits that limit.
On the other hand, things are going reasonably fast until it starts
hitting the CPU-bottleneck so I might have seen the memory usage
flatten out because it wasn't making any significant progress anymore.
Anyway, I've increased the hash size to 1M, up from 131. The TDB guys
suggested 10k: their TDB code is being used for MUCH larger cases than
they expected....
Succes! It's been running 2 hours, and it's past the half-way point
(i.e. 2.5 times further than previously after 24 hours). (of pass
1). It currently has 1400Mb of memory allocated. Hope the 3G limit
doesn't hit me before it finishes....
> >> If it is common knowledge, do you know of an upper limit (depending on fs size,
> >> no. of inodes, etc)?
> >>
> >
> > I vaguely remember some estimation of memory requirements of fsck
> > being given somewhere, but I'm not able to find the posts now :(.
> My rule of thumb is about 1 byte of memory per block in the
> filesystem, for "normal" filesystems (i.e. mostly regular files, and
> a small fraction of directories). For a 3TB filesystem this would
> mean ~768MB of RAM. One problem is that the current icount
> implementation allocates almost 2x the peak usage when it is
> resizing the array, hence the patch mentioned above for filesystems
> with lots of directories and hard links.
My filesystem is a bit weird: I make an "rsync" copy of all my data
onto it. Then I run a cp -lr to copy the current copy to a copy with
the date in it. Next I run a program that will make a second copy
should the number of links exceed 8000....
In short I have a HUMUGOUS amount of directory entries, and lots and
lots of files. And still in the millions of inodes. (Some of the
filesystems backed up contain that many inodes).
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Tue, Feb 22, 2011 at 11:20:56AM +0100, Rogier Wolff wrote:
> I wouldn't be surprised if I'd need more than 3G of RAM. When I
> extrapolated "more than a few days" it was at under 20% of the
> filesystem and had already allocated on the order of 800Gb of
> memory. Now I'm not entirely sure that this is fair: memory use seems
> to go up quickly in the beginning, and then stabilize: as if it has
> decided that 800M of memory use is "acceptable" and somehow uses a
> different strategy once it hits that limit.
OK. Good news. It's finished pass1. It is currently using about 2100Mb
of RAM (ehh. mostly swap, I have only 1G in there). Here is the patch.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Tue, Feb 22, 2011 at 02:36:52PM +0100, Rogier Wolff wrote:
> On Tue, Feb 22, 2011 at 11:20:56AM +0100, Rogier Wolff wrote:
> > I wouldn't be surprised if I'd need more than 3G of RAM. When I
> > extrapolated "more than a few days" it was at under 20% of the
> > filesystem and had already allocated on the order of 800Gb of
> > memory. Now I'm not entirely sure that this is fair: memory use seems
> > to go up quickly in the beginning, and then stabilize: as if it has
> > decided that 800M of memory use is "acceptable" and somehow uses a
> > different strategy once it hits that limit.
>
> OK. Good news. It's finished pass1. It is currently using about 2100Mb
> of RAM (ehh. mostly swap, I have only 1G in there). Here is the patch.
Forgot the patch.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
Roger,
Any idea what the hash size does to memory usage? I wonder if we can scale this based on the directory count, or if the memory usage is minimal (only needed in case of tdb) then just make it the default. It definitely appears to have been a major performance boost.
Another possible optuzatiom is to use the in-memory icount list (preferably with the patch to reduce realloc size) until the allocations fail and only then dump the list into tdb? That would allow people to run with a swapfile configured by default, but only pay the cost of on-disk operations if really needed.
Cheers, Andreas
On 2011-02-22, at 6:54, Rogier Wolff <[email protected]> wrote:
> On Tue, Feb 22, 2011 at 02:36:52PM +0100, Rogier Wolff wrote:
>> On Tue, Feb 22, 2011 at 11:20:56AM +0100, Rogier Wolff wrote:
>>> I wouldn't be surprised if I'd need more than 3G of RAM. When I
>>> extrapolated "more than a few days" it was at under 20% of the
>>> filesystem and had already allocated on the order of 800Gb of
>>> memory. Now I'm not entirely sure that this is fair: memory use seems
>>> to go up quickly in the beginning, and then stabilize: as if it has
>>> decided that 800M of memory use is "acceptable" and somehow uses a
>>> different strategy once it hits that limit.
>>
>> OK. Good news. It's finished pass1. It is currently using about 2100Mb
>> of RAM (ehh. mostly swap, I have only 1G in there). Here is the patch.
>
> Forgot the patch.
>
> Roger.
>
> --
> ** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
> <cputimefix.patch>
On Tue, Feb 22, 2011 at 09:32:28AM -0700, Andreas Dilger wrote:
>
> Any idea what the hash size does to memory usage? I wonder if we
> can scale this based on the directory count, or if the memory usage
> is minimal (only needed in case of tdb) then just make it the
> default. It definitely appears to have been a major performance
> boost.
Yeah, that was my question. Your patch adds a magic number which
probably works well on your machine (and I'm not really worried if
someone has less than 1G --- here's a quarter kid, buy your self a
real computer :-). But I wonder if we should be using a hash size
which is sized automatically depending on available memory or file
system size.
- Ted
On Tue, Feb 22, 2011 at 09:32:28AM -0700, Andreas Dilger wrote:
> Roger,
> Any idea what the hash size does to memory usage? I wonder if we
> can scale this based on the directory count, or if the memory usage
> is minimal (only needed in case of tdb) then just make it the
> default. It definitely appears to have been a major performance
> boost.
First, that hash size is passed to the tdb module, so yes it only
matters when tdb is actually used.
Second, I expect tdb's memory use to be significantly impacted by the
hash size. However.... tdb's memory use is dwarfed by e2fsck's own
memory use.... I have not noticed any difference in memory use of
e2fsck. (by watching "top" output. I haven't done any scientific
measurements.)
> Another possible optuzatiom is to use the in-memory icount list
> (preferably with the patch to reduce realloc size) until the
> allocations fail and only then dump the list into tdb? That would
> allow people to run with a swapfile configured by default, but only
> pay the cost of on-disk operations if really needed.
I don't think this is a good idea. When you expect the "big"
allocations to eventually fail (i.e. icount), you'll evenutally end up
with an allocation somewhere else that fails where you don't have
anything prepared for. A program like e2fsck will be handling
larger and different filesystems "in the field" from what you expected
at the outset. It should be robust.
My fsck is currently walking the ridge....
It grew from about 1000M to over 2500 after pass 1. I was expecting it
to hit the 3G limit before the end. But luckily somehow some memory
got released, and now it seems stable at 2001Mb.
It is currently again in a CPU-bound task. I think it's doing
lots of tdb lookups.
It has asked me:
First entry 'DSCN11194.JPG' (inode=279188586) in directory inode 277579348 (...) should be '.'
Fix<y>? yes
Which is clearly wrong. IF we can find directory entries in that
directory (i.e. it acutally IS a directory), then it is likely that
the file DSCN11194.JPG still exists, and that it has inode
279188586. If it should've been '.', it would've been inode
277579348. So instead of overwriting this "first entry" of the
directory, the fix should've been:
Directory "." is missing in directory inode 277579348. Add?
If neccesary, place should be made inside the directory.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Tue, Feb 22, 2011 at 05:13:04PM -0500, Ted Ts'o wrote:
> On Tue, Feb 22, 2011 at 09:32:28AM -0700, Andreas Dilger wrote:
> >
> > Any idea what the hash size does to memory usage? I wonder if we
> > can scale this based on the directory count, or if the memory usage
> > is minimal (only needed in case of tdb) then just make it the
> > default. It definitely appears to have been a major performance
> > boost.
>
> Yeah, that was my question. Your patch adds a magic number which
> probably works well on your machine (and I'm not really worried if
> someone has less than 1G --- here's a quarter kid, buy your self a
> real computer :-). But I wonder if we should be using a hash size
> which is sized automatically depending on available memory or file
> system size.
I fully agree that having "magic numbers" in the code is a bad thing.
A warning sign.
I don't agree with your argument that 1G RAM should be considered
minimal. There are storage boxes (single-disk-nas systems) out there
that run on a 300MHz ARM chip and have little RAM. Some of them use
ext[234].
For example: http://iomega.nas-central.org/wiki/Main_Page
64Mb RAM. I'm not sure wether that CPU is capable of virtual memory.
I just mentioned this one because a friend brought one into my office
last week. I don't think it happens to run Linux. On the other hand,
some of the "competition" do run Linux.
As to the "disadvantage" of using a large hash value:
As far as I can see, the library just seeks to that position in the
tdb file. With 32-bit file offsets (which is hardcoded into tdb), that
means the penalty is 4*hash_size of extra disk space. So with my
currently suggested value that comes to 4Mb.
As my current tdb database amounts to 1.5Gb I think the cost is
acceptable.
With the number of keys up to 1 million, we can expect a speedup of
1M/131 = about 7500. Above that we won't gain much anymore.
This is assuming that we have a next-to-perfect hash function. In fact
we don't because I see about a 30% hash bucket usage. And I surely
think my fsck has used over 1M of the keys....
I just tested the hash function: I hashed the first 10 million numbers
and got 91962 unique results (out of a possible 99931). That's only
about 10%. That's a lot worse than what e2fsck is seeing. And this is
the simplest case to get right.
Here is my test program.
#include <stdio.h>
#include <stdlib.h>
typedef unsigned int u32;
/* This is based on the hash algorithm from gdbm */
static unsigned int default_tdb_hash(unsigned char *key)
{
u32 value; /* Used to compute the hash value. */
u32 i; /* Used to cycle through random values. */
/* Set the initial value from the key size. */
for (value = 0x238F13AF * 4, i=0; i < 4; i++)
value = (value + (key[i] << (i*5 % 24)));
return (1103515243 * value + 12345);
}
int main (int argc, char **argv)
{
int i;
int max = 1000000;
if (argc > 1) max = atoi (argv[1]);
for (i=0;i < max;i++) {
printf ("%u %u\n", i, default_tdb_hash ((unsigned char *)&i) % 99931);
}
exit (0);
}
and here is the commandline I used to watch the results.
./a.out 10000000 | awk '{print $2}' | sort | uniq -c |sort -n | less
It seems my "prime generator" program is wrong too. I had thought to
choose a prime with 99931, but apparently it's not prime. (13*7687).
Which, for hashing should not be too bad, but I'll go look for a
prime and check again. Ok. Hash bucket usage shot up: 16%.
I just "designed" a new hash function, based on the "hash" page on
wikipedia.
static unsigned int my_tdb_hash(unsigned char *key)
{
u32 value; /* Used to compute the hash value. */
u32 i; /* Used to cycle through random values. */
/* Set the initial value from the key size. */
for (value = 0, i=0; i < 4; i++)
value = value * 256 + key[i] + (value >> 24) * 241;
return value;
}
It behaves MUCH better than the "default_tdb_hash" in that it has 100%
bucket usage (not almost, but exaclty 100%). It's not that hard to get
right.
The "hash" at the end (times BIGPRIME + RANDOMVALUE) in the original
is redundant. It only serves to make the results less obvious to
humans, but there is no computer-science relevant reason.
I'll shoot off an Email to the TDB guys as well.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Feb 22, 2011, at 11:44 PM, Rogier Wolff wrote:
>
> I'll shoot off an Email to the TDB guys as well.
I'm pretty sure this won't come as a surprise to them. I'm using the last version of TDB which was licensed under the GPLv2, and they relicensed to GPLv3 quite a while ago. I remember hearing they had added a new hash algorithm to TDB since the relicensing, but those newer versions aren't available to e2fsprogs....
-- Ted
On Wed, Feb 23, 2011 at 06:32:17AM -0500, Theodore Tso wrote:
>
> On Feb 22, 2011, at 11:44 PM, Rogier Wolff wrote:
>
> >
> > I'll shoot off an Email to the TDB guys as well.
> I'm pretty sure this won't come as a surprise to them. I'm using
> the last version of TDB which was licensed under the GPLv2, and they
> relicensed to GPLv3 quite a while ago. I remember hearing they had
> added a new hash algorithm to TDB since the relicensing, but those
> newer versions aren't available to e2fsprogs....
Well then....
You're free to use my "new" hash function, provided it is kept under
GPLv2 and not under GPLv3.
My implementation has been a "cleanroom" implementation in that I've
only looked at the specifications and implemented it from
there. Although no external attestation is available that I have been
completely shielded from the newer GPLv3 version...
On a slightly different note:
A pretty good estimate of the number of inodes is available in the
superblock (tot inodes - free inodes). A good hash size would be: "a
rough estimate of the number of inodes." Two or three times more or
less doesn't matter much. CPU is cheap. I'm not sure what the
estimate for the "dircount" tdb should be.
The amount of disk space that the tdb will use is at least:
overhead + hash_size * 4 + numrecords * (keysize + datasize +
perrecordoverhead)
There must also be some overhead to store the size of the keys and
data as both can be variable length. By implementing the "database"
ourselves we could optimize that out. I don't think it's worth the
trouble.
With keysize equal 4, datasize also 4 and hash_size equal to numinodes
or numrecords, we would get
overhead + numinodes * (12 + perrecordoverhead).
In fact, my icount database grew to about 750Mb, with only 23M inodes,
so that means that apparently the perrecordoverhead is about 20 bytes.
This is the price you pay for using a much more versatile database
than what you really need. Disk is cheap (except when checking a root
filesystem!)
So...
-- I suggest that for the icount tdb we move to using the superblock
info as the hash size.
-- I suggest that we use our own hash function. tdb allows us to
specify our own hash function. Instead of modifying the bad tdb, we'll
just keep it intact, and pass a better (local) hash function.
Does anybody know what the "dircount" tdb database holds, and what is
an estimate for the number of elements eventually in the database? (I
could find out myself: I have the source. But I'm lazy. I'm a
programmer you know...).
On a separate note, my filesystem finished the fsck (33 hours (*)),
and I started the backups again... :-)
Roger.
*) that might include an estimated 1-5 hours of "Fix <y>?" waiting.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On 2011-02-23, at 1:53 PM, Rogier Wolff wrote:
> My implementation has been a "cleanroom" implementation in that I've
> only looked at the specifications and implemented it from
> there. Although no external attestation is available that I have been
> completely shielded from the newer GPLv3 version...
>
> On a slightly different note:
>
> A pretty good estimate of the number of inodes is available in the
> superblock (tot inodes - free inodes). A good hash size would be: "a
> rough estimate of the number of inodes." Two or three times more or
> less doesn't matter much. CPU is cheap. I'm not sure what the
> estimate for the "dircount" tdb should be.
The dircount can be extracted from the group descriptors, which count the number of allocated directories in each group. Since the superblock "free inodes" count is no longer updated except at unmount time, the code would need to walk all of the group descriptors to get this number anyway.
> The amount of disk space that the tdb will use is at least:
> overhead + hash_size * 4 + numrecords * (keysize + datasize +
> perrecordoverhead)
>
> There must also be some overhead to store the size of the keys and
> data as both can be variable length. By implementing the "database"
> ourselves we could optimize that out. I don't think it's worth the
> trouble.
>
> With keysize equal 4, datasize also 4 and hash_size equal to numinodes
> or numrecords, we would get
>
> overhead + numinodes * (12 + perrecordoverhead).
>
> In fact, my icount database grew to about 750Mb, with only 23M inodes,
> so that means that apparently the perrecordoverhead is about 20 bytes.
> This is the price you pay for using a much more versatile database
> than what you really need. Disk is cheap (except when checking a root
> filesystem!)
>
> So...
>
> -- I suggest that for the icount tdb we move to using the superblock
> info as the hash size.
>
> -- I suggest that we use our own hash function. tdb allows us to
> specify our own hash function. Instead of modifying the bad tdb, we'll
> just keep it intact, and pass a better (local) hash function.
>
>
> Does anybody know what the "dircount" tdb database holds, and what is
> an estimate for the number of elements eventually in the database? (I
> could find out myself: I have the source. But I'm lazy. I'm a
> programmer you know...).
>
>
> On a separate note, my filesystem finished the fsck (33 hours (*)),
> and I started the backups again... :-)
If you have the opportunity, I wonder whether the entire need for tdb can be avoided in your case by using swap and the icount optimization patches previously posted? I'd really like to get that patch included upstream, but it needs testing in an environment like yours where icount is a significant factor. This would avoid all of the tdb overhead.
Cheers, Andreas
On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
>
> If you have the opportunity, I wonder whether the entire need for
> tdb can be avoided in your case by using swap and the icount
> optimization patches previously posted?
Unfortunately, there are people who are still using 32-bit CPU's, so
no, swap is not a solution here.
> I'd really like to get that patch included upstream, but it needs
> testing in an environment like yours where icount is a significant
> factor. This would avoid all of the tdb overhead.
Adjusting the tdb hash parameters, and changing the tdb hash functions
shouldn't be hard to get into upstream. We should really improve our
testing for [scratch files], but that's always been true....
- Ted
On 2011-02-23, at 4:17 PM, Ted Ts'o wrote:
> On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
>>
>> If you have the opportunity, I wonder whether the entire need for
>> tdb can be avoided in your case by using swap and the icount
>> optimization patches previously posted?
>
> Unfortunately, there are people who are still using 32-bit CPU's, so
> no, swap is not a solution here.
I agree it isn't a solution in all cases, but avoiding GB-sized realloc() in the code was certainly enough to fix problems for the original people who hit them. It likely also avoids a lot of memcpy() (depending on how realloc is implemented).
Cheers, Andreas
On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
> The dircount can be extracted from the group descriptors, which
> count the number of allocated directories in each group. Since the
OK.
> superblock "free inodes" count is no longer updated except at
> unmount time, the code would need to walk all of the group
> descriptors to get this number anyway.
No worries. It matters a bit for performance, but if that free inode
count in the superblock is outdated, we'll just use that outdated
one. The one case that I'm afraid of is that someone creates a new
filesystem (superblock inodes-in-use =~= 0), then copies on millions
of files, and then crashes his system....
I'll add a minimum of 999931, causing an overhead of around 4Mb of
disk space usage if this was totally unneccesary.
> If you have the opportunity, I wonder whether the entire need for
> tdb can be avoided in your case by using swap and the icount
> optimization patches previously posted? I'd really like to get that
> patch included upstream, but it needs testing in an environment like
> yours where icount is a significant factor. This would avoid all of
> the tdb overhead.
First: I don't think it will work. The largest amount of memory that
e2fsck had allocated was 2.5Gb. At that point it also had around 1.5G
of disk space in use for tdb's for a total of 4G. On the other hand,
we've established that the overhead in tdb is about 24bytes per 8
bytes of real data.... So maybe we would only have needed 200M of
in-memory datastructures to handle this. Two of those 400M together
with the dircount (tdb =750M, assume same ratio) total 600M still
above 3G.
Second: e2fsck is too fragile as it is. It should be able to handle
big filesystems on little systems. I have a puny little 2GHz Athlon
system that currently has 3T of disk storage and 1G RAM. Embedded
Linux systems can be running those amounts of storage with only 64
or 128 Mb of RAM.
Even if MY filesystem happens to pass, with a little less memory-use,
then there is a slightly larger system that won't.
I have a server that has 4x2T instead of the server that has 4*1T. It
uses the same backup strategy, so it too has lots and lots of files.
In fact it has 84M inodes in use. (I thought 96M inodes would be
plenty... wrong! I HAVE run out of inodes on that thing!)
That one too may need to fsck the filesystem...
I remember hearing about a tool that would extract all the filesystem
meta-info, so that I can make an image that I can then test e.g. fsck
upon? Inodes, directory blocks, indirect blocks etc.?
Then I could make an image where I could test this. I don't really
want to put this offline again for multiple days.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Thu, Feb 24, 2011 at 9:29 AM, Rogier Wolff <[email protected]> wrote:
> On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
>
>> The dircount can be extracted from the group descriptors, which
>> count the number of allocated directories in each group. ?Since the
>
> OK.
>
>> superblock "free inodes" count is no longer updated except at
>> unmount time, the code would need to walk all of the group
>> descriptors to get this number anyway.
>
> No worries. It matters a bit for performance, but if that free inode
> count in the superblock is outdated, we'll just use that outdated
> one. The one case that I'm afraid of is that someone creates a new
> filesystem (superblock inodes-in-use =~= 0), then copies on millions
> of files, and then crashes his system....
>
> I'll add a minimum of 999931, causing an overhead of around 4Mb of
> disk space usage if this was totally unneccesary.
>
>> If you have the opportunity, I wonder whether the entire need for
>> tdb can be avoided in your case by using swap and the icount
>> optimization patches previously posted? ?I'd really like to get that
>> patch included upstream, but it needs testing in an environment like
>> yours where icount is a significant factor. ?This would avoid all of
>> the tdb overhead.
>
> First: I don't think it will work. The largest amount of memory that
> e2fsck had allocated was 2.5Gb. At that point it also had around 1.5G
> of disk space in use for tdb's for a total of 4G. On the other hand,
> we've established that the overhead in tdb is about 24bytes per 8
> bytes of real data.... So maybe we would only have needed 200M of
> in-memory datastructures to handle this. Two of those 400M together
> with the dircount (tdb =750M, assume same ratio) total 600M still
> above 3G.
>
> Second: e2fsck is too fragile as it is. It should be able to handle
> big filesystems on little systems. I have a puny little 2GHz Athlon
> system that currently has 3T of disk storage and 1G RAM. Embedded
> Linux systems can be running those amounts of storage with only 64
> or 128 Mb of RAM.
>
> Even if MY filesystem happens to pass, with a little less memory-use,
> then there is a slightly larger system that won't.
>
> I have a server that has 4x2T instead of the server that has 4*1T. It
> uses the same backup strategy, so it too has lots and lots of files.
> In fact it has 84M inodes in use. (I thought 96M inodes would be
> plenty... wrong! I HAVE run out of inodes on that thing!)
>
> That one too may need to fsck the filesystem...
>
> I remember hearing about a tool that would extract all the filesystem
> meta-info, so that I can make an image that I can then test e.g. fsck
> upon? Inodes, directory blocks, indirect blocks etc.?
>
That tool is e2image -r, which creates a sparse file image of your fs
(only metadata is written, the rest is holes), so you need to be careful
when copying/transferring it to another machine to do it wisely
(i.e. bzip or dd directly to a new HDD)
Not sure what you will do if fsck fixes errors on that image...
Mostly (if it didn't clone multiply claimed blocks for example), you would
be able to write the fixed image back onto your original fs,
but that would be risky.
> Then I could make an image where I could test this. I don't really
> want to put this offline again for multiple days.
>
>
> ? ? ? ?Roger.
>
>
> --
> ** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> ** ? ?Delftechpark 26 2628 XH ?Delft, The Netherlands. KVK: 27239233 ? ?**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
On Wed, Feb 23, 2011 at 05:41:31PM -0700, Andreas Dilger wrote:
> On 2011-02-23, at 4:17 PM, Ted Ts'o wrote:
> > On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
> >>
> >> If you have the opportunity, I wonder whether the entire need for
> >> tdb can be avoided in your case by using swap and the icount
> >> optimization patches previously posted?
> >
> > Unfortunately, there are people who are still using 32-bit CPU's, so
> > no, swap is not a solution here.
>
> I agree it isn't a solution in all cases, but avoiding GB-sized
> realloc() in the code was certainly enough to fix problems for the
> original people who hit them. It likely also avoids a lot of
> memcpy() (depending on how realloc is implemented).
So, assuming that the biggest alloc is 1Gb.
Assuming that we realloc (I haven't seen the code), at twice
the size every time, we'll alloc 1M, then 2M then 4M etc. up to 1G.
In the last case we'll realloc the 512M pointer to a 1G region. Note
that this requires a contiguous 1G area of free addressing space
within the 3G total available addressing space. But let's ignore that
problem for now.
So for the 1G alloc we'll have to memcpy 512Mb of existing data.
The previous one required a memcpy of 256Mb etc etc. The total is
just under 1G.
So you're proposing to optimize out a memcpy of 1G of my main memory.
When it boots, my system says: pIII_sse : 4884.000 MB/sec
So it can handle xor at almost 5G/second. It should be able to do
memcpy (xor with a bunch of zeroes) at that speed. But lets assume
that the libc guys are stupid and mangaged to make it 10 times slower.
So you're proposing to optimize out 1G of memcopy at 0.5G/second or
two seconds of CPU time on an fsck that takes over 24
hours. Congratulations! You've made e2fsck about 0.0023 percent
faster!
Andreas, I really value your efforts to improve e2fsck. But optmizing
code can be done by looking at the code and saying: "this looks
inefficient, lets fix it up". However you're quickly going to be
spending time on optimizations that don't really matter.
(My second computer was a DOS 3.x machine. DOS came with a utility
called "sort". It does what you expect from a DOS program: It refuses
to sort datafiles larger than 64k. So I rewrote it. Turns out my
implementation was 100 times slower in reading in the dataset than the
original version. I did manage to sort 100 times faster than the
original version. End result? Mine was 10 times faster than the
original. They optimized something that didn't matter. I just read
some decades-old literature on sorting and implemented that).h
I firmly believe that a factor of ten performance improvement can be
achieved for fsck for my filesystem. It should be possible to fsck the
filesystem in 3.3 hours.
There are a total of 342M inodes. That's 87Gb. reading that at a
leasurely 50M/second gives us 1700 seconds, or half an hour. (it
should be possible to do better: I have 4 drives each doing 90M/sec,
allowing a total of over 300M/sec).
Then I have 2.7T of data. With old ext2/ext3 that requires indirect
blocks worth 2.7G of data. reading that at 10M/sec (it will be
shattered) requires 270 seconds or 5 minutes.
I have quite a lot of directories. So those might take some time. The
cputime of actually doing the checks should be possible to overlap
with the IO.
Anyway, although in theory 10x should be possible, I expect that 5x is
a more realistic goal.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Thu, Feb 24, 2011 at 10:59:23AM +0200, Amir Goldstein wrote:
> That tool is e2image -r, which creates a sparse file image of your
> fs (only metadata is written, the rest is holes), so you need to be
> careful when copying/transferring it to another machine to do it
> wisely (i.e. bzip or dd directly to a new HDD) Not sure what you
> will do if fsck fixes errors on that image... Mostly (if it didn't
> clone multiply claimed blocks for example), you would be able to
> write the fixed image back onto your original fs, but that would be
> risky.
I can then run the fsck tests on the image. I expect fsck to find
errors: I'm using the filesystem when I'm making that image.... It
won't be consistent.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
On Thu, Feb 24, 2011 at 11:02 AM, Rogier Wolff <[email protected]> wrote:
> On Thu, Feb 24, 2011 at 10:59:23AM +0200, Amir Goldstein wrote:
>
>> That tool is e2image -r, which creates a sparse file image of your
>> fs (only metadata is written, the rest is holes), so you need to be
>> careful when copying/transferring it to another machine to do it
>> wisely (i.e. bzip or dd directly to a new HDD) Not sure what you
>> will do if fsck fixes errors on that image... ?Mostly (if it didn't
>> clone multiply claimed blocks for example), you would be able to
>> write the fixed image back onto your original fs, but that would be
>> risky.
>
> I can then run the fsck tests on the image. I expect fsck to find
> errors: I'm using the filesystem when I'm making that image.... It
> won't be consistent.
>
So you probably won't learn a lot from fsck results, unless you only
want to provide memusage/runtime statistic as per Andreas request.
You have the option to use NEXT3 so take a snapshot of your fs,
while it is online, but I don't suppose you would want to experiment
on your backup server.
Amir.
On Thu, Feb 24, 2011 at 10:59:23AM +0200, Amir Goldstein wrote:
> That tool is e2image -r, which creates a sparse file image of your fs
Ah... I got:
/home/wolff/e2image: File too large error writing block 535822337
Sigh. 2Tb max filesize on ext3. :-(
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Rogier Wolff
> Sent: Wednesday, February 23, 2011 11:30 PM
> To: Andreas Dilger
> Cc: [email protected]
> Subject: Re: fsck performance.
>
> On Wed, Feb 23, 2011 at 03:24:18PM -0700, Andreas Dilger wrote:
>
...
>
> Second: e2fsck is too fragile as it is. It should be able to handle
> big filesystems on little systems. I have a puny little 2GHz Athlon
> system that currently has 3T of disk storage and 1G RAM. Embedded
> Linux systems can be running those amounts of storage with only 64
> or 128 Mb of RAM.
I have to second this comment. One of our NAS has 256 MBytes of RAM
(and they wanted 64) with a 3TB disk, 2.996TB of which is an EXT4 file
system. With our 2.6.32.11 kernel and e2fsprogs version 1.41.3-1,
all I get is a segfault when I run fsck.ext4.