2010-12-25 03:02:00

by Theodore Ts'o

[permalink] [raw]
Subject: Problems with e4defrag -c


I've started taking closer a look at e4defrag -c, and I'm a bit troubled
by both its implementation and results that it reports.

On the implementation side, it requires root, and in fact encourages
people to run it as root. As near as I can tell, it's doing this just
so it can calculate statistics based on the block size, blocks per
group, and flex_bg size (if flex_bg is enabled). As we'll see in a
bit, I'm not sure these statistics are all that useful.

The code is also sprinkled with rather ugly code style:

if (current_uid != ROOT_UID &&
buf->st_uid != current_uid) {
if (mode_flag & DETAIL) {
printf("\033[79;0H\033[K[%u/%u] \"%s\"\t\t"
" extents: %d -> %d\n", defraged_file_count,
total_count, file, extents, extents);
IN_FTW_PRINT_ERR_MSG(
"File is not current user's file"
" or current user is not root");
}
return -1;
}

First of all, explicit comparisons against the current uid is bad. A
non-root user might have read/write access to the raw device where a
file sysem is located. It's bad to encode an assumption one way or
another into a userspace program. Secondly, whenever a userspace progam
is explicitly trying to encode permission checking, that's a red flag.
I'm not sure why checking to see if a file's st_uid matches the
current_uid has any validity at all.

In addition, the only reason why root is needed is apparently to
calculate the "best # of extents", which I think is not all that useful
as a statistic. Certainly it's not worth requiring raw access to read
the file system.

What really matters are the number of extents which are non-tail
extents, and smaller than some threshold (probably around 256 MB for
most HDD's), and not forced by skips in the logical block numbering
(i.e., caused by a file being sparse). The basic idea here is to go
back to why fragments are bad, which is that they slow down file access.
If every few hundred megabytes, you need to seek to another part of the
disk, it's really not the end of the world.

Furthermore, the "average size of extents" is highly misleading. If the
directory has a large number of small files, then by definition this
will drag down the average size of the extents.

I tried running e4defrag -c on a scratch partition of mine. Here's
what it gave as results:

<Fragmented files> now/best size/ext
1. /kbuild/e2fsprogs-64bits/build.static/lib/blkid/tests/tmp/fat32_label_64MB.img.11222
5/1 4 KB
2. /kbuild/e2fsprogs-64bits/tests/f_badjour_indblks/image
8/1 6 KB
3. /kbuild/e2fsprogs-64bits/tests/f_16384_block/image.gz
2/1 4 KB
4. /kbuild/e2fsprogs-64bits/tests/f_8192_block/image.gz
2/1 4 KB
5. /kbuild/e2fsprogs-64bits/tests/f_badjour_indblks/image.gz
2/1 4 KB

*These* were the worst fragmented file systems? Oh, really? Files #3,
#4, and #5 are just small files that happened to be fragmented. They're
simply not interesting. Given the number of seeks that would be needed
to read small files in general (to the directory, to the inode table, to
the data block, etc.), one more seek because an 8k file happened to
be split isn't interesting.

The first one is somewhat interesting:

# filefrag -v /kbuild/e2fsprogs-64bits/tests/f_badjour_indblks/image
Filesystem type is: ef53
File size of /kbuild/e2fsprogs-64bits/tests/f_badjour_indblks/image is 8388608 (2048 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 3312640 1
1 8 3312648 3312640 2
2 17 3312657 3312649 4
3 23 3312663 3312660 1
4 87 3312727 3312663 2
5 152 3312792 3312728 1
6 216 3312856 3312792 1
7 2047 3310082 3312856 1 eof

As you can see, it's a sparse file. Git is apparently smart enough to
write files that have large blocks of zero as sparse files. Great. So
the fact that this file is sparse means that reading this 8 megabyte
file will be pretty fast, even though the individual blocks are
scattered.

(As an aside, this is one where the file system's block allocation
algorithm is too smart for its own good. If you look closely at the
phsical blocks, nearly all of them are allocated out of the same chunk
of free space, starting at 3312640. What happened was ext4 allocated
logical block N using phsyical block 3312640+N just in case this was a
case of a the compiler or linking writing an ELF object section by
section using a random access pattern but which would eventually result
in a fully contiguous, non-sparse file. In this case git is writing a
sparse file that in practice will never be written into afterwards, so
it would be ideal if ext4 instead allocated block #8 and 9 at 3312641
and 3312642, block #17 at 3312643, and so on. Unfortunately, without
getting a hint from userspace, it's very hard to do this. I suppose
what we could do is decide that if we're doing delayed allocation, and
the file handle is closed, that we shouldn't assume a libelf random
write pattern, but rather a "we're writing a sparse file system and
we're done pattern". Ah, hueristics...)

There's a more general question which is I'm not sure how much the
functionality of e4dfrag -c really belongs in e4defrag. I'm thinking
perhaps that perhaps this functionality should instead go in filefrag,
and/or in e2fsck, which can do the job much more efficiently since it by
definition has direct access to the file system, so it can scan the
inode tables in order.

What do people think?

- Ted


2011-01-06 07:27:39

by Kazuya Mio

[permalink] [raw]
Subject: Re: Problems with e4defrag -c

Hi Ted,
Thanks for your comments.

> First of all, explicit comparisons against the current uid is bad. A
> non-root user might have read/write access to the raw device where a
> file sysem is located. It's bad to encode an assumption one way or
> another into a userspace program. Secondly, whenever a userspace progam
> is explicitly trying to encode permission checking, that's a red flag.

I will fix it.

> I'm not sure why checking to see if a file's st_uid matches the
> current_uid has any validity at all.

e4defrag tries to change the location of data blocks, so I assumed that
non-root users should execute e4defrag only to their file. It would be better
that users who have read/write permission can e4defrag to the file.

> What really matters are the number of extents which are non-tail
> extents, and smaller than some threshold (probably around 256 MB for
> most HDD's), and not forced by skips in the logical block numbering
> (i.e., caused by a file being sparse). The basic idea here is to go
> back to why fragments are bad, which is that they slow down file access.
> If every few hundred megabytes, you need to seek to another part of the
> disk, it's really not the end of the world.

What does 256MB mean? If "some threshold" means the maximum size of one extent,
I think the size is 128MB.

> There's a more general question which is I'm not sure how much the
> functionality of e4dfrag -c really belongs in e4defrag. I'm thinking
> perhaps that perhaps this functionality should instead go in filefrag,
> and/or in e2fsck, which can do the job much more efficiently since it by
> definition has direct access to the file system, so it can scan the
> inode tables in order.

Currently, e2fsprogs has two commands that report how badly fragmented
a file might be. So, it is smart for e2fsprogs to drop -c option from e4defrag.
e4defrag -c shows whether we need to execute e4defrag or not. For this, I think
we should add "fragmentation score" included in e4defrag -c to the output of
filefrag.

However, sometimes we want to check the fragmentation not only for a single file
but also for many files in the same directory. e4defrag -c gets the extent
information of all files in a directory, and calculates the fragmentation score
based on this information. But I'm not sure that I should add this feature
to filefrag by adding new option or some other way.

Regards,
Kazuya Mio


2011-01-07 19:38:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Problems with e4defrag -c

> > What really matters are the number of extents which are non-tail
> > extents, and smaller than some threshold (probably around 256 MB for
> > most HDD's), and not forced by skips in the logical block numbering
> > (i.e., caused by a file being sparse). The basic idea here is to go
> > back to why fragments are bad, which is that they slow down file access.
> > If every few hundred megabytes, you need to seek to another part of the
> > disk, it's really not the end of the world.
>
> What does 256MB mean? If "some threshold" means the maximum size of
> one extent, I think the size is 128MB.

256MB was an arbitrary number I picked out of thin air. My point was
that when we start thinking about defragmentation, especially from a
holistic optimize-the-entire-filesystem perspective, one of the things
that we need to think about is whether a file's fragmentation is "good
enough". In fact, depending on the flex blockgroup size, it might not
be possible to have contiguous block allocations larger than 32MB or
128MB.

The reason why this matters is because humans will see the
fragmentation score, not just programs, and it's better if the
fragmentation score more accurately reflects the desired end-result.
What I didn't like was the fact that files that were actually
contiguous, but small (say, in a 6k file that was contiguously
allocated), were scored worse than a medium-sized file that was broken
into two pieces. And if we have a really, really large file, that was
say 2TB, but broken into chunks of 256MB each --- how should these
three example files be scored?

A file which is contiguous is obviously perfect, and there's no reason
to defrag it, so it should have a very good score. For a really large
file, each of which (aside from the last "tail" fragment) is broken up
into large fragments of 128MB or 256MB each, I'd argue should also be
left alone, and so it should also get a very good score.

If we do it that way, I'm not sure we really need to have access to
the superblock to get various file system values. I can imagine
requesting certain parameters --- if you have root access, you can
grab the superblock and adjust the "threshold of perfection" down from
256MB to 32MB if flex_bg is not enabled, or based on the size of the
flex_bg groups. But if you don't have access, it might be smarter
just to use some default "threshold of perfection", as opposed to
having lots of "do we have root" checks sprinkled all over the
program.

> Currently, e2fsprogs has two commands that report how badly
> fragmented a file might be. So, it is smart for e2fsprogs to drop -c
> option from e4defrag. e4defrag -c shows whether we need to execute
> e4defrag or not. For this, I think we should add "fragmentation
> score" included in e4defrag -c to the output of filefrag.

Hmm, maybe the right answer is that we have a single function, located
in libe2p, that calculates the "fragmentation score". We can separate
that out from the e4defrag code, and make it be a library function.
The programs that want to use it can call that library function.
(Parameters to the fragmentation score, such as the "threshold of
perfection", would be passed into the library function, along with a
file descriptor to the file so that FIEMAP could be called on that
file.)

> However, sometimes we want to check the fragmentation not only for a
> single file but also for many files in the same directory. e4defrag
> -c gets the extent information of all files in a directory, and
> calculates the fragmentation score based on this information. But
> I'm not sure that I should add this feature to filefrag by adding
> new option or some other way.

I'm not sure how useful it is to do a recursive tree walk just to
display the information for all the files in the directory. Filefrag
will already take a list of files on the command-line, and if you want
to do a recursive tree walk, you can do a "find /path/to/dir -type f |
xargs filefrag".

The main reason why I could see e4defrag wanting to know the
fragmentation scores of all of the files in the directory is so it
could make decisions about whether or not to even attempt to defrag a
file. If a file has a very good score, then maybe we should just
leave well enough alone and not even try to defrag it. I'm reminded
of the earliest Norton Utilities, that would spend hours trying to
defrag a disk to perfection. Later versions added the concept of
"good enough"; if the disk was good enough, sometimes it's better to
leave well enough alone, as opposed to spending hours and hours, and
lots of disk bandwidth, striving for perfection.

- Ted

2011-01-19 08:19:43

by Kazuya Mio

[permalink] [raw]
Subject: Re: Problems with e4defrag -c

2011/01/08 4:38, Ted Ts'o wrote:
> If we do it that way, I'm not sure we really need to have access to
> the superblock to get various file system values. I can imagine requesting
> certain parameters --- if you have root access, you can grab the superblock
> and adjust the "threshold of perfection" down from 256MB to 32MB
> if flex_bg is not enabled, or based on the size of the flex_bg groups. But if
> you don't have access, it might be smarter just to use some default
> "threshold of perfection", as opposed to having lots of "do we have root"
> checks sprinkled all over the program.

I agree to use some default "threshold of perfection". In this case, 256MB is
too big because all files will get the worst score if flex_bg is disabled.
So I think "blocks_per_group - 2048" seems to be good threshold. 2048 means
the maximum block region of mballoc with the exception of more than
8KB blocksize. If we try to allocate the contiguous blocks including some
metadata blocks (backup of the superblock, blockgroup descriptor, etc.),
this threshold will judge that there is no fragmentation in the file.

e.g. Allocate 512MB contiguous blocks including some metadata blocks

# dd if=/dev/zero of=/mnt/mp1/file bs=1M count=512 oflag=sync
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 536870912 (131072 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 33280 30720
1 30720 65536 63999 32768
2 63488 100352 98303 32768
3 96256 133120 30720
4 126976 165888 163839 4096 eof

> Hmm, maybe the right answer is that we have a single function, located
> in libe2p, that calculates the "fragmentation score". We can separate that
> out from the e4defrag code, and make it be a library function. The programs
> that want to use it can call that library function. (Parameters to the
> fragmentation score, such as the "threshold of perfection", would be passed
> into the library function, along with a file descriptor to the file so that
> FIEMAP could be called on that file.)

OK. I'll create the new file "fragment_score.c" into lib/e2p and add
a function to this file. We can get the percent of the small extents in the
target file by calling this function.

> I'm not sure how useful it is to do a recursive tree walk just to display
> the information for all the files in the directory. Filefrag will already
> take a list of files on the command-line, and if you want to do a recursive
> tree walk, you can do a "find /path/to/dir -type f | xargs filefrag".

It's true that we should use find command if we really want to know
the fragmentation scores of all of the files in the directory.
I won't add a recursive tree walk to filefrag.

Regards,
Kazuya Mio