I realize that it is enerally not a good idea to tune
an operating system, or subsystem, for benchmarking, but
there's something that I don't understand about ext[234]
that is badly affecting our product. File placement on
newly-created file systems is inconsistent. I can't,
yet, call it a bug, but I really need to understand what
is happening, and I cannot find, in the source code, the
source of the randomization (related to "goal"???).
Disk drive performance for writing/reading large files
is rather sensitive to outer-/inner-diameter cylinder
placement. When I create the same file multiple times
on newly-created ext[234] file systems on the same disk
partition, I find that it does not consistently occupy
the same blocks. In fact, there is enough difference in
location to cause real differences in performance from
test to test, which I cannot justify to management.
We are currently on 2.6.32.12, using a 32-bit powerpc. The
system is booted from tftp and the root file system is NFS
for the test. The partition used is always the same one,
and it is the only one mounted from the disk. There is
always exactly one (5G) file created using the same command
"for i in 1 2 3 4 5; do dd if=/hex.txt bs=64K; \
done >>/DataVolume/hex.txt", where /hex.txt is a 1G file
and /DataVolume is the mounted disk partition.
I have tried, as I said, ext[234], and have tinkered with
most of the options, including orlov/oldallocator, and the
behavior doesn't change. Here's a sample of dumpe2fs
output from three runs, in a diff3:
====
1:51,52c
44750 free blocks, 65268 free inodes, 2 directories
Free blocks: 295-45044
2:51,52c
11990 free blocks, 65268 free inodes, 2 directories
Free blocks: 295-12284
3:51,52c
40655 free blocks, 65268 free inodes, 2 directories
Free blocks: 295-40949
====
1:59,60c
3794 free blocks, 65280 free inodes, 0 directories
Free blocks: 65819-65823, 127267-131055
2:59,60c
36554 free blocks, 65280 free inodes, 0 directories
Free blocks: 65819-65823, 94507-131055
3:59,60c
7889 free blocks, 65280 free inodes, 0 directories
Free blocks: 65819-65823, 123172-131055
Thanks for any help,
Dan
Daniel Taylor wrote:
> I realize that it is enerally not a good idea to tune
> an operating system, or subsystem, for benchmarking, but
> there's something that I don't understand about ext[234]
> that is badly affecting our product. File placement on
> newly-created file systems is inconsistent. I can't,
> yet, call it a bug, but I really need to understand what
> is happening, and I cannot find, in the source code, the
> source of the randomization (related to "goal"???).
>
> Disk drive performance for writing/reading large files
> is rather sensitive to outer-/inner-diameter cylinder
> placement. When I create the same file multiple times
> on newly-created ext[234] file systems on the same disk
> partition, I find that it does not consistently occupy
> the same blocks. In fact, there is enough difference in
> location to cause real differences in performance from
> test to test, which I cannot justify to management.
>
> We are currently on 2.6.32.12, using a 32-bit powerpc. The
> system is booted from tftp and the root file system is NFS
> for the test. The partition used is always the same one,
> and it is the only one mounted from the disk. There is
> always exactly one (5G) file created using the same command
> "for i in 1 2 3 4 5; do dd if=/hex.txt bs=64K; \
> done >>/DataVolume/hex.txt", where /hex.txt is a 1G file
> and /DataVolume is the mounted disk partition.
>
> I have tried, as I said, ext[234], and have tinkered with
> most of the options, including orlov/oldallocator, and the
> behavior doesn't change. Here's a sample of dumpe2fs
> output from three runs, in a diff3:
>
> ====
> 1:51,52c
> 44750 free blocks, 65268 free inodes, 2 directories
> Free blocks: 295-45044
> 2:51,52c
> 11990 free blocks, 65268 free inodes, 2 directories
> Free blocks: 295-12284
> 3:51,52c
> 40655 free blocks, 65268 free inodes, 2 directories
> Free blocks: 295-40949
> ====
> 1:59,60c
> 3794 free blocks, 65280 free inodes, 0 directories
> Free blocks: 65819-65823, 127267-131055
> 2:59,60c
> 36554 free blocks, 65280 free inodes, 0 directories
> Free blocks: 65819-65823, 94507-131055
> 3:59,60c
> 7889 free blocks, 65280 free inodes, 0 directories
> Free blocks: 65819-65823, 123172-131055
>
> Thanks for any help,
Using a recent e2fsprogs, and the "filefrag -v" command, will
give you much more interesting layout information:
# filefrag -v testfile
Filesystem type is: ef53
File size of testfile is 1073741824 (262144 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 1865728 32768
1 32768 1898496 32768
2 65536 1931264 32768
3 98304 1964032 32768
4 131072 1996800 2048
5 133120 2000896 1998847 32768
6 165888 2033664 32768
7 198656 2066432 30720
8 229376 2236416 2097151 8192
9 237568 2252800 2244607 24576 eof
testfile: 4 extents found
(hm, not sure about that 4 extent business, it must be merging
adjacent extents)
Anyway, that's easier than going backwards from free blocks.
Also, ext3 vs. ext4 will likely have very different allocator
behavior, so a full specification of your testing, with the filefrag
output, would probably best characterize what you're seeing.
-Eric
On Tue, Jul 6, 2010 at 3:49 AM, Daniel Taylor <[email protected]> wrote:
> I realize that it is enerally not a good idea to tune
> an operating system, or subsystem, for benchmarking, but
> there's something that I don't understand about ext[234]
> that is badly affecting our product. ?File placement on
> newly-created file systems is inconsistent. ?I can't,
> yet, call it a bug, but I really need to understand what
> is happening, and I cannot find, in the source code, the
> source of the randomization (related to "goal"???).
>
> Disk drive performance for writing/reading large files
> is rather sensitive to outer-/inner-diameter cylinder
> placement. ?When I create the same file multiple times
> on newly-created ext[234] file systems on the same disk
> partition, I find that it does not consistently occupy
> the same blocks. ?In fact, there is enough difference in
> location to cause real differences in performance from
> test to test, which I cannot justify to management.
>
The ext[23] (and I suppose 4 as well) uses the process pid % 16 to
define a 'color' for the process.
New files first block goal depends on that 'color' - the goal is one
of 16 different offsets in the block group
where the new file's inode was allocated (usually the block group of
its parent directory).
The logic behind this allocator is that multiple files created
concurrently in the same directory would
have less chance of stepping over each other's allocations.
I am not sure what you are trying to test or how this behavior badly
affects your product.
If you specify your needs maybe someone can help you solve your problem.
I think that ext4 has some advanced features, like pre-allocation,
that may be able to help you.
Amir.
On Mon, Jul 05, 2010 at 06:49:34PM -0700, Daniel Taylor wrote:
> I realize that it is enerally not a good idea to tune
> an operating system, or subsystem, for benchmarking, but
> there's something that I don't understand about ext[234]
> that is badly affecting our product. File placement on
> newly-created file systems is inconsistent. I can't,
> yet, call it a bug, but I really need to understand what
> is happening, and I cannot find, in the source code, the
> source of the randomization (related to "goal"???).
In ext3, it really is random. The randomness you're looking for can
be found in fs/ext3/ialloc.c:find_group_orlov(), when it calls
get_random_bytes(). This is responsible for "spreading" directories
so they are spread across the block groups, to try to prevent
fragmented files. Yes, if all you care about is benchmarks which only
use 10% of the entire file system, and for which the benchmarks don't
adequately simulate file system aging, the algorithms in ext3 will
cause a lot of variability.
Yes, if you use FAT-style algorithms which try to use the first free
inode, and first free block which is available, for the purposes of
competitive benchmarking (especially if the benchmarks are crap), you
can probably win against the competition. Unfortunately, long-term
your product will probably far more likely to suffer from file system
aging as the blocks at the beginning of the file system are badly
fragmented. Please don't do that, though (or, if you must, please
have a switch so that users can switch it from "competitive
benchmarking mode" to "friendly to real life users" mode).
Ext4 uses very different algorithms, and it's not strictly speaking
random since it uses a cur-down md4 hash of the directory name to
decide where to place the directory inode (and the location of the
directory inode, affects both the files created in that inode as well
as the blocks allocated to those files, as in ext3). So as long as
the directory hash seed in the superblock stays constant, and the
directory and file names created stay constant, the inode and block
layout will also be consistent.
All of this having been said, it may very well be possible to improve
on the anti-fragmentation algorithms while still trying to allocate
block groups closer to the beginning of the disk to take advantage of
the inner-diamater/outer-diameter placement effect. There's probably
room for some research work here. But please do be careful before
twiddling too much with the allocator algorithms, they are somewhat
subtle....
- Ted
[email protected] wrote:
> On Mon, Jul 05, 2010 at 06:49:34PM -0700, Daniel Taylor wrote:
>> I realize that it is enerally not a good idea to tune
>> an operating system, or subsystem, for benchmarking, but
>> there's something that I don't understand about ext[234]
>> that is badly affecting our product. File placement on
>> newly-created file systems is inconsistent. I can't,
>> yet, call it a bug, but I really need to understand what
>> is happening, and I cannot find, in the source code, the
>> source of the randomization (related to "goal"???).
>
> In ext3, it really is random. The randomness you're looking for can
> be found in fs/ext3/ialloc.c:find_group_orlov(), when it calls
> get_random_bytes(). This is responsible for "spreading" directories
> so they are spread across the block groups, to try to prevent
> fragmented files. Yes, if all you care about is benchmarks which only
> use 10% of the entire file system, and for which the benchmarks don't
> adequately simulate file system aging, the algorithms in ext3 will
> cause a lot of variability.
However, from the test description it looks like it is writing
a file to the root dir, so there should be no parent-dir random spreading,
right?
-Eric
On Tue, Jul 06, 2010 at 01:59:34PM -0500, Eric Sandeen wrote:
> However, from the test description it looks like it is writing
> a file to the root dir, so there should be no parent-dir random spreading,
> right?
Hmm, yes, I missed that part of Daniel's e-mail. He's just writing a
single file. In that case, Amir is right, the only thing which would
be causing this is the colour offset, at least for ext2 and ext3.
This is avoid fragmented files caused by two or more processes running
on different CPU's all writing into the same block group.
In the case of ext4, we don't use a pid-determined colour algorithm if
delayed allocation is used, and the randomness is caused by the
writeback system deciding to write out different chunks of pages
first. The way to fix this when writing large files is to use
fallocate(2) when writing a large file, so it can be allocated
contiguously.
In any case, Daniel, if you want the best results for your benchmark,
use ext4, and tweak the script slightly:
touch /DataVolume/hex.txt
fallocate -l 5G /DataVolume/hex.txt
for i in 0 1 2 3 4
do
dd if=/hex.txt of=/DataVolume/hex.txt bs=64k conv=notrunc \
oflag=direct,append
done
Best regards,
- Ted
> -----Original Message-----
> From: Eric Sandeen [mailto:[email protected]]
> Sent: Tuesday, July 06, 2010 12:00 PM
> To: [email protected]
> Cc: Daniel Taylor; [email protected]
> Subject: Re: inconsistent file placement
>
> [email protected] wrote:
> > On Mon, Jul 05, 2010 at 06:49:34PM -0700, Daniel Taylor wrote:
> >> I realize that it is enerally not a good idea to tune
> >> an operating system, or subsystem, for benchmarking, but
> >> there's something that I don't understand about ext[234]
> >> that is badly affecting our product. File placement on
> >> newly-created file systems is inconsistent. I can't,
> >> yet, call it a bug, but I really need to understand what
> >> is happening, and I cannot find, in the source code, the
> >> source of the randomization (related to "goal"???).
> >
> > In ext3, it really is random. The randomness you're looking for can
> > be found in fs/ext3/ialloc.c:find_group_orlov(), when it calls
> > get_random_bytes(). This is responsible for "spreading" directories
> > so they are spread across the block groups, to try to prevent
> > fragmented files. Yes, if all you care about is benchmarks
> which only
> > use 10% of the entire file system, and for which the
> benchmarks don't
> > adequately simulate file system aging, the algorithms in ext3 will
> > cause a lot of variability.
>
> However, from the test description it looks like it is writing
> a file to the root dir, so there should be no parent-dir
> random spreading,
> right?
>
> -Eric
>
>
In all of my recent tests, there has only been one file created, in
the root directory of the freshly created and mounted file system.
mkfs.ext[234] -b 65536 /dev/sda4
mount <some options tested> /dev/sda4 /DataVolume
touch /DataVolume/hex.txt
"for i in 1 2 3 4 5; do dd if=/hex.txt bs=64K; \
done >>/DataVolume/hex.txt"
umount /DataVolume
dumpe2fs /dev/sda4 >/<log file>
where /hex.txt is a 1G file on the NFS root.
I tried with, and without, orlov on ext3 (-o orlov and -o oldalloc)
and didn't see any change in the behavior. In ext4, there seemed
to be less variability, but it is still present, and the "less" may
just be the small sample size.
Now, at least, I understand that the placement algorithm does not
always start at first free block.
It is an unfortunate fact of life that simplistic benchmarks often
drive sales. This product will be a consumer NAS and when our
internal runs of the common NAS benchmarks get inconsistent results,
it creates a lot of concern.
There's an option for ext4 (delayed allocation) that looks like it
bypasses the "pid % 16" coloration. I'll tinker some more with
that and see how it goes.
Thank you all for your input.
On Tue, Jul 06, 2010 at 03:15:00PM -0700, Daniel Taylor wrote:
>
> It is an unfortunate fact of life that simplistic benchmarks often
> drive sales. This product will be a consumer NAS and when our
> internal runs of the common NAS benchmarks get inconsistent results,
> it creates a lot of concern.
Out of curiosity, what *are* the "common NAS benchmarks" in use today,
and who chooses them?
There have been times in the past when "common benchmarks" promulgated
by reviewers have done active harm in the industry, driving disk drive
manufacturers to chose unsafe defaults, all because the only thing
people paid attention to was crappy benchmarks.
Sometimes the right answer is to put a spotlight on deficient
benchmarks, and to try to change them...
> There's an option for ext4 (delayed allocation) that looks like it
> bypasses the "pid % 16" coloration. I'll tinker some more with
> that and see how it goes.
Delayed allocation is the default for ext4. If you are seeing random
behaviour there it's probably because you need to be smarter in how
you write them --- see my previous e-mail about using fallocate.
Speaking of fallocate.... if this is a NAS box than the file is
probably written using CIFS, right? Are you using a modern version of
Samba? If you are use a new enough libc (that understands the
fallocate system call) and a new enough version of Samba, the
userspace should be using fallocate() to more efficiently allocate the
space. This is a feature which is not in ext3, but it is supported by
ext4, and it's a major win. The basic idea was discovered a while
ago, and was written up here:
http://software.intel.com/en-us/articles/windows-client-cifs-behavior-can-slow-linux-nas-performance/
(This was a 2007 report, and back then ext4 wasn't ready, so the only
file system available was XFS, which did have both delayed allocation
and fallocate support for preallocation. XFS is a good filesystem,
although it often tends to be a bit memory-hungry for many bookshelf
NAS systems.)
See also see here for a patch (but I'm pretty sure this functionality
is already in the most recent version of Samba if I recall correctly):
https://bugzilla.redhat.com/show_bug.cgi?id=525532
I know a fair number of folks on the Samba core team; most of them
have been hired by companies to work full-time on CIFS support
(usually using Samba), but some of them may still be available to help
out on a consulting basis... let me know if you'd like me to make some
introductions.
- Ted
P.S. Amir, this is one of the reason why you folks should seriously think
about merging Next3 support into ext4. :-)
Daniel Taylor wrote:
>
>
>> -----Original Message-----
>> From: Eric Sandeen [mailto:[email protected]]
>> Sent: Tuesday, July 06, 2010 12:00 PM
>> To: [email protected]
>> Cc: Daniel Taylor; [email protected]
>> Subject: Re: inconsistent file placement
>>
>> [email protected] wrote:
>>> On Mon, Jul 05, 2010 at 06:49:34PM -0700, Daniel Taylor wrote:
>>>> I realize that it is enerally not a good idea to tune
>>>> an operating system, or subsystem, for benchmarking, but
>>>> there's something that I don't understand about ext[234]
>>>> that is badly affecting our product. File placement on
>>>> newly-created file systems is inconsistent. I can't,
>>>> yet, call it a bug, but I really need to understand what
>>>> is happening, and I cannot find, in the source code, the
>>>> source of the randomization (related to "goal"???).
>>> In ext3, it really is random. The randomness you're looking for can
>>> be found in fs/ext3/ialloc.c:find_group_orlov(), when it calls
>>> get_random_bytes(). This is responsible for "spreading" directories
>>> so they are spread across the block groups, to try to prevent
>>> fragmented files. Yes, if all you care about is benchmarks
>> which only
>>> use 10% of the entire file system, and for which the
>> benchmarks don't
>>> adequately simulate file system aging, the algorithms in ext3 will
>>> cause a lot of variability.
>> However, from the test description it looks like it is writing
>> a file to the root dir, so there should be no parent-dir
>> random spreading,
>> right?
>>
>> -Eric
>>
>>
>
> In all of my recent tests, there has only been one file created, in
> the root directory of the freshly created and mounted file system.
>
> mkfs.ext[234] -b 65536 /dev/sda4
> mount <some options tested> /dev/sda4 /DataVolume
> touch /DataVolume/hex.txt
> "for i in 1 2 3 4 5; do dd if=/hex.txt bs=64K; \
> done >>/DataVolume/hex.txt"
> umount /DataVolume
> dumpe2fs /dev/sda4 >/<log file>
>
> where /hex.txt is a 1G file on the NFS root.
>
> I tried with, and without, orlov on ext3 (-o orlov and -o oldalloc)
> and didn't see any change in the behavior. In ext4, there seemed
> to be less variability, but it is still present, and the "less" may
> just be the small sample size.
orlov is an inode allocator for directory inodes; since you
are creating 1 file in the root dir, they won't matter.
It affects file placement because files prefer to be close to their
parent dir, more or less, but in your case you are never allocating
a directory so the point is moot.
-o orlov is the default, FWIW.
> Now, at least, I understand that the placement algorithm does not
> always start at first free block.
>
> It is an unfortunate fact of life that simplistic benchmarks often
> drive sales. This product will be a consumer NAS and when our
> internal runs of the common NAS benchmarks get inconsistent results,
> it creates a lot of concern.
>
> There's an option for ext4 (delayed allocation) that looks like it
> bypasses the "pid % 16" coloration. I'll tinker some more with
> that and see how it goes.
delalloc is the default as well.
filefrag -v output would be much more enlightening than what you've
shown so far...
-Eric
> Thank you all for your input.
[email protected] wrote:
> On Tue, Jul 06, 2010 at 03:15:00PM -0700, Daniel Taylor wrote:
...
>
> Speaking of fallocate.... if this is a NAS box than the file is
> probably written using CIFS, right? Are you using a modern version of
> Samba? If you are use a new enough libc (that understands the
> fallocate system call) and a new enough version of Samba, the
> userspace should be using fallocate() to more efficiently allocate the
> space. This is a feature which is not in ext3, but it is supported by
> ext4, and it's a major win. The basic idea was discovered a while
> ago, and was written up here:
>
> http://software.intel.com/en-us/articles/windows-client-cifs-behavior-can-slow-linux-nas-performance/
>
> (This was a 2007 report, and back then ext4 wasn't ready, so the only
> file system available was XFS, which did have both delayed allocation
> and fallocate support for preallocation. XFS is a good filesystem,
> although it often tends to be a bit memory-hungry for many bookshelf
> NAS systems.)
XFS is actually a favorite of the ARM embedded NAS space :)
> See also see here for a patch (but I'm pretty sure this functionality
> is already in the most recent version of Samba if I recall correctly):
>
> https://bugzilla.redhat.com/show_bug.cgi?id=525532
that patch is rather simplistic, FWIW; at least for XFS it -hurt- perf
due to the unwritten->written conversion and the relatively small, frequent
preallocations.
More smarts to merge up multiple 1-byte-writes into a large preallocation
might help, as the bug mentions.
But ... is something like it already in samba? that'd be nifty, but I wasn't
aware of that. There is a preallocation-sounding switch but I think it doesn't
do what you think it does. I'd have to go look up details, though.
-Eric
> I know a fair number of folks on the Samba core team; most of them
> have been hired by companies to work full-time on CIFS support
> (usually using Samba), but some of them may still be available to help
> out on a consulting basis... let me know if you'd like me to make some
> introductions.
>
> - Ted
>
> P.S. Amir, this is one of the reason why you folks should seriously think
> about merging Next3 support into ext4. :-)
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> Sent: Tuesday, July 06, 2010 4:14 PM
> To: Daniel Taylor
> Cc: Eric Sandeen; [email protected]; [email protected]
> Subject: Re: inconsistent file placement
>
> On Tue, Jul 06, 2010 at 03:15:00PM -0700, Daniel Taylor wrote:
> >
> > It is an unfortunate fact of life that simplistic benchmarks often
> > drive sales. This product will be a consumer NAS and when our
> > internal runs of the common NAS benchmarks get inconsistent results,
> > it creates a lot of concern.
>
> Out of curiosity, what *are* the "common NAS benchmarks" in use today,
> and who chooses them?
The benchmarks are chosen by individual reviewers (probably looking over
each others' shoulders). "smallnetbuilder.com" is a fairly good example.
FWIW:
1) NASPT, PC only
2) IOzone, Mac & PC
3) IOmeter, PC
BTW, the simple test sequence was trying to distill something that our
in-house performance tester was seeing in some SATA traces. It is NOT
one of the "real" benchmarks.
> Speaking of fallocate.... if this is a NAS box than the file is
> probably written using CIFS, right? Are you using a modern version of
> Samba?
Currently, we're on 3.2.5 of smdb, but that's because the later versions
work less well with ext3. We will be testing them with ext4 now that
we see the other options it offers.
Soon as I can get the fallocate utility cross-built, there are some
experiments that I want to run, but those will take a couple of days.
Thanks again for all of your help.
Daniel Taylor wrote:
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
...
>> Speaking of fallocate.... if this is a NAS box than the file is
>> probably written using CIFS, right? Are you using a modern version of
>> Samba?
>
> Currently, we're on 3.2.5 of smdb, but that's because the later versions
> work less well with ext3. We will be testing them with ext4 now that
> we see the other options it offers.
>
> Soon as I can get the fallocate utility cross-built, there are some
> experiments that I want to run, but those will take a couple of days.
>
> Thanks again for all of your help.
http://sandeen.fedorapeople.org/utilities/fallocate.c
should be a simple compilable utility, or if for some reason
you don't have it in your glibc, you can call the syscall directly
with:
http://sandeen.fedorapeople.org/utilities/fallocate-via-syscall.c
(you may need to define SYS_fallocate & massage appropriately
depending on the architecture)
The tool in util-linux-ng is similar, and I think started with
this code, but its a quick hack so look out for bugs ;)
-Eric