Hi,
Recently I've been experimenting with O_DIRECT in ext4 to get a
feeling of how much file fragmentation will be generated.
On a newly formatted ext4 partition(no-journal), I created a top-level
directory and under this top-level directory I ran a test program to
generate some files.
The test program does the following:
-- create multiple threads(in my test case: 16 threads)
-- each thread creates a file with the O_DIRECT flag and keeps
extending the file to 1MB
Since these threads run concurrently, they compete in block allocation.
After the program ran to a completion, I ran filefrag on each file and
measure how many extents there are in the file.
And here is a sample result:
file0: 6 extents found
file1: 20 extents found
file2: 7 extents found
file3: 6 extents found
file4: 6 extents found
file5: 5 extents found
file6: 6 extents found
file7: 20 extents found
file8: 20 extents found
file9: 20 extents found
file10: 20 extents found
file11: 20 extents found
file12: 20 extents found
file13: 19 extents found
file14: 19 extents found
file15: 19 extents found
Looks like these files are quite heavily fragmented.
For comparison, I did the same experiment on an ext2 partition,
resulting in each file having only 1 extent.
I also did the experiments of using buffered writes(by removing the
O_DIRECT flag) on ext2 and ext4, both resulting in each file having
only 1 extent.
I am wondering whether this kind of file fragmentation is already a
known issue in ext4 when O_DIRECT is used? Is it something by design?
Since it seems like ext2 does not have this issue under my test case,
is it necessary that we make the behavior of ext4 similar to ext2
under situations like this?
Thanks,
Xiang
Xiang Wang wrote:
> Hi,
>
> Recently I've been experimenting with O_DIRECT in ext4 to get a
> feeling of how much file fragmentation will be generated.
>
> On a newly formatted ext4 partition(no-journal), I created a top-level
> directory and under this top-level directory I ran a test program to
> generate some files.
>
> The test program does the following:
> -- create multiple threads(in my test case: 16 threads)
> -- each thread creates a file with the O_DIRECT flag and keeps
> extending the file to 1MB
> Since these threads run concurrently, they compete in block allocation.
>
> After the program ran to a completion, I ran filefrag on each file and
> measure how many extents there are in the file.
> And here is a sample result:
> file0: 6 extents found
> file1: 20 extents found
> file2: 7 extents found
> file3: 6 extents found
> file4: 6 extents found
> file5: 5 extents found
> file6: 6 extents found
> file7: 20 extents found
> file8: 20 extents found
> file9: 20 extents found
> file10: 20 extents found
> file11: 20 extents found
> file12: 20 extents found
> file13: 19 extents found
> file14: 19 extents found
> file15: 19 extents found
>
> Looks like these files are quite heavily fragmented.
Multiple parallel extending DIOs in a single dir is a tough case for a
filesystem - it has no hints about what to do, and can't use delalloc to
wait to see what's happening; it just has to allocate things as they
come, more or less.
> For comparison, I did the same experiment on an ext2 partition,
> resulting in each file having only 1 extent.
Interestinng, not sure I would have expected that.
> I also did the experiments of using buffered writes(by removing the
> O_DIRECT flag) on ext2 and ext4, both resulting in each file having
> only 1 extent.
delayed allocation at work I suppose.
> I am wondering whether this kind of file fragmentation is already a
> known issue in ext4 when O_DIRECT is used? Is it something by design?
> Since it seems like ext2 does not have this issue under my test case,
> is it necessary that we make the behavior of ext4 similar to ext2
> under situations like this?
Is this representative of a real workload?
-Eric
> Thanks,
> Xiang
On Mon, Jul 20, 2009 at 8:41 PM, Eric Sandeen<[email protected]> wrote:
> Xiang Wang wrote:
>> Hi,
>>
>> Recently I've been experimenting with O_DIRECT in ext4 to get a
>> feeling of how much file fragmentation will be generated.
>>
>> On a newly formatted ext4 partition(no-journal), I created a top-level
>> directory and under this top-level directory I ran a test program to
>> generate some files.
>>
>> The test program does the following:
>> -- create multiple threads(in my test case: 16 threads)
>> -- each thread creates a file with the O_DIRECT flag and keeps
>> extending the file to 1MB
>> Since these threads run concurrently, they compete in block allocation.
>>
>> After the program ran to a completion, I ran filefrag on each file and
>> measure how many extents there are in the file.
>> And here is a sample result:
>> file0: 6 extents found
>> file1: 20 extents found
>> file2: 7 extents found
>> file3: 6 extents found
>> file4: 6 extents found
>> file5: 5 extents found
>> file6: 6 extents found
>> file7: 20 extents found
>> file8: 20 extents found
>> file9: 20 extents found
>> file10: 20 extents found
>> file11: 20 extents found
>> file12: 20 extents found
>> file13: 19 extents found
>> file14: 19 extents found
>> file15: 19 extents found
>>
>> Looks like these files are quite heavily fragmented.
>
> Multiple parallel extending DIOs in a single dir is a tough case for a
> filesystem - it has no hints about what to do, and can't use delalloc to
> wait to see what's happening; it just has to allocate things as they
> come, more or less.
>
>> For comparison, I did the same experiment on an ext2 partition,
>> resulting in each file having only 1 extent.
>
> Interestinng, not sure I would have expected that.
Same with us; we're looking into more variables to understand it.
>> I also did the experiments of using buffered writes(by removing the
>> O_DIRECT flag) on ext2 and ext4, both resulting in each file having
>> only 1 extent.
>
> delayed allocation at work I suppose.
>
>> I am wondering whether this kind of file fragmentation is already a
>> known issue in ext4 when O_DIRECT is used? Is it something by design?
>> Since it seems like ext2 does not have this issue under my test case,
>> is it necessary that we make the behavior of ext4 similar to ext2
>> under situations like this?
>
> Is this representative of a real workload?
Not exactly perhaps, but we do have apps that are showing
significantly more fragmentation in their files on ext4 than with
ext2, while using O_DIRECT (e.g., 8 extents on ext4 vs 1 on ext2, as
reported by filefrag). The experiment above is synthetic, but fairly
representative.
(Hence the related questions about fallocate, since this is one
possible, though ugly, workaround.)
Curt
Curt Wohlgemuth wrote:
> On Mon, Jul 20, 2009 at 8:41 PM, Eric Sandeen<[email protected]> wrote:
>> Xiang Wang wrote:
>>> For comparison, I did the same experiment on an ext2 partition,
>>> resulting in each file having only 1 extent.
>> Interestinng, not sure I would have expected that.
>
> Same with us; we're looking into more variables to understand it.
To be more clear, I would not have expected ext2 to deal well with it
either, is more what I meant ;) I'm not terribly surprised that ext4
gets fragmented.
For the numbers posted, how big were the files (how many 1m chunks were
written?)
Just FWIW; I did something like:
# for I in `seq 1 16`; do dd if=/dev/zero of=testfile$I bs=1M count=16
oflag=direct & done
on a rhel5.4 beta kernel and got:
~5 extents per file on ext4 (per filefrag output)
between 41 and 234 extents on ext2.
~6 extents per file on ext3.
~16 extents per file on xfs
if I created a subdir for each file:
# for I in `seq 1 16`; do mkdir dir$I; dd if=/dev/zero
of=dir$I/testfile$I bs=1M count=16 oflag=direct & done
~5 extents per file on ext4
1 or 2 extents per file on ext2
1 or 2 extents per file on ext3
~16 extents per file on xfs.
-Eric
On Tue, Jul 21, 2009 at 9:38 AM, Eric Sandeen<[email protected]> wrote:
> Curt Wohlgemuth wrote:
>> On Mon, Jul 20, 2009 at 8:41 PM, Eric Sandeen<[email protected]> wrote:
>>> Xiang Wang wrote:
>
>>>> For comparison, I did the same experiment on an ext2 partition,
>>>> resulting in each file having only 1 extent.
>>> Interestinng, not sure I would have expected that.
>>
>> Same with us; we're looking into more variables to understand it.
>
> To be more clear, I would not have expected ext2 to deal well with it
> either, is more what I meant ;) ?I'm not terribly surprised that ext4
> gets fragmented.
>
> For the numbers posted, how big were the files (how many 1m chunks were
> written?)
>
> Just FWIW; I did something like:
>
> # for I in `seq 1 16`; do dd if=/dev/zero of=testfile$I bs=1M count=16
> oflag=direct & done
>
> on a rhel5.4 beta kernel and got:
>
> ~5 extents per file on ext4 (per filefrag output)
> between 41 and 234 extents on ext2.
> ~6 extents per file on ext3.
> ~16 extents per file on xfs
>
I repeated this test(bs=1M count=16) by tuning some parameters in my
test program. And I got the following results(per filefrag output):
ext4:
5 extents per file
ext2:
file0: 5 extents found, perfection would be 1 extent
file1: 5 extents found, perfection would be 1 extent
file2: 6 extents found, perfection would be 1 extent
file3: 4 extents found, perfection would be 1 extent
file4: 4 extents found, perfection would be 1 extent
file5: 6 extents found, perfection would be 1 extent
file6: 4 extents found, perfection would be 1 extent
file7: 5 extents found, perfection would be 1 extent
file8: 6 extents found, perfection would be 1 extent
file9: 4 extents found, perfection would be 1 extent
file10: 5 extents found, perfection would be 1 extent
file11: 6 extents found, perfection would be 1 extent
file12: 6 extents found, perfection would be 1 extent
file13: 8 extents found, perfection would be 1 extent
file14: 4 extents found, perfection would be 1 extent
file15: 7 extents found, perfection would be 1 extent
The results on ext4 look comparable to yours while the results on ext2
look very different.
I am attaching the test program I use in case you want to try it. It
is at the end of the message.
I invoked it like: ./mt_writes 16 1 to have 16 threads writing using O_DIRECT.
> if I created a subdir for each file:
>
> # for I in `seq 1 16`; do mkdir dir$I; dd if=/dev/zero
> of=dir$I/testfile$I bs=1M count=16 oflag=direct & done
>
> ~5 extents per file on ext4
> 1 or 2 extents per file on ext2
> 1 or 2 extents per file on ext3
> ~16 extents per file on xfs.
>
> -Eric
>
======
/*
* mt_write.c -- multiple threads extending files concurrently.
*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <pthread.h>
#include <sys/stat.h>
#include <fcntl.h>
#define _XOPEN_SOURCE 600
#define O_DIRECT 00040000 /* direct disk access hint */
#define MAX_THREAD 1000
#define BUFSIZE 1048576
#define COUNT 16
typedef struct {
int id;
int odirect;
} parm;
void *expand(void *arg)
{
char *buf;
char fname[16];
int fd;
int i, count;
parm *p = (parm *)arg;
// O_DIRECT needs to work with aligned memory
if (posix_memalign((void *) &buf, 512, BUFSIZE) != 0) {
fprintf(stderr, "cannot allocate aligned mem!\n");
return NULL;
}
sprintf(fname, "file%d", p->id);
if (p->odirect)
fd = open(fname, O_RDWR|O_CREAT|O_APPEND|O_DIRECT);
else
fd = open(fname, O_RDWR|O_CREAT|O_APPEND);
if (fd == -1) {
fprintf(stderr, "Open %s failed!\n", fname);
return NULL;
}
for(i = 0; i < COUNT; i++) {
count = write(fd, buf, BUFSIZE);
if (count == -1) {
fprintf(stderr, "Only able to finish %d blocks
of data\n", i);
return NULL;
}
}
if (!p->odirect) {
fsync(fd);
}
printf("Done with writing %d blocks of data\n", COUNT);
close(fd);
free(buf);
return NULL;
}
int main(int argc, char* argv[]) {
int n,i, odirect;
pthread_t *threads;
pthread_attr_t pthread_custom_attr;
parm *p;
if (argc != 3)
{
printf ("Usage: %s <# of threads> <O_DIRECT? 1:0>\n",argv[0]);
exit(1);
}
n=atoi(argv[1]);
odirect = atoi(argv[2]);
if ((n < 1) || (n > MAX_THREAD))
{
printf ("The # of thread should between 1 and
%d.\n",MAX_THREAD);
exit(1);
}
threads=(pthread_t *)malloc(n*sizeof(*threads));
pthread_attr_init(&pthread_custom_attr);
p=(parm *)malloc(sizeof(parm)*n);
/* Start up thread */
for (i = 0; i < n; i++)
{
p[i].id = i;
p[i].odirect = odirect;
pthread_create(&threads[i], &pthread_custom_attr,
expand, (void *)(p+i));
}
/* Synchronize the completion of each thread. */
for (i=0; i<n; i++)
{
pthread_join(threads[i],NULL);
}
free(p);
return 0;
}
On Tue, 2009-07-21 at 11:38 -0500, Eric Sandeen wrote:
> Curt Wohlgemuth wrote:
> > On Mon, Jul 20, 2009 at 8:41 PM, Eric Sandeen<[email protected]> wrote:
> >> Xiang Wang wrote:
>
> >>> For comparison, I did the same experiment on an ext2 partition,
> >>> resulting in each file having only 1 extent.
> >> Interestinng, not sure I would have expected that.
> >
> > Same with us; we're looking into more variables to understand it.
>
> To be more clear, I would not have expected ext2 to deal well with it
> either, is more what I meant ;) I'm not terribly surprised that ext4
> gets fragmented.
Ext2 deals with it via the block reservation code added some time ago.
It turns out it works pretty well for this case. Ext4, of course,
doesn't use the block reservation code.
--
Frank Mayhar <[email protected]>
Google, Inc.
Frank Mayhar wrote:
> On Tue, 2009-07-21 at 11:38 -0500, Eric Sandeen wrote:
>
>> Curt Wohlgemuth wrote:
>>
>>> On Mon, Jul 20, 2009 at 8:41 PM, Eric Sandeen<[email protected]> wrote:
>>>
>>>> Xiang Wang wrote:
>>>>
>>>>> For comparison, I did the same experiment on an ext2 partition,
>>>>> resulting in each file having only 1 extent.
>>>>>
>>>> Interestinng, not sure I would have expected that.
>>>>
>>> Same with us; we're looking into more variables to understand it.
>>>
>> To be more clear, I would not have expected ext2 to deal well with it
>> either, is more what I meant ;) I'm not terribly surprised that ext4
>> gets fragmented.
>>
>
> Ext2 deals with it via the block reservation code added some time ago.
> It turns out it works pretty well for this case. Ext4, of course,
> doesn't use the block reservation code.
>
ext4 mballoc code use per cpu preallocation, so all threads running on
the same cpu which needs new blocks will be assign blocks next to each
other. This will makes files created by those threads interleave each
other as a result, causing fragmentation. Preallocation will help, but
that a persistant preallocation.
Mingming