2010-02-01 14:14:12

by Paul Chavent

[permalink] [raw]
Subject: What represent 646345728 bytes

Hi

I'am writing an application that write a stream of pictures of fixed size on a disk.

My app run on a self integrated gnu/linux (based on a 2.6.31.6-rt19 kernel).

My media is formated with

# mke2fs -t ext4 -L DATA -O large_file,^has_journal,extent -v /dev/sda3
[...]

And it is mounted with

# mount -t ext4 /dev/sda3 /var/data/
EXT4-fs (sda3): no journal
EXT4-fs (sda3): delayed allocation enabled
EXT4-fs: file extents enabled
EXT4-fs: mballoc enabled
EXT4-fs (sda3): mounted filesystem without journal

My app opens the file with "O_WRONLY | O_CREAT | O_TRUNC | O_SYNC | O_DIRECT" flags.

Each write takes ~4.2ms for 304K (it is very good since it is the write bandwidth of my hard drive). There is a write every 100ms.

But every exactly 646345728 bytes, the write takes ~46ms.

I had the same problem with ext2 but every ~620M (the amount wasn't so constant).

Also i tryed to "posix_fallocate" with (eg 2G), and the first write overhead comes at this limit. I would like to avoid to preallocate.

I suppose it is a kind of block allocation issue. But i would like to have your opinion :
 - what is exatcly this amount of bytes ?
 - can i do something for having a "constant" write time from the user space point of view ?
 - is it a "probem" only for me ?

Thank you for your reading.

Paul.




2010-02-01 15:06:19

by Eric Sandeen

[permalink] [raw]
Subject: Re: What represent 646345728 bytes

[email protected] wrote:
> Hi
>
> I'am writing an application that write a stream of pictures of fixed
> size on a disk.
>
> My app run on a self integrated gnu/linux (based on a 2.6.31.6-rt19
> kernel).
>
> My media is formated with
>
> # mke2fs -t ext4 -L DATA -O large_file,^has_journal,extent -v
> /dev/sda3 [...]
>
> And it is mounted with
>
> # mount -t ext4 /dev/sda3 /var/data/ EXT4-fs (sda3): no journal
> EXT4-fs (sda3): delayed allocation enabled EXT4-fs: file extents
> enabled EXT4-fs: mballoc enabled EXT4-fs (sda3): mounted filesystem
> without journal
>
> My app opens the file with "O_WRONLY | O_CREAT | O_TRUNC | O_SYNC |
> O_DIRECT" flags.
>
> Each write takes ~4.2ms for 304K (it is very good since it is the
> write bandwidth of my hard drive). There is a write every 100ms.

So you are doing streaming writes in 304k chunks? Or each file
gets 304K? It sounds like you are writing multiple 304k pictures
to a single file, right?

> But every exactly 646345728 bytes, the write takes ~46ms.

Do you mean every 304K written to the fs, or to the file?

304k doesn't divide evenly into 646345728 bytes so I'm not sure...

> I had the same problem with ext2 but every ~620M (the amount wasn't
> so constant).
>
> Also i tryed to "posix_fallocate" with (eg 2G), and the first write
> overhead comes at this limit. I would like to avoid to preallocate.

Preallocation -should- be helpful in general, so if it's not ...

> I suppose it is a kind of block allocation issue. But i would like to
> have your opinion : - what is exatcly this amount of bytes ? - can i
> do something for having a "constant" write time from the user space
> point of view ? - is it a "probem" only for me ?

It might be interesting to know what the geometry of your filesystem is,
dumpe2fs -h would provide that. Also if you could mock up a testcase
that demonstrates the behavior, it would help in debugging.

Actually... as a first step I would redo the test yourself with
-O ^uninit_bg at mkfs time to see if the bitmap init is causing the
delay.

-Eric

> Thank you for your reading.
>
> Paul.
>
>
>
>
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-ext4" in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



2010-02-01 17:06:57

by Paul Chavent

[permalink] [raw]
Subject: Re: What represent 646345728 bytes

Thank you Eric for your reply.

My problem with preallocation is that i don't know the size of my final tar archive. So i would prefer to don't make any supposition.

Yes, i write a stream of 640x480x1 pnm images (307215 bytes each) to a single tar file.
So lets say that each write is 307712 (divisible by 512 bytes for tar).

Here is a test bench program that reproduce the behaviour of the real app.

The program log some write overhead at 645579776bytes and other at 645887488bytes, so not exactly the same thing as in the real app.

You will be able to tell me if my test bench is correct (metric, compilation options, etc.).

The system on which i run the test bench has no other workload, no other disk access.

Please find the attached files :
- test bench source
- the dumpe2fs log

Tonight, i will try with the "-O ^uninit_bg at mkfs time".

Thanks.

Paul.



Attachments:
dumpe2fs.txt (1.66 kB)
main.c (3.90 kB)
Download all attachments

2010-02-01 17:20:16

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: What represent 646345728 bytes

On Mon, 1 Feb 2010 15:08:24 +0100 (CET), <[email protected]> wrote:
> Hi
>
> I'am writing an application that write a stream of pictures of fixed size on a disk.
>
> My app run on a self integrated gnu/linux (based on a 2.6.31.6-rt19 kernel).
>
> My media is formated with
>
> # mke2fs -t ext4 -L DATA -O large_file,^has_journal,extent -v /dev/sda3
> [...]
>
> And it is mounted with
>
> # mount -t ext4 /dev/sda3 /var/data/
> EXT4-fs (sda3): no journal
> EXT4-fs (sda3): delayed allocation enabled
> EXT4-fs: file extents enabled
> EXT4-fs: mballoc enabled
> EXT4-fs (sda3): mounted filesystem without journal
>
> My app opens the file with "O_WRONLY | O_CREAT | O_TRUNC | O_SYNC | O_DIRECT" flags.
>
> Each write takes ~4.2ms for 304K (it is very good since it is the write bandwidth of my hard drive). There is a write every 100ms.
>
> But every exactly 646345728 bytes, the write takes ~46ms.

I guess that would be balance_dirty_pages starting to writeback the
delayed allocated pages. You can try if that changes by changing
/proc/sys/vm/dirty_ratio

-aneesh

2010-02-01 17:34:58

by Eric Sandeen

[permalink] [raw]
Subject: Re: What represent 646345728 bytes

Aneesh Kumar K. V wrote:
> On Mon, 1 Feb 2010 15:08:24 +0100 (CET), <[email protected]> wrote:
>> Hi
>>
>> I'am writing an application that write a stream of pictures of fixed size on a disk.
>>
>> My app run on a self integrated gnu/linux (based on a 2.6.31.6-rt19 kernel).
>>
>> My media is formated with
>>
>> # mke2fs -t ext4 -L DATA -O large_file,^has_journal,extent -v /dev/sda3
>> [...]
>>
>> And it is mounted with
>>
>> # mount -t ext4 /dev/sda3 /var/data/
>> EXT4-fs (sda3): no journal
>> EXT4-fs (sda3): delayed allocation enabled
>> EXT4-fs: file extents enabled
>> EXT4-fs: mballoc enabled
>> EXT4-fs (sda3): mounted filesystem without journal
>>
>> My app opens the file with "O_WRONLY | O_CREAT | O_TRUNC | O_SYNC | O_DIRECT" flags.
>>
>> Each write takes ~4.2ms for 304K (it is very good since it is the write bandwidth of my hard drive). There is a write every 100ms.
>>
>> But every exactly 646345728 bytes, the write takes ~46ms.
>
> I guess that would be balance_dirty_pages starting to writeback the
> delayed allocated pages. You can try if that changes by changing
> /proc/sys/vm/dirty_ratio

But he's doing direct IO...

-Eric

2010-02-01 20:07:30

by Eric Sandeen

[permalink] [raw]
Subject: Re: What represent 646345728 bytes

[email protected] wrote:
> Thank you Eric for your reply.
>
> My problem with preallocation is that i don't know the size of my
> final tar archive. So i would prefer to don't make any supposition.
>
> Yes, i write a stream of 640x480x1 pnm images (307215 bytes each) to
> a single tar file. So lets say that each write is 307712 (divisible
> by 512 bytes for tar).
>
> Here is a test bench program that reproduce the behaviour of the real
> app.
>
> The program log some write overhead at 645579776bytes and other at
> 645887488bytes, so not exactly the same thing as in the real app.
>
> You will be able to tell me if my test bench is correct (metric,
> compilation options, etc.).

I changed the test for an abnormally long write to printf if the
current time is more than twice the average; the hardcoded time
didn't trip on my run (faster disk?)

uninit_bg

size 480338432 time 10170277 avg 1533119
size 661580800 time 6651861 avg 1559238
size 1025296384 time 16295994 avg 1585389
size 1206538752 time 5157166 avg 1591501
size 1388704256 time 10096621 avg 1596869
size 1570562048 time 6523130 avg 1600365
size 1752727552 time 11047298 avg 1603791
size 1935200768 time 7421506 avg 1606132
size 1963202560 time 13636110 avg 1608181
size 2116750848 time 4634594 avg 1609417
size 2299224064 time 17226228 avg 1612249
diff min : 1468219
diff moy : 1612779
diff max : 17226228
8035 iterations

^uninit_bg

size 125854208 time 11192339 avg 1531071
size 480646144 time 8405476 avg 1537763 *
size 662196224 time 5187685 avg 1562247 *
size 843746304 time 10184477 avg 1577636
size 1025911808 time 15772404 avg 1589258 *
size 1207154176 time 3354220 avg 1594480 *
size 1389011968 time 17024332 avg 1601196 *
size 1570562048 time 5321635 avg 1604032 *
size 1752727552 time 10858699 avg 1607154 *
size 1934585344 time 6316573 avg 1609076 *
size 1963202560 time 12102167 avg 1610854 *
size 1963510272 time 9019437 avg 1612015 *
size 2116135424 time 3464465 avg 1612853 *
size 2297993216 time 8579660 avg 1614396 *
diff min : 1469909
diff moy : 1614416
diff max : 17024332
7493 iterations

* = roughly same offset as with uninit_bg

So uninit_bg doesn't seem to help. (this was on 2.6.32-ish)

some oprofiling may be in order ...

-Eric

> The system on which i run the test bench has no other workload, no
> other disk access.
>
> Please find the attached files : - test bench source - the dumpe2fs
> log
>
> Tonight, i will try with the "-O ^uninit_bg at mkfs time".
>
> Thanks.
>
> Paul.
>
>
>


2010-02-01 22:36:55

by Eric Sandeen

[permalink] [raw]
Subject: Re: What represent 646345728 bytes

Eric Sandeen wrote:

> So uninit_bg doesn't seem to help. (this was on 2.6.32-ish)
>
> some oprofiling may be in order ...

Or stapping ... thanks to hints from the stap guys, using a modified version of
http://sourceware.org/systemtap/examples/profiling/fntimes.stp
I printed out function times higher than avg and correlated to writes
above average from the testcase (cleaned up manually a bit):

1265062212927230 function ext4_mb_load_buddy well over average time (42303 vs 2)
1265062212927399 function ext4_mb_regular_allocator well over average time (42476 vs 5495)
1265062212949252 function ext4_mb_mark_diskspace_used well over average time (21739 vs 196)
1265062212949306 function ext4_mb_new_blocks well over average time (64738 vs 2457)
1265062212949647 function ext4_ext_get_blocks well over average time (65211 vs 1106)
1265062212949678 function ext4_get_blocks well over average time (65357 vs 1844)
1265062212949695 function ext4_get_block well over average time (65479 vs 683)
1265062212951284 function ext4_ind_direct_IO well over average time (68891 vs 3034)
1265062212951299 function ext4_direct_IO well over average time (68908 vs 3046)
1265062212951497 function ext4_file_write well over average time (69437 vs 3590)

1265062212951534 size 134470144 time 69477310 avg 3601820

Unfortunately under ext4_mb_load_buddy is a lot of static/inlined functions
so no more detailed info yet.

But loading the buddy bitmap for a new group seems to be the big
hitter here - I'll keep digging, or maybe Aneesh, who groks mballoc
better than I do (I think) might have an idea.

But yeah, writes randomly taking 20x longer than usual sounds bad. :(

-Eric

2010-02-01 23:01:43

by Andreas Dilger

[permalink] [raw]
Subject: Re: What represent 646345728 bytes

On 2010-02-01, at 15:36, Eric Sandeen wrote:
> Eric Sandeen wrote:
> So uninit_bg doesn't seem to help. (this was on 2.6.32-ish)
>>
>> some oprofiling may be in order ...
>
> Or stapping ... thanks to hints from the stap guys, using a modified
> version of
> http://sourceware.org/systemtap/examples/profiling/fntimes.stp
> I printed out function times higher than avg and correlated to writes
> above average from the testcase (cleaned up manually a bit):
>
> 1265062212927230 function ext4_mb_load_buddy well over average time
> (42303 vs 2)
> 1265062212927399 function ext4_mb_regular_allocator well over
> average time (42476 vs 5495)
> 1265062212949252 function ext4_mb_mark_diskspace_used well over
> average time (21739 vs 196)
> 1265062212949306 function ext4_mb_new_blocks well over average time
> (64738 vs 2457)
> 1265062212949647 function ext4_ext_get_blocks well over average time
> (65211 vs 1106)
> 1265062212949678 function ext4_get_blocks well over average time
> (65357 vs 1844)
> 1265062212949695 function ext4_get_block well over average time
> (65479 vs 683)
> 1265062212951284 function ext4_ind_direct_IO well over average time
> (68891 vs 3034)
> 1265062212951299 function ext4_direct_IO well over average time
> (68908 vs 3046)
> 1265062212951497 function ext4_file_write well over average time
> (69437 vs 3590)
>
> 1265062212951534 size 134470144 time 69477310 avg 3601820
>
> Unfortunately under ext4_mb_load_buddy is a lot of static/inlined
> functions
> so no more detailed info yet.
>
> But loading the buddy bitmap for a new group seems to be the big
> hitter here - I'll keep digging, or maybe Aneesh, who groks mballoc
> better than I do (I think) might have an idea.

Since these are on-demand loads, one option is when the current group
is ~80% full submit an async read for the next group bitmap so that it
will be in memory by the next time we need it.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.