2013-05-18 10:50:04

by frankcmoeller

[permalink] [raw]
Subject: Aw: Re: Ext4: Slow performance on first write after mount

Hi Andrei,

thanks for your quick answer!
Perhaps you understood me wrong. The general write performance is quite good. We can record more than 4 HD channels at the same time without problems. Except the problems with the first write after mount. And there are also some users which have problems 1-2 times during a recording.
I think the ext4 group initialization is the main problem, because it takes so long (as written before: around 1300 groups per second). Why don't you store the gathered informations on disk when a umount takes place?

With fallocate the group initialization is partly made before first write. This helps, but it's no solution, because the finally file size is unknown. So I cannot preallocate space for the complete file. And after the preallocated space is consumed the same problem with the initialization arises until all groups are initialized.

I also made some tests with O_DIRECT (my first tests ever). Perhaps I did something wrong, but it isn't very fast. And you have to take care about alignment and there are several threads in the internet which explain why you shouldn't use it (or only in very special situations and I don't think that my situation is one of them). And ext4 group initialization takes also place when using O_DIRECT (as said before perhaps I did something wrong).

Regards,
Frank

----- Original Nachricht ----
Von: "Sidorov, Andrei" <[email protected]>
An: "[email protected]" <[email protected]>, ext4 development <[email protected]>
Datum: 17.05.2013 23:18
Betreff: Re: Ext4: Slow performance on first write after mount

> Hi Frank,
>
> Consider using bigalloc feature (requires reformat), preallocate space
> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
> are too small for good throughput with O_DIRECT. You might also want to
> adjust max_sectors_kb to something larger than 512k.
>
> We're doing 6in+6out 20Mbps streams just fine.
>
> Regards,
> Andrei.
>


2013-05-18 20:34:20

by Sidorov, Andrei

[permalink] [raw]
Subject: Re: Aw: Re: Ext4: Slow performance on first write after mount

Frank,

Well, the main point was to use bigalloc. Unfortunately it requires
reformat.
W/o bigalloc there will be ~7800 block groups for 1T drive. Those groups
take 32M of ondisk data and up to 64M when it comes to RAM because of
runtime buddy bitmaps. I don't think it worth storing buddy bitmaps on
drive. It's not a surprise it can take long time to read lots of block
bitmaps scattered over drive and construct buddies out of them. And it's
not a surprise some these pages are evicted under high memory pressure.
With bigalloc 1M cluster size you get 256 times less metadata (128K
instead of 32M) and you get all the benefits of faster allocate,
truncate and lesser fragmentation.

Yes, you don't know file size in advance, but speculating say each 128M
is clearly a benefit. truncate to real file size once recording finished
to release unused preallocated space.
There are some caveats with O_DIRECT, but it is faster if done correctly.

Regards,
Andrei.

On 18.05.2013 03:50, [email protected] wrote:
> Hi Andrei,
>
> thanks for your quick answer!
> Perhaps you understood me wrong. The general write performance is quite good. We can record more than 4 HD channels at the same time without problems. Except the problems with the first write after mount. And there are also some users which have problems 1-2 times during a recording.
> I think the ext4 group initialization is the main problem, because it takes so long (as written before: around 1300 groups per second). Why don't you store the gathered informations on disk when a umount takes place?
>
> With fallocate the group initialization is partly made before first write. This helps, but it's no solution, because the finally file size is unknown. So I cannot preallocate space for the complete file. And after the preallocated space is consumed the same problem with the initialization arises until all groups are initialized.
>
> I also made some tests with O_DIRECT (my first tests ever). Perhaps I did something wrong, but it isn't very fast. And you have to take care about alignment and there are several threads in the internet which explain why you shouldn't use it (or only in very special situations and I don't think that my situation is one of them). And ext4 group initialization takes also place when using O_DIRECT (as said before perhaps I did something wrong).
>
> Regards,
> Frank
>
> ----- Original Nachricht ----
> Von: "Sidorov, Andrei" <[email protected]>
> An: "[email protected]" <[email protected]>, ext4 development <[email protected]>
> Datum: 17.05.2013 23:18
> Betreff: Re: Ext4: Slow performance on first write after mount
>
>> Hi Frank,
>>
>> Consider using bigalloc feature (requires reformat), preallocate space
>> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
>> are too small for good throughput with O_DIRECT. You might also want to
>> adjust max_sectors_kb to something larger than 512k.
>>
>> We're doing 6in+6out 20Mbps streams just fine.
>>
>> Regards,
>> Andrei.
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>


2013-05-18 22:34:35

by frankcmoeller

[permalink] [raw]
Subject: Re: Aw: Re: Ext4: Slow performance on first write after mount

Hi Andrei,

thanks for the informations! Didn't know that it is around 32 MB data for a 1TB disk.

Regarding bigalloc: I read on the ext4 website (https://ext4.wiki.kernel.org/index.php/Bigalloc) this:
"The bigalloc feature first appeared in the v3.2 kernel. As of this writing (in the v3.7 kernel)
bigalloc still has some problems if the delayed allocation is enabled, especially if the file
system is close to full."
Is bigalloc really stable? Since when is it stable? Were there bigger bugs in some versions?
I ask because the software (OpenPli) we use uses different kernel versions for different boxes.
Some boxes use 3.8.7 kernel, some 3.3.8 and so on (it's not changeable because of closed source
drivers).

Is an ext4 bigalloc partition resizeable? I saw a bug report and a patch in January 2013 regarding this.
If it works well, I could resize my partition and create a new bigalloc one. Then move files and resize
again. Or is the only possibility a reformat?

Regards,
Frank

----- Original Nachricht ----
Von: "Sidorov, Andrei" <[email protected]>
An: "[email protected]" <[email protected]>
Datum: 18.05.2013 22:34
Betreff: Re: Aw: Re: Ext4: Slow performance on first write after mount

> Frank,
>
> Well, the main point was to use bigalloc. Unfortunately it requires
> reformat.
> W/o bigalloc there will be ~7800 block groups for 1T drive. Those groups
> take 32M of ondisk data and up to 64M when it comes to RAM because of
> runtime buddy bitmaps. I don't think it worth storing buddy bitmaps on
> drive. It's not a surprise it can take long time to read lots of block
> bitmaps scattered over drive and construct buddies out of them. And it's
> not a surprise some these pages are evicted under high memory pressure.
> With bigalloc 1M cluster size you get 256 times less metadata (128K
> instead of 32M) and you get all the benefits of faster allocate,
> truncate and lesser fragmentation.
>
> Yes, you don't know file size in advance, but speculating say each 128M
> is clearly a benefit. truncate to real file size once recording finished
> to release unused preallocated space.
> There are some caveats with O_DIRECT, but it is faster if done correctly.
>
> Regards,
> Andrei.
>

2013-05-19 01:49:06

by Andreas Dilger

[permalink] [raw]
Subject: Re: Aw: Re: Ext4: Slow performance on first write after mount

On 2013-05-18, at 4:50, [email protected] wrote:
> thanks for your quick answer!
> Perhaps you understood me wrong. The general write performance is quite good. We can record more than 4 HD channels at the same time without problems. Except the problems with the first write after mount. And there are also some users which have problems 1-2 times during a recording.
> I think the ext4 group initialization is the main problem, because it takes so long (as written before: around 1300 groups per second). Why don't you store the gathered informations on disk when a umount takes place?

Part of the problem is that filesystems are rarely unmounted cleanly, so it means that this information would need to be updated periodically to disk so that it is available after a crash.

I wouldn't object to some kind of "lazy" updating of group information on disk that at least gives the newly-mounted filesystem a rough idea of what each group's usage is. It wouldn't have to be totally accurate (it wouldn't replace the bitmaps), but maybe 2 bits per group would be enough as a starting point?

For a 32 TB filesystem that would be about 16 4kB blocks of bits that would be updated periodically (e.g. every five minutes or so). Since the allocator will typically work in successive groups that might not cause too much churn.

> With fallocate the group initialization is partly made before first write. This helps, but it's no solution, because the finally file size is unknown.

It would be possible to fallocate() at some expected size (e.g. average file size) and then either truncate off the unused space, or fallocate() some more in another thread when you are close to tunning out.

> So I cannot preallocate space for the complete file. And after the preallocated space is consumed the same problem with the initialization arises until all groups are initialized.

If the fallocate() is done in a separate thread the latency can be hidden from the main application?
>
> I also made some tests with O_DIRECT (my first tests ever). Perhaps I did something wrong, but it isn't very fast.

That is true, and depends heavily on your workload.

Cheers, Andreas

> And you have to take care about alignment and there are several threads in the internet which explain why you shouldn't use it (or only in very special situations and I don't think that my situation is one of them). And ext4 group initialization takes also place when using O_DIRECT (as said before perhaps I did something wrong).
>
> Regards,
> Frank
>
> ----- Original Nachricht ----
> Von: "Sidorov, Andrei" <[email protected]>
> An: "[email protected]" <[email protected]>, ext4 development <[email protected]>
> Datum: 17.05.2013 23:18
> Betreff: Re: Ext4: Slow performance on first write after mount
>
>> Hi Frank,
>>
>> Consider using bigalloc feature (requires reformat), preallocate space
>> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
>> are too small for good throughput with O_DIRECT. You might also want to
>> adjust max_sectors_kb to something larger than 512k.
>>
>> We're doing 6in+6out 20Mbps streams just fine.
>>
>> Regards,
>> Andrei.
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-05-19 10:01:55

by frankcmoeller

[permalink] [raw]
Subject: Re: Aw: Re: Ext4: Slow performance on first write after mount

Hi Andreas,

> Part of the problem is that filesystems are rarely unmounted cleanly, so it
> means that this information would need to be updated periodically to disk so
> that it is available after a crash.
> I wouldn't object to some kind of "lazy" updating of group information on
> disk that at least gives the newly-mounted filesystem a rough idea of what
> each group's usage is. It wouldn't have to be totally accurate (it wouldn't
> replace the bitmaps), but maybe 2 bits per group would be enough as a
> starting point?
> For a 32 TB filesystem that would be about 16 4kB blocks of bits that would
> be updated periodically (e.g. every five minutes or so). Since the allocator
> will typically work in successive groups that might not cause too much
> churn.

Yes, you're right. The stored data wouldn't be 100% reliable. And yes, it would be really good if
right after mount the filesystem would knew something more to find a good group quicker.
What do you think of this:
1. I read this already in some discussions: You already store the free space amount for every
group. Why not also storing how big the biggest contiguous free space block in a group is? Then you
don't have to read the whole group.
2. What about a list (in memory and also stored on disk) with all unused groups (1 bit for every group).
If the allocator cannot find a good group within lets say half second, a group from this list is used.
The list is also not be 100% reliable (because of the mentioned unclean unmounts), so you need to search
a good group in the list. If no good group was found in the list, the allocator can continue searching.
This don't helps in all situations (e.g. almost full disk or every group contains a small amount of data),
but it should be in many cases much faster, if the list is not totally outdated.

> It would be possible to fallocate() at some expected size (e.g. average file
> size) and then either truncate off the unused space, or fallocate() some
> more in another thread when you are close to tunning out.
> If the fallocate() is done in a separate thread the latency can be hidden
> from the main application?
Adding a new thread for fallocate shouldn't be a big problem. But fallocate might
generate high disk usage (while searching for a good group). I don't know whether
parallel writing from the other thread is quick enough.

One question regarding fallocate: I create a new file and do a 100MB fallocate
with FALLOC_FL_KEEP_SIZE. Then I write only 70MB to that file and close it.
Is the 30 MB unused preallocated space still preallocated for that file after closing
it? Or does a close release the preallocated space?

Regards,
Frank

>
> Cheers, Andreas
>
> > And you have to take care about alignment and there are several threads in
> the internet which explain why you shouldn't use it (or only in very special
> situations and I don't think that my situation is one of them). And ext4
> group initialization takes also place when using O_DIRECT (as said before
> perhaps I did something wrong).
> >
> > Regards,
> > Frank
> >
> > ----- Original Nachricht ----
> > Von: "Sidorov, Andrei" <[email protected]>
> > An: "[email protected]" <[email protected]>, ext4
> development <[email protected]>
> > Datum: 17.05.2013 23:18
> > Betreff: Re: Ext4: Slow performance on first write after mount
> >
> >> Hi Frank,
> >>
> >> Consider using bigalloc feature (requires reformat), preallocate space
> >> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
> >> are too small for good throughput with O_DIRECT. You might also want to
> >> adjust max_sectors_kb to something larger than 512k.
> >>
> >> We're doing 6in+6out 20Mbps streams just fine.
> >>
> >> Regards,
> >> Andrei.
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-05-19 13:00:05

by frankcmoeller

[permalink] [raw]
Subject: Aw: Re: Aw: Re: Ext4: Slow performance on first write after mount

Hi,

> One question regarding fallocate: I create a new file and do a 100MB
> fallocate
> with FALLOC_FL_KEEP_SIZE. Then I write only 70MB to that file and close it.
> Is the 30 MB unused preallocated space still preallocated for that file
> after closing
> it? Or does a close release the preallocated space?

I did some tests and now I can answer it by myself ;-)
The space stays preallocated after closing the file. Also umount don't releases
the space. Interesting!

I was testing concurrent fallocates and writes to the same file descriptor. It
seems to work. If it is quick enough I cannot say at the moment.

Regards,
Frank

----- Original Nachricht ----
Von: [email protected]
An: [email protected]
Datum: 19.05.2013 12:01
Betreff: Re: Aw: Re: Ext4: Slow performance on first write after mount

> Hi Andreas,
>
> > Part of the problem is that filesystems are rarely unmounted cleanly, so
> it
> > means that this information would need to be updated periodically to disk
> so
> > that it is available after a crash.
> > I wouldn't object to some kind of "lazy" updating of group information on
> > disk that at least gives the newly-mounted filesystem a rough idea of
> what
> > each group's usage is. It wouldn't have to be totally accurate (it
> wouldn't
> > replace the bitmaps), but maybe 2 bits per group would be enough as a
> > starting point?
> > For a 32 TB filesystem that would be about 16 4kB blocks of bits that
> would
> > be updated periodically (e.g. every five minutes or so). Since the
> allocator
> > will typically work in successive groups that might not cause too much
> > churn.
>
> Yes, you're right. The stored data wouldn't be 100% reliable. And yes, it
> would be really good if
> right after mount the filesystem would knew something more to find a good
> group quicker.
> What do you think of this:
> 1. I read this already in some discussions: You already store the free space
> amount for every
> group. Why not also storing how big the biggest contiguous free space
> block in a group is? Then you
> don't have to read the whole group.
> 2. What about a list (in memory and also stored on disk) with all unused
> groups (1 bit for every group).
> If the allocator cannot find a good group within lets say half second, a
> group from this list is used.
> The list is also not be 100% reliable (because of the mentioned unclean
> unmounts), so you need to search
> a good group in the list. If no good group was found in the list, the
> allocator can continue searching.
> This don't helps in all situations (e.g. almost full disk or every group
> contains a small amount of data),
> but it should be in many cases much faster, if the list is not totally
> outdated.
>
> > It would be possible to fallocate() at some expected size (e.g. average
> file
> > size) and then either truncate off the unused space, or fallocate() some
> > more in another thread when you are close to tunning out.
> > If the fallocate() is done in a separate thread the latency can be hidden
> > from the main application?
> Adding a new thread for fallocate shouldn't be a big problem. But fallocate
> might
> generate high disk usage (while searching for a good group). I don't know
> whether
> parallel writing from the other thread is quick enough.
>
> One question regarding fallocate: I create a new file and do a 100MB
> fallocate
> with FALLOC_FL_KEEP_SIZE. Then I write only 70MB to that file and close it.
> Is the 30 MB unused preallocated space still preallocated for that file
> after closing
> it? Or does a close release the preallocated space?
>
> Regards,
> Frank
>
> >
> > Cheers, Andreas
> >
> > > And you have to take care about alignment and there are several threads
> in
> > the internet which explain why you shouldn't use it (or only in very
> special
> > situations and I don't think that my situation is one of them). And ext4
> > group initialization takes also place when using O_DIRECT (as said before
> > perhaps I did something wrong).
> > >
> > > Regards,
> > > Frank
> > >
> > > ----- Original Nachricht ----
> > > Von: "Sidorov, Andrei" <[email protected]>
> > > An: "[email protected]" <[email protected]>, ext4
> > development <[email protected]>
> > > Datum: 17.05.2013 23:18
> > > Betreff: Re: Ext4: Slow performance on first write after mount
> > >
> > >> Hi Frank,
> > >>
> > >> Consider using bigalloc feature (requires reformat), preallocate space
> > >> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
> > >> are too small for good throughput with O_DIRECT. You might also want
> to
> > >> adjust max_sectors_kb to something larger than 512k.
> > >>
> > >> We're doing 6in+6out 20Mbps streams just fine.
> > >>
> > >> Regards,
> > >> Andrei.
> > >>
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-05-20 07:04:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: Ext4: Slow performance on first write after mount

On 2013-05-19, at 7:00, [email protected] wrote:
>> One question regarding fallocate: I create a new file and do a 100MB
>> fallocate
>> with FALLOC_FL_KEEP_SIZE. Then I write only 70MB to that file and close it.
>> Is the 30 MB unused preallocated space still preallocated for that file
>> after closing
>> it? Or does a close release the preallocated space?
>
> I did some tests and now I can answer it by myself ;-)
> The space stays preallocated after closing the file. Also umount don't releases
> the space. Interesting!

Yes, this is how it is expected to work. Your application would need
to truncate the file to the final size when it is finished writing to it.

> I was testing concurrent fallocates and writes to the same file descriptor. It
> seems to work. If it is quick enough I cannot say at the moment.
>
>> it would be really good if
>> right after mount the filesystem
>> would knew something more to find a good group quicker.
>> What do you think of this:
>> 1. I read this already in some discussions: You already store the
>> free space amount for every group. Why not also storing how
>> big the biggest contiguous free space block in a group is?
>> Then you don't have to read the whole group.

Yes, this is done in memory already, and updating it on disk is no
more effort than updating the free block count when blocks are allocated
or freed in that group.

One option would be to store the first 32 bits of the buddy bitmap
in the bg_reserved field for each group. That would give us the
distribution down to 4 MB chunks in each group (if I calculate correctly).

That would consume the last free field in the group descriptor, but it
might be worthwhile? Alternately, it could be put into a separate
file, but that would cause more IO.

>> 2. What about a list (in memory and also stored on disk) with all unused
>> groups (1 bit for every group).

Having only 1 bit per group is useless. The full/not full information
can already be had from the free blocks counter in the group descriptor,
which is always in memory.

The problem is with groups that appear to have _some_ free space,
but need the bitmap to be read to see if it is contiguous or not. Some
heuristics might be used to improve this scanning, but having part of
the buddy bitmap loaded would be more useful.

>> If the allocator cannot find a good group within lets say half second, a
>> group from this list is used.
>> The list is also not be 100% reliable (because of the mentioned unclean
>> unmounts), so you need to search
>> a good group in the list. If no good group was found in the list, the
>> allocator can continue searching.
>> This don't helps in all situations (e.g. almost full disk or every group
>> contains a small amount of data),
>> but it should be in many cases much faster, if the list is not totally
>> outdated.

I think this could be an administrator tunable, if latency is more
important than space efficiency. It can already do this from the
data in the group descriptors that are loaded at mount time.

Cheers, Andreas
>>> It would be possible to fallocate() at some expected size (e.g. average
>> file
>>> size) and then either truncate off the unused space, or fallocate() some
>>> more in another thread when you are close to tunning out.
>>> If the fallocate() is done in a separate thread the latency can be hidden
>>> from the main application?
>> Adding a new thread for fallocate shouldn't be a big problem. But fallocate
>> might
>> generate high disk usage (while searching for a good group). I don't know
>> whether
>> parallel writing from the other thread is quick enough.
>>
>> One question regarding fallocate: I create a new file and do a 100MB
>> fallocate
>> with FALLOC_FL_KEEP_SIZE. Then I write only 70MB to that file and close it.
>> Is the 30 MB unused preallocated space still preallocated for that file
>> after closing
>> it? Or does a close release the preallocated space?
>>
>> Regards,
>> Frank
>>
>>>
>>> Cheers, Andreas
>>>
>>>> And you have to take care about alignment and there are several threads
>> in
>>> the internet which explain why you shouldn't use it (or only in very
>> special
>>> situations and I don't think that my situation is one of them). And ext4
>>> group initialization takes also place when using O_DIRECT (as said before
>>> perhaps I did something wrong).
>>>>
>>>> Regards,
>>>> Frank
>>>>
>>>> ----- Original Nachricht ----
>>>> Von: "Sidorov, Andrei" <[email protected]>
>>>> An: "[email protected]" <[email protected]>, ext4
>>> development <[email protected]>
>>>> Datum: 17.05.2013 23:18
>>>> Betreff: Re: Ext4: Slow performance on first write after mount
>>>>
>>>>> Hi Frank,
>>>>>
>>>>> Consider using bigalloc feature (requires reformat), preallocate space
>>>>> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
>>>>> are too small for good throughput with O_DIRECT. You might also want
>> to
>>>>> adjust max_sectors_kb to something larger than 512k.
>>>>>
>>>>> We're doing 6in+6out 20Mbps streams just fine.
>>>>>
>>>>> Regards,
>>>>> Andrei.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4"
>> in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html