LinuxLists.cc - [RFC] ext4: block reservation allocation

2012-02-27 09:04:03

Subject: [RFC] ext4: block reservation allocation

Hi list,

Now, in ext4, we have multi-block allocation and delay allocation. They work
well for most scenarios. However, in some specific scenarios, they cannot help
us to optimize block allocation. For example, the user may want to indicate some
file set to be allocated at the beginning of the disk because its speed in this
position is faster than its speed at the end of disk.

I have done the following experiment. The experiment is on my own server, which
has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
split this disk into two partitions, one has 900G, and another has 100G. Then I
use dd to get the speed of read/write. The result is as following.

[READ]
# dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s

# dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s

[WRITE]
# dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s

# dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s

So filesystem can provide a new feature to let the user to indicate a value
for reserving some blocks from the beginning of the disk. When the user needs
to allocate some blocks for an important file that needs to be read/write as
quick as possible, the user can use ioctl(2) and/or other ways to notify
filesystem to allocate these blocks in the reservation area. Thereby, the user
can obtain the higher performance for manipulating this file set.

This idea is very trivial. So any comments or suggestions are appreciated.

Regards,
Zheng

2012-02-27 12:00:17

by Lukas Czerner

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, 27 Feb 2012, Zheng Liu wrote:

> Hi list,
>
> Now, in ext4, we have multi-block allocation and delay allocation. They work
> well for most scenarios. However, in some specific scenarios, they cannot help
> us to optimize block allocation. For example, the user may want to indicate some
> file set to be allocated at the beginning of the disk because its speed in this
> position is faster than its speed at the end of disk.
>
> I have done the following experiment. The experiment is on my own server, which
> has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
> split this disk into two partitions, one has 900G, and another has 100G. Then I
> use dd to get the speed of read/write. The result is as following.
>
> [READ]
> # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
> 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
>
> # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
> 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
>
> [WRITE]
> # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
> 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
>
> # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
> 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
>
> So filesystem can provide a new feature to let the user to indicate a value
> for reserving some blocks from the beginning of the disk. When the user needs
> to allocate some blocks for an important file that needs to be read/write as
> quick as possible, the user can use ioctl(2) and/or other ways to notify
> filesystem to allocate these blocks in the reservation area. Thereby, the user
> can obtain the higher performance for manipulating this file set.
>
> This idea is very trivial. So any comments or suggestions are appreciated.
>
> Regards,
> Zheng
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

Hi Zheng,

I have to admit I do not like it :). I think that this kind of
optimization is useless in the long run. There are several reasons for
this:

- the test you've done is purely fabricated and does not respond to
real workload at all. Especially because it is done on a huge files.
I can imagine this approach improving boot speed, but you will
usually have to load just small files, so for single file it does not
make much sense. Moreover with small files more seeks would have to
be done hugely reducing the advantage you can see with dd.
- HDD might have more platters than just one
- Your file system might span across several drives
- On thinly provisioned storage this does not make sense at all
- SSD's are more and more common and this optimization is useless for
them.

Is there any 'real' problem you would want to solve with this ? Or is it
just something that came to you mind ? I agree that we want to improve
our allocators, but IMHO especially for better scalability, not to cover
this disputable niche.

Anyway, you may try to come up with better experiment. Something which
would actually show how much can we get from the more realistic workload
rather than showing that contiguous serial writes are faster closely to
the center of the disk platter, we know that.

Thanks!
-Lukas

2012-02-27 13:13:31

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, Feb 27, 2012 at 01:00:07PM +0100, Lukas Czerner wrote:
> On Mon, 27 Feb 2012, Zheng Liu wrote:
>
> > Hi list,
> >
> > Now, in ext4, we have multi-block allocation and delay allocation. They work
> > well for most scenarios. However, in some specific scenarios, they cannot help
> > us to optimize block allocation. For example, the user may want to indicate some
> > file set to be allocated at the beginning of the disk because its speed in this
> > position is faster than its speed at the end of disk.
> >
> > I have done the following experiment. The experiment is on my own server, which
> > has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
> > split this disk into two partitions, one has 900G, and another has 100G. Then I
> > use dd to get the speed of read/write. The result is as following.
> >
> > [READ]
> > # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
> > 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
> >
> > # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
> > 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
> >
> > [WRITE]
> > # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
> > 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
> >
> > # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
> > 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
> >
> > So filesystem can provide a new feature to let the user to indicate a value
> > for reserving some blocks from the beginning of the disk. When the user needs
> > to allocate some blocks for an important file that needs to be read/write as
> > quick as possible, the user can use ioctl(2) and/or other ways to notify
> > filesystem to allocate these blocks in the reservation area. Thereby, the user
> > can obtain the higher performance for manipulating this file set.
> >
> > This idea is very trivial. So any comments or suggestions are appreciated.
> >
> > Regards,
> > Zheng
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> Hi Zheng,
>
> I have to admit I do not like it :). I think that this kind of
> optimization is useless in the long run. There are several reasons for
> this:

Hi Lukas,

Thank you for your opinion. ;-)

>
> - the test you've done is purely fabricated and does not respond to
> real workload at all. Especially because it is done on a huge files.
> I can imagine this approach improving boot speed, but you will
> usually have to load just small files, so for single file it does not
> make much sense. Moreover with small files more seeks would have to
> be done hugely reducing the advantage you can see with dd.

I will describe the problem that we encounter. the problem shows that
even if files are small, the performance can be improved in some
specific scenarios using this block allocation.

> - HDD might have more platters than just one
> - Your file system might span across several drives
> - On thinly provisioned storage this does not make sense at all
> - SSD's are more and more common and this optimization is useless for
> them.
>
> Is there any 'real' problem you would want to solve with this ? Or is it
> just something that came to you mind ? I agree that we want to improve
> our allocators, but IMHO especially for better scalability, not to cover
> this disputable niche.

We encounter a problem in our product system. In a 2TB sata disk, the
file can be divided into two categories. One is index file, and another
is block file. The average size of index files is about 128k and will
increase as time goes on. The size of block files is 70M and they are
created by fallocate(2). Thus, index file is allocated at the end of the
disk. When application starts up, it needs to load all of index files
into memory. So it costs too much time. If we can allocate index files
at the beginning of the disk, we will cut down the startup time and
increase the service time of this application.

Therefore, I think that it might be as a generic mechanism to provide
other users that have the similar requirement.

Regards,
Zheng

>
> Anyway, you may try to come up with better experiment. Something which
> would actually show how much can we get from the more realistic workload
> rather than showing that contiguous serial writes are faster closely to
> the center of the disk platter, we know that.
>
> Thanks!
> -Lukas

2012-02-27 13:33:32

by Lukas Czerner

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, 27 Feb 2012, Zheng Liu wrote:

> On Mon, Feb 27, 2012 at 01:00:07PM +0100, Lukas Czerner wrote:
> > On Mon, 27 Feb 2012, Zheng Liu wrote:
> >
> > > Hi list,
> > >
> > > Now, in ext4, we have multi-block allocation and delay allocation. They work
> > > well for most scenarios. However, in some specific scenarios, they cannot help
> > > us to optimize block allocation. For example, the user may want to indicate some
> > > file set to be allocated at the beginning of the disk because its speed in this
> > > position is faster than its speed at the end of disk.
> > >
> > > I have done the following experiment. The experiment is on my own server, which
> > > has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
> > > split this disk into two partitions, one has 900G, and another has 100G. Then I
> > > use dd to get the speed of read/write. The result is as following.
> > >
> > > [READ]
> > > # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
> > > 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
> > >
> > > # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
> > > 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
> > >
> > > [WRITE]
> > > # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
> > > 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
> > >
> > > # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
> > > 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
> > >
> > > So filesystem can provide a new feature to let the user to indicate a value
> > > for reserving some blocks from the beginning of the disk. When the user needs
> > > to allocate some blocks for an important file that needs to be read/write as
> > > quick as possible, the user can use ioctl(2) and/or other ways to notify
> > > filesystem to allocate these blocks in the reservation area. Thereby, the user
> > > can obtain the higher performance for manipulating this file set.
> > >
> > > This idea is very trivial. So any comments or suggestions are appreciated.
> > >
> > > Regards,
> > > Zheng
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> >
> > Hi Zheng,
> >
> > I have to admit I do not like it :). I think that this kind of
> > optimization is useless in the long run. There are several reasons for
> > this:
>
> Hi Lukas,
>
> Thank you for your opinion. ;-)
>
> >
> > - the test you've done is purely fabricated and does not respond to
> > real workload at all. Especially because it is done on a huge files.
> > I can imagine this approach improving boot speed, but you will
> > usually have to load just small files, so for single file it does not
> > make much sense. Moreover with small files more seeks would have to
> > be done hugely reducing the advantage you can see with dd.
>
> I will describe the problem that we encounter. the problem shows that
> even if files are small, the performance can be improved in some
> specific scenarios using this block allocation.
>
> > - HDD might have more platters than just one
> > - Your file system might span across several drives
> > - On thinly provisioned storage this does not make sense at all
> > - SSD's are more and more common and this optimization is useless for
> > them.
> >
> > Is there any 'real' problem you would want to solve with this ? Or is it
> > just something that came to you mind ? I agree that we want to improve
> > our allocators, but IMHO especially for better scalability, not to cover
> > this disputable niche.
>
> We encounter a problem in our product system. In a 2TB sata disk, the
> file can be divided into two categories. One is index file, and another
> is block file. The average size of index files is about 128k and will
> increase as time goes on. The size of block files is 70M and they are
> created by fallocate(2). Thus, index file is allocated at the end of the
> disk. When application starts up, it needs to load all of index files
> into memory. So it costs too much time. If we can allocate index files
> at the beginning of the disk, we will cut down the startup time and
> increase the service time of this application.
>
> Therefore, I think that it might be as a generic mechanism to provide
> other users that have the similar requirement.

Ok, so this seems like a valid use case. However I think that this is
exactly something that can be quite easily solved without having to
modify file system code, right ?

You can simply use separate drive for the index files, or even raid. Or
you can actually use an SSD for this, which I believe will give you *a
lot* better performance improvements and you wont be bothered by the
size/price ratio for SSD as you would only store indexes there, right ?

Or, if you really do not want to, or can not, but a new hardware for
some reason, you can always partition a 2TB disk and put all your index
files on the smaller, close to the disk center partition. I really do
not see a reason to modify the code.

What might be even more interesting is, that you might generally benefit
from splitting the index/data file systems. The reason is that your data
file and your index file filesystem might benefit from bigalloc if you
split them, because you can set different cluster sizes on both file
system depending on the file sizes you would actually store there, since
as I understand the index and data files differs in size significantly.

How much of the performance boost do you expect by doing this your way -
modifying the file system? Note that dd will not tell you that, as I
explained earlier. I surely would not match using SSD for index files by
far.

What do you think?

Thanks!
-Lukas

>
> Regards,
> Zheng
>
> >
> > Anyway, you may try to come up with better experiment. Something which
> > would actually show how much can we get from the more realistic workload
> > rather than showing that contiguous serial writes are faster closely to
> > the center of the disk platter, we know that.
> >
> > Thanks!
> > -Lukas
>

--

2012-02-27 13:36:53

by Yongqiang Yang

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, Feb 27, 2012 at 9:18 PM, Zheng Liu <[email protected]> wrote:
> On Mon, Feb 27, 2012 at 01:00:07PM +0100, Lukas Czerner wrote:
> We encounter a problem in our product system. In a 2TB sata disk, the
> file can be divided into two categories. One is index file, and another
> is block file. The average size of index files is about 128k and will
> increase as time goes on. The size of block files is 70M and they are
> created by fallocate(2). Thus, index file is allocated at the end of the
> disk. When application starts up, it needs to load all of index files
> into memory. So it costs too much time. If we can allocate index files
> at the beginning of the disk, we will cut down the startup time and
> increase the service time of this application.
>
> Therefore, I think that it might be as a generic mechanism to provide
> other users that have the similar requirement.
Hi There,

Some filesystems allocate faster blocks for metadata. IMHO, we can
reserve faster blocks as possible as we can, and allocate faster
blocks for the files which have some flag set. The flag is also can
be inherited from directory.

Yongqiang.
>
> Regards,
> Zheng
>
--
Best Wishes
Yongqiang Yang

2012-02-27 15:04:09

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, Feb 27, 2012 at 02:33:28PM +0100, Lukas Czerner wrote:
> On Mon, 27 Feb 2012, Zheng Liu wrote:
>
> > On Mon, Feb 27, 2012 at 01:00:07PM +0100, Lukas Czerner wrote:
> > > On Mon, 27 Feb 2012, Zheng Liu wrote:
> > >
> > > > Hi list,
> > > >
> > > > Now, in ext4, we have multi-block allocation and delay allocation. They work
> > > > well for most scenarios. However, in some specific scenarios, they cannot help
> > > > us to optimize block allocation. For example, the user may want to indicate some
> > > > file set to be allocated at the beginning of the disk because its speed in this
> > > > position is faster than its speed at the end of disk.
> > > >
> > > > I have done the following experiment. The experiment is on my own server, which
> > > > has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
> > > > split this disk into two partitions, one has 900G, and another has 100G. Then I
> > > > use dd to get the speed of read/write. The result is as following.
> > > >
> > > > [READ]
> > > > # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
> > > > 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
> > > >
> > > > # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
> > > > 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
> > > >
> > > > [WRITE]
> > > > # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
> > > > 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
> > > >
> > > > # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
> > > > 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
> > > >
> > > > So filesystem can provide a new feature to let the user to indicate a value
> > > > for reserving some blocks from the beginning of the disk. When the user needs
> > > > to allocate some blocks for an important file that needs to be read/write as
> > > > quick as possible, the user can use ioctl(2) and/or other ways to notify
> > > > filesystem to allocate these blocks in the reservation area. Thereby, the user
> > > > can obtain the higher performance for manipulating this file set.
> > > >
> > > > This idea is very trivial. So any comments or suggestions are appreciated.
> > > >
> > > > Regards,
> > > > Zheng
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > >
> > >
> > > Hi Zheng,
> > >
> > > I have to admit I do not like it :). I think that this kind of
> > > optimization is useless in the long run. There are several reasons for
> > > this:
> >
> > Hi Lukas,
> >
> > Thank you for your opinion. ;-)
> >
> > >
> > > - the test you've done is purely fabricated and does not respond to
> > > real workload at all. Especially because it is done on a huge files.
> > > I can imagine this approach improving boot speed, but you will
> > > usually have to load just small files, so for single file it does not
> > > make much sense. Moreover with small files more seeks would have to
> > > be done hugely reducing the advantage you can see with dd.
> >
> > I will describe the problem that we encounter. the problem shows that
> > even if files are small, the performance can be improved in some
> > specific scenarios using this block allocation.
> >
> > > - HDD might have more platters than just one
> > > - Your file system might span across several drives
> > > - On thinly provisioned storage this does not make sense at all
> > > - SSD's are more and more common and this optimization is useless for
> > > them.
> > >
> > > Is there any 'real' problem you would want to solve with this ? Or is it
> > > just something that came to you mind ? I agree that we want to improve
> > > our allocators, but IMHO especially for better scalability, not to cover
> > > this disputable niche.
> >
> > We encounter a problem in our product system. In a 2TB sata disk, the
> > file can be divided into two categories. One is index file, and another
> > is block file. The average size of index files is about 128k and will
> > increase as time goes on. The size of block files is 70M and they are
> > created by fallocate(2). Thus, index file is allocated at the end of the
> > disk. When application starts up, it needs to load all of index files
> > into memory. So it costs too much time. If we can allocate index files
> > at the beginning of the disk, we will cut down the startup time and
> > increase the service time of this application.
> >
> > Therefore, I think that it might be as a generic mechanism to provide
> > other users that have the similar requirement.
>
> Ok, so this seems like a valid use case. However I think that this is
> exactly something that can be quite easily solved without having to
> modify file system code, right ?
>
> You can simply use separate drive for the index files, or even raid. Or
> you can actually use an SSD for this, which I believe will give you *a
> lot* better performance improvements and you wont be bothered by the
> size/price ratio for SSD as you would only store indexes there, right ?
>
> Or, if you really do not want to, or can not, but a new hardware for
> some reason, you can always partition a 2TB disk and put all your index
> files on the smaller, close to the disk center partition. I really do
> not see a reason to modify the code.
>
> What might be even more interesting is, that you might generally benefit
> from splitting the index/data file systems. The reason is that your data
> file and your index file filesystem might benefit from bigalloc if you
> split them, because you can set different cluster sizes on both file
> system depending on the file sizes you would actually store there, since
> as I understand the index and data files differs in size significantly.

You are right. I am trying this solution in our test environment. I have
splitted a 2TB disk into 2 partitions. One is for index file and is
formated with big alloc, and another is for block file.

>
> How much of the performance boost do you expect by doing this your way -
> modifying the file system? Note that dd will not tell you that, as I
> explained earlier. I surely would not match using SSD for index files by
> far.
>
> What do you think?

As Yongqiang said, maybe we can allocate faster block for the file which
needs to be fast read/write when the user sets a flag to notify the file
system. Maybe we don't need to implement a new block allocation
algorithm. We only need to modify the current block allocation to
provide this mechansim.

Regards,
Zheng

>
> Thanks!
> -Lukas
>
>
>
> >
> > Regards,
> > Zheng
> >
> > >
> > > Anyway, you may try to come up with better experiment. Something which
> > > would actually show how much can we get from the more realistic workload
> > > rather than showing that contiguous serial writes are faster closely to
> > > the center of the disk platter, we know that.
> > >
> > > Thanks!
> > > -Lukas
> >
>
> --

2012-02-27 15:16:50

by Lukas Czerner

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, 27 Feb 2012, Zheng Liu wrote:

> On Mon, Feb 27, 2012 at 02:33:28PM +0100, Lukas Czerner wrote:
> > On Mon, 27 Feb 2012, Zheng Liu wrote:
> >
> > > On Mon, Feb 27, 2012 at 01:00:07PM +0100, Lukas Czerner wrote:
> > > > On Mon, 27 Feb 2012, Zheng Liu wrote:
> > > >
> > > > > Hi list,
> > > > >
> > > > > Now, in ext4, we have multi-block allocation and delay allocation. They work
> > > > > well for most scenarios. However, in some specific scenarios, they cannot help
> > > > > us to optimize block allocation. For example, the user may want to indicate some
> > > > > file set to be allocated at the beginning of the disk because its speed in this
> > > > > position is faster than its speed at the end of disk.
> > > > >
> > > > > I have done the following experiment. The experiment is on my own server, which
> > > > > has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
> > > > > split this disk into two partitions, one has 900G, and another has 100G. Then I
> > > > > use dd to get the speed of read/write. The result is as following.
> > > > >
> > > > > [READ]
> > > > > # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
> > > > > 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
> > > > >
> > > > > # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
> > > > > 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
> > > > >
> > > > > [WRITE]
> > > > > # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
> > > > > 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
> > > > >
> > > > > # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
> > > > > 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
> > > > >
> > > > > So filesystem can provide a new feature to let the user to indicate a value
> > > > > for reserving some blocks from the beginning of the disk. When the user needs
> > > > > to allocate some blocks for an important file that needs to be read/write as
> > > > > quick as possible, the user can use ioctl(2) and/or other ways to notify
> > > > > filesystem to allocate these blocks in the reservation area. Thereby, the user
> > > > > can obtain the higher performance for manipulating this file set.
> > > > >
> > > > > This idea is very trivial. So any comments or suggestions are appreciated.
> > > > >
> > > > > Regards,
> > > > > Zheng
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > > the body of a message to [email protected]
> > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > >
> > > >
> > > > Hi Zheng,
> > > >
> > > > I have to admit I do not like it :). I think that this kind of
> > > > optimization is useless in the long run. There are several reasons for
> > > > this:
> > >
> > > Hi Lukas,
> > >
> > > Thank you for your opinion. ;-)
> > >
> > > >
> > > > - the test you've done is purely fabricated and does not respond to
> > > > real workload at all. Especially because it is done on a huge files.
> > > > I can imagine this approach improving boot speed, but you will
> > > > usually have to load just small files, so for single file it does not
> > > > make much sense. Moreover with small files more seeks would have to
> > > > be done hugely reducing the advantage you can see with dd.
> > >
> > > I will describe the problem that we encounter. the problem shows that
> > > even if files are small, the performance can be improved in some
> > > specific scenarios using this block allocation.
> > >
> > > > - HDD might have more platters than just one
> > > > - Your file system might span across several drives
> > > > - On thinly provisioned storage this does not make sense at all
> > > > - SSD's are more and more common and this optimization is useless for
> > > > them.
> > > >
> > > > Is there any 'real' problem you would want to solve with this ? Or is it
> > > > just something that came to you mind ? I agree that we want to improve
> > > > our allocators, but IMHO especially for better scalability, not to cover
> > > > this disputable niche.
> > >
> > > We encounter a problem in our product system. In a 2TB sata disk, the
> > > file can be divided into two categories. One is index file, and another
> > > is block file. The average size of index files is about 128k and will
> > > increase as time goes on. The size of block files is 70M and they are
> > > created by fallocate(2). Thus, index file is allocated at the end of the
> > > disk. When application starts up, it needs to load all of index files
> > > into memory. So it costs too much time. If we can allocate index files
> > > at the beginning of the disk, we will cut down the startup time and
> > > increase the service time of this application.
> > >
> > > Therefore, I think that it might be as a generic mechanism to provide
> > > other users that have the similar requirement.
> >
> > Ok, so this seems like a valid use case. However I think that this is
> > exactly something that can be quite easily solved without having to
> > modify file system code, right ?
> >
> > You can simply use separate drive for the index files, or even raid. Or
> > you can actually use an SSD for this, which I believe will give you *a
> > lot* better performance improvements and you wont be bothered by the
> > size/price ratio for SSD as you would only store indexes there, right ?
> >
> > Or, if you really do not want to, or can not, but a new hardware for
> > some reason, you can always partition a 2TB disk and put all your index
> > files on the smaller, close to the disk center partition. I really do
> > not see a reason to modify the code.
> >
> > What might be even more interesting is, that you might generally benefit
> > from splitting the index/data file systems. The reason is that your data
> > file and your index file filesystem might benefit from bigalloc if you
> > split them, because you can set different cluster sizes on both file
> > system depending on the file sizes you would actually store there, since
> > as I understand the index and data files differs in size significantly.
>
> You are right. I am trying this solution in our test environment. I have
> splitted a 2TB disk into 2 partitions. One is for index file and is
> formated with big alloc, and another is for block file.

That's good to hear. So you have your solution maybe ?

>
> >
> > How much of the performance boost do you expect by doing this your way -
> > modifying the file system? Note that dd will not tell you that, as I
> > explained earlier. I surely would not match using SSD for index files by
> > far.
> >
> > What do you think?
>
> As Yongqiang said, maybe we can allocate faster block for the file which
> needs to be fast read/write when the user sets a flag to notify the file
> system. Maybe we don't need to implement a new block allocation
> algorithm. We only need to modify the current block allocation to
> provide this mechansim.
>
> Regards,
> Zheng

I am not sure what Yongqiang meant by that. I know that there is a
REQ_META flag which is supposed to set higher priority for metadata
reads. However how do you expect this to work ? It would have to be set
*only* by root, because from user perspective *every* file is a priority
above other users files :). But doing this as root greatly limits it
use.

If the REQ_META thing is what Yongqiang meant, I am not sure if it is
such a good idea to exploit this flag like that.

Thanks!
-Lukas

>
> >
> > Thanks!
> > -Lukas
> >
> >
> >
> > >
> > > Regards,
> > > Zheng
> > >
> > > >
> > > > Anyway, you may try to come up with better experiment. Something which
> > > > would actually show how much can we get from the more realistic workload
> > > > rather than showing that contiguous serial writes are faster closely to
> > > > the center of the disk platter, we know that.
> > > >
> > > > Thanks!
> > > > -Lukas
> > >
> >
> > --
>

--

2012-02-27 15:24:28

by Lukas Czerner

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, 27 Feb 2012, Lukas Czerner wrote:

> On Mon, 27 Feb 2012, Zheng Liu wrote:
>
> > On Mon, Feb 27, 2012 at 02:33:28PM +0100, Lukas Czerner wrote:
> > > On Mon, 27 Feb 2012, Zheng Liu wrote:
> > >
> > > > On Mon, Feb 27, 2012 at 01:00:07PM +0100, Lukas Czerner wrote:
> > > > > On Mon, 27 Feb 2012, Zheng Liu wrote:
> > > > >
> > > > > > Hi list,
> > > > > >
> > > > > > Now, in ext4, we have multi-block allocation and delay allocation. They work
> > > > > > well for most scenarios. However, in some specific scenarios, they cannot help
> > > > > > us to optimize block allocation. For example, the user may want to indicate some
> > > > > > file set to be allocated at the beginning of the disk because its speed in this
> > > > > > position is faster than its speed at the end of disk.
> > > > > >
> > > > > > I have done the following experiment. The experiment is on my own server, which
> > > > > > has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
> > > > > > split this disk into two partitions, one has 900G, and another has 100G. Then I
> > > > > > use dd to get the speed of read/write. The result is as following.
> > > > > >
> > > > > > [READ]
> > > > > > # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
> > > > > > 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
> > > > > >
> > > > > > # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
> > > > > > 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
> > > > > >
> > > > > > [WRITE]
> > > > > > # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
> > > > > > 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
> > > > > >
> > > > > > # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
> > > > > > 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
> > > > > >
> > > > > > So filesystem can provide a new feature to let the user to indicate a value
> > > > > > for reserving some blocks from the beginning of the disk. When the user needs
> > > > > > to allocate some blocks for an important file that needs to be read/write as
> > > > > > quick as possible, the user can use ioctl(2) and/or other ways to notify
> > > > > > filesystem to allocate these blocks in the reservation area. Thereby, the user
> > > > > > can obtain the higher performance for manipulating this file set.
> > > > > >
> > > > > > This idea is very trivial. So any comments or suggestions are appreciated.
> > > > > >
> > > > > > Regards,
> > > > > > Zheng
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > > > the body of a message to [email protected]
> > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > >
> > > > >
> > > > > Hi Zheng,
> > > > >
> > > > > I have to admit I do not like it :). I think that this kind of
> > > > > optimization is useless in the long run. There are several reasons for
> > > > > this:
> > > >
> > > > Hi Lukas,
> > > >
> > > > Thank you for your opinion. ;-)
> > > >
> > > > >
> > > > > - the test you've done is purely fabricated and does not respond to
> > > > > real workload at all. Especially because it is done on a huge files.
> > > > > I can imagine this approach improving boot speed, but you will
> > > > > usually have to load just small files, so for single file it does not
> > > > > make much sense. Moreover with small files more seeks would have to
> > > > > be done hugely reducing the advantage you can see with dd.
> > > >
> > > > I will describe the problem that we encounter. the problem shows that
> > > > even if files are small, the performance can be improved in some
> > > > specific scenarios using this block allocation.
> > > >
> > > > > - HDD might have more platters than just one
> > > > > - Your file system might span across several drives
> > > > > - On thinly provisioned storage this does not make sense at all
> > > > > - SSD's are more and more common and this optimization is useless for
> > > > > them.
> > > > >
> > > > > Is there any 'real' problem you would want to solve with this ? Or is it
> > > > > just something that came to you mind ? I agree that we want to improve
> > > > > our allocators, but IMHO especially for better scalability, not to cover
> > > > > this disputable niche.
> > > >
> > > > We encounter a problem in our product system. In a 2TB sata disk, the
> > > > file can be divided into two categories. One is index file, and another
> > > > is block file. The average size of index files is about 128k and will
> > > > increase as time goes on. The size of block files is 70M and they are
> > > > created by fallocate(2). Thus, index file is allocated at the end of the
> > > > disk. When application starts up, it needs to load all of index files
> > > > into memory. So it costs too much time. If we can allocate index files
> > > > at the beginning of the disk, we will cut down the startup time and
> > > > increase the service time of this application.
> > > >
> > > > Therefore, I think that it might be as a generic mechanism to provide
> > > > other users that have the similar requirement.
> > >
> > > Ok, so this seems like a valid use case. However I think that this is
> > > exactly something that can be quite easily solved without having to
> > > modify file system code, right ?
> > >
> > > You can simply use separate drive for the index files, or even raid. Or
> > > you can actually use an SSD for this, which I believe will give you *a
> > > lot* better performance improvements and you wont be bothered by the
> > > size/price ratio for SSD as you would only store indexes there, right ?
> > >
> > > Or, if you really do not want to, or can not, but a new hardware for
> > > some reason, you can always partition a 2TB disk and put all your index
> > > files on the smaller, close to the disk center partition. I really do
> > > not see a reason to modify the code.
> > >
> > > What might be even more interesting is, that you might generally benefit
> > > from splitting the index/data file systems. The reason is that your data
> > > file and your index file filesystem might benefit from bigalloc if you
> > > split them, because you can set different cluster sizes on both file
> > > system depending on the file sizes you would actually store there, since
> > > as I understand the index and data files differs in size significantly.
> >
> > You are right. I am trying this solution in our test environment. I have
> > splitted a 2TB disk into 2 partitions. One is for index file and is
> > formated with big alloc, and another is for block file.
>
> That's good to hear. So you have your solution maybe ?

You probably know all of that that already, but just in case... for the
sake of good performance make sure that you partitions are properly aligned,
because drive might very well have 4k sector size.

-Lukas

>
> >
> > >
> > > How much of the performance boost do you expect by doing this your way -
> > > modifying the file system? Note that dd will not tell you that, as I
> > > explained earlier. I surely would not match using SSD for index files by
> > > far.
> > >
> > > What do you think?
> >
> > As Yongqiang said, maybe we can allocate faster block for the file which
> > needs to be fast read/write when the user sets a flag to notify the file
> > system. Maybe we don't need to implement a new block allocation
> > algorithm. We only need to modify the current block allocation to
> > provide this mechansim.
> >
> > Regards,
> > Zheng
>
> I am not sure what Yongqiang meant by that. I know that there is a
> REQ_META flag which is supposed to set higher priority for metadata
> reads. However how do you expect this to work ? It would have to be set
> *only* by root, because from user perspective *every* file is a priority
> above other users files :). But doing this as root greatly limits it
> use.
>
> If the REQ_META thing is what Yongqiang meant, I am not sure if it is
> such a good idea to exploit this flag like that.
>
> Thanks!
> -Lukas
>
> >
> > >
> > > Thanks!
> > > -Lukas
> > >
> > >
> > >
> > > >
> > > > Regards,
> > > > Zheng
> > > >
> > > > >
> > > > > Anyway, you may try to come up with better experiment. Something which
> > > > > would actually show how much can we get from the more realistic workload
> > > > > rather than showing that contiguous serial writes are faster closely to
> > > > > the center of the disk platter, we know that.
> > > > >
> > > > > Thanks!
> > > > > -Lukas
> > > >
> > >
> > > --
> >
>
>

--

2012-02-27 15:37:34

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On 2/27/12 3:09 AM, Zheng Liu wrote:
> Hi list,
>
> Now, in ext4, we have multi-block allocation and delay allocation. They work
> well for most scenarios. However, in some specific scenarios, they cannot help
> us to optimize block allocation. For example, the user may want to indicate some
> file set to be allocated at the beginning of the disk because its speed in this
> position is faster than its speed at the end of disk.

I agree with Lukas - please, no.

You can play tricks with your storage to accomplish much the same thing,
by making filesystems on faster & slower devices and mounting them on
directories which your application can recognize as faster/slower.

If you want "fast" for metadata, adilger has a recipe out there for using
lvm to interleave ssd blocks with spinning blocks to get metadata to line
up on the ssd.

A filesystem-specific hack for a custom application has no place in EXT4,
IMHO, sorry.

Essentially this would move allocation decisions to userspace, and I don't
think that sounds like a good idea. If nothing else, the application shouldn't
assume that it "knows" anything at all about which regions of a filesystem may
be faster or slower...

-Eric

> I have done the following experiment. The experiment is on my own server, which
> has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
> split this disk into two partitions, one has 900G, and another has 100G. Then I
> use dd to get the speed of read/write. The result is as following.
>
> [READ]
> # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
> 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
>
> # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
> 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
>
> [WRITE]
> # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
> 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
>
> # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
> 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
>
> So filesystem can provide a new feature to let the user to indicate a value
> for reserving some blocks from the beginning of the disk. When the user needs
> to allocate some blocks for an important file that needs to be read/write as
> quick as possible, the user can use ioctl(2) and/or other ways to notify
> filesystem to allocate these blocks in the reservation area. Thereby, the user
> can obtain the higher performance for manipulating this file set.
>
> This idea is very trivial. So any comments or suggestions are appreciated.
>
> Regards,
> Zheng
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-02-27 17:44:43

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, Feb 27, 2012 at 09:37:32AM -0600, Eric Sandeen wrote:
>
> Essentially this would move allocation decisions to userspace, and I don't
> think that sounds like a good idea. If nothing else, the application shouldn't
> assume that it "knows" anything at all about which regions of a filesystem may
> be faster or slower...

What I *can* imagine is passing hints to the file system:

* This file will be accessed a lot --- vs --- this file will
be written once and then will be mostly cold storage

* This file won't be extended once originally written --- vs
--- this file will be extended often (i.e., it is a log file
or a unix mail directory file)

* This file is mostly emphemeral --- vs --- this file will be
sticking around for a long time.

* This file will be read mostly sequentially --- vs --- this
file will be read mostly via random access.

Obviously, these can be combined in various interesting ways; consider
for example an application journal file which is rarely read (except
in recovery circumstances, after a system crash, where speed might not
be the most important thing), and so even though the file is being
appended to regularly, contiguous block allocations might not matter
that much --- especially if the file is also being regularly fsync'ed,
so it would be more important if the blocks are located close to the
inode table. This isn't a hypothetical situation, by the way; I once
saw a performance regression of ext4 vs. ext2 that was traced down to
the fact that ext2 would greedily allocate the block closest to the
inode table, whereas ext4 would optimize for reading the file later,
and so allocating a large contiguous block far, far away from the
inode table was what ext4 choose to do. However, in this particular
case, optimizing for the frequent small write/fsync case would have
been a better choice.

In some cases the file system can infer some of these characteristics
(e.g. if the file was opened O_APPEND, it's probably a file that will
be extended often).

In other cases it makes sense for this sort of thing to be declared
via an fcntl or fadvise when the file is first opened. Indeed we have
some of this already via fadvise's FADV_RANDOM vs. FADV_SEQUENTIAL,
although currently the expectation of this interface is that it's
mostly used for applications declare how they plan to read a
particular file from the perspective of enabling or disabling
readahead, and not from the perspective of influencing how the file
system should handle its allocation policy.

I definitely agree that we don't want to go down the path of having
applications try to directly decide where block should be placed on
the disk. That way lies madness. However, having some way of
specifying the behaviour of how the file is going to be used can be
very useful indeed.

There are still some interesting policy/security questions, though.
Do you trust any application or any user id to be able to declare that
"this file is going to be used a lot"? After, all if everyone
declares that their file is accessed a lot, and thus deserving of
being in the beginning third of the HDD (which can be significantly
faster than the rest of the disk), then the whole scheme falls apart.

"That King, although no one denies
His heart was of abnormal size,
Yet he'd have acted otherwise
If he had been acuter.
The end is easily foretold,
When every blessed thing you hold
Is made of silver, or of gold,
You long for simple pewter.
When you have nothing else to wear
But cloth of gold and satins rare,
For cloth of gold you cease to care--
Up goes the price of shoddy.

In short, whoever you may be,
To this conclusion you'll agree,
When every one is somebodee,
Then no one's anybody!"

-- Gilbert and Sullivan, The Gondoliers
http://lyricsplayground.com/alpha/songs/t/therelivedaking.shtml

Do we simply not care? Do we reserve the ability to set certain file
usage declarations only to root, or via some cgroup? The answers are
not obvious.... For some parameters it probably won't matter if we
let unprivileged users declare whether or not their file is mostly
accessed sequentially or random access. But for others, it might
matter a lot if you have bad actors, or worse, bad application writers
who assume that their web browser or GUI file system navigator, or
chat program should have the very best and highest priority blocks for
their sqlite files.

- Ted

2012-02-27 21:11:29

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On 2012-02-27, at 5:00 AM, Lukas Czerner wrote:
> On Mon, 27 Feb 2012, Zheng Liu wrote:
>>
>> Now, in ext4, we have multi-block allocation and delay allocation. They work
>> well for most scenarios. However, in some specific scenarios, they cannot help
>> us to optimize block allocation. For example, the user may want to indicate some
>> file set to be allocated at the beginning of the disk because its speed in this
>> position is faster than its speed at the end of disk.
>>
>> I have done the following experiment. The experiment is on my own server, which
>> has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
>> split this disk into two partitions, one has 900G, and another has 100G. Then I
>> use dd to get the speed of read/write. The result is as following.
>>
>> [READ]
>> # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
>> 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
>>
>> # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
>> 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
>>
>> [WRITE]
>> # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
>> 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
>>
>> # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
>> 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
>>
>> So filesystem can provide a new feature to let the user to indicate a value
>> for reserving some blocks from the beginning of the disk. When the user needs
>> to allocate some blocks for an important file that needs to be read/write as
>> quick as possible, the user can use ioctl(2) and/or other ways to notify
>> filesystem to allocate these blocks in the reservation area. Thereby, the user
>> can obtain the higher performance for manipulating this file set.
>>
>> This idea is very trivial. So any comments or suggestions are appreciated.
>
> Hi Zheng,
>
> I have to admit I do not like it :). I think that this kind of
> optimization is useless in the long run. There are several reasons for
> this:
>
> - the test you've done is purely fabricated and does not respond to
> real workload at all. Especially because it is done on a huge files.
> I can imagine this approach improving boot speed, but you will
> usually have to load just small files, so for single file it does not
> make much sense. Moreover with small files more seeks would have to
> be done hugely reducing the advantage you can see with dd.

Lukas,
I think the generalization "you will usually have to load just small files"
is not something that you can make. Some workloads have small files (e.g.
distro installers) and some have large files (e.g. media servers, HPC, etc).

> - HDD might have more platters than just one
> - Your file system might span across several drives

These two typically do not matter, since the tracks on the HDD are still
interleaved between the drive heads, so that there is never a huge seek
between block N and N+1 just because N was at the outside of one platter
and N+1 was on the inside of the next platter.

Similarly, with RAID storage the blocks are typically interleaved between
drives in a sequential manner for the same reason. It's true that this
not guaranteed for LVM, but it is still _generally_ true, since LVM will
allocate LEs from lower numbers to higher numbers unless there is some
reason not to do so.

> - On thinly provisioned storage this does not make sense at all

True, but even with thin provisioned storage the backing file blocks will
typically try to allocate sequentially if possible. I don't think this
can be used as an argument _against_ giving the filesystem more allocation
information, since it cannot be worse than having no information at all.

> - SSD's are more and more common and this optimization is useless for
> them.

Sure, but it also isn't _harmful_ for SSDs, and not everyone can afford
huge amounts of SSDs.

> Is there any 'real' problem you would want to solve with this ? Or is it
> just something that came to you mind ? I agree that we want to improve
> our allocators, but IMHO especially for better scalability, not to cover
> this disputable niche.
>
> Anyway, you may try to come up with better experiment. Something which
> would actually show how much can we get from the more realistic workload
> rather than showing that contiguous serial writes are faster closely to
> the center of the disk platter, we know that.
>
> Thanks!
> -Lukas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Cheers, Andreas

2012-02-27 21:17:00

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On 2012-02-27, at 6:33 AM, Lukas Czerner wrote:
> On Mon, 27 Feb 2012, Zheng Liu wrote:
>> On Mon, Feb 27, 2012 at 01:00:07PM +0100, Lukas Czerner wrote:
>>> On Mon, 27 Feb 2012, Zheng Liu wrote:
>>>
>>>> Hi list,
>>>>
>>>> Now, in ext4, we have multi-block allocation and delay allocation. They work
>>>> well for most scenarios. However, in some specific scenarios, they cannot
>>>> help us to optimize block allocation. For example, the user may want to
>>>> indicate some file set to be allocated at the beginning of the disk because
>>>> its speed in this position is faster than its speed at the end of disk.
>>>>
>>>> I have done the following experiment. The experiment is on my own server, which has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T SAS
>>>> disk. I split this disk into two partitions, one has 900G, and another has
>>>> 100G. Then I use dd to get the speed of read/write. The result is as
>>>> following.
>>>>
>>>> [READ]
>>>> # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
>>>> 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
>>>>
>>>> # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
>>>> 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
>>>>
>>>> [WRITE]
>>>> # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
>>>> 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
>>>>
>>>> # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
>>>> 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
>>>>
>>>> So filesystem can provide a new feature to let the user to indicate a value
>>>> for reserving some blocks from the beginning of the disk. When the user needs to allocate some blocks for an important file that needs to be
>>>> read/write as quick as possible, the user can use ioctl(2) and/or other
>>>> ways to notify filesystem to allocate these blocks in the reservation area.
>>>> Thereby, the user can obtain the higher performance for manipulating this
>>>> file set.
>>>>
>>>> This idea is very trivial. So any comments or suggestions are appreciated.
>>>
>>> Hi Zheng,
>>>
>>> I have to admit I do not like it :). I think that this kind of
>>> optimization is useless in the long run. There are several reasons for
>>> this:
>>
>> Hi Lukas,
>>
>> Thank you for your opinion. ;-)
>>
>>>
>>> - the test you've done is purely fabricated and does not respond to
>>> real workload at all. Especially because it is done on a huge files.
>>> I can imagine this approach improving boot speed, but you will
>>> usually have to load just small files, so for single file it does not
>>> make much sense. Moreover with small files more seeks would have to
>>> be done hugely reducing the advantage you can see with dd.
>>
>> I will describe the problem that we encounter. the problem shows that
>> even if files are small, the performance can be improved in some
>> specific scenarios using this block allocation.
>>
>>> - HDD might have more platters than just one
>>> - Your file system might span across several drives
>>> - On thinly provisioned storage this does not make sense at all
>>> - SSD's are more and more common and this optimization is useless for
>>> them.
>>>
>>> Is there any 'real' problem you would want to solve with this ? Or is it
>>> just something that came to you mind ? I agree that we want to improve
>>> our allocators, but IMHO especially for better scalability, not to cover
>>> this disputable niche.
>>
>> We encounter a problem in our product system. In a 2TB sata disk, the
>> file can be divided into two categories. One is index file, and another
>> is block file. The average size of index files is about 128k and will
>> increase as time goes on. The size of block files is 70M and they are
>> created by fallocate(2). Thus, index file is allocated at the end of the
>> disk. When application starts up, it needs to load all of index files
>> into memory. So it costs too much time. If we can allocate index files
>> at the beginning of the disk, we will cut down the startup time and
>> increase the service time of this application.
>>
>> Therefore, I think that it might be as a generic mechanism to provide
>> other users that have the similar requirement.
>
> Ok, so this seems like a valid use case. However I think that this is
> exactly something that can be quite easily solved without having to
> modify file system code, right ?
>
> You can simply use separate drive for the index files, or even raid. Or
> you can actually use an SSD for this, which I believe will give you *a
> lot* better performance improvements and you wont be bothered by the
> size/price ratio for SSD as you would only store indexes there, right ?
>
> Or, if you really do not want to, or can not, but a new hardware for
> some reason, you can always partition a 2TB disk and put all your index
> files on the smaller, close to the disk center partition. I really do
> not see a reason to modify the code.

This introduces more administration complexity, and is one of the reasons
why desktop systems install with a single huge partition instead of many
separate partitions. Having a slightly slower index is much less harmful
to an application than the index partition becoming full while the data
partition has free space.

This also introduces gratuitous overhead due to separate journals, multiple
cache flush commands for the same disk, etc.

Cheers, Andreas

> What might be even more interesting is, that you might generally benefit
> from splitting the index/data file systems. The reason is that your data
> file and your index file filesystem might benefit from bigalloc if you
> split them, because you can set different cluster sizes on both file
> system depending on the file sizes you would actually store there, since
> as I understand the index and data files differs in size significantly.
>
> How much of the performance boost do you expect by doing this your way -
> modifying the file system? Note that dd will not tell you that, as I
> explained earlier. I surely would not match using SSD for index files by
> far.
>
> What do you think?
>
> Thanks!
> -Lukas
>
>
>
>>
>> Regards,
>> Zheng
>>
>>>
>>> Anyway, you may try to come up with better experiment. Something which
>>> would actually show how much can we get from the more realistic workload
>>> rather than showing that contiguous serial writes are faster closely to
>>> the center of the disk platter, we know that.
>>>
>>> Thanks!
>>> -Lukas
>>
>
> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Cheers, Andreas

2012-02-27 22:00:14

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On 2012-02-27, at 10:44 AM, Ted Ts'o wrote:
> On Mon, Feb 27, 2012 at 09:37:32AM -0600, Eric Sandeen wrote:
>>
>> Essentially this would move allocation decisions to userspace, and I don't
>> think that sounds like a good idea. If nothing else, the application shouldn't
>> assume that it "knows" anything at all about which regions of a filesystem may
>> be faster or slower...
>
> What I *can* imagine is passing hints to the file system:
>
> * This file will be accessed a lot --- vs --- this file will
> be written once and then will be mostly cold storage
>
> * This file won't be extended once originally written --- vs
> --- this file will be extended often (i.e., it is a log file
> or a unix mail directory file)
>
> * This file is mostly emphemeral --- vs --- this file will be
> sticking around for a long time.
>
> * This file will be read mostly sequentially --- vs --- this
> file will be read mostly via random access.

I definitely think that this is Zheng's real goal - to be able to give
application-level hints to the underlying filesystem. While Lukas and
Eric may disagree with the _mechanism_ that Zheng proposed, I definitely
think the _goal_ is useful.

Often when working at the filesystem level the kernel has to try and
guess the intent of the application instead of being told what the
application actually wants. A prime example is delalloc vs. fallocate(),
where the kernel is guessing (via delalloc) that the application may be
writing more data to the filesystem so it should delay flushing that
data to disk in the hope of making a better decision, while fallocate()
allows the application to specify exactly what file data will be written
and the kernel can make a good allocation decision immediately.

> Obviously, these can be combined in various interesting ways; consider
> for example an application journal file which is rarely read (except
> in recovery circumstances, after a system crash, where speed might not
> be the most important thing), and so even though the file is being
> appended to regularly, contiguous block allocations might not matter
> that much --- especially if the file is also being regularly fsync'ed,
> so it would be more important if the blocks are located close to the
> inode table. This isn't a hypothetical situation, by the way; I once
> saw a performance regression of ext4 vs. ext2 that was traced down to
> the fact that ext2 would greedily allocate the block closest to the
> inode table, whereas ext4 would optimize for reading the file later,
> and so allocating a large contiguous block far, far away from the
> inode table was what ext4 choose to do. However, in this particular
> case, optimizing for the frequent small write/fsync case would have
> been a better choice.
>
>
> In some cases the file system can infer some of these characteristics
> (e.g. if the file was opened O_APPEND, it's probably a file that will
> be extended often).
>
> In other cases it makes sense for this sort of thing to be declared
> via an fcntl or fadvise when the file is first opened. Indeed we have
> some of this already via fadvise's FADV_RANDOM vs. FADV_SEQUENTIAL,
> although currently the expectation of this interface is that it's
> mostly used for applications declare how they plan to read a
> particular file from the perspective of enabling or disabling
> readahead, and not from the perspective of influencing how the file
> system should handle its allocation policy.

Yes, using FADV_* for files during write is exactly the kind of hint
that the kernel could use. I expect that the current FADV_* flags are
not rich enough, but at least could form a starting point for this.

> I definitely agree that we don't want to go down the path of having
> applications try to directly decide where block should be placed on
> the disk. That way lies madness. However, having some way of
> specifying the behaviour of how the file is going to be used can be
> very useful indeed.

>
> There are still some interesting policy/security questions, though.
> Do you trust any application or any user id to be able to declare that
> "this file is going to be used a lot"? After, all if everyone
> declares that their file is accessed a lot, and thus deserving of
> being in the beginning third of the HDD (which can be significantly
> faster than the rest of the disk), then the whole scheme falls apart.

In some sense, in the rare case where all applications are ill behaved
then it is no worse than not having any interface in the first place.
In general, however, I don't expect applications to abuse this any more
than they abuse fallocate() to reserve huge amounts of space that they
don't need to use.

> Do we simply not care? Do we reserve the ability to set certain file
> usage declarations only to root, or via some cgroup? The answers are
> not obvious.... For some parameters it probably won't matter if we
> let unprivileged users declare whether or not their file is mostly
> accessed sequentially or random access. But for others, it might
> matter a lot if you have bad actors, or worse, bad application writers
> who assume that their web browser or GUI file system navigator, or
> chat program should have the very best and highest priority blocks for
> their sqlite files.

Sure, and the users can stop using badly-written applications, but that
is no reason to deny the ability for well written applications from
helping the kernel make better decisions.

Cheers, Andreas

2012-02-28 03:29:17

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, Feb 27, 2012 at 04:24:24PM +0100, Lukas Czerner wrote:
[snip]
>
> You probably know all of that that already, but just in case... for the
> sake of good performance make sure that you partitions are properly aligned,
> because drive might very well have 4k sector size.

Thank you for reminding me. :-)

Regards,
Zheng

[snip]

2012-02-28 04:00:14

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On Mon, Feb 27, 2012 at 03:00:12PM -0700, Andreas Dilger wrote:
> On 2012-02-27, at 10:44 AM, Ted Ts'o wrote:
> > On Mon, Feb 27, 2012 at 09:37:32AM -0600, Eric Sandeen wrote:
> >>
> >> Essentially this would move allocation decisions to userspace, and I don't
> >> think that sounds like a good idea. If nothing else, the application shouldn't
> >> assume that it "knows" anything at all about which regions of a filesystem may
> >> be faster or slower...
> >
> > What I *can* imagine is passing hints to the file system:
> >
> > * This file will be accessed a lot --- vs --- this file will
> > be written once and then will be mostly cold storage
> >
> > * This file won't be extended once originally written --- vs
> > --- this file will be extended often (i.e., it is a log file
> > or a unix mail directory file)
> >
> > * This file is mostly emphemeral --- vs --- this file will be
> > sticking around for a long time.
> >
> > * This file will be read mostly sequentially --- vs --- this
> > file will be read mostly via random access.
>
> I definitely think that this is Zheng's real goal - to be able to give
> application-level hints to the underlying filesystem. While Lukas and
> Eric may disagree with the _mechanism_ that Zheng proposed, I definitely
> think the _goal_ is useful.
>
> Often when working at the filesystem level the kernel has to try and
> guess the intent of the application instead of being told what the
> application actually wants. A prime example is delalloc vs. fallocate(),
> where the kernel is guessing (via delalloc) that the application may be
> writing more data to the filesystem so it should delay flushing that
> data to disk in the hope of making a better decision, while fallocate()
> allows the application to specify exactly what file data will be written
> and the kernel can make a good allocation decision immediately.
>
> > Obviously, these can be combined in various interesting ways; consider
> > for example an application journal file which is rarely read (except
> > in recovery circumstances, after a system crash, where speed might not
> > be the most important thing), and so even though the file is being
> > appended to regularly, contiguous block allocations might not matter
> > that much --- especially if the file is also being regularly fsync'ed,
> > so it would be more important if the blocks are located close to the
> > inode table. This isn't a hypothetical situation, by the way; I once
> > saw a performance regression of ext4 vs. ext2 that was traced down to
> > the fact that ext2 would greedily allocate the block closest to the
> > inode table, whereas ext4 would optimize for reading the file later,
> > and so allocating a large contiguous block far, far away from the
> > inode table was what ext4 choose to do. However, in this particular
> > case, optimizing for the frequent small write/fsync case would have
> > been a better choice.
> >
> >
> > In some cases the file system can infer some of these characteristics
> > (e.g. if the file was opened O_APPEND, it's probably a file that will
> > be extended often).
> >
> > In other cases it makes sense for this sort of thing to be declared
> > via an fcntl or fadvise when the file is first opened. Indeed we have
> > some of this already via fadvise's FADV_RANDOM vs. FADV_SEQUENTIAL,
> > although currently the expectation of this interface is that it's
> > mostly used for applications declare how they plan to read a
> > particular file from the perspective of enabling or disabling
> > readahead, and not from the perspective of influencing how the file
> > system should handle its allocation policy.
>
> Yes, using FADV_* for files during write is exactly the kind of hint
> that the kernel could use. I expect that the current FADV_* flags are
> not rich enough, but at least could form a starting point for this.
>

Hi Andreas,

I agree with you and Ted. Maybe we can provide more flags in fadvise(2)
to let the user to help the kernel to make a better decision.

I notice this RFC[1] in linux-kernel mailing list. This is an acceptable
solution for us. Some flags can be added into fadvise(2).

e.g.
FADV_READ_HOT
FADV_READ_SEQ
FADV_READ_RANDOM
FADV_WRITE_ONCE
FADV_WRITE_APPEND
FADV_WRITE_FIX_FILELEN
...

Then file system can pick a subset of these flags to implement.

1. https://lkml.org/lkml/2012/2/9/473

Regards,
Zheng

> > I definitely agree that we don't want to go down the path of having
> > applications try to directly decide where block should be placed on
> > the disk. That way lies madness. However, having some way of
> > specifying the behaviour of how the file is going to be used can be
> > very useful indeed.
>
> >
> > There are still some interesting policy/security questions, though.
> > Do you trust any application or any user id to be able to declare that
> > "this file is going to be used a lot"? After, all if everyone
> > declares that their file is accessed a lot, and thus deserving of
> > being in the beginning third of the HDD (which can be significantly
> > faster than the rest of the disk), then the whole scheme falls apart.
>
> In some sense, in the rare case where all applications are ill behaved
> then it is no worse than not having any interface in the first place.
> In general, however, I don't expect applications to abuse this any more
> than they abuse fallocate() to reserve huge amounts of space that they
> don't need to use.
>
> > Do we simply not care? Do we reserve the ability to set certain file
> > usage declarations only to root, or via some cgroup? The answers are
> > not obvious.... For some parameters it probably won't matter if we
> > let unprivileged users declare whether or not their file is mostly
> > accessed sequentially or random access. But for others, it might
> > matter a lot if you have bad actors, or worse, bad application writers
> > who assume that their web browser or GUI file system navigator, or
> > chat program should have the very best and highest priority blocks for
> > their sqlite files.
>
> Sure, and the users can stop using badly-written applications, but that
> is no reason to deny the ability for well written applications from
> helping the kernel make better decisions.
>
> Cheers, Andreas
>
>
>
>
>

2012-03-08 16:39:13

by Phillip Susi

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

On 2/27/2012 4:09 AM, Zheng Liu wrote:
> Hi list,
>
> Now, in ext4, we have multi-block allocation and delay allocation. They work
> well for most scenarios. However, in some specific scenarios, they cannot help
> us to optimize block allocation. For example, the user may want to indicate some
> file set to be allocated at the beginning of the disk because its speed in this
> position is faster than its speed at the end of disk.

I thought this could be done with the new defrag api. I thought it had
a way to allocate a new donor file in a specific location so you could
then migrate the target file to those specific donor blocks.

2012-03-11 17:32:36

by Phillip Susi

[permalink] [raw]

Subject: Re: [RFC] ext4: block reservation allocation

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/10/2012 07:26 PM, Greg Freemyer wrote:
>> I thought this could be done with the new defrag api. I thought it had a
>> way to allocate a new donor file in a specific location so you could then
>> migrate the target file to those specific donor blocks.
>>
>>
> I'm not aware of that ability in the defrag api.
>
> There are ohsm patches to do things along those lines, but they haven't
> been updated since 2.6.30 days as far as I know.

It seems this ability would be vital to e4defrag. Without it, it would end up being no better than filesystem agnostic defragmenters like shake. I'm almost certain there was some discussion in this direction a few years ago when e4defrag started being talked about, was it really not implemented?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPXOGuAAoJEJrBOlT6nu75n5kIANngR3o+ZQ6vij+jJ7nj1/S1
nTmWUA70328xZD/Ht2G/ue3oTqOnQmxDIFOhM8AuUoZ0fHondSRXRFpPsXyUU7LO
uc/ZranjqGCAH7IP3FzduqvOcQSJS8Vo4CI9bjrH0fnB7uV+vYbB0mVkVh7X2AHi
T/1h0KIbtna8T7n1ISb3xqvKDY1zUM5qR9a7qo5fHkZpgLBDQNDxu1Q1qMEiytet
HPksKZcPPvkA8oOuHvika1zj2FiidWWcmPsv20P80BtTaRUBmAHbc3s3Wox3THWh
SP0tI0260LiMEyVVSpdsx76NoV9lFpu0ezNOtFHjFL99lnCV2K/fsHpTaICDBiA=
=E6n5
-----END PGP SIGNATURE-----