LinuxLists.cc - Regarding ext4 extent allocation strategy

2021-07-13 06:54:00

Subject: Regarding ext4 extent allocation strategy

Hi,

Our team in Microsoft, which works on the Linux SMB3 client kernel
filesystem has recently been exploring the use of fscache on top of
ext4 for caching the network filesystem data for some customer
workloads.

However, the maintainer of fscache (David Howells) recently warned us
that a few other extent based filesystem developers pointed out a
theoretical bug in the current implementation of fscache/cachefiles.
It currently does not maintain a separate metadata for the cached data
it holds, but instead uses the sparseness of the underlying filesystem
to track the ranges of the data that is being cached.
The bug that has been pointed out with this is that the underlying
filesystems could bridge holes between data ranges with zeroes or
punch hole in data ranges that contain zeroes. (@David please add if I
missed something).

David has already begun working on the fix to this by maintaining the
metadata of the cached ranges in fscache itself.
However, since it could take some time for this fix to be approved and
then backported by various distros, I'd like to understand if there is
a potential problem in using fscache on top of ext4 without the fix.
If ext4 doesn't do any such optimizations on the data ranges, or has a
way to disable such optimizations, I think we'll be okay to use the
older versions of fscache even without the fix mentioned above.

Opinions?

--
Regards,
Shyam

2021-07-13 11:40:31

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Regarding ext4 extent allocation strategy

On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
>
> Our team in Microsoft, which works on the Linux SMB3 client kernel
> filesystem has recently been exploring the use of fscache on top of
> ext4 for caching the network filesystem data for some customer
> workloads.
>
> However, the maintainer of fscache (David Howells) recently warned us
> that a few other extent based filesystem developers pointed out a
> theoretical bug in the current implementation of fscache/cachefiles.
> It currently does not maintain a separate metadata for the cached data
> it holds, but instead uses the sparseness of the underlying filesystem
> to track the ranges of the data that is being cached.
> The bug that has been pointed out with this is that the underlying
> filesystems could bridge holes between data ranges with zeroes or
> punch hole in data ranges that contain zeroes. (@David please add if I
> missed something).
>
> David has already begun working on the fix to this by maintaining the
> metadata of the cached ranges in fscache itself.
> However, since it could take some time for this fix to be approved and
> then backported by various distros, I'd like to understand if there is
> a potential problem in using fscache on top of ext4 without the fix.
> If ext4 doesn't do any such optimizations on the data ranges, or has a
> way to disable such optimizations, I think we'll be okay to use the
> older versions of fscache even without the fix mentioned above.

Yes, the tuning knob you are looking for is:

What: /sys/fs/ext4/<disk>/extent_max_zeroout_kb
Date: August 2012
Contact: "Theodore Ts'o" <[email protected]>
Description:
The maximum number of kilobytes which will be zeroed
out in preference to creating a new uninitialized
extent when manipulating an inode's extent tree. Note
that using a larger value will increase the
variability of time necessary to complete a random
write operation (since a 4k random write might turn
into a much larger write due to the zeroout
operation).

(From Documentation/ABI/testing/sysfs-fs-ext4)

The basic idea here is that with a random workload, with HDD's, the
cost of writing a 16k random write is not much more than the time to
write a 4k random write; that is, the cost of HDD seeks dominates.
There is also a cost in having a many additional entries in the extent
tree. So if we have a fallocated region, e.g:

+-------------+---+---+---+----------+---+---+---------+
... + Uninit (U) | W | U | W | Uninit | W | U | Written | ...
+-------------+---+---+---+----------+---+---+---------+

It's more efficient to have the extent tree look like this

+-------------+-----------+----------+---+---+---------+
... + Uninit (U) | Written | Uninit | W | U | Written | ...
+-------------+-----------+----------+---+---+---------+

And just simply write zeros to the first "U" in the above figure.

The default value of extent_max_zeroout_kb is 32k. This optimization
can be disabled by setting extent_max_zeroout_kb to 0. The downside
of this is a potential degredation of a random write workload (using
for example the fio benchmark program) on that file system.

Cheers,

- Ted

2021-07-13 12:58:17

by Shyam Prasad N

[permalink] [raw]

Subject: Re: Regarding ext4 extent allocation strategy

On Tue, Jul 13, 2021 at 5:09 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> >
> > Our team in Microsoft, which works on the Linux SMB3 client kernel
> > filesystem has recently been exploring the use of fscache on top of
> > ext4 for caching the network filesystem data for some customer
> > workloads.
> >
> > However, the maintainer of fscache (David Howells) recently warned us
> > that a few other extent based filesystem developers pointed out a
> > theoretical bug in the current implementation of fscache/cachefiles.
> > It currently does not maintain a separate metadata for the cached data
> > it holds, but instead uses the sparseness of the underlying filesystem
> > to track the ranges of the data that is being cached.
> > The bug that has been pointed out with this is that the underlying
> > filesystems could bridge holes between data ranges with zeroes or
> > punch hole in data ranges that contain zeroes. (@David please add if I
> > missed something).
> >
> > David has already begun working on the fix to this by maintaining the
> > metadata of the cached ranges in fscache itself.
> > However, since it could take some time for this fix to be approved and
> > then backported by various distros, I'd like to understand if there is
> > a potential problem in using fscache on top of ext4 without the fix.
> > If ext4 doesn't do any such optimizations on the data ranges, or has a
> > way to disable such optimizations, I think we'll be okay to use the
> > older versions of fscache even without the fix mentioned above.
>
> Yes, the tuning knob you are looking for is:
>
> What: /sys/fs/ext4/<disk>/extent_max_zeroout_kb
> Date: August 2012
> Contact: "Theodore Ts'o" <[email protected]>
> Description:
> The maximum number of kilobytes which will be zeroed
> out in preference to creating a new uninitialized
> extent when manipulating an inode's extent tree. Note
> that using a larger value will increase the
> variability of time necessary to complete a random
> write operation (since a 4k random write might turn
> into a much larger write due to the zeroout
> operation).
>
> (From Documentation/ABI/testing/sysfs-fs-ext4)
>
> The basic idea here is that with a random workload, with HDD's, the
> cost of writing a 16k random write is not much more than the time to
> write a 4k random write; that is, the cost of HDD seeks dominates.
> There is also a cost in having a many additional entries in the extent
> tree. So if we have a fallocated region, e.g:
>
> +-------------+---+---+---+----------+---+---+---------+
> ... + Uninit (U) | W | U | W | Uninit | W | U | Written | ...
> +-------------+---+---+---+----------+---+---+---------+
>
> It's more efficient to have the extent tree look like this
>
> +-------------+-----------+----------+---+---+---------+
> ... + Uninit (U) | Written | Uninit | W | U | Written | ...
> +-------------+-----------+----------+---+---+---------+
>
> And just simply write zeros to the first "U" in the above figure.
>
> The default value of extent_max_zeroout_kb is 32k. This optimization
> can be disabled by setting extent_max_zeroout_kb to 0. The downside
> of this is a potential degredation of a random write workload (using
> for example the fio benchmark program) on that file system.
>
> Cheers,
>
> - Ted

Hi Ted,

Thanks for pointing this out. We'll look into the use of this option.

Also, is this parameter also respected when a hole is punched in the
middle of an allocated data extent? i.e. is there still a possibility
that a punched hole does not translate to splitting the data extent,
even when extent_max_zeroout_kb is set to 0?

--
Regards,
Shyam

2021-07-13 20:19:14

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Regarding ext4 extent allocation strategy

On Tue, Jul 13, 2021 at 06:27:37PM +0530, Shyam Prasad N wrote:
>
> Also, is this parameter also respected when a hole is punched in the
> middle of an allocated data extent? i.e. is there still a possibility
> that a punched hole does not translate to splitting the data extent,
> even when extent_max_zeroout_kb is set to 0?

Ext4 doesn't ever try to zero blocks as part of a punch operation.
It's true a file system is allowed to do it, but I would guess most
wouldn't, since the presumption is that userspace is actually trying
to free up disk space, and so you would want to release the disk
blocks in the punch hole case.

The more interesting one is the FALLOC_FL_ZERO_RANGE_FL operation,
which *should* work by transitioning the extent to be uninitialized,
but there might be cases where writing a few zero blocks might be
faster in some cases. That should use the same code path which
resepects the max_zeroout configuration parameter for ext4.

Cheers,

- Ted

2021-07-14 00:38:09

by Shyam Prasad N

[permalink] [raw]

Subject: Re: Regarding ext4 extent allocation strategy

On Wed, Jul 14, 2021 at 1:48 AM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Tue, Jul 13, 2021 at 06:27:37PM +0530, Shyam Prasad N wrote:
> >
> > Also, is this parameter also respected when a hole is punched in the
> > middle of an allocated data extent? i.e. is there still a possibility
> > that a punched hole does not translate to splitting the data extent,
> > even when extent_max_zeroout_kb is set to 0?
>
> Ext4 doesn't ever try to zero blocks as part of a punch operation.
> It's true a file system is allowed to do it, but I would guess most
> wouldn't, since the presumption is that userspace is actually trying
> to free up disk space, and so you would want to release the disk
> blocks in the punch hole case.
>
> The more interesting one is the FALLOC_FL_ZERO_RANGE_FL operation,
> which *should* work by transitioning the extent to be uninitialized,
> but there might be cases where writing a few zero blocks might be
> faster in some cases. That should use the same code path which
> resepects the max_zeroout configuration parameter for ext4.
>
> Cheers,
>
> - Ted

Thanks a lot for your replies, Ted. This was useful.

--
Regards,
Shyam

2022-02-18 03:20:25

by Gao Xiang

[permalink] [raw]

Subject: Re: Regarding ext4 extent allocation strategy

Hi Ted and David,

On Tue, Jul 13, 2021 at 07:39:16AM -0400, Theodore Y. Ts'o wrote:
> On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> >
> > Our team in Microsoft, which works on the Linux SMB3 client kernel
> > filesystem has recently been exploring the use of fscache on top of
> > ext4 for caching the network filesystem data for some customer
> > workloads.
> >
> > However, the maintainer of fscache (David Howells) recently warned us
> > that a few other extent based filesystem developers pointed out a
> > theoretical bug in the current implementation of fscache/cachefiles.
> > It currently does not maintain a separate metadata for the cached data
> > it holds, but instead uses the sparseness of the underlying filesystem
> > to track the ranges of the data that is being cached.
> > The bug that has been pointed out with this is that the underlying
> > filesystems could bridge holes between data ranges with zeroes or
> > punch hole in data ranges that contain zeroes. (@David please add if I
> > missed something).
> >
> > David has already begun working on the fix to this by maintaining the
> > metadata of the cached ranges in fscache itself.
> > However, since it could take some time for this fix to be approved and
> > then backported by various distros, I'd like to understand if there is
> > a potential problem in using fscache on top of ext4 without the fix.
> > If ext4 doesn't do any such optimizations on the data ranges, or has a
> > way to disable such optimizations, I think we'll be okay to use the
> > older versions of fscache even without the fix mentioned above.
>
> Yes, the tuning knob you are looking for is:
>
> What: /sys/fs/ext4/<disk>/extent_max_zeroout_kb
> Date: August 2012
> Contact: "Theodore Ts'o" <[email protected]>
> Description:
> The maximum number of kilobytes which will be zeroed
> out in preference to creating a new uninitialized
> extent when manipulating an inode's extent tree. Note
> that using a larger value will increase the
> variability of time necessary to complete a random
> write operation (since a 4k random write might turn
> into a much larger write due to the zeroout
> operation).
>
> (From Documentation/ABI/testing/sysfs-fs-ext4)
>
> The basic idea here is that with a random workload, with HDD's, the
> cost of writing a 16k random write is not much more than the time to
> write a 4k random write; that is, the cost of HDD seeks dominates.
> There is also a cost in having a many additional entries in the extent
> tree. So if we have a fallocated region, e.g:
>
> +-------------+---+---+---+----------+---+---+---------+
> ... + Uninit (U) | W | U | W | Uninit | W | U | Written | ...
> +-------------+---+---+---+----------+---+---+---------+
>
> It's more efficient to have the extent tree look like this
>
> +-------------+-----------+----------+---+---+---------+
> ... + Uninit (U) | Written | Uninit | W | U | Written | ...
> +-------------+-----------+----------+---+---+---------+
>
> And just simply write zeros to the first "U" in the above figure.
>
> The default value of extent_max_zeroout_kb is 32k. This optimization
> can be disabled by setting extent_max_zeroout_kb to 0. The downside
> of this is a potential degredation of a random write workload (using
> for example the fio benchmark program) on that file system.
>

As far as I understand what cachefile does, it just truncates a sparse
file with a big hole, and do direct IO _only_ all the time to fill the
holes.

But the description above is all around (un)written extents, which
already have physical blocks allocated, but just without data
initialization. So we could zero out the middle extent and merge
these extents into one bigger written extent.

However, IMO, it's not the case of what the current cachefiles
behavior is... I think rare local fs allocates blocks with direct
i/o due to real holes, zero out and merge extents since at least it
touches disk quota.

David pointed this message yesterday since we're doing on-demand read
feature by using cachefiles as well. But I still fail to understand why
the current cachefile behavior is wrong.

Could you kindly leave more hints about this? Many thanks!

Thanks,
Gao Xiang

> Cheers,
>
> - Ted

2022-02-22 05:19:05

by Gao Xiang

[permalink] [raw]

Subject: Re: Regarding ext4 extent allocation strategy

Hi Ted,

Sorry for pinging so quickly since it's quite important for the
container on-demand use cases (maybe other on-demand distribution
use cases as well.) We still prefer this cachefiles way since
its data plane won't cross kernel-userspace boundary when data is
ready (and that's the common cases after data is fetched from
network.)

Many thanks again!
Gao Xiang

On Fri, Feb 18, 2022 at 11:18:14AM +0800, Gao Xiang wrote:
> Hi Ted and David,
>
> On Tue, Jul 13, 2021 at 07:39:16AM -0400, Theodore Y. Ts'o wrote:
> > On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> > >
> > > Our team in Microsoft, which works on the Linux SMB3 client kernel
> > > filesystem has recently been exploring the use of fscache on top of
> > > ext4 for caching the network filesystem data for some customer
> > > workloads.
> > >
> > > However, the maintainer of fscache (David Howells) recently warned us
> > > that a few other extent based filesystem developers pointed out a
> > > theoretical bug in the current implementation of fscache/cachefiles.
> > > It currently does not maintain a separate metadata for the cached data
> > > it holds, but instead uses the sparseness of the underlying filesystem
> > > to track the ranges of the data that is being cached.
> > > The bug that has been pointed out with this is that the underlying
> > > filesystems could bridge holes between data ranges with zeroes or
> > > punch hole in data ranges that contain zeroes. (@David please add if I
> > > missed something).
> > >
> > > David has already begun working on the fix to this by maintaining the
> > > metadata of the cached ranges in fscache itself.
> > > However, since it could take some time for this fix to be approved and
> > > then backported by various distros, I'd like to understand if there is
> > > a potential problem in using fscache on top of ext4 without the fix.
> > > If ext4 doesn't do any such optimizations on the data ranges, or has a
> > > way to disable such optimizations, I think we'll be okay to use the
> > > older versions of fscache even without the fix mentioned above.
> >
> > Yes, the tuning knob you are looking for is:
> >
> > What: /sys/fs/ext4/<disk>/extent_max_zeroout_kb
> > Date: August 2012
> > Contact: "Theodore Ts'o" <[email protected]>
> > Description:
> > The maximum number of kilobytes which will be zeroed
> > out in preference to creating a new uninitialized
> > extent when manipulating an inode's extent tree. Note
> > that using a larger value will increase the
> > variability of time necessary to complete a random
> > write operation (since a 4k random write might turn
> > into a much larger write due to the zeroout
> > operation).
> >
> > (From Documentation/ABI/testing/sysfs-fs-ext4)
> >
> > The basic idea here is that with a random workload, with HDD's, the
> > cost of writing a 16k random write is not much more than the time to
> > write a 4k random write; that is, the cost of HDD seeks dominates.
> > There is also a cost in having a many additional entries in the extent
> > tree. So if we have a fallocated region, e.g:
> >
> > +-------------+---+---+---+----------+---+---+---------+
> > ... + Uninit (U) | W | U | W | Uninit | W | U | Written | ...
> > +-------------+---+---+---+----------+---+---+---------+
> >
> > It's more efficient to have the extent tree look like this
> >
> > +-------------+-----------+----------+---+---+---------+
> > ... + Uninit (U) | Written | Uninit | W | U | Written | ...
> > +-------------+-----------+----------+---+---+---------+
> >
> > And just simply write zeros to the first "U" in the above figure.
> >
> > The default value of extent_max_zeroout_kb is 32k. This optimization
> > can be disabled by setting extent_max_zeroout_kb to 0. The downside
> > of this is a potential degredation of a random write workload (using
> > for example the fio benchmark program) on that file system.
> >
>
> As far as I understand what cachefile does, it just truncates a sparse
> file with a big hole, and do direct IO _only_ all the time to fill the
> holes.
>
> But the description above is all around (un)written extents, which
> already have physical blocks allocated, but just without data
> initialization. So we could zero out the middle extent and merge
> these extents into one bigger written extent.
>
> However, IMO, it's not the case of what the current cachefiles
> behavior is... I think rare local fs allocates blocks with direct
> i/o due to real holes, zero out and merge extents since at least it
> touches disk quota.
>
> David pointed this message yesterday since we're doing on-demand read
> feature by using cachefiles as well. But I still fail to understand why
> the current cachefile behavior is wrong.
>
> Could you kindly leave more hints about this? Many thanks!
>
> Thanks,
> Gao Xiang
>
> > Cheers,
> >
> > - Ted