2011-04-19 02:59:49

by Theodore Y. Ts'o

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Tue, Apr 19, 2011 at 09:58:15AM +0800, Yongqiang Yang wrote:
> On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <[email protected]> wrote:
> > Always passing FIEMAP_FLAG_SYNC is fine in this case. It should
> > only do anything if there is unwritten data, which is the only
> > case we are concerned with at this point. In any case, this is a
> > simple solution for coreutils until such a time that a more
> > complex solution is added in the kernel (if ever).

I would recommend that coreutils check i_blocks and i_size and only
try using fiemap (with FIEMAP_FLAG_SYNC) if the file appears to be
sparse. That's because FIEMAP_FLAG_SYNC will do the effectively
equivalent of an fsync() system call. Otherwise, in the case of a
freshly untar'ed directory hierarchy which is then copied using "cp
-r", cp would end up calling fsync() for each file in the directory,
with the disastrous performance result that one might expect.

If cp only tries the fiemap optimization on files that appear to be
sparse, it should avoid this problem.

> > Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
> >
> > I don't see how this will change the problem in any meaningful way. There
> > will still need to be code that is traversing the on-disk mapping, and also
> > keeping it coherent with unwritten data in the page cache.

The advantage of SEEK_HOLE/SEEK_DATA is that we don't need to force an
fsync() of the data.

> It seems that we are being messed up by page cache and disk.
> Unwritten flag returned from FIEMAP indicates blocks on disk are not
> written, but it does not say if there is data in page cache. So
> FIEMAP itself just tells user the map on disk. However there is an
> exception for delayed allocation, FIEMAP tells users the data is in
> page cache.
>
> Maybe FIEMAP should return all known messages for unwritten extent, if
> unwritten data exists in page cache, FIEMAP should let users know that
> data is in page cache and space on disk has been preallocated, but
> data has not been flushed into disk. Actually, delayed allocation has
> done like this. Then user-space applications can determine how to do.
> Taking cp as an example, it will copy from page cache rather ignore
> it.
>
> We need a definite definition for FIEMAP, in other words, it tells
> users map on disk or both disk and page cache.
>
> If the former one is taken, then FIEMAP should not consider delayed
> allocation. otherwise, FIEMAP should return all known messages for
> unwritten case like delayed allocation.

The fact that the FIEMAP interface deifnition includes an delayed
allocation bit could be a strong indication that unlike the XFS's bmap
interface, that this interface is supposed to return information
taking into account both on-disk and page cache information. If this
is the case, then even though there might be a single on-disk
(uninitialized) extent, if there are pages in the page cache that have
not yet been written out yet, but which are described by that on-disk
extent, then instead of returning a single struct fiemap_extent for
that on-disk extent, the fiemap ioctl would need to return multiple
struct fiemap_extents, where some would have the FIEMAP_UNWRITTEN bit,
and others would not (since data has been written to the page cache,
even if it hasn't been flushed to disk yet).

But yes, if we're going to make the case that the FIEMAP interface is
only intended to reflect the on-disk information, then the DELALLOC
bit shouldn't be returned at all, and we should deprecate it.
Anything else leads us to a inconsistent interface.

> > Since FIEMAP already exists for most Linux filesystems, it probably makes
> > sense to implement SEEK_{HOLE,DATA} by calling FIEMAP to get the disk
> > mapping in the first place.

Not if it means forcing an FIEMAP_FLAG_SYNC, which implies an fsync().
If the only way to get consistent data across ext4, btrfs, xfs,
etc. is to force userspace to issue a FIEMAP_FLAG_SYNC, then we need
to have a separate interface of SEEK_HOLE/SEEK_DATA that doesn't
require flushing data to the disk first.

Maybe coreutils will need to use FIEMAP_FLAG_SYNC initially, since
it's the only way to guarantee correct behaviour for XFS. But I would
really rather that be the long-term way we leave things!

- Ted


2011-04-19 03:30:20

by Yongqiang Yang

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Tue, Apr 19, 2011 at 10:59 AM, Ted Ts'o <[email protected]> wrote:
> On Tue, Apr 19, 2011 at 09:58:15AM +0800, Yongqiang Yang wrote:
>> On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <[email protected]> wrote:
>> > Always passing FIEMAP_FLAG_SYNC is fine in this case. It should
>> > only do anything if there is unwritten data, which is the only
>> > case we are concerned with at this point. ?In any case, this is a
>> > simple solution for coreutils until such a time that a more
>> > complex solution is added in the kernel (if ever).
>
> I would recommend that coreutils check i_blocks and i_size and only
> try using fiemap (with FIEMAP_FLAG_SYNC) if the file appears to be
> sparse. ?That's because FIEMAP_FLAG_SYNC will do the effectively
> equivalent of an fsync() system call. ?Otherwise, in the case of a
> freshly untar'ed directory hierarchy which is then copied using "cp
> -r", cp would end up calling fsync() for each file in the directory,
> with the disastrous performance result that one might expect.
>
> If cp only tries the fiemap optimization on files that appear to be
> sparse, it should avoid this problem.
>
>> > Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
>> >
>> > I don't see how this will change the problem in any meaningful way. There
>> > will still need to be code that is traversing the on-disk mapping, and also
>> > keeping it coherent with unwritten data in the page cache.
>
> The advantage of SEEK_HOLE/SEEK_DATA is that we don't need to force an
> fsync() of the data.
>
>> It seems that we are being messed up by page cache and disk.
>> Unwritten flag returned from FIEMAP indicates blocks on disk are not
>> written, but it does not say if there is data in page cache. ?So
>> FIEMAP itself just tells user the map on disk. ?However there is an
>> exception for delayed allocation, ?FIEMAP tells users the data is in
>> page cache.
>>
>> Maybe FIEMAP should return all known messages for unwritten extent, if
>> unwritten data exists in page cache, FIEMAP should let users know that
>> data is in page cache and space on disk has been preallocated, but
>> data has not been flushed into disk. ?Actually, delayed allocation has
>> done like this. Then user-space applications can determine how to do.
>> Taking cp as an example, it will copy from page cache rather ignore
>> it.
>>
>> We need a definite definition for FIEMAP, in other words, it tells
>> users map on disk or both disk and page cache.
>>
>> If the former one is taken, then FIEMAP should not consider delayed
>> allocation. ?otherwise, FIEMAP should return all known messages for
>> unwritten case like delayed allocation.
>
> The fact that the FIEMAP interface deifnition includes an delayed
> allocation bit could be a strong indication that unlike the XFS's bmap
> interface, that this interface is supposed to return information
> taking into account both on-disk and page cache information. ?If this
> is the case, then even though there might be a single on-disk
> (uninitialized) extent, if there are pages in the page cache that have
> not yet been written out yet, but which are described by that on-disk
> extent, then instead of returning a single struct fiemap_extent for
> that on-disk extent, the fiemap ioctl would need to return multiple
> struct fiemap_extents, where some would have the FIEMAP_UNWRITTEN bit,
> and others would not (since data has been written to the page cache,
> even if it hasn't been flushed to disk yet).
Maybe we can add a SPLIT flag like MERGE for ext3, which is set if
there are pages in page cache that have not been written out, but
which are described by unwritten extent on disk, and which does not
cover the whole extent.

Thus, an extent returned by FIEMAP may have UNWRITTEN, NOBYPASS and SPLIT flags.

I noticed that there is a NOBYPASS flag in initial FIEMAP, which
indicates data has not been written out to disk. But it does not
exist in current implementation any more.

>
> But yes, if we're going to make the case that the FIEMAP interface is
> only intended to reflect the on-disk information, then the DELALLOC
> bit shouldn't be returned at all, and we should deprecate it.
> Anything else leads us to a inconsistent interface.
>
>> > Since FIEMAP already exists for most Linux filesystems, it probably makes
>> > sense to implement SEEK_{HOLE,DATA} by calling FIEMAP to get the disk
>> > mapping in the first place.
>
> Not if it means forcing an FIEMAP_FLAG_SYNC, which implies an fsync().
> If the only way to get consistent data across ext4, btrfs, xfs,
> etc. is to force userspace to issue a FIEMAP_FLAG_SYNC, then we need
> to have a separate interface of SEEK_HOLE/SEEK_DATA that doesn't
> require flushing data to the disk first.
>
> Maybe coreutils will need to use FIEMAP_FLAG_SYNC initially, since
> it's the only way to guarantee correct behaviour for XFS. ?But I would
> really rather that be the long-term way we leave things!
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>



--
Best Wishes
Yongqiang Yang

2011-04-19 04:14:12

by Dave Chinner

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Mon, Apr 18, 2011 at 10:59:49PM -0400, Ted Ts'o wrote:
> On Tue, Apr 19, 2011 at 09:58:15AM +0800, Yongqiang Yang wrote:
> > On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <[email protected]> wrote:
> > > Always passing FIEMAP_FLAG_SYNC is fine in this case. It should
> > > only do anything if there is unwritten data, which is the only
> > > case we are concerned with at this point. In any case, this is a
> > > simple solution for coreutils until such a time that a more
> > > complex solution is added in the kernel (if ever).
>
> I would recommend that coreutils check i_blocks and i_size and only
> try using fiemap (with FIEMAP_FLAG_SYNC) if the file appears to be
> sparse. That's because FIEMAP_FLAG_SYNC will do the effectively
> equivalent of an fsync() system call. Otherwise, in the case of a
> freshly untar'ed directory hierarchy which is then copied using "cp
> -r", cp would end up calling fsync() for each file in the directory,
> with the disastrous performance result that one might expect.
>
> If cp only tries the fiemap optimization on files that appear to be
> sparse, it should avoid this problem.
>
> > > Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
> > >
> > > I don't see how this will change the problem in any meaningful way. There
> > > will still need to be code that is traversing the on-disk mapping, and also
> > > keeping it coherent with unwritten data in the page cache.
>
> The advantage of SEEK_HOLE/SEEK_DATA is that we don't need to force an
> fsync() of the data.
>
> > It seems that we are being messed up by page cache and disk.
> > Unwritten flag returned from FIEMAP indicates blocks on disk are not
> > written, but it does not say if there is data in page cache. So
> > FIEMAP itself just tells user the map on disk. However there is an
> > exception for delayed allocation, FIEMAP tells users the data is in
> > page cache.
> >
> > Maybe FIEMAP should return all known messages for unwritten extent, if
> > unwritten data exists in page cache, FIEMAP should let users know that
> > data is in page cache and space on disk has been preallocated, but
> > data has not been flushed into disk. Actually, delayed allocation has
> > done like this. Then user-space applications can determine how to do.
> > Taking cp as an example, it will copy from page cache rather ignore
> > it.
> >
> > We need a definite definition for FIEMAP, in other words, it tells
> > users map on disk or both disk and page cache.
> >
> > If the former one is taken, then FIEMAP should not consider delayed
> > allocation. otherwise, FIEMAP should return all known messages for
> > unwritten case like delayed allocation.
>
> The fact that the FIEMAP interface deifnition includes an delayed
> allocation bit could be a strong indication that unlike the XFS's bmap
> interface, that this interface is supposed to return information
> taking into account both on-disk and page cache information.

As I said in a previous email, XFS uses delalloc as a first class
extent and reporting them does not require looking at the page
cache. Therefore whatever historical behaviour xfs_bmap used is
irrelevant - supporting delalloc extents was a 2 or 3 line change
and in no way was intended to report anything other than the current
extents. Even at that time, "dirty page cache ranges" != delalloc
extents, and this appears to be the way ext4 has _implemented_
reporting of delalloc extents.

Indeed, I was the one that suggested it be supported because it is
useful to know the delalloc state _for debugging purposes_. Now you
are trying to redefine what a delalloc extent is to match the ext4
implementation, and then extend that same reasoning to change what
an unwritten extent means to match how _you think_ the ext4
implementation works(*).

And besides, if I use your same logical progression you've
applied to FIEMAP via the ext4 delalloc extent implementation, using
the XFS delalloc extent implementation in no way implies page cache
coherency for FIEMAP. :)

FIEMAP is for reporting extent state. What that means is filesystem
specific, and requires knowledge of the filesystem to use
effectively. If you want to report coherent state for working out
what ranges to copy, implement SEEK_HOLE/SEEK_DATA (which would use
much of the FIEMAP infrastructure). Redefining the FIEMAP API will
not solve the problem of different filesystems behaving in a manner
that is not useful for coreutils....

> Maybe coreutils will need to use FIEMAP_FLAG_SYNC initially, since
> it's the only way to guarantee correct behaviour for XFS. But I would
> really rather that be the long-term way we leave things!

(*) It's not XFS specific - ext4 behaves exactly the same way (as
eric kindly pointed out). IOWs, it's likely that all filesystems
need the SYNC flag for one reason or another and that indicates to
me that FIEMAP is simply not the right interface for coreutils to be
using for their intended purpose.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-19 05:27:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Mon, Apr 18, 2011 at 10:59:49PM -0400, Ted Ts'o wrote:
> Maybe coreutils will need to use FIEMAP_FLAG_SYNC initially, since
> it's the only way to guarantee correct behaviour for XFS. But I would
> really rather that be the long-term way we leave things!

As Eric pointed out both ext4 and XFS have the same behaviour when
writing into unwritten extent. I think you are a bit confused because
ext4 also got basic handling of delalloc extents wrong before commit
6d9c85eb700bd3ac59e63bb9de463dea1aca084c, which never was a problem with
XFS. It would be nice if ext4 developers had sent the included
regression test for xfs so that everyone could verify this behaviour,
btw.

To report written to but not synced unwritten extents properly we'd
need to move fiemap away from the on?disk state reporting done so far
and do something that is purely in-memory. It would be doable by
walking the pagecache and checking for the buffer unwritten flag
in a loop over the pages, but I'm honestly not sure it's going to
help much. In fact given that unwritten extent were specifically
allocated before it doesn't seem like an overly smart idea to skip
them in a copy - yes it will save space but it also undoes the
previous explicit preallocation. If people want it they should rather
add a new option to cp to turn zeroes into holes.