2011-04-14 15:50:10

by Eric Sandeen

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On 4/14/11 9:59 AM, P?draig Brady wrote:
> On 14/04/11 15:02, Markus Trippelsdorf wrote:
>>>> Hi P?draig,
>>>>
>>>> here you go:
>>>> + filefrag -v unwritten.withdata
>>>> Filesystem type is: ef53
>>>> File size of unwritten.withdata is 5120 (2 blocks, blocksize 4096)
>>>> ext logical physical expected length flags
>>>> 0 0 274432 2560 unwritten,eof
>>>> unwritten.withdata: 1 extent found
>>>>
>>>> Please notice that this also happens with ext4 on the same kernel.
>>>> Btrfs is fine.
>>>
>> `filefrag -vs` fixes the issue on both xfs and ext4.
>
> So in summary, currently on (2.6.39-rc3), the following
> will (usually?) report a single unwritten extent,
> on both ext4 and xfs
>
> fallocate -l 10MiB -n k
> dd count=10 if=/dev/urandom conv=notrunc iflag=fullblock of=k
> filefrag -v k # grep for an extent without unwritten || fail

right, that's what I see too in testing.

But would the coreutils install have done a preallocation of the destination file?

Otherwise this looks like a different bug...

> This particular issue has been discussed so far at:
> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=8411
> Note there it was stated there that ext4 had this
> fixed as of 2.6.39-rc1, so maybe there is something lurking?

ext4 got a fix, but not xfs, I guess. My poor brain can't remember, I think I started looking into it, but it's clearly still broken.

Still, I don't know for sure what happened to Markus - did something preallocate, in his case?

-Eric

> cheers,
> P?draig.
>
> _______________________________________________
> xfs mailing list
> [email protected]
> http://oss.sgi.com/mailman/listinfo/xfs
>


2011-04-14 22:59:04

by Dave Chinner

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Thu, Apr 14, 2011 at 10:50:10AM -0500, Eric Sandeen wrote:
> On 4/14/11 9:59 AM, P?draig Brady wrote:
> > On 14/04/11 15:02, Markus Trippelsdorf wrote:
> >>>> Hi P?draig,
> >>>>
> >>>> here you go:
> >>>> + filefrag -v unwritten.withdata
> >>>> Filesystem type is: ef53
> >>>> File size of unwritten.withdata is 5120 (2 blocks, blocksize 4096)
> >>>> ext logical physical expected length flags
> >>>> 0 0 274432 2560 unwritten,eof
> >>>> unwritten.withdata: 1 extent found
> >>>>
> >>>> Please notice that this also happens with ext4 on the same kernel.
> >>>> Btrfs is fine.
> >>>
> >> `filefrag -vs` fixes the issue on both xfs and ext4.
> >
> > So in summary, currently on (2.6.39-rc3), the following
> > will (usually?) report a single unwritten extent,
> > on both ext4 and xfs
> >
> > fallocate -l 10MiB -n k
> > dd count=10 if=/dev/urandom conv=notrunc iflag=fullblock of=k
> > filefrag -v k # grep for an extent without unwritten || fail
>
> right, that's what I see too in testing.
>
> But would the coreutils install have done a preallocation of the destination file?
>
> Otherwise this looks like a different bug...
>
> > This particular issue has been discussed so far at:
> > http://debbugs.gnu.org/cgi/bugreport.cgi?bug=8411
> > Note there it was stated there that ext4 had this
> > fixed as of 2.6.39-rc1, so maybe there is something lurking?
>
> ext4 got a fix, but not xfs, I guess. My poor brain can't remember, I think I started looking into it, but it's clearly still broken.
>
> Still, I don't know for sure what happened to Markus - did something preallocate, in his case?

Unwritten extent mapping behaves in an unexpected way due to
buffered writeback not occurring immediately. Extent conversion
doesn't occur until the data is on disk, and for buffered IO you
need an fdatasync to ensure that has occurred.

That is:

$ xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c "bmap -vp" /mnt/test/foo
wrote 5120/5120 bytes at offset 0
5 KiB, 2 ops; 0.0000 sec (62.600 MiB/sec and 25641.0256 ops/sec)
/mnt/test/foo:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..20479]: 268984..289463 0 (268984..289463) 20480 10000

Data has not been written yet, so it is still unwritten. The same
test with a fsync shows:

$ sudo xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c fsync -c "bmap -vp" /mnt/test/foo
wrote 5120/5120 bytes at offset 0
5 KiB, 2 ops; 0.0000 sec (87.193 MiB/sec and 35714.2857 ops/sec)
/mnt/test/foo:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..15]: 268984..268999 0 (268984..268999) 16 00000
1: [16..20479]: 269000..289463 0 (269000..289463) 20464 10000

Everything is fine.

So this seems like an application error to me. If you are going to
use fiemap to determine what ranges to copy, then you have to
fdatasync the source file first to guarantee that preallocated
extents have been converted to written state before mapping the
file....

Cheers,

Dave.
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2011-04-14 23:31:29

by Pádraig Brady

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On 14/04/11 23:59, Dave Chinner wrote:
> On Thu, Apr 14, 2011 at 10:50:10AM -0500, Eric Sandeen wrote:
>> On 4/14/11 9:59 AM, P?draig Brady wrote:
>>> On 14/04/11 15:02, Markus Trippelsdorf wrote:
>>>>>> Hi P?draig,
>>>>>>
>>>>>> here you go:
>>>>>> + filefrag -v unwritten.withdata
>>>>>> Filesystem type is: ef53
>>>>>> File size of unwritten.withdata is 5120 (2 blocks, blocksize 4096)
>>>>>> ext logical physical expected length flags
>>>>>> 0 0 274432 2560 unwritten,eof
>>>>>> unwritten.withdata: 1 extent found
>>>>>>
>>>>>> Please notice that this also happens with ext4 on the same kernel.
>>>>>> Btrfs is fine.
>>>>>
>>>> `filefrag -vs` fixes the issue on both xfs and ext4.
>>>
>>> So in summary, currently on (2.6.39-rc3), the following
>>> will (usually?) report a single unwritten extent,
>>> on both ext4 and xfs
>>>
>>> fallocate -l 10MiB -n k
>>> dd count=10 if=/dev/urandom conv=notrunc iflag=fullblock of=k
>>> filefrag -v k # grep for an extent without unwritten || fail
>>
>> right, that's what I see too in testing.
>>
>> But would the coreutils install have done a preallocation of the destination file?
>>
>> Otherwise this looks like a different bug...
>>
>>> This particular issue has been discussed so far at:
>>> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=8411
>>> Note there it was stated there that ext4 had this
>>> fixed as of 2.6.39-rc1, so maybe there is something lurking?
>>
>> ext4 got a fix, but not xfs, I guess. My poor brain can't remember, I think I started looking into it, but it's clearly still broken.
>>
>> Still, I don't know for sure what happened to Markus - did something preallocate, in his case?
>
> Unwritten extent mapping behaves in an unexpected way due to
> buffered writeback not occurring immediately. Extent conversion
> doesn't occur until the data is on disk, and for buffered IO you
> need an fdatasync to ensure that has occurred.
>
> That is:
>
> $ xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c "bmap -vp" /mnt/test/foo
> wrote 5120/5120 bytes at offset 0
> 5 KiB, 2 ops; 0.0000 sec (62.600 MiB/sec and 25641.0256 ops/sec)
> /mnt/test/foo:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..20479]: 268984..289463 0 (268984..289463) 20480 10000
>
> Data has not been written yet, so it is still unwritten. The same
> test with a fsync shows:
>
> $ sudo xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c fsync -c "bmap -vp" /mnt/test/foo
> wrote 5120/5120 bytes at offset 0
> 5 KiB, 2 ops; 0.0000 sec (87.193 MiB/sec and 35714.2857 ops/sec)
> /mnt/test/foo:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..15]: 268984..268999 0 (268984..268999) 16 00000
> 1: [16..20479]: 269000..289463 0 (269000..289463) 20464 10000
>
> Everything is fine.
>
> So this seems like an application error to me. If you are going to
> use fiemap to determine what ranges to copy, then you have to
> fdatasync the source file first to guarantee that preallocated
> extents have been converted to written state before mapping the
> file....

Well IMHO there should be a difference between
knowing where you are going to write, and actually writing to disk.
I.E. one shouldn't need to write the whole way to the device
before returning a valid fiemap. If a particular file system
implementation needs to sync to return a valid fiemap,
then it should be implicit.

cheers,
P?draig.

2011-04-15 00:09:49

by Dave Chinner

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Fri, Apr 15, 2011 at 12:29:46AM +0100, P?draig Brady wrote:
> On 14/04/11 23:59, Dave Chinner wrote:
> > On Thu, Apr 14, 2011 at 10:50:10AM -0500, Eric Sandeen wrote:
> >> On 4/14/11 9:59 AM, P?draig Brady wrote:
> >>> On 14/04/11 15:02, Markus Trippelsdorf wrote:
> >>>>>> Hi P?draig,
> >>>>>>
> >>>>>> here you go:
> >>>>>> + filefrag -v unwritten.withdata
> >>>>>> Filesystem type is: ef53
> >>>>>> File size of unwritten.withdata is 5120 (2 blocks, blocksize 4096)
> >>>>>> ext logical physical expected length flags
> >>>>>> 0 0 274432 2560 unwritten,eof
> >>>>>> unwritten.withdata: 1 extent found
> >>>>>>
> >>>>>> Please notice that this also happens with ext4 on the same kernel.
> >>>>>> Btrfs is fine.
> >>>>>
> >>>> `filefrag -vs` fixes the issue on both xfs and ext4.
> >>>
> >>> So in summary, currently on (2.6.39-rc3), the following
> >>> will (usually?) report a single unwritten extent,
> >>> on both ext4 and xfs
> >>>
> >>> fallocate -l 10MiB -n k
> >>> dd count=10 if=/dev/urandom conv=notrunc iflag=fullblock of=k
> >>> filefrag -v k # grep for an extent without unwritten || fail
> >>
> >> right, that's what I see too in testing.
> >>
> >> But would the coreutils install have done a preallocation of the destination file?
> >>
> >> Otherwise this looks like a different bug...
> >>
> >>> This particular issue has been discussed so far at:
> >>> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=8411
> >>> Note there it was stated there that ext4 had this
> >>> fixed as of 2.6.39-rc1, so maybe there is something lurking?
> >>
> >> ext4 got a fix, but not xfs, I guess. My poor brain can't remember, I think I started looking into it, but it's clearly still broken.
> >>
> >> Still, I don't know for sure what happened to Markus - did something preallocate, in his case?
> >
> > Unwritten extent mapping behaves in an unexpected way due to
> > buffered writeback not occurring immediately. Extent conversion
> > doesn't occur until the data is on disk, and for buffered IO you
> > need an fdatasync to ensure that has occurred.
> >
> > That is:
> >
> > $ xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c "bmap -vp" /mnt/test/foo
> > wrote 5120/5120 bytes at offset 0
> > 5 KiB, 2 ops; 0.0000 sec (62.600 MiB/sec and 25641.0256 ops/sec)
> > /mnt/test/foo:
> > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> > 0: [0..20479]: 268984..289463 0 (268984..289463) 20480 10000
> >
> > Data has not been written yet, so it is still unwritten. The same
> > test with a fsync shows:
> >
> > $ sudo xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c fsync -c "bmap -vp" /mnt/test/foo
> > wrote 5120/5120 bytes at offset 0
> > 5 KiB, 2 ops; 0.0000 sec (87.193 MiB/sec and 35714.2857 ops/sec)
> > /mnt/test/foo:
> > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> > 0: [0..15]: 268984..268999 0 (268984..268999) 16 00000
> > 1: [16..20479]: 269000..289463 0 (269000..289463) 20464 10000
> >
> > Everything is fine.
> >
> > So this seems like an application error to me. If you are going to
> > use fiemap to determine what ranges to copy, then you have to
> > fdatasync the source file first to guarantee that preallocated
> > extents have been converted to written state before mapping the
> > file....
>
> Well IMHO there should be a difference between
> knowing where you are going to write, and actually writing to disk.
> I.E. one shouldn't need to write the whole way to the device
> before returning a valid fiemap. If a particular file system
> implementation needs to sync to return a valid fiemap,
> then it should be implicit.

No, this was explicitly laid out in the fiemap interface discussions
- it's up to the applicaiton to decide if it needs to do a sync
first. That's what the FIEMAP_FLAG_SYNC control flag is for.
This forces the fiemap call to do a fsync _before_ getting the
mapping. If you want to know the exact layout of the file is, then
you must use this flag.

Even so, it is recognised that this is racy - any use of the block
map has a time-of-read-to-time-of-use race condition that means you
have to _verify_ the copy after it completes. FYI, that's what
xfs_fsr does when copying based on extent maps - if the inode has
changed in _any way_ during the copy, it aborts the copy of that
file.

i.e. using fiemap for copying is at best a *hint* about the regions
that need copying, and it is in no way a guarantee that you'll get
all the information you need to make accurate copy even if you do
use the synchronous variant.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-15 05:01:09

by Andreas Dilger

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On 2011-04-14, at 6:09 PM, Dave Chinner <[email protected]> wrote:
> On Fri, Apr 15, 2011 at 12:29:46AM +0100, Pádraig Brady wrote:
>> On 14/04/11 23:59, Dave Chinner wrote:
>>> On Thu, Apr 14, 2011 at 10:50:10AM -0500, Eric Sandeen wrote:
>>>> On 4/14/11 9:59 AM, Pádraig Brady wrote:
>>>>> On 14/04/11 15:02, Markus Trippelsdorf wrote:
>>>>>>>> Hi Pádraig,
>>>>>>>>
>>>>>>>> here you go:
>>>>>>>> + filefrag -v unwritten.withdata
>>>>>>>> Filesystem type is: ef53
>>>>>>>> File size of unwritten.withdata is 5120 (2 blocks, blocksize 4096)
>>>>>>>> ext logical physical expected length flags
>>>>>>>> 0 0 274432 2560 unwritten,eof
>>>>>>>> unwritten.withdata: 1 extent found
>>>>>>>>
>>>>>>>> Please notice that this also happens with ext4 on the same kernel.
>>>>>>>> Btrfs is fine.
>>>>>>>
>>>>>> `filefrag -vs` fixes the issue on both xfs and ext4.
>>>>>
>>>>> So in summary, currently on (2.6.39-rc3), the following
>>>>> will (usually?) report a single unwritten extent,
>>>>> on both ext4 and xfs
>>>>>
>>>>> fallocate -l 10MiB -n k
>>>>> dd count=10 if=/dev/urandom conv=notrunc iflag=fullblock of=k
>>>>> filefrag -v k # grep for an extent without unwritten || fail
>>>>
>>>> right, that's what I see too in testing.
>>>>
>>>> But would the coreutils install have done a preallocation of the destination file?
>>>>
>>>> Otherwise this looks like a different bug...
>>>>
>>>>> This particular issue has been discussed so far at:
>>>>> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=8411
>>>>> Note there it was stated there that ext4 had this
>>>>> fixed as of 2.6.39-rc1, so maybe there is something lurking?
>>>>
>>>> ext4 got a fix, but not xfs, I guess. My poor brain can't remember, I think I started looking into it, but it's clearly still broken.
>>>>
>>>> Still, I don't know for sure what happened to Markus - did something preallocate, in his case?
>>>
>>> Unwritten extent mapping behaves in an unexpected way due to
>>> buffered writeback not occurring immediately. Extent conversion
>>> doesn't occur until the data is on disk, and for buffered IO you
>>> need an fdatasync to ensure that has occurred.
>>>
>>> That is:
>>>
>>> $ xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c "bmap -vp" /mnt/test/foo
>>> wrote 5120/5120 bytes at offset 0
>>> 5 KiB, 2 ops; 0.0000 sec (62.600 MiB/sec and 25641.0256 ops/sec)
>>> /mnt/test/foo:
>>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
>>> 0: [0..20479]: 268984..289463 0 (268984..289463) 20480 10000
>>>
>>> Data has not been written yet, so it is still unwritten. The same
>>> test with a fsync shows:
>>>
>>> $ sudo xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c fsync -c "bmap -vp" /mnt/test/foo
>>> wrote 5120/5120 bytes at offset 0
>>> 5 KiB, 2 ops; 0.0000 sec (87.193 MiB/sec and 35714.2857 ops/sec)
>>> /mnt/test/foo:
>>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
>>> 0: [0..15]: 268984..268999 0 (268984..268999) 16 00000
>>> 1: [16..20479]: 269000..289463 0 (269000..289463) 20464 10000
>>>
>>> Everything is fine.
>>>
>>> So this seems like an application error to me. If you are going to
>>> use fiemap to determine what ranges to copy, then you have to
>>> fdatasync the source file first to guarantee that preallocated
>>> extents have been converted to written state before mapping the
>>> file....
>>
>> Well IMHO there should be a difference between
>> knowing where you are going to write, and actually writing to disk.
>> I.E. one shouldn't need to write the whole way to the device
>> before returning a valid fiemap. If a particular file system
>> implementation needs to sync to return a valid fiemap,
>> then it should be implicit.
>
> No, this was explicitly laid out in the fiemap interface discussions
> - it's up to the applicaiton to decide if it needs to do a sync
> first. That's what the FIEMAP_FLAG_SYNC control flag is for.
> This forces the fiemap call to do a fsync _before_ getting the
> mapping. If you want to know the exact layout of the file is, then
> you must use this flag.
>
> Even so, it is recognised that this is racy - any use of the block
> map has a time-of-read-to-time-of-use race condition that means you
> have to _verify_ the copy after it completes. FYI, that's what
> xfs_fsr does when copying based on extent maps - if the inode has
> changed in _any way_ during the copy, it aborts the copy of that
> file.
>
> i.e. using fiemap for copying is at best a *hint* about the regions
> that need copying, and it is in no way a guarantee that you'll get
> all the information you need to make accurate copy even if you do
> use the synchronous variant.

I would tend to agree with Pádraig. If there is data in the mapping (regardless of whether it is on disk or not), the FIEMAP should return this to the caller. The SYNC flag is only intended to flush the data to disk for tools that are doing direct-to-disk operations on the data.

Otherwise the UNMAPPED flag is useless, since even with "check, copy, check" there is no guarantee that the inode is changed _during_ the copy operation. It could have been written into the cache _before_ the FIEMAP and remain unchanged and in your case there would be no way to know any data was ever written to the file without SYNC on ever single file before FIEMAP.

Cheers, Andreas-

2011-04-15 08:53:48

by Jim Meyering

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

Dave Chinner wrote:
> On Fri, Apr 15, 2011 at 12:29:46AM +0100, P?draig Brady wrote:
...
>> Well IMHO there should be a difference between
>> knowing where you are going to write, and actually writing to disk.
>> I.E. one shouldn't need to write the whole way to the device
>> before returning a valid fiemap. If a particular file system
>> implementation needs to sync to return a valid fiemap,
>> then it should be implicit.
>
> No, this was explicitly laid out in the fiemap interface discussions
> - it's up to the applicaiton to decide if it needs to do a sync
> first. That's what the FIEMAP_FLAG_SYNC control flag is for.
> This forces the fiemap call to do a fsync _before_ getting the
> mapping. If you want to know the exact layout of the file is, then
> you must use this flag.
>
> Even so, it is recognised that this is racy - any use of the block
> map has a time-of-read-to-time-of-use race condition that means you
> have to _verify_ the copy after it completes. FYI, that's what
> xfs_fsr does when copying based on extent maps - if the inode has
> changed in _any way_ during the copy, it aborts the copy of that
> file.
>
> i.e. using fiemap for copying is at best a *hint* about the regions
> that need copying, and it is in no way a guarantee that you'll get
> all the information you need to make accurate copy even if you do
> use the synchronous variant.

Hi Dave,

Can you or anyone else point to authoritative documentation
(or even a summary of those "discussions") of FIEMAP semantics?
I'm hoping the semantics are the same for all file system types.

I had understood that cp's use of FIEMAP_FLAG_SYNC was not only
unnecessary, but even undesirable, given a new-enough kernel.
That's why coreutils-8.11 resorts to using the workaround of
FIEMAP_FLAG_SYNC only when uname says the kernel is 2.6.[0..38].

2011-04-15 17:16:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Fri, Apr 15, 2011 at 10:53:48AM +0200, Jim Meyering wrote:
> Can you or anyone else point to authoritative documentation
> (or even a summary of those "discussions") of FIEMAP semantics?

A large part of the problem is that there's none and the semantics
is a hode-podge of random flags that mostly make little sense
outside of fs developers debugging the layout that we created. The
closest to a description is Documentation/filesystems/fiemap.txt
in the kernel tree, but that doesn't document most of the delicate
points.


2011-04-16 00:50:43

by Dave Chinner

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Thu, Apr 14, 2011 at 11:01:04PM -0600, Andreas Dilger wrote:
> On 2011-04-14, at 6:09 PM, Dave Chinner <[email protected]>
> wrote:
> > No, this was explicitly laid out in the fiemap interface
> > discussions - it's up to the applicaiton to decide if it needs
> > to do a sync first. That's what the FIEMAP_FLAG_SYNC control
> > flag is for. This forces the fiemap call to do a fsync _before_
> > getting the mapping. If you want to know the exact layout of the
> > file is, then you must use this flag.
> >
> > Even so, it is recognised that this is racy - any use of the
> > block map has a time-of-read-to-time-of-use race condition that
> > means you have to _verify_ the copy after it completes. FYI,
> > that's what xfs_fsr does when copying based on extent maps - if
> > the inode has changed in _any way_ during the copy, it aborts
> > the copy of that file.
> >
> > i.e. using fiemap for copying is at best a *hint* about the
> > regions that need copying, and it is in no way a guarantee that
> > you'll get all the information you need to make accurate copy
> > even if you do use the synchronous variant.
>
> I would tend to agree with P?draig. If there is data in the
> mapping (regardless of whether it is on disk or not), the FIEMAP
> should return this to the caller. The SYNC flag is only intended
> to flush the data to disk for tools that are doing
> direct-to-disk operations on the data.

What you are suggesting is that FIEMAP needs to be page cache
coherent, and that is far, far away from the intended use of the
interface. Even consiering that you need to looking for active pages
in the page cache when mapping extents say to me that you are
doing something very wrong.

Unwritten extents remain unwritten until the data is physically
written to them. Therefore, to change their state, you need to sync
the data covering the range. _Lying_ about whether an extent is in
the unwritten state is a really bad precedence to set, especially as
it is then guaranteed to change state when a crash occurs (Why did
recovery zero out my file? FIEMAP said it contained data before my
system crashed!).

Don't try to mangle the API semantics every time someone doesn't
understand how to use FIEMAP reliably. If you need the extent list
returned by FIEMAP to match what is in the page cache *regardless of

> Otherwise the UNMAPPED flag is useless, since even with "check,
> copy, check" there is no guarantee that the inode is changed
> _during_ the copy operation. It could have been written into the
> cache _before_ the FIEMAP and remain unchanged and in your case
> there would be no way to know any data was ever written to the
> file without SYNC on ever single file before FIEMAP.

I can't find any UNMAPPED flag in the FIEMAP interface, so I have no
idea what you are refering to here.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-16 05:11:23

by Andreas Dilger

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On 2011-04-15, at 6:50 PM, Dave Chinner <[email protected]> wrote:
> What you are suggesting is that FIEMAP needs to be page cache
> coherent, and that is far, far away from the intended use of the
> interface. Even consiering that you need to looking for active pages
> in the page cache when mapping extents say to me that you are
> doing something very wrong.
>
> Unwritten extents remain unwritten until the data is physically
> written to them. Therefore, to change their state, you need to sync
> the data covering the range.

In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is fine.

Cheers, Andreas

2011-04-16 06:05:51

by Yongqiang Yang

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Sat, Apr 16, 2011 at 8:50 AM, Dave Chinner <[email protected]> wrote:
> On Thu, Apr 14, 2011 at 11:01:04PM -0600, Andreas Dilger wrote:
>> On 2011-04-14, at 6:09 PM, Dave Chinner <[email protected]>
>> wrote:
>> > No, this was explicitly laid out in the fiemap interface
>> > discussions - it's up to the applicaiton to decide if it needs
>> > to do a sync first. That's what the FIEMAP_FLAG_SYNC control
>> > flag is for. ?This forces the fiemap call to do a fsync _before_
>> > getting the mapping. If you want to know the exact layout of the
>> > file is, then you must use this flag.
>> >
>> > Even so, it is recognised that this is racy - any use of the
>> > block map has a time-of-read-to-time-of-use race condition that
>> > means you have to _verify_ the copy after it completes. FYI,
>> > that's what xfs_fsr does when copying based on extent maps - if
>> > the inode has changed in _any way_ during the copy, it aborts
>> > the copy of that file.
>> >
>> > i.e. using fiemap for copying is at best a *hint* about the
>> > regions that need copying, and it is in no way a guarantee that
>> > you'll get all the information you need to make accurate copy
>> > even if you do use the synchronous variant.
>>
>> I would tend to agree with P?draig. If there is data in the
>> mapping (regardless of whether it is on disk or not), the FIEMAP
>> should return this to the caller. ?The SYNC flag is only intended
>> to flush the data to disk for tools that are doing
>> direct-to-disk operations on the data.
>
> What you are suggesting is that FIEMAP needs to be page cache
> coherent, and that is far, far away from the intended use of the
> interface. Even consiering that you need to looking for active pages
> in the page cache when mapping extents say to me that you are
> doing something very wrong.
>
> Unwritten extents remain unwritten until the data is physically
> written to them. Therefore, to change their state, you need to sync
No, buffered writes change their state without sync.

> the data covering the range. ?_Lying_ about whether an extent is in
> the unwritten state is a really bad precedence to set, especially as
> it is then guaranteed to change state when a crash occurs (Why did
> recovery zero out my file? FIEMAP said it contained data before my
> system crashed!).

All filesystems have metadata in memory which is not flushed to
permanent storage. e.g. if a extent exists in memory, but itself and
corresponding data are not flushed to permanent storage. So you said
above can only be achieved by sync before FIEMAP. Otherwise if a
crash occurs, FIEMAP can not find data before system crashed.

Without delayed allocation, there is no difference between
preallocation case(fallocate) and normal cases.

> Don't try to mangle the API semantics every time someone doesn't
> understand how to use FIEMAP reliably. If you need the extent list
> returned by FIEMAP to match what is in the page cache *regardless of
>
>> Otherwise the UNMAPPED flag is useless, since even with "check,
>> copy, check" there is no guarantee that the inode is changed
>> _during_ the copy operation. It could have been written into the
>> cache _before_ the FIEMAP and remain unchanged and in your case
>> there would be no way to know any data was ever written to the
>> file without SYNC on ever single file before FIEMAP.
>
> I can't find any UNMAPPED flag in the FIEMAP interface, so I have no
> idea what you are refering to here.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>



--
Best Wishes
Yongqiang Yang

2011-04-16 12:21:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)


On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:

> In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is fine.

Except that if someone is copying a large delay allocated file, it will cause
the file to immediately snapped to disk, which might not be the greatest
thing in the world. Christoph is write, SEEK_HOLE and SEEK_DATA are
a much better API for what cp woulld lke to do. Unfortunately it hasn't
been implemented yet in the VFS...

- Ted

2011-04-18 00:36:00

by Dave Chinner

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Sat, Apr 16, 2011 at 02:05:51PM +0800, Yongqiang Yang wrote:
> On Sat, Apr 16, 2011 at 8:50 AM, Dave Chinner <[email protected]> wrote:
> > On Thu, Apr 14, 2011 at 11:01:04PM -0600, Andreas Dilger wrote:
> >> On 2011-04-14, at 6:09 PM, Dave Chinner <[email protected]>
> >> wrote:
> >> > No, this was explicitly laid out in the fiemap interface
> >> > discussions - it's up to the applicaiton to decide if it needs
> >> > to do a sync first. That's what the FIEMAP_FLAG_SYNC control
> >> > flag is for. ?This forces the fiemap call to do a fsync _before_
> >> > getting the mapping. If you want to know the exact layout of the
> >> > file is, then you must use this flag.
> >> >
> >> > Even so, it is recognised that this is racy - any use of the
> >> > block map has a time-of-read-to-time-of-use race condition that
> >> > means you have to _verify_ the copy after it completes. FYI,
> >> > that's what xfs_fsr does when copying based on extent maps - if
> >> > the inode has changed in _any way_ during the copy, it aborts
> >> > the copy of that file.
> >> >
> >> > i.e. using fiemap for copying is at best a *hint* about the
> >> > regions that need copying, and it is in no way a guarantee that
> >> > you'll get all the information you need to make accurate copy
> >> > even if you do use the synchronous variant.
> >>
> >> I would tend to agree with P?draig. If there is data in the
> >> mapping (regardless of whether it is on disk or not), the FIEMAP
> >> should return this to the caller. ?The SYNC flag is only intended
> >> to flush the data to disk for tools that are doing
> >> direct-to-disk operations on the data.
> >
> > What you are suggesting is that FIEMAP needs to be page cache
> > coherent, and that is far, far away from the intended use of the
> > interface. Even consiering that you need to looking for active pages
> > in the page cache when mapping extents say to me that you are
> > doing something very wrong.
> >
> > Unwritten extents remain unwritten until the data is physically
> > written to them. Therefore, to change their state, you need to sync
> No, buffered writes change their state without sync.

They shouldn't.

> > the data covering the range. ?_Lying_ about whether an extent is in
> > the unwritten state is a really bad precedence to set, especially as
> > it is then guaranteed to change state when a crash occurs (Why did
> > recovery zero out my file? FIEMAP said it contained data before my
> > system crashed!).
>
> All filesystems have metadata in memory which is not flushed to
> permanent storage. e.g. if a extent exists in memory, but itself and
> corresponding data are not flushed to permanent storage.

Sure, but in the case of unwritten extents, XFS does not change the
metadata state in memory until *after the physical IO is completed*.
I'm pretty sure that btrfs is the same.

IOWs, despite the fact that a buffered write has occurred, no
metadata has changed state in memory, and the extents are still
unwritten in both memory and on disk....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-18 00:40:45

by Dave Chinner

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Sat, Apr 16, 2011 at 08:21:28AM -0400, Theodore Tso wrote:
>
> On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:
>
> > In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is fine.
>
> Except that if someone is copying a large delay allocated file, it will cause
> the file to immediately snapped to disk, which might not be the greatest
> thing in the world.

Obvious workaround - if the initial fiemap call shows unwritten
extents, redo it with the sync flag set. Though that assumeѕ that
you can trust things like delalloc extents to only cover the range
that valid data exists in. Which, of course, you can't assume,
either. :/

> Christoph is write, SEEK_HOLE and SEEK_DATA are
> a much better API for what cp woulld lke to do. Unfortunately it hasn't
> been implemented yet in the VFS...

Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-18 02:45:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On 2011-04-17, at 6:40 PM, Dave Chinner <[email protected]> wrote:
> On Sat, Apr 16, 2011 at 08:21:28AM -0400, Theodore Tso wrote:
>>
>> On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:
>>> In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is fine.
>>
>> Except that if someone is copying a large delay allocated file, it will cause
>> the file to immediately snapped to disk, which might not be the greatest
>> thing in the world.
>
> Obvious workaround - if the initial fiemap call shows unwritten
> extents, redo it with the sync flag set. Though that assumeѕ that
> you can trust things like delalloc extents to only cover the range
> that valid data exists in. Which, of course, you can't assume,
> either. :/

Always passing FIEMAP_FLAG_SYNC is fine in this case. It should only do anything if there is unwritten data, which is the only case we are concerned with at this point. In any case, this is a simple solution for coreutils until such a time that a more complex solution is added in the kernel (if ever).

>> Christoph is write, SEEK_HOLE and SEEK_DATA are
>> a much better API for what cp woulld lke to do. Unfortunately it hasn't
>> been implemented yet in the VFS...
>
> Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.

I don't see how this will change the problem in any meaningful way. There will still need to be code that is traversing the on-disk mapping, and also keeping it coherent with unwritten data in the page cache.

Since FIEMAP already exists for most Linux filesystems, it probably makes sense to implement SEEK_{HOLE,DATA} by calling FIEMAP to get the disk mapping in the first place.

I agree that SEEK_{HOLE,DATA} is an easier programming interface, and probably what cp, tar, etc should use, once it is implemented.

Cheers, Andreas

2011-04-19 01:58:22

by Yongqiang Yang

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <[email protected]> wrote:
> On 2011-04-17, at 6:40 PM, Dave Chinner <[email protected]> wrote:
>
> On Sat, Apr 16, 2011 at 08:21:28AM -0400, Theodore Tso wrote:
>
> On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:
>
> In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is
> fine.
>
> Except that if someone is copying a large delay allocated file, it will
> cause
>
> the file to immediately snapped to disk, which might not be the greatest
>
> thing in the world.
>
> Obvious workaround - if the initial fiemap call shows unwritten
> extents, redo it with the sync flag set. Though that assumeѕ that
> you can trust things like delalloc extents to only cover the range
> that valid data exists in. Which, of course, you can't assume,
> either. :/
>
> Always passing FIEMAP_FLAG_SYNC is fine in this case. It should only do
> anything if there is unwritten data, which is the only case we are concerned
> with at this point.  In any case, this is a simple solution for coreutils
> until such a time that a more complex solution is added in the kernel (if
> ever).
>
> Christoph is write, SEEK_HOLE and SEEK_DATA are
>
> a much better API for what cp woulld lke to do.  Unfortunately it hasn't
>
> been implemented yet in the VFS...
>
> Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
>
> I don't see how this will change the problem in any meaningful way. There
> will still need to be code that is traversing the on-disk mapping, and also
> keeping it coherent with unwritten data in the page cache.

It seems that we are being messed up by page cache and disk.
Unwritten flag returned from FIEMAP indicates blocks on disk are not
written, but it does not say if there is data in page cache. So
FIEMAP itself just tells user the map on disk. However there is an
exception for delayed allocation, FIEMAP tells users the data is in
page cache.

Maybe FIEMAP should return all known messages for unwritten extent, if
unwritten data exists in page cache, FIEMAP should let users know that
data is in page cache and space on disk has been preallocated, but
data has not been flushed into disk. Actually, delayed allocation has
done like this. Then user-space applications can determine how to do.
Taking cp as an example, it will copy from page cache rather ignore
it.


We need a definite definition for FIEMAP, in other words, it tells
users map on disk or both disk and page cache.

If the former one is taken, then FIEMAP should not consider delayed allocation.
otherwise, FIEMAP should return all known messages for unwritten case
like delayed allocation.

> Since FIEMAP already exists for most Linux filesystems, it probably makes
> sense to implement SEEK_{HOLE,DATA} by calling FIEMAP to get the disk
> mapping in the first place.
> I agree that SEEK_{HOLE,DATA} is an easier programming interface, and
> probably what cp, tar, etc should use, once it is implemented.
> Cheers, Andreas
> _______________________________________________
> xfs mailing list
> [email protected]
> http://oss.sgi.com/mailman/listinfo/xfs
>
>



--
Best Wishes
Yongqiang Yang

2011-04-19 03:45:00

by Dave Chinner

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Tue, Apr 19, 2011 at 09:58:15AM +0800, Yongqiang Yang wrote:
> On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <[email protected]> wrote:
> > On 2011-04-17, at 6:40 PM, Dave Chinner <[email protected]> wrote:
> >
> > On Sat, Apr 16, 2011 at 08:21:28AM -0400, Theodore Tso wrote:
> >
> > On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:
> >
> > In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is
> > fine.
> >
> > Except that if someone is copying a large delay allocated file, it will
> > cause
> >
> > the file to immediately snapped to disk, which might not be the greatest
> >
> > thing in the world.
> >
> > Obvious workaround - if the initial fiemap call shows unwritten
> > extents, redo it with the sync flag set. Though that assumeѕ that
> > you can trust things like delalloc extents to only cover the range
> > that valid data exists in. Which, of course, you can't assume,
> > either. :/
> >
> > Always passing FIEMAP_FLAG_SYNC is fine in this case. It should only do
> > anything if there is unwritten data, which is the only case we are concerned
> > with at this point.  In any case, this is a simple solution for coreutils
> > until such a time that a more complex solution is added in the kernel (if
> > ever).
> >
> > Christoph is write, SEEK_HOLE and SEEK_DATA are
> >
> > a much better API for what cp woulld lke to do.  Unfortunately it hasn't
> >
> > been implemented yet in the VFS...
> >
> > Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
> >
> > I don't see how this will change the problem in any meaningful way. There
> > will still need to be code that is traversing the on-disk mapping, and also
> > keeping it coherent with unwritten data in the page cache.
>
> It seems that we are being messed up by page cache and disk.
> Unwritten flag returned from FIEMAP indicates blocks on disk are not
> written, but it does not say if there is data in page cache. So
> FIEMAP itself just tells user the map on disk. However there is an
> exception for delayed allocation, FIEMAP tells users the data is in
> page cache.

No, FIEMAP does not tell the user there is data in the page cache.
It tells there user there is a delayed allocation extent. For XFS, a
delayed allocation extent can cover a range _greater_ than there is
data in the page cache - we do allocation allignment, speculative
allocation and other tricks to avoid fragmentation via
delayed allocation. When XFSs says there is a delalloc extent, it is
simply showing the in-memory representation of the extent. if you
want to know where the data in the page cache actually is, you need
to sync the file to disk to get those ranges converted to real
extents. This is how xfs_bmap has worked for 15 years....

> Maybe FIEMAP should return all known messages for unwritten extent, if
> unwritten data exists in page cache, FIEMAP should let users know that
> data is in page cache and space on disk has been preallocated, but
> data has not been flushed into disk. Actually, delayed allocation has
> done like this. Then user-space applications can determine how to do.
> Taking cp as an example, it will copy from page cache rather ignore
> it.

Once again, FIEMAP is for showing the filesystem's current extent
state, not the page cache state. Ext4 may implement FIEMAP by doing
page cache walks, but that is a filesystem specific implementation
detail.

> We need a definite definition for FIEMAP, in other words, it tells
> users map on disk or both disk and page cache.

We already have a definition - and it has nothing to do with the
page cache state.

> If the former one is taken, then FIEMAP should not consider
> delayed allocation.

Not at all. the delayed allocation extent is a first class extent
type in XFS and it is reported directly from the extent list. Your
viewpoint is very ext4-specific and ignores the fact that other
filesystems were doing this sort of mapping long before even ext3
(let alone ext4) was a glint in the designer's eye....

> otherwise, FIEMAP should return all known messages for unwritten case
> like delayed allocation.

See my previous comments about extents being unwritten until data is
physically written to them.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-19 06:53:21

by Yongqiang Yang

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Tue, Apr 19, 2011 at 11:44 AM, Dave Chinner <[email protected]> wrote:
> On Tue, Apr 19, 2011 at 09:58:15AM +0800, Yongqiang Yang wrote:
>> On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <[email protected]> wrote:
>> > On 2011-04-17, at 6:40 PM, Dave Chinner <[email protected]> wrote:
>> >
>> > On Sat, Apr 16, 2011 at 08:21:28AM -0400, Theodore Tso wrote:
>> >
>> > On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:
>> >
>> > In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is
>> > fine.
>> >
>> > Except that if someone is copying a large delay allocated file, it will
>> > cause
>> >
>> > the file to immediately snapped to disk, which might not be the greatest
>> >
>> > thing in the world.
>> >
>> > Obvious workaround - if the initial fiemap call shows unwritten
>> > extents, redo it with the sync flag set. Though that assumeѕ that
>> > you can trust things like delalloc extents to only cover the range
>> > that valid data exists in. Which, of course, you can't assume,
>> > either. :/
>> >
>> > Always passing FIEMAP_FLAG_SYNC is fine in this case. It should only do
>> > anything if there is unwritten data, which is the only case we are concerned
>> > with at this point.  In any case, this is a simple solution for coreutils
>> > until such a time that a more complex solution is added in the kernel (if
>> > ever).
>> >
>> > Christoph is write, SEEK_HOLE and SEEK_DATA are
>> >
>> > a much better API for what cp woulld lke to do.  Unfortunately it hasn't
>> >
>> > been implemented yet in the VFS...
>> >
>> > Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
>> >
>> > I don't see how this will change the problem in any meaningful way. There
>> > will still need to be code that is traversing the on-disk mapping, and also
>> > keeping it coherent with unwritten data in the page cache.
>>
>> It seems that we are being messed up by page cache and disk.
>> Unwritten flag returned from FIEMAP indicates blocks on disk are not
>> written, but it does not say if there is data in page cache.  So
>> FIEMAP itself just tells user the map on disk.  However there is an
>> exception for delayed allocation,  FIEMAP tells users the data is in
>> page cache.
>
> No, FIEMAP does not tell the user there is data in the page cache.
> It tells there user there is a delayed allocation extent. For XFS, a
> delayed allocation extent can cover a range _greater_ than there is
> data in the page cache - we do allocation allignment, speculative
> allocation and other tricks to avoid fragmentation via
> delayed allocation. When XFSs says there is a delalloc extent, it is
> simply showing the in-memory representation of the extent. if you
> want to know where the data in the page cache actually is, you need
> to sync the file to disk to get those ranges converted to real
> extents. This is how xfs_bmap has worked for 15 years....
>
>> Maybe FIEMAP should return all known messages for unwritten extent, if
>> unwritten data exists in page cache, FIEMAP should let users know that
>> data is in page cache and space on disk has been preallocated, but
>> data has not been flushed into disk.  Actually, delayed allocation has
>> done like this. Then user-space applications can determine how to do.
>> Taking cp as an example, it will copy from page cache rather ignore
>> it.
>
> Once again, FIEMAP is for showing the filesystem's current extent
> state, not the page cache state. Ext4 may implement FIEMAP by doing
> page cache walks, but that is a filesystem specific implementation
> detail.
>
>> We need a definite definition for FIEMAP, in other words, it tells
>> users map on disk or both disk and page cache.
>
> We already have a definition - and it has nothing to do with the
> page cache state.
>
>> If the former one is taken, then FIEMAP should not consider
>> delayed allocation.
>
> Not at all. the delayed allocation extent is a first class extent
> type in XFS and it is reported directly from the extent list. Your
> viewpoint is very ext4-specific and ignores the fact that other
> filesystems were doing this sort of mapping long before even ext3
> (let alone ext4) was a glint in the designer's eye....
>
>> otherwise, FIEMAP should return all known messages for unwritten case
>> like delayed allocation.
>
> See my previous comments about extents being unwritten until data is
> physically written to them.
Understood, thank you for your explanation.

Ok. Let's look at it from a higher view. What you described about
extent state is specific to xfs.

I think there are 2 ways to provide a definite definition for FIEMAP
for all filesystems:

1. FIEMAP returns extent state on disk.
2. FIEMAP returns extent both in memory and on disk.

Now, the question comes in case 2. How to define extent's state in
memory? Every filesystem has its own implementation regarding extent
in memory, especially for delayed and unwritten extents, and I think
they are all reasonable. For example, ext4 without delayed allocation
change unwritten extents to written ones immediately in memory, while
the changing is delayed until flush time in delayed allocation case.

It seems that there are only 1 way to provide a definite definition of
in-memory-extent - FIEMAP should return what an user has written.
this works for all filesystems.

Yongqiang.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>



--
Best Wishes
Yongqiang Yang

2011-04-19 07:45:51

by Dave Chinner

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Tue, Apr 19, 2011 at 02:53:20PM +0800, Yongqiang Yang wrote:
> On Tue, Apr 19, 2011 at 11:44 AM, Dave Chinner <[email protected]> wrote:
> > On Tue, Apr 19, 2011 at 09:58:15AM +0800, Yongqiang Yang wrote:
> >> On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <[email protected]> wrote:
> >> > On 2011-04-17, at 6:40 PM, Dave Chinner <[email protected]> wrote:
> >> >
> >> > On Sat, Apr 16, 2011 at 08:21:28AM -0400, Theodore Tso wrote:
> >> >
> >> > On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:
> >> >
> >> > In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is
> >> > fine.
> >> >
> >> > Except that if someone is copying a large delay allocated file, it will
> >> > cause
> >> >
> >> > the file to immediately snapped to disk, which might not be the greatest
> >> >
> >> > thing in the world.
> >> >
> >> > Obvious workaround - if the initial fiemap call shows unwritten
> >> > extents, redo it with the sync flag set. Though that assumeѕ that
> >> > you can trust things like delalloc extents to only cover the range
> >> > that valid data exists in. Which, of course, you can't assume,
> >> > either. :/
> >> >
> >> > Always passing FIEMAP_FLAG_SYNC is fine in this case. It should only do
> >> > anything if there is unwritten data, which is the only case we are concerned
> >> > with at this point.  In any case, this is a simple solution for coreutils
> >> > until such a time that a more complex solution is added in the kernel (if
> >> > ever).
> >> >
> >> > Christoph is write, SEEK_HOLE and SEEK_DATA are
> >> >
> >> > a much better API for what cp woulld lke to do.  Unfortunately it hasn't
> >> >
> >> > been implemented yet in the VFS...
> >> >
> >> > Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
> >> >
> >> > I don't see how this will change the problem in any meaningful way. There
> >> > will still need to be code that is traversing the on-disk mapping, and also
> >> > keeping it coherent with unwritten data in the page cache.
> >>
> >> It seems that we are being messed up by page cache and disk.
> >> Unwritten flag returned from FIEMAP indicates blocks on disk are not
> >> written, but it does not say if there is data in page cache.  So
> >> FIEMAP itself just tells user the map on disk.  However there is an
> >> exception for delayed allocation,  FIEMAP tells users the data is in
> >> page cache.
> >
> > No, FIEMAP does not tell the user there is data in the page cache.
> > It tells there user there is a delayed allocation extent. For XFS, a
> > delayed allocation extent can cover a range _greater_ than there is
> > data in the page cache - we do allocation allignment, speculative
> > allocation and other tricks to avoid fragmentation via
> > delayed allocation. When XFSs says there is a delalloc extent, it is
> > simply showing the in-memory representation of the extent. if you
> > want to know where the data in the page cache actually is, you need
> > to sync the file to disk to get those ranges converted to real
> > extents. This is how xfs_bmap has worked for 15 years....
> >
> >> Maybe FIEMAP should return all known messages for unwritten extent, if
> >> unwritten data exists in page cache, FIEMAP should let users know that
> >> data is in page cache and space on disk has been preallocated, but
> >> data has not been flushed into disk.  Actually, delayed allocation has
> >> done like this. Then user-space applications can determine how to do.
> >> Taking cp as an example, it will copy from page cache rather ignore
> >> it.
> >
> > Once again, FIEMAP is for showing the filesystem's current extent
> > state, not the page cache state. Ext4 may implement FIEMAP by doing
> > page cache walks, but that is a filesystem specific implementation
> > detail.
> >
> >> We need a definite definition for FIEMAP, in other words, it tells
> >> users map on disk or both disk and page cache.
> >
> > We already have a definition - and it has nothing to do with the
> > page cache state.
> >
> >> If the former one is taken, then FIEMAP should not consider
> >> delayed allocation.
> >
> > Not at all. the delayed allocation extent is a first class extent
> > type in XFS and it is reported directly from the extent list. Your
> > viewpoint is very ext4-specific and ignores the fact that other
> > filesystems were doing this sort of mapping long before even ext3
> > (let alone ext4) was a glint in the designer's eye....
> >
> >> otherwise, FIEMAP should return all known messages for unwritten case
> >> like delayed allocation.
> >
> > See my previous comments about extents being unwritten until data is
> > physically written to them.
> Understood, thank you for your explanation.
>
> Ok. Let's look at it from a higher view. What you described about
> extent state is specific to xfs.
>
> I think there are 2 ways to provide a definite definition for FIEMAP
> for all filesystems:
>
> 1. FIEMAP returns extent state on disk.
> 2. FIEMAP returns extent both in memory and on disk.

You are *not listening*. There is no #2. FIEMAP returns the extent
state _on disk_ at the time of the call. If you want it to reflect
the in-memory state at the time of the call (for data or metadata),
you *must* use the the SYNC flag to convert that in-memory state to
on-disk state, which FIEMAP then reports just fine.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-19 08:11:37

by Yongqiang Yang

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Tue, Apr 19, 2011 at 3:45 PM, Dave Chinner <[email protected]> wrote:
> On Tue, Apr 19, 2011 at 02:53:20PM +0800, Yongqiang Yang wrote:
>> On Tue, Apr 19, 2011 at 11:44 AM, Dave Chinner <[email protected]> wrote:
>> > On Tue, Apr 19, 2011 at 09:58:15AM +0800, Yongqiang Yang wrote:
>> >> On Mon, Apr 18, 2011 at 10:45 AM, Andreas Dilger <[email protected]> wrote:
>> >> > On 2011-04-17, at 6:40 PM, Dave Chinner <[email protected]> wrote:
>> >> >
>> >> > On Sat, Apr 16, 2011 at 08:21:28AM -0400, Theodore Tso wrote:
>> >> >
>> >> > On Apr 16, 2011, at 1:11 AM, Andreas Dilger wrote:
>> >> >
>> >> > In that case, it means cp should just always use FIEMAP_FLAG_SYNC, which is
>> >> > fine.
>> >> >
>> >> > Except that if someone is copying a large delay allocated file, it will
>> >> > cause
>> >> >
>> >> > the file to immediately snapped to disk, which might not be the greatest
>> >> >
>> >> > thing in the world.
>> >> >
>> >> > Obvious workaround - if the initial fiemap call shows unwritten
>> >> > extents, redo it with the sync flag set. Though that assumeѕ that
>> >> > you can trust things like delalloc extents to only cover the range
>> >> > that valid data exists in. Which, of course, you can't assume,
>> >> > either. :/
>> >> >
>> >> > Always passing FIEMAP_FLAG_SYNC is fine in this case. It should only do
>> >> > anything if there is unwritten data, which is the only case we are concerned
>> >> > with at this point.  In any case, this is a simple solution for coreutils
>> >> > until such a time that a more complex solution is added in the kernel (if
>> >> > ever).
>> >> >
>> >> > Christoph is write, SEEK_HOLE and SEEK_DATA are
>> >> >
>> >> > a much better API for what cp woulld lke to do.  Unfortunately it hasn't
>> >> >
>> >> > been implemented yet in the VFS...
>> >> >
>> >> > Agreed, SEEK_HOLE/SEEK_DATA is the right way to solve this problem.
>> >> >
>> >> > I don't see how this will change the problem in any meaningful way. There
>> >> > will still need to be code that is traversing the on-disk mapping, and also
>> >> > keeping it coherent with unwritten data in the page cache.
>> >>
>> >> It seems that we are being messed up by page cache and disk.
>> >> Unwritten flag returned from FIEMAP indicates blocks on disk are not
>> >> written, but it does not say if there is data in page cache.  So
>> >> FIEMAP itself just tells user the map on disk.  However there is an
>> >> exception for delayed allocation,  FIEMAP tells users the data is in
>> >> page cache.
>> >
>> > No, FIEMAP does not tell the user there is data in the page cache.
>> > It tells there user there is a delayed allocation extent. For XFS, a
>> > delayed allocation extent can cover a range _greater_ than there is
>> > data in the page cache - we do allocation allignment, speculative
>> > allocation and other tricks to avoid fragmentation via
>> > delayed allocation. When XFSs says there is a delalloc extent, it is
>> > simply showing the in-memory representation of the extent. if you
>> > want to know where the data in the page cache actually is, you need
>> > to sync the file to disk to get those ranges converted to real
>> > extents. This is how xfs_bmap has worked for 15 years....
>> >
>> >> Maybe FIEMAP should return all known messages for unwritten extent, if
>> >> unwritten data exists in page cache, FIEMAP should let users know that
>> >> data is in page cache and space on disk has been preallocated, but
>> >> data has not been flushed into disk.  Actually, delayed allocation has
>> >> done like this. Then user-space applications can determine how to do.
>> >> Taking cp as an example, it will copy from page cache rather ignore
>> >> it.
>> >
>> > Once again, FIEMAP is for showing the filesystem's current extent
>> > state, not the page cache state. Ext4 may implement FIEMAP by doing
>> > page cache walks, but that is a filesystem specific implementation
>> > detail.
>> >
>> >> We need a definite definition for FIEMAP, in other words, it tells
>> >> users map on disk or both disk and page cache.
>> >
>> > We already have a definition - and it has nothing to do with the
>> > page cache state.
>> >
>> >> If the former one is taken, then FIEMAP should not consider
>> >> delayed allocation.
>> >
>> > Not at all. the delayed allocation extent is a first class extent
>> > type in XFS and it is reported directly from the extent list. Your
>> > viewpoint is very ext4-specific and ignores the fact that other
>> > filesystems were doing this sort of mapping long before even ext3
>> > (let alone ext4) was a glint in the designer's eye....
>> >
>> >> otherwise, FIEMAP should return all known messages for unwritten case
>> >> like delayed allocation.
>> >
>> > See my previous comments about extents being unwritten until data is
>> > physically written to them.
>> Understood, thank you for your explanation.
>>
>> Ok.  Let's look at it from a higher view.  What you described about
>> extent state is specific to xfs.
>>
>> I think there are 2 ways to provide a definite definition for FIEMAP
>> for all filesystems:
>>
>> 1. FIEMAP returns extent state on disk.
>> 2. FIEMAP returns extent both in memory and on disk.
>
> You are *not listening*. There is no #2. FIEMAP returns the extent
> state _on disk_ at the time of the call. If you want it to reflect
> the in-memory state at the time of the call (for data or metadata),
> you *must* use the the SYNC flag to convert that in-memory state to
> on-disk state, which FIEMAP then reports just fine.

Sorry for being dense.

I think delayed extent is an exception. because it is not on the disk.

Yongqiang.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>



--
Best Wishes
Yongqiang Yang

2011-04-19 14:05:11

by Eric Sandeen

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On 4/19/11 3:11 AM, Yongqiang Yang wrote:
> On Tue, Apr 19, 2011 at 3:45 PM, Dave Chinner <[email protected]> wrote:
>> On Tue, Apr 19, 2011 at 02:53:20PM +0800, Yongqiang Yang wrote:


...

>>> I think there are 2 ways to provide a definite definition for FIEMAP
>>> for all filesystems:
>>>
>>> 1. FIEMAP returns extent state on disk.
>>> 2. FIEMAP returns extent both in memory and on disk.
>>
>> You are *not listening*. There is no #2. FIEMAP returns the extent
>> state _on disk_ at the time of the call. If you want it to reflect
>> the in-memory state at the time of the call (for data or metadata),
>> you *must* use the the SYNC flag to convert that in-memory state to
>> on-disk state, which FIEMAP then reports just fine.
>
> Sorry for being dense.
>
> I think delayed extent is an exception. because it is not on the disk.
>
> Yongqiang.

I don't think you're being dense, I think that the interface specification is just messed up.

By including flags for both unwritten and delalloc in the interface, we have hopelessly intertwined on-disk and in-memory state.

If you preallocate 1M and then do a buffered IO to that same range without a sync, and then a fiemap, what on earth should the interface return?

Writing the first part of that testcase is simple but I have no idea what the correct behavior is.

(FIEMAP_EXTENT_UNWRITTEN|FIEMAP_EXTENT_DELALLOC) ? that makes no sense. But no other combination of flags makes sense to me either, unless we get into tortured redefinitions of what "delalloc" means.

And if we can't say what it -should- return.... well, too bad for coreutils. :(

-Eric

2011-04-19 14:09:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Tue, Apr 19, 2011 at 05:45:38PM +1000, Dave Chinner wrote:
> You are *not listening*. There is no #2. FIEMAP returns the extent
> state _on disk_ at the time of the call.

Dave, you're being rather strident about your insistence about what
FIEMAP's semantics are. Part of the problem here is that it's *not*
clear or settled.

If it really is the state _on_ _disk_, does XFS really have a DELALLOC
bit _on_ _disk_?

- Ted

2011-04-19 14:13:06

by Eric Sandeen

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On 4/19/11 9:09 AM, Ted Ts'o wrote:
> On Tue, Apr 19, 2011 at 05:45:38PM +1000, Dave Chinner wrote:
>> You are *not listening*. There is no #2. FIEMAP returns the extent
>> state _on disk_ at the time of the call.
>
> Dave, you're being rather strident about your insistence about what
> FIEMAP's semantics are. Part of the problem here is that it's *not*
> clear or settled.
>
> If it really is the state _on_ _disk_, does XFS really have a DELALLOC
> bit _on_ _disk_?
>
> - Ted
>

no of course it doesn't....

But I too am confused about Dave's assertion that it only reflects ondisk state when we have that pesky delalloc flag.

Whose idea was that, anyway? ;)

I'd certainly buy the argument that it -should- only reflect ondisk state, and we should nuke the delalloc flag from orbit, if we could, though.

-Eric

2011-04-19 16:01:24

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Tue, Apr 19, 2011 at 09:13:19AM -0500, Eric Sandeen wrote:
>
> But I too am confused about Dave's assertion that it only reflects ondisk state when we have that pesky delalloc flag.
>
> Whose idea was that, anyway? ;)
>
> I'd certainly buy the argument that it -should- only reflect ondisk state, and we should nuke the delalloc flag from orbit, if we could, though.

I see three options of how we can clarify FIEMAP's semantics:

1) We define it as only reflecting ondisk state, and nuke the delalloc
flag from orbit.

2) We state that if the file is currently has unflushed pages in the
page cache, and FIEMAP_FLAG_SYNC is not passed, whether or not extents
return the DELALLOC flag or how they handle the UNWRITTEN flag is
undefined.

3) We state that FIEMAP is supposed to return information which
reflects the union of the on-disk and page cache state, with all that
this implies.

All of these are internally consistent definitions --- we need to
chose one, document, and then tell the shellutils folks what they
should do.

In the case of #1 and #2, we really need to implement support for
SEEK_HOLE/SEEK_DATA for userspace programs like cp who want to know
this information.

Do we all agree on the problem statement, at least? If so, then we
can try to come consensus on what is the appropriate solution.

- Ted

2011-04-20 01:53:05

by Yongqiang Yang

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Wed, Apr 20, 2011 at 12:01 AM, Ted Ts'o <[email protected]> wrote:
> On Tue, Apr 19, 2011 at 09:13:19AM -0500, Eric Sandeen wrote:
>>
>> But I too am confused about Dave's assertion that it only reflects ondisk state when we have that pesky delalloc flag.
>>
>> Whose idea was that, anyway? ;)
>>
>> I'd certainly buy the argument that it -should- only reflect ondisk state, and we should nuke the delalloc flag from orbit, if we could, though.
>
> I see three options of how we can clarify FIEMAP's semantics:
>
> 1) We define it as only reflecting ondisk state, and nuke the delalloc
> flag from orbit.
>
> 2) We state that if the file is currently has unflushed pages in the
> page cache, and FIEMAP_FLAG_SYNC is not passed, whether or not extents
> return the DELALLOC flag or how they handle the UNWRITTEN flag is
> undefined.
>
> 3) We state that FIEMAP is supposed to return information which
> reflects the union of the on-disk and page cache state, with all that
> this implies.
>
> All of these are internally consistent definitions --- we need to
> chose one, document, and then tell the shellutils folks what they
> should do.
>
> In the case of #1 and #2, we really need to implement support for
> SEEK_HOLE/SEEK_DATA for userspace programs like cp who want to know
> this information.
>
> Do we all agree on the problem statement, at least? ?If so, then we
> can try to come consensus on what is the appropriate solution.

I agree on the problem statement. Users need to know what FIEMAP
returns definitely.

It seems that Dave is looking at the problem from a different view.

Dave thinks that FIEMAP returns where data exists on disk finally.
Then there are 2 possibilities: unknown and known. delayed extent is
unknown and others are known. Although we know where data in
unwritten extent exist finally on disk, we cannot know whether or not
it is being in memory. However we know data in delayed extent is in
memory. it sounds strange.

>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? - Ted
>



--
Best Wishes
Yongqiang Yang

2011-04-20 15:21:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Tue, Apr 19, 2011 at 12:01:14PM -0400, Ted Ts'o wrote:
> 1) We define it as only reflecting ondisk state, and nuke the delalloc
> flag from orbit.
>
> 2) We state that if the file is currently has unflushed pages in the
> page cache, and FIEMAP_FLAG_SYNC is not passed, whether or not extents
> return the DELALLOC flag or how they handle the UNWRITTEN flag is
> undefined.

That seems like a weird option, as the pagecache state really has
nothing to do at all with the extent layout, and the existence of dirty
pages really has nothing to do with the unwritten flag.

> 3) We state that FIEMAP is supposed to return information which
> reflects the union of the on-disk and page cache state, with all that
> this implies.

How do you want to union the existance of an extent with a state
on disk, with a pending modification to it that is still in-memory
and not flushed out to disk yet? This is looking into an uncertain
future, as the extent map might change in various other ways before
the transaction to conver the unwritten extents goes to disk.

And if we do this it would need to be a new option to FIEMAP, as
it changes the semantics from the existing one that returns the
actual state on disk (plus the magic delalloc bit).

And even if you find semantics that take pending unwrittent extent
conversions into account and still make sense how do you plan to
implement them? For buffered writes into unwritten extents it could
be done by walking the pagecache and buffers after adding a new
flag for an already converted unwritten extent to the buffer head
state. But there's no easy way to do that for direct I/O.

> In the case of #1 and #2, we really need to implement support for
> SEEK_HOLE/SEEK_DATA for userspace programs like cp who want to know
> this information.

We need to do that anyway, as fiemap is a horrible interface for
tools that just want to skip holes.


2011-04-20 17:21:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On Wed, Apr 20, 2011 at 11:21:31AM -0400, Christoph Hellwig wrote:
>
> How do you want to union the existance of an extent with a state
> on disk, with a pending modification to it that is still in-memory
> and not flushed out to disk yet? This is looking into an uncertain
> future, as the extent map might change in various other ways before
> the transaction to conver the unwritten extents goes to disk.

So for example, suppose you have a single unwritten extent on disk,
but there are 3 regions within that extent range's that have unwritten
pages, you return 3 or 4 fiemap_extent structures, reflecting the
state if the unwritten pages were pushed out to disk at the time of
the fiemap ioctl --- but without actually doing the expensive sync
operation. The one case where you can't do that is in the case of
delayed allocation blocks, since you won't know where on disk they
would be going, necessarily --- but hey, conveniently we have a
DELALLOC bit already defined....

> And if we do this it would need to be a new option to FIEMAP, as
> it changes the semantics from the existing one that returns the
> actual state on disk (plus the magic delalloc bit).

Well, we seem to have inconsistent semantics right now, because we
never defined the semantics clearly enough from the beginning. So no
matter which choice we choose, including "the on-disk extent state
only, and nuke the delalloc bit", we will be changing semantics. I'm
not sure we can get around that.

> And even if you find semantics that take pending unwrittent extent
> conversions into account and still make sense how do you plan to
> implement them? For buffered writes into unwritten extents it could
> be done by walking the pagecache and buffers after adding a new
> flag for an already converted unwritten extent to the buffer head
> state. But there's no easy way to do that for direct I/O.

If the file is being actively modified (for example with direct I/O),
there will be inevitably race conditions. If only some of the pending
conversions have been taken into account, that seems like it's
reasonable result. If a file is actively being modified by many DIO
writes, even using FIEMAP_FLAG_SYNC isn't going to help you get a
coherent view of the file, so this seems to be a previously unsolved
problem....

> > In the case of #1 and #2, we really need to implement support for
> > SEEK_HOLE/SEEK_DATA for userspace programs like cp who want to know
> > this information.
>
> We need to do that anyway, as fiemap is a horrible interface for
> tools that just want to skip holes.

I agree that implementing SEEK_HOLE/SEEK_DATA is a good thing
regardless of which choice we end up choosing.

- Ted