2007-04-12 11:05:52

by Andreas Dilger

[permalink] [raw]
Subject: [RFC] add FIEMAP ioctl to efficiently map file allocation

I'm interested in getting input for implementing an ioctl to efficiently
map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion
times. We already have customers with single files in the 10TB range and
we additionally need to get the mapping over the network so it needs to
be efficient in terms of how data is passed, and how easily it can be
extracted from the filesystem.

I had come up with a plan independently and was also steered toward
XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
plan, though I think the XFS structs used there are a bit bloated.

There was also recent discussion about SEEK_HOLE and SEEK_DATA as
implemented by Sun, but even if we could skip the holes we still might
need to do millions of FIBMAPs to see how large files are allocated
on disk. Conversely, having filesystems implement an efficient FIBMAP
ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE
and SEEK_DATA instead of doing looping over ->bmap() inside the kernel
as I saw one patch.


struct fibmap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len; /* length in bytes */
}

struct fibmap {
struct fibmap_extent fm_start; /* offset, length of desired mapping */
__u32 fm_extent_count; /* number of extents in array */
__u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
__u64 unused;
struct fibmap_extent fm_extents[0];
}

#define FIEMAP_LEN_MASK 0xff000000000000
#define FIEMAP_LEN_HOLE 0x01000000000000
#define FIEMAP_LEN_UNWRITTEN 0x02000000000000

All offsets are in bytes to allow cases where filesystems are not going
block-aligned/sized allocations (e.g. tail packing). The fm_extents array
returned contains the packed list of allocation extents for the file,
including entries for holes (which have fe_start == 0, and a flag).

The ->fm_extents[] array includes all of the holes in addition to
allocated extents because this avoids the need to return both the logical
and physical address for every extent and does not make processing any
harder.

One feature that XFS_IOC_GETBMAPX has that may be desirable is the
ability to return unwritten extent information. In order to do this XFS
required expanding the per-extent struct from 32 to 48 bytes per extent,
but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship)
and keep 8 bytes or so for input/output flags per extent (would need to
be masked before use).


Caller works something like:

char buf[4096];
struct fibmap *fm = (struct fibmap *)buf;
int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);

fm->fm_extent.fe_start = 0; /* start of file */
fm->fm_extent.fe_len = -1; /* end of file */
fm->fm_extent_count = count; /* max extents in fm_extents[] array */
fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */

fd = open(path, O_RDONLY);
printf("logical\t\tphysical\t\tbytes\n");

/* The last entry will have less extents than the maximum */
while (fm->fm_extent_count == count) {
rc = ioctl(fd, FIEMAP, fm);
if (rc)
break;

/* kernel filled in fm_extents[] array, set fm_extent_count
* to be actual number of extents returned, leaves fm_start
* alone (unlike XFS_IOC_GETBMAP). */

for (i = 0; i < fm->fm_extent_count; i++) {
__u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK;
__u64 fm_next = fm->fm_start + len;
int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE;
int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN;

printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n",
fm->fm_start, fm_next - 1,
hole ? 0 : fm->fm_extents[i].fe_start,
hole ? 0 : fm->fm_extents[i].fe_start +
fm->fm_extents[i].fe_len - 1,
len, hole ? "(hole) " : "",
unwr ? "(unwritten) " : "");

/* get ready for printing next extent, or next ioctl */
fm->fm_start = fm_next;
}
}

I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP.
I'm quite open to suggestions at this point, both in terms of how usable
the fibmap data structures are by the caller, and if we need to add anything
to make them more flexible for the future.

In terms of implementing this in the kernel, there was originally code for
this during the development of the ext3 extent patches and it was done via
a callback in the extent tree iterator so it is very efficient. I believe
it implements all that is needed to allow this interface to be mapped
onto XFS_IOC_BMAP internally (or vice versa). Even for block-mapped
filesystems, they can at least improve over the ->bmap() case by skipping
holes in files that cover [dt]indirect blocks (saving thousands of calls).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2007-04-12 11:22:55

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

Hi Andreas,

On 12 Apr 2007, at 12:05, Andreas Dilger wrote:

> I'm interested in getting input for implementing an ioctl to
> efficiently
> map file extents & holes (FIEMAP) instead of looping over FIBMAP a
> billion
> times. We already have customers with single files in the 10TB
> range and
> we additionally need to get the mapping over the network so it
> needs to
> be efficient in terms of how data is passed, and how easily it can be
> extracted from the filesystem.
>
> I had come up with a plan independently and was also steered toward
> XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
> plan, though I think the XFS structs used there are a bit bloated.
>
> There was also recent discussion about SEEK_HOLE and SEEK_DATA as
> implemented by Sun, but even if we could skip the holes we still might
> need to do millions of FIBMAPs to see how large files are allocated
> on disk. Conversely, having filesystems implement an efficient FIBMAP
> ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE
> and SEEK_DATA instead of doing looping over ->bmap() inside the kernel
> as I saw one patch.
>
>
> struct fibmap_extent {
> __u64 fe_start; /* starting offset in bytes */
> __u64 fe_len; /* length in bytes */
> }
>
> struct fibmap {
> struct fibmap_extent fm_start; /* offset, length of desired
> mapping */
> __u32 fm_extent_count; /* number of extents in array */
> __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
> __u64 unused;
> struct fibmap_extent fm_extents[0];
> }
>
> #define FIEMAP_LEN_MASK 0xff000000000000
> #define FIEMAP_LEN_HOLE 0x01000000000000
> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000

Sound good but I would add:

#define FIEMAP_LEN_NO_DIRECT_ACCESS

This would say that the offset on disk can move at any time or that
the data is compressed or encrypted on disk thus the data is not
useful for direct disk access.

On NTFS small files can be inside the inode and there direct access
is not possible because the metadata on disk is protected with fixups
which need to be removed when the inode is read into memory. If you
access the data directly on disk, you would see corrupt data on reads
and cause corruption on writes...

Similarly both for compressed and encrypted files doing direct access
to the on-disk data is totally nonsensical as you would see random
junk on read and cause fatal data corruption on writes.

Also why are you not using 0xff00000000000000, i.e. two more zeroes
at the end? Seems unnecessary to drop an extra 8 bits of
significance from the byte size... May not matter today but it
almost certainly will do in the future (just remember what people
said about the 640k limit in MSDOS when it first came out!)...

Finally please make sure that the file system can return in one way
or another errors for example when it fails to determine the extents
because the system ran out of memory, there was an i/o error,
whatever... It may even be useful to be able to say "here is an
extent of size X bytes but we do not know where it is on disk because
there was an error determining this particular extent's on-disk
location for some reason or other"...

> All offsets are in bytes to allow cases where filesystems are not
> going

Excellent!

> block-aligned/sized allocations (e.g. tail packing). The
> fm_extents array
> returned contains the packed list of allocation extents for the file,
> including entries for holes (which have fe_start == 0, and a flag).

Why the fe_start == 0? Surely just the flag is sufficient... On
NTFS it is perfectly valid to have fe_start == 0 and to have that not
be sparse (normally the $Boot system file is stored in the first 8
sectors of the volume)...

Best regards,

Anton

> The ->fm_extents[] array includes all of the holes in addition to
> allocated extents because this avoids the need to return both the
> logical
> and physical address for every extent and does not make processing any
> harder.
>
> One feature that XFS_IOC_GETBMAPX has that may be desirable is the
> ability to return unwritten extent information. In order to do
> this XFS
> required expanding the per-extent struct from 32 to 48 bytes per
> extent,
> but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what
> hardship)
> and keep 8 bytes or so for input/output flags per extent (would
> need to
> be masked before use).
>
>
> Caller works something like:
>
> char buf[4096];
> struct fibmap *fm = (struct fibmap *)buf;
> int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
>
> fm->fm_extent.fe_start = 0; /* start of file */
> fm->fm_extent.fe_len = -1; /* end of file */
> fm->fm_extent_count = count; /* max extents in fm_extents[] array */
> fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */
>
> fd = open(path, O_RDONLY);
> printf("logical\t\tphysical\t\tbytes\n");
>
> /* The last entry will have less extents than the maximum */
> while (fm->fm_extent_count == count) {
> rc = ioctl(fd, FIEMAP, fm);
> if (rc)
> break;
>
> /* kernel filled in fm_extents[] array, set fm_extent_count
> * to be actual number of extents returned, leaves fm_start
> * alone (unlike XFS_IOC_GETBMAP). */
>
> for (i = 0; i < fm->fm_extent_count; i++) {
> __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK;
> __u64 fm_next = fm->fm_start + len;
> int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE;
> int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN;
>
> printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n",
> fm->fm_start, fm_next - 1,
> hole ? 0 : fm->fm_extents[i].fe_start,
> hole ? 0 : fm->fm_extents[i].fe_start +
> fm->fm_extents[i].fe_len - 1,
> len, hole ? "(hole) " : "",
> unwr ? "(unwritten) " : "");
>
> /* get ready for printing next extent, or next ioctl */
> fm->fm_start = fm_next;
> }
> }
>
> I'm not wedded to an ioctl interface, but it seems consistent with
> FIBMAP.
> I'm quite open to suggestions at this point, both in terms of how
> usable
> the fibmap data structures are by the caller, and if we need to add
> anything
> to make them more flexible for the future.
>
> In terms of implementing this in the kernel, there was originally
> code for
> this during the development of the ext3 extent patches and it was
> done via
> a callback in the extent tree iterator so it is very efficient. I
> believe
> it implements all that is needed to allow this interface to be mapped
> onto XFS_IOC_BMAP internally (or vice versa). Even for block-mapped
> filesystems, they can at least improve over the ->bmap() case by
> skipping
> holes in files that cover [dt]indirect blocks (saving thousands of
> calls).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



2007-04-13 01:33:03

by Nicholas Miell

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Thu, 2007-04-12 at 05:05 -0600, Andreas Dilger wrote:
> I'm interested in getting input for implementing an ioctl to efficiently
> map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion
> times. We already have customers with single files in the 10TB range and
> we additionally need to get the mapping over the network so it needs to
> be efficient in terms of how data is passed, and how easily it can be
> extracted from the filesystem.
>
> I had come up with a plan independently and was also steered toward
> XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
> plan, though I think the XFS structs used there are a bit bloated.
>
> There was also recent discussion about SEEK_HOLE and SEEK_DATA as
> implemented by Sun, but even if we could skip the holes we still might
> need to do millions of FIBMAPs to see how large files are allocated
> on disk. Conversely, having filesystems implement an efficient FIBMAP
> ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE
> and SEEK_DATA instead of doing looping over ->bmap() inside the kernel
> as I saw one patch.
>

I certainly hope not. SEEK_HOLE/SEEK_DATA is a poor interface and
doesn't deserve to spread.

OTOH, this is nicely done.

>
> struct fibmap_extent {
> __u64 fe_start; /* starting offset in bytes */
> __u64 fe_len; /* length in bytes */
> }
>
> struct fibmap {
> struct fibmap_extent fm_start; /* offset, length of desired mapping */
> __u32 fm_extent_count; /* number of extents in array */
> __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
> __u64 unused;
> struct fibmap_extent fm_extents[0];
> }
>
> #define FIEMAP_LEN_MASK 0xff000000000000
> #define FIEMAP_LEN_HOLE 0x01000000000000
> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000
>


--
Nicholas Miell <[email protected]>

2007-04-13 04:01:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote:
> On 12 Apr 2007, at 12:05, Andreas Dilger wrote:
> >I'm interested in getting input for implementing an ioctl to
> >efficiently map file extents & holes (FIEMAP) instead of looping
> >over FIBMAP a billion times. We already have customers with single
> >files in the 10TB range and we additionally need to get the mapping
> >over the network so it needs to be efficient in terms of how data
> >is passed, and how easily it can be extracted from the filesystem.
> >
> >struct fibmap_extent {
> > __u64 fe_start; /* starting offset in bytes */
> > __u64 fe_len; /* length in bytes */
> >}
> >
> >struct fibmap {
> > struct fibmap_extent fm_start; /* offset, length of desired mapping */
> > __u32 fm_extent_count; /* number of extents in array */
> > __u32 fm_flags; /* flags for input request */
> > XFS_IOC_GETBMAP) */
> > __u64 unused;
> > struct fibmap_extent fm_extents[0];
> >}
> >
> >#define FIEMAP_LEN_MASK 0xff000000000000
> >#define FIEMAP_LEN_HOLE 0x01000000000000
> >#define FIEMAP_LEN_UNWRITTEN 0x02000000000000
>
> Sound good but I would add:
>
> #define FIEMAP_LEN_NO_DIRECT_ACCESS
>
> This would say that the offset on disk can move at any time or that
> the data is compressed or encrypted on disk thus the data is not
> useful for direct disk access.

This makes sense. Even for Reiserfs the same is true with packed tails,
and I believe if FIBMAP is called on a tail it will migrate the tail into
a block because this is might be a sign that the file is a kernel that
LILO wants to boot.

I'd rather not have any such feature in FIEMAP, and just return the
on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me.
My main reason for FIEMAP is being able to investigate allocation patterns
of files.

By no means is my flag list exhaustive, just the ones that I thought would
be needed to implement this for ext4 and Lustre.

> Also why are you not using 0xff00000000000000, i.e. two more zeroes
> at the end? Seems unnecessary to drop an extra 8 bits of
> significance from the byte size...

It was actually just a typo (this was the first time I'd written the
structs and flags down, it is just at the discussion stage). I'd meant
for it to be 2^56 bytes for the file size as I wrote later in the email.
That said, I think that 2^48 bytes is probably sufficient for most uses,
so that we get 16 bits for flags. As it is this email already discusses
5 flags, and that would give little room for expansion in the future.

Remember, this is the mapping for a single file (which can't practially
be beyond 2^64 bytes as yet) so it wouldn't be hard for the filesystem to
return a few separate extents which are actually contiguous (assuming that
there will actually be files in filesystems with > 2^48 bytes of contiguous
space). Since the API is that it will return the extent that contains the
requested "start" byte, the kernel will be able to detect this case also,
since it won't be able to specify a length for the extent that contains the
start byte.

At most we'd have to call the ioctl() 65536 times for a completely
contiguous 2^64 byte file if the buffer was only large enough for a
single extent. In reality, I expect any file to have some discontinuities
and the buffer to be large enough for a thousand or more entries so the
corner case is not very bad.

> Finally please make sure that the file system can return in one way
> or another errors for example when it fails to determine the extents
> because the system ran out of memory, there was an i/o error,
> whatever... It may even be useful to be able to say "here is an
> extent of size X bytes but we do not know where it is on disk because
> there was an error determining this particular extent's on-disk
> location for some reason or other"...

Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and
FIEMAP_LEN_ERROR. Consider FIEMAP on a file that was migrated
to tape and currently has no blocks allocated in the filesystem. We
want to return some indication that there is actual file data and not
just a hole, but at the same time we don't want this to actually return
the file from tape just to generate block mappings for it.

This concept is also present in XFS_IOC_GETBMAPX - BMV_IF_NO_DMAPI_READ,
but this needs to be specified on input to prevent the file being mapped
and I'd rather the opposite (not getting file from tape) be the default,
by principle of least surprise.


> >block-aligned/sized allocations (e.g. tail packing). The
> >fm_extents array
> >returned contains the packed list of allocation extents for the file,
> >including entries for holes (which have fe_start == 0, and a flag).
>
> Why the fe_start == 0? Surely just the flag is sufficient... On
> NTFS it is perfectly valid to have fe_start == 0 and to have that not
> be sparse (normally the $Boot system file is stored in the first 8
> sectors of the volume)...

I thought fe_start = 0 was pretty standard for a hole. It should be
something and I'd rather 0 than anything else. The _HOLE flag is enough
as you say though.

PS - I'd thought about adding you to the CC list for this, because I know
you've had opinions on FIBMAP in the past, but I didn't have
your email handy and it was late, and I know you saw the NTFS kmap
patch on fsdevel so I figured you would see this too...
Thanks for your input.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2007-04-13 07:46:34

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

Hi Andreas,

On 13 Apr 2007, at 05:01, Andreas Dilger wrote:
> On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote:
>> On 12 Apr 2007, at 12:05, Andreas Dilger wrote:
>>> I'm interested in getting input for implementing an ioctl to
>>> efficiently map file extents & holes (FIEMAP) instead of looping
>>> over FIBMAP a billion times. We already have customers with single
>>> files in the 10TB range and we additionally need to get the mapping
>>> over the network so it needs to be efficient in terms of how data
>>> is passed, and how easily it can be extracted from the filesystem.
>>>
>>> struct fibmap_extent {
>>> __u64 fe_start; /* starting offset in bytes */
>>> __u64 fe_len; /* length in bytes */
>>> }
>>>
>>> struct fibmap {
>>> struct fibmap_extent fm_start; /* offset, length of desired
>>> mapping */
>>> __u32 fm_extent_count; /* number of extents in array */
>>> __u32 fm_flags; /* flags for input request */
>>> XFS_IOC_GETBMAP) */
>>> __u64 unused;
>>> struct fibmap_extent fm_extents[0];
>>> }
>>>
>>> #define FIEMAP_LEN_MASK 0xff000000000000
>>> #define FIEMAP_LEN_HOLE 0x01000000000000
>>> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000
>>
>> Sound good but I would add:
>>
>> #define FIEMAP_LEN_NO_DIRECT_ACCESS
>>
>> This would say that the offset on disk can move at any time or that
>> the data is compressed or encrypted on disk thus the data is not
>> useful for direct disk access.
>
> This makes sense. Even for Reiserfs the same is true with packed
> tails,
> and I believe if FIBMAP is called on a tail it will migrate the
> tail into
> a block because this is might be a sign that the file is a kernel that
> LILO wants to boot.
>
> I'd rather not have any such feature in FIEMAP, and just return the
> on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me.
> My main reason for FIEMAP is being able to investigate allocation
> patterns
> of files.
>
> By no means is my flag list exhaustive, just the ones that I
> thought would
> be needed to implement this for ext4 and Lustre.

Sure, hence why I made my comment for NTFS. (-: And yes, ReiserFS
and even ext* could use such flag. I believe there is a compression
patch for ext somewhere isn't there? (Or at least there was one at
some point I think...)

>> Also why are you not using 0xff00000000000000, i.e. two more zeroes
>> at the end? Seems unnecessary to drop an extra 8 bits of
>> significance from the byte size...
>
> It was actually just a typo (this was the first time I'd written the
> structs and flags down, it is just at the discussion stage). I'd
> meant
> for it to be 2^56 bytes for the file size as I wrote later in the
> email.

Ok. (-:

> That said, I think that 2^48 bytes is probably sufficient for most
> uses,
> so that we get 16 bits for flags. As it is this email already
> discusses
> 5 flags, and that would give little room for expansion in the future.
>
> Remember, this is the mapping for a single file (which can't
> practially
> be beyond 2^64 bytes as yet) so it wouldn't be hard for the
> filesystem to
> return a few separate extents which are actually contiguous
> (assuming that
> there will actually be files in filesystems with > 2^48 bytes of
> contiguous
> space). Since the API is that it will return the extent that
> contains the
> requested "start" byte, the kernel will be able to detect this case
> also,
> since it won't be able to specify a length for the extent that
> contains the
> start byte.

Valid point. As long as the "on-disk location" is maintained as full
64 bits then you are right we could just return multiple extents if
the space does not fit. A bit of a kludge but it would certainly
work. An alternative would be to have the flags in a separate field
but that would add 8-bytes to the structure size if you want to
maintain 8-byte alignment so that would not be great...

> At most we'd have to call the ioctl() 65536 times for a completely
> contiguous 2^64 byte file if the buffer was only large enough for a
> single extent. In reality, I expect any file to have some
> discontinuities
> and the buffer to be large enough for a thousand or more entries so
> the
> corner case is not very bad.
>
>> Finally please make sure that the file system can return in one way
>> or another errors for example when it fails to determine the extents
>> because the system ran out of memory, there was an i/o error,
>> whatever... It may even be useful to be able to say "here is an
>> extent of size X bytes but we do not know where it is on disk because
>> there was an error determining this particular extent's on-disk
>> location for some reason or other"...
>
> Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and
> FIEMAP_LEN_ERROR. Consider FIEMAP on a file that was migrated
> to tape and currently has no blocks allocated in the filesystem. We
> want to return some indication that there is actual file data and not
> just a hole, but at the same time we don't want this to actually
> return
> the file from tape just to generate block mappings for it.

Yes, NTFS also has off line storage (DFS - the Distributed File
System I think it is called) but we don't support any of that.
Perhaps one day...

> This concept is also present in XFS_IOC_GETBMAPX -
> BMV_IF_NO_DMAPI_READ,
> but this needs to be specified on input to prevent the file being
> mapped
> and I'd rather the opposite (not getting file from tape) be the
> default,
> by principle of least surprise.
>
>>> block-aligned/sized allocations (e.g. tail packing). The
>>> fm_extents array
>>> returned contains the packed list of allocation extents for the
>>> file,
>>> including entries for holes (which have fe_start == 0, and a flag).
>>
>> Why the fe_start == 0? Surely just the flag is sufficient... On
>> NTFS it is perfectly valid to have fe_start == 0 and to have that not
>> be sparse (normally the $Boot system file is stored in the first 8
>> sectors of the volume)...
>
> I thought fe_start = 0 was pretty standard for a hole. It should be
> something and I'd rather 0 than anything else. The _HOLE flag is
> enough
> as you say though.

It is standard on Unix. I am trying to fight this standard because
of NTFS... On NTFS a hole is -1 not 0 and zero is a valid block.
But on NTFS device locations are "s64" not "u64" so the -1 is logical
to use...

As long as it is made clear that people MUST check the flag when
fe_start == 0 rather than assume that fe_start == 0 means a hole I am
happy with that. Hopefully not too many programmers will be lazy
gits who will ignore this and just check fe_start == 0 or they will
fail on NTFS and assume $Boot is sparse when it is not...

> PS - I'd thought about adding you to the CC list for this, because
> I know
> you've had opinions on FIBMAP in the past, but I didn't have
> your email handy and it was late, and I know you saw the NTFS
> kmap
> patch on fsdevel so I figured you would see this too...

Thanks. Yes, I try to follow fsdevel closely and LKML not so closely
(I often read it with "select all new, delete")...

> Thanks for your input.

You are welcome.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2007-04-13 10:15:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
> struct fibmap_extent {
> __u64 fe_start; /* starting offset in bytes */
> __u64 fe_len; /* length in bytes */
> }
>
> struct fibmap {
> struct fibmap_extent fm_start; /* offset, length of desired mapping */
> __u32 fm_extent_count; /* number of extents in array */
> __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
> __u64 unused;
> struct fibmap_extent fm_extents[0];
> }
>
> #define FIEMAP_LEN_MASK 0xff000000000000
> #define FIEMAP_LEN_HOLE 0x01000000000000
> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000
>
> All offsets are in bytes to allow cases where filesystems are not going
> block-aligned/sized allocations (e.g. tail packing). The fm_extents array
> returned contains the packed list of allocation extents for the file,
> including entries for holes (which have fe_start == 0, and a flag).

> One feature that XFS_IOC_GETBMAPX has that may be desirable is the
> ability to return unwritten extent information. In order to do this XFS
> required expanding the per-extent struct from 32 to 48 bytes per extent,
> but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship)
> and keep 8 bytes or so for input/output flags per extent (would need to
> be masked before use).

I'd be much happier to have the separate per-extent flags value.
For one thing this allows much nicer representations of unwritten
extents or holes without taking away bits from the len value. It also
allows to make interesting use of this in the future, e.g. telling
about an offline exttent for use in HSM applications. Also for
this kernel<->user interface the wasted space shouldn't matter too
much - if you want to pass the above condensed structure over the
wire in lustre that shouldn't a problem, you'd have to convert
to an endian-neutral on the wire format anyway. Not doing the
masking also make the interface quite a bit simpler to use.

One addition freature from the XFS getbmapx interface we should
provide is the ability to query layout of xattrs. While other
filesystems might not have the exact xattr fork XFS has it fits
nicely into the interface. Especially when we have Anton's suggested
flag for inline data.


2007-04-13 11:38:58

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On 13 Apr 2007, at 11:15, Christoph Hellwig wrote:
> On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
>> struct fibmap_extent {
>> __u64 fe_start; /* starting offset in bytes */
>> __u64 fe_len; /* length in bytes */
>> }
>>
>> struct fibmap {
>> struct fibmap_extent fm_start; /* offset, length of desired
>> mapping */
>> __u32 fm_extent_count; /* number of extents in array */
>> __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
>> __u64 unused;
>> struct fibmap_extent fm_extents[0];
>> }
>>
>> #define FIEMAP_LEN_MASK 0xff000000000000
>> #define FIEMAP_LEN_HOLE 0x01000000000000
>> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000
>>
>> All offsets are in bytes to allow cases where filesystems are not
>> going
>> block-aligned/sized allocations (e.g. tail packing). The
>> fm_extents array
>> returned contains the packed list of allocation extents for the file,
>> including entries for holes (which have fe_start == 0, and a flag).
>
>> One feature that XFS_IOC_GETBMAPX has that may be desirable is the
>> ability to return unwritten extent information. In order to do
>> this XFS
>> required expanding the per-extent struct from 32 to 48 bytes per
>> extent,
>> but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what
>> hardship)
>> and keep 8 bytes or so for input/output flags per extent (would
>> need to
>> be masked before use).
>
> I'd be much happier to have the separate per-extent flags value.
> For one thing this allows much nicer representations of unwritten
> extents or holes without taking away bits from the len value. It also
> allows to make interesting use of this in the future, e.g. telling
> about an offline exttent for use in HSM applications. Also for
> this kernel<->user interface the wasted space shouldn't matter too
> much - if you want to pass the above condensed structure over the
> wire in lustre that shouldn't a problem, you'd have to convert
> to an endian-neutral on the wire format anyway. Not doing the
> masking also make the interface quite a bit simpler to use.
>
> One addition freature from the XFS getbmapx interface we should
> provide is the ability to query layout of xattrs. While other
> filesystems might not have the exact xattr fork XFS has it fits
> nicely into the interface. Especially when we have Anton's suggested
> flag for inline data.

Would it not be better to allow people to get a file descriptor on
the xattr fork and then just run the normal FIEMAP ioctl on that file
descriptor?

I.e. "openat(base file descriptor, O_STREAM, streamname)" or O_XATTR
or whatever... An alternative API would be to provide a "getxattrfd
()/fgetxattrfd()" call or similar that would instead of returning the
value of an xattr return an fd to it. Then you do not need to modify
openat() at all... Interface doesn't bother me, just some ideas...

And for XFS you would define a magic streamname or xattrname (or
whatever you want to call it) of say
"com.sgi.filesystem.xfs.xattrstream" (or .xattrfork) or something and
then XFS would intercept that and know what to do with it...

Such an interface could then be used by NTFS named streams and other
file systems providing such things...

(Yes I know I will now totally get flamed about named streams not
being wanted in Linux and crap like that but that is exactly what you
are asking for except you want to special case a particular stream
using a flag instead of calling it for what it really is and once you
start doing that you might as well allow full named streams...)

You can just see named streams as an alternative, non-atomic API to
xattrs if you like, i.e. you can either use the atomic xattr API
provided in Linux already or you can get a file descriptor to an
xattr and then use the normal system calls to access it non-
atomically thus you can use the FIEMAP ioctl also. (-:

FWIW this two-API approach to xattrs/named streams is the direction
OSX is heading towards also so it is not without precedent and
Windows has had both APIs for many years. And Solaris has the "openat
(O_XATTR)" interface so that is not without precedent either.

Best regards,

Anton

PS. to all flamers: I am going to delete any non-technical flames
without replying so please do us all a favour and don't bother...
Thanks.

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



2007-04-13 14:53:50

by Jeff Mahoney

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andreas Dilger wrote:
> On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote:
>> This would say that the offset on disk can move at any time or that
>> the data is compressed or encrypted on disk thus the data is not
>> useful for direct disk access.
>
> This makes sense. Even for Reiserfs the same is true with packed tails,
> and I believe if FIBMAP is called on a tail it will migrate the tail into
> a block because this is might be a sign that the file is a kernel that
> LILO wants to boot.

Actually, reiserfs_aop_bmap() returns 0 when the requested block is in a
tail. There's a separate ioctl for unpacking them.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFGH5l+LPWxlyuTD7IRAn5/AJ9VcocIcDGr9wtAlgGZuOAQWqVASwCfVdWM
uLZQq1mkf8hsGXOpZtKQH5w=
=AxnN
-----END PGP SIGNATURE-----

2007-04-13 18:55:50

by Nicholas Miell

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Fri, 2007-04-13 at 12:38 +0100, Anton Altaparmakov wrote:
> > One addition freature from the XFS getbmapx interface we should
> > provide is the ability to query layout of xattrs. While other
> > filesystems might not have the exact xattr fork XFS has it fits
> > nicely into the interface. Especially when we have Anton's suggested
> > flag for inline data.
>
> Would it not be better to allow people to get a file descriptor on
> the xattr fork and then just run the normal FIEMAP ioctl on that file
> descriptor?
>
> I.e. "openat(base file descriptor, O_STREAM, streamname)" or O_XATTR
> or whatever... An alternative API would be to provide a "getxattrfd
> ()/fgetxattrfd()" call or similar that would instead of returning the
> value of an xattr return an fd to it. Then you do not need to modify
> openat() at all... Interface doesn't bother me, just some ideas...
>
> And for XFS you would define a magic streamname or xattrname (or
> whatever you want to call it) of say
> "com.sgi.filesystem.xfs.xattrstream" (or .xattrfork) or something and
> then XFS would intercept that and know what to do with it...
>
> Such an interface could then be used by NTFS named streams and other
> file systems providing such things...
>
> (Yes I know I will now totally get flamed about named streams not
> being wanted in Linux and crap like that but that is exactly what you
> are asking for except you want to special case a particular stream
> using a flag instead of calling it for what it really is and once you
> start doing that you might as well allow full named streams...)
>
> You can just see named streams as an alternative, non-atomic API to
> xattrs if you like, i.e. you can either use the atomic xattr API
> provided in Linux already or you can get a file descriptor to an
> xattr and then use the normal system calls to access it non-
> atomically thus you can use the FIEMAP ioctl also. (-:
>
> FWIW this two-API approach to xattrs/named streams is the direction
> OSX is heading towards also so it is not without precedent and
> Windows has had both APIs for many years. And Solaris has the "openat
> (O_XATTR)" interface so that is not without precedent either.

Except that xattrs in Linux aren't streams, and providing a stream-like
interface to them would be a weird abuse of the xattr concept.

In essence, Linux xattrs are named extensions to struct stat, with
getxattr() being in the same category as stat() and setxattr() being in
the same category as chmod()/chown()/utime()/etc.

They system namespace exists to provide a better interface than ioctl()
to weird FS-specific features (DOS attribute bits, HFS+ creator/type,
ext2/3/reiserfs/etc. immutable/append-only/secure-delete/etc. attributes
and so on). The uptake of this feature isn't as high as I'd like, but
that's what it's there for.

They security namespace is there for all the neat LSM modules that need
to attach metadata to files in order to function.

Finally, the user namespace exists to allow users to attach small bits
of information to their own files, since the API was already there and
hey!, metadata is useful.

Now, Solaris came along and totally confused the issue by using the same
name for a completely different feature, but that isn't any real reason
to mess up the existing Linux xattr concept just to graft named streams
support into the kernel.

(Not that I'm opposed to named streams in Linux, you just have to
realize that xattrs aren't name streams, can't live in the same
namespace as named streams, and certainly don't serve the same purpose
as named streams.)

--
Nicholas Miell <[email protected]>

2007-04-16 08:01:17

by Timothy Shimmin

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

Hi Andreas,

--On 12 April 2007 5:05:50 AM -0600 Andreas Dilger <[email protected]> wrote:

> I'm interested in getting input for implementing an ioctl to efficiently
> map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion
> times.
...
>
> I had come up with a plan independently and was also steered toward
> XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
> plan, though I think the XFS structs used there are a bit bloated.

They certainly seem to be (combining entries and header).


> struct fibmap_extent {
> __u64 fe_start; /* starting offset in bytes */
> __u64 fe_len; /* length in bytes */
> }
>
> struct fibmap {
> struct fibmap_extent fm_start; /* offset, length of desired mapping */
> __u32 fm_extent_count; /* number of extents in array */
> __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
> __u64 unused;
> struct fibmap_extent fm_extents[0];
> }
>
># define FIEMAP_LEN_MASK 0xff000000000000
># define FIEMAP_LEN_HOLE 0x01000000000000
># define FIEMAP_LEN_UNWRITTEN 0x02000000000000
>
> All offsets are in bytes to allow cases where filesystems are not going
> block-aligned/sized allocations (e.g. tail packing). The fm_extents array
> returned contains the packed list of allocation extents for the file,
> including entries for holes (which have fe_start == 0, and a flag).
>
> The ->fm_extents[] array includes all of the holes in addition to
> allocated extents because this avoids the need to return both the logical
> and physical address for every extent and does not make processing any
> harder.

Well, that's what stood out for me. I was wondering where the "fe_block" field
had gone - the "physical address".
So is your "fe_start; /* starting offset */" actually the disk location
(not a logical file offset)
_except_ in the header (fibmap) where it is the desired logical offset.
Okay, looking at your example use below that's what it looks like.
And when you refer to fm_start below, you mean fm_start.fe_start?
Sorry, I realise this is just an approximation but this part confused me.
So you get rid of all the logical file offsets in the extents because we
report holes explicitly (and we know everything is contiguous if you
include the holes).

--Tim

>
> Caller works something like:
>
> char buf[4096];
> struct fibmap *fm = (struct fibmap *)buf;
> int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
>
> fm->fm_extent.fe_start = 0; /* start of file */
> fm->fm_extent.fe_len = -1; /* end of file */
> fm->fm_extent_count = count; /* max extents in fm_extents[] array */
> fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */
>
> fd = open(path, O_RDONLY);
> printf("logical\t\tphysical\t\tbytes\n");
>
> /* The last entry will have less extents than the maximum */
> while (fm->fm_extent_count == count) {
> rc = ioctl(fd, FIEMAP, fm);
> if (rc)
> break;
>
> /* kernel filled in fm_extents[] array, set fm_extent_count
> * to be actual number of extents returned, leaves fm_start
> * alone (unlike XFS_IOC_GETBMAP). */
>
> for (i = 0; i < fm->fm_extent_count; i++) {
> __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK;
> __u64 fm_next = fm->fm_start + len;
> int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE;
> int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN;
>
> printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n",
> fm->fm_start, fm_next - 1,
> hole ? 0 : fm->fm_extents[i].fe_start,
> hole ? 0 : fm->fm_extents[i].fe_start +
> fm->fm_extents[i].fe_len - 1,
> len, hole ? "(hole) " : "",
> unwr ? "(unwritten) " : "");
>
> /* get ready for printing next extent, or next ioctl */
> fm->fm_start = fm_next;
> }
> }
>


2007-04-16 11:23:07

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
> I'm interested in getting input for implementing an ioctl to efficiently
> map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion
> times. We already have customers with single files in the 10TB range and
> we additionally need to get the mapping over the network so it needs to
> be efficient in terms of how data is passed, and how easily it can be
> extracted from the filesystem.
>
> I had come up with a plan independently and was also steered toward
> XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original
> plan, though I think the XFS structs used there are a bit bloated.

Yeah, they were designed with having a long term stable ABI
that limited expandability. Hence the "future" fields that never
got used ;)

> There was also recent discussion about SEEK_HOLE and SEEK_DATA as
> implemented by Sun, but even if we could skip the holes we still might
> need to do millions of FIBMAPs to see how large files are allocated
> on disk. Conversely, having filesystems implement an efficient FIBMAP
> ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE
> and SEEK_DATA instead of doing looping over ->bmap() inside the kernel
> as I saw one patch.

Yup.

> struct fibmap_extent {
> __u64 fe_start; /* starting offset in bytes */
> __u64 fe_len; /* length in bytes */
> }
>
> struct fibmap {
> struct fibmap_extent fm_start; /* offset, length of desired mapping */
> __u32 fm_extent_count; /* number of extents in array */
> __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
> __u64 unused;
> struct fibmap_extent fm_extents[0];
> }
>
> #define FIEMAP_LEN_MASK 0xff000000000000
> #define FIEMAP_LEN_HOLE 0x01000000000000
> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000

I'm not sure I like stealing bits from the length to use a flags -
I'd prefer an explicit field per fibmap_extent for this.

Given that xfs_bmap uses extra information from the filesystem
(geometry) to display extra (and frequently used) information
about the alignment of extents. ie:

chook 681% xfs_bmap -vv fred
fred:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width

This information could be easily passed up in the flags fields if the
filesystem has geometry information (there go 4 more flags ;).

Also - what are the explicit sync semantics of this ioctl? The
XFS ioctl causes a fsync of the file first to convert delalloc
extents to real extents before returning the bmap. Is this functionality
going to be the same? If not, then we need a DELALLOC flag to indicate
extents that haven't been allocated yet. This might be handy to
have, anyway....

> All offsets are in bytes to allow cases where filesystems are not going
> block-aligned/sized allocations (e.g. tail packing).

So it'll be ok for a few years yet ;)

> The fm_extents array
> returned contains the packed list of allocation extents for the file,
> including entries for holes (which have fe_start == 0, and a flag).

Internalling in XFS, we pass these around as:

#define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL)
#define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL)

And the offset passed out through XFS_IOC_GETBMAP[X] is a block
number of -1 for the start of a hole. Hence we don't need a
flag for this. We can expose delalloc extents like this as well
without needing flags...

> The ->fm_extents[] array includes all of the holes in addition to
> allocated extents because this avoids the need to return both the logical
> and physical address for every extent and does not make processing any
> harder.

Doesn't really make it any easier to map to disk, either.

> One feature that XFS_IOC_GETBMAPX has that may be desirable is the
> ability to return unwritten extent information.

You got that with the unwritten flag above.....

> required expanding the per-extent struct from 32 to 48 bytes per extent,

not sure I follow your maths here?

> but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship)
> and keep 8 bytes or so for input/output flags per extent (would need to
^^^^^ bits?
> be masked before use).
>
>
> Caller works something like:
>
> char buf[4096];
> struct fibmap *fm = (struct fibmap *)buf;
> int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
>
> fm->fm_extent.fe_start = 0; /* start of file */
> fm->fm_extent.fe_len = -1; /* end of file */
> fm->fm_extent_count = count; /* max extents in fm_extents[] array */
> fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */
>
> fd = open(path, O_RDONLY);
> printf("logical\t\tphysical\t\tbytes\n");
>
> /* The last entry will have less extents than the maximum */
> while (fm->fm_extent_count == count) {

fm_extent_count is an in/out parameter?

> rc = ioctl(fd, FIEMAP, fm);
> if (rc)
> break;
>
> /* kernel filled in fm_extents[] array, set fm_extent_count
> * to be actual number of extents returned, leaves fm_start
> * alone (unlike XFS_IOC_GETBMAP). */

Ok, it is.

> for (i = 0; i < fm->fm_extent_count; i++) {
> __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK;
> __u64 fm_next = fm->fm_start + len;
> int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE;
> int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN;
>
> printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n",
> fm->fm_start, fm_next - 1,
> hole ? 0 : fm->fm_extents[i].fe_start,
> hole ? 0 : fm->fm_extents[i].fe_start +
> fm->fm_extents[i].fe_len - 1,
> len, hole ? "(hole) " : "",
> unwr ? "(unwritten) " : "");
>
> /* get ready for printing next extent, or next ioctl */
> fm->fm_start = fm_next;

Ok, so the only way you can determine where you are in the file
is by adding up the length of each extent. What happens if the file
is changing underneath you e.g. someone punches out a hole
in teh file, or truncates and extends it again between ioctl()
calls?

Also, what happens if you ask for an offset/len that doesn't map to
any extent boundaries - are you truncating the extents returned to
teh off/len passed in?

xfs_bmap gets around this by finding out how many extents there are in the
file and allocating a buffer that big to hold all the extents so they
are gathered in a single atomic call (think sparse matrix files)....

> I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP.
> I'm quite open to suggestions at this point, both in terms of how usable
> the fibmap data structures are by the caller, and if we need to add anything
> to make them more flexible for the future.

ioctl is fine by me. perhaps a version number in the structure header
would be handy so we can modify the interface easily in the future
without having to worry about breaking userspace....

> In terms of implementing this in the kernel, there was originally code for
> this during the development of the ext3 extent patches and it was done via
> a callback in the extent tree iterator so it is very efficient. I believe
> it implements all that is needed to allow this interface to be mapped
> onto XFS_IOC_BMAP internally (or vice versa).

I wouldn't map the ioctls - I'd just write another interface to
xfs_getbmap(). That way we could eventually get rid of the XFS_IOC_BMAP
interface. is there any code yet?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-04-18 23:03:54

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Apr 16, 2007 18:01 +1000, Timothy Shimmin wrote:
> --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger <[email protected]>
> wrote:
> >struct fiemap_extent {
> > __u64 fe_start; /* starting offset in bytes */
> > __u64 fe_len; /* length in bytes */
> >}
> >
> >struct fiemap {
> > struct fiemap_extent fm_start; /* offset, length of desired mapping
> > */
> > __u32 fm_extent_count; /* number of extents in array */
> > __u32 fm_flags; /* flags (similar to
> > XFS_IOC_GETBMAP) */
> > __u64 unused;
> > struct fiemap_extent fm_extents[0];
> >}
> >
> ># define FIEMAP_LEN_MASK 0xff000000000000
> ># define FIEMAP_LEN_HOLE 0x01000000000000
> ># define FIEMAP_LEN_UNWRITTEN 0x02000000000000
> >
> >All offsets are in bytes to allow cases where filesystems are not going
> >block-aligned/sized allocations (e.g. tail packing). The fm_extents array
> >returned contains the packed list of allocation extents for the file,
> >including entries for holes (which have fe_start == 0, and a flag).
> >
> >The ->fm_extents[] array includes all of the holes in addition to
> >allocated extents because this avoids the need to return both the logical
> >and physical address for every extent and does not make processing any
> >harder.
>
> Well, that's what stood out for me. I was wondering where the "fe_block"
> field had gone - the "physical address".
> So is your "fe_start; /* starting offset */" actually the disk location
> (not a logical file offset)
> _except_ in the header (fiemap) where it is the desired logical offset.

Correct. The fm_extent in the request contains the logical start offset
and length in bytes of the requested fiemap region. In the returned header
it represents the logical start offset of the extent that contained the
requested start offset, and the logical length of all the returned extents.
I haven't decided whether the returned length should be until EOF, or have
the "virtual hole" at the end of the file. I think EOF makes more sense.

The fe_start + fe_len in the fm_extents represent the physical location on
the block device for that extent. fm_extent[i].fe_start (per Anton) is
undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole.

> Okay, looking at your example use below that's what it looks like.
> And when you refer to fm_start below, you mean fm_start.fe_start?
> Sorry, I realise this is just an approximation but this part confused me.

Right, I'll write up a new RFC based on feedback here, and correcting the
various errors in the original proposal.

> So you get rid of all the logical file offsets in the extents because we
> report holes explicitly (and we know everything is contiguous if you
> include the holes).

Correct. It saves space in the common case.

> >Caller works something like:
> >
> > char buf[4096];
> > struct fiemap *fm = (struct fiemap *)buf;
> > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
> >
> > fm->fm_start.fe_start = 0; /* start of file */
> > fm->fm_start.fe_len = -1; /* end of file */
> > fm->fm_extent_count = count; /* max extents in fm_extents[] array */
> > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */
> >
> > fd = open(path, O_RDONLY);
> > printf("logical\t\tphysical\t\tbytes\n");
> >
> > /* The last entry will have less extents than the maximum */
> > while (fm->fm_extent_count == count) {
> > rc = ioctl(fd, FIEMAP, fm);
> > if (rc)
> > break;
> >
> > /* kernel filled in fm_extents[] array, set fm_extent_count
> > * to be actual number of extents returned, leaves
> > * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */
> >
> > for (i = 0; i < fm->fm_extent_count; i++) {
> > __u64 len = fm->fm_extents[i].fe_len &
> > FIEMAP_LEN_MASK;
> > __u64 fm_next = fm->fm_start.fe_start + len;
> > int hole = fm->fm_extents[i].fe_len &
> > FIEMAP_LEN_HOLE;
> > int unwr = fm->fm_extents[i].fe_len &
> > FIEMAP_LEN_UNWRITTEN;
> >
> > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n",
> > fm->fm_start.fe_start, fm_next - 1,
> > hole ? 0 : fm->fm_extents[i].fe_start,
> > hole ? 0 : fm->fm_extents[i].fe_start +
> > fm->fm_extents[i].fe_len - 1,
> > len, hole ? "(hole) " : "",
> > unwr ? "(unwritten) " : "");
> >
> > /* get ready for printing next extent, or next ioctl
> > */
> > fm->fm_start.fe_start = fm_next;
> > }
> > }
> >

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-04-19 00:21:43

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Apr 16, 2007 21:22 +1000, David Chinner wrote:
> On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
> > struct fiemap_extent {
> > __u64 fe_start; /* starting offset in bytes */
> > __u64 fe_len; /* length in bytes */
> > }
> >
> > struct fiemap {
> > struct fiemap_extent fm_start; /* offset, length of desired mapping */
> > __u32 fm_extent_count; /* number of extents in array */
> > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
> > __u64 unused;
> > struct fiemap_extent fm_extents[0];
> > }
> >
> > #define FIEMAP_LEN_MASK 0xff000000000000
> > #define FIEMAP_LEN_HOLE 0x01000000000000
> > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000
>
> I'm not sure I like stealing bits from the length to use a flags -
> I'd prefer an explicit field per fiemap_extent for this.

Christoph expressed the same concern. I'm not dead set against having an
extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may
mean the need for 50% more ioctls if the file is large.


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len; /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun; /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len; /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count; /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
#define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/
#define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */
#define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/

/* flags for the returned extents */
#define FIEMAP_EXTENT_HOLE 0x00000001 /* no space allocated */
#define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* uninitialized space */
#define FIEMAP_EXTENT_UNKNOWN 0x00000004 /* in use, location unknown */
#define FIEMAP_EXTENT_ERROR 0x00000008 /* error mapping space */
#define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* no direct data access */



SUMMARY OF CHANGES
==================
- use fm_* fields directly in request instead of making it a fiemap_extent
(though they are layed out identically)

- separate flags word for fm_flags:
- FIEMAP_FLAG_SYNC = range should be synced to disk before returning
mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise
- FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified
(this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag)
- FIEMAP_FLAG_XATTR = omitted for now, can address that in the future
if there is agreement on whether that is desirable to have or if it is
better to call ioctl(FIEMAP) on an XATTR fd.
- FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel
must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we
don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it

- __u64 fm_unused does not take up an extra space on all power-of-two buffer
sizes (would otherwise be at end of buffer), and may be handy in the future.

- add separate fe_flags word with flags from various suggestions:
- FIEMAP_EXTENT_HOLE = extent has no space allocation
- FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
- FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
(e.g. HSM, delalloc awaiting sync, etc)
- FIEMAP_EXTENT_ERROR = error mapping extent. Should fe_lun == errno?
- FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data
encrypted, compressed, etc), may want separate flags for these?

- add new fe_lun word per extent for filesystems that manage multiple devices
(e.g. OCFS, GFS, ZFS, Lustre). This would otherwise have been unused.


> Given that xfs_bmap uses extra information from the filesystem
> (geometry) to display extra (and frequently used) information
> about the alignment of extents. ie:
>
> chook 681% xfs_bmap -vv fred
> fred:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010
> FLAG Values:
> 010000 Unwritten preallocated extent
> 001000 Doesn't begin on stripe unit
> 000100 Doesn't end on stripe unit
> 000010 Doesn't begin on stripe width
> 000001 Doesn't end on stripe width

Can you clarify the terminology here? What is a "stripe unit" and what is
a "stripe width"? Are there "N * stripe_unit = stripe_width" in e.g. a
RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa?

I don't mind adding this, as long as it's clear that some filesystems don't
have this kind of information.

> This information could be easily passed up in the flags fields if the
> filesystem has geometry information (there go 4 more flags ;).

Got lots of flag bits now.

> Also - what are the explicit sync semantics of this ioctl? The
> XFS ioctl causes a fsync of the file first to convert delalloc
> extents to real extents before returning the bmap. Is this functionality
> going to be the same? If not, then we need a DELALLOC flag to indicate
> extents that haven't been allocated yet. This might be handy to
> have, anyway....

Have added a FIEMAP_FLAG_SYNC on the request to sync if applications care,
and FIEMAP_EXTENT_UNKNOWN can handle unmapped extents for delalloc.

> > The fm_extents array
> > returned contains the packed list of allocation extents for the file,
> > including entries for holes (which have fe_start == 0, and a flag).
>
> Internalling in XFS, we pass these around as:
>
> #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL)
> #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL)

We could do this too, instead of having flags, but many of the proposed
flags are orthogonal so we'd end up needing a lot of separate values here
and it would just degenerate into the FIEMAP_LEN_MASK I previously suggested.

> > required expanding the per-extent struct from 32 to 48 bytes per extent,
>
> not sure I follow your maths here?

That was the case for XFS getbmap vs. getbmapx. For FIEMAP it increases the
extent size from 16 to 24 bytes.

> > Caller works something like:
> >
> > char buf[4096];
> > struct fiemap *fm = (struct fiemap *)buf;
> > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
> >
> > fm->fm_start.fe_start = 0; /* start of file */
> > fm->fm_start.fe_len = -1; /* end of file */
> > fm->fm_extent_count = count; /* max extents in fm_extents[] array */
> > fm->fm_flags = 0; /* maybe "no DMAPI", etc like XFS */
> >
> > fd = open(path, O_RDONLY);
> > printf("logical\t\tphysical\t\tbytes\n");
> >
> > /* The last entry will have less extents than the maximum */
> > while (fm->fm_extent_count == count) {
>
> fm_extent_count is an in/out parameter?

Correct.

>
> > rc = ioctl(fd, FIEMAP, fm);
> > if (rc)
> > break;
> >
> > /* kernel filled in fm_extents[] array, set fm_extent_count
> > * to be actual number of extents returned, leaves fm_start
> > * alone (unlike XFS_IOC_GETBMAP). */
>
> Ok, it is.
>
> > for (i = 0; i < fm->fm_extent_count; i++) {
> > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK;
> > __u64 fm_next = fm->fm_start.fe_start + len;
> > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE;
> > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN;
> >
> > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n",
> > fm->fm_start.fe_start, fm_next - 1,
> > hole ? 0 : fm->fm_extents[i].fe_start,
> > hole ? 0 : fm->fm_extents[i].fe_start +
> > fm->fm_extents[i].fe_len - 1,
> > len, hole ? "(hole) " : "",
> > unwr ? "(unwritten) " : "");
> >
> > /* get ready for printing next extent, or next ioctl */
> > fm->fm_start.fe_start = fm_next;
>
> Ok, so the only way you can determine where you are in the file
> is by adding up the length of each extent. What happens if the file
> is changing underneath you e.g. someone punches out a hole
> in teh file, or truncates and extends it again between ioctl()
> calls?

Well, that is always true with data once it is out of the caller.

> Also, what happens if you ask for an offset/len that doesn't map to
> any extent boundaries - are you truncating the extents returned to
> teh off/len passed in?

The request offset will be returned as the start of the actual extent that
it falls inside. And the returned extents will end with the extent that
ends at or after the requested fm_start + fm_len.

> xfs_bmap gets around this by finding out how many extents there are in the
> file and allocating a buffer that big to hold all the extents so they
> are gathered in a single atomic call (think sparse matrix files)....

Yeah, except this might be persistent for a long time if it isn't fully
read with a single ioctl and the app never continues reading but doesn't
close the fd.

> > I'm not wedded to an ioctl interface, but it seems consistent with FIBMAP.
> > I'm quite open to suggestions at this point, both in terms of how usable
> > the fiemap data structures are by the caller, and if we need to add anything
> > to make them more flexible for the future.
>
> ioctl is fine by me. perhaps a version number in the structure header
> would be handy so we can modify the interface easily in the future
> without having to worry about breaking userspace....

Yeah, but premature optimization and such. Would rather have INCOMPAT
flags instead of version numbers.

> > In terms of implementing this in the kernel, there was originally code for
> > this during the development of the ext3 extent patches and it was done via
> > a callback in the extent tree iterator so it is very efficient. I believe
> > it implements all that is needed to allow this interface to be mapped
> > onto XFS_IOC_BMAP internally (or vice versa).
>
> I wouldn't map the ioctls - I'd just write another interface to
> xfs_getbmap(). That way we could eventually get rid of the XFS_IOC_BMAP
> interface. is there any code yet?

Up to you, I was just suggesting "mapping" in the generic sense. The
flags and values would all have to be changed anyways.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-04-19 01:54:38

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Wed, Apr 18, 2007 at 06:21:39PM -0600, Andreas Dilger wrote:
> On Apr 16, 2007 21:22 +1000, David Chinner wrote:
> > On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
> > > struct fiemap_extent {
> > > __u64 fe_start; /* starting offset in bytes */
> > > __u64 fe_len; /* length in bytes */
> > > }
> > >
> > > struct fiemap {
> > > struct fiemap_extent fm_start; /* offset, length of desired mapping */
> > > __u32 fm_extent_count; /* number of extents in array */
> > > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
> > > __u64 unused;
> > > struct fiemap_extent fm_extents[0];
> > > }
> > >
> > > #define FIEMAP_LEN_MASK 0xff000000000000
> > > #define FIEMAP_LEN_HOLE 0x01000000000000
> > > #define FIEMAP_LEN_UNWRITTEN 0x02000000000000
> >
> > I'm not sure I like stealing bits from the length to use a flags -
> > I'd prefer an explicit field per fiemap_extent for this.
>
> Christoph expressed the same concern. I'm not dead set against having an
> extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may
> mean the need for 50% more ioctls if the file is large.

I don't think this overhead is a huge problem - just pass in a
larger buffer (e.g. xfs_bmap can ask for thousands of extents in
a single ioctl call as we can extract the number of extents in
an inode via XFS_IOC_FSGETXATTRA).

> Below is an aggregation of the comments in this thread:
>
> struct fiemap_extent {
> __u64 fe_start; /* starting offset in bytes */
> __u64 fe_len; /* length in bytes */
> __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
> __u32 fe_lun; /* logical storage device number in array */
> }

Oh, I missed the bit about the fe_lun - I was thinking something like
that might be useful in future....

> struct fiemap {
> __u64 fm_start; /* logical start offset of mapping (in/out) */
> __u64 fm_len; /* logical length of mapping (in/out) */
> __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
> __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */
> __u64 fm_unused;
> struct fiemap_extent fm_extents[0];
> }
>
> /* flags for the fiemap request */
> #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/
> #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */
> #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/

No flags in the INCOMPAT range - shouldn't it be 0x3 at this point?

> /* flags for the returned extents */
> #define FIEMAP_EXTENT_HOLE 0x00000001 /* no space allocated */
> #define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* uninitialized space */
> #define FIEMAP_EXTENT_UNKNOWN 0x00000004 /* in use, location unknown */
> #define FIEMAP_EXTENT_ERROR 0x00000008 /* error mapping space */
> #define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* no direct data access */

SO, there's a HSM_READ flag above. If we are going to make this interface
useful for filesystems that have HSMs interacting with their extents, the
HSM needs to be able to query whether the extent is online (on disk),
has been migrated offline (on tape) or in dual-state (i.e. both online and
offline).

> SUMMARY OF CHANGES
> ==================
> - use fm_* fields directly in request instead of making it a fiemap_extent
> (though they are layed out identically)
>
> - separate flags word for fm_flags:
> - FIEMAP_FLAG_SYNC = range should be synced to disk before returning
> mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise
> - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified
> (this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag)
> - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future
> if there is agreement on whether that is desirable to have or if it is
> better to call ioctl(FIEMAP) on an XATTR fd.
> - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel
> must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we
> don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it
>
> - __u64 fm_unused does not take up an extra space on all power-of-two buffer
> sizes (would otherwise be at end of buffer), and may be handy in the future.
>
> - add separate fe_flags word with flags from various suggestions:
> - FIEMAP_EXTENT_HOLE = extent has no space allocation
> - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
> - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
> (e.g. HSM, delalloc awaiting sync, etc)

I'd like an explicit delalloc flag, not lumping it in with "unknown".
we *know* the extent is delalloc ;)

> - FIEMAP_EXTENT_ERROR = error mapping extent. Should fe_lun == errno?
> - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data
> encrypted, compressed, etc), may want separate flags for these?
>
> - add new fe_lun word per extent for filesystems that manage multiple devices
> (e.g. OCFS, GFS, ZFS, Lustre). This would otherwise have been unused.
>
>
> > Given that xfs_bmap uses extra information from the filesystem
> > (geometry) to display extra (and frequently used) information
> > about the alignment of extents. ie:
> >
> > chook 681% xfs_bmap -vv fred
> > fred:
> > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> > 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010
> > FLAG Values:
> > 010000 Unwritten preallocated extent
> > 001000 Doesn't begin on stripe unit
> > 000100 Doesn't end on stripe unit
> > 000010 Doesn't begin on stripe width
> > 000001 Doesn't end on stripe width
>
> Can you clarify the terminology here? What is a "stripe unit" and what is
> a "stripe width"?

Stripe unit is equivalent of the chunk size in an MD RAID. It's the amount
of data that is written to each lun in a stripe before moving onto the
next stripe element.

> Are there "N * stripe_unit = stripe_width" in e.g. a
> RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa?

Yes, on simple configurations. In more complex HW RAID
configurations, we'll typically set the stripe unit to the width of
the RAID5 lun (N * segment size) and the stripe width to the number
of luns we've striped across.

The reason I want this to come out of the filesystem is that one of
the driving factors for multi-device support in XFS is to allow
multiple devices of different geometries to co-exist efficiently in
the one namespace (another reason I'm happy about the fe_lun
addition). Passing this information out with the extent is far
simpler than trying to find what device it lies on from userspace,
then querying for the geometry of that device and then converting
it. Especially when extents could lie on different devices with
differing geometries....


> I don't mind adding this, as long as it's clear that some filesystems don't
> have this kind of information.

Sure.

> > This information could be easily passed up in the flags fields if the
> > filesystem has geometry information (there go 4 more flags ;).
>
> Got lots of flag bits now.

Time to start using them all up ;)

> > Also - what are the explicit sync semantics of this ioctl? The
> > XFS ioctl causes a fsync of the file first to convert delalloc
> > extents to real extents before returning the bmap. Is this functionality
> > going to be the same? If not, then we need a DELALLOC flag to indicate
> > extents that haven't been allocated yet. This might be handy to
> > have, anyway....
>
> Have added a FIEMAP_FLAG_SYNC on the request to sync if applications care,

OK.

> and FIEMAP_EXTENT_UNKNOWN can handle unmapped extents for delalloc.

I'd prefer explicit enumeration of then, as I said before...

> > > The fm_extents array
> > > returned contains the packed list of allocation extents for the file,
> > > including entries for holes (which have fe_start == 0, and a flag).
> >
> > Internalling in XFS, we pass these around as:
> >
> > #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL)
> > #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL)
>
> We could do this too, instead of having flags, but many of the proposed
> flags are orthogonal so we'd end up needing a lot of separate values here
> and it would just degenerate into the FIEMAP_LEN_MASK I previously suggested.

Yeah, fair enough.

> > > for (i = 0; i < fm->fm_extent_count; i++) {
> > > __u64 len = fm->fm_extents[i].fe_len & FIEMAP_LEN_MASK;
> > > __u64 fm_next = fm->fm_start.fe_start + len;
> > > int hole = fm->fm_extents[i].fe_len & FIEMAP_LEN_HOLE;
> > > int unwr = fm->fm_extents[i].fe_len & FIEMAP_LEN_UNWRITTEN;
> > >
> > > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n",
> > > fm->fm_start.fe_start, fm_next - 1,
> > > hole ? 0 : fm->fm_extents[i].fe_start,
> > > hole ? 0 : fm->fm_extents[i].fe_start +
> > > fm->fm_extents[i].fe_len - 1,
> > > len, hole ? "(hole) " : "",
> > > unwr ? "(unwritten) " : "");
> > >
> > > /* get ready for printing next extent, or next ioctl */
> > > fm->fm_start.fe_start = fm_next;
> >
> > Ok, so the only way you can determine where you are in the file
> > is by adding up the length of each extent. What happens if the file
> > is changing underneath you e.g. someone punches out a hole
> > in teh file, or truncates and extends it again between ioctl()
> > calls?
>
> Well, that is always true with data once it is out of the caller.

Sure, but this interface requires iterative calls where the n+1
call is reliant on nothing changing since the first call to be
accurate. My question is how do you use this interface to reliably
and accurately get all the extents if you using iterative summing
like this?

> > Also, what happens if you ask for an offset/len that doesn't map to
> > any extent boundaries - are you truncating the extents returned to
> > teh off/len passed in?
>
> The request offset will be returned as the start of the actual extent that
> it falls inside. And the returned extents will end with the extent that
> ends at or after the requested fm_start + fm_len.

Ok, so you round the start inwards and the round end outwards. Can
you ensure that this is documented in the header file that describes
this interface?

> > xfs_bmap gets around this by finding out how many extents there are in the
> > file and allocating a buffer that big to hold all the extents so they
> > are gathered in a single atomic call (think sparse matrix files)....
>
> Yeah, except this might be persistent for a long time if it isn't fully
> read with a single ioctl and the app never continues reading but doesn't
> close the fd.

Not sure I follow you here...

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-04-19 06:21:21

by Timothy Shimmin

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

--On 18 April 2007 6:21:39 PM -0600 Andreas Dilger <[email protected]> wrote:

> Below is an aggregation of the comments in this thread:
>
> struct fiemap_extent {
> __u64 fe_start; /* starting offset in bytes */
> __u64 fe_len; /* length in bytes */
> __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
> __u32 fe_lun; /* logical storage device number in array */
> }
>
> struct fiemap {
> __u64 fm_start; /* logical start offset of mapping (in/out) */
> __u64 fm_len; /* logical length of mapping (in/out) */
> __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
> __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */
> __u64 fm_unused;
> struct fiemap_extent fm_extents[0];
> }
>
> /* flags for the fiemap request */
># define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/
># define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */
># define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/
>
> /* flags for the returned extents */
># define FIEMAP_EXTENT_HOLE 0x00000001 /* no space allocated */
># define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* uninitialized space */
># define FIEMAP_EXTENT_UNKNOWN 0x00000004 /* in use, location unknown */
># define FIEMAP_EXTENT_ERROR 0x00000008 /* error mapping space */
># define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* no direct data access */
>
>
>
> SUMMARY OF CHANGES
> ==================
> - use fm_* fields directly in request instead of making it a fiemap_extent
> (though they are layed out identically)

I much prefer that - it makes it a lot clearer to me to have fiemap_extent
just for fm_extents (no different meanings now).
(Don't like the word "offset" in comment without "physical" or some such but whatever;-)
I also prefer the flags as separate fields too :)

--Tim

2007-04-30 22:44:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Apr 19, 2007 11:54 +1000, David Chinner wrote:
> > struct fiemap {
> > __u64 fm_start; /* logical start offset of mapping (in/out) */
> > __u64 fm_len; /* logical length of mapping (in/out) */
> > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
> > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */
> > __u64 fm_unused;
> > struct fiemap_extent fm_extents[0];
> > }
> >
> > /* flags for the fiemap request */
> > #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/
> > #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */
> > #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/
>
> No flags in the INCOMPAT range - shouldn't it be 0x3 at this point?

This is actually for future use. Any flags that are added into this range
must be understood by both sides or it should be considered an error. Flags
outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported.
If it turns out that 8 bits is too small a range for INCOMPAT flags, then
we can make 0x01000000 an incompat flag that means e.g. 0x00ff0000 are also
incompat flags also.

I'm assuming that all flags that will be in the original FIEMAP proposal
will be understood by the implementations. Most filesystems can safely
ignore FLAG_HSM_READ, for example, since they don't support HSM, and for
that matter FLAG_SYNC is probably moot for most filesystems also because
they do block allocation at preprw time.

> SO, there's a HSM_READ flag above. If we are going to make this interface
> useful for filesystems that have HSMs interacting with their extents, the
> HSM needs to be able to query whether the extent is online (on disk),
> has been migrated offline (on tape) or in dual-state (i.e. both online and
> offline).

Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't
consider files that are both on disk and on secondary storage (which is
no longer just tape anymore). I thought I'd call this FIEMAP_EXTENT_OFFLINE,
but that has a confusing connotation that the extent is inaccessible, instead
of just saying it is also on offline storage. What about
FIEMAP_EXTENT_SECONDARY? Other proposals welcome.

FIEMAP_EXTENT_SECONDARY could still be set even if the file is mapped.
That would mean an offline-only file would be EXTENT_SECONDARY|EXTENT_UNKNOWN,
while a dual-location file would be EXTENT_SECONDARY only.


> > SUMMARY OF CHANGES
> > ==================
> > - add separate fe_flags word with flags from various suggestions:
> > - FIEMAP_EXTENT_HOLE = extent has no space allocation
> > - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
> > - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
> > (e.g. HSM, delalloc awaiting sync, etc)
>
> I'd like an explicit delalloc flag, not lumping it in with "unknown".
> we *know* the extent is delalloc ;)

Sure, FIEMAP_EXTENT_DELALLOC is fine. It is mostly redundant with
EXTENT_UNKNOWN (and I think it only makes sense if DELALLOC is given in
addition to UNKNOWN). I'd like to keep a generic "UNKNOWN" flag that can
be used by applications that don't really care about why it is unmapped
and in case there are other reasons in the future that an extent might
be unmapped (e.g. fsck or storage layer reporting corruption or loss of
that part of the file).

> > > chook 681% xfs_bmap -vv fred
> > > fred:
> > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> > > 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010
> > > FLAG Values:
> > > 010000 Unwritten preallocated extent
> > > 001000 Doesn't begin on stripe unit
> > > 000100 Doesn't end on stripe unit
> > > 000010 Doesn't begin on stripe width
> > > 000001 Doesn't end on stripe width
> >
> > Can you clarify the terminology here? What is a "stripe unit" and what is
> > a "stripe width"?
>
> Stripe unit is equivalent of the chunk size in an MD RAID. It's the amount
> of data that is written to each lun in a stripe before moving onto the
> next stripe element.
>
> > Are there "N * stripe_unit = stripe_width" in e.g. a
> > RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa?
>
> Yes, on simple configurations. In more complex HW RAID
> configurations, we'll typically set the stripe unit to the width of
> the RAID5 lun (N * segment size) and the stripe width to the number
> of luns we've striped across.

Can you propose reasonable flag names for these (I can't think of anything
very good) and a clear explanation of what they mean. I suspect it will
only be XFS that uses them initially. In mke2fs and ext4+mballoc there is
the concept of stripe unit and stripe width, but as yet they are not
communicated between the two very well. I'd be much happier if this info
could be queried in a standard way from the block layer instead of the
user having to specify it and the filesystem having to track it.

> > > Ok, so the only way you can determine where you are in the file
> > > is by adding up the length of each extent. What happens if the file
> > > is changing underneath you e.g. someone punches out a hole
> > > in teh file, or truncates and extends it again between ioctl()
> > > calls?
> >
> > Well, that is always true with data once it is out of the caller.
>
> Sure, but this interface requires iterative calls where the n+1
> call is reliant on nothing changing since the first call to be
> accurate. My question is how do you use this interface to reliably
> and accurately get all the extents if you using iterative summing
> like this?

Maybe it wasn't clear, but the semantics of the ioctl are that it will
return the first extent that contains the requested byte offset in fm_start.
If the file has changed since the last call to FIEMAP then it will restart
with the extent that covers this byte and continue on. In most cases the
file mapping should be returnable in a single ioctl (assuming a reasonable
extent count).

> > > Also, what happens if you ask for an offset/len that doesn't map to
> > > any extent boundaries - are you truncating the extents returned to
> > > teh off/len passed in?
> >
> > The request offset will be returned as the start of the actual extent that
> > it falls inside. And the returned extents will end with the extent that
> > ends at or after the requested fm_start + fm_len.
>
> Ok, so you round the start inwards and the round end outwards. Can
> you ensure that this is documented in the header file that describes
> this interface?

Sure.

> > > xfs_bmap gets around this by finding out how many extents there are in the
> > > file and allocating a buffer that big to hold all the extents so they
> > > are gathered in a single atomic call (think sparse matrix files)....
> >
> > Yeah, except this might be persistent for a long time if it isn't fully
> > read with a single ioctl and the app never continues reading but doesn't
> > close the fd.
>
> Not sure I follow you here...

Ah, I was thinking that XFS was keeping a copy of the whole extent
mapping in the kernel to handle getting the data with separate calls.
It does make sense to specify zero for the fm_extent_count array and a
new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
extent data itself, for the non-verbose mode of filefrag, and for
pre-allocating a buffer large enough to hold the file if that is important.

I'm also going to add a FIEMAP_FLAG_LAST to mark the last extent in the file,
so that iterators using a small buffer don't need to retry to get the last
extent, and it is possible in case of e.g. EINTR (or whatever) to return a
short list without signalling EOF. I think this is cleaner than returning
a HOLE extent from EOF to ~0ULL.

Another question about semantics -
- does XFS return an extent for the metadata parts of the file (e.g. btree)?
- does XFS return preallocated extents beyond EOF?
- does XFS allow non-root users to call xfs_bmap on files they don't own, or
use by non-root users at all? The FIBMAP ioctl is for privileged users
only, and I wonder if FIEMAP should be the same, or at least disallow
mapping files that the user can't access especially with FLAG_SYNC and/or
FLAG_HSM_READ.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2007-05-01 04:39:06

by Nicholas Miell

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote:
> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> > On Apr 19, 2007 11:54 +1000, David Chinner wrote:
> > > > struct fiemap {
> > > > __u64 fm_start; /* logical start offset of mapping (in/out) */
> > > > __u64 fm_len; /* logical length of mapping (in/out) */
> > > > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
> > > > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */
> > > > __u64 fm_unused;
> > > > struct fiemap_extent fm_extents[0];
> > > > }
> > > >
> > > > /* flags for the fiemap request */
> > > > #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/
> > > > #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */
> > > > #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/
> > >
> > > No flags in the INCOMPAT range - shouldn't it be 0x3 at this point?
> >
> > This is actually for future use. Any flags that are added into this range
> > must be understood by both sides or it should be considered an error. Flags
> > outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported.
> > If it turns out that 8 bits is too small a range for INCOMPAT flags, then
> > we can make 0x01000000 an incompat flag that means e.g. 0x00ff0000 are also
> > incompat flags also.
>
> Ah, ok. So it's not really a set of "compatibility" flags,
> it's more a "compulsory" set. Under those terms, i don't really
> see why this is necessary - either the filesystem will understand
> the flags or it will return EINVAL or ignore them...
>
> > I'm assuming that all flags that will be in the original FIEMAP proposal
> > will be understood by the implementations. Most filesystems can safely
> > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for
> > that matter FLAG_SYNC is probably moot for most filesystems also because
> > they do block allocation at preprw time.
>
> Exactly my point - so why do we really need to encode a compulsory set
> of flags in the API?
>

Because flags have meaning, independent of whether or not the filesystem
understands them. And if the filesystem chooses to ignore critically
important flags (instead of returning EINVAL), bad things may happen.

So, either the filesystem will understand the flag
or iff the unknown flag is in the incompat set, it will return EINVAL
or else the unknown flag will be safely ignored.

--
Nicholas Miell <[email protected]>

2007-05-01 04:23:08

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> On Apr 19, 2007 11:54 +1000, David Chinner wrote:
> > > struct fiemap {
> > > __u64 fm_start; /* logical start offset of mapping (in/out) */
> > > __u64 fm_len; /* logical length of mapping (in/out) */
> > > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
> > > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */
> > > __u64 fm_unused;
> > > struct fiemap_extent fm_extents[0];
> > > }
> > >
> > > /* flags for the fiemap request */
> > > #define FIEMAP_FLAG_SYNC 0x00000001 /* flush delalloc data to disk*/
> > > #define FIEMAP_FLAG_HSM_READ 0x00000002 /* retrieve data from HSM */
> > > #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* must understand these flags*/
> >
> > No flags in the INCOMPAT range - shouldn't it be 0x3 at this point?
>
> This is actually for future use. Any flags that are added into this range
> must be understood by both sides or it should be considered an error. Flags
> outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported.
> If it turns out that 8 bits is too small a range for INCOMPAT flags, then
> we can make 0x01000000 an incompat flag that means e.g. 0x00ff0000 are also
> incompat flags also.

Ah, ok. So it's not really a set of "compatibility" flags,
it's more a "compulsory" set. Under those terms, i don't really
see why this is necessary - either the filesystem will understand
the flags or it will return EINVAL or ignore them...

> I'm assuming that all flags that will be in the original FIEMAP proposal
> will be understood by the implementations. Most filesystems can safely
> ignore FLAG_HSM_READ, for example, since they don't support HSM, and for
> that matter FLAG_SYNC is probably moot for most filesystems also because
> they do block allocation at preprw time.

Exactly my point - so why do we really need to encode a compulsory set
of flags in the API?

> > SO, there's a HSM_READ flag above. If we are going to make this interface
> > useful for filesystems that have HSMs interacting with their extents, the
> > HSM needs to be able to query whether the extent is online (on disk),
> > has been migrated offline (on tape) or in dual-state (i.e. both online and
> > offline).
>
> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't

I disagree - why would you want to indicate the state is unknown when we know
very well that it is offline?

> consider files that are both on disk and on secondary storage (which is
> no longer just tape anymore). I thought I'd call this FIEMAP_EXTENT_OFFLINE,
> but that has a confusing connotation that the extent is inaccessible, instead
> of just saying it is also on offline storage. What about
> FIEMAP_EXTENT_SECONDARY? Other proposals welcome.

Effectively, when your extent is offline in the HSM, it is inaccessable, and
you have to bring it back from tape so it becomes accessible again. i.e. some
action is necessary on behalf of the user to make it accessible. So I think
that OFFLINE is a good name for this state because it really is inaccessible.

Also, I don't think "secondary" is a good term because most large systems
have more than one tier of storage. One possibility is "HSM_RESIDENT"
which indicates the extent is current and resident with a HSM's archive....

> FIEMAP_EXTENT_SECONDARY could still be set even if the file is mapped.
> That would mean an offline-only file would be EXTENT_SECONDARY|EXTENT_UNKNOWN,
> while a dual-location file would be EXTENT_SECONDARY only.

I much prefer OFFLINE|HSM_RESIDENT and HSM_RESIDENT as it is far more
descriptive as to what the state is (which certainly isn't unknown).

> > > SUMMARY OF CHANGES
> > > ==================
> > > - add separate fe_flags word with flags from various suggestions:
> > > - FIEMAP_EXTENT_HOLE = extent has no space allocation
> > > - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
> > > - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
> > > (e.g. HSM, delalloc awaiting sync, etc)
> >
> > I'd like an explicit delalloc flag, not lumping it in with "unknown".
> > we *know* the extent is delalloc ;)
>
> Sure, FIEMAP_EXTENT_DELALLOC is fine. It is mostly redundant with
> EXTENT_UNKNOWN (and I think it only makes sense if DELALLOC is given in
> addition to UNKNOWN).

I disagree that it is redundant - in case you hadn't already noticed I
dislike the idea of "unknown" meaning one of several possible known
states ;)

> I'd like to keep a generic "UNKNOWN" flag that can
> be used by applications that don't really care about why it is unmapped
> and in case there are other reasons in the future that an extent might
> be unmapped (e.g. fsck or storage layer reporting corruption or loss of
> that part of the file).

Sure.

> > > > chook 681% xfs_bmap -vv fred
> > > > fred:
> > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> > > > 0: [0..151]: 288444888..288445039 8 (1696536..1696687) 152 00010
> > > > FLAG Values:
> > > > 010000 Unwritten preallocated extent
> > > > 001000 Doesn't begin on stripe unit
> > > > 000100 Doesn't end on stripe unit
> > > > 000010 Doesn't begin on stripe width
> > > > 000001 Doesn't end on stripe width
> > >
> > > Can you clarify the terminology here? What is a "stripe unit" and what is
> > > a "stripe width"?
> >
> > Stripe unit is equivalent of the chunk size in an MD RAID. It's the amount
> > of data that is written to each lun in a stripe before moving onto the
> > next stripe element.
> >
> > > Are there "N * stripe_unit = stripe_width" in e.g. a
> > > RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa?
> >
> > Yes, on simple configurations. In more complex HW RAID
> > configurations, we'll typically set the stripe unit to the width of
> > the RAID5 lun (N * segment size) and the stripe width to the number
> > of luns we've striped across.
>
> Can you propose reasonable flag names for these (I can't think of anything
> very good) and a clear explanation of what they mean. I suspect it will
> only be XFS that uses them initially. In mke2fs and ext4+mballoc there is
> the concept of stripe unit and stripe width, but as yet they are not
> communicated between the two very well. I'd be much happier if this info
> could be queried in a standard way from the block layer instead of the
> user having to specify it and the filesystem having to track it.

My preference is definitely for a separate ioctl to grab the
filesystem geometry so this stuff can be calculated in userspace.
i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't
bother trying to define names until we decide which appraoch we take
to implement this.

The problem with the block layer is that ther is no standard way of
getting this information - every volume manager has a different
method. In XFS, mkfs.xfs does the work of getting this information
to see in the filesystem superblock. Here's the code for getting
sunit/swidth from the underlying block device:

http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/

Not much in common there ;)

The other problem is that there is no guarantee the filesystem and
the block layer are using the same values as they can be overridden
by mount options. e.g the block layer may have a stripe unit of
512k, but it's perfectly valid for the filesystem to use a multiple
of this for it's sunit, or to not use a sunit at all. Hence you
really do need to track this in the filesystem and query it from
there...

> > > > xfs_bmap gets around this by finding out how many extents there are in the
> > > > file and allocating a buffer that big to hold all the extents so they
> > > > are gathered in a single atomic call (think sparse matrix files)....
> > >
> > > Yeah, except this might be persistent for a long time if it isn't fully
> > > read with a single ioctl and the app never continues reading but doesn't
> > > close the fd.
> >
> > Not sure I follow you here...
>
> Ah, I was thinking that XFS was keeping a copy of the whole extent
> mapping in the kernel to handle getting the data with separate calls.

Actually, it keeps the whole mapping in the kernel to make lookups
fast and relatively simple.

> It does make sense to specify zero for the fm_extent_count array and a
> new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
> extent data itself, for the non-verbose mode of filefrag, and for
> pre-allocating a buffer large enough to hold the file if that is important.

Rather than rely on implicit behaviour of "pass in extent count of
zero and a don't try to return any extents" to return the number of
extents on the file, why not just explicitly define this as a valid
input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS

> I'm also going to add a FIEMAP_FLAG_LAST to mark the last extent in the file,
> so that iterators using a small buffer don't need to retry to get the last
> extent, and it is possible in case of e.g. EINTR (or whatever) to return a
> short list without signalling EOF. I think this is cleaner than returning
> a HOLE extent from EOF to ~0ULL.

Yes, good idea.

> Another question about semantics -
> - does XFS return an extent for the metadata parts of the file (e.g. btree)?

No, but we can return the extent map for the attribute fork (i.e.
extended attrs) if asked for (XFS_IOC_GETBMAPA).

> - does XFS return preallocated extents beyond EOF?

Yes - they are part of the extent map for the file.

> - does XFS allow non-root users to call xfs_bmap on files they don't own, or
> use by non-root users at all?

Users can run xfs_bmap on any file they have permission to
open(O_RDONLY).

> The FIBMAP ioctl is for privileged users
> only, and I wonder if FIEMAP should be the same, or at least disallow
> mapping files that the user can't access especially with FLAG_SYNC and/or
> FLAG_HSM_READ.

I see little reason for restricting FI[BE]MAP to privileged users -
anyone should be able to determine if files they have permission to
access are fragmented.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-01 14:20:49

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote:
> On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote:
> > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> > > This is actually for future use. Any flags that are added into this
> > > range must be understood by both sides or it should be considered an
> > > error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need
> > > to be supported. If it turns out that 8 bits is too small a range for
> > > INCOMPAT flags, then we can make 0x01000000 an incompat flag that means
> > > e.g. 0x00ff0000 are also incompat flags also.
> >
> > Ah, ok. So it's not really a set of "compatibility" flags, it's more a
> > "compulsory" set. Under those terms, i don't really see why this is
> > necessary - either the filesystem will understand the flags or it will
> > return EINVAL or ignore them...
> >
> > > I'm assuming that all flags that will be in the original FIEMAP proposal
> > > will be understood by the implementations. Most filesystems can safely
> > > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for
> > > that matter FLAG_SYNC is probably moot for most filesystems also because
> > > they do block allocation at preprw time.
> >
> > Exactly my point - so why do we really need to encode a compulsory set of
> >
>
> Because flags have meaning, independent of whether or not the filesystem
> understands them. And if the filesystem chooses to ignore critically
> important flags (instead of returning EINVAL), bad things may happen.
>
> So, either the filesystem will understand the flag or iff the unknown flag
> is in the incompat set, it will return EINVAL or else the unknown flag will
> be safely ignored.

My point was that there is a difference between specification and
implementation - if the specification says something is compulsory,
then they must be implemented in the filesystem. This is easy
enough to ensure by code review - we don't need additional interface
complexity for this....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-01 18:38:22

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On 1 May 2007, at 05:22, David Chinner wrote:
> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
>> The FIBMAP ioctl is for privileged users
>> only, and I wonder if FIEMAP should be the same, or at least
>> disallow
>> mapping files that the user can't access especially with
>> FLAG_SYNC and/or
>> FLAG_HSM_READ.
>
> I see little reason for restricting FI[BE]MAP to privileged users -
> anyone should be able to determine if files they have permission to
> access are fragmented.

Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the
machine. Perhaps for non-privileged users FIEMAP has to be read-
only? As soon as any of the FLAG_* flags come into play you make it
privileged. For example fancy any user being able to fill up your
file system by calling FIEMAP with FLAG_HSM_READ on all files
recursively? This should certainly not be simply dismissed as a non-
issue without thinking about it first...

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2007-05-01 18:46:53

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On 1 May 2007, at 15:20, David Chinner wrote:
> On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote:
>> On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote:
>>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
>>>> This is actually for future use. Any flags that are added into
>>>> this
>>>> range must be understood by both sides or it should be
>>>> considered an
>>>> error. Flags outside the FIEMAP_FLAG_INCOMPAT do not
>>>> necessarily need
>>>> to be supported. If it turns out that 8 bits is too small a
>>>> range for
>>>> INCOMPAT flags, then we can make 0x01000000 an incompat flag
>>>> that means
>>>> e.g. 0x00ff0000 are also incompat flags also.
>>>
>>> Ah, ok. So it's not really a set of "compatibility" flags, it's
>>> more a
>>> "compulsory" set. Under those terms, i don't really see why this is
>>> necessary - either the filesystem will understand the flags or it
>>> will
>>> return EINVAL or ignore them...
>>>
>>>> I'm assuming that all flags that will be in the original FIEMAP
>>>> proposal
>>>> will be understood by the implementations. Most filesystems can
>>>> safely
>>>> ignore FLAG_HSM_READ, for example, since they don't support HSM,
>>>> and for
>>>> that matter FLAG_SYNC is probably moot for most filesystems also
>>>> because
>>>> they do block allocation at preprw time.
>>>
>>> Exactly my point - so why do we really need to encode a
>>> compulsory set of
>>
>> Because flags have meaning, independent of whether or not the
>> filesystem
>> understands them. And if the filesystem chooses to ignore critically
>> important flags (instead of returning EINVAL), bad things may happen.
>>
>> So, either the filesystem will understand the flag or iff the
>> unknown flag
>> is in the incompat set, it will return EINVAL or else the unknown
>> flag will
>> be safely ignored.
>
> My point was that there is a difference between specification and
> implementation - if the specification says something is compulsory,
> then they must be implemented in the filesystem. This is easy
> enough to ensure by code review - we don't need additional interface
> complexity for this....

You are wrong about this because you are missing the point that you
have no code to review. The users that will use those flags are
going to be applications that run in user space. Chances are you
will never see their code. Heck, they might not even be open source
applications... And all applications will run against a multitude of
kernels. So version X of the application will run on kernel 2.4.*,
2.6.*, a.b.*, etc... For future expandability of the interface I
think it is important to have both compulsory and non-compulsory flags.

For example there is no reason why FIEMAP_HSM_READ needs to be
compulsory. Most filesystems do not support HSM so can safely ignore
it. And applications that want to read/write the data locations that
are obtained with the FIEMAP call will likely always supply
FIEMAP_HSM_READ because they want to ensure the file is brought in if
it is off line so they definitely want file systems that do not
support this flag to ignore it.

And vice versa, an application might specify some weird and funky yet
to be developed feature that it expects the FS to perform and if the
FS cannot do it (either because it does not support it or because it
failed to perform the operation) the application expects the FS to
return an error and not to ignore the flag. An example could be the
asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS
ignores it it will return the extent map for the file data instead of
the XATTR_FORK! Not what the application wanted at all. Ouch! So
this is definitely a compulsory flag if I ever saw one.

So as you see you must support both voluntary and compulsory flags...

Also consider what I said above about different kernels. A new
feature is implemented in kernel 2.8.13 say that was not there before
and an application is updated to use that feature. There will be
lots of instances where that application will still be run on older
kernels where this feature does not exist. Depending on the feature
it may be quite sensible to simply ignore in the kernel that the
application set an unknown flag whilst for a different feature it may
be the opposite.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



2007-05-01 22:30:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On May 01, 2007 14:22 +1000, David Chinner wrote:
> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't
>
> I disagree - why would you want to indicate the state is unknown when we know
> very well that it is offline?

If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a
catch-all flag that indicates "this extent contains data but there is
nothing sensible to be returned for the extent mapping."

> Effectively, when your extent is offline in the HSM, it is inaccessable, and
> you have to bring it back from tape so it becomes accessible again. i.e. some
> action is necessary on behalf of the user to make it accessible. So I think
> that OFFLINE is a good name for this state because it really is inaccessible.

What you are calling OFFLINE I would prefer to call UNMAPPED, since that
can be used by applications as a catch-all for "no mapping". There can
be further flags that give refinements to UNMAPPED that some applications
might care about them (e.g. HSM_RESIDENT), but many users/apps will not
if they just want the number of fragments in a given file.

> Also, I don't think "secondary" is a good term because most large systems
> have more than one tier of storage. One possibility is "HSM_RESIDENT"
> which indicates the extent is current and resident with a HSM's archive....

Sure.

> > Can you propose reasonable flag names for these (I can't think of anything
> > very good) and a clear explanation of what they mean. I suspect it will
> > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is
> > the concept of stripe unit and stripe width, but as yet they are not
> > communicated between the two very well. I'd be much happier if this info
> > could be queried in a standard way from the block layer instead of the
> > user having to specify it and the filesystem having to track it.
>
> My preference is definitely for a separate ioctl to grab the
> filesystem geometry so this stuff can be calculated in userspace.
> i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't
> bother trying to define names until we decide which appraoch we take
> to implement this.

Hmm, previously you wrote "This information could be easily passed up in the
flags fields if the filesystem has geometry information". So, I _think_
what you are saying is that you want 4 flags to convey this start/end
alignment information, but the exact semantics of what a "stripe unit" and
a "stripe width" is filesystem specific?

I definitely do NOT want to get into any issues of querying the block
device geometry here. I was just making a passing comment that ext4+mballoc
can already do RAID-specific allocation alignment, but it depends on the
admin to specify this information and it would be nice if there was some
easy way to get this from userspace/kernel interfaces.

Having an API that can request "tell me the number of blocks from this
offset until the next physical disk boundary" or similar would be useful
to any allocator, and the block layer already needs to know this when
submitting IO.

> In XFS, mkfs.xfs does the work of getting this information
> to see in the filesystem superblock. Here's the code for getting
> sunit/swidth from the underlying block device:
>
> http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/
>
> Not much in common there ;)

It looks like this might be just what e2fsprogs needs also.

> > It does make sense to specify zero for the fm_extent_count array and a
> > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
> > extent data itself, for the non-verbose mode of filefrag, and for
> > pre-allocating a buffer large enough to hold the file if that is important.
>
> Rather than rely on implicit behaviour of "pass in extent count of
> zero and a don't try to return any extents" to return the number of
> extents on the file, why not just explicitly define this as a valid
> input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS

That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my
clever-clever for "return no extents" and "return number of extents"
is wasted :-/.

> > - does XFS return an extent for the metadata parts of the file (e.g. btree)?
>
> No, but we can return the extent map for the attribute fork (i.e.
> extended attrs) if asked for (XFS_IOC_GETBMAPA).

This seems like it would be a useful addition to the interface also, having
FIEMAP_FLAG_METADATA request the return of metadata allocations too.

> > - does XFS return preallocated extents beyond EOF?
>
> Yes - they are part of the extent map for the file.

OK.

> > - does XFS allow non-root users to call xfs_bmap on files they don't own, or
> > use by non-root users at all?
>
> Users can run xfs_bmap on any file they have permission to
> open(O_RDONLY).
>
> > The FIBMAP ioctl is for privileged users
> > only, and I wonder if FIEMAP should be the same, or at least disallow
> > mapping files that the user can't access especially with FLAG_SYNC and/or
> > FLAG_HSM_READ.
>
> I see little reason for restricting FI[BE]MAP to privileged users -
> anyone should be able to determine if files they have permission to
> access are fragmented.

I think I agree with Anton that allowing some of the flags for non-privileged
users seems dangerous. I think this needs to be determined on a flag-by-flag
basis, and -EPERM should be returned in some cases.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-05-01 22:32:42

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On May 02, 2007 00:20 +1000, David Chinner wrote:
> My point was that there is a difference between specification and
> implementation - if the specification says something is compulsory,
> then they must be implemented in the filesystem. This is easy
> enough to ensure by code review - we don't need additional interface
> complexity for this....

What you seem to be missing about my proposal is that the FLAG_INCOMPAT
is for future use by that part of the specification we haven't thought
of yet... Having COMPAT/INCOMPAT flags has been very useful for ext2/3/4,
and is much better than having version numbers for the interface.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-05-02 00:07:16

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote:
> On 1 May 2007, at 05:22, David Chinner wrote:
> >On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> >> The FIBMAP ioctl is for privileged users
> >> only, and I wonder if FIEMAP should be the same, or at least
> >>disallow
> >> mapping files that the user can't access especially with
> >>FLAG_SYNC and/or
> >> FLAG_HSM_READ.
> >
> >I see little reason for restricting FI[BE]MAP to privileged users -
> >anyone should be able to determine if files they have permission to
> >access are fragmented.
>
> Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the
> machine. Perhaps for non-privileged users FIEMAP has to be read-
> only? As soon as any of the FLAG_* flags come into play you make it
> privileged. For example fancy any user being able to fill up your
> file system by calling FIEMAP with FLAG_HSM_READ on all files
> recursively?

By that reasoning, users should not be allowed to recall any files
without root privileges. HSMs don't work that way, though - any user
is allowed to recall any files they have permission to access either
by manual command or by trying to read the file daata.

If that runs the filesytem out of space, then the HSM either hasn't
been configured properly or it's failed to manage the space
correctly. Either way, that's not the fault of the user for
recalling their own files.

Hence allowing FIEMAP to be executed by the user does not open up
any DOS conditions that don't already exist in normal HSM-managed
filesystem.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-02 02:26:58

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote:
> On May 01, 2007 14:22 +1000, David Chinner wrote:
> > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't
> >
> > I disagree - why would you want to indicate the state is unknown when we know
> > very well that it is offline?
>
> If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a
> catch-all flag that indicates "this extent contains data but there is
> nothing sensible to be returned for the extent mapping."

Yes, I like that much more. Good suggestion. ;)

> > Effectively, when your extent is offline in the HSM, it is inaccessable, and
> > you have to bring it back from tape so it becomes accessible again. i.e. some
> > action is necessary on behalf of the user to make it accessible. So I think
> > that OFFLINE is a good name for this state because it really is inaccessible.
>
> What you are calling OFFLINE I would prefer to call UNMAPPED, since that
> can be used by applications as a catch-all for "no mapping". There can
> be further flags that give refinements to UNMAPPED that some applications
> might care about them (e.g. HSM_RESIDENT), but many users/apps will not
> if they just want the number of fragments in a given file.

Agreed - UNMAPPED does make a lot more sense in this case.

> > > Can you propose reasonable flag names for these (I can't think of anything
> > > very good) and a clear explanation of what they mean. I suspect it will
> > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is
> > > the concept of stripe unit and stripe width, but as yet they are not
> > > communicated between the two very well. I'd be much happier if this info
> > > could be queried in a standard way from the block layer instead of the
> > > user having to specify it and the filesystem having to track it.
> >
> > My preference is definitely for a separate ioctl to grab the
> > filesystem geometry so this stuff can be calculated in userspace.
> > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't
> > bother trying to define names until we decide which appraoch we take
> > to implement this.
>
> Hmm, previously you wrote "This information could be easily passed up in the
> flags fields if the filesystem has geometry information". So, I _think_
> what you are saying is that you want 4 flags to convey this start/end
> alignment information, but the exact semantics of what a "stripe unit" and
> a "stripe width" is filesystem specific?

Right.

> I definitely do NOT want to get into any issues of querying the block
> device geometry here. I was just making a passing comment that ext4+mballoc
> can already do RAID-specific allocation alignment, but it depends on the
> admin to specify this information and it would be nice if there was some
> easy way to get this from userspace/kernel interfaces.
>
> Having an API that can request "tell me the number of blocks from this
> offset until the next physical disk boundary" or similar would be useful
> to any allocator, and the block layer already needs to know this when
> submitting IO.

The block layer knows this once you get inside the volume manager. I
think the issue is that there is no common export interface for this
information.

> > In XFS, mkfs.xfs does the work of getting this information
> > to see in the filesystem superblock. Here's the code for getting
> > sunit/swidth from the underlying block device:
> >
> > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/
> >
> > Not much in common there ;)
>
> It looks like this might be just what e2fsprogs needs also.

More than likely.

> > > It does make sense to specify zero for the fm_extent_count array and a
> > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
> > > extent data itself, for the non-verbose mode of filefrag, and for
> > > pre-allocating a buffer large enough to hold the file if that is important.
> >
> > Rather than rely on implicit behaviour of "pass in extent count of
> > zero and a don't try to return any extents" to return the number of
> > extents on the file, why not just explicitly define this as a valid
> > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS
>
> That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my
> clever-clever for "return no extents" and "return number of extents"
> is wasted :-/.

Too clever for an API, I think. ;)

My point is mainly that if you are going to use an API for a
specific function (e.g. query the number of extents) I think that
the API should have an obvious method for executing that specific
function. Using a command of "get no extents" to provide the query
of "how many extents in this file" is kind of obscure. When you read
the code it doesn't make a lot of sense, as opposed to seeing a
clear statement of intent from the code itself.

i.e. FIEMAP_FLAG_GET_NUMEXTENTS is self-documenting in both the API
and the code that uses it...

> > > - does XFS return an extent for the metadata parts of the file (e.g. btree)?
> >
> > No, but we can return the extent map for the attribute fork (i.e.
> > extended attrs) if asked for (XFS_IOC_GETBMAPA).
>
> This seems like it would be a useful addition to the interface also, having
> FIEMAP_FLAG_METADATA request the return of metadata allocations too.

Agreed. The different types of requests need to be mutually
exclusive, though - returning the map of the attribute fork mixed
with the map of the data fork is going to be confusing....

> > > - does XFS allow non-root users to call xfs_bmap on files they don't own, or
> > > use by non-root users at all?
> >
> > Users can run xfs_bmap on any file they have permission to
> > open(O_RDONLY).
> >
> > > The FIBMAP ioctl is for privileged users
> > > only, and I wonder if FIEMAP should be the same, or at least disallow
> > > mapping files that the user can't access especially with FLAG_SYNC and/or
> > > FLAG_HSM_READ.
> >
> > I see little reason for restricting FI[BE]MAP to privileged users -
> > anyone should be able to determine if files they have permission to
> > access are fragmented.
>
> I think I agree with Anton that allowing some of the flags for non-privileged
> users seems dangerous. I think this needs to be determined on a flag-by-flag
> basis, and -EPERM should be returned in some cases.

Agreed, but I'm yet to see any flags where I think that is necessary
yet.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-02 08:16:04

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On 2 May 2007, at 01:06, David Chinner wrote:
> On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote:
>> On 1 May 2007, at 05:22, David Chinner wrote:
>>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
>>>> The FIBMAP ioctl is for privileged users
>>>> only, and I wonder if FIEMAP should be the same, or at least
>>>> disallow
>>>> mapping files that the user can't access especially with
>>>> FLAG_SYNC and/or
>>>> FLAG_HSM_READ.
>>>
>>> I see little reason for restricting FI[BE]MAP to privileged users -
>>> anyone should be able to determine if files they have permission to
>>> access are fragmented.
>>
>> Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the
>> machine. Perhaps for non-privileged users FIEMAP has to be read-
>> only? As soon as any of the FLAG_* flags come into play you make it
>> privileged. For example fancy any user being able to fill up your
>> file system by calling FIEMAP with FLAG_HSM_READ on all files
>> recursively?
>
> By that reasoning, users should not be allowed to recall any files
> without root privileges. HSMs don't work that way, though - any user
> is allowed to recall any files they have permission to access either
> by manual command or by trying to read the file daata.
>
> If that runs the filesytem out of space, then the HSM either hasn't
> been configured properly or it's failed to manage the space
> correctly. Either way, that's not the fault of the user for
> recalling their own files.
>
> Hence allowing FIEMAP to be executed by the user does not open up
> any DOS conditions that don't already exist in normal HSM-managed
> filesystem.

Sorry, it was not a great example. But the point still stands that
there are/may be created flags that you do not want to allow everyone
to use.

I completely agree with Andreas that those can simply return -EPERM
and the rest can be allowed through.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



2007-05-02 08:25:05

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On 1 May 2007, at 23:30, Andreas Dilger wrote:

> On May 01, 2007 14:22 +1000, David Chinner wrote:
>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
>>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I
>>> didn't
>>
>> I disagree - why would you want to indicate the state is unknown
>> when we know
>> very well that it is offline?
>
> If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a
> catch-all flag that indicates "this extent contains data but there is
> nothing sensible to be returned for the extent mapping."

I like UNMAPPED. I even use it in NTFS internally for extents maps
that have not been read into memory yet. (-:

On a different issue, do you think it would be worth adding an option
flags like FIEMAP_DONT_RELOCATE or something similar that would be a
compulsory flag and if set the FS is not allowed to move the file
around/change the block allocation of the file.

My thinking is that the extent map is not terribly useful if the FS
goes and relocates the file to somewhere else just after you have
done the ioctl. For example HFS on OSX automatically defragments
files whilst it is running... Linux file systems may one day do
similar things.

Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell
the FS we want to access the actual raw blocks so the FS can make
sure the data is on block aligned boundaries and if the FS does not
support this (e.g. ZFS or a compressed or encrypted NTFS file) then
it can return -ENOTSUP.

Perhaps this is totally the wrong interface and such a "prepare file
for direct access" API should be a different ioctl() or syscall or
whatever. It just seems very simple and appropriate to combine it
here as people who use FIEMAP are at least sometimes going to be
wanting to access those blocks directly as well and it feels right to
be able to communicate this to the FS in the same call, kind of like
an "open intent" of "I want to use the data directly on disk"...

What do you think?

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2007-05-02 08:30:17

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On 2 May 2007, at 09:23, Anton Altaparmakov wrote:
> On 1 May 2007, at 23:30, Andreas Dilger wrote:
>
>> On May 01, 2007 14:22 +1000, David Chinner wrote:
>>> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
>>>> Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but
>>>> I didn't
>>>
>>> I disagree - why would you want to indicate the state is unknown
>>> when we know
>>> very well that it is offline?
>>
>> If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a
>> catch-all flag that indicates "this extent contains data but there is
>> nothing sensible to be returned for the extent mapping."
>
> I like UNMAPPED. I even use it in NTFS internally for extents maps
> that have not been read into memory yet. (-:

Oops, I use NOT_MAPPED in NTFS rather than UNMAPPED but I still like
UNMAPPED, too. (-:

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



2007-05-02 09:15:26

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote:
> On 1 May 2007, at 15:20, David Chinner wrote:
> >>
> >>So, either the filesystem will understand the flag or iff the
> >>unknown flag
> >>is in the incompat set, it will return EINVAL or else the unknown
> >>flag will
> >>be safely ignored.
> >
> >My point was that there is a difference between specification and
> >implementation - if the specification says something is compulsory,
> >then they must be implemented in the filesystem. This is easy
> >enough to ensure by code review - we don't need additional interface
> >complexity for this....
>
> You are wrong about this because you are missing the point that you
> have no code to review. The users that will use those flags are
> going to be applications that run in user space. Chances are you
> will never see their code. Heck, they might not even be open source
> applications...

Ummm - the specification defines what is compulsory for *filesystems*
to implement, not what applications can use. We don't need to see
what the applications do - what we care about is that all filesystems
implement the compulsory part of the specification. That's the code
we review, and that's what I was referring to.

> And all applications will run against a multitude of
> kernels. So version X of the application will run on kernel 2.4.*,
> 2.6.*, a.b.*, etc... For future expandability of the interface I
> think it is important to have both compulsory and non-compulsory flags.

Ah, so that's what you want - a mutable interface. i.e. versioning.

So how does compusory flags help here? What happens if a voluntary
flag now becomes compulsory? Or vice versa? How is the application
supposed to deal with this dynamically?

I suggested a version number for this right back at the start of
this discussion and got told that we don't want versioned interfaces
because we should make the effort to get it right the first time.
I don't think this can be called "getting it right".

> For example there is no reason why FIEMAP_HSM_READ needs to be
> compulsory. Most filesystems do not support HSM so can safely ignore
> it.

They might be able to safely ignore it, but in reality it should
be saying "I don't understand this". If the application *needs* to
use a flag like this, then it should be told that the filesystem is
not capable of doing what it was asked!

OTOH if the application does not need to use the flag, then it
shouldn't be using it and we shouldn't be silently ignoring
incorrect usage of the provided API.

What you are effectively saying about these "voluntary" flags
is that their behaviour is _undefined_. That is, if you use
these flags what you get on a successful call is undefined;
it may or may not contain what you asked for but you can't
tell if it really did what you want or returned the information
you asked for.

This is a really bad semantic to encode into an API.

> And vice versa, an application might specify some weird and funky yet
> to be developed feature that it expects the FS to perform and if the
> FS cannot do it (either because it does not support it or because it
> failed to perform the operation) the application expects the FS to
> return an error and not to ignore the flag. An example could be the
> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS
> ignores it it will return the extent map for the file data instead of
> the XATTR_FORK! Not what the application wanted at all. Ouch! So
> this is definitely a compulsory flag if I ever saw one.

Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But
we don't need a flag defined in the user visible API to tell us
that we need to return an error here.

> So as you see you must support both voluntary and compulsory flags...

No, you've managed to convince me that they are not necessary and
they are in fact a Bad Idea... ;)

> Also consider what I said above about different kernels. A new
> feature is implemented in kernel 2.8.13 say that was not there before
> and an application is updated to use that feature. There will be
> lots of instances where that application will still be run on older
> kernels where this feature does not exist.

This is *exactly* where silently ignoring flags really falls down.
On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does
something and it returns different structure contents for the same
state. Now how does the application writer know which is correct or
how to tell the difference? They have to guess or write detection
code which is exactly what we want to avoid.

I objected to the UNKNOWN flag because it wasn't explicit
in it's meaning - I'm doing the same thing here. An interface
needs to be explicitly defined and should not have and undefined
behaviour in it....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-02 09:38:24

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation


On 2 May 2007, at 10:15, David Chinner wrote:

> On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote:
>> On 1 May 2007, at 15:20, David Chinner wrote:
>>>>
>>>> So, either the filesystem will understand the flag or iff the
>>>> unknown flag
>>>> is in the incompat set, it will return EINVAL or else the unknown
>>>> flag will
>>>> be safely ignored.
>>>
>>> My point was that there is a difference between specification and
>>> implementation - if the specification says something is compulsory,
>>> then they must be implemented in the filesystem. This is easy
>>> enough to ensure by code review - we don't need additional interface
>>> complexity for this....
>>
>> You are wrong about this because you are missing the point that you
>> have no code to review. The users that will use those flags are
>> going to be applications that run in user space. Chances are you
>> will never see their code. Heck, they might not even be open source
>> applications...
>
> Ummm - the specification defines what is compulsory for *filesystems*
> to implement, not what applications can use. We don't need to see
> what the applications do - what we care about is that all filesystems
> implement the compulsory part of the specification. That's the code
> we review, and that's what I was referring to.
>
>> And all applications will run against a multitude of
>> kernels. So version X of the application will run on kernel 2.4.*,
>> 2.6.*, a.b.*, etc... For future expandability of the interface I
>> think it is important to have both compulsory and non-compulsory
>> flags.
>
> Ah, so that's what you want - a mutable interface. i.e. versioning.
>
> So how does compusory flags help here? What happens if a voluntary
> flag now becomes compulsory? Or vice versa? How is the application
> supposed to deal with this dynamically?
>
> I suggested a version number for this right back at the start of
> this discussion and got told that we don't want versioned interfaces
> because we should make the effort to get it right the first time.
> I don't think this can be called "getting it right".

Look at ext2/3/4. They do it that way and it works well. No
versioning just compatible and incompatible flags... The proposal is
to do the same here.

>> For example there is no reason why FIEMAP_HSM_READ needs to be
>> compulsory. Most filesystems do not support HSM so can safely ignore
>> it.
>
> They might be able to safely ignore it, but in reality it should
> be saying "I don't understand this". If the application *needs* to
> use a flag like this, then it should be told that the filesystem is
> not capable of doing what it was asked!

That is where you are completely wrong! (-: Or rather you are wrong
for my example, i.e. you are wrong/right depending on the type of
flag in question. HSM_READ is definitely _NOT_ required because all
it means is "if the file is OFFLINE, bring it ONLINE and then return
the extent map". Clearly all file systems that do not support HSM
can 100% ignore this flag as all files will ALWAYS be ONLINE so they
will return the correct data ALWAYS so no need to do anything for
HSM_READ.

> OTOH if the application does not need to use the flag, then it
> shouldn't be using it and we shouldn't be silently ignoring
> incorrect usage of the provided API.
>
> What you are effectively saying about these "voluntary" flags
> is that their behaviour is _undefined_. That is, if you use
> these flags what you get on a successful call is undefined;
> it may or may not contain what you asked for but you can't
> tell if it really did what you want or returned the information
> you asked for.
>
> This is a really bad semantic to encode into an API.

That is your opinion. There is nothing undefined in the API at all.
You just fail to understand it...

>> And vice versa, an application might specify some weird and funky yet
>> to be developed feature that it expects the FS to perform and if the
>> FS cannot do it (either because it does not support it or because it
>> failed to perform the operation) the application expects the FS to
>> return an error and not to ignore the flag. An example could be the
>> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS
>> ignores it it will return the extent map for the file data instead of
>> the XATTR_FORK! Not what the application wanted at all. Ouch! So
>> this is definitely a compulsory flag if I ever saw one.
>
> Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But
> we don't need a flag defined in the user visible API to tell us
> that we need to return an error here.

Heh? What are you talking about? You need a flag to specify that you
want XATTR_FORK. If not how the hell does the application specify
that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you
of the opinion that FIEMAP should definitely not support XATTR_FORK.
If the latter I fully agree. This should be a separate API with
named streams and the FD of the named stream should be passed to
FIEMAP without the silly XATTR_FORK flag...

>> So as you see you must support both voluntary and compulsory flags...
>
> No, you've managed to convince me that they are not necessary and
> they are in fact a Bad Idea... ;)

We agree to disagree then. I think they are a very Good Idea(TM). (-;

>> Also consider what I said above about different kernels. A new
>> feature is implemented in kernel 2.8.13 say that was not there before
>> and an application is updated to use that feature. There will be
>> lots of instances where that application will still be run on older
>> kernels where this feature does not exist.
>
> This is *exactly* where silently ignoring flags really falls down.

It does not!

> On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does
> something and it returns different structure contents for the same

No it does not. You do NOT understand at all what we are talking
about do you?!?

If a flag would do something weird like returning different data then
OBVIOUSLY you would make this a mandatory flag and it will NOT be
ignored!

You should know better than arguing with fallacies. Seriously...

> state. Now how does the application writer know which is correct or
> how to tell the difference? They have to guess or write detection
> code which is exactly what we want to avoid.

No they don't. It is then a compulsory flag so your argument is
totally moot.

> I objected to the UNKNOWN flag because it wasn't explicit
> in it's meaning - I'm doing the same thing here. An interface
> needs to be explicitly defined and should not have and undefined
> behaviour in it....

That is exactly the point. It is explicitly defined and has NO
undefined behaviour in it. (-:

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2007-05-02 09:48:12

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On 2 May 2007, at 10:15, David Chinner wrote:
> On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote:
>> And all applications will run against a multitude of
>> kernels. So version X of the application will run on kernel 2.4.*,
>> 2.6.*, a.b.*, etc... For future expandability of the interface I
>> think it is important to have both compulsory and non-compulsory
>> flags.
>
> Ah, so that's what you want - a mutable interface. i.e. versioning.
>
> So how does compusory flags help here?

A concrete example:

Let's say that the FIEMAP interface goes live as is without any flags
at all and just defined bits for "these are optional and those are
compulsory".

Then the next kernel adds support for optional flag HSM_READ and
compulsory flag XATTR_READ.

FS that do not support XATTR_READ will return -ENOTSUP as they cannot
return the wanted data.

FS that do not support HSM_READ will still return the correct data in
majority of cases (except when the FS supports HSM and the data is
actually OFFLINE which the application will need to be able to cope
with anyway incase the FS failed to bring the file ONLINE even if it
supports the HSM_READ flag so no added complexity for handling this
case).

> What happens if a voluntary flag now becomes compulsory? Or vice
> versa? How is the application supposed to deal with this dynamically?

Forgot to answer this bit: This cannot happen. There cannot be
flags that move from compulsory to non-compulsory or anything stupid
like that. It would have to be a totally new flag otherwise it
breaks backwards compatibility and hence this interface becomes
useless crap.

> I suggested a version number for this right back at the start of
> this discussion and got told that we don't want versioned interfaces
> because we should make the effort to get it right the first time.
> I don't think this can be called "getting it right".

So all applications end up doing:

if (version X, do blah)
else if (version Y, do blob)
else if (version Z, do foo)
else if (version A, do bar)
else exit(1);

Every time a new version is added? And abort for unknown versions?
Now that is a great interface! Not.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2007-05-02 09:48:51

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote:
> On a different issue, do you think it would be worth adding an option
> flags like FIEMAP_DONT_RELOCATE or something similar that would be a
> compulsory flag and if set the FS is not allowed to move the file
> around/change the block allocation of the file.

We already have an inode flag in XFS to say this - the defrag
tool checks it and ignores the file if it is set.

> Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell
> the FS we want to access the actual raw blocks so the FS can make
> sure the data is on block aligned boundaries and if the FS does not
> support this (e.g. ZFS or a compressed or encrypted NTFS file) then
> it can return -ENOTSUP.
>
> Perhaps this is totally the wrong interface and such a "prepare file
> for direct access" API should be a different ioctl() or syscall or
> whatever. It just seems very simple and appropriate to combine it
> here as people who use FIEMAP are at least sometimes going to be
> wanting to access those blocks directly as well and it feels right to
> be able to communicate this to the FS in the same call, kind of like
> an "open intent" of "I want to use the data directly on disk"...

I think this is wrong interface for this. Sure, use it to get the
mappings (that's what it's for) but what you do with the mappings
after that is not part of FIEMAP....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-02 09:57:49

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On 2 May 2007, at 10:48, David Chinner wrote:
> On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote:
>> On a different issue, do you think it would be worth adding an option
>> flags like FIEMAP_DONT_RELOCATE or something similar that would be a
>> compulsory flag and if set the FS is not allowed to move the file
>> around/change the block allocation of the file.
>
> We already have an inode flag in XFS to say this - the defrag
> tool checks it and ignores the file if it is set.

That is great for XFS but you control the metadata. NTFS, HFS, etc
are cases where we cannot add such a flag because we cannot modify
the metadata format (ok we could in some kludgy manner like storing
an EA with an inode to say "com.linux.ntfs.immutable" or something
but I would rather not if I can avoid it).

>> Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell
>> the FS we want to access the actual raw blocks so the FS can make
>> sure the data is on block aligned boundaries and if the FS does not
>> support this (e.g. ZFS or a compressed or encrypted NTFS file) then
>> it can return -ENOTSUP.
>>
>> Perhaps this is totally the wrong interface and such a "prepare file
>> for direct access" API should be a different ioctl() or syscall or
>> whatever. It just seems very simple and appropriate to combine it
>> here as people who use FIEMAP are at least sometimes going to be
>> wanting to access those blocks directly as well and it feels right to
>> be able to communicate this to the FS in the same call, kind of like
>> an "open intent" of "I want to use the data directly on disk"...
>
> I think this is wrong interface for this. Sure, use it to get the
> mappings (that's what it's for) but what you do with the mappings
> after that is not part of FIEMAP....

Thanks for the comments. I am not sure it is a good idea either,
just thought it would be worth discussing in case people thought it a
good idea.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2007-05-02 10:58:12

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote:
> On 2 May 2007, at 10:15, David Chinner wrote:
> >On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote:
> >>And all applications will run against a multitude of
> >>kernels. So version X of the application will run on kernel 2.4.*,
> >>2.6.*, a.b.*, etc... For future expandability of the interface I
> >>think it is important to have both compulsory and non-compulsory
> >>flags.
> >
> >Ah, so that's what you want - a mutable interface. i.e. versioning.
> >
> >So how does compusory flags help here? What happens if a voluntary
> >flag now becomes compulsory? Or vice versa? How is the application
> >supposed to deal with this dynamically?
> >
> >I suggested a version number for this right back at the start of
> >this discussion and got told that we don't want versioned interfaces
> >because we should make the effort to get it right the first time.
> >I don't think this can be called "getting it right".
>
> Look at ext2/3/4. They do it that way and it works well. No
> versioning just compatible and incompatible flags... The proposal is
> to do the same here.

Just because it works for extN doesn't make it right for this
interface.

> >>For example there is no reason why FIEMAP_HSM_READ needs to be
> >>compulsory. Most filesystems do not support HSM so can safely ignore
> >>it.
> >
> >They might be able to safely ignore it, but in reality it should
> >be saying "I don't understand this". If the application *needs* to
> >use a flag like this, then it should be told that the filesystem is
> >not capable of doing what it was asked!
>
> That is where you are completely wrong! (-: Or rather you are wrong
> for my example, i.e. you are wrong/right depending on the type of
> flag in question.

And that is the crux of the argument.

My point is that *any* flag returns an error if the filesystem
does not support it.

> HSM_READ is definitely _NOT_ required because all
> it means is "if the file is OFFLINE, bring it ONLINE and then return
> the extent map".

You've got the definition of HSM_READ wrong. If the flag is *not*
set, then we bring everything back online and return the full extent
map.

Specifying the flag indicates that we do *not* want the offline
extents brought back online. i.e. it is a HSM or a datamover
(e.g. backup program) that is querying the extents and we want to
known *exactly* what the current state of the file is right now.

So, if the HSM_READ flag is set, then the application is
expecting the filesytem to be part of a HSM. Hence if it's not,
it should return an error because somebody has done something wrong.

> >OTOH if the application does not need to use the flag, then it
> >shouldn't be using it and we shouldn't be silently ignoring
> >incorrect usage of the provided API.
> >
> >What you are effectively saying about these "voluntary" flags
> >is that their behaviour is _undefined_. That is, if you use
> >these flags what you get on a successful call is undefined;
> >it may or may not contain what you asked for but you can't
> >tell if it really did what you want or returned the information
> >you asked for.
> >
> >This is a really bad semantic to encode into an API.
>
> That is your opinion. There is nothing undefined in the API at all.
> You just fail to understand it...

FIEMAP returned success. Did it do what I asked? I don't
know because it's allowed to return success when it did ignored me.

This is as silly an interface definition as saying you can
implement fsync() with { return 0; }. So, when fsync() succeeded
did it write my data to disk? I don't know; it's allowed to return
success when it ignored me.

It's crazy, isn't it? It makes writing applications portable
across operating systems a real PITA (ask the MySQL folk ;)
because POSIX really does allow fsync() to be implemented like this.

I use this example because the "allow some filesystems to silently
ignore flags they don't understand" is a portability problem for
applications - rather than a cross-OS issue it is a cross-filesystem
issue. That is, if different filesystems behave differently to
the same request they will have to be handled specifically by
the application. Every filesystem should behave in *exactly* the
same way to the FIEMAP ioctls - if they don't support something
they throw an error, if they do then they return the correct
data.

> >>And vice versa, an application might specify some weird and funky yet
> >>to be developed feature that it expects the FS to perform and if the
> >>FS cannot do it (either because it does not support it or because it
> >>failed to perform the operation) the application expects the FS to
> >>return an error and not to ignore the flag. An example could be the
> >>asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS
> >>ignores it it will return the extent map for the file data instead of
> >>the XATTR_FORK! Not what the application wanted at all. Ouch! So
> >>this is definitely a compulsory flag if I ever saw one.
> >
> >Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But
> >we don't need a flag defined in the user visible API to tell us
> >that we need to return an error here.
>
> Heh? What are you talking about? You need a flag to specify that you
> want XATTR_FORK. If not how the hell does the application specify
> that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you
> of the opinion that FIEMAP should definitely not support XATTR_FORK.
> If the latter I fully agree. This should be a separate API with
> named streams and the FD of the named stream should be passed to
> FIEMAP without the silly XATTR_FORK flag...

Ummmm - I think you misunderstood what I was saying. I was agreeing
with you that is a FS does not support FIEMAP_XATTR_FORK "the correct
answer is -EOPNOTSUPP or -EINVAL".

What I was saying is that we don't need a COMPAT flag bit to tell
us the obvious error return if the filesystem does not support this
functionality....

> >>Also consider what I said above about different kernels. A new
> >>feature is implemented in kernel 2.8.13 say that was not there before
> >>and an application is updated to use that feature. There will be
> >>lots of instances where that application will still be run on older
> >>kernels where this feature does not exist.
> >
> >This is *exactly* where silently ignoring flags really falls down.
>
> It does not!
>
> >On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does
> >something and it returns different structure contents for the same
>
> No it does not. You do NOT understand at all what we are talking
> about do you?!?
>
> If a flag would do something weird like returning different data then
> OBVIOUSLY you would make this a mandatory flag and it will NOT be
> ignored!

You've just successfully argued my case for me.

By your reasoning, if we have voluntary flags 1, 2 and 3 and
filesystems A, B and C and filesystem A is the only filesystem to
implement 1, when B implements 1 bit must become a compulsory flag
and hence C must now return an error despite being unchanged.

Likewise when C implement 3, 3 must become a comulsory flag and
A and B must now return an error despite being unchanged.

IOWs, whenever *any* filesystem implements a voluntary feature that
it didn't previously support, we have to make that a mandatory
feature and all other filesystems that don't support it now
must return an error. You're guaranteeing th application sees
changes in behaviour with this interface, not preventing.

Can we simply mandate that filesystems return an error
to commands they don't support or don't understand and
drop this silly interface mutation thing?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-02 11:17:32

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation


On 2 May 2007, at 11:57, David Chinner wrote:

> On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote:
>> On 2 May 2007, at 10:15, David Chinner wrote:
>>> On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote:
>>>> And all applications will run against a multitude of
>>>> kernels. So version X of the application will run on kernel 2.4.*,
>>>> 2.6.*, a.b.*, etc... For future expandability of the interface I
>>>> think it is important to have both compulsory and non-compulsory
>>>> flags.
>>>
>>> Ah, so that's what you want - a mutable interface. i.e. versioning.
>>>
>>> So how does compusory flags help here? What happens if a voluntary
>>> flag now becomes compulsory? Or vice versa? How is the application
>>> supposed to deal with this dynamically?
>>>
>>> I suggested a version number for this right back at the start of
>>> this discussion and got told that we don't want versioned interfaces
>>> because we should make the effort to get it right the first time.
>>> I don't think this can be called "getting it right".
>>
>> Look at ext2/3/4. They do it that way and it works well. No
>> versioning just compatible and incompatible flags... The proposal is
>> to do the same here.
>
> Just because it works for extN doesn't make it right for this
> interface.
>
>>>> For example there is no reason why FIEMAP_HSM_READ needs to be
>>>> compulsory. Most filesystems do not support HSM so can safely
>>>> ignore
>>>> it.
>>>
>>> They might be able to safely ignore it, but in reality it should
>>> be saying "I don't understand this". If the application *needs* to
>>> use a flag like this, then it should be told that the filesystem is
>>> not capable of doing what it was asked!
>>
>> That is where you are completely wrong! (-: Or rather you are wrong
>> for my example, i.e. you are wrong/right depending on the type of
>> flag in question.
>
> And that is the crux of the argument.
>
> My point is that *any* flag returns an error if the filesystem
> does not support it.

Yes and my point is that it should not do so as there are flags where
it is not necessary.

>> HSM_READ is definitely _NOT_ required because all
>> it means is "if the file is OFFLINE, bring it ONLINE and then return
>> the extent map".
>
> You've got the definition of HSM_READ wrong. If the flag is *not*
> set, then we bring everything back online and return the full extent
> map.

Ah, sorry, I did indeed misunderstand what it was meant to mean.

>>> OTOH if the application does not need to use the flag, then it
>>> shouldn't be using it and we shouldn't be silently ignoring
>>> incorrect usage of the provided API.
>>>
>>> What you are effectively saying about these "voluntary" flags
>>> is that their behaviour is _undefined_. That is, if you use
>>> these flags what you get on a successful call is undefined;
>>> it may or may not contain what you asked for but you can't
>>> tell if it really did what you want or returned the information
>>> you asked for.
>>>
>>> This is a really bad semantic to encode into an API.
>>
>> That is your opinion. There is nothing undefined in the API at all.
>> You just fail to understand it...
>
> FIEMAP returned success. Did it do what I asked? I don't
> know because it's allowed to return success when it did ignored me.

So what?

> This is as silly an interface definition as saying you can
> implement fsync() with { return 0; }. So, when fsync() succeeded
> did it write my data to disk? I don't know; it's allowed to return
> success when it ignored me.

No it is not silly at all. There can be flags that fail but still
the operation is a success.

Example from admittedly unrelated area: when truncating a file to
smaller size if the freeing of the allocated blocks fails it does not
cause the truncate to fail, it just means some space is wasted/marked
used when it is unused on the volume and running fsck fixes this. At
least that is how I have implemented it for NTFS and I think this is
the most sensible way to do it. The user does not care if some
blocks could not be freed. All they care about is that the file is
now truncated. The volume is then marked dirty thus running fsck/
chkdsk will reclaim the lost space.

> It's crazy, isn't it? It makes writing applications portable
> across operating systems a real PITA (ask the MySQL folk ;)
> because POSIX really does allow fsync() to be implemented like this.
>
> I use this example because the "allow some filesystems to silently
> ignore flags they don't understand" is a portability problem for
> applications - rather than a cross-OS issue it is a cross-filesystem
> issue. That is, if different filesystems behave differently to
> the same request they will have to be handled specifically by
> the application. Every filesystem should behave in *exactly* the
> same way to the FIEMAP ioctls - if they don't support something
> they throw an error, if they do then they return the correct
> data.

It is only a problem if you do not choose wisely which flags my be
ignored silently...

>>>> And vice versa, an application might specify some weird and
>>>> funky yet
>>>> to be developed feature that it expects the FS to perform and if
>>>> the
>>>> FS cannot do it (either because it does not support it or
>>>> because it
>>>> failed to perform the operation) the application expects the FS to
>>>> return an error and not to ignore the flag. An example could be
>>>> the
>>>> asked for FIEMAP_XATTR_FORK flag. If that is implemented, and
>>>> the FS
>>>> ignores it it will return the extent map for the file data
>>>> instead of
>>>> the XATTR_FORK! Not what the application wanted at all. Ouch! So
>>>> this is definitely a compulsory flag if I ever saw one.
>>>
>>> Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But
>>> we don't need a flag defined in the user visible API to tell us
>>> that we need to return an error here.
>>
>> Heh? What are you talking about? You need a flag to specify that you
>> want XATTR_FORK. If not how the hell does the application specify
>> that it wants XATTR_FORK instead of DATA_FORK (default)? Or are you
>> of the opinion that FIEMAP should definitely not support XATTR_FORK.
>> If the latter I fully agree. This should be a separate API with
>> named streams and the FD of the named stream should be passed to
>> FIEMAP without the silly XATTR_FORK flag...
>
> Ummmm - I think you misunderstood what I was saying. I was agreeing
> with you that is a FS does not support FIEMAP_XATTR_FORK "the correct
> answer is -EOPNOTSUPP or -EINVAL".
>
> What I was saying is that we don't need a COMPAT flag bit to tell
> us the obvious error return if the filesystem does not support this
> functionality....

But there is no COMPAT bit. I don't understand what you are saying...

>>>> Also consider what I said above about different kernels. A new
>>>> feature is implemented in kernel 2.8.13 say that was not there
>>>> before
>>>> and an application is updated to use that feature. There will be
>>>> lots of instances where that application will still be run on older
>>>> kernels where this feature does not exist.
>>>
>>> This is *exactly* where silently ignoring flags really falls down.
>>
>> It does not!
>>
>>> On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does
>>> something and it returns different structure contents for the same
>>
>> No it does not. You do NOT understand at all what we are talking
>> about do you?!?
>>
>> If a flag would do something weird like returning different data then
>> OBVIOUSLY you would make this a mandatory flag and it will NOT be
>> ignored!
>
> You've just successfully argued my case for me.

No I have not at all.

> By your reasoning, if we have voluntary flags 1, 2 and 3 and
> filesystems A, B and C and filesystem A is the only filesystem to
> implement 1, when B implements 1 bit must become a compulsory flag

WHY? It does not at all. Flags CANNOT move from voluntary to
compulsory. Read my argument again...

> and hence C must now return an error despite being unchanged.

Nope.

> Likewise when C implement 3, 3 must become a comulsory flag and
> A and B must now return an error despite being unchanged.

Again no.

> IOWs, whenever *any* filesystem implements a voluntary feature that
> it didn't previously support, we have to make that a mandatory
> feature and all other filesystems that don't support it now

This is total crap.

> must return an error. You're guaranteeing th application sees
> changes in behaviour with this interface, not preventing.
>
> Can we simply mandate that filesystems return an error
> to commands they don't support or don't understand and
> drop this silly interface mutation thing?

Can we simply not and drop this silly argument?

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



2007-05-03 07:49:11

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On May 02, 2007 20:57 +1000, David Chinner wrote:
> On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote:
> > HSM_READ is definitely _NOT_ required because all
> > it means is "if the file is OFFLINE, bring it ONLINE and then return
> > the extent map".
>
> You've got the definition of HSM_READ wrong. If the flag is *not*
> set, then we bring everything back online and return the full extent
> map.
>
> Specifying the flag indicates that we do *not* want the offline
> extents brought back online. i.e. it is a HSM or a datamover
> (e.g. backup program) that is querying the extents and we want to
> known *exactly* what the current state of the file is right now.
>
> So, if the HSM_READ flag is set, then the application is
> expecting the filesytem to be part of a HSM. Hence if it's not,
> it should return an error because somebody has done something wrong.

In my original proposal I specifically pointed out that the
FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX
BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the
HSM_READ flag is set. That's why the flag is called "HSM_READ" instead
of "HSM_NO_READ".

The reason is that it seems bad if the default behaviour for calling
ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is
only disabled by specifying a flag. It makes a lot more sense to just
leave the data as it is and return the extent mapping by default (i.e.
this is the principle of least surprise). It would probably be equally
surprising and undesirable if the default behaviour was to force all
data out to HSM.



For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should
even be a part of this interface? I have no problem with returning a
flag that reports if the data is migrated to HSM and whether it is UNMAPPED.

Having FIEMAP force the retrieval of data from HSM strikes me as something
that should be a part of a separate HSM interface, which also needs to be
able to do things like push specific files or parts thereof out to HSM,
set the aging policy, and return information like "where does the HSM
file live" and "how many copies are there".

Do you know the reasoning behind including this into XFS_IOC_GETBMAPX?
Looking at the bmap.c comments it appears it is simply because the API
isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate
there is data in HSM but it has no blocks allocated in the filesystem.

I don't think it makes the operation significantly more efficient than
say "ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP)" if an application actually
needs the data to be present instead of just returning mapping info that
includes "UNMAPPED.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-05-03 08:24:22

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation


On 3 May 2007, at 08:49, Andreas Dilger wrote:

> On May 02, 2007 20:57 +1000, David Chinner wrote:
>> On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote:
>>> HSM_READ is definitely _NOT_ required because all
>>> it means is "if the file is OFFLINE, bring it ONLINE and then return
>>> the extent map".
>>
>> You've got the definition of HSM_READ wrong. If the flag is *not*
>> set, then we bring everything back online and return the full extent
>> map.
>>
>> Specifying the flag indicates that we do *not* want the offline
>> extents brought back online. i.e. it is a HSM or a datamover
>> (e.g. backup program) that is querying the extents and we want to
>> known *exactly* what the current state of the file is right now.
>>
>> So, if the HSM_READ flag is set, then the application is
>> expecting the filesytem to be part of a HSM. Hence if it's not,
>> it should return an error because somebody has done something wrong.
>
> In my original proposal I specifically pointed out that the
> FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the
> XFS_IOC_GETBMAPX
> BMV_IF_NO_DMAPI_READ flag. Data is retrieved from HSM only if the
> HSM_READ flag is set. That's why the flag is called "HSM_READ"
> instead
> of "HSM_NO_READ".

Cool. I did not misunderstand after all then. (-:

> The reason is that it seems bad if the default behaviour for calling
> ioctl(FIEMAP) would be to force retrieval of data from HSM, and
> this is
> only disabled by specifying a flag. It makes a lot more sense to just
> leave the data as it is and return the extent mapping by default (i.e.
> this is the principle of least surprise). It would probably be
> equally
> surprising and undesirable if the default behaviour was to force all
> data out to HSM.
>
> For that matter, I'm also beginning to wonder if the FLAG_HSM_READ
> should
> even be a part of this interface? I have no problem with returning a
> flag that reports if the data is migrated to HSM and whether it is
> UNMAPPED.
>
> Having FIEMAP force the retrieval of data from HSM strikes me as
> something
> that should be a part of a separate HSM interface, which also needs
> to be
> able to do things like push specific files or parts thereof out to
> HSM,
> set the aging policy, and return information like "where does the HSM
> file live" and "how many copies are there".

That would seem sensible to me also. Just like David argued that
causing the data to be in a fixed location should be a separate
interface rather than part of FIEMAP so by analogy the same should
apply to touching HSM.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2007-10-29 19:45:10

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

By request on #linuxfs, here is the FIEMAP spec that we used to implement
the FIEMAP support for ext4. There was an ext4 patch posted on August 29
to linux-ext4 entitled "[PATCH] FIEMAP ioctl". I've asked Kalpak to post
an updated version of that patch along with the changes to the "filefrag"
tool to use FIEMAP.

======================== FIEMAP_1.0.txt ==================================

File Mapping Interface

18 June 2007

Andreas Dilger, Kalpak Shah

Introduction

This document covers the user interface and internal implementation of
an efficient fragmentation reporting tool. This will include addition
of a FIEMAP ioctl to fetch extents and changes to filefrag to use this
ioctl. The main objective of this tool is to efficiently and easily allow
inspection of the disk layout of one or more files without requiring
user access to the underlying storage device(s).

1 Requirements

The tool should be efficient in its use of resources, even for large
files. The FIBMAP ioctl is not suitable for use on large files,
as this can result in millions or even billions of ioctls to get the
mapping information for a single file. It should be possible to get the
information about an arbitrary-sized extent in a single call, and the
kernel component and user tool should efficiently use this information.

The user interface should be simple, and the output should be easily
understood - by default the filename(s), a count of extents (for each
file), and the optimal number of extents for a file with the given
striping parameters. The user interface will be "filefrag [options]
{filename ...}" and will allow retrieving the fragmentation information
for one or more files specified on the command-line. The output will be
of the form:

/path/to/file1: extents=2 optimal=1

/path/to/file2: extents=10 optimal=4

..........

2 Functional specification

The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
ioctl block device ioctl used for mapping an individual logical block
address in a file to a physical block address in the block device. The
FIEMAP ioctl will return the logical to physical mapping for the extent
that contains the specified logical byte address.

struct fiemap_extent {
__u64 fe_offset;/* offset in bytes for the start of the extent */
__u64 fe_length;/* length in bytes for the extent */
__u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */
__u32 fe_lun; /* logical device number for extent(starting at 0)*/
};



struct fiemap {
__u64 fm_start; /* logical byte offset (in/out) */
__u64 fm_length; /* logical length of map (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags (in/out) */
__u32 fm_extent_count; /* extents in fm_extents (in/out) */
__u64 fm_unused;

struct fiemap_extent fm_extents[0];
};



In the ioctl request, the fiemap struct is initialized with the desired
mapping information.

fiemap.fm_start = {desired start byte offset, 0 if whole file};
fiemap.fm_length = {length of mapping in bytes, ~0ULL if whole file}
fiemap.fm_extent_count = {number of fiemap_extents in fm_extents array};
fiemap.fm_flags = {flags from FIEMPA_FLAG_* array, if needed};

ioctl(fd, FIEMAP, &fiemap);
{verify fiemap flags are understood }

for (i = 0; i < fiemap.fm_extent_count; i++) {
{ process extent fiemap.fm_extents[i]};
}


The logic for the filefrag would be similar to above. The size of the
extent array will be extrapolated from the filesize and multiple ioctls
of increasing extent count may be called for very large files. filefrag
can easily call the FIEMAP ioctls repeatedly using the end of the last
extent as the start offset for the next ioctl:

fm_start = fm_extents[fm_extent_count - 1].fe_offset +
fm_extents[fm_extent_count - 1].fe_length + 1;

We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.

The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is
given then the fm_extents array is not filled, and only fm_extent_count is
returned with the total number of extents in the file. Any new flags that
introduce and/or require an incompatible behaviour in an application or
in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT
(e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that
range if they were not part of the original specification). This is
currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT
is not large enough then it is possible to use the last INCOMPAT flag
0x01000000 to incidate that more of the flag range contains incompatible
flags.

#define FIEMAP_FLAG_SYNC 0x00000001 /* sync file data before map */
#define FIEMAP_FLAG_HSM_READ 0x00000002 /* get data from HSM before map */
#define FIEMAP_FLAG_NUM_EXTENTS 0x00000004 /* return only number of extents */
#define FIEMAP_FLAG_INCOMPAT 0xff000000 /* error for unknown flags in here */

The returned data from the FIEMAP ioctl is an array of fiemap_extent
elements, one per extent in the file. The first extent will contain the
byte specified by fm_start and the last extent will contain the byte
specified by fm_start + fm_len, unless there are more than the passed-in
fm_extent_count extents in the file, or this is beyond the EOF in which
case the last extent will be marked with FIEMAP_EXTENT_LAST. Each extent
returned has a set of flags associated with it that provide additional
information about the extent. Not all filesystems will support all flags.

FIEMAP_FLAG_NUM_EXTENTS will return only the number of extents used by
the file. It will be used by default for filefrag since the specific
extent information is not required in many cases.

#define FIEMAP_EXTENT_HOLE 0x00000001 /* has no data or space allocation */
#define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* space allocated, but no data */
#define FIEMAP_EXTENT_UNMAPPED 0x00000004 /* has data but no space allocated */
#define FIEMAP_EXTENT_ERROR 0x00000008 /* map error, errno in fe_offset. */
#define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* cannot access data directly */
#define FIEMAP_EXTENT_LAST 0x00000020 /* last extent in the file */
#define FIEMAP_EXTENT_DELALLOC 0x00000040 /* has data but not yet written */
#define FIEMAP_EXTENT_SECONDARY 0x00000080 /* data in secondary storage */
#define FIEMAP_EXTENT_EOF 0x00000100 /* fm_start + fm_len beyond EOF */
#define FIEMAP_EXTENT_UNKNOWN 0x00000200 /* in use but location is unknown */


FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
encrypted, compressed, etc.)

FIEMAP_EXTENT_ERROR and FIEMAP_EXTENT_DELALLOC flags should always
be returned with FIEMAP_EXTENT_UNMAPPED also set. So some flags are a
superset of other flags. FIEMAP_EXTENT_SECONDARY may optionally include
FIEMAP_EXTENT_UNMAPPED.

Inside ext4, this can be implemented for extent-mapped files by
calling something similar to the existing ext4_ext_ioctl() for
EXT4_IOC_GET_EXTENTS but with a different callback function. Or the
ext4_fiemap() function can be called directly from the ioctl code if
the latest extents patches do not have ext4_ext_ioctl().

3 Use cases

1) Files containing holes including an all-hole file.

2) File having an extent which is not yet allocated.

3) Proper working with fm_start + fm_len beyond EOF.

4) Test proper reporting of preallocated extents.

5) Have non-zero fm_start and non-~0ULL fm_end. This can be
tested by having fm_count = 1 and forcing many ioctls.

6) If there is an error mapping an in-between extent then the
later extents should be returned.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2007-10-29 20:58:05

by Mark Fasheh

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

Hi Andreas,

Thanks for posting this. I believe that an interface such as FIEMAP
would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)

My comments below are generally geared towards understanding the ioctl
interface.

On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:

> 2 Functional specification
>
> The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
> ioctl block device ioctl used for mapping an individual logical block
> address in a file to a physical block address in the block device. The
> FIEMAP ioctl will return the logical to physical mapping for the extent
> that contains the specified logical byte address.
>
> struct fiemap_extent {
> __u64 fe_offset;/* offset in bytes for the start of the extent */

I'm a little bit confused by fe_offset. Is it a physical offset, or a
logical offset? The reason I ask is that your description above says "FIEMAP
ioctl will return the logical to physical mapping for the extent that
contains the specified logical byte address." Which seems to imply physical,
but your math to get to the next logical start in a very fragmented file,
implies that fe_offset is a logical offset:

fm_start = fm_extents[fm_extent_count - 1].fe_offset +
fm_extents[fm_extent_count - 1].fe_length + 1;


> The logic for the filefrag would be similar to above. The size of the
> extent array will be extrapolated from the filesize and multiple ioctls
> of increasing extent count may be called for very large files. filefrag
> can easily call the FIEMAP ioctls repeatedly using the end of the last
> extent as the start offset for the next ioctl:
>
> fm_start = fm_extents[fm_extent_count - 1].fe_offset +
> fm_extents[fm_extent_count - 1].fe_length + 1;
>
> We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
> will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.

I think you meant 'fm_length' instead of 'fm_end' there.


> The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is
> given then the fm_extents array is not filled, and only fm_extent_count is
> returned with the total number of extents in the file. Any new flags that
> introduce and/or require an incompatible behaviour in an application or
> in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT
> (e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that
> range if they were not part of the original specification). This is
> currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT
> is not large enough then it is possible to use the last INCOMPAT flag
> 0x01000000 to incidate that more of the flag range contains incompatible
> flags.
>
> #define FIEMAP_FLAG_SYNC 0x00000001 /* sync file data before map */
> #define FIEMAP_FLAG_HSM_READ 0x00000002 /* get data from HSM before map */
> #define FIEMAP_FLAG_NUM_EXTENTS 0x00000004 /* return only number of extents */
> #define FIEMAP_FLAG_INCOMPAT 0xff000000 /* error for unknown flags in here */
>
> The returned data from the FIEMAP ioctl is an array of fiemap_extent
> elements, one per extent in the file. The first extent will contain the
> byte specified by fm_start and the last extent will contain the byte
> specified by fm_start + fm_len, unless there are more than the passed-in
> fm_extent_count extents in the file, or this is beyond the EOF in which
> case the last extent will be marked with FIEMAP_EXTENT_LAST. Each extent
> returned has a set of flags associated with it that provide additional
> information about the extent. Not all filesystems will support all flags.
>
> FIEMAP_FLAG_NUM_EXTENTS will return only the number of extents used by
> the file. It will be used by default for filefrag since the specific
> extent information is not required in many cases.
>
> #define FIEMAP_EXTENT_HOLE 0x00000001 /* has no data or space allocation */

Btw, I really like that holes are explicitely marked.


> #define FIEMAP_EXTENT_UNWRITTEN 0x00000002 /* space allocated, but no data */
> #define FIEMAP_EXTENT_UNMAPPED 0x00000004 /* has data but no space allocated */
> #define FIEMAP_EXTENT_ERROR 0x00000008 /* map error, errno in fe_offset. */
> #define FIEMAP_EXTENT_NO_DIRECT 0x00000010 /* cannot access data directly */
> #define FIEMAP_EXTENT_LAST 0x00000020 /* last extent in the file */
> #define FIEMAP_EXTENT_DELALLOC 0x00000040 /* has data but not yet written */
> #define FIEMAP_EXTENT_SECONDARY 0x00000080 /* data in secondary storage */
> #define FIEMAP_EXTENT_EOF 0x00000100 /* fm_start + fm_len beyond EOF */

Is "EOF" here considering "beyond i_size" or "beyond allocation"?


> #define FIEMAP_EXTENT_UNKNOWN 0x00000200 /* in use but location is unknown */
>
>
> FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
> encrypted, compressed, etc.)

Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
blocks.

Thanks,
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2007-10-29 22:13:02

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote:
> Thanks for posting this. I believe that an interface such as FIEMAP
> would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)

I tried to make it as Lustre-agnostic as possible...

> On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:
> > The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
> > ioctl block device ioctl used for mapping an individual logical block
> > address in a file to a physical block address in the block device. The
> > FIEMAP ioctl will return the logical to physical mapping for the extent
> > that contains the specified logical byte address.
> >
> > struct fiemap_extent {
> > __u64 fe_offset;/* offset in bytes for the start of the extent */
>
> I'm a little bit confused by fe_offset. Is it a physical offset, or a
> logical offset? The reason I ask is that your description above says "FIEMAP
> ioctl will return the logical to physical mapping for the extent that
> contains the specified logical byte address." Which seems to imply physical,
> but your math to get to the next logical start in a very fragmented file,
> implies that fe_offset is a logical offset:
>
> fm_start = fm_extents[fm_extent_count - 1].fe_offset +
> fm_extents[fm_extent_count - 1].fe_length + 1;

Note the distinction between "fe_offset" (which is a physical offset for
a single extent) and "fm_offset" (which is a logical offset for that file).

> > We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
> > will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.
>
> I think you meant 'fm_length' instead of 'fm_end' there.

You're right, thanks.

> > #define FIEMAP_EXTENT_LAST 0x00000020 /* last extent in the file */
> > #define FIEMAP_EXTENT_EOF 0x00000100 /* fm_start + fm_len beyond EOF*/
>
> Is "EOF" here considering "beyond i_size" or "beyond allocation"?

_EOF == beyond i_size.
_LAST == last extent in the file.

In most cases FIEMAP_EXTENT_EOF will be set at the same time as
FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the
EOF flag may be set on one or more earlier extents.

> > FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
> > encrypted, compressed, etc.)
>
> Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
> Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
> blocks.

Hmm, but part of the issue would be how to request the extra data, and
what offset it would be given? One could, for example, use negative
offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
or similar, I hadn't given that much thought. The other issue is that
I'd like to get the basics of the API in place before it gets too complex.
We can always add functionality with more FIEMAP_FLAG_* (whether in the
INCOMPAT range or not, depending on what is being done).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2007-10-29 22:25:33

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:
> By request on #linuxfs, here is the FIEMAP spec that we used to implement
> the FIEMAP support for ext4. There was an ext4 patch posted on August 29
> to linux-ext4 entitled "[PATCH] FIEMAP ioctl".

Link:

http://marc.info/?l=linux-ext4&m=118838241209683&w=2

That's a very ext4 specific ioctl interface. Can we get this made
generic like the FIBMAP interface so we don't have to replicate all
the copyin/copyout handling and interface definitions everywhere?
i.e. a ->extent_map aops callout to the filesystem in generic code
just like ->bmap?

> I've asked Kalpak to post
> an updated version of that patch along with the changes to the "filefrag"
> tool to use FIEMAP.

Where can I find the test program that validates the implementation?
Also, following the fallocate model, can we get the interface definition
turned into a man page before anything is submitted upstream?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-10-29 22:29:09

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Oct 29, 2007 16:13 -0600, Andreas Dilger wrote:
> On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote:
> > I'm a little bit confused by fe_offset. Is it a physical offset, or a
> > logical offset? The reason I ask is that your description above says "FIEMAP
> > ioctl will return the logical to physical mapping for the extent that
> > contains the specified logical byte address." Which seems to imply physical,
> > but your math to get to the next logical start in a very fragmented file,
> > implies that fe_offset is a logical offset:
> >
> > fm_start = fm_extents[fm_extent_count - 1].fe_offset +
> > fm_extents[fm_extent_count - 1].fe_length + 1;
>
> Note the distinction between "fe_offset" (which is a physical offset for
> a single extent) and "fm_offset" (which is a logical offset for that file).

Actually, that is completely bunk. What it should say is something like:
"filefrag can easily call the FIEMAP ioctls repeatedly using the returned
fm_start and fm_length as the start offset for the next ioctl:

fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1;

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2007-10-29 22:40:11

by Mark Fasheh

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Mon, Oct 29, 2007 at 04:29:07PM -0600, Andreas Dilger wrote:
> On Oct 29, 2007 16:13 -0600, Andreas Dilger wrote:
> > On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote:
> > > I'm a little bit confused by fe_offset. Is it a physical offset, or a
> > > logical offset? The reason I ask is that your description above says "FIEMAP
> > > ioctl will return the logical to physical mapping for the extent that
> > > contains the specified logical byte address." Which seems to imply physical,
> > > but your math to get to the next logical start in a very fragmented file,
> > > implies that fe_offset is a logical offset:
> > >
> > > fm_start = fm_extents[fm_extent_count - 1].fe_offset +
> > > fm_extents[fm_extent_count - 1].fe_length + 1;
> >
> > Note the distinction between "fe_offset" (which is a physical offset for
> > a single extent) and "fm_offset" (which is a logical offset for that file).
>
> Actually, that is completely bunk. What it should say is something like:
> "filefrag can easily call the FIEMAP ioctls repeatedly using the returned
> fm_start and fm_length as the start offset for the next ioctl:
>
> fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1;

Yeah - that's where I was going with my question. This is much more clear
now, thanks.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2007-10-30 00:11:26

by Mark Fasheh

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:
> On Oct 29, 2007 13:57 -0700, Mark Fasheh wrote:
> > Thanks for posting this. I believe that an interface such as FIEMAP
> > would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)
>
> I tried to make it as Lustre-agnostic as possible...

IMHO, your description succeeded at that. I'm hoping that the final patch
can have mostly generic code, like FIBMAP does today.


> > > #define FIEMAP_EXTENT_LAST 0x00000020 /* last extent in the file */
> > > #define FIEMAP_EXTENT_EOF 0x00000100 /* fm_start + fm_len beyond EOF*/
> >
> > Is "EOF" here considering "beyond i_size" or "beyond allocation"?
>
> _EOF == beyond i_size.
> _LAST == last extent in the file.
>
> In most cases FIEMAP_EXTENT_EOF will be set at the same time as
> FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the
> EOF flag may be set on one or more earlier extents.

Oh, ok great - I was primarily looking for a way to say "there's allocation
past i_size" and it looks like we have it.


> > > FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
> > > encrypted, compressed, etc.)
> >
> > Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
> > Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
> > blocks.
>
> Hmm, but part of the issue would be how to request the extra data, and
> what offset it would be given? One could, for example, use negative
> offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
> or similar, I hadn't given that much thought.

Well, fe_offset and fe_length are already expressed in bytes, so we could
just put the byte offset to where the inline data starts in there. fe_length
is just used as the length allocated for inline-data.

If fe_offset is required to be block aligned, then we could add a field to
express an offset within the block where data would be found - say
'fe_data_start_offset'. In the non-inline case, we could guarantee that
fe_data_start_offset is zero. That way software which doesn't want to care
whether something is inline-data (for example, a backup program) or not
could just blidly add it to fe_offset before looking at the data.

Regardless, I think we also want to explicitely flag this:

#define FIEMAP_EXTENT_DATA_IN_INODE 0x00000400 /* extent data is stored in inode block */


I'm going to pretend that I completely understand reiserfs tail-packing and
say that my approaches above looks like they could work for that case too.
We'd want to add a seperate flag for tail packed data though.


> The other issue is that I'd like to get the basics of the API in place
> before it gets too complex. We can always add functionality with more
> FIEMAP_FLAG_* (whether in the INCOMPAT range or not, depending on what is
> being done).

Sure, but I think whatever goes upstream should be able to handle this case
- there's file systems in use _today_ which put data in inode blocks and
pack file tails.

Thanks,
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2007-10-30 00:26:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

On Oct 29, 2007 17:11 -0700, Mark Fasheh wrote:
> On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:
> > > Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
> > > blocks.
> >
> > Hmm, but part of the issue would be how to request the extra data, and
> > what offset it would be given? One could, for example, use negative
> > offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
> > or similar, I hadn't given that much thought.
>
> Well, fe_offset and fe_length are already expressed in bytes, so we could
> just put the byte offset to where the inline data starts in there. fe_length
> is just used as the length allocated for inline-data.
>
> If fe_offset is required to be block aligned, then we could add a field to
> express an offset within the block where data would be found - say
> 'fe_data_start_offset'. In the non-inline case, we could guarantee that
> fe_data_start_offset is zero. That way software which doesn't want to care
> whether something is inline-data (for example, a backup program) or not
> could just blidly add it to fe_offset before looking at the data.

Oh, I was confused as to what you are asking. Mapping in-inode data is
just fine using the existing interface. The byte offset of the data is
given, and the "FIEMAP_EXTENT_NO_DIRECT" flag is set to indicate that it
isn't necessarily safe to do IO directly to that byte offset in the file
(e.g. tail packed, compressed data, etc).

I was thinking you were asking how to map metadata (e.g. indirect blocks).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.