by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH 0/3] Fiemap, an extent mapping ioctl

On Sep 10, 2008 11:46 -0700, Andrew Morton wrote:
> On Wed, 10 Sep 2008 05:49:34 -0700 Mark Fasheh <[email protected]> wrote:
> > The following patches are the latest attempt at implementing a
> > fiemap ioctl, which can be used by userspace software to get extent
> > information for an inode in an efficient manner.
>
> As these are applicable to all filesystems, Cc:ing only linux-ext4 is
> not sufficient. All filesystem developers (at least) need an
> opportunity to review and understand these changes.

That was an oversight, partly caused because I launched the thread for
this on linux-ext4 after an ext4 concall. All previous patches have
gone to linux-fsdevel and been through numerous discussions.

> > The fiemap ioctl is an efficient method for userspace to get file
> > extent mappings. Instead of block-by-block mapping (such as bmap), fiemap
> > returns a list of extents.
>
> The above is, afacit, the only offered rationale for the addition of
> these new feature. I don't recall seeing anyone complain about bmap()
> inefficiency. In fact I rarely hear of anyone using bmap() at all.
>
> This rationale needs expanding, please. A lot.

There are several reasons for this new API:
- it avoids tools like "filefrag" (which currently use FIBMAP) having to
do an ioctl for every block in a file, have the kernel map that block
from an on-disk extent (in most newer filesystems), then re-assemble the
extents in userspace.
- it works with filesystems that are not block based (e.g. NTFS, btrfs, etc)
that may align file data on boundaries other than $blocksize boundaries
- it provides a much more rich API for finding out about on-disk allocation,
such as whether allocated blocks are unwritten (e.g. fallocate), if they
are packed along with other data, if the data is in the inode, etc.
- it can share existing XFS-specific functionality (which FIEMAP was designed
to provide a superset of functionality for) with other filesystems.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-09-10 20:34:35

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH 0/3] Fiemap, an extent mapping ioctl

On Sep 10, 2008 10:11 -0400, Christoph Hellwig wrote:
> On Wed, Sep 10, 2008 at 05:49:34AM -0700, Mark Fasheh wrote:
> > * FIEMAP_FLAG_XATTR
> > If this flag is set, the extents returned will describe the inodes
> > extended attribute lookup tree, instead of it's data tree.
>
> So does this actually make sense for any filesystem but XFS? Still
> seems like a not too useful option for a highlevel generic interface.

You are being mislead by the use of "tree" in the description here.
Both ext3 and ext4 can already have 2 places to store xattr data (in
the inode and in a separate block). It is likely that ext4 will add
the ability to store more xattr data in the future. Would it be better
to just remove the use of "tree" and write:

* FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inode's
extended attribute data, instead of the regular file data.

> > __u32 fe_device; /* device number for extent */
>
> As sayd before please kill thise one. It doesn't make any sense at all
> for any merged or near mainline FS. It'd be an utterly stupid
> lustre-specific hack that still doesn't make much sense.

I think this is short sighted. Lustre is only one of several filesystems
that aggregate multiple underlying devices into a single filesystem.
Others include XFS, NFSv4, ZFS, chunkfs. It would even be possible to
have FIEMAP return the underlying mapping for "normal" filesystems on
MD/LVM devices, but there are already endless roadblocks to the current
basic functionality that I didn't want to get that mixed in as well.

Even for "normal" filesystems that don't map multiple block devices,
having the fe_device field returned with FIEMAP makes sense for
applications like LILO that need it.

Just because we don't have (m)any filesystems using 64-bit inodes does
that mean we shouldn't have a stat() interface that supports it?

It's true that we started with FIEMAP for Lustre, and thought it would
be helpful to Linux as a whole (better than Solaris SEEK_HOLE/SEEK_DATA,
IMHO) and made sure we got input from many of the filesystem developers
during the design, and now it seems you want to rip out everything
that makes it useful to Lustre and any "future" filesystems that aren't
plain-old block-based filesystems.

> > * FIEMAP_EXTENT_NO_BYPASS
> > Direct access to the data in this extent is illegal or will have
> > undefined results.
>
> Huh? What is direct access? Direct access as in bypassing the
> filesystem and writing to the blockdev directly always has undefined
> results.

Sure, but LILO still does it, as does dump(8). There were a bunch of
people commenting on previous FIEMAP discussions that want to do this,
because it is "safe" according to their environment/usage/definition
of safe. Just because you don't need to do it (nor do I, for that matter)
it doesn't mean that other people do not.

> > * FIEMAP_EXTENT_SECONDARY
> > The data for this extent is in secondary storage (e.g. HSM). If the
> > data is not also in the filesystem, then FIEMAP_EXTENT_NO_BYPASS
> > should also be set.
>
> No HSM in mainline, so please drop it. We can add it once we get HSM
> support.

This was added because XFS_BMAP had it, and XFS is using HSM in some
environments as you well know. #defining a flag that isn't used by the
core Linux code isn't going to kill anyone, and makes life a tiny bit
easier for Dave at zero cost to anyone else.

> > * FIEMAP_EXTENT_NET
> > - This will also set FIEMAP_EXTENT_NO_BYPASS
> > The data for this extent is not stored in a locally-accessible device.
>
> Doesn't make any sense currently, please drop.

Again, it doesn't hurt anyone to include this flag. Filesystems like
CRFS will likely want to use this, in addition to Lustre of course.
It will likel also be useful for NFSv4.

You seem to think that the lack of any current in-kernel users means that
the API should be dumbed down until there is no room to grow, while I'd
prefer to design it in a way that ensures it won't be obsolete as soon
as someone thinks outside the "single local block device" box.

> > * FIEMAP_EXTENT_DATA_COMPRESSED
> > - This will also set FIEMAP_EXTENT_NO_BYPASS
> > The data in this extent has been compressed by the file system.
>
> Add once we have users for it.

A bunch of the cramfs/logfs/etc filesystems have compression support.
There were ext2/3 patches for compressed filesystems, and may be in
the future for btrfs as well. It doesn't seem at all unreasonable to
define this.

> On Wed, Sep 10, 2008 at 10:07:19AM -0400, Theodore Tso wrote:
> > Interface minimalism to allow for flexibility later is one thing; but
> > taken to extremes, it it's just stupid. For example, if we didn't
> > have any filesystems that needed 64-bit fields in mainline, would that
> > be a justification for making data structures to use 32-bit fields and
> > forcing a flag data interface change in the future. Sometimes we can
> > look ahead just a little bit and design interfaces which are foward
> > compatible. And it's pretty clear that btrfs will need something like
> > fe_device, although whether it should be a 32-bit index or something
> > else like a 128-bit UUID admittedly might not be clear at the moment.
> > But leaving room in the data structure for future growth is just a
> > intelligent thing to do.
>
> No, if you look at btrfs or the not public bit for mirrored allocations
> groups in XFS you'll notice that an extent can be happily on more than
> once device. So to support this we'd need a list of device plus a field
> for the algorithm how to balance it over the devices.

Actually, you misunderstand how the API is designed to be used.

Instead of having to return a list of devices for a single extent,
it would return an extent per device. That is especially important
because the logical->physical mapping may be different for every extent.

Having simple flags for "RAID1", "RAID5", "RAID6" would cover most of
the cases, and in all cases setting NO_BYPASS for complex layouts
would avoid non-broken applications trying to access the data directly.
The majority of FIEMAP users will only be filefrag and other tools that
want to know relative locations of file data and some information about
the block layout (e.g. filesystem developers, performance tuning, etc),
they won't be trying to read from the raw block device.

To be honest, I don't know why there is such a huge objection to this
patch. First it was ext4/Lustre specific, then people said "this is
generic stuff that should go into the VFS" and now people say "this
is too complex to be in the VFS as-is, let's make it dumb". People
didn't present such an uproar when blktrace was added to the kernel,
people didn't complain about the EXT2_GETFLAGS ioctl that migrated
over to the VFS, even though it has flags not applicable to all
filesystems (or even applicable to ext* like NOTAIL, which was added
to e2fsprogs just for Reiserfs).

It's just an ioctl, folks. We've spent 1000's of collective hours arguing
about minor changes in semantics, as if we were rewriting the whole VFS.
Christoph, if you really hate FIEMAP that much, it can just lose the
VFS changes entirely, and be implemented in ext4 as we like it and other
filesystem maintainers that want e2fsprogs filefrag to provide the improved
information can follow the implementation there if they want.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.