2007-10-26 23:39:57

by Mike Waychison

[permalink] [raw]
Subject: [patch 0/6][RFC] Cleanup FIBMAP

The following series is meant to clean up FIBMAP paths with the eventual goal of allowing users to be able to FIBMAP their data.

I'm sending this as an RFC as I've only tested this on a x86_64 kernel with a 32bit binary on ext2 and I've noticed a couple ext2_warnings already.

I'm unsure of the locking in [4/6] fix_race_with_truncate.patch. Any help here would greatly be appreciated.

The last patch, [6/6] drop_cap_sys_rawio_for_fibmap.patch, is of course, not to be applied until any remaining issues are fixed :)

Thanks,

Mike Waychison
--


2007-10-27 17:57:21

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP

Hi,

->bmap is ugly and horrible! If you have to do this at the very least
please cause ->bmap64 to be able to return error values in case the
file system failed to get the information or indeed such information
does not exist as is the case for compressed and encrypted files for
example and also for small files that are inside the on-disk inode
(NTFS resident files and reiserfs packed tails are examples of this).

And another of my pet peeves with ->bmap is that it uses 0 to mean
"sparse" which causes a conflict on NTFS at least as block zero is
part of the $Boot system file so it is a real, valid block... NTFS
uses -1 to denote sparse blocks internally.

Best regards,

Anton

On 27 Oct 2007, at 00:37, Mike Waychison wrote:

> The following series is meant to clean up FIBMAP paths with the
> eventual goal of allowing users to be able to FIBMAP their data.
>
> I'm sending this as an RFC as I've only tested this on a x86_64
> kernel with a 32bit binary on ext2 and I've noticed a couple
> ext2_warnings already.
>
> I'm unsure of the locking in [4/6] fix_race_with_truncate.patch.
> Any help here would greatly be appreciated.
>
> The last patch, [6/6] drop_cap_sys_rawio_for_fibmap.patch, is of
> course, not to be applied until any remaining issues are fixed :)
>
> Thanks,
>
> Mike Waychison


--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2007-10-27 21:58:03

by Szabolcs Szakacsits

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP


On Sat, 27 Oct 2007, Anton Altaparmakov wrote:

> And another of my pet peeves with ->bmap is that it uses 0 to mean "sparse"
> which causes a conflict on NTFS at least as block zero is part of the $Boot
> system file so it is a real, valid block... NTFS uses -1 to denote sparse
> blocks internally.

In practice, the meaning of 0 is file system [driver] dependent. For
example in case of NTFS-3G it means that the block is sparse or the file is
encrypted or compressed, or resident, or it's the $Boot file, or an error
happened.

Thankfully the widely used FIBMAP users (swapon and the ever less used
lilo) are only interested in the non-zero values and they report an error
if the driver returns 0 for some reason. Which is perfectly ok since both
swaping and Linux booting would fail using a sparse, encrypted, compressed,
resident, or the NTFS $Boot file.

But in real, both swap files and lilo work fine with NTFS if the needed
files were created the way these softwares expect. If not then swapon or
lilo will catch and report the file creation error.

Afair, somebody is doing (has done?) an indeed much needed, better
alternative. Bmap is legacy, thank you Mike for maintaining it.

Szaka

--
NTFS-3G Lead Developer: http://ntfs-3g.org

2007-10-28 00:44:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP

Mike Waychison wrote:
> The following series is meant to clean up FIBMAP paths with the eventual goal of allowing users to be able to FIBMAP their data.

Keep in mind FIBMAP is currently extremely expensive on some
filesystems, e.g. ext3. Therefore, additional filesystem-level work
would have to be done in order for this not become a DoS issue.

-hpa

2007-10-29 14:12:20

by Chris Mason

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP

On Sat, 27 Oct 2007 18:57:06 +0100
Anton Altaparmakov <[email protected]> wrote:

> Hi,
>
> ->bmap is ugly and horrible! If you have to do this at the very
> least please cause ->bmap64 to be able to return error values in case
> the file system failed to get the information or indeed such
> information does not exist as is the case for compressed and
> encrypted files for example and also for small files that are inside
> the on-disk inode (NTFS resident files and reiserfs packed tails are
> examples of this).
>
> And another of my pet peeves with ->bmap is that it uses 0 to mean
> "sparse" which causes a conflict on NTFS at least as block zero is
> part of the $Boot system file so it is a real, valid block... NTFS
> uses -1 to denote sparse blocks internally.

Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there
was a way to indicate your-data-is-here-but-isn't-alone. But that's
more of a feature for the FIEMAP stuff.

-chris

2007-10-29 16:30:46

by Zach Brown

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP


>> And another of my pet peeves with ->bmap is that it uses 0 to mean
>> "sparse" which causes a conflict on NTFS at least as block zero is
>> part of the $Boot system file so it is a real, valid block... NTFS
>> uses -1 to denote sparse blocks internally.
>
> Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there
> was a way to indicate your-data-is-here-but-isn't-alone. But that's
> more of a feature for the FIEMAP stuff.

And maybe we can step back and see what the callers of FIBMAP are doing
with the results they're getting.

One use is to discover the order in which to read file data that will
result in efficient IO.

If we had an interface specifically for this use case then perhaps a
sparse block would be better reported as the position of the inode
relative to other data blocks. Maybe the inode block number in ext* land.

This use is interesting to me because it can be useful for any file
system, particularly networked file systems.

- z

2007-10-29 19:17:24

by Mike Waychison

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP

Chris Mason wrote:
> On Sat, 27 Oct 2007 18:57:06 +0100
> Anton Altaparmakov <[email protected]> wrote:
>
>> Hi,
>>
>> ->bmap is ugly and horrible! If you have to do this at the very
>> least please cause ->bmap64 to be able to return error values in case
>> the file system failed to get the information or indeed such
>> information does not exist as is the case for compressed and
>> encrypted files for example and also for small files that are inside
>> the on-disk inode (NTFS resident files and reiserfs packed tails are
>> examples of this).
>>
>> And another of my pet peeves with ->bmap is that it uses 0 to mean
>> "sparse" which causes a conflict on NTFS at least as block zero is
>> part of the $Boot system file so it is a real, valid block... NTFS
>> uses -1 to denote sparse blocks internally.
>
> Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there
> was a way to indicate your-data-is-here-but-isn't-alone. But that's
> more of a feature for the FIEMAP stuff.
>

I hadn't heard of FIEMAP, so I went back and read the thread from
April/May. It seems that this is a much better approach than
introducing a FIBMAP64.

What ever happened with this proposal?

Mike Waychison

2007-10-29 19:22:04

by Mike Waychison

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP

Zach Brown wrote:
>>> And another of my pet peeves with ->bmap is that it uses 0 to mean
>>> "sparse" which causes a conflict on NTFS at least as block zero is
>>> part of the $Boot system file so it is a real, valid block... NTFS
>>> uses -1 to denote sparse blocks internally.
>> Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there
>> was a way to indicate your-data-is-here-but-isn't-alone. But that's
>> more of a feature for the FIEMAP stuff.
>
> And maybe we can step back and see what the callers of FIBMAP are doing
> with the results they're getting.
>
> One use is to discover the order in which to read file data that will
> result in efficient IO.
>
> If we had an interface specifically for this use case then perhaps a
> sparse block would be better reported as the position of the inode
> relative to other data blocks. Maybe the inode block number in ext* land.
>

Can you clarify what you mean above with an example? I don't really follow.

Thanks,

Mike Waychison


2007-10-29 19:48:36

by Andreas Dilger

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP

On Oct 29, 2007 12:16 -0700, Mike Waychison wrote:
> Chris Mason wrote:
>> Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there
>> was a way to indicate your-data-is-here-but-isn't-alone. But that's
>> more of a feature for the FIEMAP stuff.
>
> I hadn't heard of FIEMAP, so I went back and read the thread from
> April/May. It seems that this is a much better approach than introducing a
> FIBMAP64.
>
> What ever happened with this proposal?

We made a patch for ext4 that we need to update and push upstream. I've
just resent the spec we used in a separate email (attached to old thread)
for reference.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2007-10-29 19:48:55

by Chris Mason

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP

On Mon, 29 Oct 2007 12:18:22 -0700
Mike Waychison <[email protected]> wrote:

> Zach Brown wrote:
> >>> And another of my pet peeves with ->bmap is that it uses 0 to
> >>> mean "sparse" which causes a conflict on NTFS at least as block
> >>> zero is part of the $Boot system file so it is a real, valid
> >>> block... NTFS uses -1 to denote sparse blocks internally.
> >> Reiserfs and Btrfs also use 0 to mean packed. It would be nice if
> >> there was a way to indicate your-data-is-here-but-isn't-alone.
> >> But that's more of a feature for the FIEMAP stuff.
> >
> > And maybe we can step back and see what the callers of FIBMAP are
> > doing with the results they're getting.
> >
> > One use is to discover the order in which to read file data that
> > will result in efficient IO.
> >
> > If we had an interface specifically for this use case then perhaps a
> > sparse block would be better reported as the position of the inode
> > relative to other data blocks. Maybe the inode block number in
> > ext* land.
> >
>
> Can you clarify what you mean above with an example? I don't really
> follow.

This is a larger topic of helping userland optimize access to groups of
files. For example, during a readdir if we knew the next step was to
delete all the files found, we could do one top of readahead (or even
ordering the returned values).

If we knew the next step would be to read all the files found, a
different type of readahead would be useful.

But, we shouldn't inflict all of this on fibmap/fiemap....we'll get
lost trying to make the one true interface for all operations.

For grouping operations on files, I think a read_tree syscall with
hints for what userland will do (read, stat, delete, list
filenames), and a better cookie than readdir should do it.

-chris

2007-10-29 20:00:49

by Zach Brown

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP


> Can you clarify what you mean above with an example? I don't really
> follow.

Sure, take 'tar' as an example. It'll read files in the order that
their names are returned from directory listing. This can produce bad
IO patterns because the order in which the file names are returned
doesn't match the order of the file's blocks on disk. (htree, I'm
looking at you!)

People have noticed that tar-like loads can be sped up greatly just by
sorting the files by their inode number as returned by stat(), never
mind the file blocks themselves. One example of this is Chris Mason's
'acp'.

http://oss.oracle.com/~mason/acp/

The logical extension of that is to use FIBMAP to find the order of file
blocks on disk and then doing IO on blocks in sorted order. It'd take
work to write an app that does this reliably, sure.

In this use the application doesn't actually care what the absolute
numbers are. It cares about their ordering. File systems would be able
to chose whatever scheme they wanted for the actual values of the
results from a FIBMAP-alike as long as the sorting resulted in the right
IO patterns.

Arguing that this use is significant enough to justify an addition to
the file system API is a stretch. I'm just sharing the observation.

- z

2007-10-29 20:01:55

by Zach Brown

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP


> But, we shouldn't inflict all of this on fibmap/fiemap....we'll get
> lost trying to make the one true interface for all operations.
>
> For grouping operations on files, I think a read_tree syscall with
> hints for what userland will do (read, stat, delete, list
> filenames), and a better cookie than readdir should do it.

Agreed, on both points.

- z

2007-10-31 11:07:11

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP

Zach Brown wrote:
>> Can you clarify what you mean above with an example? I don't really
>> follow.
>
> Sure, take 'tar' as an example. It'll read files in the order that
> their names are returned from directory listing. This can produce bad
> IO patterns because the order in which the file names are returned
> doesn't match the order of the file's blocks on disk. (htree, I'm
> looking at you!)
>
> People have noticed that tar-like loads can be sped up greatly just by
> sorting the files by their inode number as returned by stat(), never
> mind the file blocks themselves. One example of this is Chris Mason's
> 'acp'.
>
> http://oss.oracle.com/~mason/acp/
>
> The logical extension of that is to use FIBMAP to find the order of file
> blocks on disk and then doing IO on blocks in sorted order. It'd take
> work to write an app that does this reliably, sure.
>
> In this use the application doesn't actually care what the absolute
> numbers are. It cares about their ordering. File systems would be able
> to chose whatever scheme they wanted for the actual values of the
> results from a FIBMAP-alike as long as the sorting resulted in the right
> IO patterns.
>
> Arguing that this use is significant enough to justify an addition to
> the file system API is a stretch. I'm just sharing the observation.
>
> - z

I use FIBMAP support for a few different things.

The first is to exactly the case that you describe above where we can
use the first block of a file extracted by FIBMAP to produce an optimal
sorting for the read order. My testing showed that the cost of the
extra fibmap was not too high compared to the speedup, but it was not a
huge gain over the speedup gained when the read was done in inode sorted
order.

The second use case is to look at the physical layout of blocks on disk
for a specific file, use Mark Lord's write_long patches to inject a disk
error and then read that file to make sure that we are handling disk IO
errors correctly. A bit obscure, but really quite useful.

We have also used FIBMAP a few times to try and map an observed IO error
back to a file. Really slow and painful to do, but should work on any
file system when a better method is not supported.


ric

2007-10-31 16:16:19

by Zach Brown

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP


> The second use case is to look at the physical layout of blocks on disk
> for a specific file, use Mark Lord's write_long patches to inject a disk
> error and then read that file to make sure that we are handling disk IO
> errors correctly. A bit obscure, but really quite useful.

Hmm, yeah, that's interesting.

> We have also used FIBMAP a few times to try and map an observed IO error
> back to a file. Really slow and painful to do, but should work on any
> file system when a better method is not supported.

We're getting off of this FIBMAP topic, but this interests me. Can we
explore this a little? How did you find out about the error without
having a file to associate with it? Drive scrubbing, or some such?

- z

2007-10-31 17:19:19

by Ric Wheeler

[permalink] [raw]
Subject: Re: [patch 0/6][RFC] Cleanup FIBMAP

Zach Brown wrote:
>> The second use case is to look at the physical layout of blocks on disk
>> for a specific file, use Mark Lord's write_long patches to inject a disk
>> error and then read that file to make sure that we are handling disk IO
>> errors correctly. A bit obscure, but really quite useful.
>
> Hmm, yeah, that's interesting.

It would be even better if we could poke holes in metadata, etc, but
this gives us a reasonable test case.

>
>> We have also used FIBMAP a few times to try and map an observed IO error
>> back to a file. Really slow and painful to do, but should work on any
>> file system when a better method is not supported.
>
> We're getting off of this FIBMAP topic, but this interests me. Can we
> explore this a little? How did you find out about the error without
> having a file to associate with it? Drive scrubbing, or some such?
>
> - z

Vladimir extended debugreiserfs to do an optimized reverse mapping scan
of the disk sector to file/metadata/etc. Definitely worth having that
ability for any file system.

We also do drive scrubbing, looking for bad sectors. The list of those
sectors is fed into the reverse mapping code to enable us to gage the
impact of the IO errors, start recovering the user files, etc.

The scrub code we use takes advantage of the read-verify command (to
avoid data transfer from the drive to the page cache).

ric