From: Namjae Jeon <[email protected]>
This patch series is in response of the following post:
http://lwn.net/Articles/556136/
"ext4: introduce two new ioctls"
Dave chinner suggested that truncate_block_range
(which was one of the ioctls name) should be an fallocate operation
and not any fs specific ioctl, hence we add this functionality to fallocate.
This patch series introduces new flag FALLOC_FL_COLLAPSE_RANGE for fallocate
and implements it for XFS and Ext4.
The semantics of this flag are following:
1) It collapses the range lying between offset and length by removing any data
blocks which are present in this range and than updates all the logical
offsets of extents beyond "offset + len" to nullify the hole created by
removing blocks. In short, it does not leave a hole.
2) It should be used exclusively. No other fallocate flag in combination.
3) Offset and length supplied to fallocate should be fs block size aligned
in case of xfs and ext4.
4) Collaspe range does not work beyond i_size.
This new functionality of collapsing range could be used by media editing tools
which does non linear editing to quickly purge and edit parts of a media file.
This will immensely improve the performance of these operations.
The limitation of fs block size aligned offsets can be easily handled
by media codecs which are encapsulated in a conatiner as they have to
just change the offset to next keyframe value to match the proper alignment.
Namjae Jeon (10):
fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
xfsprog: xfsio: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
xfstest: shared/001: Standard collapse range tests
xfstest: shared/002: Delayed allocation collapse range
xfstest: shared/003: Multi collapse range tests
xfstest: shared/004: Delayed allocation multi collapse
xfstest: shared/005: Test multiple fallocate collapse
manpage: update FALLOC_FL_COLLAPSE_RANGE flag in fallocate
--
1.7.11-rc0
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Wed, Feb 19, 2014 at 01:37:16AM +0900, Namjae Jeon wrote:
> From: Namjae Jeon <[email protected]>
>
> This patch series is in response of the following post:
> http://lwn.net/Articles/556136/
> "ext4: introduce two new ioctls"
>
> Dave chinner suggested that truncate_block_range
> (which was one of the ioctls name) should be an fallocate operation
> and not any fs specific ioctl, hence we add this functionality to fallocate.
>
> This patch series introduces new flag FALLOC_FL_COLLAPSE_RANGE for fallocate
> and implements it for XFS and Ext4.
>
> The semantics of this flag are following:
> 1) It collapses the range lying between offset and length by removing any data
> blocks which are present in this range and than updates all the logical
> offsets of extents beyond "offset + len" to nullify the hole created by
> removing blocks. In short, it does not leave a hole.
> 2) It should be used exclusively. No other fallocate flag in combination.
> 3) Offset and length supplied to fallocate should be fs block size aligned
> in case of xfs and ext4.
> 4) Collaspe range does not work beyond i_size.
>
> This new functionality of collapsing range could be used by media editing tools
> which does non linear editing to quickly purge and edit parts of a media file.
> This will immensely improve the performance of these operations.
> The limitation of fs block size aligned offsets can be easily handled
> by media codecs which are encapsulated in a conatiner as they have to
> just change the offset to next keyframe value to match the proper alignment.
>
> Namjae Jeon (10):
> fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
> xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
I've pushed these to the following branch:
git://oss.sgi.com/xfs/xfs.git xfs-collapse-range
And so they'll be in tomorrow's linux-next tree.
> ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
I've left this one alone for the ext4 guys to sort out.
> xfsprog: xfsio: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
That's already in a current xfstests tree.
> xfstest: shared/001: Standard collapse range tests
> xfstest: shared/002: Delayed allocation collapse range
> xfstest: shared/003: Multi collapse range tests
> xfstest: shared/004: Delayed allocation multi collapse
> xfstest: shared/005: Test multiple fallocate collapse
These are now in the xfstests git tree.
> manpage: update FALLOC_FL_COLLAPSE_RANGE flag in fallocate
And Michael will need to review and commit that to the kernel
manpages tree.
Cheers,
Dave.
--
Dave Chinner
[email protected]
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
2014-02-24 9:57 GMT+09:00, Dave Chinner <[email protected]>:
> On Wed, Feb 19, 2014 at 01:37:16AM +0900, Namjae Jeon wrote:
>> From: Namjae Jeon <[email protected]>
>>
>> This patch series is in response of the following post:
>> http://lwn.net/Articles/556136/
>> "ext4: introduce two new ioctls"
>>
>> Dave chinner suggested that truncate_block_range
>> (which was one of the ioctls name) should be an fallocate operation
>> and not any fs specific ioctl, hence we add this functionality to
>> fallocate.
>>
>> This patch series introduces new flag FALLOC_FL_COLLAPSE_RANGE for
>> fallocate
>> and implements it for XFS and Ext4.
>>
>> The semantics of this flag are following:
>> 1) It collapses the range lying between offset and length by removing any
>> data
>> blocks which are present in this range and than updates all the
>> logical
>> offsets of extents beyond "offset + len" to nullify the hole created
>> by
>> removing blocks. In short, it does not leave a hole.
>> 2) It should be used exclusively. No other fallocate flag in combination.
>> 3) Offset and length supplied to fallocate should be fs block size
>> aligned
>> in case of xfs and ext4.
>> 4) Collaspe range does not work beyond i_size.
>>
>> This new functionality of collapsing range could be used by media editing
>> tools
>> which does non linear editing to quickly purge and edit parts of a media
>> file.
>> This will immensely improve the performance of these operations.
>> The limitation of fs block size aligned offsets can be easily handled
>> by media codecs which are encapsulated in a conatiner as they have to
>> just change the offset to next keyframe value to match the proper
>> alignment.
>>
>> Namjae Jeon (10):
>> fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
>> xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
>
> I've pushed these to the following branch:
>
> git://oss.sgi.com/xfs/xfs.git xfs-collapse-range
>
> And so they'll be in tomorrow's linux-next tree.
Okay.
>
>> ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
>
> I've left this one alone for the ext4 guys to sort out.
I will try to follow up continously.
>
>> xfsprog: xfsio: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
>
> That's already in a current xfstests tree.
Okay.
>
>> xfstest: shared/001: Standard collapse range tests
>> xfstest: shared/002: Delayed allocation collapse range
>> xfstest: shared/003: Multi collapse range tests
>> xfstest: shared/004: Delayed allocation multi collapse
>> xfstest: shared/005: Test multiple fallocate collapse
>
> These are now in the xfstests git tree.
Okay.
>
>> manpage: update FALLOC_FL_COLLAPSE_RANGE flag in fallocate
>
> And Michael will need to review and commit that to the kernel
> manpages tree.
Okay,
Hi Micheal.
Could you please review manpage patch ?
Thanks :)
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
Hi Dave,
On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <[email protected]> wrote:
>
> > Namjae Jeon (10):
> > fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
> > xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
>
> I've pushed these to the following branch:
>
> git://oss.sgi.com/xfs/xfs.git xfs-collapse-range
>
> And so they'll be in tomorrow's linux-next tree.
>
> > ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
>
> I've left this one alone for the ext4 guys to sort out.
So presumably that xfs tree branch is now completely stable and so Ted
could just merge that branch into the ext4 tree as well and put the ext4
part on top of that in his tree.
--
Cheers,
Stephen Rothwell [email protected]
On Tue, Feb 25, 2014 at 02:16:01PM +1100, Stephen Rothwell wrote:
> Hi Dave,
>
> On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <[email protected]> wrote:
> >
> > > Namjae Jeon (10):
> > > fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
> > > xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> >
> > I've pushed these to the following branch:
> >
> > git://oss.sgi.com/xfs/xfs.git xfs-collapse-range
> >
> > And so they'll be in tomorrow's linux-next tree.
> >
> > > ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> >
> > I've left this one alone for the ext4 guys to sort out.
>
> So presumably that xfs tree branch is now completely stable and so Ted
> could just merge that branch into the ext4 tree as well and put the ext4
> part on top of that in his tree.
Well, for some definition of stable. Right now it's just a topic
branch that is merged into the for-next branch, so in theory it is
still just a set of pending changes in a branch in a repo that has
been pushed to linux-next for testing.
That said, I don't see that branch changing unless we find bugs in
the code or a problem with the API needs fixing, at which point I
would add more commits to it and rebase the for-next branch that you
are pulling into the linux-next tree.
Realistically, I'm waiting for Lukas to repost his other pending
fallocate changes (the zero range changes) so I can pull the VFS and
XFS bits of that into the XFS tree and I can test them together
before I'll call the xfs-collapse-range stable and ready to be
merged into some other tree...
Cheers,
Dave.
--
Dave Chinner
[email protected]
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Tue, 25 Feb 2014, Dave Chinner wrote:
> On Tue, Feb 25, 2014 at 02:16:01PM +1100, Stephen Rothwell wrote:
> > On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <[email protected]> wrote:
> > >
> > > > Namjae Jeon (10):
> > > > fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
> > > > xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> > >
> > > I've pushed these to the following branch:
> > >
> > > git://oss.sgi.com/xfs/xfs.git xfs-collapse-range
> > >
> > > And so they'll be in tomorrow's linux-next tree.
> > >
> > > > ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> > >
> > > I've left this one alone for the ext4 guys to sort out.
> >
> > So presumably that xfs tree branch is now completely stable and so Ted
> > could just merge that branch into the ext4 tree as well and put the ext4
> > part on top of that in his tree.
>
> Well, for some definition of stable. Right now it's just a topic
> branch that is merged into the for-next branch, so in theory it is
> still just a set of pending changes in a branch in a repo that has
> been pushed to linux-next for testing.
>
> That said, I don't see that branch changing unless we find bugs in
> the code or a problem with the API needs fixing, at which point I
> would add more commits to it and rebase the for-next branch that you
> are pulling into the linux-next tree.
>
> Realistically, I'm waiting for Lukas to repost his other pending
> fallocate changes (the zero range changes) so I can pull the VFS and
> XFS bits of that into the XFS tree and I can test them together
> before I'll call the xfs-collapse-range stable and ready to be
> merged into some other tree...
Thank you, Namjae and Dave, for driving this; and thank you, Ted and
Matthew, for raising appropriate mmap concerns (2013-7-31 and 2014-2-2).
I was aware of this work in progress, but only now found time to look.
I've not studied the implementation, knowing too little of ext4 and
xfs; but it sounds like the approach you've taken, writing out dirties
and truncating all pagecache from the critical offset onwards, is the
sensible approach for now - lame, and leaves me wondering whether an
offline tool wouldn't be more appropriate; but a safe place to start
if we suppose it will be extended to handle pagecache better in future.
Of course I'm interested in the possibility of extending it to tmpfs;
which may not be a worthwhile exercise in itself, except that it would
force us to face and solve any pagecache/radixtree issues, if possible,
thereby enhancing the support for disk-based filesystems.
I doubt we should look into that before Jan Kara's range locking mods
arrive, or are rejected. As I understand it, you're going ahead with
this, knowing that there can be awkward races with concurrent faults -
more likely to cause trinity fuzzer reports, than interfere with daily
usage (trinity seems to be good at faulting into holes being punched).
That's probably the right pragmatic decision; but I'm a little worried
that it's justfied by saying we already have such races in hole-punch.
Collapse is significantly more challenging than either hole-punch or
truncation: the shifting of pages from one offset to another is new,
and might present nastier bugs.
Emphasis on "might": I expect it's impossible, given your current
approach, but something to be on guard against is unmap_mapping_range()
failing to find and unmap a pte, because the page is mapped at the
"wrong" place in the vma, resulting in BUG_ON(page_mapped(page))
in __delete_from_page_cache().
One thing that is slightly wrong in what you have right now, is your
use of truncate_pagecache_range(): you'll need to add an "even_cows"
arg to that (or make it a wrapper to a __truncate_pagecache_range()
taking additional "even_cows" arg). That arg governs what is done
with anonymous COW'ed pages in a MAP_PRIVATE mmap of the file.
Truncation is required by spec to unmap them; hole punching was up
to us to spec, did not unmap them originally (because there was no
preliminary call to unmap_mapping_range(), so it happened to rely
on the inefficient fallback within truncate_inode_page()), and that
seemed fine because, why discard a user's data unnecessarily?
But your case is different: collapse is much closer to truncation,
and if you do not unmap the private COW'ed pages, then pages left
behind beyond the EOF will break the spec that requires SIGBUS when
touching there, and pages within EOF will be confusingly derived
from file data now belonging to another offset or none (move these
pages within the user address space? no, I don't think anon_vmas
would allow that, and there may be no right place to move them).
It's clear that the right and easy thing to do is just to unmap
them (all of them, from critical offset to EOF), in the rare case
of there being any such pages. Whether this detail needs to be
mentioned in the man page (I don't like throwing away a user's
data without warning) I'm not sure, Michael can judge.
FALLOC_FL_COLLAPSE_RANGE: I'm a little sad at the name COLLAPSE,
but probably seven months too late to object. It surprises me that
you're doing all this work to deflate a part of the file, without
the obvious complementary work to inflate it - presumably all those
advertisers whose ads you're cutting out, will come back to us soon
to ask for inflation, so that they have somewhere to reinsert them ;)
But you have the good precedent of "truncate" being used to extend
files, so I suppose "collapse" can one day be enhanced to inflate a
file when given negative len. Or perhaps, like truncate, that would
lead to too much "if (bigger) do_one_thing() else something_else()",
and a separate FALLOC_FL_ would prove better. Certainly there's no
requirement that you should implement this, I was just a little
surprised that you had not.
I should mention that when "we" implemented this thirty years ago,
we had a strong conviction that the system call should be idempotent:
that is, the len argument should indicate the final i_size, not the
amount being removed from it. Now, I don't remember the grounds for
that conviction: maybe it was just an idealistic preference for how
to design a good system call. I can certainly see that defining it
that way round would surprise many app programmers. Just mentioning
this in case anyone on these lists sees a practical advantage to
doing it that way instead.
I see you've included xfstests and xfs_io updates, nice. Did you
realize that util-linux has a /usr/bin/fallocate? I hope someone
will update that too.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, 25 Feb 2014 15:23:35 -0800 (PST) Hugh Dickins <[email protected]> wrote:
> On Tue, 25 Feb 2014, Dave Chinner wrote:
> > On Tue, Feb 25, 2014 at 02:16:01PM +1100, Stephen Rothwell wrote:
> > > On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <[email protected]> wrote:
> > > >
> > > > > Namjae Jeon (10):
> > > > > fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
> > > > > xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> > > >
> > > > I've pushed these to the following branch:
> > > >
> > > > git://oss.sgi.com/xfs/xfs.git xfs-collapse-range
> > > >
> > > > And so they'll be in tomorrow's linux-next tree.
> > > >
> > > > > ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> > > >
> > > > I've left this one alone for the ext4 guys to sort out.
> > >
> > > So presumably that xfs tree branch is now completely stable and so Ted
> > > could just merge that branch into the ext4 tree as well and put the ext4
> > > part on top of that in his tree.
> >
> > Well, for some definition of stable. Right now it's just a topic
> > branch that is merged into the for-next branch, so in theory it is
> > still just a set of pending changes in a branch in a repo that has
> > been pushed to linux-next for testing.
> >
> > That said, I don't see that branch changing unless we find bugs in
> > the code or a problem with the API needs fixing, at which point I
> > would add more commits to it and rebase the for-next branch that you
> > are pulling into the linux-next tree.
> >
> > Realistically, I'm waiting for Lukas to repost his other pending
> > fallocate changes (the zero range changes) so I can pull the VFS and
> > XFS bits of that into the XFS tree and I can test them together
> > before I'll call the xfs-collapse-range stable and ready to be
> > merged into some other tree...
>
> ...
>
> Emphasis on "might": I expect it's impossible, given your current
> approach, but something to be on guard against is unmap_mapping_range()
> failing to find and unmap a pte, because the page is mapped at the
> "wrong" place in the vma, resulting in BUG_ON(page_mapped(page))
> in __delete_from_page_cache().
It should be well tested with non-linear mappings please. It *should*
be OK, but...
>
> ...
>
> FALLOC_FL_COLLAPSE_RANGE: I'm a little sad at the name COLLAPSE,
> but probably seven months too late to object. It surprises me that
> you're doing all this work to deflate a part of the file, without
> the obvious complementary work to inflate it - presumably all those
> advertisers whose ads you're cutting out, will come back to us soon
> to ask for inflation, so that they have somewhere to reinsert them ;)
Yes, I was wondering that. Why not simply "move these blocks from here
to there".
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, Feb 25, 2014 at 03:23:35PM -0800, Hugh Dickins wrote:
> On Tue, 25 Feb 2014, Dave Chinner wrote:
> > On Tue, Feb 25, 2014 at 02:16:01PM +1100, Stephen Rothwell wrote:
> > > On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <[email protected]> wrote:
> > > >
> > > > > Namjae Jeon (10):
> > > > > fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
> > > > > xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> > > >
> > > > I've pushed these to the following branch:
> > > >
> > > > git://oss.sgi.com/xfs/xfs.git xfs-collapse-range
> > > >
> > > > And so they'll be in tomorrow's linux-next tree.
> > > >
> > > > > ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> > > >
> > > > I've left this one alone for the ext4 guys to sort out.
> > >
> > > So presumably that xfs tree branch is now completely stable and so Ted
> > > could just merge that branch into the ext4 tree as well and put the ext4
> > > part on top of that in his tree.
> >
> > Well, for some definition of stable. Right now it's just a topic
> > branch that is merged into the for-next branch, so in theory it is
> > still just a set of pending changes in a branch in a repo that has
> > been pushed to linux-next for testing.
> >
> > That said, I don't see that branch changing unless we find bugs in
> > the code or a problem with the API needs fixing, at which point I
> > would add more commits to it and rebase the for-next branch that you
> > are pulling into the linux-next tree.
> >
> > Realistically, I'm waiting for Lukas to repost his other pending
> > fallocate changes (the zero range changes) so I can pull the VFS and
> > XFS bits of that into the XFS tree and I can test them together
> > before I'll call the xfs-collapse-range stable and ready to be
> > merged into some other tree...
>
> Thank you, Namjae and Dave, for driving this; and thank you, Ted and
> Matthew, for raising appropriate mmap concerns (2013-7-31 and 2014-2-2).
> I was aware of this work in progress, but only now found time to look.
>
> I've not studied the implementation, knowing too little of ext4 and
> xfs; but it sounds like the approach you've taken, writing out dirties
> and truncating all pagecache from the critical offset onwards, is the
> sensible approach for now - lame, and leaves me wondering whether an
> offline tool wouldn't be more appropriate; but a safe place to start
> if we suppose it will be extended to handle pagecache better in future.
Offline? You mean with the filesystem unmounted, or something else?
> Of course I'm interested in the possibility of extending it to tmpfs;
> which may not be a worthwhile exercise in itself, except that it would
> force us to face and solve any pagecache/radixtree issues, if possible,
> thereby enhancing the support for disk-based filesystems.
>
> I doubt we should look into that before Jan Kara's range locking mods
> arrive, or are rejected. As I understand it, you're going ahead with
> this, knowing that there can be awkward races with concurrent faults -
> more likely to cause trinity fuzzer reports, than interfere with daily
> usage (trinity seems to be good at faulting into holes being punched).
Yes, the caveat is that the applications that use it (TVs, DVRs, NLE
applications, etc) typically don't use mmap for accessing the data
stream being modified. Further, it's much less generally useful than
holepunching, so when these two are combined, the likely exposure to
issues resulting from mmap deficiencies are pretty damn low.
> That's probably the right pragmatic decision; but I'm a little worried
> that it's justfied by saying we already have such races in hole-punch.
> Collapse is significantly more challenging than either hole-punch or
> truncation: the shifting of pages from one offset to another is new,
> and might present nastier bugs.
Symptoms might be different, but it's exactly the same problem. i.e.
mmap_sem locking inversions preventing the filesystem from
serialising IO path operations like hole punch, truncate and other
extent manipulation operations against concurrent page faults
that enter the IO path.
> Emphasis on "might": I expect it's impossible, given your current
> approach, but something to be on guard against is unmap_mapping_range()
> failing to find and unmap a pte, because the page is mapped at the
> "wrong" place in the vma, resulting in BUG_ON(page_mapped(page))
> in __delete_from_page_cache().
Unmapping occurs before anything is shifted. And even if a fault
does occur before the file size changes at the end of a collapse
range operation (via the truncate path), the page in the page cache
won't be moved about so I don't see how the above problem could
occur. All that will happen is that you get the wrong data in the
mmap()d page, just like you will with hole_punch issues.
> One thing that is slightly wrong in what you have right now, is your
> use of truncate_pagecache_range(): you'll need to add an "even_cows"
> arg to that (or make it a wrapper to a __truncate_pagecache_range()
> taking additional "even_cows" arg). That arg governs what is done
> with anonymous COW'ed pages in a MAP_PRIVATE mmap of the file.
>
> Truncation is required by spec to unmap them; hole punching was up
> to us to spec, did not unmap them originally (because there was no
> preliminary call to unmap_mapping_range(), so it happened to rely
> on the inefficient fallback within truncate_inode_page()), and that
> seemed fine because, why discard a user's data unnecessarily?
I don't see there being a problem here - it's the same case as hole
punch. And when we change the file size after shifting extents we
are doing a truncate so that will behave as expected w.r.t.
anonymous COW'ed pages in a MAP_PRIVATE mmap of the file.
> But your case is different: collapse is much closer to truncation,
> and if you do not unmap the private COW'ed pages, then pages left
> behind beyond the EOF will break the spec that requires SIGBUS when
> touching there, and pages within EOF will be confusingly derived
> from file data now belonging to another offset or none (move these
> pages within the user address space? no, I don't think anon_vmas
> would allow that, and there may be no right place to move them).
See above - we never leave pages beyond the new EOF because setting
the new EOF is a truncate operation that calls
truncate_setsize(inode, newsize).
> It's clear that the right and easy thing to do is just to unmap
> them (all of them, from critical offset to EOF), in the rare case
> of there being any such pages. Whether this detail needs to be
> mentioned in the man page (I don't like throwing away a user's
> data without warning) I'm not sure, Michael can judge.
>
> FALLOC_FL_COLLAPSE_RANGE: I'm a little sad at the name COLLAPSE,
> but probably seven months too late to object. It surprises me that
> you're doing all this work to deflate a part of the file, without
> the obvious complementary work to inflate it - presumably all those
> advertisers whose ads you're cutting out, will come back to us soon
> to ask for inflation, so that they have somewhere to reinsert them ;)
The name makes no difference to me - it's a filesystem offload
function of a very specific nature. If we require the opposite
behaviour - inserting unwritten extents after shifting the data up
out of the way - then we can just add a new FALLOC_FL_INSERT_RANGE
command to do that.
But in the absence of anyone needing such functionality, the
complexity of implementing it is not worth the effort.
> I should mention that when "we" implemented this thirty years ago,
> we had a strong conviction that the system call should be idempotent:
> that is, the len argument should indicate the final i_size, not the
> amount being removed from it. Now, I don't remember the grounds for
> that conviction: maybe it was just an idealistic preference for how
> to design a good system call. I can certainly see that defining it
> that way round would surprise many app programmers. Just mentioning
> this in case anyone on these lists sees a practical advantage to
> doing it that way instead.
I don't see how specifying the end file size as an improvement. What
happens if you are collapse a range in a file that is still being
appended to by the application and so you race with a file size
update? IOWs, with such an API the range to be collapsed is
completely unpredictable, and IMO that's a fundamentally broken API.
> I see you've included xfstests and xfs_io updates, nice. Did you
> realize that util-linux has a /usr/bin/fallocate? I hope someone
> will update that too.
I don't care about /usr/bin/fallocate - I've never used it in my
life because xfs_io exists on all my systems and is way more
powerful and useful to me than /usr/bin/fallocate. Regardless, I
think someone posted patches for it yesterday.
Cheers,
Dave.
--
Dave Chinner
[email protected]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, Feb 25, 2014 at 03:41:28PM -0800, Andrew Morton wrote:
> On Tue, 25 Feb 2014 15:23:35 -0800 (PST) Hugh Dickins <[email protected]> wrote:
> > On Tue, 25 Feb 2014, Dave Chinner wrote:
> > > On Tue, Feb 25, 2014 at 02:16:01PM +1100, Stephen Rothwell wrote:
> > > > On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <[email protected]> wrote:
> > FALLOC_FL_COLLAPSE_RANGE: I'm a little sad at the name COLLAPSE,
> > but probably seven months too late to object. It surprises me that
> > you're doing all this work to deflate a part of the file, without
> > the obvious complementary work to inflate it - presumably all those
> > advertisers whose ads you're cutting out, will come back to us soon
> > to ask for inflation, so that they have somewhere to reinsert them ;)
>
> Yes, I was wondering that. Why not simply "move these blocks from here
> to there".
And open a completely unnecessary can of worms to do with
behavioural and implementation corner cases?
Do you allow it to destroy data by default? Or only allow moves into
holes?
What do you do with range the data is moved out of? Does it just
become a hole? What happens if the range overlaps EOF - does that
change the file size?
What if you want to move the range beyond EOF?
What if the source and destination ranges overlap?
What happens when you move the block at EOF into the middle of a
file - do you end up with zeros padding the block and the file size
having to be adjusted accordingly? Or do we have to *copy* all the
data in high blocks down to fill the hole in the block?
What behaviour should we expect if the filesystem can't implement
the entire move atomically and we crash in the middle of the move?
I can keep going, but I'll stop here - you get the idea.
In comparison, collapse range as a file data manipulation has very
specific requirements and from that we can define a simple, specific
API that allows filesystems to accelerate that operation by extent
manipulation rather than read/memcpy/write that the applications are
currently doing for this operation.... IOWs, collapse range is a
simple operation, "move arbitrary blocks from here to there" is a
nightmare both from the specification and the implementation points
of view.
Cheers,
Dave.
--
Dave Chinner
[email protected]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed, 26 Feb 2014 12:34:26 +1100 Dave Chinner <[email protected]> wrote:
> On Tue, Feb 25, 2014 at 03:41:28PM -0800, Andrew Morton wrote:
> > On Tue, 25 Feb 2014 15:23:35 -0800 (PST) Hugh Dickins <[email protected]> wrote:
> > > On Tue, 25 Feb 2014, Dave Chinner wrote:
> > > > On Tue, Feb 25, 2014 at 02:16:01PM +1100, Stephen Rothwell wrote:
> > > > > On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <[email protected]> wrote:
> > > FALLOC_FL_COLLAPSE_RANGE: I'm a little sad at the name COLLAPSE,
> > > but probably seven months too late to object. It surprises me that
> > > you're doing all this work to deflate a part of the file, without
> > > the obvious complementary work to inflate it - presumably all those
> > > advertisers whose ads you're cutting out, will come back to us soon
> > > to ask for inflation, so that they have somewhere to reinsert them ;)
> >
> > Yes, I was wondering that. Why not simply "move these blocks from here
> > to there".
>
> And open a completely unnecessary can of worms to do with
> behavioural and implementation corner cases?
But it's general.
> Do you allow it to destroy data by default? Or only allow moves into
> holes?
Overwrite.
> What do you do with range the data is moved out of? Does it just
> become a hole? What happens if the range overlaps EOF - does that
> change the file size?
Truncate.
> What if you want to move the range beyond EOF?
Extend.
> What if the source and destination ranges overlap?
Don't screw it up.
> What happens when you move the block at EOF into the middle of a
> file - do you end up with zeros padding the block and the file size
> having to be adjusted accordingly? Or do we have to *copy* all the
> data in high blocks down to fill the hole in the block?
I don't understand that. Move the block(s) and truncate to the new
length.
> What behaviour should we expect if the filesystem can't implement
> the entire move atomically and we crash in the middle of the move?
What does collapse_range do now?
If it's a journaled filesystem, it shouldn't screw up. If it isn't, fsck.
> I can keep going, but I'll stop here - you get the idea.
None of this seems like rocket science.
> In comparison, collapse range as a file data manipulation has very
> specific requirements and from that we can define a simple, specific
> API that allows filesystems to accelerate that operation by extent
> manipulation rather than read/memcpy/write that the applications are
> currently doing for this operation.... IOWs, collapse range is a
> simple operation, "move arbitrary blocks from here to there" is a
> nightmare both from the specification and the implementation points
> of view.
collapse_range seems weird, arbitrary and half-assed. "Why didn't they
go all the way and do it properly".
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Tue, Feb 25, 2014 at 05:52:16PM -0800, Andrew Morton wrote:
> On Wed, 26 Feb 2014 12:34:26 +1100 Dave Chinner <[email protected]> wrote:
>
> > On Tue, Feb 25, 2014 at 03:41:28PM -0800, Andrew Morton wrote:
> > > On Tue, 25 Feb 2014 15:23:35 -0800 (PST) Hugh Dickins <[email protected]> wrote:
> > > > On Tue, 25 Feb 2014, Dave Chinner wrote:
> > > > > On Tue, Feb 25, 2014 at 02:16:01PM +1100, Stephen Rothwell wrote:
> > > > > > On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <[email protected]> wrote:
> > > > FALLOC_FL_COLLAPSE_RANGE: I'm a little sad at the name COLLAPSE,
> > > > but probably seven months too late to object. It surprises me that
> > > > you're doing all this work to deflate a part of the file, without
> > > > the obvious complementary work to inflate it - presumably all those
> > > > advertisers whose ads you're cutting out, will come back to us soon
> > > > to ask for inflation, so that they have somewhere to reinsert them ;)
> > >
> > > Yes, I was wondering that. Why not simply "move these blocks from here
> > > to there".
> >
> > And open a completely unnecessary can of worms to do with
> > behavioural and implementation corner cases?
>
> But it's general.
Exactly. And because it's general, you can't make arbitrary
decisions about the behaviour.
> > Do you allow it to destroy data by default? Or only allow moves into
> > holes?
>
> Overwrite.
Application dev says: "I don't want it to overwrite data - I only
want to it succeed if it's moving into a hole that I've already
prepared".
> > What do you do with range the data is moved out of? Does it just
> > become a hole? What happens if the range overlaps EOF - does that
> > change the file size?
>
> Truncate.
A.D. says: "But I need FALLOC_FL_KEEP_SIZE semantics"
> > What if you want to move the range beyond EOF?
>
> Extend.
Filesystem developer says: "Ok, so what happens to the range between
the old EOF and destintation offset? What do you do with blocks
beyond EOF that fall within that range? punch, zero, preallocate the
entire range? Do users need to be able to specify this behaviour?
Hell, do we even know of an application that requires this
behaviour?"
> > What if the source and destination ranges overlap?
>
> Don't screw it up.
Exactly my point - it's a complex behaviour that is difficult to
verify that it is correct.
> > What happens when you move the block at EOF into the middle of a
> > file - do you end up with zeros padding the block and the file size
> > having to be adjusted accordingly? Or do we have to *copy* all the
> > data in high blocks down to fill the hole in the block?
>
> I don't understand that. Move the block(s) and truncate to the new
> length.
So, you are saying this (move from s to d):
+-----------------------------------------------------+
+sssssssssssss+
+ddddddddddddd+
should result in:
+--------+ddddddddddddd+
A.D. says: "That's not what I asked for! What happened to all the
rest of my data in the file between d and s? I didn't ask for them
to be removed. And I want a hole where the source was!"
> > What behaviour should we expect if the filesystem can't implement
> > the entire move atomically and we crash in the middle of the move?
>
> What does collapse_range do now?
>
> If it's a journaled filesystem, it shouldn't screw up. If it isn't, fsck.
Define "screw up". For journalled filesystems "don't screw up" means
the filesystem will be consistent after a crash, not that a change
made in a syscall is completed atomicly.
Indeed, collapse range isn't implemented atomically in XFS, and I
doubt it is in ext4. Why? Because the extent tree being manipulated
can be *much* larger than the journal and so the changes can't
easily be done atomically from a crash recovery perspective. The
result is that collapse range will end up with a hole somewhere in
the file the size of the range being collapsed. This was pointed out
during review some time in the past 6 months and, IIRC, the response
was "that's fine, just so long as the filesystem is not corrupted".
I have plans to fix this issue in XFS, but it isn't critical to the
correct functioning of devices using collapse range.
This just illustrates my point is that behaviour needs to be
specified so that we can get all filesystems with the same minimum
crash guarantees....
> > I can keep going, but I'll stop here - you get the idea.
>
> None of this seems like rocket science.
It's not rocket science, but the devil is in the details. There's no
requirements or specification to work from, let alone an application
that needs such generic functionality. Until these exist and there's
someone willing to put the effort into specifying, implementing and
testing such an interface, it's just not going to happen.
> > In comparison, collapse range as a file data manipulation has very
> > specific requirements and from that we can define a simple, specific
> > API that allows filesystems to accelerate that operation by extent
> > manipulation rather than read/memcpy/write that the applications are
> > currently doing for this operation.... IOWs, collapse range is a
> > simple operation, "move arbitrary blocks from here to there" is a
> > nightmare both from the specification and the implementation points
> > of view.
>
> collapse_range seems weird, arbitrary and half-assed. "Why didn't they
> go all the way and do it properly".
Yup, I can apply exactly the same argument to FALLOC_FL_PREALLOC and
FALLOC_FL_PUNCH_HOLE. We should have done them as part of a generic
block movement API as they are simply degenerate cases of NULL
source/destination targets for the block movement API....
Cheers,
Dave.
--
Dave Chinner
[email protected]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed, 26 Feb 2014, Dave Chinner wrote:
> On Tue, Feb 25, 2014 at 03:23:35PM -0800, Hugh Dickins wrote:
> > On Tue, 25 Feb 2014, Dave Chinner wrote:
> > > On Tue, Feb 25, 2014 at 02:16:01PM +1100, Stephen Rothwell wrote:
> > > > On Mon, 24 Feb 2014 11:57:10 +1100 Dave Chinner <[email protected]> wrote:
> > > > >
> > > > > > Namjae Jeon (10):
> > > > > > fs: Add new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
> > > > > > xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> > > > >
> > > > > I've pushed these to the following branch:
> > > > >
> > > > > git://oss.sgi.com/xfs/xfs.git xfs-collapse-range
> > > > >
> > > > > And so they'll be in tomorrow's linux-next tree.
> > > > >
> > > > > > ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> > > > >
> > > > > I've left this one alone for the ext4 guys to sort out.
> > > >
> > > > So presumably that xfs tree branch is now completely stable and so Ted
> > > > could just merge that branch into the ext4 tree as well and put the ext4
> > > > part on top of that in his tree.
> > >
> > > Well, for some definition of stable. Right now it's just a topic
> > > branch that is merged into the for-next branch, so in theory it is
> > > still just a set of pending changes in a branch in a repo that has
> > > been pushed to linux-next for testing.
> > >
> > > That said, I don't see that branch changing unless we find bugs in
> > > the code or a problem with the API needs fixing, at which point I
> > > would add more commits to it and rebase the for-next branch that you
> > > are pulling into the linux-next tree.
> > >
> > > Realistically, I'm waiting for Lukas to repost his other pending
> > > fallocate changes (the zero range changes) so I can pull the VFS and
> > > XFS bits of that into the XFS tree and I can test them together
> > > before I'll call the xfs-collapse-range stable and ready to be
> > > merged into some other tree...
> >
> > Thank you, Namjae and Dave, for driving this; and thank you, Ted and
> > Matthew, for raising appropriate mmap concerns (2013-7-31 and 2014-2-2).
> > I was aware of this work in progress, but only now found time to look.
> >
> > I've not studied the implementation, knowing too little of ext4 and
> > xfs; but it sounds like the approach you've taken, writing out dirties
> > and truncating all pagecache from the critical offset onwards, is the
> > sensible approach for now - lame, and leaves me wondering whether an
> > offline tool wouldn't be more appropriate; but a safe place to start
> > if we suppose it will be extended to handle pagecache better in future.
>
> Offline? You mean with the filesystem unmounted, or something else?
"Something else": which I don't expect you to leap into implementing.
>
> > Of course I'm interested in the possibility of extending it to tmpfs;
> > which may not be a worthwhile exercise in itself, except that it would
> > force us to face and solve any pagecache/radixtree issues, if possible,
> > thereby enhancing the support for disk-based filesystems.
> >
> > I doubt we should look into that before Jan Kara's range locking mods
> > arrive, or are rejected. As I understand it, you're going ahead with
> > this, knowing that there can be awkward races with concurrent faults -
> > more likely to cause trinity fuzzer reports, than interfere with daily
> > usage (trinity seems to be good at faulting into holes being punched).
>
> Yes, the caveat is that the applications that use it (TVs, DVRs, NLE
> applications, etc) typically don't use mmap for accessing the data
> stream being modified. Further, it's much less generally useful than
> holepunching, so when these two are combined, the likely exposure to
> issues resulting from mmap deficiencies are pretty damn low.
Agreed, but we do want to define how the interaction behaves, we want
it to be the same across all filesystems supporting COLLAPSE_RANGE,
and we don't want it to lead to system crashes or corruptions.
>
> > That's probably the right pragmatic decision; but I'm a little worried
> > that it's justfied by saying we already have such races in hole-punch.
> > Collapse is significantly more challenging than either hole-punch or
> > truncation: the shifting of pages from one offset to another is new,
> > and might present nastier bugs.
>
> Symptoms might be different, but it's exactly the same problem. i.e.
> mmap_sem locking inversions preventing the filesystem from
> serialising IO path operations like hole punch, truncate and other
> extent manipulation operations against concurrent page faults
> that enter the IO path.
That may (may) be true of the current kick-everything-out-of-pagecache
approach. But in general I stand by "Collapse is significantly more
challenging". Forgive me if that amounts to saying "Hey, here's a
more complicated way to do it. Ooh, this way is more complicated."
The concept of moving a page from one file offset to another is new,
and can be expected to pose new difficulties.
>
> > Emphasis on "might": I expect it's impossible, given your current
> > approach, but something to be on guard against is unmap_mapping_range()
> > failing to find and unmap a pte, because the page is mapped at the
> > "wrong" place in the vma, resulting in BUG_ON(page_mapped(page))
> > in __delete_from_page_cache().
>
> Unmapping occurs before anything is shifted. And even if a fault
> does occur before the file size changes at the end of a collapse
> range operation (via the truncate path), the page in the page cache
> won't be moved about so I don't see how the above problem could
> occur. All that will happen is that you get the wrong data in the
> mmap()d page, just like you will with hole_punch issues.
I think you're probably right. I expect that attempting to fault
a page back from disk while collapse is shifting down, will hit a
mutex and wait. But that's liable to differ from filesystem to
filesystem, so I'm not certain.
>
> > One thing that is slightly wrong in what you have right now, is your
> > use of truncate_pagecache_range(): you'll need to add an "even_cows"
> > arg to that (or make it a wrapper to a __truncate_pagecache_range()
> > taking additional "even_cows" arg). That arg governs what is done
> > with anonymous COW'ed pages in a MAP_PRIVATE mmap of the file.
> >
> > Truncation is required by spec to unmap them; hole punching was up
> > to us to spec, did not unmap them originally (because there was no
> > preliminary call to unmap_mapping_range(), so it happened to rely
> > on the inefficient fallback within truncate_inode_page()), and that
> > seemed fine because, why discard a user's data unnecessarily?
>
> I don't see there being a problem here - it's the same case as hole
> punch. And when we change the file size after shifting extents we
> are doing a truncate so that will behave as expected w.r.t.
> anonymous COW'ed pages in a MAP_PRIVATE mmap of the file.
No, it's only the same case as hole punch, up to a point.
>
> > But your case is different: collapse is much closer to truncation,
> > and if you do not unmap the private COW'ed pages, then pages left
> > behind beyond the EOF will break the spec that requires SIGBUS when
> > touching there, and pages within EOF will be confusingly derived
> > from file data now belonging to another offset or none (move these
> > pages within the user address space? no, I don't think anon_vmas
> > would allow that, and there may be no right place to move them).
>
> See above - we never leave pages beyond the new EOF because setting
> the new EOF is a truncate operation that calls
> truncate_setsize(inode, newsize).
Right, thanks, I now see the truncate_setsize() in the xfs case -
though not in the ext4 case, which looks as if it's just doing an
i_size_write() afterwards.
Yes, truncate_setsize() at the end should answer my SIGBUS objection.
And with that out of the way, although I don't particularly care for
the weirdness of private COW'ed pages becoming associated with file
offsets they never originated from, I don't think I could argue with
you when you tell me "well, that's the weirdness you get from mixing
COLLAPSE_RANGE with MAP_PRIVATE mmaps".
Looks like there's no need for the __truncate_pagecache_range() with
even_cows arg that I was advocating: we just need ext4 to truncate
properly at the end, and document the disassociated private pages.
>
> > It's clear that the right and easy thing to do is just to unmap
> > them (all of them, from critical offset to EOF), in the rare case
> > of there being any such pages. Whether this detail needs to be
> > mentioned in the man page (I don't like throwing away a user's
> > data without warning) I'm not sure, Michael can judge.
> >
> > FALLOC_FL_COLLAPSE_RANGE: I'm a little sad at the name COLLAPSE,
> > but probably seven months too late to object. It surprises me that
> > you're doing all this work to deflate a part of the file, without
> > the obvious complementary work to inflate it - presumably all those
> > advertisers whose ads you're cutting out, will come back to us soon
> > to ask for inflation, so that they have somewhere to reinsert them ;)
>
> The name makes no difference to me - it's a filesystem offload
> function of a very specific nature. If we require the opposite
> behaviour - inserting unwritten extents after shifting the data up
> out of the way - then we can just add a new FALLOC_FL_INSERT_RANGE
> command to do that.
>
> But in the absence of anyone needing such functionality, the
> complexity of implementing it is not worth the effort.
Yes, it's not a requirement that it be implemented immediately.
>
> > I should mention that when "we" implemented this thirty years ago,
> > we had a strong conviction that the system call should be idempotent:
> > that is, the len argument should indicate the final i_size, not the
> > amount being removed from it. Now, I don't remember the grounds for
> > that conviction: maybe it was just an idealistic preference for how
> > to design a good system call. I can certainly see that defining it
> > that way round would surprise many app programmers. Just mentioning
> > this in case anyone on these lists sees a practical advantage to
> > doing it that way instead.
>
> I don't see how specifying the end file size as an improvement. What
> happens if you are collapse a range in a file that is still being
> appended to by the application and so you race with a file size
> update? IOWs, with such an API the range to be collapsed is
> completely unpredictable, and IMO that's a fundamentally broken API.
That's fine if you don't see the idempotent API as an improvement,
I just wanted to put it on the table in case someone does see an
advantage to it. But I think I'm missing something in your race
example: I don't see a difference between the two APIs there.
>
> > I see you've included xfstests and xfs_io updates, nice. Did you
> > realize that util-linux has a /usr/bin/fallocate? I hope someone
> > will update that too.
>
> I don't care about /usr/bin/fallocate - I've never used it in my
> life because xfs_io exists on all my systems and is way more
> powerful and useful to me than /usr/bin/fallocate. Regardless,
> I think someone posted patches for it yesterday.
Your scorn is noted: yes, it is pretty simple,
but I'm glad to hear Dongsu is attending to it.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, Feb 25, 2014 at 08:45:15PM -0800, Hugh Dickins wrote:
> On Wed, 26 Feb 2014, Dave Chinner wrote:
> > On Tue, Feb 25, 2014 at 03:23:35PM -0800, Hugh Dickins wrote:
> > > Of course I'm interested in the possibility of extending it to tmpfs;
> > > which may not be a worthwhile exercise in itself, except that it would
> > > force us to face and solve any pagecache/radixtree issues, if possible,
> > > thereby enhancing the support for disk-based filesystems.
> > >
> > > I doubt we should look into that before Jan Kara's range locking mods
> > > arrive, or are rejected. As I understand it, you're going ahead with
> > > this, knowing that there can be awkward races with concurrent faults -
> > > more likely to cause trinity fuzzer reports, than interfere with daily
> > > usage (trinity seems to be good at faulting into holes being punched).
> >
> > Yes, the caveat is that the applications that use it (TVs, DVRs, NLE
> > applications, etc) typically don't use mmap for accessing the data
> > stream being modified. Further, it's much less generally useful than
> > holepunching, so when these two are combined, the likely exposure to
> > issues resulting from mmap deficiencies are pretty damn low.
>
> Agreed, but we do want to define how the interaction behaves, we want
> it to be the same across all filesystems supporting COLLAPSE_RANGE,
> and we don't want it to lead to system crashes or corruptions.
We have defined it. It's just a data manipulation operation that
moves file data from one offset to another. How a filesystem
implements that is a filesystem's problem, not a problem with the
API.
ext4 and XFS implement it by removing all cached data and mappings
over the range from memory and then mainpulating extent status.
Other fileystems might be able to do similar things, or we might
just do an internal kernel read/write loop. And there's nothing
ruling out a hardware ior nework protocol copy offload being used to
implement an optimised data copy, either.
tmpfs is different in that it's only data store is the page cache,
so it needs to operate within the constraints of the page cache.
moving pages inside the page cache might be a new operation for
tmpfs, but it's not an operation that is needed by filesystems
to implement this file data manipulation.
> > > That's probably the right pragmatic decision; but I'm a little worried
> > > that it's justfied by saying we already have such races in hole-punch.
> > > Collapse is significantly more challenging than either hole-punch or
> > > truncation: the shifting of pages from one offset to another is new,
> > > and might present nastier bugs.
> >
> > Symptoms might be different, but it's exactly the same problem. i.e.
> > mmap_sem locking inversions preventing the filesystem from
> > serialising IO path operations like hole punch, truncate and other
> > extent manipulation operations against concurrent page faults
> > that enter the IO path.
>
> That may (may) be true of the current kick-everything-out-of-pagecache
> approach. But in general I stand by "Collapse is significantly more
> challenging". Forgive me if that amounts to saying "Hey, here's a
> more complicated way to do it. Ooh, this way is more complicated."
> The concept of moving a page from one file offset to another is new,
> and can be expected to pose new difficulties.
Collapse might be challenging for tmpfs, but it's relatively trivial
for block based filesystems because we have an independent backing
store and so we don't need to move cached data around.
> > > Emphasis on "might": I expect it's impossible, given your current
> > > approach, but something to be on guard against is unmap_mapping_range()
> > > failing to find and unmap a pte, because the page is mapped at the
> > > "wrong" place in the vma, resulting in BUG_ON(page_mapped(page))
> > > in __delete_from_page_cache().
> >
> > Unmapping occurs before anything is shifted. And even if a fault
> > does occur before the file size changes at the end of a collapse
> > range operation (via the truncate path), the page in the page cache
> > won't be moved about so I don't see how the above problem could
> > occur. All that will happen is that you get the wrong data in the
> > mmap()d page, just like you will with hole_punch issues.
>
> I think you're probably right. I expect that attempting to fault
> a page back from disk while collapse is shifting down, will hit a
> mutex and wait. But that's liable to differ from filesystem to
> filesystem, so I'm not certain.
Well, no, we can't do that entirely atomically because of mmap_sem
inversions. Individual extent shifts, yes, but not against the
operation as a whole. i.e. fallocate needs to serialise against IO
operations, but we can't serialise Io operations against page faults
because of mmap_sem inversions....
> > > But your case is different: collapse is much closer to truncation,
> > > and if you do not unmap the private COW'ed pages, then pages left
> > > behind beyond the EOF will break the spec that requires SIGBUS when
> > > touching there, and pages within EOF will be confusingly derived
> > > from file data now belonging to another offset or none (move these
> > > pages within the user address space? no, I don't think anon_vmas
> > > would allow that, and there may be no right place to move them).
> >
> > See above - we never leave pages beyond the new EOF because setting
> > the new EOF is a truncate operation that calls
> > truncate_setsize(inode, newsize).
>
> Right, thanks, I now see the truncate_setsize() in the xfs case -
> though not in the ext4 case, which looks as if it's just doing an
> i_size_write() afterwards.
So that's a bug in the ext4 code ;)
> Yes, truncate_setsize() at the end should answer my SIGBUS objection.
> And with that out of the way, although I don't particularly care for
> the weirdness of private COW'ed pages becoming associated with file
> offsets they never originated from, I don't think I could argue with
> you when you tell me "well, that's the weirdness you get from mixing
> COLLAPSE_RANGE with MAP_PRIVATE mmaps".
>
> Looks like there's no need for the __truncate_pagecache_range() with
> even_cows arg that I was advocating: we just need ext4 to truncate
> properly at the end, and document the disassociated private pages.
*nod*
> > > It's clear that the right and easy thing to do is just to unmap
> > > them (all of them, from critical offset to EOF), in the rare case
> > > of there being any such pages. Whether this detail needs to be
> > > mentioned in the man page (I don't like throwing away a user's
> > > data without warning) I'm not sure, Michael can judge.
> > >
> > > FALLOC_FL_COLLAPSE_RANGE: I'm a little sad at the name COLLAPSE,
> > > but probably seven months too late to object. It surprises me that
> > > you're doing all this work to deflate a part of the file, without
> > > the obvious complementary work to inflate it - presumably all those
> > > advertisers whose ads you're cutting out, will come back to us soon
> > > to ask for inflation, so that they have somewhere to reinsert them ;)
> >
> > The name makes no difference to me - it's a filesystem offload
> > function of a very specific nature. If we require the opposite
> > behaviour - inserting unwritten extents after shifting the data up
> > out of the way - then we can just add a new FALLOC_FL_INSERT_RANGE
> > command to do that.
> >
> > But in the absence of anyone needing such functionality, the
> > complexity of implementing it is not worth the effort.
>
> Yes, it's not a requirement that it be implemented immediately.
>
> >
> > > I should mention that when "we" implemented this thirty years ago,
> > > we had a strong conviction that the system call should be idempotent:
> > > that is, the len argument should indicate the final i_size, not the
> > > amount being removed from it. Now, I don't remember the grounds for
> > > that conviction: maybe it was just an idealistic preference for how
> > > to design a good system call. I can certainly see that defining it
> > > that way round would surprise many app programmers. Just mentioning
> > > this in case anyone on these lists sees a practical advantage to
> > > doing it that way instead.
> >
> > I don't see how specifying the end file size as an improvement. What
> > happens if you are collapse a range in a file that is still being
> > appended to by the application and so you race with a file size
> > update? IOWs, with such an API the range to be collapsed is
> > completely unpredictable, and IMO that's a fundamentally broken API.
>
> That's fine if you don't see the idempotent API as an improvement,
> I just wanted to put it on the table in case someone does see an
> advantage to it. But I think I'm missing something in your race
> example: I don't see a difference between the two APIs there.
Userspace can't sample the inode size via stat(2) and then use the value for a
syscall atomically. i.e. if you specify the offset you want to
collapse at, and the file size you want to have to define the region
to collapse, then the length you need to collapse is (current inode
size - end file size). If "current inode size" can change between
the stat(2) and fallocate() call (and it can), then the length being
collapsed is indeterminate....
> > > I see you've included xfstests and xfs_io updates, nice. Did you
> > > realize that util-linux has a /usr/bin/fallocate? I hope someone
> > > will update that too.
> >
> > I don't care about /usr/bin/fallocate - I've never used it in my
> > life because xfs_io exists on all my systems and is way more
> > powerful and useful to me than /usr/bin/fallocate. Regardless,
> > I think someone posted patches for it yesterday.
>
> Your scorn is noted: yes, it is pretty simple,
> but I'm glad to hear Dongsu is attending to it.
Not scorn - "don't care" was careless phrasing. What I was trying to
say is that fallocate is not really relevant for testing filesystem
implementations as you need a bunch more functionality to be able to
test the syscall and filesystems adequately.....
Cheers,
Dave.
>
> Hugh
>
--
Dave Chinner
[email protected]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed, 26 Feb 2014, Dave Chinner wrote:
> On Tue, Feb 25, 2014 at 08:45:15PM -0800, Hugh Dickins wrote:
> > On Wed, 26 Feb 2014, Dave Chinner wrote:
> > > On Tue, Feb 25, 2014 at 03:23:35PM -0800, Hugh Dickins wrote:
> > >
> > > > I should mention that when "we" implemented this thirty years ago,
> > > > we had a strong conviction that the system call should be idempotent:
> > > > that is, the len argument should indicate the final i_size, not the
> > > > amount being removed from it. Now, I don't remember the grounds for
> > > > that conviction: maybe it was just an idealistic preference for how
> > > > to design a good system call. I can certainly see that defining it
> > > > that way round would surprise many app programmers. Just mentioning
> > > > this in case anyone on these lists sees a practical advantage to
> > > > doing it that way instead.
> > >
> > > I don't see how specifying the end file size as an improvement. What
> > > happens if you are collapse a range in a file that is still being
> > > appended to by the application and so you race with a file size
> > > update? IOWs, with such an API the range to be collapsed is
> > > completely unpredictable, and IMO that's a fundamentally broken API.
> >
> > That's fine if you don't see the idempotent API as an improvement,
> > I just wanted to put it on the table in case someone does see an
> > advantage to it. But I think I'm missing something in your race
> > example: I don't see a difference between the two APIs there.
>
>
> Userspace can't sample the inode size via stat(2) and then use the value for a
> syscall atomically. i.e. if you specify the offset you want to
> collapse at, and the file size you want to have to define the region
> to collapse, then the length you need to collapse is (current inode
> size - end file size). If "current inode size" can change between
> the stat(2) and fallocate() call (and it can), then the length being
> collapsed is indeterminate....
Thanks for explaining more, I was just about to acknowledge what a good
example that is. Indeed, it seems not unreasonable to be editing the
earlier part of a file while the later part of it is still streaming in.
But damn, it now occurs to me that there's still a problem at the
streaming end: its file write offset won't be updated to reflect
the collapse, so there would be a sparse hole at that end. And
collapse returns -EPERM if IS_APPEND(inode).
Never mind, I'm not campaigning for a change of interface anyway.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed, Feb 26, 2014 at 03:08:58PM -0800, Hugh Dickins wrote:
> On Wed, 26 Feb 2014, Dave Chinner wrote:
> > On Tue, Feb 25, 2014 at 08:45:15PM -0800, Hugh Dickins wrote:
> > > On Wed, 26 Feb 2014, Dave Chinner wrote:
> > > > On Tue, Feb 25, 2014 at 03:23:35PM -0800, Hugh Dickins wrote:
> > > >
> > > > > I should mention that when "we" implemented this thirty years ago,
> > > > > we had a strong conviction that the system call should be idempotent:
> > > > > that is, the len argument should indicate the final i_size, not the
> > > > > amount being removed from it. Now, I don't remember the grounds for
> > > > > that conviction: maybe it was just an idealistic preference for how
> > > > > to design a good system call. I can certainly see that defining it
> > > > > that way round would surprise many app programmers. Just mentioning
> > > > > this in case anyone on these lists sees a practical advantage to
> > > > > doing it that way instead.
> > > >
> > > > I don't see how specifying the end file size as an improvement. What
> > > > happens if you are collapse a range in a file that is still being
> > > > appended to by the application and so you race with a file size
> > > > update? IOWs, with such an API the range to be collapsed is
> > > > completely unpredictable, and IMO that's a fundamentally broken API.
> > >
> > > That's fine if you don't see the idempotent API as an improvement,
> > > I just wanted to put it on the table in case someone does see an
> > > advantage to it. But I think I'm missing something in your race
> > > example: I don't see a difference between the two APIs there.
> >
> >
> > Userspace can't sample the inode size via stat(2) and then use the value for a
> > syscall atomically. i.e. if you specify the offset you want to
> > collapse at, and the file size you want to have to define the region
> > to collapse, then the length you need to collapse is (current inode
> > size - end file size). If "current inode size" can change between
> > the stat(2) and fallocate() call (and it can), then the length being
> > collapsed is indeterminate....
>
> Thanks for explaining more, I was just about to acknowledge what a good
> example that is. Indeed, it seems not unreasonable to be editing the
> earlier part of a file while the later part of it is still streaming in.
>
> But damn, it now occurs to me that there's still a problem at the
> streaming end: its file write offset won't be updated to reflect
> the collapse, so there would be a sparse hole at that end. And
> collapse returns -EPERM if IS_APPEND(inode).
Well, we figure that most applications won't be using append only
inode flags for files that they know they want to edit at random
offsets later on. ;)
However, I can see how DVR apps would use open(O_APPEND) to obtain
the fd they write to because that sets the write position to the EOF
on every write() call (i.e. in generic_write_checks()). And collapse
range should behave sanely with this sort of usage.
e.g. XFS calls generic_write_checks() after it has taken the IO lock
to set the current write position to EOF. Hence it will be correctly
serialised against collapse range calls and so O_APPEND writes will
not leave sparse holes if collapse range calls are interleaved with
the write stream....
Cheers,
Dave.
--
Dave Chinner
[email protected]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu, 27 Feb 2014, Dave Chinner wrote:
> On Wed, Feb 26, 2014 at 03:08:58PM -0800, Hugh Dickins wrote:
> >
> > Thanks for explaining more, I was just about to acknowledge what a good
> > example that is. Indeed, it seems not unreasonable to be editing the
> > earlier part of a file while the later part of it is still streaming in.
> >
> > But damn, it now occurs to me that there's still a problem at the
> > streaming end: its file write offset won't be updated to reflect
> > the collapse, so there would be a sparse hole at that end. And
> > collapse returns -EPERM if IS_APPEND(inode).
>
> Well, we figure that most applications won't be using append only
> inode flags for files that they know they want to edit at random
> offsets later on. ;)
>
> However, I can see how DVR apps would use open(O_APPEND) to obtain
> the fd they write to because that sets the write position to the EOF
> on every write() call (i.e. in generic_write_checks()). And collapse
> range should behave sanely with this sort of usage.
>
> e.g. XFS calls generic_write_checks() after it has taken the IO lock
> to set the current write position to EOF. Hence it will be correctly
> serialised against collapse range calls and so O_APPEND writes will
> not leave sparse holes if collapse range calls are interleaved with
> the write stream....
Right, I was getting confused between O_APPEND and APPEND_Only!
Thanks, I'm back to being convinced by your example.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>