I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.
But, before submitting these patches, I think it will be better to finalize
on certain things which might be worth some discussion here:
1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the
file size in this case. I also tend to agree with them. Does anyone
has an argument in favor of changing the filesize ?
If not, I will remove the code which changes the filesize, before I
resubmit the concerned ext4 patch.
2) For FA_UNALLOCATE mode, should the file system allow unallocation
of normal (non-preallocated) blocks (blocks allocated via
regular write/truncate operations) also (i.e. work as punch()) ?
- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
we need to finalize on the convention here as a general guideline
to all the filesystems that implement fallocate.
3) If above is true, the file size will need to be changed
for "unallocation" when block holding the EOF gets unallocated.
- If we do not "unallocate" normal (non-preallocated) blocks and we
do not change the file size on preallocation, then this is a
non-issue.
4) Should we update mtime & ctime on a successfull allocation/
unallocation ?
- David Chinner raised this question in following post:
http://lkml.org/lkml/2007/4/29/407
I think it makes sense to update the [mc]time for a successfull
preallocation/unallocation. Does anyone feel otherwise ?
It will be interesting to know how XFS behaves currently. Does XFS
update [mc]time for preallocation ?
--
Regards,
Amit Arora
On May 09, 2007 21:31 +0530, Amit K. Arora wrote:
> 2) For FA_UNALLOCATE mode, should the file system allow unallocation
> of normal (non-preallocated) blocks (blocks allocated via
> regular write/truncate operations) also (i.e. work as punch()) ?
> - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
> we need to finalize on the convention here as a general guideline
> to all the filesystems that implement fallocate.
I would only allow this on FA_ALLOCATE extents. That means it won't be
possible to do this for filesystems that don't understand unwritten
extents unless there are blocks allocated beyond EOF.
> 3) If above is true, the file size will need to be changed
> for "unallocation" when block holding the EOF gets unallocated.
> - If we do not "unallocate" normal (non-preallocated) blocks and we
> do not change the file size on preallocation, then this is a
> non-issue.
Not necessarily. That will just make the file sparse. If FA_ALLOCATE
does not change the file size, why should FA_UNALLOCATE.
> 4) Should we update mtime & ctime on a successfull allocation/
> unallocation ?
I would say yes. If glibc does the fallback fallocate via write() the
mtime/ctime will be updated, so it makes sense to be consistent for
both methods. Also, it just makes sense from the "this file was modified"
point of view.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote:
> I have the updated patches ready which take care of Andrew's comments.
> Will run some tests and post them soon.
>
> But, before submitting these patches, I think it will be better to finalize
> on certain things which might be worth some discussion here:
>
> 1) Should the file size change when preallocation is done beyond EOF ?
> - Andreas and Chris Wedgwood are in favor of not changing the
> file size in this case. I also tend to agree with them. Does anyone
> has an argument in favor of changing the filesize ?
> If not, I will remove the code which changes the filesize, before I
> resubmit the concerned ext4 patch.
>
If we chose not to update the file size beyong EOF, then for filesystem
without fallocate() support (ext2,3 currently), posix_fallocate() will
follow the hard way(zero-out) to do preallocation. Then we will get
different behavior on filesystems w/o fallocate() support. It make sense
to be consistent, IMO.
My point of view, preallocation is just a efficient way to allocating
blocks for files without zero-out, other than this, the new behavior
should be consistent with the old way: file size update,mtime/ctime,
ENOSPC etc.
Mingming
On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> I have the updated patches ready which take care of Andrew's comments.
> Will run some tests and post them soon.
>
> But, before submitting these patches, I think it will be better to finalize
> on certain things which might be worth some discussion here:
>
> 1) Should the file size change when preallocation is done beyond EOF ?
> - Andreas and Chris Wedgwood are in favor of not changing the
> file size in this case. I also tend to agree with them. Does anyone
> has an argument in favor of changing the filesize ?
> If not, I will remove the code which changes the filesize, before I
> resubmit the concerned ext4 patch.
I think there needs to be both. If we don't have a mechanism to
atomically change the file size with the preallocation, then
applications that use stat() to work out if they need to preallocate
more space will end up racing.
> 2) For FA_UNALLOCATE mode, should the file system allow unallocation
> of normal (non-preallocated) blocks (blocks allocated via
> regular write/truncate operations) also (i.e. work as punch()) ?
Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
what i did for FA_UNALLOCATE as well.
> - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
> we need to finalize on the convention here as a general guideline
> to all the filesystems that implement fallocate.
>
> 3) If above is true, the file size will need to be changed
> for "unallocation" when block holding the EOF gets unallocated.
No - we punch a hole. If you want the filesize to change, then
you use ftruncate() to remove the blocks at EOF and change the
file size atomically.
> 4) Should we update mtime & ctime on a successfull allocation/
> unallocation ?
> - David Chinner raised this question in following post:
> http://lkml.org/lkml/2007/4/29/407
> I think it makes sense to update the [mc]time for a successfull
> preallocation/unallocation. Does anyone feel otherwise ?
> It will be interesting to know how XFS behaves currently. Does XFS
> update [mc]time for preallocation ?
No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
changes. If the filesize changes, it behaves exactly the same way that
ftruncate() behaves.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > I have the updated patches ready which take care of Andrew's comments.
> > Will run some tests and post them soon.
> >
> > But, before submitting these patches, I think it will be better to finalize
> > on certain things which might be worth some discussion here:
> >
> > 1) Should the file size change when preallocation is done beyond EOF ?
> > - Andreas and Chris Wedgwood are in favor of not changing the
> > file size in this case. I also tend to agree with them. Does anyone
> > has an argument in favor of changing the filesize ?
> > If not, I will remove the code which changes the filesize, before I
> > resubmit the concerned ext4 patch.
>
> I think there needs to be both. If we don't have a mechanism to
> atomically change the file size with the preallocation, then
> applications that use stat() to work out if they need to preallocate
> more space will end up racing.
By "both" above, do you mean we should give user the flexibility if it
wants the filesize changed or not ? It can be done by having *two* modes
for preallocation in the system call - say FA_PREALLOCATE and
FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate
blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE
mode is used, fallocate() will change the filesize if required (i.e.
when allocation is beyond EOF) and also update [cm]time.
This way, the application can decide what it wants.
This will be helpfull for the partial allocation scenario also. Think of
the case when we do not change the filesize in fallocate() and expect
applications/posix_fallocate() to do ftruncate() after fallocate() for
this. Now if fallocate() results in a partial allocation with -ENOSPC
error returned, applications/posix_fallocate() will not know for what
length ftruncate() has to be called. :(
Hence it may be a good idea to give user the flexibility if it wants to
atomically change the file size with preallocation or not. But, with
more flexibility there comes inconsistency in behavior, which is worth
considering.
>
> > 2) For FA_UNALLOCATE mode, should the file system allow unallocation
> > of normal (non-preallocated) blocks (blocks allocated via
> > regular write/truncate operations) also (i.e. work as punch()) ?
>
> Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
> what i did for FA_UNALLOCATE as well.
Ok. But, some people may not expect/like this. I think, we can keep it
on the backburner for a while, till other issues are sorted out.
> > - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
> > we need to finalize on the convention here as a general guideline
> > to all the filesystems that implement fallocate.
> >
> > 3) If above is true, the file size will need to be changed
> > for "unallocation" when block holding the EOF gets unallocated.
>
> No - we punch a hole. If you want the filesize to change, then
> you use ftruncate() to remove the blocks at EOF and change the
> file size atomically.
Ok.
>
> > 4) Should we update mtime & ctime on a successfull allocation/
> > unallocation ?
> > - David Chinner raised this question in following post:
> > http://lkml.org/lkml/2007/4/29/407
> > I think it makes sense to update the [mc]time for a successfull
> > preallocation/unallocation. Does anyone feel otherwise ?
> > It will be interesting to know how XFS behaves currently. Does XFS
> > update [mc]time for preallocation ?
>
> No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
> changes. If the filesize changes, it behaves exactly the same way that
> ftruncate() behaves.
Having additional mode (of FA_PREALLOCATE) might help here too. Please
see above.
--
Regards,
Amit Arora
On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
> On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > > I have the updated patches ready which take care of Andrew's comments.
> > > Will run some tests and post them soon.
> > >
> > > But, before submitting these patches, I think it will be better to
> > > finalize on certain things which might be worth some discussion here:
> > >
> > > 1) Should the file size change when preallocation is done beyond EOF ?
> > > - Andreas and Chris Wedgwood are in favor of not changing the file size
> > > in this case. I also tend to agree with them. Does anyone has an
> > > argument in favor of changing the filesize ? If not, I will remove the
> > > code which changes the filesize, before I resubmit the concerned ext4
> > > patch.
> >
> > I think there needs to be both. If we don't have a mechanism to atomically
> > change the file size with the preallocation, then applications that use
> > stat() to work out if they need to preallocate more space will end up
> > racing.
>
> By "both" above, do you mean we should give user the flexibility if it wants
> the filesize changed or not ? It can be done by having *two* modes for
> preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
> use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
> change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
> will change the filesize if required (i.e. when allocation is beyond EOF)
> and also update [cm]time. This way, the application can decide what it
> wants.
Yes, that's right.
> This will be helpfull for the partial allocation scenario also. Think of the
> case when we do not change the filesize in fallocate() and expect
> applications/posix_fallocate() to do ftruncate() after fallocate() for this.
> Now if fallocate() results in a partial allocation with -ENOSPC error
> returned, applications/posix_fallocate() will not know for what length
> ftruncate() has to be called. :(
Well, posix_fallocate() either gets all the space or it fails. If
you truncate to extend the file size after an ENOSPC, then that is
a buggy implementation.
The same could be said for any application, or even the fallocate()
call itself if it changes the filesize without having completely
preallocated the space asked....
> Hence it may be a good idea to give user the flexibility if it wants to
> atomically change the file size with preallocation or not. But, with more
> flexibility there comes inconsistency in behavior, which is worth
> considering.
We've got different modes to specify different behaviour. That's
what the mode field was put there for in the first place - the
interface is *designed* to support different preallocation
behaviours....
> > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of
> > > normal (non-preallocated) blocks (blocks allocated via regular
> > > write/truncate operations) also (i.e. work as punch()) ?
> >
> > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
> > i did for FA_UNALLOCATE as well.
>
> Ok. But, some people may not expect/like this. I think, we can keep it on
> the backburner for a while, till other issues are sorted out.
How can it be a "backburner" issue when it defines the
implementation? I've already implemented some thing in XFS that
sort of does what I think that the interface is supposed to do, but
I need that interface to be nailed down before proceeding any
further.
All I'm really interested in right now is that the fallocate
_interface_ can be used as a *complete replacement* for the
pre-existing XFS-specific ioctls that are already used by
applications. What ext4 can or can't do right now is irrelevant to
this discussion - the interface definition needs to take priority
over implementation....
Cheers,
Dave,
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
> > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > > > I have the updated patches ready which take care of Andrew's comments.
> > > > Will run some tests and post them soon.
> > > >
> > > > But, before submitting these patches, I think it will be better to
> > > > finalize on certain things which might be worth some discussion here:
> > > >
> > > > 1) Should the file size change when preallocation is done beyond EOF ?
> > > > - Andreas and Chris Wedgwood are in favor of not changing the file size
> > > > in this case. I also tend to agree with them. Does anyone has an
> > > > argument in favor of changing the filesize ? If not, I will remove the
> > > > code which changes the filesize, before I resubmit the concerned ext4
> > > > patch.
> > >
> > > I think there needs to be both. If we don't have a mechanism to atomically
> > > change the file size with the preallocation, then applications that use
> > > stat() to work out if they need to preallocate more space will end up
> > > racing.
> >
> > By "both" above, do you mean we should give user the flexibility if it wants
> > the filesize changed or not ? It can be done by having *two* modes for
> > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
> > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
> > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
> > will change the filesize if required (i.e. when allocation is beyond EOF)
> > and also update [cm]time. This way, the application can decide what it
> > wants.
>
> Yes, that's right.
>
> > This will be helpfull for the partial allocation scenario also. Think of the
> > case when we do not change the filesize in fallocate() and expect
> > applications/posix_fallocate() to do ftruncate() after fallocate() for this.
> > Now if fallocate() results in a partial allocation with -ENOSPC error
> > returned, applications/posix_fallocate() will not know for what length
> > ftruncate() has to be called. :(
>
> Well, posix_fallocate() either gets all the space or it fails. If
> you truncate to extend the file size after an ENOSPC, then that is
> a buggy implementation.
>
> The same could be said for any application, or even the fallocate()
> call itself if it changes the filesize without having completely
> preallocated the space asked....
>
> > Hence it may be a good idea to give user the flexibility if it wants to
> > atomically change the file size with preallocation or not. But, with more
> > flexibility there comes inconsistency in behavior, which is worth
> > considering.
>
> We've got different modes to specify different behaviour. That's
> what the mode field was put there for in the first place - the
> interface is *designed* to support different preallocation
> behaviours....
>
> > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of
> > > > normal (non-preallocated) blocks (blocks allocated via regular
> > > > write/truncate operations) also (i.e. work as punch()) ?
> > >
> > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
> > > i did for FA_UNALLOCATE as well.
> >
> > Ok. But, some people may not expect/like this. I think, we can keep it on
> > the backburner for a while, till other issues are sorted out.
>
> How can it be a "backburner" issue when it defines the
> implementation? I've already implemented some thing in XFS that
> sort of does what I think that the interface is supposed to do, but
> I need that interface to be nailed down before proceeding any
> further.
>
> All I'm really interested in right now is that the fallocate
> _interface_ can be used as a *complete replacement* for the
> pre-existing XFS-specific ioctls that are already used by
> applications. What ext4 can or can't do right now is irrelevant to
> this discussion - the interface definition needs to take priority
> over implementation....
Would you like to write up an interface definition description (likely
man page) and post it for review, possibly with a mention of apps using
it today ?
One reason for introducing the mode parameter was to allow the interface to
evolve incrementally as more options / semantic questions are proposed, so
that we don't have to make all the decisions right now.
So it would be good to start with a *minimal* definition, even just one mode.
The rest could follow as subsequent patches, each being reviewed and debated
separately. Otherwise this discussion can drag on for a long time.
Regards
Suparna
>
> Cheers,
>
> Dave,
> --
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
> On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> > All I'm really interested in right now is that the fallocate
> > _interface_ can be used as a *complete replacement* for the
> > pre-existing XFS-specific ioctls that are already used by
> > applications. What ext4 can or can't do right now is irrelevant to
> > this discussion - the interface definition needs to take priority
> > over implementation....
>
> Would you like to write up an interface definition description (likely
> man page) and post it for review, possibly with a mention of apps using
> it today ?
Yeah, I started doing that yesterday as i figured it was the only way
to cut the discussion short....
> One reason for introducing the mode parameter was to allow the interface to
> evolve incrementally as more options / semantic questions are proposed, so
> that we don't have to make all the decisions right now.
> So it would be good to start with a *minimal* definition, even just one mode.
> The rest could follow as subsequent patches, each being reviewed and debated
> separately. Otherwise this discussion can drag on for a long time.
Minimal definition to replace what applicaitons use on XFS and to
support poasix_fallocate are the thre that have been mentioned so
far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
all in a man page...
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
> On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
> > On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> > > All I'm really interested in right now is that the fallocate
> > > _interface_ can be used as a *complete replacement* for the
> > > pre-existing XFS-specific ioctls that are already used by
> > > applications. What ext4 can or can't do right now is irrelevant to
> > > this discussion - the interface definition needs to take priority
> > > over implementation....
> >
> > Would you like to write up an interface definition description (likely
> > man page) and post it for review, possibly with a mention of apps using
> > it today ?
>
> Yeah, I started doing that yesterday as i figured it was the only way
> to cut the discussion short....
>
> > One reason for introducing the mode parameter was to allow the interface to
> > evolve incrementally as more options / semantic questions are proposed, so
> > that we don't have to make all the decisions right now.
> > So it would be good to start with a *minimal* definition, even just one mode.
> > The rest could follow as subsequent patches, each being reviewed and debated
> > separately. Otherwise this discussion can drag on for a long time.
>
> Minimal definition to replace what applicaitons use on XFS and to
> support poasix_fallocate are the thre that have been mentioned so
> far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
> all in a man page...
Hi Dave,
Did you get time to write the above man page ? It will help to push
further patches in time (eg. for FA_PREALLOCATE mode).
The idea I had was to push the patch with bare minimum functionality
(FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
new mode(s) based on the man page you planned to provide.
Thanks!
--
Regards,
Amit Arora
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
> On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
> > Minimal definition to replace what applicaitons use on XFS and to
> > support poasix_fallocate are the thre that have been mentioned so
> > far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
> > all in a man page...
>
> Hi Dave,
>
> Did you get time to write the above man page ? It will help to push
> further patches in time (eg. for FA_PREALLOCATE mode).
No, I didn't. Instead of working on new preallocation stuff, I've
been spending all my time fixing bugs found by new and interesting
(ab)uses of preallocation and hole punching.
> The idea I had was to push the patch with bare minimum functionality
> (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
> new mode(s) based on the man page you planned to provide.
Push them. I'll just make XFS work with whatever is provided.
Is there a test harness for the syscall yet?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
> Did you get time to write the above man page ? It will help to push
> further patches in time (eg. for FA_PREALLOCATE mode).
First pass is attached.
`nroff -man fallocate.2 | less` to view.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Jun 14, 2007 09:52 +1000, David Chinner wrote:
> B FA_PREALLOCATE
> provides the same functionality as
> B FA_ALLOCATE
> except it does not ever change the file size. This allows allocation
> of zero blocks beyond the end of file and is useful for optimising
> append workloads.
> TP
> B FA_DEALLOCATE
> removes the underlying disk space with the given range. The disk space
> shall be removed regardless of it's contents so both allocated space
> from
> B FA_ALLOCATE
> and
> B FA_PREALLOCATE
> as well as from
> B write(3)
> will be removed.
> B FA_DEALLOCATE
> shall never remove disk blocks outside the range specified.
So this is essentially the same as "punch". There doesn't seem to be
a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
end.
> B FA_DEALLOCATE
> shall never change the file size. If changing the file size
> is required when deallocating blocks from an offset to end
> of file (or beyond end of file) is required,
> B ftuncate64(3)
> should be used.
This also seems to be a bit of a wart, since it isn't a natural converse
of either of the above functions. How about having two modes,
similar to FA_ALLOCATE and FA_PREALLOCATE? Say, FA_PUNCH (which
would be as you describe here - deletes all data in the specified
range changing the file size if it overlaps EOF, and FA_DEALLOCATE,
which only deallocates unused FA_{PRE,}ALLOCATE space?
We might also consider making @mode be a mask instead of an enumeration:
FA_FL_DEALLOC 0x01 (default allocate)
FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
FA_FL_DEL_DATA 0x04 (default keep written data on DEALLOC)
We might then build FA_ALLOCATE and FA_DEALLOCATE out of these flags
without making the interface sub-optimal.
I suppose it might be a bit late in the game to add a "goal"
parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
the API more suitable for XFS? The goal could be a single __u64, or
a struct with e.g. __u64 byte offset (possibly also __u32 lun like
in FIEMAP). I guess the one potential limitation here is the
number of function parameters on some architectures.
> B ENOSPC
> There is not enough space left on the device containing the file
> referred to by
> IR fd.
Should probably say whether space is removed on failure or not. In
some (primitive) implementations it might no longer be possible to
distinguish between unwritten extents and zero-filled blocks, though
at this point DEALLOC of zero-filled blocks might not be harmful either.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> On Jun 14, 2007 09:52 +1000, David Chinner wrote:
> > B FA_PREALLOCATE
> > provides the same functionality as
> > B FA_ALLOCATE
> > except it does not ever change the file size. This allows allocation
> > of zero blocks beyond the end of file and is useful for optimising
> > append workloads.
> > TP
> > B FA_DEALLOCATE
> > removes the underlying disk space with the given range. The disk space
> > shall be removed regardless of it's contents so both allocated space
> > from
> > B FA_ALLOCATE
> > and
> > B FA_PREALLOCATE
> > as well as from
> > B write(3)
> > will be removed.
> > B FA_DEALLOCATE
> > shall never remove disk blocks outside the range specified.
>
> So this is essentially the same as "punch".
Depends on your definition of "punch".
> There doesn't seem to be
> a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
> end.
ftruncate()
> > B FA_DEALLOCATE
> > shall never change the file size. If changing the file size
> > is required when deallocating blocks from an offset to end
> > of file (or beyond end of file) is required,
> > B ftuncate64(3)
> > should be used.
>
> This also seems to be a bit of a wart, since it isn't a natural converse
> of either of the above functions. How about having two modes,
> similar to FA_ALLOCATE and FA_PREALLOCATE?
<shrug>
whatever.
> Say, FA_PUNCH (which
> would be as you describe here - deletes all data in the specified
> range changing the file size if it overlaps EOF,
Punch means different things to different people. To me (and probably
most XFS aware ppl) punch implies no change to the file size.
i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
holes to leave the file size unchanged. This is the behaviour I
described for FA_DEALLOCATE.
> and FA_DEALLOCATE,
> which only deallocates unused FA_{PRE,}ALLOCATE space?
That's an "unwritten-to-hole" extent conversion. Is that really
useful for anything? That's easily implemented with FIEMAP
and FA_DEALLOCATE.
Anyway, because we can't agree on a single pair of flags:
FA_ALLOCATE == posix_fallocate()
FA_DEALLOCATE == unwritten-to-hole ???
FA_RESV_SPACE == XFS_IOC_RESVSP64
FA_UNRESV_SPACE == XFS_IOC_UNRESVSP64
> We might also consider making @mode be a mask instead of an enumeration:
>
> FA_FL_DEALLOC 0x01 (default allocate)
> FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
> FA_FL_DEL_DATA 0x04 (default keep written data on DEALLOC)
i.e:
#define FA_ALLOCATE 0
#define FA_DEALLOCATE FA_FL_DEALLOC
#define FA_RESV_SPACE FA_FL_KEEP_SIZE
#define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA
> I suppose it might be a bit late in the game to add a "goal"
> parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
> the API more suitable for XFS?
It would suffice for the simpler operations, I think, but we'll
rapidly run out of flags and we'll still need another interface
for doing complex stuff.....
> The goal could be a single __u64, or
> a struct with e.g. __u64 byte offset (possibly also __u32 lun like
> in FIEMAP). I guess the one potential limitation here is the
> number of function parameters on some architectures.
To be useful it needs to __u64.
> > B ENOSPC
> > There is not enough space left on the device containing the file
> > referred to by
> > IR fd.
>
> Should probably say whether space is removed on failure or not. In
Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
allocated was freed back up. That way the error handling in the allocate
functions is much simpler (i.e. no need to undo there).
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Jun 14, 2007 22:04 +1000, David Chinner wrote:
> On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> > > B FA_DEALLOCATE
> > > removes the underlying disk space with the given range. The disk space
> > > shall be removed regardless of it's contents so both allocated space
> > > from
> > > B FA_ALLOCATE
> > > and
> > > B FA_PREALLOCATE
> > > as well as from
> > > B write(3)
> > > will be removed.
> > > B FA_DEALLOCATE
> > > shall never remove disk blocks outside the range specified.
> >
> > So this is essentially the same as "punch".
>
> Depends on your definition of "punch".
>
> > There doesn't seem to be
> > a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
> > end.
>
> ftruncate()
No, that will delete written data also. What I'm thinking is in cases
where an application does fallocate() to reserve a lot of space, and
when the application is finished it wants to unreserve any unused space.
> > > B FA_DEALLOCATE
> > > shall never change the file size. If changing the file size
> > > is required when deallocating blocks from an offset to end
> > > of file (or beyond end of file) is required,
> > > B ftuncate64(3)
> > > should be used.
> >
> > This also seems to be a bit of a wart, since it isn't a natural converse
> > of either of the above functions. How about having two modes,
> > similar to FA_ALLOCATE and FA_PREALLOCATE?
>
> <shrug>
>
> whatever.
>
> > Say, FA_PUNCH (which
> > would be as you describe here - deletes all data in the specified
> > range changing the file size if it overlaps EOF,
>
> Punch means different things to different people. To me (and probably
> most XFS aware ppl) punch implies no change to the file size.
If "punch" does not change the file size, how is it possible to determine
the end of the actual written data? Say you have a file with records
in it, and these records are cancelled as they are processed (e.g. a
journal of sorts). One usage model for punch() that we had in the past
is to punch out each record after it finishes processing, so that it will
not be re-processed after a crash. If the file size doesn't change with
punch then there is no way to know when the last record is hit and the
rest of the file needs to be scanned.
> i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
> holes to leave the file size unchanged. This is the behaviour I
> described for FA_DEALLOCATE.
>
> > and FA_DEALLOCATE,
> > which only deallocates unused FA_{PRE,}ALLOCATE space?
>
> That's an "unwritten-to-hole" extent conversion. Is that really
> useful for anything? That's easily implemented with FIEMAP
> and FA_DEALLOCATE.
But why force the application to do this instead of making the
fallocate API sensible and allowing it to be done directly?
> Anyway, because we can't agree on a single pair of flags:
>
> FA_ALLOCATE == posix_fallocate()
> FA_DEALLOCATE == unwritten-to-hole ???
I'd think this makes sense, being natural opposites of each other.
FA_ALLOCATE doesn't overwrite existing data with zeros, so FA_DEALLOCATE
shouldn't erase existing data. If FA_ALLOCATE extends the file size,
then FA_DEALLOCATE should shrink it if there is no data at the end.
> FA_RESV_SPACE == XFS_IOC_RESVSP64
> FA_UNRESV_SPACE == XFS_IOC_UNRESVSP64
> > We might also consider making @mode be a mask instead of an enumeration:
> >
> > FA_FL_DEALLOC 0x01 (default allocate)
> > FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
> > FA_FL_DEL_DATA 0x04 (default keep written data on DEALLOC)
>
> #define FA_ALLOCATE 0
> #define FA_DEALLOCATE FA_FL_DEALLOC
> #define FA_RESV_SPACE FA_FL_KEEP_SIZE
> #define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA
OK, this makes the semantics of XFS_IOC_RESVSP64 and XFS_IOC_UNRESVSP64
clear at least. The benefit is that it would also be possible (I'm
not necessarily advocating this as a flag, just an example) to have
semantics that are like XFS_IOC_ALLOCSP64 (zeroing written data while
preallocating) with:
#define FA_ZERO_SPACE FA_DEL_DATA
or whatever semantics the caller actually wants, instead of restricting
them to the subset of combinations given by FA_ALLOCATE and FA_DEALLOCATE
(whatever it is we decide on in the end).
> > > B ENOSPC
> > > There is not enough space left on the device containing the file
> > > referred to by
> > > IR fd.
> >
> > Should probably say whether space is removed on failure or not. In
>
> Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
> allocated was freed back up. That way the error handling in the allocate
> functions is much simpler (i.e. no need to undo there).
Hmm, another flag? FA_FL_FREE_ENOSPC? I can imagine applications like
PVRs to want to preallocate, say, an estimated 30 min of space for a show
but if they only get 25 min of space returned they know some cleanup is
in order (which can be done asynchronously while the show is filling the
first 25 min of preallocated space). Otherwise, they have to loop in
userspace trying decreasing preallocations until they fit, or starting
small and incrementally preallocating space until they get an error.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
N O T E:
-------
1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
of ext4 patch queue git tree hosted by Ted.
2) The above new patches (4/7 and 7/7) are based on the dicussion
between Andreas Dilger and David Chinner on the mode argument,
when later posted a man page on fallocate.
3) All of these patches are based on 2.6.22-rc4 kernel and apply to
2.6.22-rc5 too (with some successfull hunks, though - since the
ext4 patch queue git tree has some other patches as well before
fallocate patches in the patch series).
Changelog:
---------
Changes from Take4 to Take5:
1) New Patch 4/7 implements new flags and values for mode
argument of fallocate system call.
2) New Patch 7/7 implements 2 (out of 4) modes in ext4.
Implementation of rest of the (two) modes is yet to be done.
3) Updated the interface description below to mention new modes
being supported.
4) Removed "extent overlap check" bugfix (patch 4/6 in TAKE4,
since it is now part of mainline.
5) Corrected format of couple of multi-line comments, which got
missed in earlier take.
Changes from Take2 to Take3:
1) Return type is now described in the interface description
above.
2) Patches rebased to 2.6.22-rc1 kernel.
** Each post will have an individual changelog for a particular patch.
Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.
Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.
Interface:
---------
The system call's layout is:
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
fd: The descriptor of the open file.
mode*: This specifies the behavior of the system call. Currently the
system call supports four modes - FA_ALLOCATE, FA_DEALLOCATE,
FA_RESV_SPACE and FA_UNRESV_SPACE.
FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime in the inode of the corresponding file, marking a
successfull allocation.
FA_FA_RESV_SPACE: This mode is quite same as FA_ALLOCATE. The only
difference being that the file size will not be changed.
FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime. This is reverse of FA_ALLOCATE mode.
FA_UNRESV_SPACE: This mode is quite same as FA_DEALLOCATE. The
difference being that the file size is not changed and the data is
also deleted.
* New modes might get added in future.
offset: This is the offset in bytes, from where the preallocation should
start.
len: This is the number of bytes requested for preallocation (from
offset).
RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate().
sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.
Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583
Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198
ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ia64, ppc64 and s390(x)).
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
a) to support fallocate() system call
b) to make posix_fallocate() and posix_fallocate64() call fallocate()
Following patches follow:
Patch 1/6 : fallocate() implementation on i386, x86_64 and powerpc
Patch 2/7 : fallocate() on s390(x)
Patch 3/7 : fallocate() on ia64
Patch 4/7 : support new modes in fallocate
Patch 5/7 : ext4: fallocate support in ext4
Patch 6/7 : ext4: write support for preallocated blocks
Patch 7/7 : ext4: support new modes
This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.
Changelog:
---------
Changes from Take3 to Take4:
1) Do not update c/mtime. Let each filesystem update ctime (update of
mtime will not be required for allocation since we touch only
metadata/inode and not blocks), if required.
Changes from Take2 to Take3:
1) Patches now based on 2.6.22-rc1 kernel.
Changes from Take1(initial post on 26th April, 2007) to Take2:
1) Added description before sys_fallocate() definition.
2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len <= 0.
3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
5) Check for wrap through zero.
6) Update c/mtime if fallocate() succeeds.
7) Added mode descriptions in fs.h
8) Added variable names to function definition (fallocate inode op)
Signed-off-by: Amit Arora <[email protected]>
Index: linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+ .long sys_fallocate
Index: linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
===================================================================
--- linux-2.6.22-rc4.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high << 32) | low);
}
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+ u32 lenhi, u32 lenlo)
+{
+ return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+ ((loff_t)lenhi << 32) | lenlo);
+}
+
asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
unsigned long low)
{
Index: linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
.quad compat_sys_signalfd
.quad compat_sys_timerfd
.quad sys_eventfd
+ .quad sys_fallocate
ia32_syscall_end:
Index: linux-2.6.22-rc4/fs/open.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/open.c
+++ linux-2.6.22-rc4/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
#endif
/*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ * FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset & len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the ->fallocate() inode operation implemented by the
+ * individual file systems will update the file size and/or ctime/mtime
+ * depending on the mode and also on the success of the operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0 : On SUCCESS a value of zero is returned.
+ * error : On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * <TBD> Generic fallocate to be added for file systems that do not
+ * support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+ struct file *file;
+ struct inode *inode;
+ long ret = -EINVAL;
+
+ if (offset < 0 || len <= 0)
+ goto out;
+
+ /* Return error if mode is not supported */
+ ret = -EOPNOTSUPP;
+ if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
+ goto out;
+
+ ret = -EBADF;
+ file = fget(fd);
+ if (!file)
+ goto out;
+ if (!(file->f_mode & FMODE_WRITE))
+ goto out_fput;
+
+ inode = file->f_path.dentry->d_inode;
+
+ ret = -ESPIPE;
+ if (S_ISFIFO(inode->i_mode))
+ goto out_fput;
+
+ ret = -ENODEV;
+ /*
+ * Let individual file system decide if it supports preallocation
+ * for directories or not.
+ */
+ if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+ goto out_fput;
+
+ ret = -EFBIG;
+ /* Check for wrap through zero too */
+ if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
+ goto out_fput;
+
+ if (inode->i_op && inode->i_op->fallocate)
+ ret = inode->i_op->fallocate(inode, mode, offset, len);
+ else
+ ret = -ENOSYS;
+
+out_fput:
+ fput(file);
+out:
+ return ret;
+}
+
+/*
* access() needs to use the real uid/gid, not the effective uid/gid.
* We do this by temporarily clearing all FS-related capabilities and
* switching the fsuid/fsgid around to the real ones.
Index: linux-2.6.22-rc4/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-i386/unistd.h
+++ linux-2.6.22-rc4/include/asm-i386/unistd.h
@@ -329,10 +329,11 @@
#define __NR_signalfd 321
#define __NR_timerfd 322
#define __NR_eventfd 323
+#define __NR_fallocate 324
#ifdef __KERNEL__
-#define NR_syscalls 324
+#define NR_syscalls 325
#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22-rc4/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.22-rc4/include/asm-powerpc/systbl.h
@@ -308,6 +308,7 @@ COMPAT_SYS_SPU(move_pages)
SYSCALL_SPU(getcpu)
COMPAT_SYS(epoll_pwait)
COMPAT_SYS_SPU(utimensat)
+COMPAT_SYS(fallocate)
COMPAT_SYS_SPU(signalfd)
COMPAT_SYS_SPU(timerfd)
SYSCALL_SPU(eventfd)
Index: linux-2.6.22-rc4/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.22-rc4/include/asm-powerpc/unistd.h
@@ -330,10 +330,11 @@
#define __NR_signalfd 305
#define __NR_timerfd 306
#define __NR_eventfd 307
+#define __NR_fallocate 308
#ifdef __KERNEL__
-#define __NR_syscalls 308
+#define __NR_syscalls 309
#define __NR__exit __NR_exit
#define NR_syscalls __NR_syscalls
Index: linux-2.6.22-rc4/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.22-rc4/include/asm-x86_64/unistd.h
@@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd)
__SYSCALL(__NR_timerfd, sys_timerfd)
#define __NR_eventfd 283
__SYSCALL(__NR_eventfd, sys_eventfd)
+#define __NR_fallocate 284
+__SYSCALL(__NR_fallocate, sys_fallocate)
#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22-rc4/include/linux/fs.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/fs.h
+++ linux-2.6.22-rc4/include/linux/fs.h
@@ -266,6 +266,17 @@ extern int dir_notify_enable;
#define SYNC_FILE_RANGE_WRITE 2
#define SYNC_FILE_RANGE_WAIT_AFTER 4
+/*
+ * sys_fallocate modes
+ * Currently sys_fallocate supports two modes:
+ * FA_ALLOCATE : This is the preallocate mode, using which an application/user
+ * may request (pre)allocation of blocks.
+ * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
+ * the preallocated blocks.
+ */
+#define FA_ALLOCATE 0x1
+#define FA_DEALLOCATE 0x2
+
#ifdef __KERNEL__
#include <linux/linkage.h>
@@ -1138,6 +1149,8 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ long (*fallocate)(struct inode *inode, int mode, loff_t offset,
+ loff_t len);
};
struct seq_file;
Index: linux-2.6.22-rc4/include/linux/syscalls.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/syscalls.h
+++ linux-2.6.22-rc4/include/linux/syscalls.h
@@ -608,6 +608,7 @@ asmlinkage long sys_signalfd(int ufd, si
asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
const struct itimerspec __user *utmr);
asmlinkage long sys_eventfd(unsigned int count);
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
This is the patch suggested by Martin Schwidefsky to support
sys_fallocate() on s390(x) platform.
He also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.
.globl __fallocate
ENTRY(__fallocate)
stm %r6,%r7,28(%r15) /* save %r6/%r7 on stack */
cfi_offset (%r7, -68)
cfi_offset (%r6, -72)
lm %r6,%r7,96(%r15) /* load loff_t len from stack */
svc SYS_ify(fallocate)
lm %r6,%r7,28(%r15) /* restore %r6/%r7 from stack */
br %r14
PSEUDO_END(__fallocate)
Here are the comments and the patch to linux kernel from him.
-------------
From: Martin Schwidefsky <[email protected]>
This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.
Signed-off-by: Martin Schwidefsky <[email protected]>
Index: linux-2.6.22-rc4/arch/s390/kernel/compat_wrapper.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/s390/kernel/compat_wrapper.S 2007-06-11 16:16:01.000000000 -0700
+++ linux-2.6.22-rc4/arch/s390/kernel/compat_wrapper.S 2007-06-11 16:27:29.000000000 -0700
@@ -1683,6 +1683,16 @@
llgtr %r3,%r3 # struct compat_timeval *
jg compat_sys_utimes
+ .globl sys_fallocate_wrapper
+sys_fallocate_wrapper:
+ lgfr %r2,%r2 # int
+ lgfr %r3,%r3 # int
+ sllg %r4,%r4,32 # get high word of 64bit loff_t
+ lr %r4,%r5 # get low word of 64bit loff_t
+ sllg %r5,%r6,32 # get high word of 64bit loff_t
+ l %r5,164(%r15) # get low word of 64bit loff_t
+ jg sys_fallocate
+
.globl compat_sys_utimensat_wrapper
compat_sys_utimensat_wrapper:
llgfr %r2,%r2 # unsigned int
Index: linux-2.6.22-rc4/arch/s390/kernel/sys_s390.c
===================================================================
--- linux-2.6.22-rc4.orig/arch/s390/kernel/sys_s390.c 2007-06-11 16:16:01.000000000 -0700
+++ linux-2.6.22-rc4/arch/s390/kernel/sys_s390.c 2007-06-11 16:27:29.000000000 -0700
@@ -265,3 +265,32 @@
return -EFAULT;
return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice);
}
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument "len" is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ * %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ * %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+ u32 len_high, u32 len_low)
+{
+ union {
+ u64 len;
+ struct {
+ u32 high;
+ u32 low;
+ };
+ } cv;
+ cv.high = len_high;
+ cv.low = len_low;
+ return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.22-rc4/arch/s390/kernel/syscalls.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/s390/kernel/syscalls.S 2007-06-11 16:16:01.000000000 -0700
+++ linux-2.6.22-rc4/arch/s390/kernel/syscalls.S 2007-06-11 16:27:29.000000000 -0700
@@ -322,6 +322,7 @@
SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
NI_SYSCALL /* 314 sys_fallocate */
SYSCALL(sys_utimensat,sys_utimensat,compat_sys_utimensat_wrapper) /* 315 */
SYSCALL(sys_signalfd,sys_signalfd,compat_sys_signalfd_wrapper)
Index: linux-2.6.22-rc4/include/asm-s390/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-s390/unistd.h 2007-06-11 16:16:01.000000000 -0700
+++ linux-2.6.22-rc4/include/asm-s390/unistd.h 2007-06-11 16:27:29.000000000 -0700
@@ -256,7 +256,8 @@
#define __NR_signalfd 316
#define __NR_timerfd 317
#define __NR_eventfd 318
-#define NR_syscalls 319
+#define __NR_fallocate 319
+#define NR_syscalls 320
/*
* There are some system calls that are not present on 64 bit, some
fallocate() on ia64
ia64 fallocate syscall support.
Signed-off-by: Dave Chinner <[email protected]>
Index: linux-2.6.22-rc4/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/ia64/kernel/entry.S 2007-06-11 17:22:15.000000000 -0700
+++ linux-2.6.22-rc4/arch/ia64/kernel/entry.S 2007-06-11 17:30:37.000000000 -0700
@@ -1588,5 +1588,6 @@
data8 sys_signalfd
data8 sys_timerfd
data8 sys_eventfd
+ data8 sys_fallocate // 1310
.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
Index: linux-2.6.22-rc4/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-ia64/unistd.h 2007-06-11 17:22:15.000000000 -0700
+++ linux-2.6.22-rc4/include/asm-ia64/unistd.h 2007-06-11 17:30:37.000000000 -0700
@@ -299,11 +299,12 @@
#define __NR_signalfd 1307
#define __NR_timerfd 1308
#define __NR_eventfd 1309
+#define __NR_fallocate 1310
#ifdef __KERNEL__
-#define NR_syscalls 286 /* length of syscall table */
+#define NR_syscalls 287 /* length of syscall table */
/*
* The following defines stop scripts/checksyscalls.sh from complaining about
Implement new flags and values for mode argument.
This patch implements the new flags and values for the "mode" argument
of the fallocate system call. It is based on the discussion between
Andreas Dilger and David Chinner on the man page proposed (by the later)
on fallocate.
Signed-off-by: Amit Arora <[email protected]>
Index: linux-2.6.22-rc4/include/linux/fs.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/fs.h
+++ linux-2.6.22-rc4/include/linux/fs.h
@@ -267,15 +267,16 @@ extern int dir_notify_enable;
#define SYNC_FILE_RANGE_WAIT_AFTER 4
/*
- * sys_fallocate modes
- * Currently sys_fallocate supports two modes:
- * FA_ALLOCATE : This is the preallocate mode, using which an application/user
- * may request (pre)allocation of blocks.
- * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
- * the preallocated blocks.
+ * sys_fallocate mode flags and values
*/
-#define FA_ALLOCATE 0x1
-#define FA_DEALLOCATE 0x2
+#define FA_FL_DEALLOC 0x01 /* default is allocate */
+#define FA_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */
+#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */
+
+#define FA_ALLOCATE 0
+#define FA_DEALLOCATE FA_FL_DEALLOC
+#define FA_RESV_SPACE FA_FL_KEEP_SIZE
+#define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)
#ifdef __KERNEL__
Index: linux-2.6.22-rc4/fs/open.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/open.c
+++ linux-2.6.22-rc4/fs/open.c
@@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
* sys_fallocate - preallocate blocks or free preallocated blocks
* @fd: the file descriptor
* @mode: mode specifies if fallocate should preallocate blocks OR free
- * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
- * FA_DEALLOCATE modes are supported.
+ * (unallocate) preallocated blocks.
* @offset: The offset within file, from where (un)allocation is being
* requested. It should not have a negative value.
* @len: The amount (in bytes) of space to be (un)allocated, from the offset.
*
* This system call, depending on the mode, preallocates or unallocates blocks
* for a file. The range of blocks depends on the value of offset and len
- * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * arguments provided by the user/application. For FA_ALLOCATE and
+ * FA_RESV_SPACE modes, if the sys_fallocate()
* system call succeeds, subsequent writes to the file in the given range
* (specified by offset & len) should not fail - even if the file system
* later becomes full. Hence the preallocation done is persistent (valid
- * even after reopen of the file and remount/reboot).
+ * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
+ * is passed, the file size will not be changed even if the preallocation
+ * is beyond EOF.
*
* It is expected that the ->fallocate() inode operation implemented by the
* individual file systems will update the file size and/or ctime/mtime
- * depending on the mode and also on the success of the operation.
+ * depending on the mode (change is visible to user or not - say file size)
+ * and obviously, on the success of the operation.
*
* Note: Incase the file system does not support preallocation,
* posix_fallocate() should fall back to the library implementation (i.e.
@@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
/* Return error if mode is not supported */
ret = -EOPNOTSUPP;
- if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
+ if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE ||
+ mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE))
goto out;
ret = -EBADF;
This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.
Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.
Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a <ToDo> item.
Changelog:
---------
Changes from Take3 to Take4:
1) Changed ext4_fllocate() declaration and definition to return a
"long"
and not an "int", to match with ->fallocate() inode op.
2) Update ctime if new blocks get allocated.
Changes from Take2 to Take3:
1) Patch rebased to 2.6.22-rc1 kernel version.
2) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);".
Changes from Take1 to Take2:
1) Added more description for ext4_fallocate().
2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
3) Moved journal_start & journal_stop inside the while loop.
4) Replaced BUG_ON with WARN_ON & ext4_error.
5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
6) Added variable names in the function declaration of ext4_fallocate()
7) Converted macros that handle uninitialized extents into inline
functions.
Signed-off-by: Amit Arora <[email protected]>
Index: linux-2.6.22-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc4/fs/ext4/extents.c
@@ -316,7 +316,7 @@ static void ext4_ext_show_path(struct in
} else if (path->p_ext) {
ext_debug(" %d:%d:%llu ",
le32_to_cpu(path->p_ext->ee_block),
- le16_to_cpu(path->p_ext->ee_len),
+ ext4_ext_get_actual_len(path->p_ext),
ext_pblock(path->p_ext));
} else
ext_debug(" []");
@@ -339,7 +339,7 @@ static void ext4_ext_show_leaf(struct in
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug("\n");
}
@@ -455,7 +455,7 @@ ext4_ext_binsearch(struct inode *inode,
ext_debug(" -> %d:%llu:%d ",
le32_to_cpu(path->p_ext->ee_block),
ext_pblock(path->p_ext),
- le16_to_cpu(path->p_ext->ee_len));
+ ext4_ext_get_actual_len(path->p_ext));
#ifdef CHECK_BINSEARCH
{
@@ -713,7 +713,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug("move %d:%llu:%d in new leaf %llu\n",
le32_to_cpu(path[depth].p_ext->ee_block),
ext_pblock(path[depth].p_ext),
- le16_to_cpu(path[depth].p_ext->ee_len),
+ ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1133,7 +1133,19 @@ static int
ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
{
- if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+ unsigned short ext1_ee_len, ext2_ee_len;
+
+ /*
+ * Make sure that either both extents are uninitialized, or
+ * both are _not_.
+ */
+ if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+ return 0;
+
+ ext1_ee_len = ext4_ext_get_actual_len(ex1);
+ ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+ if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
le32_to_cpu(ex2->ee_block))
return 0;
@@ -1142,14 +1154,14 @@ ext4_can_extents_be_merged(struct inode
* as an RO_COMPAT feature, refuse to merge to extents if
* this can result in the top bit of ee_len being set.
*/
- if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+ if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
return 0;
#ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
return 0;
#endif
- if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+ if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
}
@@ -1171,7 +1183,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;
b1 = le32_to_cpu(newext->ee_block);
- len1 = le16_to_cpu(newext->ee_len);
+ len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1218,8 +1230,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
int depth, len, err, next;
+ unsigned uninitialized = 0;
- BUG_ON(newext->ee_len == 0);
+ BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1227,14 +1240,24 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug("append %d block to %d:%d (from %llu)\n",
- le16_to_cpu(newext->ee_len),
+ ext4_ext_get_actual_len(newext),
le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
err = ext4_ext_get_access(handle, inode, path + depth);
if (err)
return err;
- ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
- + le16_to_cpu(newext->ee_len));
+
+ /*
+ * ext4_can_extents_be_merged should have checked that either
+ * both extents are uninitialized, or both aren't. Thus we
+ * need to check only one of them here.
+ */
+ if (ext4_ext_is_uninitialized(ex))
+ uninitialized = 1;
+ ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+ + ext4_ext_get_actual_len(newext));
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(ex);
eh = path[depth].p_hdr;
nearex = ex;
goto merge;
@@ -1290,7 +1313,7 @@ has_space:
ext_debug("first extent in the leaf: %d:%llu:%d\n",
le32_to_cpu(newext->ee_block),
ext_pblock(newext),
- le16_to_cpu(newext->ee_len));
+ ext4_ext_get_actual_len(newext));
path[depth].p_ext = EXT_FIRST_EXTENT(eh);
} else if (le32_to_cpu(newext->ee_block)
> le32_to_cpu(nearex->ee_block)) {
@@ -1303,7 +1326,7 @@ has_space:
"move %d from 0x%p to 0x%p\n",
le32_to_cpu(newext->ee_block),
ext_pblock(newext),
- le16_to_cpu(newext->ee_len),
+ ext4_ext_get_actual_len(newext),
nearex, len, nearex + 1, nearex + 2);
memmove(nearex + 2, nearex + 1, len);
}
@@ -1316,7 +1339,7 @@ has_space:
"move %d from 0x%p to 0x%p\n",
le32_to_cpu(newext->ee_block),
ext_pblock(newext),
- le16_to_cpu(newext->ee_len),
+ ext4_ext_get_actual_len(newext),
nearex, len, nearex + 1, nearex + 2);
memmove(nearex + 1, nearex, len);
path[depth].p_ext = nearex;
@@ -1335,8 +1358,13 @@ merge:
if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
break;
/* merge with next extent! */
- nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
- + le16_to_cpu(nearex[1].ee_len));
+ if (ext4_ext_is_uninitialized(nearex))
+ uninitialized = 1;
+ nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
+ + ext4_ext_get_actual_len(nearex + 1));
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(nearex);
+
if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
len = (EXT_LAST_EXTENT(eh) - nearex - 1)
* sizeof(struct ext4_extent);
@@ -1406,8 +1434,8 @@ int ext4_ext_walk_space(struct inode *in
end = le32_to_cpu(ex->ee_block);
if (block + num < end)
end = block + num;
- } else if (block >=
- le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+ } else if (block >= le32_to_cpu(ex->ee_block)
+ + ext4_ext_get_actual_len(ex)) {
/* need to allocate space after found extent */
start = block;
end = block + num;
@@ -1419,7 +1447,8 @@ int ext4_ext_walk_space(struct inode *in
* by found extent
*/
start = block;
- end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+ end = le32_to_cpu(ex->ee_block)
+ + ext4_ext_get_actual_len(ex);
if (block + num < end)
end = block + num;
exists = 1;
@@ -1435,7 +1464,7 @@ int ext4_ext_walk_space(struct inode *in
cbex.ec_type = EXT4_EXT_CACHE_GAP;
} else {
cbex.ec_block = le32_to_cpu(ex->ee_block);
- cbex.ec_len = le16_to_cpu(ex->ee_len);
+ cbex.ec_len = ext4_ext_get_actual_len(ex);
cbex.ec_start = ext_pblock(ex);
cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
}
@@ -1508,15 +1537,15 @@ ext4_ext_put_gap_in_cache(struct inode *
ext_debug("cache gap(before): %lu [%lu:%lu]",
(unsigned long) block,
(unsigned long) le32_to_cpu(ex->ee_block),
- (unsigned long) le16_to_cpu(ex->ee_len));
+ (unsigned long) ext4_ext_get_actual_len(ex));
} else if (block >= le32_to_cpu(ex->ee_block)
- + le16_to_cpu(ex->ee_len)) {
+ + ext4_ext_get_actual_len(ex)) {
lblock = le32_to_cpu(ex->ee_block)
- + le16_to_cpu(ex->ee_len);
+ + ext4_ext_get_actual_len(ex);
len = ext4_ext_next_allocated_block(path);
ext_debug("cache gap(after): [%lu:%lu] %lu",
(unsigned long) le32_to_cpu(ex->ee_block),
- (unsigned long) le16_to_cpu(ex->ee_len),
+ (unsigned long) ext4_ext_get_actual_len(ex),
(unsigned long) block);
BUG_ON(len == lblock);
len = len - lblock;
@@ -1646,12 +1675,12 @@ static int ext4_remove_blocks(handle_t *
unsigned long from, unsigned long to)
{
struct buffer_head *bh;
+ unsigned short ee_len = ext4_ext_get_actual_len(ex);
int i;
#ifdef EXTENTS_STATS
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- unsigned short ee_len = le16_to_cpu(ex->ee_len);
spin_lock(&sbi->s_ext_stats_lock);
sbi->s_ext_blocks += ee_len;
sbi->s_ext_extents++;
@@ -1665,12 +1694,12 @@ static int ext4_remove_blocks(handle_t *
}
#endif
if (from >= le32_to_cpu(ex->ee_block)
- && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+ && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
/* tail removal */
unsigned long num;
ext4_fsblk_t start;
- num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
- start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
+ num = le32_to_cpu(ex->ee_block) + ee_len - from;
+ start = ext_pblock(ex) + ee_len - num;
ext_debug("free last %lu blocks starting %llu\n", num, start);
for (i = 0; i < num; i++) {
bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1678,12 +1707,12 @@ static int ext4_remove_blocks(handle_t *
}
ext4_free_blocks(handle, inode, start, num);
} else if (from == le32_to_cpu(ex->ee_block)
- && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+ && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
printk("strange request: removal %lu-%lu from %u:%u\n",
- from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+ from, to, le32_to_cpu(ex->ee_block), ee_len);
} else {
printk("strange request: removal(2) %lu-%lu from %u:%u\n",
- from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+ from, to, le32_to_cpu(ex->ee_block), ee_len);
}
return 0;
}
@@ -1698,6 +1727,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
unsigned a, b, block, num;
unsigned long ex_ee_block;
unsigned short ex_ee_len;
+ unsigned uninitialized = 0;
struct ext4_extent *ex;
/* the header must be checked already in ext4_ext_remove_space() */
@@ -1711,7 +1741,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc
ex = EXT_LAST_EXTENT(eh);
ex_ee_block = le32_to_cpu(ex->ee_block);
- ex_ee_len = le16_to_cpu(ex->ee_len);
+ if (ext4_ext_is_uninitialized(ex))
+ uninitialized = 1;
+ ex_ee_len = ext4_ext_get_actual_len(ex);
while (ex >= EXT_FIRST_EXTENT(eh) &&
ex_ee_block + ex_ee_len > start) {
@@ -1779,6 +1811,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc
ex->ee_block = cpu_to_le32(block);
ex->ee_len = cpu_to_le16(num);
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(ex);
err = ext4_ext_dirty(handle, inode, path + depth);
if (err)
@@ -1788,7 +1822,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
ext_pblock(ex));
ex--;
ex_ee_block = le32_to_cpu(ex->ee_block);
- ex_ee_len = le16_to_cpu(ex->ee_len);
+ ex_ee_len = ext4_ext_get_actual_len(ex);
}
if (correct_index && eh->eh_entries)
@@ -2062,7 +2096,7 @@ int ext4_ext_get_blocks(handle_t *handle
if (ex) {
unsigned long ee_block = le32_to_cpu(ex->ee_block);
ext4_fsblk_t ee_start = ext_pblock(ex);
- unsigned short ee_len = le16_to_cpu(ex->ee_len);
+ unsigned short ee_len;
/*
* Allow future support for preallocated extents to be added
@@ -2070,8 +2104,9 @@ int ext4_ext_get_blocks(handle_t *handle
* Uninitialized extents are treated as holes, except that
* we avoid (fail) allocating new blocks during a write.
*/
- if (ee_len > EXT_MAX_LEN)
+ if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
goto out2;
+ ee_len = ext4_ext_get_actual_len(ex);
/* if found extent covers block, simply return it */
if (iblock >= ee_block && iblock < ee_block + ee_len) {
newblock = iblock - ee_block + ee_start;
@@ -2079,8 +2114,11 @@ int ext4_ext_get_blocks(handle_t *handle
allocated = ee_len - (iblock - ee_block);
ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
ee_block, ee_len, newblock);
- ext4_ext_put_in_cache(inode, ee_block, ee_len,
- ee_start, EXT4_EXT_CACHE_EXTENT);
+ /* Do not put uninitialized extent in the cache */
+ if (!ext4_ext_is_uninitialized(ex))
+ ext4_ext_put_in_cache(inode, ee_block,
+ ee_len, ee_start,
+ EXT4_EXT_CACHE_EXTENT);
goto out;
}
}
@@ -2122,6 +2160,8 @@ int ext4_ext_get_blocks(handle_t *handle
/* try to insert new extent into found leaf and return */
ext4_ext_store_pblock(&newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
+ if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */
+ ext4_ext_mark_uninitialized(&newex);
err = ext4_ext_insert_extent(handle, inode, path, &newex);
if (err) {
/* free data blocks we just allocated */
@@ -2137,8 +2177,10 @@ int ext4_ext_get_blocks(handle_t *handle
newblock = ext_pblock(&newex);
__set_bit(BH_New, &bh_result->b_state);
- ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
- EXT4_EXT_CACHE_EXTENT);
+ /* Cache only when it is _not_ an uninitialized extent */
+ if (create != EXT4_CREATE_UNINITIALIZED_EXT)
+ ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
+ EXT4_EXT_CACHE_EXTENT);
out:
if (allocated > max_blocks)
allocated = max_blocks;
@@ -2241,3 +2283,129 @@ int ext4_ext_writepage_trans_blocks(stru
return needed;
}
+
+/*
+ * preallocate space for a file. This implements ext4's fallocate inode
+ * operation, which gets called from sys_fallocate system call.
+ * Currently only FA_ALLOCATE mode is supported on extent based files.
+ * We may have more modes supported in future - like FA_DEALLOCATE, which
+ * tells fallocate to unallocate previously (pre)allocated blocks.
+ * For block-mapped files, posix_fallocate should fall back to the method
+ * of writing zeroes to the required new blocks (the same behavior which is
+ * expected for file systems which do not support fallocate() system call).
+ */
+long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+ handle_t *handle;
+ ext4_fsblk_t block, max_blocks;
+ ext4_fsblk_t nblocks = 0;
+ int ret = 0;
+ int ret2 = 0;
+ int retries = 0;
+ struct buffer_head map_bh;
+ unsigned int credits, blkbits = inode->i_blkbits;
+
+ /*
+ * currently supporting (pre)allocate mode for extent-based
+ * files _only_
+ */
+ if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return -EOPNOTSUPP;
+
+ /* preallocation to directories is currently not supported */
+ if (S_ISDIR(inode->i_mode))
+ return -ENODEV;
+
+ block = offset >> blkbits;
+ max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
+ - block;
+
+ /*
+ * credits to insert 1 extent into extent tree + buffers to be able to
+ * modify 1 super block, 1 block bitmap and 1 group descriptor.
+ */
+ credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+retry:
+ while (ret >= 0 && ret < max_blocks) {
+ block = block + ret;
+ max_blocks = max_blocks - ret;
+ handle = ext4_journal_start(inode, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ break;
+ }
+
+ ret = ext4_ext_get_blocks(handle, inode, block,
+ max_blocks, &map_bh,
+ EXT4_CREATE_UNINITIALIZED_EXT, 0);
+ WARN_ON(!ret);
+ if (!ret) {
+ ext4_error(inode->i_sb, "ext4_fallocate",
+ "ext4_ext_get_blocks returned 0! inode#%lu"
+ ", block=%llu, max_blocks=%llu",
+ inode->i_ino, block, max_blocks);
+ ret = -EIO;
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ break;
+ }
+ if (ret > 0) {
+ /* check wrap through sign-bit/zero here */
+ if ((block + ret) < 0 || (block + ret) < block) {
+ ret = -EIO;
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ break;
+ }
+ if (buffer_new(&map_bh) && ((block + ret) >
+ (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits)
+ >> blkbits)))
+ nblocks = nblocks + ret;
+ }
+
+ /* Update ctime if new blocks get allocated */
+ if (nblocks) {
+ struct timespec now;
+ now = current_fs_time(inode->i_sb);
+ if (!timespec_equal(&inode->i_ctime, &now))
+ inode->i_ctime = now;
+ }
+
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ if (ret2)
+ break;
+ }
+
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+
+ /*
+ * Time to update the file size.
+ * Update only when preallocation was requested beyond the file size.
+ */
+ if ((offset + len) > i_size_read(inode)) {
+ if (ret > 0) {
+ /*
+ * if no error, we assume preallocation succeeded
+ * completely
+ */
+ mutex_lock(&inode->i_mutex);
+ i_size_write(inode, offset + len);
+ EXT4_I(inode)->i_disksize = i_size_read(inode);
+ mutex_unlock(&inode->i_mutex);
+ } else if (ret < 0 && nblocks) {
+ /* Handle partial allocation scenario */
+ loff_t newsize;
+
+ mutex_lock(&inode->i_mutex);
+ newsize = (nblocks << blkbits) + i_size_read(inode);
+ i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
+ EXT4_I(inode)->i_disksize = i_size_read(inode);
+ mutex_unlock(&inode->i_mutex);
+ }
+ }
+
+ return ret > 0 ? ret2 : ret;
+}
+
Index: linux-2.6.22-rc4/fs/ext4/file.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/file.c
+++ linux-2.6.22-rc4/fs/ext4/file.c
@@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
.removexattr = generic_removexattr,
#endif
.permission = ext4_permission,
+ .fallocate = ext4_fallocate,
};
Index: linux-2.6.22-rc4/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/ext4_fs.h
+++ linux-2.6.22-rc4/include/linux/ext4_fs.h
@@ -102,6 +102,7 @@
EXT4_GOOD_OLD_FIRST_INO : \
(s)->s_first_ino)
#endif
+#define EXT4_BLOCK_ALIGN(size, blkbits) ALIGN((size), (1 << (blkbits)))
/*
* Macro-instructions used to manage fragments
@@ -225,6 +226,11 @@ struct ext4_new_group_data {
__u32 free_blocks_count;
};
+/*
+ * Following is used by preallocation code to tell get_blocks() that we
+ * want uninitialzed extents.
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT 2
/*
* ioctl commands
@@ -984,6 +990,8 @@ extern int ext4_ext_get_blocks(handle_t
extern void ext4_ext_truncate(struct inode *, struct page *);
extern void ext4_ext_init(struct super_block *);
extern void ext4_ext_release(struct super_block *);
+extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
+ loff_t len);
static inline int
ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
unsigned long max_blocks, struct buffer_head *bh,
Index: linux-2.6.22-rc4/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc4/include/linux/ext4_fs_extents.h
@@ -188,6 +188,21 @@ ext4_ext_invalidate_cache(struct inode *
EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO;
}
+static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
+{
+ ext->ee_len |= cpu_to_le16(0x8000);
+}
+
+static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext)
+{
+ return (int)(le16_to_cpu((ext)->ee_len) & 0x8000);
+}
+
+static inline int ext4_ext_get_actual_len(struct ext4_extent *ext)
+{
+ return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF);
+}
+
extern int ext4_extent_tree_init(handle_t *, struct inode *);
extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.
Changelog:
---------
Changes from Take3 to Take4:
- no change -
Changes from Take2 to Take3:
1) Patch now rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
1) Replaced BUG_ON with WARN_ON & ext4_error.
2) Added variable names to the function declaration of
ext4_ext_try_to_merge().
3) Updated variable declarations to use multiple-definitions-per-line.
4) "if((a=foo())).." was broken into "a=foo(); if(a).."
5) Removed extra spaces.
Signed-off-by: Amit Arora <[email protected]>
Index: linux-2.6.22-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc4/fs/ext4/extents.c
@@ -1167,6 +1167,53 @@ ext4_can_extents_be_merged(struct inode
}
/*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+ struct ext4_extent_header *eh;
+ unsigned int depth, len;
+ int merge_done = 0;
+ int uninitialized = 0;
+
+ depth = ext_depth(inode);
+ BUG_ON(path[depth].p_hdr == NULL);
+ eh = path[depth].p_hdr;
+
+ while (ex < EXT_LAST_EXTENT(eh)) {
+ if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+ break;
+ /* merge with next extent! */
+ if (ext4_ext_is_uninitialized(ex))
+ uninitialized = 1;
+ ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+ + ext4_ext_get_actual_len(ex + 1));
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(ex);
+
+ if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+ len = (EXT_LAST_EXTENT(eh) - ex - 1)
+ * sizeof(struct ext4_extent);
+ memmove(ex + 1, ex + 2, len);
+ }
+ eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+ merge_done = 1;
+ WARN_ON(eh->eh_entries == 0);
+ if (!eh->eh_entries)
+ ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+ "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+ }
+
+ return merge_done;
+}
+
+/*
* check if a portion of the "newext" extent overlaps with an
* existing extent.
*
@@ -1354,25 +1401,7 @@ has_space:
merge:
/* try to merge extents to the right */
- while (nearex < EXT_LAST_EXTENT(eh)) {
- if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
- break;
- /* merge with next extent! */
- if (ext4_ext_is_uninitialized(nearex))
- uninitialized = 1;
- nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
- + ext4_ext_get_actual_len(nearex + 1));
- if (uninitialized)
- ext4_ext_mark_uninitialized(nearex);
-
- if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
- len = (EXT_LAST_EXTENT(eh) - nearex - 1)
- * sizeof(struct ext4_extent);
- memmove(nearex + 1, nearex + 2, len);
- }
- eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
- BUG_ON(eh->eh_entries == 0);
- }
+ ext4_ext_try_to_merge(inode, path, nearex);
/* try to merge extents to the left */
@@ -2035,15 +2064,158 @@ void ext4_ext_release(struct super_block
#endif
}
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ * a> There is no split required: Entire extent should be initialized
+ * b> Splits in two extents: Write is happening at either end of the extent
+ * c> Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+ struct ext4_ext_path *path,
+ ext4_fsblk_t iblock,
+ unsigned long max_blocks)
+{
+ struct ext4_extent *ex, newex;
+ struct ext4_extent *ex1 = NULL;
+ struct ext4_extent *ex2 = NULL;
+ struct ext4_extent *ex3 = NULL;
+ struct ext4_extent_header *eh;
+ unsigned int allocated, ee_block, ee_len, depth;
+ ext4_fsblk_t newblock;
+ int err = 0;
+ int ret = 0;
+
+ depth = ext_depth(inode);
+ eh = path[depth].p_hdr;
+ ex = path[depth].p_ext;
+ ee_block = le32_to_cpu(ex->ee_block);
+ ee_len = ext4_ext_get_actual_len(ex);
+ allocated = ee_len - (iblock - ee_block);
+ newblock = iblock - ee_block + ext_pblock(ex);
+ ex2 = ex;
+
+ /* ex1: ee_block to iblock - 1 : uninitialized */
+ if (iblock > ee_block) {
+ ex1 = ex;
+ ex1->ee_len = cpu_to_le16(iblock - ee_block);
+ ext4_ext_mark_uninitialized(ex1);
+ ex2 = &newex;
+ }
+ /*
+ * for sanity, update the length of the ex2 extent before
+ * we insert ex3, if ex1 is NULL. This is to avoid temporary
+ * overlap of blocks.
+ */
+ if (!ex1 && allocated > max_blocks)
+ ex2->ee_len = cpu_to_le16(max_blocks);
+ /* ex3: to ee_block + ee_len : uninitialised */
+ if (allocated > max_blocks) {
+ unsigned int newdepth;
+ ex3 = &newex;
+ ex3->ee_block = cpu_to_le32(iblock + max_blocks);
+ ext4_ext_store_pblock(ex3, newblock + max_blocks);
+ ex3->ee_len = cpu_to_le16(allocated - max_blocks);
+ ext4_ext_mark_uninitialized(ex3);
+ err = ext4_ext_insert_extent(handle, inode, path, ex3);
+ if (err)
+ goto out;
+ /*
+ * The depth, and hence eh & ex might change
+ * as part of the insert above.
+ */
+ newdepth = ext_depth(inode);
+ if (newdepth != depth) {
+ depth = newdepth;
+ path = ext4_ext_find_extent(inode, iblock, NULL);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ path = NULL;
+ goto out;
+ }
+ eh = path[depth].p_hdr;
+ ex = path[depth].p_ext;
+ if (ex2 != &newex)
+ ex2 = ex;
+ }
+ allocated = max_blocks;
+ }
+ /*
+ * If there was a change of depth as part of the
+ * insertion of ex3 above, we need to update the length
+ * of the ex1 extent again here
+ */
+ if (ex1 && ex1 != ex) {
+ ex1 = ex;
+ ex1->ee_len = cpu_to_le16(iblock - ee_block);
+ ext4_ext_mark_uninitialized(ex1);
+ ex2 = &newex;
+ }
+ /* ex2: iblock to iblock + maxblocks-1 : initialised */
+ ex2->ee_block = cpu_to_le32(iblock);
+ ex2->ee_start = cpu_to_le32(newblock);
+ ext4_ext_store_pblock(ex2, newblock);
+ ex2->ee_len = cpu_to_le16(allocated);
+ if (ex2 != ex)
+ goto insert;
+ err = ext4_ext_get_access(handle, inode, path + depth);
+ if (err)
+ goto out;
+ /*
+ * New (initialized) extent starts from the first block
+ * in the current extent. i.e., ex2 == ex
+ * We have to see if it can be merged with the extent
+ * on the left.
+ */
+ if (ex2 > EXT_FIRST_EXTENT(eh)) {
+ /*
+ * To merge left, pass "ex2 - 1" to try_to_merge(),
+ * since it merges towards right _only_.
+ */
+ ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
+ if (ret) {
+ err = ext4_ext_correct_indexes(handle, inode, path);
+ if (err)
+ goto out;
+ depth = ext_depth(inode);
+ ex2--;
+ }
+ }
+ /*
+ * Try to Merge towards right. This might be required
+ * only when the whole extent is being written to.
+ * i.e. ex2 == ex and ex3 == NULL.
+ */
+ if (!ex3) {
+ ret = ext4_ext_try_to_merge(inode, path, ex2);
+ if (ret) {
+ err = ext4_ext_correct_indexes(handle, inode, path);
+ if (err)
+ goto out;
+ }
+ }
+ /* Mark modified extent as dirty */
+ err = ext4_ext_dirty(handle, inode, path + depth);
+ goto out;
+insert:
+ err = ext4_ext_insert_extent(handle, inode, path, &newex);
+out:
+ return err ? err : allocated;
+}
+
int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
int create, int extend_disksize)
{
struct ext4_ext_path *path = NULL;
+ struct ext4_extent_header *eh;
struct ext4_extent newex, *ex;
ext4_fsblk_t goal, newblock;
- int err = 0, depth;
+ int err = 0, depth, ret;
unsigned long allocated = 0;
__clear_bit(BH_New, &bh_result->b_state);
@@ -2056,8 +2228,10 @@ int ext4_ext_get_blocks(handle_t *handle
if (goal) {
if (goal == EXT4_EXT_CACHE_GAP) {
if (!create) {
- /* block isn't allocated yet and
- * user doesn't want to allocate it */
+ /*
+ * block isn't allocated yet and
+ * user doesn't want to allocate it
+ */
goto out2;
}
/* we should allocate requested block */
@@ -2091,6 +2265,7 @@ int ext4_ext_get_blocks(handle_t *handle
* this is why assert can't be put in ext4_ext_find_extent()
*/
BUG_ON(path[depth].p_ext == NULL && depth != 0);
+ eh = path[depth].p_hdr;
ex = path[depth].p_ext;
if (ex) {
@@ -2099,13 +2274,9 @@ int ext4_ext_get_blocks(handle_t *handle
unsigned short ee_len;
/*
- * Allow future support for preallocated extents to be added
- * as an RO_COMPAT feature:
* Uninitialized extents are treated as holes, except that
- * we avoid (fail) allocating new blocks during a write.
+ * we split out initialized portions during a write.
*/
- if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
- goto out2;
ee_len = ext4_ext_get_actual_len(ex);
/* if found extent covers block, simply return it */
if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -2114,12 +2285,27 @@ int ext4_ext_get_blocks(handle_t *handle
allocated = ee_len - (iblock - ee_block);
ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
ee_block, ee_len, newblock);
+
/* Do not put uninitialized extent in the cache */
- if (!ext4_ext_is_uninitialized(ex))
+ if (!ext4_ext_is_uninitialized(ex)) {
ext4_ext_put_in_cache(inode, ee_block,
ee_len, ee_start,
EXT4_EXT_CACHE_EXTENT);
- goto out;
+ goto out;
+ }
+ if (create == EXT4_CREATE_UNINITIALIZED_EXT)
+ goto out;
+ if (!create)
+ goto out2;
+
+ ret = ext4_ext_convert_to_initialized(handle, inode,
+ path, iblock,
+ max_blocks);
+ if (ret <= 0)
+ goto out2;
+ else
+ allocated = ret;
+ goto outnew;
}
}
@@ -2128,8 +2314,10 @@ int ext4_ext_get_blocks(handle_t *handle
* we couldn't try to create block if create flag is zero
*/
if (!create) {
- /* put just found gap into cache to speed up
- * subsequent requests */
+ /*
+ * put just found gap into cache to speed up
+ * subsequent requests
+ */
ext4_ext_put_gap_in_cache(inode, path, iblock);
goto out2;
}
@@ -2175,6 +2363,7 @@ int ext4_ext_get_blocks(handle_t *handle
/* previous routine could use block we allocated */
newblock = ext_pblock(&newex);
+outnew:
__set_bit(BH_New, &bh_result->b_state);
/* Cache only when it is _not_ an uninitialized extent */
@@ -2244,7 +2433,8 @@ void ext4_ext_truncate(struct inode * in
err = ext4_ext_remove_space(inode, last_block);
/* In a multi-transaction truncate, we only make the final
- * transaction synchronous. */
+ * transaction synchronous.
+ */
if (IS_SYNC(inode))
handle->h_sync = 1;
Index: linux-2.6.22-rc4/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc4/include/linux/ext4_fs_extents.h
@@ -205,6 +205,9 @@ static inline int ext4_ext_get_actual_le
extern int ext4_extent_tree_init(handle_t *, struct inode *);
extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *);
extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);
Support new values of mode in ext4.
This patch supports new mode values/flags in ext4. With this patch ext4
will be able to support FA_ALLOCATE and FA_RESV_SPACE modes. Supporting
FA_DEALLOCATE and FA_UNRESV_SPACE fallocate modes in ext4 is a work for
future.
Signed-off-by: Amit Arora <[email protected]>
Index: linux-2.6.22-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc4/fs/ext4/extents.c
@@ -2477,7 +2477,8 @@ int ext4_ext_writepage_trans_blocks(stru
/*
* preallocate space for a file. This implements ext4's fallocate inode
* operation, which gets called from sys_fallocate system call.
- * Currently only FA_ALLOCATE mode is supported on extent based files.
+ * Currently only FA_ALLOCATE and FA_RESV_SPACE modes are supported on
+ * extent based files.
* We may have more modes supported in future - like FA_DEALLOCATE, which
* tells fallocate to unallocate previously (pre)allocated blocks.
* For block-mapped files, posix_fallocate should fall back to the method
@@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
* currently supporting (pre)allocate mode for extent-based
* files _only_
*/
- if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
+ !(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
return -EOPNOTSUPP;
/* preallocation to directories is currently not supported */
@@ -2572,9 +2574,10 @@ retry:
/*
* Time to update the file size.
- * Update only when preallocation was requested beyond the file size.
+ * Update only when preallocation was requested beyond the file size
+ * and when FA_FL_KEEP_SIZE mode is not specified!
*/
- if ((offset + len) > i_size_read(inode)) {
+ if (!(mode & FA_FL_KEEP_SIZE) && (offset + len) > i_size_read(inode)) {
if (ret > 0) {
/*
* if no error, we assume preallocation succeeded
I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
*suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post.
If it is decided that these flags are also needed, I will update this
patch. Thanks!
On Mon, Jun 25, 2007 at 07:15:00PM +0530, Amit K. Arora wrote:
> Implement new flags and values for mode argument.
>
> This patch implements the new flags and values for the "mode" argument
> of the fallocate system call. It is based on the discussion between
> Andreas Dilger and David Chinner on the man page proposed (by the later)
> on fallocate.
>
> Signed-off-by: Amit Arora <[email protected]>
>
> Index: linux-2.6.22-rc4/include/linux/fs.h
> ===================================================================
> --- linux-2.6.22-rc4.orig/include/linux/fs.h
> +++ linux-2.6.22-rc4/include/linux/fs.h
> @@ -267,15 +267,16 @@ extern int dir_notify_enable;
> #define SYNC_FILE_RANGE_WAIT_AFTER 4
>
> /*
> - * sys_fallocate modes
> - * Currently sys_fallocate supports two modes:
> - * FA_ALLOCATE : This is the preallocate mode, using which an application/user
> - * may request (pre)allocation of blocks.
> - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
> - * the preallocated blocks.
> + * sys_fallocate mode flags and values
> */
> -#define FA_ALLOCATE 0x1
> -#define FA_DEALLOCATE 0x2
> +#define FA_FL_DEALLOC 0x01 /* default is allocate */
> +#define FA_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */
> +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */
> +
> +#define FA_ALLOCATE 0
> +#define FA_DEALLOCATE FA_FL_DEALLOC
> +#define FA_RESV_SPACE FA_FL_KEEP_SIZE
> +#define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)
>
> #ifdef __KERNEL__
>
> Index: linux-2.6.22-rc4/fs/open.c
> ===================================================================
> --- linux-2.6.22-rc4.orig/fs/open.c
> +++ linux-2.6.22-rc4/fs/open.c
> @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
> * sys_fallocate - preallocate blocks or free preallocated blocks
> * @fd: the file descriptor
> * @mode: mode specifies if fallocate should preallocate blocks OR free
> - * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
> - * FA_DEALLOCATE modes are supported.
> + * (unallocate) preallocated blocks.
> * @offset: The offset within file, from where (un)allocation is being
> * requested. It should not have a negative value.
> * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
> *
> * This system call, depending on the mode, preallocates or unallocates blocks
> * for a file. The range of blocks depends on the value of offset and len
> - * arguments provided by the user/application. For FA_ALLOCATE mode, if this
> + * arguments provided by the user/application. For FA_ALLOCATE and
> + * FA_RESV_SPACE modes, if the sys_fallocate()
> * system call succeeds, subsequent writes to the file in the given range
> * (specified by offset & len) should not fail - even if the file system
> * later becomes full. Hence the preallocation done is persistent (valid
> - * even after reopen of the file and remount/reboot).
> + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
> + * is passed, the file size will not be changed even if the preallocation
> + * is beyond EOF.
> *
> * It is expected that the ->fallocate() inode operation implemented by the
> * individual file systems will update the file size and/or ctime/mtime
> - * depending on the mode and also on the success of the operation.
> + * depending on the mode (change is visible to user or not - say file size)
> + * and obviously, on the success of the operation.
> *
> * Note: Incase the file system does not support preallocation,
> * posix_fallocate() should fall back to the library implementation (i.e.
> @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
>
> /* Return error if mode is not supported */
> ret = -EOPNOTSUPP;
> - if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
> + if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE ||
> + mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE))
> goto out;
>
> ret = -EBADF;
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jun 25, 2007 20:33 +0530, Amit K. Arora wrote:
> I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
> *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post.
> If it is decided that these flags are also needed, I will update this
> patch. Thanks!
Can you clarify - what is the current behaviour when ENOSPC (or some other
error) is hit? Does it keep the current fallocate() or does it free it?
For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
don't want to expose uninitialized disk blocks to userspace. I'm not
sure if this makes sense at all.
> On Mon, Jun 25, 2007 at 07:15:00PM +0530, Amit K. Arora wrote:
> > Implement new flags and values for mode argument.
> >
> > This patch implements the new flags and values for the "mode" argument
> > of the fallocate system call. It is based on the discussion between
> > Andreas Dilger and David Chinner on the man page proposed (by the later)
> > on fallocate.
> >
> > Signed-off-by: Amit Arora <[email protected]>
> >
> > Index: linux-2.6.22-rc4/include/linux/fs.h
> > ===================================================================
> > --- linux-2.6.22-rc4.orig/include/linux/fs.h
> > +++ linux-2.6.22-rc4/include/linux/fs.h
> > @@ -267,15 +267,16 @@ extern int dir_notify_enable;
> > #define SYNC_FILE_RANGE_WAIT_AFTER 4
> >
> > /*
> > - * sys_fallocate modes
> > - * Currently sys_fallocate supports two modes:
> > - * FA_ALLOCATE : This is the preallocate mode, using which an application/user
> > - * may request (pre)allocation of blocks.
> > - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
> > - * the preallocated blocks.
> > + * sys_fallocate mode flags and values
> > */
> > -#define FA_ALLOCATE 0x1
> > -#define FA_DEALLOCATE 0x2
> > +#define FA_FL_DEALLOC 0x01 /* default is allocate */
> > +#define FA_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */
> > +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */
> > +
> > +#define FA_ALLOCATE 0
> > +#define FA_DEALLOCATE FA_FL_DEALLOC
> > +#define FA_RESV_SPACE FA_FL_KEEP_SIZE
> > +#define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)
> >
> > #ifdef __KERNEL__
> >
> > Index: linux-2.6.22-rc4/fs/open.c
> > ===================================================================
> > --- linux-2.6.22-rc4.orig/fs/open.c
> > +++ linux-2.6.22-rc4/fs/open.c
> > @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
> > * sys_fallocate - preallocate blocks or free preallocated blocks
> > * @fd: the file descriptor
> > * @mode: mode specifies if fallocate should preallocate blocks OR free
> > - * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
> > - * FA_DEALLOCATE modes are supported.
> > + * (unallocate) preallocated blocks.
> > * @offset: The offset within file, from where (un)allocation is being
> > * requested. It should not have a negative value.
> > * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
> > *
> > * This system call, depending on the mode, preallocates or unallocates blocks
> > * for a file. The range of blocks depends on the value of offset and len
> > - * arguments provided by the user/application. For FA_ALLOCATE mode, if this
> > + * arguments provided by the user/application. For FA_ALLOCATE and
> > + * FA_RESV_SPACE modes, if the sys_fallocate()
> > * system call succeeds, subsequent writes to the file in the given range
> > * (specified by offset & len) should not fail - even if the file system
> > * later becomes full. Hence the preallocation done is persistent (valid
> > - * even after reopen of the file and remount/reboot).
> > + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
> > + * is passed, the file size will not be changed even if the preallocation
> > + * is beyond EOF.
> > *
> > * It is expected that the ->fallocate() inode operation implemented by the
> > * individual file systems will update the file size and/or ctime/mtime
> > - * depending on the mode and also on the success of the operation.
> > + * depending on the mode (change is visible to user or not - say file size)
> > + * and obviously, on the success of the operation.
> > *
> > * Note: Incase the file system does not support preallocation,
> > * posix_fallocate() should fall back to the library implementation (i.e.
> > @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
> >
> > /* Return error if mode is not supported */
> > ret = -EOPNOTSUPP;
> > - if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
> > + if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE ||
> > + mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE))
> > goto out;
> >
> > ret = -EBADF;
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Jun 25, 2007 19:15 +0530, Amit K. Arora wrote:
> +#define FA_FL_DEALLOC 0x01 /* default is allocate */
> +#define FA_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */
> +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */
In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
each extent. For some workloads this would be much faster than truncate
and reallocate of all the blocks in a file.
In that light, please change the comment to /* default is keep existing data */
so that it doesn't imply this is only for DEALLOC.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Jun 25, 2007 19:20 +0530, Amit K. Arora wrote:
> @@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
> * currently supporting (pre)allocate mode for extent-based
> * files _only_
> */
> - if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
> + !(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
> return -EOPNOTSUPP;
This should probably just check for the individual flags it can support
(e.g. no FA_FL_DEALLOC, no FA_FL_DEL_DATA).
I also thought another proposed flag was to determine whether mtime (and
maybe ctime) is changed when doing prealloc/dealloc space? Default should
probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone else
should decide if we want to allow changing the file w/o changing ctime, if
that is required even though the file is not visibly changing. Maybe the
ctime update should be implicit if the size or mtime are changing?
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007 20:33 +0530, Amit K. Arora wrote:
> > I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
> > *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post.
> > If it is decided that these flags are also needed, I will update this
> > patch. Thanks!
>
> Can you clarify - what is the current behaviour when ENOSPC (or some other
> error) is hit? Does it keep the current fallocate() or does it free it?
Currently it is left on the file system implementation. In ext4, we do
not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
end up with partial (pre)allocation. This is inline with dd and
posix_fallocate, which also do not free the partially allocated space.
> For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> don't want to expose uninitialized disk blocks to userspace. I'm not
> sure if this makes sense at all.
I don't think we need to make it default - atleast for filesystems which
have a mechanism to distinguish preallocated blocks from "regular" ones.
In ext4, for example, we will have a way to mark uninitialized extents.
All the preallocated blocks will be part of these uninitialized extents.
And any read on these extents will treat them as a hole, returning
zeroes to user land. Thus any existing data on uninitialized blocks will
not be exposed to the userspace.
--
Regards,
Amit Arora
On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007 19:15 +0530, Amit K. Arora wrote:
> > +#define FA_FL_DEALLOC 0x01 /* default is allocate */
> > +#define FA_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */
> > +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */
>
> In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
> For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
> each extent. For some workloads this would be much faster than truncate
> and reallocate of all the blocks in a file.
In ext4, we already mark each extent having preallocated blocks as
uninitialized. This is done as part of following code (which is part of
patch 5/7) in ext4_ext_get_blocks() :
@@ -2122,6 +2160,8 @@ int ext4_ext_get_blocks(handle_t *handle
/* try to insert new extent into found leaf and return */
ext4_ext_store_pblock(&newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
+ if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */
+ ext4_ext_mark_uninitialized(&newex);
err = ext4_ext_insert_extent(handle, inode, path, &newex);
if (err) {
/* free data blocks we just allocated */
> In that light, please change the comment to /* default is keep existing data */
> so that it doesn't imply this is only for DEALLOC.
Ok. Will update the comment.
Thanks!
--
Regards,
Amit Arora
On Mon, Jun 25, 2007 at 03:56:25PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007 19:20 +0530, Amit K. Arora wrote:
> > @@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
> > * currently supporting (pre)allocate mode for extent-based
> > * files _only_
> > */
> > - if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
> > + !(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
> > return -EOPNOTSUPP;
>
> This should probably just check for the individual flags it can support
> (e.g. no FA_FL_DEALLOC, no FA_FL_DEL_DATA).
Hmm.. I am thinking of a scenario when the file system supports some
individual flags, but does not support a particular combination of them.
Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
for some reason). This means that although we support FA_FL_DEALLOC,
FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
combination of all these flags (which is nothing but FA_UNRESV_SPACE).
> I also thought another proposed flag was to determine whether mtime (and
> maybe ctime) is changed when doing prealloc/dealloc space? Default should
> probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone else
> should decide if we want to allow changing the file w/o changing ctime, if
> that is required even though the file is not visibly changing. Maybe the
> ctime update should be implicit if the size or mtime are changing?
Is it really required ? I mean, why should we allow users not to update
ctime/mtime even if the file metadata/data gets updated ? It sounds
a bit "unnatural" to me.
Is there any application scenario in your mind, when you suggest of
giving this flexibility to userspace ?
I think, modifying ctime/mtime should be dependent on the other flags.
E.g., if we do not zero out data blocks on allocation/deallocation,
update only ctime. Otherwise, update ctime and mtime both.
--
Regards,
Amit Arora
> Index: linux-2.6.22-rc4/arch/s390/kernel/syscalls.S
> ===================================================================
> --- linux-2.6.22-rc4.orig/arch/s390/kernel/syscalls.S 2007-06-11 16:16:01.000000000 -0700
> +++ linux-2.6.22-rc4/arch/s390/kernel/syscalls.S 2007-06-11 16:27:29.000000000 -0700
> @@ -322,6 +322,7 @@
> SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
> SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
> SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
> +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
> NI_SYSCALL /* 314 sys_fallocate */
You need to remove the NI_SYSCALL line. Otherwise all following entries
will be wrong.
> SYSCALL(sys_utimensat,sys_utimensat,compat_sys_utimensat_wrapper) /* 315 */
> SYSCALL(sys_signalfd,sys_signalfd,compat_sys_signalfd_wrapper)
> Index: linux-2.6.22-rc4/include/asm-s390/unistd.h
> ===================================================================
> --- linux-2.6.22-rc4.orig/include/asm-s390/unistd.h 2007-06-11 16:16:01.000000000 -0700
> +++ linux-2.6.22-rc4/include/asm-s390/unistd.h 2007-06-11 16:27:29.000000000 -0700
> @@ -256,7 +256,8 @@
> #define __NR_signalfd 316
> #define __NR_timerfd 317
> #define __NR_eventfd 318
> -#define NR_syscalls 319
> +#define __NR_fallocate 319
> +#define NR_syscalls 320
Erm... no. You use slot 314 in the syscall table but assign number 319.
That won't work. Please use 314 for both.
I assume this got broken when updating to newer kernel versions.
On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote:
> On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > error) is hit? Does it keep the current fallocate() or does it free it?
>
> Currently it is left on the file system implementation. In ext4, we do
> not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> end up with partial (pre)allocation. This is inline with dd and
> posix_fallocate, which also do not free the partially allocated space.
Since I believe the XFS allocation ioctls do it the opposite way (free
preallocated space on error) this should be encoded into the flags.
Having it "filesystem dependent" just means that nobody will be happy.
> > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > don't want to expose uninitialized disk blocks to userspace. I'm not
> > sure if this makes sense at all.
>
> I don't think we need to make it default - atleast for filesystems which
> have a mechanism to distinguish preallocated blocks from "regular" ones.
What I mean is that any data read from the file should have the "appearance"
of being zeroed (whether zeroes are actually written to disk or not). What
I _think_ David is proposing is to allow fallocate() to return without
marking the blocks even "uninitialized" and subsequent reads would return
the old data from the disk.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Jun 26, 2007 16:15 +0530, Amit K. Arora wrote:
> On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> > In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
> > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
> > each extent. For some workloads this would be much faster than truncate
> > and reallocate of all the blocks in a file.
>
> In ext4, we already mark each extent having preallocated blocks as
> uninitialized. This is done as part of following code (which is part of
> patch 5/7) in ext4_ext_get_blocks() :
What I meant is that with XFS_IOC_ALLOCSP the previously-written data
is ZEROED OUT, unlike with fallocate() which leaves previously-written
data alone and only allocates in holes.
So, if you had a sparse file with some data in it:
AAAAA BBBBBB
fallocate() would allocate the holes:
00000AAAAA000000000BBBBBB00000000
XFS_IOC_ALLOCSP would overwrite everything:
000000000000000000000000000000000
In order to specify this for allocation, FA_FL_DEL_DATA would need to make
sense for allocations (as well as the deallocation). This is farily easy
to do - just mark all of the existing extents as unallocated, and their
data disappears.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Jun 26, 2007 17:37 +0530, Amit K. Arora wrote:
> Hmm.. I am thinking of a scenario when the file system supports some
> individual flags, but does not support a particular combination of them.
> Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
> file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
> FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
> for some reason). This means that although we support FA_FL_DEALLOC,
> FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
> combination of all these flags (which is nothing but FA_UNRESV_SPACE).
That is up to the filesystem to determine then. I just thought it should
be clear to return an error for flags (or as you say combinations thereof)
that the filesystem doesn't understand.
That said, I'd think in most cases the flags are orthogonal, so if you
support some combination of the flags (e.g. FA_FL_DEL_DATA, FA_FL_DEALLOC)
then you will also support other combinations of those flags just from
the way it is coded.
> > I also thought another proposed flag was to determine whether mtime (and
> > maybe ctime) is changed when doing prealloc/dealloc space? Default should
> > probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone else
> > should decide if we want to allow changing the file w/o changing ctime, if
> > that is required even though the file is not visibly changing. Maybe the
> > ctime update should be implicit if the size or mtime are changing?
>
> Is it really required ? I mean, why should we allow users not to update
> ctime/mtime even if the file metadata/data gets updated ? It sounds
> a bit "unnatural" to me.
> Is there any application scenario in your mind, when you suggest of
> giving this flexibility to userspace ?
One reason is that XFS does NOT update the mtime/ctime when doing the
XFS_IOC_* allocation ioctls.
> I think, modifying ctime/mtime should be dependent on the other flags.
> E.g., if we do not zero out data blocks on allocation/deallocation,
> update only ctime. Otherwise, update ctime and mtime both.
I'm only being the advocate for requirements David Chinner has put
forward due to existing behaviour in XFS. This is one of the reasons
why I think the "flags" mechanism we now have - we can encode the
various different behaviours in any way we want and leave it to the
caller.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
> On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote:
> > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > error) is hit? Does it keep the current fallocate() or does it free it?
> >
> > Currently it is left on the file system implementation. In ext4, we do
> > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > end up with partial (pre)allocation. This is inline with dd and
> > posix_fallocate, which also do not free the partially allocated space.
>
> Since I believe the XFS allocation ioctls do it the opposite way (free
> preallocated space on error) this should be encoded into the flags.
> Having it "filesystem dependent" just means that nobody will be happy.
Ok, got your point. Maybe we can have a flag for this, as you suggested.
But, default behavior IMHO should be _not_ to undo partial allocation
(thus the file system will have the option of supporting this flag or
not and it will be inline with posix_fallocate; XFS will obviously
like to support this flag, inline with its existing behavior).
> > > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > > don't want to expose uninitialized disk blocks to userspace. I'm not
> > > sure if this makes sense at all.
> >
> > I don't think we need to make it default - atleast for filesystems which
> > have a mechanism to distinguish preallocated blocks from "regular" ones.
>
> What I mean is that any data read from the file should have the "appearance"
> of being zeroed (whether zeroes are actually written to disk or not). What
> I _think_ David is proposing is to allow fallocate() to return without
> marking the blocks even "uninitialized" and subsequent reads would return
> the old data from the disk.
I can't think of a good reason for this (i.e. returning stale data from
preallocated blocks). It is infact a security issue to me.
Anyhow, this may though be beneficial for file systems which have
noticable overhead in marking the blocks "uninitialized/preallocated".
Can you or David please throw some light on how this option might really
be helpful ? Thanks!
--
Regards,
Amit Arora
On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote:
> On Jun 26, 2007 16:15 +0530, Amit K. Arora wrote:
> > On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> > > In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
> > > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
> > > each extent. For some workloads this would be much faster than truncate
> > > and reallocate of all the blocks in a file.
> >
> > In ext4, we already mark each extent having preallocated blocks as
> > uninitialized. This is done as part of following code (which is part of
> > patch 5/7) in ext4_ext_get_blocks() :
>
> What I meant is that with XFS_IOC_ALLOCSP the previously-written data
> is ZEROED OUT, unlike with fallocate() which leaves previously-written
> data alone and only allocates in holes.
>
> In order to specify this for allocation, FA_FL_DEL_DATA would need to make
> sense for allocations (as well as the deallocation). This is farily easy
> to do - just mark all of the existing extents as unallocated, and their
> data disappears.
Ok, agreed. Will add the FA_ZERO_SPACE mode too.
Thanks!
--
Regards,
Amit Arora
On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote:
> On Jun 26, 2007 17:37 +0530, Amit K. Arora wrote:
> > Hmm.. I am thinking of a scenario when the file system supports some
> > individual flags, but does not support a particular combination of them.
> > Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
> > file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
> > FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
> > for some reason). This means that although we support FA_FL_DEALLOC,
> > FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
> > combination of all these flags (which is nothing but FA_UNRESV_SPACE).
>
> That is up to the filesystem to determine then. I just thought it should
> be clear to return an error for flags (or as you say combinations thereof)
> that the filesystem doesn't understand.
>
> That said, I'd think in most cases the flags are orthogonal, so if you
> support some combination of the flags (e.g. FA_FL_DEL_DATA, FA_FL_DEALLOC)
> then you will also support other combinations of those flags just from
> the way it is coded.
Ok.
> > > I also thought another proposed flag was to determine whether mtime (and
> > > maybe ctime) is changed when doing prealloc/dealloc space? Default should
> > > probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone else
> > > should decide if we want to allow changing the file w/o changing ctime, if
> > > that is required even though the file is not visibly changing. Maybe the
> > > ctime update should be implicit if the size or mtime are changing?
> >
> > Is it really required ? I mean, why should we allow users not to update
> > ctime/mtime even if the file metadata/data gets updated ? It sounds
> > a bit "unnatural" to me.
> > Is there any application scenario in your mind, when you suggest of
> > giving this flexibility to userspace ?
>
> One reason is that XFS does NOT update the mtime/ctime when doing the
> XFS_IOC_* allocation ioctls.
Hmm.. I personally will call it a bug in XFS code then. :)
> > I think, modifying ctime/mtime should be dependent on the other flags.
> > E.g., if we do not zero out data blocks on allocation/deallocation,
> > update only ctime. Otherwise, update ctime and mtime both.
>
> I'm only being the advocate for requirements David Chinner has put
> forward due to existing behaviour in XFS. This is one of the reasons
> why I think the "flags" mechanism we now have - we can encode the
> various different behaviours in any way we want and leave it to the
> caller.
I understand. May be we can confirm once more with David Chinner if this
is really required. Will it really be a compatibility issue if new XFS
preallocations (ie. via fallocate) update mtime/ctime ? Will old
applications really get affected ? If yes, then it might be worth
implementing - even though I personally don't like it.
David, can you please confirm ? Thanks!
--
Regards,
Amit Arora
> Index: linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
> ===================================================================
> --- linux-2.6.22-rc4.orig/arch/powerpc/kernel/sys_ppc32.c
> +++ linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
> @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
> return sys_truncate(path, (high << 32) | low);
> }
>
> +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
> + u32 lenhi, u32 lenlo)
> +{
> + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
> + ((loff_t)lenhi << 32) | lenlo);
> +}
> +
> asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
> unsigned long low)
> {
> Index: linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
> ===================================================================
> --- linux-2.6.22-rc4.orig/arch/x86_64/ia32/ia32entry.S
> +++ linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
> @@ -719,4 +719,5 @@ ia32_sys_call_table:
> .quad compat_sys_signalfd
> .quad compat_sys_timerfd
> .quad sys_eventfd
> + .quad sys_fallocate
> ia32_syscall_end:
Btw. this is also (still?) broken. x86_64 needs a compat syscall here.
On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007 20:33 +0530, Amit K. Arora wrote:
> > I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
> > *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post.
> > If it is decided that these flags are also needed, I will update this
> > patch. Thanks!
>
> Can you clarify - what is the current behaviour when ENOSPC (or some other
> error) is hit? Does it keep the current fallocate() or does it free it?
>
> For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> don't want to expose uninitialized disk blocks to userspace. I'm not
> sure if this makes sense at all.
Someone on the XFs list had an interesting request - preallocated
swap files. You can't use unwritten extents for this because
of sys_swapon()s use of bmap() (XFS returns holes for reading
unwritten extents), so we need a method of preallocating that does
not zero or mark the extent unread. i.e. FA_MKSWAP.
I thinkthis would be:
#define FA_FL_NO_ZERO_SPACE 0x08 /* default is to zero space */
#define FA_MKSWAP (FA_ALLOCATE | FA_FL_NO_ZERO_SPACE)
That way we can allocate large swap files that don't need zeroing
in a single, fast operation, and hence potentially bring new
swap space online without needed very much memory at all (i.e.
should succeed in most near-OOM conditions).
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Mon, Jun 25, 2007 at 06:58:10PM +0530, Amit K. Arora wrote:
> 2) The above new patches (4/7 and 7/7) are based on the dicussion
> between Andreas Dilger and David Chinner on the mode argument,
> when later posted a man page on fallocate.
Can you include the man page in this patch set, please? That
way it can be kept up to date with the rest of the patch set.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
> On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote:
> > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > error) is hit? Does it keep the current fallocate() or does it free it?
> >
> > Currently it is left on the file system implementation. In ext4, we do
> > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > end up with partial (pre)allocation. This is inline with dd and
> > posix_fallocate, which also do not free the partially allocated space.
>
> Since I believe the XFS allocation ioctls do it the opposite way (free
> preallocated space on error) this should be encoded into the flags.
> Having it "filesystem dependent" just means that nobody will be happy.
No, XFs does not free preallocated space on error. it is up to the
application to clean up.
> What I mean is that any data read from the file should have the "appearance"
> of being zeroed (whether zeroes are actually written to disk or not). What
> I _think_ David is proposing is to allow fallocate() to return without
> marking the blocks even "uninitialized" and subsequent reads would return
> the old data from the disk.
Correct, but for swap files that's not an issue - no user should be able
too read them, and FA_MKSWAP would really need root privileges to execute.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007 19:15 +0530, Amit K. Arora wrote:
> > +#define FA_FL_DEALLOC 0x01 /* default is allocate */
> > +#define FA_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */
> > +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */
>
> In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
No, none of the XFS allocation modes do that.
XFS_IOC_ALLOCSP, which does write zeros to disk, only allocates and
writes zeros in the range between the old file size and the new file size.
XFS_IOC_RESVSP, which alocates unwritten extents, only allocates
where extents do not currently exist. It does not zero existing
extents.
IOWs, you can't overwrite existing data with XFS preallocation.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote:
> On Jun 26, 2007 16:15 +0530, Amit K. Arora wrote:
> > On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> > > In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
> > > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
> > > each extent. For some workloads this would be much faster than truncate
> > > and reallocate of all the blocks in a file.
> >
> > In ext4, we already mark each extent having preallocated blocks as
> > uninitialized. This is done as part of following code (which is part of
> > patch 5/7) in ext4_ext_get_blocks() :
>
> What I meant is that with XFS_IOC_ALLOCSP the previously-written data
> is ZEROED OUT, unlike with fallocate() which leaves previously-written
> data alone and only allocates in holes.
>
> So, if you had a sparse file with some data in it:
>
> AAAAA BBBBBB
>
> fallocate() would allocate the holes:
>
> 00000AAAAA000000000BBBBBB00000000
>
> XFS_IOC_ALLOCSP would overwrite everything:
>
> 000000000000000000000000000000000
No, it wouldn't. XFS_IOC_ALLOCSP would give you:
AAAAA BBBBBB00000000
because it only allocates the space between the old EOF and the new
EOF. Graphic demonstration - write 4k @ 4k, 4k @ 16k, allocsp out to 32k:
budgie:~ # xfs_io -f \
> -c "pwrite 4096 4096" \
> -c "pwrite 16384 4096" \
> -c "bmap -vvp" \
> -c "allocsp 32768 0" \
> -c "bmap -vvp" \
> /mnt/test/alfred
wrote 4096/4096 bytes at offset 4096
4 KiB, 1 ops; 0.0000 sec (108.507 MiB/sec and 27777.7778 ops/sec)
wrote 4096/4096 bytes at offset 16384
4 KiB, 1 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
/mnt/test/alfred:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..7]: hole 8
1: [8..15]: 5226864..5226871 4 (1022160..1022167) 8
2: [16..31]: hole 16
3: [32..39]: 5226888..5226895 4 (1022184..1022191) 8
/mnt/test/alfred:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..7]: hole 8
1: [8..15]: 5226864..5226871 4 (1022160..1022167) 8
2: [16..31]: hole 16
3: [32..63]: 5226888..5226919 4 (1022184..1022215) 32
budgie:~ #
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wed, Jun 27, 2007 at 12:59:08AM +0530, Amit K. Arora wrote:
> On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote:
> > On Jun 26, 2007 17:37 +0530, Amit K. Arora wrote:
> > > > I also thought another proposed flag was to determine whether mtime (and
> > > > maybe ctime) is changed when doing prealloc/dealloc space? Default should
> > > > probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone else
> > > > should decide if we want to allow changing the file w/o changing ctime, if
> > > > that is required even though the file is not visibly changing. Maybe the
> > > > ctime update should be implicit if the size or mtime are changing?
> > >
> > > Is it really required ? I mean, why should we allow users not to update
> > > ctime/mtime even if the file metadata/data gets updated ? It sounds
> > > a bit "unnatural" to me.
> > > Is there any application scenario in your mind, when you suggest of
> > > giving this flexibility to userspace ?
> >
> > One reason is that XFS does NOT update the mtime/ctime when doing the
> > XFS_IOC_* allocation ioctls.
Not totally correct.
XFS_IOC_ALLOCSP/FREESP change timestamps if they change
the file size (via the truncate call made to change the file size).
If they don't change the file size, then they are a no-op and should
not change the file size.
XFS_IOC_RESVSP/UNRESVSP don't change timestamps just like they don't
change file size. That is by design AFAICT so these calls can be
used by HSM-type applications that don't want to change timestamps
when punching out data blocks or preallocating new ones.
> Hmm.. I personally will call it a bug in XFS code then. :)
No, I'd call it useful. :)
> > > I think, modifying ctime/mtime should be dependent on the other flags.
> > > E.g., if we do not zero out data blocks on allocation/deallocation,
> > > update only ctime. Otherwise, update ctime and mtime both.
> >
> > I'm only being the advocate for requirements David Chinner has put
> > forward due to existing behaviour in XFS. This is one of the reasons
> > why I think the "flags" mechanism we now have - we can encode the
> > various different behaviours in any way we want and leave it to the
> > caller.
>
> I understand. May be we can confirm once more with David Chinner if this
> is really required. Will it really be a compatibility issue if new XFS
> preallocations (ie. via fallocate) update mtime/ctime?
It should be left up to the filesystem to decide. Only the
filesystem knows whether something changed and the timestamp should
or should not be updated.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Mon, 25 Jun 2007 18:58:10 +0530 "Amit K. Arora" <[email protected]> wrote:
> N O T E:
> -------
> 1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
> of ext4 patch queue git tree hosted by Ted.
Why the heck are replacements for these things being sent out again when
they're already in -mm and they're already in Ted's queue (from which I
need to diligently drop them each time I remerge)?
Are we all supposed to re-review the entire patchset (or at least #4 and
#7) again?
The core kernel changes are not appropriate to the ext4 tree.
For a start, the syscall numbers in Ted's queue are wrong (other new
syscalls are pending).
Patches which add syscalls are an utter PITA to carry due to all the patch
conflicts and to the relatively frequent syscall renumbering (they don't
get numbered in time-of-arrival order due to differing rates at which patches
mature).
Please drop the non-ext4 patches from the ext4 tree and send incremental
patches against the (non-ext4) fallocate patches in -mm.
And try to get the code finished? Time is pressing.
Thanks.
On Thu, 2007-06-28 at 02:55 -0700, Andrew Morton wrote:
> Please drop the non-ext4 patches from the ext4 tree and send incremental
> patches against the (non-ext4) fallocate patches in -mm.
>
The ext4 fallocate() patches are dependent on the core fallocate()
patches, so ext4 patch-queue and git tree won't compile (it's not based
on mm tree) without the core changes.
We can send ext4 fallocate patches (incremental patches against mm tree)
and drop the full fallocate patches(ext4 and non ext4 part) from ext4
patch queue if you prefer this way.
> And try to get the code finished? Time is pressing.
>
I looked at the mm tree, there are other ext4 features/changes that are
currently in ext4-patch-queue(not ext4 git tree) that not in part of
ext4 series yet. Ted, can you merge those patches to your git tree?
Thanks!
Thanks for your patience.
Mingming.
On Thu, Jun 28, 2007 at 02:55:43AM -0700, Andrew Morton wrote:
> On Mon, 25 Jun 2007 18:58:10 +0530 "Amit K. Arora" <[email protected]> wrote:
>
> > N O T E:
> > -------
> > 1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
> > of ext4 patch queue git tree hosted by Ted.
>
> Why the heck are replacements for these things being sent out again when
> they're already in -mm and they're already in Ted's queue (from which I
> need to diligently drop them each time I remerge)?
>
> Are we all supposed to re-review the entire patchset (or at least #4 and
> #7) again?
As I mentioned in the note above, only patches #4 and #7 were new and
thus these needed to be reviewed. Other patches are _not_ replacements
of any of the patches which are already part of -mm and/or in Ted's
patch queue. They were posted again as just "placeholders" so that the
two new patches (#4 & #7) could be reviewed. Sorry for any confusion.
> Please drop the non-ext4 patches from the ext4 tree and send incremental
> patches against the (non-ext4) fallocate patches in -mm.
Please let us know what you think of Mingming's suggestion of posting
all the fallocate patches including the ext4 ones as incremental ones
against the -mm.
--
Regards,
Amit Arora
On Wed, Jun 27, 2007 at 10:04:56AM +1000, David Chinner wrote:
> On Wed, Jun 27, 2007 at 12:59:08AM +0530, Amit K. Arora wrote:
> > On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote:
> > > On Jun 26, 2007 17:37 +0530, Amit K. Arora wrote:
> > > > I think, modifying ctime/mtime should be dependent on the other flags.
> > > > E.g., if we do not zero out data blocks on allocation/deallocation,
> > > > update only ctime. Otherwise, update ctime and mtime both.
> > >
> > > I'm only being the advocate for requirements David Chinner has put
> > > forward due to existing behaviour in XFS. This is one of the reasons
> > > why I think the "flags" mechanism we now have - we can encode the
> > > various different behaviours in any way we want and leave it to the
> > > caller.
> >
> > I understand. May be we can confirm once more with David Chinner if this
> > is really required. Will it really be a compatibility issue if new XFS
> > preallocations (ie. via fallocate) update mtime/ctime?
>
> It should be left up to the filesystem to decide. Only the
> filesystem knows whether something changed and the timestamp should
> or should not be updated.
Since Andreas had suggested FA_FL_NO_MTIME flag thinking it as a
requirement from XFS (whereas XFS does not need this flag), I don't think
we need to add this new flag.
Please let know if someone still feels FA_FL_NO_MTIME flag can be
useful.
--
Regards,
Amit Arora
On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote:
> On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
> > On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote:
> > > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > > error) is hit? Does it keep the current fallocate() or does it free it?
> > >
> > > Currently it is left on the file system implementation. In ext4, we do
> > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > > end up with partial (pre)allocation. This is inline with dd and
> > > posix_fallocate, which also do not free the partially allocated space.
> >
> > Since I believe the XFS allocation ioctls do it the opposite way (free
> > preallocated space on error) this should be encoded into the flags.
> > Having it "filesystem dependent" just means that nobody will be happy.
>
> No, XFs does not free preallocated space on error. it is up to the
> application to clean up.
Since XFS also does not free preallocated space on error and this
behavior is inline with dd, posix_fallocate() and the current ext4
implementation, do we still need FA_FL_FREE_ENOSPC flag ?
> > What I mean is that any data read from the file should have the "appearance"
> > of being zeroed (whether zeroes are actually written to disk or not). What
> > I _think_ David is proposing is to allow fallocate() to return without
> > marking the blocks even "uninitialized" and subsequent reads would return
> > the old data from the disk.
>
> Correct, but for swap files that's not an issue - no user should be able
> too read them, and FA_MKSWAP would really need root privileges to execute.
Will the FA_MKSWAP mode still be required with your suggested change of
teaching do_mpage_readpage() about unwritten extents being in place ?
Or, will you still like to have FA_MKSWAP mode ?
--
Regards,
Amit Arora
On Thu, 28 Jun 2007 23:27:57 +0530 "Amit K. Arora" <[email protected]> wrote:
> > Please drop the non-ext4 patches from the ext4 tree and send incremental
> > patches against the (non-ext4) fallocate patches in -mm.
>
> Please let us know what you think of Mingming's suggestion of posting
> all the fallocate patches including the ext4 ones as incremental ones
> against the -mm.
I think Mingming was asking that Ted move the current quilt tree into git,
presumably because she's working off git.
I'm not sure what to do, really. The core kernel patches need to be in
Ted's tree for testing but that'll create a mess for me.
ug.
Options might be
a) I drop the fallocate patches from -mm and from the ext4 tree, hack up
any needed build fixes, then just wait for it all to mature and then
think about it again
b) We do what we normally don't do and reserve the syscall slots in mainline.
On Thu, 2007-06-28 at 11:33 -0700, Andrew Morton wrote:
> On Thu, 28 Jun 2007 23:27:57 +0530 "Amit K. Arora" <[email protected]> wrote:
>
> > > Please drop the non-ext4 patches from the ext4 tree and send incremental
> > > patches against the (non-ext4) fallocate patches in -mm.
> >
> > Please let us know what you think of Mingming's suggestion of posting
> > all the fallocate patches including the ext4 ones as incremental ones
> > against the -mm.
>
> I think Mingming was asking that Ted move the current quilt tree into git,
> presumably because she's working off git.
I moved the fallocate patches to the very end of the series in the quilt
tree. This way the patches will be in the quilt tree for testing, but
Ted can easily leave them out of the git tree so you and Linus won't
pull them with the ext4 patches.
Fortunately, the ext4-specific fallocate patches don't conflict with the
other patches in the queue, so they can (at least for now) be handled
independently in the -mm tree.
> I'm not sure what to do, really. The core kernel patches need to be in
> Ted's tree for testing but that'll create a mess for me.
>
> ug.
>
> Options might be
>
> a) I drop the fallocate patches from -mm and from the ext4 tree, hack up
> any needed build fixes, then just wait for it all to mature and then
> think about it again
>
> b) We do what we normally don't do and reserve the syscall slots in mainline.
--
David Kleikamp
IBM Linux Technology Center
Andrew Morton wrote:
> b) We do what we normally don't do and reserve the syscall slots in mainline.
If everyone agrees it's going to happen... why not?
Jeff
On Jun 28, 2007 23:27 +0530, Amit K. Arora wrote:
> On Thu, Jun 28, 2007 at 02:55:43AM -0700, Andrew Morton wrote:
> > Are we all supposed to re-review the entire patchset (or at least #4 and
> > #7) again?
>
> As I mentioned in the note above, only patches #4 and #7 were new and
> thus these needed to be reviewed. Other patches are _not_ replacements
> of any of the patches which are already part of -mm and/or in Ted's
> patch queue. They were posted again as just "placeholders" so that the
> two new patches (#4 & #7) could be reviewed. Sorry for any confusion.
The new patches are definitely a big improvement over the previous API,
and need to go in before fallocate() goes into mainline. This last set
of changes allows the behaviour of these syscalls to accomodate the various
different semantics desired by XFS in a sensible manner instead of tying
all of the individual behaviours (time update, size update, alloc/free, etc)
into monolithic modes that will never make everyone happy.
My understanding is that you only need to grab #4 and #7 to get your tree
into get fallocate in sync with the ext4 patch queue (i.e. they are
incremental over the previous set).
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Thu, 2007-06-28 at 23:49 +0530, Amit K. Arora wrote:
>
> > Correct, but for swap files that's not an issue - no user should be
> able
> > too read them, and FA_MKSWAP would really need root privileges to
> execute.
>
> Will the FA_MKSWAP mode still be required with your suggested change
> of
> teaching do_mpage_readpage() about unwritten extents being in place ?
> Or, will you still like to have FA_MKSWAP mode ?
There's no need for a MKSWAP flag.
cheers.
--
Nathan
On Thu, Jun 28, 2007 at 11:49:13PM +0530, Amit K. Arora wrote:
> On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote:
> > On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
> > > On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote:
> > > > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > > > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > > > error) is hit? Does it keep the current fallocate() or does it free it?
> > > >
> > > > Currently it is left on the file system implementation. In ext4, we do
> > > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > > > end up with partial (pre)allocation. This is inline with dd and
> > > > posix_fallocate, which also do not free the partially allocated space.
> > >
> > > Since I believe the XFS allocation ioctls do it the opposite way (free
> > > preallocated space on error) this should be encoded into the flags.
> > > Having it "filesystem dependent" just means that nobody will be happy.
> >
> > No, XFs does not free preallocated space on error. it is up to the
> > application to clean up.
>
> Since XFS also does not free preallocated space on error and this
> behavior is inline with dd, posix_fallocate() and the current ext4
> implementation, do we still need FA_FL_FREE_ENOSPC flag ?
Not at the moment.
> > > What I mean is that any data read from the file should have the "appearance"
> > > of being zeroed (whether zeroes are actually written to disk or not). What
> > > I _think_ David is proposing is to allow fallocate() to return without
> > > marking the blocks even "uninitialized" and subsequent reads would return
> > > the old data from the disk.
> >
> > Correct, but for swap files that's not an issue - no user should be able
> > too read them, and FA_MKSWAP would really need root privileges to execute.
>
> Will the FA_MKSWAP mode still be required with your suggested change of
> teaching do_mpage_readpage() about unwritten extents being in place ?
> Or, will you still like to have FA_MKSWAP mode ?
budgie:/mnt/test # xfs_io -f -c "resvsp 0 1048576" -c "truncate 1048576" swap_file
budgie:/mnt/test # mkswap swap_file
Setting up swapspace version 1, size = 1032 kB
budgie:/mnt/test # swapon -v swap_file
swapon on swap_file
budgie:/mnt/test # swapon -s
Filename Type Size Used Priority
/dev/sda2 partition 9437152 0 -1
/mnt/test/swap_file file 992 0 -2
budgie:/mnt/test # xfs_bmap -vvp swap_file
swap_file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..31]: 96..127 0 (96..127) 32
1: [32..2047]: 128..2143 0 (128..2143) 2016 10000
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width
Looks like the changes work, so FA_MKSWAP is not necessary for XFS.
We can drop that for the moment unless anyone else sees a need for it.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote:
> I think Mingming was asking that Ted move the current quilt tree into git,
> presumably because she's working off git.
>
> I'm not sure what to do, really. The core kernel patches need to be in
> Ted's tree for testing but that'll create a mess for me.
Could we please stop this stupid ext4-centrism? XFS is ready so we can
put in the syscalls backed by XFS. We have already done this with the xattr
syscalls in 2.4, btw.
Then again I don't think we should put it in quite yet, because this thread
has degraded into creeping featurism, please give me some more time to
preparate a semi-coheret rants about this..
On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote:
> > Please let us know what you think of Mingming's suggestion of posting
> > all the fallocate patches including the ext4 ones as incremental ones
> > against the -mm.
>
> I think Mingming was asking that Ted move the current quilt tree into git,
> presumably because she's working off git.
No, mingming and I both work off of the patch queue (which is also
stored in git). So what mingming was asking for exactly was just
posting the incremental patches and tagging them appropriately to
avoid confusion.
I tried building the patch queue earlier in the week and it there were
multiple oops/panics as I ran things through various regression tests,
but that may have been fixed since (the tree was broken over the
weekend and I may have grabbed a broken patch series) or it may have
been a screw up on my part feeding them into our testing grid. I
haven't had time to try again this week, but I'll try to put together
a new tested ext4 patchset over the weekend.
> I'm not sure what to do, really. The core kernel patches need to be in
> Ted's tree for testing but that'll create a mess for me.
I don't think we have a problem here. What we have now is fine, and
it was just people kvetching that Amit reposted patches that were
already in -mm and ext4.
In any case, the plan is to push all of the core bits into Linus tree
for 2.6.22 once it opens up, which should be Real Soon Now, it looks
like.
- Ted
Theodore Tso wrote:
> I don't think we have a problem here. What we have now is fine, and
It's fine for ext4, but not the wider world. This is a common problem
created by parallel development when code dependencies exist.
> In any case, the plan is to push all of the core bits into Linus tree
> for 2.6.22 once it opens up, which should be Real Soon Now, it looks
> like.
Presumably you mean 2.6.23.
Jeff
Theodore Tso wrote:
> On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote:
>
>>> Please let us know what you think of Mingming's suggestion of posting
>>> all the fallocate patches including the ext4 ones as incremental ones
>>> against the -mm.
>>>
>> I think Mingming was asking that Ted move the current quilt tree into git,
>> presumably because she's working off git.
>>
>
> No, mingming and I both work off of the patch queue (which is also
> stored in git). So what mingming was asking for exactly was just
> posting the incremental patches and tagging them appropriately to
> avoid confusion.
>
> I tried building the patch queue earlier in the week and it there were
> multiple oops/panics as I ran things through various regression tests,but that may have been fixed since (the tree was broken over the
> weekend and I may have grabbed a broken patch series) or it may have
> been a screw up on my part feeding them into our testing grid. I
> haven't had time to try again this week, but I'll try to put together
> a new tested ext4 patchset over the weekend.
>
>
I think the ext4 patch queue is in good shape now. Shaggy have tested
in on dbench, fsx, and tiobench, tests runs fine. and BULL team has
benchmarked the latest ext4 patch queue with iozone and FFSB.
Regards,
Mingming
>> I'm not sure what to do, really. The core kernel patches need to be in
>> Ted's tree for testing but that'll create a mess for me.
>>
>
> I don't think we have a problem here. What we have now is fine, and
> it was just people kvetching that Amit reposted patches that were
> already in -mm and ext4.
>
> In any case, the plan is to push all of the core bits into Linus tree
> for 2.6.22 once it opens up, which should be Real Soon Now, it looks
> like.
>
> - Ted
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Fri, Jun 29, 2007 at 10:29:21AM -0400, Jeff Garzik wrote:
> >In any case, the plan is to push all of the core bits into Linus tree
> >for 2.6.22 once it opens up, which should be Real Soon Now, it looks
> >like.
>
> Presumably you mean 2.6.23.
Yes, sorry. I meant once Linus releases 2.6.22, and we would be
aiming to merge before the 2.6.23-rc1 window.
- Ted
On Fri, 29 Jun 2007 11:50:04 -0400
Mingming Caoc <[email protected]> wrote:
> I think the ext4 patch queue is in good shape now.
Which ext4 patches are you intending to merge into 2.6.23?
Please send all those out to lkml for review?
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> I suppose it might be a bit late in the game to add a "goal"
> parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
> the API more suitable for XFS? The goal could be a single __u64, or
> a struct with e.g. __u64 byte offset (possibly also __u32 lun like
> in FIEMAP). I guess the one potential limitation here is the
> number of function parameters on some architectures.
This isn't really about "more suitable for XFS" but more about more
suitable for sophisticated layout decisions.
But I'm still not confident this should be shohorned into this
syscall. In fact I'm already rather unhappy about the feature churn in
the current patch series.
The more I think about it the more I'd prefer we would just put a simple
syscall in that implements nothing but the posix_fallocate(3) semantics
as defined in SuS, and then go on to brainstorm about advanced
preallocation / layout hint semantics.
On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
> > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > error) is hit? Does it keep the current fallocate() or does it free it?
>
> Currently it is left on the file system implementation. In ext4, we do
> not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> end up with partial (pre)allocation. This is inline with dd and
> posix_fallocate, which also do not free the partially allocated space.
I can't find anything in the specification of posix_fallocate
(http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
that tells what should happen to allocate blocks on error.
But common sense would be to not leak disk space on failure of this
syscall, and this definitively should not be left up to the filesystem,
either we always leak it or always free it, and I'd strongly favour
the latter variant.
> > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > don't want to expose uninitialized disk blocks to userspace. I'm not
> > sure if this makes sense at all.
>
> I don't think we need to make it default - atleast for filesystems which
> have a mechanism to distinguish preallocated blocks from "regular" ones.
> In ext4, for example, we will have a way to mark uninitialized extents.
> All the preallocated blocks will be part of these uninitialized extents.
> And any read on these extents will treat them as a hole, returning
> zeroes to user land. Thus any existing data on uninitialized blocks will
> not be exposed to the userspace.
This is the xfs unwritten extent behaviour. But anyway, the important bit
is uninitialized blocks should never ever leak to userspace, so there is
not need for the flag.