On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[email protected]> wrote:
> This patch implements the fallocate() system call and adds support for
> i386, x86_64 and powerpc.
>
> ...
>
> +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
Please add a comment over this function which specifies its behaviour.
Really it should be enough material from which a full manpage can be
written.
If that's all too much, this material should at least be spelled out in the
changelog. Because there's no way in which this change can be fully
reviewed unless someone (ie: you) tells us what it is setting out to
achieve.
If we 100% implement some standard then a URL for what we claim to
implement would suffice. Given that we're at least using different types from
posix I doubt if such a thing would be sufficient.
And given the complexity and potential variability within the filesystem
implementations of this, I'd expect that _something_ additional needs to be
said?
> +{
> + struct file *file;
> + struct inode *inode;
> + long ret = -EINVAL;
> +
> + if (len == 0 || offset < 0)
> + goto out;
The posix spec implies that negative `len' is permitted - presumably "allocate
ahead of `offset'". How peculiar.
> + ret = -EBADF;
> + file = fget(fd);
> + if (!file)
> + goto out;
> + if (!(file->f_mode & FMODE_WRITE))
> + goto out_fput;
> +
> + inode = file->f_path.dentry->d_inode;
> +
> + ret = -ESPIPE;
> + if (S_ISFIFO(inode->i_mode))
> + goto out_fput;
> +
> + ret = -ENODEV;
> + if (!S_ISREG(inode->i_mode))
> + goto out_fput;
So we return ENODEV against an S_ISBLK fd, as per the posix spec. That
seems a bit silly of them.
> + ret = -EFBIG;
> + if (offset + len > inode->i_sb->s_maxbytes)
> + goto out_fput;
This code does handle offset+len going negative, but only by accident, I
suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment
here would settle the reader's mind.
> + if (inode->i_op && inode->i_op->fallocate)
> + ret = inode->i_op->fallocate(inode, mode, offset, len);
> + else
> + ret = -ENOSYS;
If we _are_ going to support negative `len', as posix suggests, I think we
should perform the appropriate sanity conversions to `offset' and `len'
right here, rather than expecting each filesystem to do it.
If we're not going to handle negative `len' then we should check for it.
> +out_fput:
> + fput(file);
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL(sys_fallocate);
I don't believe this needs to be exported to modules?
> +/*
> + * fallocate() modes
> + */
> +#define FA_ALLOCATE 0x1
> +#define FA_DEALLOCATE 0x2
Now those aren't in posix. They should be documented, along with their
expected semantics.
> #ifdef __KERNEL__
>
> #include <linux/linkage.h>
> @@ -1125,6 +1131,7 @@ struct inode_operations {
> ssize_t (*listxattr) (struct dentry *, char *, size_t);
> int (*removexattr) (struct dentry *, const char *);
> void (*truncate_range)(struct inode *, loff_t, loff_t);
> + long (*fallocate)(struct inode *, int, loff_t, loff_t);
I really do think it's better to put the variable names in definitions such
as this. Especially when we have two identically-typed variables next to
each other like that. Quick: which one is the offset and which is the
length?
On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton <[email protected]> wrote:
> > + ret = -EFBIG;
> > + if (offset + len > inode->i_sb->s_maxbytes)
> > + goto out_fput;
>
> This code does handle offset+len going negative, but only by accident, I
> suspect.
But it doesn't handle offset+len wrapping through zero.
Andrew Morton writes:
> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[email protected]> wrote:
>
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> >
> > ...
> >
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
>
> Please add a comment over this function which specifies its behaviour.
> Really it should be enough material from which a full manpage can be
> written.
This looks like it will have the same problem on s390 as
sys_sync_file_range. Maybe the prototype should be:
asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
Paul.
On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[email protected]> wrote:
>
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> >
> > ...
> > +{
> > + struct file *file;
> > + struct inode *inode;
> > + long ret = -EINVAL;
> > +
> > + if (len == 0 || offset < 0)
> > + goto out;
>
> The posix spec implies that negative `len' is permitted - presumably "allocate
> ahead of `offset'". How peculiar.
I just checked the man page for posix_fallocate() and it says:
EINVAL offset or len was less than zero.
We should probably follow this lead.
> > +
> > + ret = -ENODEV;
> > + if (!S_ISREG(inode->i_mode))
> > + goto out_fput;
>
> So we return ENODEV against an S_ISBLK fd, as per the posix spec. That
> seems a bit silly of them.
Hmmmm - I thought that the intention of sys_fallocate() was to
be generic enough to eventually allow preallocation on directories.
If that is the case, then this check will prevent that....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Fri, 4 May 2007 16:07:31 +1000 David Chinner <[email protected]> wrote:
> On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[email protected]> wrote:
> >
> > > This patch implements the fallocate() system call and adds support for
> > > i386, x86_64 and powerpc.
> > >
> > > ...
> > > +{
> > > + struct file *file;
> > > + struct inode *inode;
> > > + long ret = -EINVAL;
> > > +
> > > + if (len == 0 || offset < 0)
> > > + goto out;
> >
> > The posix spec implies that negative `len' is permitted - presumably "allocate
> > ahead of `offset'". How peculiar.
>
> I just checked the man page for posix_fallocate() and it says:
>
> EINVAL offset or len was less than zero.
>
> We should probably follow this lead.
Yes, I think so. I'm suspecting that
http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
is just buggy. Or I can't read.
I mean, if we're going to support negative `len' then is the byte at
`offset' inside or outside the segment? Head spins.
However it would be neat if someone could test $OTHER_OS and, perhaps more
importantly, the present glibc emulation (which I assume your manpage is
referring to, so this would be a manpage test ;)).
> > > +
> > > + ret = -ENODEV;
> > > + if (!S_ISREG(inode->i_mode))
> > > + goto out_fput;
> >
> > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That
> > seems a bit silly of them.
>
> Hmmmm - I thought that the intention of sys_fallocate() was to
> be generic enough to eventually allow preallocation on directories.
> If that is the case, then this check will prevent that....
The above opengroup page only permits S_ISREG. Preallocating directories
sounds quite useful to me, although it's something which would be pretty
hard to emulate if the FS doesn't support it. And there's a decent case to
be made for emulating it - run-anywhere reasons. Does glibc emulation support
directories? Quite unlikely.
But yes, sounds like a desirable thing. Would XFS support it easily if the above
check was relaxed?
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> > > The posix spec implies that negative `len' is permitted - presumably "allocate
> > > ahead of `offset'". How peculiar.
> >
> > I just checked the man page for posix_fallocate() and it says:
> >
> > EINVAL offset or len was less than zero.
That describes the current glibc implementation.
> > We should probably follow this lead.
>
> Yes, I think so. I'm suspecting that
> http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
> is just buggy. Or I can't read.
>
> I mean, if we're going to support negative `len' then is the byte at
> `offset' inside or outside the segment? Head spins.
>
> However it would be neat if someone could test $OTHER_OS and, perhaps more
> importantly, the present glibc emulation (which I assume your manpage is
> referring to, so this would be a manpage test ;)).
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
struct stat64 st;
struct statfs f;
/* `off_t' is a signed type. Therefore we can determine whether
OFFSET + LEN is too large if it is a negative value. */
if (offset < 0 || len < 0)
return EINVAL;
if (offset + len < 0)
return EFBIG;
/* First thing we have to make sure is that this is really a regular
file. */
if (__fxstat64 (_STAT_VER, fd, &st) != 0)
return EBADF;
if (S_ISFIFO (st.st_mode))
return ESPIPE;
if (! S_ISREG (st.st_mode))
return ENODEV;
if (len == 0)
{
if (st.st_size < offset)
{
int ret = __ftruncate (fd, offset);
if (ret != 0)
ret = errno;
return ret;
}
return 0;
}
...
is what glibc does ATM. Seems we violate the case where len == 0, as
EINVAL in that case is "shall fail". But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
"required storage for regular file data starting at offset and continuing for len bytes"
doesn't make sense for negative size.
And given the general
"Implementations may support additional errors not included in this list,
may generate errors included in this list under circumstances other than
those described here, or may contain extensions or limitations that prevent
some errors from occurring."
I believe returning EINVAL for len < 0 is not a POSIX violation.
That doesn't mean the standard shouldn't be clarified, whether by saying
EINVAL must be returned for non-positive len or saying that using negative
len has undefined or implementation defined behavior.
> The above opengroup page only permits S_ISREG. Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it. And there's a decent case to
> be made for emulating it - run-anywhere reasons. Does glibc emulation support
> directories? Quite unlikely.
No, see above.
Jakub
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> On Fri, 4 May 2007 16:07:31 +1000 David Chinner <[email protected]> wrote:
> > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[email protected]> wrote:
> > >
> > > > This patch implements the fallocate() system call and adds support for
> > > > i386, x86_64 and powerpc.
> > > >
> > > > ...
> > > > +{
> > > > + struct file *file;
> > > > + struct inode *inode;
> > > > + long ret = -EINVAL;
> > > > +
> > > > + if (len == 0 || offset < 0)
> > > > + goto out;
> > >
> > > The posix spec implies that negative `len' is permitted - presumably "allocate
> > > ahead of `offset'". How peculiar.
> >
> > I just checked the man page for posix_fallocate() and it says:
> >
> > EINVAL offset or len was less than zero.
> >
> > We should probably follow this lead.
>
> Yes, I think so. I'm suspecting that
> http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
> is just buggy. Or I can't read.
>
> I mean, if we're going to support negative `len' then is the byte at
> `offset' inside or outside the segment? Head spins.
I don't think we should care. If we provide a syscall with the
semantics of "allocate from offset to offset+len" then glibc's
implementation can turn negative length into two separate
fallocate syscalls....
> > > > + ret = -ENODEV;
> > > > + if (!S_ISREG(inode->i_mode))
> > > > + goto out_fput;
> > >
> > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That
> > > seems a bit silly of them.
> >
> > Hmmmm - I thought that the intention of sys_fallocate() was to
> > be generic enough to eventually allow preallocation on directories.
> > If that is the case, then this check will prevent that....
>
> The above opengroup page only permits S_ISREG. Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it. And there's a decent case to
> be made for emulating it - run-anywhere reasons. Does glibc emulation support
> directories? Quite unlikely.
>
> But yes, sounds like a desirable thing. Would XFS support it easily if the above
> check was relaxed?
No - right now empty blocks are pruned from the directory immediately so I
don't think we really have a concept of empty blocks in the btree structure.
dir2 is bloody complex, so adding preallocation is probably not going to
be simple to do.
It's not high on my list to add, either, because we can typically avoid the
worst case directory fragmentation by using larger directory block sizes
(e.g. 16k instead of the default 4k on a 4k block size fs).
IIRC directory preallocation has been talked about more for ext3/4....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
Andrew,
Thanks for the review comments!
On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[email protected]> wrote:
>
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> >
> > ...
> >
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
>
> Please add a comment over this function which specifies its behaviour.
> Really it should be enough material from which a full manpage can be
> written.
>
> If that's all too much, this material should at least be spelled out in the
> changelog. Because there's no way in which this change can be fully
> reviewed unless someone (ie: you) tells us what it is setting out to
> achieve.
>
> If we 100% implement some standard then a URL for what we claim to
> implement would suffice. Given that we're at least using different types from
> posix I doubt if such a thing would be sufficient.
>
> And given the complexity and potential variability within the filesystem
> implementations of this, I'd expect that _something_ additional needs to be
> said?
Ok. I will add a detailed comment here.
>
> > +{
> > + struct file *file;
> > + struct inode *inode;
> > + long ret = -EINVAL;
> > +
> > + if (len == 0 || offset < 0)
> > + goto out;
>
> The posix spec implies that negative `len' is permitted - presumably "allocate
> ahead of `offset'". How peculiar.
I think we should go ahead with current glibc implementation (which
Jakub poited at) of not allowing a negative 'len', since posix also
doesn't explicitly say anything about allowing negative 'len'.
>
> > + ret = -EBADF;
> > + file = fget(fd);
> > + if (!file)
> > + goto out;
> > + if (!(file->f_mode & FMODE_WRITE))
> > + goto out_fput;
> > +
> > + inode = file->f_path.dentry->d_inode;
> > +
> > + ret = -ESPIPE;
> > + if (S_ISFIFO(inode->i_mode))
> > + goto out_fput;
> > +
> > + ret = -ENODEV;
> > + if (!S_ISREG(inode->i_mode))
> > + goto out_fput;
>
> So we return ENODEV against an S_ISBLK fd, as per the posix spec. That
> seems a bit silly of them.
True.
> > + ret = -EFBIG;
> > + if (offset + len > inode->i_sb->s_maxbytes)
> > + goto out_fput;
>
> This code does handle offset+len going negative, but only by accident, I
> suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment
> here would settle the reader's mind.
Ok. I will add a check here for wrap though zero.
> > + if (inode->i_op && inode->i_op->fallocate)
> > + ret = inode->i_op->fallocate(inode, mode, offset, len);
> > + else
> > + ret = -ENOSYS;
>
> If we _are_ going to support negative `len', as posix suggests, I think we
> should perform the appropriate sanity conversions to `offset' and `len'
> right here, rather than expecting each filesystem to do it.
>
> If we're not going to handle negative `len' then we should check for it.
Will add a check for negative 'len' and return -EINVAL. This will be
done where currently we check for negative offset (i.e. at the start of
the function).
> > +out_fput:
> > + fput(file);
> > +out:
> > + return ret;
> > +}
> > +EXPORT_SYMBOL(sys_fallocate);
>
> I don't believe this needs to be exported to modules?
Ok. Will remove it.
> > +/*
> > + * fallocate() modes
> > + */
> > +#define FA_ALLOCATE 0x1
> > +#define FA_DEALLOCATE 0x2
>
> Now those aren't in posix. They should be documented, along with their
> expected semantics.
Will add a comment describing the role of these modes.
> > #ifdef __KERNEL__
> >
> > #include <linux/linkage.h>
> > @@ -1125,6 +1131,7 @@ struct inode_operations {
> > ssize_t (*listxattr) (struct dentry *, char *, size_t);
> > int (*removexattr) (struct dentry *, const char *);
> > void (*truncate_range)(struct inode *, loff_t, loff_t);
> > + long (*fallocate)(struct inode *, int, loff_t, loff_t);
>
> I really do think it's better to put the variable names in definitions such
> as this. Especially when we have two identically-typed variables next to
> each other like that. Quick: which one is the offset and which is the
> length?
Ok. Will add the variable names here.
--
Regards,
Amit Arora
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> The above opengroup page only permits S_ISREG. Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it. And there's a decent case to
> be made for emulating it - run-anywhere reasons. Does glibc emulation support
> directories? Quite unlikely.
>
> But yes, sounds like a desirable thing. Would XFS support it easily if the above
> check was relaxed?
I think we may relax the check here and let the individual file system
decide if they support preallocation for directories or not. What do you
think ?
One thing to be thought in this case is the error code which should be
returned by the file system implementation, incase it doesn't support
preallocation for directories. Should it be -ENODEV (to match with what
posix says) , or something else (which might make more sense in this
case) ?
--
Regards,
Amit Arora
Jakub Jelinek wrote:
> is what glibc does ATM. Seems we violate the case where len == 0, as
> EINVAL in that case is "shall fail". But reading the standard to imply
> negative len is ok is too much guessing, there is no word what it means
> when len is negative and
> "required storage for regular file data starting at offset and continuing for len bytes"
> doesn't make sense for negative size.
This wording has already been cleaned up. The current draft for the
next revision reads:
[EINVAL] The len argument is less than or equal to zero, or the offset
argument is less than zero, or the underlying file system does not
support this operation.
I still don't like it since len==0 shouldn't create an error (it's
inconsistent) but len<0 is already outlawed.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote:
> Andrew Morton writes:
>
> > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[email protected]> wrote:
> >
> > > This patch implements the fallocate() system call and adds support for
> > > i386, x86_64 and powerpc.
> > >
> > > ...
> > >
> > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> >
> > Please add a comment over this function which specifies its behaviour.
> > Really it should be enough material from which a full manpage can be
> > written.
>
> This looks like it will have the same problem on s390 as
> sys_sync_file_range. Maybe the prototype should be:
>
> asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
Yes, but the trouble is that there was a contrary viewpoint preferring that fd
first be maintained as a convention like other syscalls (see the following
posts)
http://marc.info/?l=linux-fsdevel&m=117585330016809&w=2 (Andreas)
http://marc.info/?l=linux-fsdevel&m=117690157917378&w=2 (Andreas)
http://marc.info/?l=linux-fsdevel&m=117578821827323&w=2 (Randy)
So we are kind of deadlocked, aren't we ?
The debates on the proposed solution for s390
http://marc.info/?l=linux-fsdevel&m=117760995610639&w=2
http://marc.info/?l=linux-fsdevel&m=117708124913098&w=2
http://marc.info/?l=linux-fsdevel&m=117767607229807&w=2
Are there any better ideas ?
Regards
Suparna
>
> Paul.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
Suparna Bhattacharya writes:
> > This looks like it will have the same problem on s390 as
> > sys_sync_file_range. Maybe the prototype should be:
> >
> > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
>
> Yes, but the trouble is that there was a contrary viewpoint preferring that fd
> first be maintained as a convention like other syscalls (see the following
> posts)
Of course the interface used by an application program would have the
fd first. Glibc can do the translation.
Paul.
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote:
> Suparna Bhattacharya writes:
>
> > > This looks like it will have the same problem on s390 as
> > > sys_sync_file_range. Maybe the prototype should be:
> > >
> > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
> >
> > Yes, but the trouble is that there was a contrary viewpoint preferring that fd
> > first be maintained as a convention like other syscalls (see the following
> > posts)
>
> Of course the interface used by an application program would have the
> fd first. Glibc can do the translation.
I think that was understood.
Regards
Suparna
>
> Paul.
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
Suparna Bhattacharya writes:
> > Of course the interface used by an application program would have the
> > fd first. Glibc can do the translation.
>
> I think that was understood.
OK, then what does it matter what the glibc/kernel interface is, as
long as it works?
It's only a minor point; the order of arguments can vary between
architectures if necessary, but it's nicer if they don't have to.
32-bit powerpc will need to have the two int arguments adjacent in
order to avoid using more than 6 argument registers at the user/kernel
boundary, and s390 will need to avoid having a 64-bit argument last
(if I understand it correctly).
Paul.
On 5/9/07, Paul Mackerras <[email protected]> wrote:
> Suparna Bhattacharya writes:
>
> > > Of course the interface used by an application program would have the
> > > fd first. Glibc can do the translation.
> >
> > I think that was understood.
>
> OK, then what does it matter what the glibc/kernel interface is, as
> long as it works?
>
> It's only a minor point; the order of arguments can vary between
> architectures if necessary, but it's nicer if they don't have to.
> 32-bit powerpc will need to have the two int arguments adjacent in
> order to avoid using more than 6 argument registers at the user/kernel
> boundary, and s390 will need to avoid having a 64-bit argument last
> (if I understand it correctly).
Ah, almost but not quite the point. But I admit it is hard to understand..
The trouble started with the futex call which has been the first
system call with 6 arguments. s390 supported only 5 arguments up to
that point (%r2 - %r6). For futex we added a wrapper to the glibc that
loaded the 6th argument to %r7. In entry.S we set up things so that
%r7 gets stored to the kernel stack where normal C code expects the
first overflow argument. This enabled us to use the standard futex
system call with 6 arguments.
fallocate now has an additional problem: the last argument is a 64 bit
integers AND registers %r2-%r5 are already used. In this case the 64
bit number would have to be split into the high part in %r6 and the
low part on the stack so that the glibc wrapper can load the low part
to %r7. But the C compiler will skip %r6 and store the 64 bit number
on the stack.
If the order of the arguments if modified so that %r6 is assigned to a
32-bit argument, then the entry.S magic with %r7 would work.
--
blue skies,
Martin
On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote:
> Suparna Bhattacharya writes:
>
> > > Of course the interface used by an application program would have the
> > > fd first. Glibc can do the translation.
> >
> > I think that was understood.
>
> OK, then what does it matter what the glibc/kernel interface is, as
> long as it works?
>
> It's only a minor point; the order of arguments can vary between
> architectures if necessary, but it's nicer if they don't have to.
> 32-bit powerpc will need to have the two int arguments adjacent in
> order to avoid using more than 6 argument registers at the user/kernel
> boundary, and s390 will need to avoid having a 64-bit argument last
> (if I understand it correctly).
You are right to say that. But, it may not be _that_ a minor point,
especially for the arch which is getting affected. It has
other implications like what Heiko noticed in his post below:
http://lkml.org/lkml/2007/4/27/377
- implications like modifying glibc and *trace utilities for a particular
arch.
--
Regards,
Amit Arora