by Amit K. Arora

[permalink] [raw]

This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

As we are developing and testing the required patches, we decided to
post a preliminary patch and get inputs from the community to give it
a right direction and shape. First, a little description on the feature.

Persistent preallocation is a file system feature using which an
application (say, relational database servers) can explicitly
preallocate blocks to a particular file. This feature can be used to
reserve space for a file to get mainly the following benefits:
1> contiguity - less defragmentation and thus faster access speed, and
2> guarantee for a minimum space availibility (depending on how many
blocks were preallocated) for the file, even if the filesystem becomes
full.

XFS already has an implementation for this, using an ioctl interface. And,
ext4 is now coming up with this feature. In coming time we may see a few
more file systems implementing this. Thus, it makes sense to have a more
standard interface for this, like this new system call.

Here is the initial and incomplete version of the patch, which can be
used for the discussion, till we come up with a set of more complete
patches.

---
arch/i386/kernel/syscall_table.S | 1 +
fs/ext4/file.c | 1 +
fs/open.c | 18 ++++++++++++++++++
include/asm-i386/unistd.h | 3 ++-
include/linux/fs.h | 1 +
include/linux/syscalls.h | 1 +
6 files changed, 24 insertions(+), 1 deletion(-)

Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_fallocate /* 320 */
Index: linux-2.6.20.1/fs/ext4/file.c
===================================================================
--- linux-2.6.20.1.orig/fs/ext4/file.c
+++ linux-2.6.20.1/fs/ext4/file.c
@@ -135,5 +135,6 @@ struct inode_operations ext4_file_inode_
.removexattr = generic_removexattr,
#endif
.permission = ext4_permission,
+ .fallocate = ext4_fallocate,
};

Index: linux-2.6.20.1/fs/open.c
===================================================================
--- linux-2.6.20.1.orig/fs/open.c
+++ linux-2.6.20.1/fs/open.c
@@ -350,6 +350,24 @@ asmlinkage long sys_ftruncate64(unsigned
}
#endif

+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
+{
+ struct file *file;
+ struct inode *inode;
+ long ret = -EINVAL;
+ file = fget(fd);
+ if (!file)
+ goto out;
+ inode = file->f_path.dentry->d_inode;
+ if (inode->i_op && inode->i_op->fallocate)
+ ret = inode->i_op->fallocate(inode, offset, len);
+ else
+ ret = -ENOTTY;
+ fput(file);
+out:
+ return ret;
+}
+
/*
* access() needs to use the real uid/gid, not the effective uid/gid.
* We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.20.1/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-i386/unistd.h
+++ linux-2.6.20.1/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
#define __NR_move_pages 317
#define __NR_getcpu 318
#define __NR_epoll_pwait 319
+#define __NR_fallocate 320

#ifdef __KERNEL__

-#define NR_syscalls 320
+#define NR_syscalls 321

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.1/include/linux/fs.h
===================================================================
--- linux-2.6.20.1.orig/include/linux/fs.h
+++ linux-2.6.20.1/include/linux/fs.h
@@ -1124,6 +1124,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ long (*fallocate)(struct inode *, loff_t, loff_t);
};

struct seq_file;
Index: linux-2.6.20.1/include/linux/syscalls.h
===================================================================
--- linux-2.6.20.1.orig/include/linux/syscalls.h
+++ linux-2.6.20.1/include/linux/syscalls.h
@@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int
asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);
asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

--
Regards,
Amit Arora

2007-03-01 19:15:19

by Eric Sandeen

[permalink] [raw]

2007-03-02 06:17:51

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] Heads up on sys_fallocate()

On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[email protected]> wrote:

> Just curious .. What does posix_fallocate() return ?

bookmark this:

http://www.opengroup.org/onlinepubs/009695399/nfindex.html

Upon successful completion, posix_fallocate() shall return zero;
otherwise, an error number shall be returned to indicate the error.

2007-03-02 07:12:08

Jörn Engel wrote:
>> Of course. You call posix_fallocate once for the lifetime of the file
>> when it is created to ensure that all future uses will work.
>
> That part is not quite clear from the manpage but I trust most people
> would assume the same.

Not only that, it is what this function is for. In the POSIX committee
we've looked at the functions in detail before adding them, even if some
information is not in the man page but instead in the Rationale.

> Still, it is quite obvious that noone designing this interface has lost
> much thought to compressing filesystems.

You already have problems with supporting the functionality
posix_fallocate is supporting. You cannot reliably support MAP_SHARED
files if all of a sudden the compression causes and expansion of a block
and that causes a ENOSPC error. So, don't expect pity. This is a
function in support of a real and reliable implementation of memory
mapped files. You don't use MAP_SHARED on such filesystems, it'll eat
your kittens sooner or later anyway.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

Attachments:

signature.asc (251.00 B)
OpenPGP digital signature

2007-03-05 16:02:55

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC] Heads up on sys_fallocate()

On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
> Well, I'm sure the kernel can do better than the code we have in libc
> now. The kernel has access to the bitmasks which say which blocks have
> already been allocated. The libc code does not and we have to be very
> simple-minded and simply touch every block. And this means reading it
> and then writing it back. The kernel would know when the reading part
> is not necessary. Add to then the block granularity (we use f_bsize as
> returned from fstatfs but that's not the best value in some cases) and
> you have compelling data to have generic code in the kernel. Then libc
> implementation can then go away completely which is a good thing.

You have a very good point; indeed since we don't export an interface
which allows userspace to determine whether or not a block is in use,
that does mean a huge amount of churn in the page cache. So maybe it
would be worth doing in the kernel as a result, although the libc
implementation still wouldn't be able to go away for long time due to
the need to be backwards compatible with older kernels that didn't
have this support.

Regards,

- Ted

2007-03-05 16:07:24

On Sat, 17 Mar 2007 15:30:43 +0100 Heiko Carstens <[email protected]> wrote:
>
> sys_sync_file_range(int fd, loff_t offset, loff_t nbytes, unsigned int flags)
>
> But from what I read, it's currently not possible for 32-bit powerpc to
> wire up the already present sync_file_range system call.

32bit native is fine (as the ABI in user mode is the same as that in the
kernel). For 32bit on a 64bit kernel you need the arch specific comapt
routine that I used in the patch I posteda little while ago,

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/

Attachments:

(No filename) (612.00 B)
(No filename) (189.00 B)
Download all attachments

2007-03-17 14:42:14

On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
> On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote:
> > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > error) is hit? Does it keep the current fallocate() or does it free it?
> >
> > Currently it is left on the file system implementation. In ext4, we do
> > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > end up with partial (pre)allocation. This is inline with dd and
> > posix_fallocate, which also do not free the partially allocated space.
>
> Since I believe the XFS allocation ioctls do it the opposite way (free
> preallocated space on error) this should be encoded into the flags.
> Having it "filesystem dependent" just means that nobody will be happy.

Ok, got your point. Maybe we can have a flag for this, as you suggested.
But, default behavior IMHO should be _not_ to undo partial allocation
(thus the file system will have the option of supporting this flag or
not and it will be inline with posix_fallocate; XFS will obviously
like to support this flag, inline with its existing behavior).

> > > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > > don't want to expose uninitialized disk blocks to userspace. I'm not
> > > sure if this makes sense at all.
> >
> > I don't think we need to make it default - atleast for filesystems which
> > have a mechanism to distinguish preallocated blocks from "regular" ones.
>
> What I mean is that any data read from the file should have the "appearance"
> of being zeroed (whether zeroes are actually written to disk or not). What
> I _think_ David is proposing is to allow fallocate() to return without
> marking the blocks even "uninitialized" and subsequent reads would return
> the old data from the disk.

I can't think of a good reason for this (i.e. returning stale data from
preallocated blocks). It is infact a security issue to me.
Anyhow, this may though be beneficial for file systems which have
noticable overhead in marking the blocks "uninitialized/preallocated".
Can you or David please throw some light on how this option might really
be helpful ? Thanks!

--
Regards,
Amit Arora

2007-06-26 19:12:09

by Amit K. Arora

[permalink] [raw]

2007-06-26 23:26:49

by David Chinner

[permalink] [raw]

On Jul 12, 2007 13:56 +0530, Amit K. Arora wrote:
> As you suggest, let us just have two modes for the time being:
>
> #define FALLOC_ALLOCATE 0x1
> #define FALLOC_ALLOCATE_KEEP_SIZE 0x2
>
> As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it
> will result in file size not being changed even if the preallocation is
> beyond EOF.

What does FALLOC_ALLOCATE mean vs. not passing this flag? I have no
objection to this as long as the code remains with these as "flags"
instead of "modes"... Essentially just dropping the FALLOC_FL_DEALLOCATE
and FALLOC_FL_DEL_DATA from the interface.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.