2007-05-17 14:11:26

by Amit K. Arora

[permalink] [raw]
Subject: [PATCH 0/6][TAKE4] fallocate system call

Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
---------
The proposed system call's layout is:

asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime in the inode of the corresponding file, marking a
successfull allocation.
FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime.
* New modes might get added in future. One such new mode which is
already under discussion is FA_PREALLOCATE, which when used will
preallocate space but will not change the filesize and [cm]time.
Since the semantics of this new mode is not clear and agreed upon yet,
this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
start.

len: This is the number of bytes requested for preallocation (from
offset).

RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate().

sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
a) to support fallocate() system call
b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
---------
Changes from Take2 to Take3:
1) Return type is now described in the interface description
above.
2) Patches rebased to 2.6.22-rc1 kernel.

** Each post will have an individual changelog for a particular patch.


Following patches follow:
Patch 1/6 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/6 : fallocate() on s390
Patch 3/6 : fallocate() on ia64
Patch 4/6 : ext4: Extent overlap bugfix
Patch 5/6 : ext4: fallocate support in ext4
Patch 6/6 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora


2007-05-19 06:49:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/6][TAKE4] fallocate system call

On Thu, 17 May 2007 19:41:15 +0530 "Amit K. Arora" <[email protected]> wrote:

> fallocate() is a new system call being proposed here which will allow
> applications to preallocate space to any file(s) in a file system.

I merged the first three patches into -mm, thanks.

All the system call numbers got changed due to recent additions. They
may change in the future, too - nothing is stable until the code lands
in mainline.

I didn't merge any of the ext4 changes as they appear to be in Ted's
devel tree. Although I didn't check that they are 100% the same in
that tree.

What's the plan to get some ext4 updates into mainline, btw? Things
seem to be rather gradual.

2007-05-21 05:25:03

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 0/6][TAKE4] fallocate system call

On Fri, 2007-05-18 at 23:44 -0700, Andrew Morton wrote:
> On Thu, 17 May 2007 19:41:15 +0530 "Amit K. Arora" <[email protected]> wrote:
>
> > fallocate() is a new system call being proposed here which will allow
> > applications to preallocate space to any file(s) in a file system.
>
> I merged the first three patches into -mm, thanks.
>
> All the system call numbers got changed due to recent additions. They
> may change in the future, too - nothing is stable until the code lands
> in mainline.
>
In case you haven't realize it, the ia64 fallocate() patch comes with
Amit's takes 4 fallocate patch series (3/6) missing one line change,
thus fail to compile on ia64.

Here is the updated one. Patch tested on ia64. (compile and fsx)

fallocate() on ia64

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner <[email protected]>

---
arch/ia64/kernel/entry.S | 1 +
include/asm-ia64/unistd.h | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S 2007-05-18 16:30:16.000000000 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S 2007-05-18 16:32:45.000000000 -0700
@@ -1585,5 +1585,6 @@
data8 sys_getcpu
data8 sys_epoll_pwait // 1305
data8 sys_utimensat
+ data8 sys_fallocate

.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-18 16:30:16.000000000 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h 2007-05-18 17:34:58.000000000 -0700
@@ -296,11 +296,12 @@
#define __NR_getcpu 1304
#define __NR_epoll_pwait 1305
#define __NR_utimensat 1306
+#define __NR_fallocate 1307

#ifdef __KERNEL__


-#define NR_syscalls 283 /* length of syscall table */
+#define NR_syscalls 285 /* length of syscall table */

#define __ARCH_WANT_SYS_RT_SIGACTION
#define __ARCH_WANT_SYS_RT_SIGSUSPEND


> I didn't merge any of the ext4 changes as they appear to be in Ted's
> devel tree. Although I didn't check that they are 100% the same in
> that tree.
>
Since both Amit and Ted are traveling, I will jump in...

Most likely it's not the same one. What in Ted's devel tree is "takes 2"
patches.

I have incorporated takes 4 patches in the backing ext4 patch git tree
here:
http://repo.or.cz/w/ext4-patch-queue.git

I have tested these patch series on ia64,ppc64,x86 and x86_64. I am not
sure if Ted got a chance to update his ext4 git tree from this patch
queue git tree yet.

> What's the plan to get some ext4 updates into mainline, btw? Things
> seem to be rather gradual.


Last time Ted and I discussed we all agree fallocate patches should go
into mainline. Actually most patches marked before the "unstable
patches" can get into mainline, especially the following patches
(contains a few bug fixes patches)

# New patch to fix whitespace before applying new patches
whitespace.patch

#New patch to remove unnecessary exported symbols
ext4_remove_exported_symbles.patch

# New patch to add mount option to turn off extents
ext4_noextent_mount_opt.patch
# Now Turn on extents feature by default
ext4_extents_on_by_default.patch

#New patch to propagate inode flags
ext4-propagate_flags.patch

#New patch to add extent sanity checks
ext4-extent-sanity-checks.patch

#New patch to free blocks when failed to insert an extent
ext4-free-blocks-on-insert-extent-failure.patch

We already missed rc-1 window, but if possible, I would like to see ext4
fallocate patches and above patches in mainline 2.6.22. The nanosecond
timestamp patch is probably good to go also.

Regards,
Mingming
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html