2007-07-10 20:11:52

by Amit K. Arora

[permalink] [raw]
Subject: [PATCH 0/7][TAKE6] fallocate system call

This is the latest fallocate patchset and is rebased to 2.6.22.

Following are the changes from TAKE5:
1) Rebased to 2.6.22
2) Added compat wrapper for x86_64
3) Dropped s390 and ia64 patches, since the platform maintaners can
add the support for fallocate once it is in mainline.
4) Added a change suggested by Andreas for better extent-to-group
alignment in ext4 (Patch 6/6). Please refer following post:
http://www.mail-archive.com/[email protected]/msg02445.html
5) Renamed mode flags and values from "FA_" to "FALLOC_"
6) Added manpage (updated version of the one initially submitted by
David Chinner).


Todos:
-----
1> Implementation on other architectures (other than i386, x86_64,
and ppc64). s390(x) and ia64 patches are ready and will be pushed
by platform maintaners when the fallocate is in mainline.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
a) to support fallocate() system call
b) to make posix_fallocate() and posix_fallocate64() call fallocate()
4> A testcase to test the system call. Will post it soon.


Following patches follow:
Patch 1/7 : manpage for fallocate
Patch 2/7 : fallocate() implementation in i386, x86_64 and powerpc
Patch 3/7 : support new modes in fallocate
Patch 4/7 : ext4: fallocate support in ext4
Patch 5/7 : ext4: write support for preallocated blocks
Patch 6/7 : ext4: support new modes in ext4
Patch 7/7 : ext4: change for better extent-to-group alignment


--
Regards,
Amit Arora


2007-07-10 20:18:16

by Amit K. Arora

[permalink] [raw]
Subject: [PATCH 1/7] manpage for fallocate

Following is the modified version of the manpage originally submitted by
David Chinner. Please use `nroff -man fallocate.2 | less` to view.


.TH fallocate 2
.SH NAME
fallocate \- allocate or remove file space
.SH SYNOPSIS
.nf
.B #include <sys/syscall.h>
.PP
.BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len);
.Op
.SH DESCRIPTION
The
.BR fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.IR offset
and continuing for
.IR len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there are four modes:
.TP
.B FALLOC_ALLOCATE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space. If the size of the file is less than
.IR offset + len ,
then the file is increased to this size; otherwise the file size is left
unchanged.
.B FALLOC_ALLOCATE
closely resembles
.B posix_fallocate(3)
and is intended as a method of optimally implementing this function.
.B FALLOC_ALLOCATE
may allocate a larger range that was specified.
.TP
.B FALLOC_RESV_SPACE
provides the same functionality as
.B FALLOC_ALLOCATE
except it does not ever change the file size. This allows allocation
of zero blocks beyond the end of file and is useful for optimising
append workloads.
.TP
.B FALLOC_DEALLOCATE
removes any preallocated space within the given range. The file size
may change if deallocation is towards the end of the file.
.TP
.B FALLOC_UNRESV_SPACE
removes the underlying disk space within the given range. The disk space
shall be removed regardless of it's contents so both allocated space
from
.B FALLOC_ALLOCATE
and
.B FALLOC_RESV_SPACE
as well as from
.B write(3)
will be removed.
.B FALLOC_UNRESV_SPACE
shall never remove disk blocks outside the range specified.
.B FALLOC_UNRESV_SPACE
shall never change the file size. If changing the file size
is required when deallocating blocks from an offset to end
of file (or beyond end of file) is required,
.B ftuncate64(3)
or
.B FALLOC_DEALLOCATE
should be used.

.SH "RETURN VALUE"
.BR fallocate()
returns zero on success, or an error number on failure.
Note that
.IR errno
is not set.
.SH "ERRORS"
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.I offset+len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
or
.I len
was less than 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
.TP
.B ENOSPC
There is not enough space left on the device containing the file
referred to by
.IR fd.
.TP
.B ESPIPE
.I fd
refers to a pipe of file descriptor.
.B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
.SH AVAILABILITY
The
.BR fallocate ()
system call is available since 2.6.XX
.SH "SEE ALSO"
.BR syscall (2),
.BR posix_fadvise (3)
.BR ftruncate (3)

2007-07-10 20:19:55

by Amit K. Arora

[permalink] [raw]
Subject: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

From: Amit Arora <[email protected]>

sys_fallocate() implementation on i386, x86_64 and powerpc

fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called ->fallocate().
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.


Signed-off-by: Amit Arora <[email protected]>

Index: linux-2.6.22/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+ .long sys_fallocate
Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
===================================================================
--- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high << 32) | low);
}

+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+ u32 lenhi, u32 lenlo)
+{
+ return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+ ((loff_t)lenhi << 32) | lenlo);
+}
+
asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
unsigned long low)
{
Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
.quad compat_sys_signalfd
.quad compat_sys_timerfd
.quad sys_eventfd
+ .quad sys32_fallocate
ia32_syscall_end:
Index: linux-2.6.22/fs/open.c
===================================================================
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
#endif

/*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ * FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset & len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the ->fallocate() inode operation implemented by the
+ * individual file systems will update the file size and/or ctime/mtime
+ * depending on the mode and also on the success of the operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0 : On SUCCESS a value of zero is returned.
+ * error : On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * <TBD> Generic fallocate to be added for file systems that do not
+ * support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+ struct file *file;
+ struct inode *inode;
+ long ret = -EINVAL;
+
+ if (offset < 0 || len <= 0)
+ goto out;
+
+ /* Return error if mode is not supported */
+ ret = -EOPNOTSUPP;
+ if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
+ goto out;
+
+ ret = -EBADF;
+ file = fget(fd);
+ if (!file)
+ goto out;
+ if (!(file->f_mode & FMODE_WRITE))
+ goto out_fput;
+
+ inode = file->f_path.dentry->d_inode;
+
+ ret = -ESPIPE;
+ if (S_ISFIFO(inode->i_mode))
+ goto out_fput;
+
+ ret = -ENODEV;
+ /*
+ * Let individual file system decide if it supports preallocation
+ * for directories or not.
+ */
+ if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+ goto out_fput;
+
+ ret = -EFBIG;
+ /* Check for wrap through zero too */
+ if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
+ goto out_fput;
+
+ if (inode->i_op && inode->i_op->fallocate)
+ ret = inode->i_op->fallocate(inode, mode, offset, len);
+ else
+ ret = -ENOSYS;
+
+out_fput:
+ fput(file);
+out:
+ return ret;
+}
+
+/*
* access() needs to use the real uid/gid, not the effective uid/gid.
* We do this by temporarily clearing all FS-related capabilities and
* switching the fsuid/fsgid around to the real ones.
Index: linux-2.6.22/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.22.orig/include/asm-i386/unistd.h
+++ linux-2.6.22/include/asm-i386/unistd.h
@@ -329,10 +329,11 @@
#define __NR_signalfd 321
#define __NR_timerfd 322
#define __NR_eventfd 323
+#define __NR_fallocate 324

#ifdef __KERNEL__

-#define NR_syscalls 324
+#define NR_syscalls 325

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.22.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.22/include/asm-powerpc/systbl.h
@@ -308,6 +308,7 @@ COMPAT_SYS_SPU(move_pages)
SYSCALL_SPU(getcpu)
COMPAT_SYS(epoll_pwait)
COMPAT_SYS_SPU(utimensat)
+COMPAT_SYS(fallocate)
COMPAT_SYS_SPU(signalfd)
COMPAT_SYS_SPU(timerfd)
SYSCALL_SPU(eventfd)
Index: linux-2.6.22/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.22.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.22/include/asm-powerpc/unistd.h
@@ -331,10 +331,11 @@
#define __NR_timerfd 306
#define __NR_eventfd 307
#define __NR_sync_file_range2 308
+#define __NR_fallocate 309

#ifdef __KERNEL__

-#define __NR_syscalls 309
+#define __NR_syscalls 310

#define __NR__exit __NR_exit
#define NR_syscalls __NR_syscalls
Index: linux-2.6.22/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.22.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.22/include/asm-x86_64/unistd.h
@@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd)
__SYSCALL(__NR_timerfd, sys_timerfd)
#define __NR_eventfd 284
__SYSCALL(__NR_eventfd, sys_eventfd)
+#define __NR_fallocate 285
+__SYSCALL(__NR_fallocate, sys_fallocate)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22/include/linux/fs.h
===================================================================
--- linux-2.6.22.orig/include/linux/fs.h
+++ linux-2.6.22/include/linux/fs.h
@@ -266,6 +266,17 @@ extern int dir_notify_enable;
#define SYNC_FILE_RANGE_WRITE 2
#define SYNC_FILE_RANGE_WAIT_AFTER 4

+/*
+ * sys_fallocate modes
+ * Currently sys_fallocate supports two modes:
+ * FA_ALLOCATE : This is the preallocate mode, using which an application/user
+ * may request (pre)allocation of blocks.
+ * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
+ * the preallocated blocks.
+ */
+#define FA_ALLOCATE 0x1
+#define FA_DEALLOCATE 0x2
+
#ifdef __KERNEL__

#include <linux/linkage.h>
@@ -1138,6 +1149,8 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ long (*fallocate)(struct inode *inode, int mode, loff_t offset,
+ loff_t len);
};

struct seq_file;
Index: linux-2.6.22/include/linux/syscalls.h
===================================================================
--- linux-2.6.22.orig/include/linux/syscalls.h
+++ linux-2.6.22/include/linux/syscalls.h
@@ -610,6 +610,7 @@ asmlinkage long sys_signalfd(int ufd, si
asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
const struct itimerspec __user *utmr);
asmlinkage long sys_eventfd(unsigned int count);
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

Index: linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
===================================================================
--- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
+++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
@@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd,
return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo,
len, advice);
}
+
+asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
+ unsigned offset_hi, unsigned len_lo,
+ unsigned len_hi)
+{
+ return sys_fallocate(fd, mode, ((u64)offset_hi << 32) | offset_lo,
+ ((u64)len_hi << 32) | len_lo);
+}

2007-07-10 20:21:12

by Amit K. Arora

[permalink] [raw]
Subject: [PATCH 3/7] support new modes in fallocate

From: Amit Arora <[email protected]>
Implement new flags and values for mode argument.

This patch implements the new flags and values for the "mode" argument
of the fallocate system call. It is based on the discussion between
Andreas Dilger and David Chinner on the man page proposed (by the later)
on fallocate.

Signed-off-by: Amit Arora <[email protected]>

Index: linux-2.6.22/include/linux/fs.h
===================================================================
--- linux-2.6.22.orig/include/linux/fs.h
+++ linux-2.6.22/include/linux/fs.h
@@ -267,15 +267,17 @@ extern int dir_notify_enable;
#define SYNC_FILE_RANGE_WAIT_AFTER 4

/*
- * sys_fallocate modes
- * Currently sys_fallocate supports two modes:
- * FA_ALLOCATE : This is the preallocate mode, using which an application/user
- * may request (pre)allocation of blocks.
- * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
- * the preallocated blocks.
+ * sys_fallocate mode flags and values
*/
-#define FA_ALLOCATE 0x1
-#define FA_DEALLOCATE 0x2
+#define FALLOC_FL_DEALLOC 0x01 /* default is allocate */
+#define FALLOC_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */
+#define FALLOC_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */
+
+#define FALLOC_ALLOCATE 0
+#define FALLOC_DEALLOCATE FALLOC_FL_DEALLOC
+#define FALLOC_RESV_SPACE FALLOC_FL_KEEP_SIZE
+#define FALLOC_UNRESV_SPACE (FALLOC_FL_DEALLOC | FALLOC_FL_KEEP_SIZE | \
+ FALLOC_FL_DEL_DATA)

#ifdef __KERNEL__

Index: linux-2.6.22/fs/open.c
===================================================================
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
* sys_fallocate - preallocate blocks or free preallocated blocks
* @fd: the file descriptor
* @mode: mode specifies if fallocate should preallocate blocks OR free
- * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
- * FA_DEALLOCATE modes are supported.
+ * (unallocate) preallocated blocks.
* @offset: The offset within file, from where (un)allocation is being
* requested. It should not have a negative value.
* @len: The amount (in bytes) of space to be (un)allocated, from the offset.
*
* This system call, depending on the mode, preallocates or unallocates blocks
* for a file. The range of blocks depends on the value of offset and len
- * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * arguments provided by the user/application. For FALLOC_ALLOCATE and
+ * FALLOC_RESV_SPACE modes, if the sys_fallocate()
* system call succeeds, subsequent writes to the file in the given range
* (specified by offset & len) should not fail - even if the file system
* later becomes full. Hence the preallocation done is persistent (valid
- * even after reopen of the file and remount/reboot).
+ * even after reopen of the file and remount/reboot). If FALLOC_RESV_SPACE mode
+ * is passed, the file size will not be changed even if the preallocation
+ * is beyond EOF.
*
* It is expected that the ->fallocate() inode operation implemented by the
* individual file systems will update the file size and/or ctime/mtime
- * depending on the mode and also on the success of the operation.
+ * depending on the mode (change is visible to user or not - say file size)
+ * and obviously, on the success of the operation.
*
* Note: Incase the file system does not support preallocation,
* posix_fallocate() should fall back to the library implementation (i.e.
@@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in

/* Return error if mode is not supported */
ret = -EOPNOTSUPP;
- if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
+ if (!(mode == FALLOC_ALLOCATE || mode == FALLOC_DEALLOCATE ||
+ mode == FALLOC_RESV_SPACE || mode == FALLOC_UNRESV_SPACE))
goto out;

ret = -EBADF;

2007-07-10 20:22:42

by Amit K. Arora

[permalink] [raw]
Subject: [PATCH 4/7] ext4: fallocate support in ext4

From: Amit Arora <[email protected]>

fallocate support in ext4

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a <ToDo> item.


Signed-off-by: Amit Arora <[email protected]>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-09 15:24:33.000000000 -0700
+++ linux-2.6.22/fs/ext4/extents.c 2007-07-09 15:24:39.000000000 -0700
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path->p_ext) {
ext_debug(" %d:%d:%llu ",
le32_to_cpu(path->p_ext->ee_block),
- le16_to_cpu(path->p_ext->ee_len),
+ ext4_ext_get_actual_len(path->p_ext),
ext_pblock(path->p_ext));
} else
ext_debug(" []");
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in

for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug("\n");
}
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode,
ext_debug(" -> %d:%llu:%d ",
le32_to_cpu(path->p_ext->ee_block),
ext_pblock(path->p_ext),
- le16_to_cpu(path->p_ext->ee_len));
+ ext4_ext_get_actual_len(path->p_ext));

#ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug("move %d:%llu:%d in new leaf %llu\n",
le32_to_cpu(path[depth].p_ext->ee_block),
ext_pblock(path[depth].p_ext),
- le16_to_cpu(path[depth].p_ext->ee_len),
+ ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
{
- if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+ unsigned short ext1_ee_len, ext2_ee_len;
+
+ /*
+ * Make sure that either both extents are uninitialized, or
+ * both are _not_.
+ */
+ if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+ return 0;
+
+ ext1_ee_len = ext4_ext_get_actual_len(ex1);
+ ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+ if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
le32_to_cpu(ex2->ee_block))
return 0;

@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode
* as an RO_COMPAT feature, refuse to merge to extents if
* this can result in the top bit of ee_len being set.
*/
- if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+ if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
return 0;
#ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
return 0;
#endif

- if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+ if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
}
@@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;

b1 = le32_to_cpu(newext->ee_block);
- len1 = le16_to_cpu(newext->ee_len);
+ len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
int depth, len, err, next;
+ unsigned uninitialized = 0;

- BUG_ON(newext->ee_len == 0);
+ BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug("append %d block to %d:%d (from %llu)\n",
- le16_to_cpu(newext->ee_len),
+ ext4_ext_get_actual_len(newext),
le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
err = ext4_ext_get_access(handle, inode, path + depth);
if (err)
return err;
- ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
- + le16_to_cpu(newext->ee_len));
+
+ /*
+ * ext4_can_extents_be_merged should have checked that either
+ * both extents are uninitialized, or both aren't. Thus we
+ * need to check only one of them here.
+ */
+ if (ext4_ext_is_uninitialized(ex))
+ uninitialized = 1;
+ ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+ + ext4_ext_get_actual_len(newext));
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(ex);
eh = path[depth].p_hdr;
nearex = ex;
goto merge;
@@ -1263,7 +1286,7 @@ has_space:
ext_debug("first extent in the leaf: %d:%llu:%d\n",
le32_to_cpu(newext->ee_block),
ext_pblock(newext),
- le16_to_cpu(newext->ee_len));
+ ext4_ext_get_actual_len(newext));
path[depth].p_ext = EXT_FIRST_EXTENT(eh);
} else if (le32_to_cpu(newext->ee_block)
> le32_to_cpu(nearex->ee_block)) {
@@ -1276,7 +1299,7 @@ has_space:
"move %d from 0x%p to 0x%p\n",
le32_to_cpu(newext->ee_block),
ext_pblock(newext),
- le16_to_cpu(newext->ee_len),
+ ext4_ext_get_actual_len(newext),
nearex, len, nearex + 1, nearex + 2);
memmove(nearex + 2, nearex + 1, len);
}
@@ -1289,7 +1312,7 @@ has_space:
"move %d from 0x%p to 0x%p\n",
le32_to_cpu(newext->ee_block),
ext_pblock(newext),
- le16_to_cpu(newext->ee_len),
+ ext4_ext_get_actual_len(newext),
nearex, len, nearex + 1, nearex + 2);
memmove(nearex + 1, nearex, len);
path[depth].p_ext = nearex;
@@ -1308,8 +1331,13 @@ merge:
if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
break;
/* merge with next extent! */
- nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
- + le16_to_cpu(nearex[1].ee_len));
+ if (ext4_ext_is_uninitialized(nearex))
+ uninitialized = 1;
+ nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
+ + ext4_ext_get_actual_len(nearex + 1));
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(nearex);
+
if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
len = (EXT_LAST_EXTENT(eh) - nearex - 1)
* sizeof(struct ext4_extent);
@@ -1379,8 +1407,8 @@ int ext4_ext_walk_space(struct inode *in
end = le32_to_cpu(ex->ee_block);
if (block + num < end)
end = block + num;
- } else if (block >=
- le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+ } else if (block >= le32_to_cpu(ex->ee_block)
+ + ext4_ext_get_actual_len(ex)) {
/* need to allocate space after found extent */
start = block;
end = block + num;
@@ -1392,7 +1420,8 @@ int ext4_ext_walk_space(struct inode *in
* by found extent
*/
start = block;
- end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+ end = le32_to_cpu(ex->ee_block)
+ + ext4_ext_get_actual_len(ex);
if (block + num < end)
end = block + num;
exists = 1;
@@ -1408,7 +1437,7 @@ int ext4_ext_walk_space(struct inode *in
cbex.ec_type = EXT4_EXT_CACHE_GAP;
} else {
cbex.ec_block = le32_to_cpu(ex->ee_block);
- cbex.ec_len = le16_to_cpu(ex->ee_len);
+ cbex.ec_len = ext4_ext_get_actual_len(ex);
cbex.ec_start = ext_pblock(ex);
cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
}
@@ -1481,15 +1510,15 @@ ext4_ext_put_gap_in_cache(struct inode *
ext_debug("cache gap(before): %lu [%lu:%lu]",
(unsigned long) block,
(unsigned long) le32_to_cpu(ex->ee_block),
- (unsigned long) le16_to_cpu(ex->ee_len));
+ (unsigned long) ext4_ext_get_actual_len(ex));
} else if (block >= le32_to_cpu(ex->ee_block)
- + le16_to_cpu(ex->ee_len)) {
+ + ext4_ext_get_actual_len(ex)) {
lblock = le32_to_cpu(ex->ee_block)
- + le16_to_cpu(ex->ee_len);
+ + ext4_ext_get_actual_len(ex);
len = ext4_ext_next_allocated_block(path);
ext_debug("cache gap(after): [%lu:%lu] %lu",
(unsigned long) le32_to_cpu(ex->ee_block),
- (unsigned long) le16_to_cpu(ex->ee_len),
+ (unsigned long) ext4_ext_get_actual_len(ex),
(unsigned long) block);
BUG_ON(len == lblock);
len = len - lblock;
@@ -1619,12 +1648,12 @@ static int ext4_remove_blocks(handle_t *
unsigned long from, unsigned long to)
{
struct buffer_head *bh;
+ unsigned short ee_len = ext4_ext_get_actual_len(ex);
int i;

#ifdef EXTENTS_STATS
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- unsigned short ee_len = le16_to_cpu(ex->ee_len);
spin_lock(&sbi->s_ext_stats_lock);
sbi->s_ext_blocks += ee_len;
sbi->s_ext_extents++;
@@ -1638,12 +1667,12 @@ static int ext4_remove_blocks(handle_t *
}
#endif
if (from >= le32_to_cpu(ex->ee_block)
- && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+ && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
/* tail removal */
unsigned long num;
ext4_fsblk_t start;
- num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
- start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
+ num = le32_to_cpu(ex->ee_block) + ee_len - from;
+ start = ext_pblock(ex) + ee_len - num;
ext_debug("free last %lu blocks starting %llu\n", num, start);
for (i = 0; i < num; i++) {
bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1651,12 +1680,12 @@ static int ext4_remove_blocks(handle_t *
}
ext4_free_blocks(handle, inode, start, num);
} else if (from == le32_to_cpu(ex->ee_block)
- && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+ && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
printk("strange request: removal %lu-%lu from %u:%u\n",
- from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+ from, to, le32_to_cpu(ex->ee_block), ee_len);
} else {
printk("strange request: removal(2) %lu-%lu from %u:%u\n",
- from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+ from, to, le32_to_cpu(ex->ee_block), ee_len);
}
return 0;
}
@@ -1671,6 +1700,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
unsigned a, b, block, num;
unsigned long ex_ee_block;
unsigned short ex_ee_len;
+ unsigned uninitialized = 0;
struct ext4_extent *ex;

ext_debug("truncate since %lu in leaf\n", start);
@@ -1685,7 +1715,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc
ex = EXT_LAST_EXTENT(eh);

ex_ee_block = le32_to_cpu(ex->ee_block);
- ex_ee_len = le16_to_cpu(ex->ee_len);
+ if (ext4_ext_is_uninitialized(ex))
+ uninitialized = 1;
+ ex_ee_len = ext4_ext_get_actual_len(ex);

while (ex >= EXT_FIRST_EXTENT(eh) &&
ex_ee_block + ex_ee_len > start) {
@@ -1753,6 +1785,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc

ex->ee_block = cpu_to_le32(block);
ex->ee_len = cpu_to_le16(num);
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(ex);

err = ext4_ext_dirty(handle, inode, path + depth);
if (err)
@@ -1762,7 +1796,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
ext_pblock(ex));
ex--;
ex_ee_block = le32_to_cpu(ex->ee_block);
- ex_ee_len = le16_to_cpu(ex->ee_len);
+ ex_ee_len = ext4_ext_get_actual_len(ex);
}

if (correct_index && eh->eh_entries)
@@ -2038,7 +2072,7 @@ int ext4_ext_get_blocks(handle_t *handle
if (ex) {
unsigned long ee_block = le32_to_cpu(ex->ee_block);
ext4_fsblk_t ee_start = ext_pblock(ex);
- unsigned short ee_len = le16_to_cpu(ex->ee_len);
+ unsigned short ee_len;

/*
* Allow future support for preallocated extents to be added
@@ -2046,8 +2080,9 @@ int ext4_ext_get_blocks(handle_t *handle
* Uninitialized extents are treated as holes, except that
* we avoid (fail) allocating new blocks during a write.
*/
- if (ee_len > EXT_MAX_LEN)
+ if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
goto out2;
+ ee_len = ext4_ext_get_actual_len(ex);
/* if found extent covers block, simply return it */
if (iblock >= ee_block && iblock < ee_block + ee_len) {
newblock = iblock - ee_block + ee_start;
@@ -2055,8 +2090,11 @@ int ext4_ext_get_blocks(handle_t *handle
allocated = ee_len - (iblock - ee_block);
ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
ee_block, ee_len, newblock);
- ext4_ext_put_in_cache(inode, ee_block, ee_len,
- ee_start, EXT4_EXT_CACHE_EXTENT);
+ /* Do not put uninitialized extent in the cache */
+ if (!ext4_ext_is_uninitialized(ex))
+ ext4_ext_put_in_cache(inode, ee_block,
+ ee_len, ee_start,
+ EXT4_EXT_CACHE_EXTENT);
goto out;
}
}
@@ -2098,6 +2136,8 @@ int ext4_ext_get_blocks(handle_t *handle
/* try to insert new extent into found leaf and return */
ext4_ext_store_pblock(&newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
+ if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */
+ ext4_ext_mark_uninitialized(&newex);
err = ext4_ext_insert_extent(handle, inode, path, &newex);
if (err) {
/* free data blocks we just allocated */
@@ -2113,8 +2153,10 @@ int ext4_ext_get_blocks(handle_t *handle
newblock = ext_pblock(&newex);
__set_bit(BH_New, &bh_result->b_state);

- ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
- EXT4_EXT_CACHE_EXTENT);
+ /* Cache only when it is _not_ an uninitialized extent */
+ if (create != EXT4_CREATE_UNINITIALIZED_EXT)
+ ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
+ EXT4_EXT_CACHE_EXTENT);
out:
if (allocated > max_blocks)
allocated = max_blocks;
@@ -2217,3 +2259,129 @@ int ext4_ext_writepage_trans_blocks(stru

return needed;
}
+
+/*
+ * preallocate space for a file. This implements ext4's fallocate inode
+ * operation, which gets called from sys_fallocate system call.
+ * Currently only FA_ALLOCATE mode is supported on extent based files.
+ * We may have more modes supported in future - like FA_DEALLOCATE, which
+ * tells fallocate to unallocate previously (pre)allocated blocks.
+ * For block-mapped files, posix_fallocate should fall back to the method
+ * of writing zeroes to the required new blocks (the same behavior which is
+ * expected for file systems which do not support fallocate() system call).
+ */
+long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+ handle_t *handle;
+ ext4_fsblk_t block, max_blocks;
+ ext4_fsblk_t nblocks = 0;
+ int ret = 0;
+ int ret2 = 0;
+ int retries = 0;
+ struct buffer_head map_bh;
+ unsigned int credits, blkbits = inode->i_blkbits;
+
+ /*
+ * currently supporting (pre)allocate mode for extent-based
+ * files _only_
+ */
+ if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return -EOPNOTSUPP;
+
+ /* preallocation to directories is currently not supported */
+ if (S_ISDIR(inode->i_mode))
+ return -ENODEV;
+
+ block = offset >> blkbits;
+ max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
+ - block;
+
+ /*
+ * credits to insert 1 extent into extent tree + buffers to be able to
+ * modify 1 super block, 1 block bitmap and 1 group descriptor.
+ */
+ credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+retry:
+ while (ret >= 0 && ret < max_blocks) {
+ block = block + ret;
+ max_blocks = max_blocks - ret;
+ handle = ext4_journal_start(inode, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ break;
+ }
+
+ ret = ext4_ext_get_blocks(handle, inode, block,
+ max_blocks, &map_bh,
+ EXT4_CREATE_UNINITIALIZED_EXT, 0);
+ WARN_ON(!ret);
+ if (!ret) {
+ ext4_error(inode->i_sb, "ext4_fallocate",
+ "ext4_ext_get_blocks returned 0! inode#%lu"
+ ", block=%llu, max_blocks=%llu",
+ inode->i_ino, block, max_blocks);
+ ret = -EIO;
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ break;
+ }
+ if (ret > 0) {
+ /* check wrap through sign-bit/zero here */
+ if ((block + ret) < 0 || (block + ret) < block) {
+ ret = -EIO;
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ break;
+ }
+ if (buffer_new(&map_bh) && ((block + ret) >
+ (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits)
+ >> blkbits)))
+ nblocks = nblocks + ret;
+ }
+
+ /* Update ctime if new blocks get allocated */
+ if (nblocks) {
+ struct timespec now;
+ now = current_fs_time(inode->i_sb);
+ if (!timespec_equal(&inode->i_ctime, &now))
+ inode->i_ctime = now;
+ }
+
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ if (ret2)
+ break;
+ }
+
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+
+ /*
+ * Time to update the file size.
+ * Update only when preallocation was requested beyond the file size.
+ */
+ if ((offset + len) > i_size_read(inode)) {
+ if (ret > 0) {
+ /*
+ * if no error, we assume preallocation succeeded
+ * completely
+ */
+ mutex_lock(&inode->i_mutex);
+ i_size_write(inode, offset + len);
+ EXT4_I(inode)->i_disksize = i_size_read(inode);
+ mutex_unlock(&inode->i_mutex);
+ } else if (ret < 0 && nblocks) {
+ /* Handle partial allocation scenario */
+ loff_t newsize;
+
+ mutex_lock(&inode->i_mutex);
+ newsize = (nblocks << blkbits) + i_size_read(inode);
+ i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
+ EXT4_I(inode)->i_disksize = i_size_read(inode);
+ mutex_unlock(&inode->i_mutex);
+ }
+ }
+
+ return ret > 0 ? ret2 : ret;
+}
+
Index: linux-2.6.22/fs/ext4/file.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/file.c 2007-07-09 15:24:33.000000000 -0700
+++ linux-2.6.22/fs/ext4/file.c 2007-07-09 15:24:39.000000000 -0700
@@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
.removexattr = generic_removexattr,
#endif
.permission = ext4_permission,
+ .fallocate = ext4_fallocate,
};

Index: linux-2.6.22/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.22.orig/include/linux/ext4_fs.h 2007-07-09 15:24:33.000000000 -0700
+++ linux-2.6.22/include/linux/ext4_fs.h 2007-07-09 15:24:39.000000000 -0700
@@ -102,6 +102,7 @@
EXT4_GOOD_OLD_FIRST_INO : \
(s)->s_first_ino)
#endif
+#define EXT4_BLOCK_ALIGN(size, blkbits) ALIGN((size), (1 << (blkbits)))

/*
* Macro-instructions used to manage fragments
@@ -225,6 +226,11 @@ struct ext4_new_group_data {
__u32 free_blocks_count;
};

+/*
+ * Following is used by preallocation code to tell get_blocks() that we
+ * want uninitialzed extents.
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT 2

/*
* ioctl commands
@@ -983,6 +989,8 @@ extern int ext4_ext_get_blocks(handle_t
extern void ext4_ext_truncate(struct inode *, struct page *);
extern void ext4_ext_init(struct super_block *);
extern void ext4_ext_release(struct super_block *);
+extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
+ loff_t len);
static inline int
ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
unsigned long max_blocks, struct buffer_head *bh,
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h 2007-07-09 15:24:33.000000000 -0700
+++ linux-2.6.22/include/linux/ext4_fs_extents.h 2007-07-09 15:24:39.000000000 -0700
@@ -188,6 +188,21 @@ ext4_ext_invalidate_cache(struct inode *
EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO;
}

+static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
+{
+ ext->ee_len |= cpu_to_le16(0x8000);
+}
+
+static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext)
+{
+ return (int)(le16_to_cpu((ext)->ee_len) & 0x8000);
+}
+
+static inline int ext4_ext_get_actual_len(struct ext4_extent *ext)
+{
+ return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF);
+}
+
extern int ext4_extent_tree_init(handle_t *, struct inode *);
extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);

2007-07-10 20:23:53

by Amit K. Arora

[permalink] [raw]
Subject: [PATCH 5/7] ext4: write support for preallocated blocks

From: Amit Arora <[email protected]>

write support for preallocated blocks

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Signed-off-by: Amit Arora <[email protected]>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-09 15:24:39.000000000 -0700
+++ linux-2.6.22/fs/ext4/extents.c 2007-07-09 15:24:48.000000000 -0700
@@ -1140,6 +1140,53 @@ ext4_can_extents_be_merged(struct inode
}

/*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+ struct ext4_extent_header *eh;
+ unsigned int depth, len;
+ int merge_done = 0;
+ int uninitialized = 0;
+
+ depth = ext_depth(inode);
+ BUG_ON(path[depth].p_hdr == NULL);
+ eh = path[depth].p_hdr;
+
+ while (ex < EXT_LAST_EXTENT(eh)) {
+ if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+ break;
+ /* merge with next extent! */
+ if (ext4_ext_is_uninitialized(ex))
+ uninitialized = 1;
+ ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+ + ext4_ext_get_actual_len(ex + 1));
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(ex);
+
+ if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+ len = (EXT_LAST_EXTENT(eh) - ex - 1)
+ * sizeof(struct ext4_extent);
+ memmove(ex + 1, ex + 2, len);
+ }
+ eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+ merge_done = 1;
+ WARN_ON(eh->eh_entries == 0);
+ if (!eh->eh_entries)
+ ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+ "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+ }
+
+ return merge_done;
+}
+
+/*
* check if a portion of the "newext" extent overlaps with an
* existing extent.
*
@@ -1327,25 +1374,7 @@ has_space:

merge:
/* try to merge extents to the right */
- while (nearex < EXT_LAST_EXTENT(eh)) {
- if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
- break;
- /* merge with next extent! */
- if (ext4_ext_is_uninitialized(nearex))
- uninitialized = 1;
- nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
- + ext4_ext_get_actual_len(nearex + 1));
- if (uninitialized)
- ext4_ext_mark_uninitialized(nearex);
-
- if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
- len = (EXT_LAST_EXTENT(eh) - nearex - 1)
- * sizeof(struct ext4_extent);
- memmove(nearex + 1, nearex + 2, len);
- }
- eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
- BUG_ON(eh->eh_entries == 0);
- }
+ ext4_ext_try_to_merge(inode, path, nearex);

/* try to merge extents to the left */

@@ -2011,15 +2040,158 @@ void ext4_ext_release(struct super_block
#endif
}

+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ * a> There is no split required: Entire extent should be initialized
+ * b> Splits in two extents: Write is happening at either end of the extent
+ * c> Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+ struct ext4_ext_path *path,
+ ext4_fsblk_t iblock,
+ unsigned long max_blocks)
+{
+ struct ext4_extent *ex, newex;
+ struct ext4_extent *ex1 = NULL;
+ struct ext4_extent *ex2 = NULL;
+ struct ext4_extent *ex3 = NULL;
+ struct ext4_extent_header *eh;
+ unsigned int allocated, ee_block, ee_len, depth;
+ ext4_fsblk_t newblock;
+ int err = 0;
+ int ret = 0;
+
+ depth = ext_depth(inode);
+ eh = path[depth].p_hdr;
+ ex = path[depth].p_ext;
+ ee_block = le32_to_cpu(ex->ee_block);
+ ee_len = ext4_ext_get_actual_len(ex);
+ allocated = ee_len - (iblock - ee_block);
+ newblock = iblock - ee_block + ext_pblock(ex);
+ ex2 = ex;
+
+ /* ex1: ee_block to iblock - 1 : uninitialized */
+ if (iblock > ee_block) {
+ ex1 = ex;
+ ex1->ee_len = cpu_to_le16(iblock - ee_block);
+ ext4_ext_mark_uninitialized(ex1);
+ ex2 = &newex;
+ }
+ /*
+ * for sanity, update the length of the ex2 extent before
+ * we insert ex3, if ex1 is NULL. This is to avoid temporary
+ * overlap of blocks.
+ */
+ if (!ex1 && allocated > max_blocks)
+ ex2->ee_len = cpu_to_le16(max_blocks);
+ /* ex3: to ee_block + ee_len : uninitialised */
+ if (allocated > max_blocks) {
+ unsigned int newdepth;
+ ex3 = &newex;
+ ex3->ee_block = cpu_to_le32(iblock + max_blocks);
+ ext4_ext_store_pblock(ex3, newblock + max_blocks);
+ ex3->ee_len = cpu_to_le16(allocated - max_blocks);
+ ext4_ext_mark_uninitialized(ex3);
+ err = ext4_ext_insert_extent(handle, inode, path, ex3);
+ if (err)
+ goto out;
+ /*
+ * The depth, and hence eh & ex might change
+ * as part of the insert above.
+ */
+ newdepth = ext_depth(inode);
+ if (newdepth != depth) {
+ depth = newdepth;
+ path = ext4_ext_find_extent(inode, iblock, NULL);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ path = NULL;
+ goto out;
+ }
+ eh = path[depth].p_hdr;
+ ex = path[depth].p_ext;
+ if (ex2 != &newex)
+ ex2 = ex;
+ }
+ allocated = max_blocks;
+ }
+ /*
+ * If there was a change of depth as part of the
+ * insertion of ex3 above, we need to update the length
+ * of the ex1 extent again here
+ */
+ if (ex1 && ex1 != ex) {
+ ex1 = ex;
+ ex1->ee_len = cpu_to_le16(iblock - ee_block);
+ ext4_ext_mark_uninitialized(ex1);
+ ex2 = &newex;
+ }
+ /* ex2: iblock to iblock + maxblocks-1 : initialised */
+ ex2->ee_block = cpu_to_le32(iblock);
+ ex2->ee_start = cpu_to_le32(newblock);
+ ext4_ext_store_pblock(ex2, newblock);
+ ex2->ee_len = cpu_to_le16(allocated);
+ if (ex2 != ex)
+ goto insert;
+ err = ext4_ext_get_access(handle, inode, path + depth);
+ if (err)
+ goto out;
+ /*
+ * New (initialized) extent starts from the first block
+ * in the current extent. i.e., ex2 == ex
+ * We have to see if it can be merged with the extent
+ * on the left.
+ */
+ if (ex2 > EXT_FIRST_EXTENT(eh)) {
+ /*
+ * To merge left, pass "ex2 - 1" to try_to_merge(),
+ * since it merges towards right _only_.
+ */
+ ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
+ if (ret) {
+ err = ext4_ext_correct_indexes(handle, inode, path);
+ if (err)
+ goto out;
+ depth = ext_depth(inode);
+ ex2--;
+ }
+ }
+ /*
+ * Try to Merge towards right. This might be required
+ * only when the whole extent is being written to.
+ * i.e. ex2 == ex and ex3 == NULL.
+ */
+ if (!ex3) {
+ ret = ext4_ext_try_to_merge(inode, path, ex2);
+ if (ret) {
+ err = ext4_ext_correct_indexes(handle, inode, path);
+ if (err)
+ goto out;
+ }
+ }
+ /* Mark modified extent as dirty */
+ err = ext4_ext_dirty(handle, inode, path + depth);
+ goto out;
+insert:
+ err = ext4_ext_insert_extent(handle, inode, path, &newex);
+out:
+ return err ? err : allocated;
+}
+
int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
int create, int extend_disksize)
{
struct ext4_ext_path *path = NULL;
+ struct ext4_extent_header *eh;
struct ext4_extent newex, *ex;
ext4_fsblk_t goal, newblock;
- int err = 0, depth;
+ int err = 0, depth, ret;
unsigned long allocated = 0;

__clear_bit(BH_New, &bh_result->b_state);
@@ -2032,8 +2204,10 @@ int ext4_ext_get_blocks(handle_t *handle
if (goal) {
if (goal == EXT4_EXT_CACHE_GAP) {
if (!create) {
- /* block isn't allocated yet and
- * user doesn't want to allocate it */
+ /*
+ * block isn't allocated yet and
+ * user doesn't want to allocate it
+ */
goto out2;
}
/* we should allocate requested block */
@@ -2067,6 +2241,7 @@ int ext4_ext_get_blocks(handle_t *handle
* this is why assert can't be put in ext4_ext_find_extent()
*/
BUG_ON(path[depth].p_ext == NULL && depth != 0);
+ eh = path[depth].p_hdr;

ex = path[depth].p_ext;
if (ex) {
@@ -2075,13 +2250,9 @@ int ext4_ext_get_blocks(handle_t *handle
unsigned short ee_len;

/*
- * Allow future support for preallocated extents to be added
- * as an RO_COMPAT feature:
* Uninitialized extents are treated as holes, except that
- * we avoid (fail) allocating new blocks during a write.
+ * we split out initialized portions during a write.
*/
- if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
- goto out2;
ee_len = ext4_ext_get_actual_len(ex);
/* if found extent covers block, simply return it */
if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -2090,12 +2261,27 @@ int ext4_ext_get_blocks(handle_t *handle
allocated = ee_len - (iblock - ee_block);
ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
ee_block, ee_len, newblock);
+
/* Do not put uninitialized extent in the cache */
- if (!ext4_ext_is_uninitialized(ex))
+ if (!ext4_ext_is_uninitialized(ex)) {
ext4_ext_put_in_cache(inode, ee_block,
ee_len, ee_start,
EXT4_EXT_CACHE_EXTENT);
- goto out;
+ goto out;
+ }
+ if (create == EXT4_CREATE_UNINITIALIZED_EXT)
+ goto out;
+ if (!create)
+ goto out2;
+
+ ret = ext4_ext_convert_to_initialized(handle, inode,
+ path, iblock,
+ max_blocks);
+ if (ret <= 0)
+ goto out2;
+ else
+ allocated = ret;
+ goto outnew;
}
}

@@ -2104,8 +2290,10 @@ int ext4_ext_get_blocks(handle_t *handle
* we couldn't try to create block if create flag is zero
*/
if (!create) {
- /* put just found gap into cache to speed up
- * subsequent requests */
+ /*
+ * put just found gap into cache to speed up
+ * subsequent requests
+ */
ext4_ext_put_gap_in_cache(inode, path, iblock);
goto out2;
}
@@ -2151,6 +2339,7 @@ int ext4_ext_get_blocks(handle_t *handle

/* previous routine could use block we allocated */
newblock = ext_pblock(&newex);
+outnew:
__set_bit(BH_New, &bh_result->b_state);

/* Cache only when it is _not_ an uninitialized extent */
@@ -2220,7 +2409,8 @@ void ext4_ext_truncate(struct inode * in
err = ext4_ext_remove_space(inode, last_block);

/* In a multi-transaction truncate, we only make the final
- * transaction synchronous. */
+ * transaction synchronous.
+ */
if (IS_SYNC(inode))
handle->h_sync = 1;

Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h 2007-07-09 15:24:39.000000000 -0700
+++ linux-2.6.22/include/linux/ext4_fs_extents.h 2007-07-09 15:24:48.000000000 -0700
@@ -205,6 +205,9 @@ static inline int ext4_ext_get_actual_le

extern int ext4_extent_tree_init(handle_t *, struct inode *);
extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *);
extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);

2007-07-10 20:24:28

by Amit K. Arora

[permalink] [raw]
Subject: [PATCH 6/7] ext4: support new modes in ext4

From: Amit Arora <[email protected]>
Support new values of mode in ext4.

This patch supports new mode values/flags in ext4. With this patch ext4
will be able to support FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes. Supporting
FALLOC_DEALLOCATE and FALLOC_UNRESV_SPACE fallocate modes in ext4 is a work for
future.

Signed-off-by: Amit Arora <[email protected]>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -2453,8 +2453,9 @@ int ext4_ext_writepage_trans_blocks(stru
/*
* preallocate space for a file. This implements ext4's fallocate inode
* operation, which gets called from sys_fallocate system call.
- * Currently only FA_ALLOCATE mode is supported on extent based files.
- * We may have more modes supported in future - like FA_DEALLOCATE, which
+ * Currently only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are supported on
+ * extent based files.
+ * We may have more modes supported in future - like FALLOC_DEALLOCATE, which
* tells fallocate to unallocate previously (pre)allocated blocks.
* For block-mapped files, posix_fallocate should fall back to the method
* of writing zeroes to the required new blocks (the same behavior which is
@@ -2475,7 +2476,8 @@ long ext4_fallocate(struct inode *inode,
* currently supporting (pre)allocate mode for extent-based
* files _only_
*/
- if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
+ !(mode == FALLOC_ALLOCATE || mode == FALLOC_RESV_SPACE))
return -EOPNOTSUPP;

/* preallocation to directories is currently not supported */
@@ -2548,9 +2550,11 @@ retry:

/*
* Time to update the file size.
- * Update only when preallocation was requested beyond the file size.
+ * Update only when preallocation was requested beyond the file size
+ * and when FALLOC_FL_KEEP_SIZE mode is not specified!
*/
- if ((offset + len) > i_size_read(inode)) {
+ if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+ (offset + len) > i_size_read(inode)) {
if (ret > 0) {
/*
* if no error, we assume preallocation succeeded

2007-07-10 20:26:03

by Amit K. Arora

[permalink] [raw]
Subject: [PATCH 7/7] ext4: change for better extent-to-group alignment

From: Amit Arora <[email protected]>
Change on-disk format for extent to represent uninitialized/initialized extents

This change was suggested by Andreas Dilger as part of the following
post:
http://www.mail-archive.com/[email protected]/msg02445.html

This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.

Signed-off-by: Amit Arora <[email protected]>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1106,7 +1106,7 @@ static int
ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
{
- unsigned short ext1_ee_len, ext2_ee_len;
+ unsigned short ext1_ee_len, ext2_ee_len, max_len;

/*
* Make sure that either both extents are uninitialized, or
@@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode
if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
return 0;

+ if (ext4_ext_is_uninitialized(ex1))
+ max_len = EXT_UNINIT_MAX_LEN;
+ else
+ max_len = EXT_INIT_MAX_LEN;
+
ext1_ee_len = ext4_ext_get_actual_len(ex1);
ext2_ee_len = ext4_ext_get_actual_len(ex2);

@@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode
* as an RO_COMPAT feature, refuse to merge to extents if
* this can result in the top bit of ee_len being set.
*/
- if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
+ if (ext1_ee_len + ext2_ee_len > max_len)
return 0;
#ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
@@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc

ex->ee_block = cpu_to_le32(block);
ex->ee_len = cpu_to_le16(num);
- if (uninitialized)
+ /*
+ * Do not mark uninitialized if all the blocks in the
+ * extent have been removed.
+ */
+ if (uninitialized && num)
ext4_ext_mark_uninitialized(ex);

err = ext4_ext_dirty(handle, inode, path + depth);
@@ -2307,6 +2316,18 @@ int ext4_ext_get_blocks(handle_t *handle
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);

+ /*
+ * See if request is beyond maximum number of blocks we can have in
+ * a single extent. For an initialized extent this limit is
+ * EXT_INIT_MAX_LEN and for an uninitialized extent this limit is
+ * EXT_UNINIT_MAX_LEN.
+ */
+ if (max_blocks > EXT_INIT_MAX_LEN && create != EXT4_CREATE_UNINITIALIZED_EXT)
+ max_blocks = EXT_INIT_MAX_LEN;
+ else if (max_blocks > EXT_UNINIT_MAX_LEN &&
+ create == EXT4_CREATE_UNINITIALIZED_EXT)
+ max_blocks = EXT_UNINIT_MAX_LEN;
+
/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
newex.ee_block = cpu_to_le32(iblock);
newex.ee_len = cpu_to_le16(max_blocks);
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru

#define EXT_MAX_BLOCK 0xffffffff

-#define EXT_MAX_LEN ((1UL << 15) - 1)
+/*
+ * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
+ * initialized extent. This is 2^15 and not (2^16 - 1), since we use the
+ * MSB of ee_len field in the extent datastructure to signify if this
+ * particular extent is an initialized extent or an uninitialized (i.e.
+ * preallocated).
+ * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an
+ * uninitialized extent.
+ * If ee_len is <= 0x8000, it is an initialized extent. Otherwise, it is an
+ * uninitialized one. In other words, if MSB of ee_len is set, it is an
+ * uninitialized extent with only one special scenario when ee_len = 0x8000.
+ * In this case we can not have an uninitialized extent of zero length and
+ * thus we make it as a special case of initialized extent with 0x8000 length.
+ * This way we get better extent-to-group alignment for initialized extents.
+ * Hence, the maximum number of blocks we can have in an *initialized*
+ * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 (32767).
+ */
+#define EXT_INIT_MAX_LEN (1UL << 15)
+#define EXT_UNINIT_MAX_LEN (EXT_INIT_MAX_LEN - 1)


#define EXT_FIRST_EXTENT(__hdr__) \
@@ -190,17 +208,21 @@ ext4_ext_invalidate_cache(struct inode *

static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
{
+ /* We can not have an uninitialized extent of zero length! */
+ BUG_ON((le16_to_cpu(ext->ee_len) & ~0x8000) == 0);
ext->ee_len |= cpu_to_le16(0x8000);
}

static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext)
{
- return (int)(le16_to_cpu((ext)->ee_len) & 0x8000);
+ /* Extent with ee_len of 0x8000 is treated as an initialized extent */
+ return (le16_to_cpu(ext->ee_len) > 0x8000);
}

static inline int ext4_ext_get_actual_len(struct ext4_extent *ext)
{
- return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF);
+ return (le16_to_cpu(ext->ee_len) <= 0x8000 ? le16_to_cpu(ext->ee_len) :
+ (le16_to_cpu(ext->ee_len) - 0x8000));
}

extern int ext4_extent_tree_init(handle_t *, struct inode *);

2007-07-10 21:37:01

by Heikki Orsila

[permalink] [raw]
Subject: Re: [PATCH 1/7] manpage for fallocate

On Wed, Jul 11, 2007 at 01:48:20AM +0530, Amit K. Arora wrote:
> .BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len);

Correction: "int syscall(int fd, int mode, ...)",

> .SH "ERRORS"
> .TP
> .B EBADF
> .I fd
> is not a valid file descriptor, or is not opened for writing.
> .TP
> .B EFBIG
> .I offset+len
> exceeds the maximum file size.
> .TP
> .B EINVAL
> .I offset
> or
> .I len
> was less than 0.
> .TP
> .B ENODEV
> .I fd
> does not refer to a regular file or a directory.
> .TP
> .B ENOSPC
> There is not enough space left on the device containing the file
> referred to by
> .IR fd.
> .TP
> .B ESPIPE
> .I fd
> refers to a pipe of file descriptor.
> .B ENOSYS
> The filesystem underlying the file descriptor does not support this
> operation.

EINTR?

--
Heikki Orsila Barbie's law:
[email protected] "Math is hard, let's go shopping!"
http://www.iki.fi/shd

2007-07-11 01:34:44

by Barry Naujok

[permalink] [raw]
Subject: Re: [PATCH 1/7] manpage for fallocate

On Wed, 11 Jul 2007 06:18:20 +1000, Amit K. Arora
<[email protected]> wrote:

> Following is the modified version of the manpage originally submitted by
> David Chinner. Please use `nroff -man fallocate.2 | less` to view.

A few more touch-ups attached.

Regards,
Barry.


Attachments:
fallocate.2 (2.76 kB)

2007-07-11 02:10:34

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

On Wed, 11 Jul 2007 01:50:00 +0530 "Amit K. Arora" <[email protected]> wrote:
>
> --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
> +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
> @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd,
> return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo,
> len, advice);
> }
> +
> +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
> + unsigned offset_hi, unsigned len_lo,
> + unsigned len_hi)

Please call this compat_sys_fallocate in line with the powerpc version -
it gives us a hint that maybe we should think about how to consolidate
them. I know other stuff in that file is called sys32_ ... but it is time
for a change :-)

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (831.00 B)
(No filename) (189.00 B)
Download all attachments

2007-07-11 05:00:04

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote:
> On Wed, 11 Jul 2007 01:50:00 +0530 "Amit K. Arora" <[email protected]> wrote:
> >
> > --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
> > +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
> > @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd,
> > return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo,
> > len, advice);
> > }
> > +
> > +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
> > + unsigned offset_hi, unsigned len_lo,
> > + unsigned len_hi)
>
> Please call this compat_sys_fallocate in line with the powerpc version -
> it gives us a hint that maybe we should think about how to consolidate
> them. I know other stuff in that file is called sys32_ ... but it is time
> for a change :-)

Maybe it would be worth to finally consider this:

http://marc.info/?l=linux-kernel&m=118411511620432&w=2

?

2007-07-11 08:11:28

by Amit K. Arora

[permalink] [raw]
Subject: Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote:
> On Wed, 11 Jul 2007 01:50:00 +0530 "Amit K. Arora" <[email protected]> wrote:
> >
> > --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
> > +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
> > @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd,
> > return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo,
> > len, advice);
> > }
> > +
> > +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
> > + unsigned offset_hi, unsigned len_lo,
> > + unsigned len_hi)
>
> Please call this compat_sys_fallocate in line with the powerpc version -
> it gives us a hint that maybe we should think about how to consolidate
> them. I know other stuff in that file is called sys32_ ... but it is time
> for a change :-)

I think this can be handled as a separate patch once this patchset
is in mainline. Since, anyhow we will need to do this for other sys32_
calls which are already there...

--
Regards,
Amit Arora

2007-07-11 09:12:24

by Amit K. Arora

[permalink] [raw]
Subject: Re: [PATCH 1/7] manpage for fallocate

On Wed, Jul 11, 2007 at 12:37:01AM +0300, Heikki Orsila wrote:
> On Wed, Jul 11, 2007 at 01:48:20AM +0530, Amit K. Arora wrote:
> > .BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len);
>
> Correction: "int syscall(int fd, int mode, ...)",

Here, we have syscall() with first argument being the system call number
- so what you suggested is not correct.

But, yes, the synopsis should change at some time. Maybe to something
like:

#include <fcntl.h>

long fallocate(int fd, int mode, loff_t offset, loff_t len);

> > .TP
> > .B ENOSPC
> > There is not enough space left on the device containing the file
> > referred to by
> > .IR fd.
> > .TP
> > .B ESPIPE
> > .I fd
> > refers to a pipe of file descriptor.
> > .B ENOSYS
> > The filesystem underlying the file descriptor does not support this
> > operation.
>
> EINTR?

Will add following errors:

EINTR A signal was caught during execution
EIO An I/O error occurred while reading from or writing to
a file system.
EOPNOTSUPP The mode is not supported on the file descriptor.

and will update following :

EINVAL offset was less than 0, or len was less than or equal to 0.

--
Regards,
Amit Arora