When discard granuality of a block device is bigger than filesystem block size,
fstrim does not effectively release device blocks. During the filesystem life,
some files become deleted, some remain alive, and this results in that many
device blocks are used incomletely (of course, the reason is not only in this,
but since this is not a problem of a filesystem, this is not a subject
of the patchset). This results in space lose for thin provisioning devices.
Say, a filesystem on a block device, which is provided by another filesystem
(say, distributed network filesystem). Semi-used blocks of the block device
result in bad performance and worse space usage of underlining filesystem.
Another example is ext4 with 4k block on loop on ext4 with 1m block. This
case also results in bad disk space usage.
Choosing a bigger block size is not a solution here, since small files become
taking much more disk space, than they used before, and the result excess
disk usage is the same.
The proposed solution is defragmentation of files based on block device
discard granuality knowledge. Files, which were not modified for a long time,
and read-only files, small files, etc, may be placed in the same block device
block together. I.e., compaction of some device blocks, which results
in releasing another device blocks.
The problem is current fallocate() does not allow to implement effective
way for such the defragmentation. The below describes the situation for ext4,
but this should touch all filesystems.
fallocate() goes thru standard blocks allocator, which try to behave very
good for life allocation cases: block placement and future file size
prediction, delayed blocks allocation, etc. But it almost impossible
to allocate blocks from specified place for our specific case. The only
ext4 block allocator option possible to use is that the allocator firstly
tries to allocate blocks from the same block group, that inode is related to.
But this is not enough for effective files compaction.
This patchset implements an extension of fallocate():
fallocate2(int fd, int mode, loff_t offset, loff_t len,
unsigned long long physical)
The new argument is @physical offset from start of device, which is must
for block allocation. In case of [@physical, @physical + len] block range
is available for allocation, the syscall assigns the corresponding extent/
extents to inode. In case of the range or its part is occupied, the syscall
returns with error (maybe, smaller range will be allocated. The behavior
is the same as when fallocate() meets no space in the middle).
This interface allows to solve the formulated problem. Also, note, this
interface may allow to improve existing e4defrag algorithm: decrease
number of file extents more effective.
[1-2/5] are refactoring.
[3/5] adds fallocate2() body.
[4/5] prepares ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
[5/5] adds fallocate2() support for ext4
Any comments are welcomed!
---
Kirill Tkhai (5):
fs: Add new argument to file_operations::fallocate()
fs: Add new argument to vfs_fallocate()
fs: Add fallocate2() syscall
ext4: Prepare ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
ext4: Add fallocate2() support
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/ia32/sys_ia32.c | 10 +++++++
drivers/block/loop.c | 2 +
drivers/nvme/target/io-cmd-file.c | 4 +--
drivers/staging/android/ashmem.c | 2 +
drivers/target/target_core_file.c | 2 +
fs/block_dev.c | 4 +--
fs/btrfs/file.c | 4 ++-
fs/ceph/file.c | 5 +++-
fs/cifs/cifsfs.c | 7 +++--
fs/cifs/smb2ops.c | 5 +++-
fs/ext4/ext4.h | 5 +++-
fs/ext4/extents.c | 35 ++++++++++++++++++++-----
fs/ext4/inode.c | 14 ++++++++++
fs/ext4/mballoc.c | 45 +++++++++++++++++++++++++-------
fs/f2fs/file.c | 4 ++-
fs/fat/file.c | 7 ++++-
fs/fuse/file.c | 5 +++-
fs/gfs2/file.c | 5 +++-
fs/hugetlbfs/inode.c | 5 +++-
fs/io_uring.c | 2 +
fs/ioctl.c | 5 ++--
fs/nfs/nfs4file.c | 6 ++++
fs/nfsd/vfs.c | 2 +
fs/ocfs2/file.c | 4 ++-
fs/open.c | 21 +++++++++++----
fs/overlayfs/file.c | 8 ++++--
fs/xfs/xfs_file.c | 5 +++-
include/linux/fs.h | 4 +--
include/linux/syscalls.h | 8 +++++-
ipc/shm.c | 6 ++--
mm/madvise.c | 2 +
mm/shmem.c | 4 ++-
34 files changed, 190 insertions(+), 59 deletions(-)
--
Signed-off-by: Kirill Tkhai <[email protected]>
After the patch the prototype will look in the following way:
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len, u64 physical);
@physical is the new argument. This patch does not contain
functional changes, and it will be used in further patches.
Signed-off-by: Kirill Tkhai <[email protected]>
---
drivers/block/loop.c | 2 +-
drivers/staging/android/ashmem.c | 2 +-
drivers/target/target_core_file.c | 2 +-
fs/block_dev.c | 4 ++--
fs/btrfs/file.c | 4 +++-
fs/ceph/file.c | 5 ++++-
fs/cifs/cifsfs.c | 7 ++++---
fs/cifs/smb2ops.c | 5 ++++-
fs/ext4/ext4.h | 2 +-
fs/ext4/extents.c | 6 +++++-
fs/f2fs/file.c | 4 +++-
fs/fat/file.c | 7 +++++--
fs/fuse/file.c | 5 ++++-
fs/gfs2/file.c | 5 ++++-
fs/hugetlbfs/inode.c | 5 ++++-
fs/nfs/nfs4file.c | 6 +++++-
fs/ocfs2/file.c | 4 +++-
fs/open.c | 2 +-
fs/overlayfs/file.c | 6 +++++-
fs/xfs/xfs_file.c | 5 ++++-
include/linux/fs.h | 2 +-
ipc/shm.c | 6 +++---
mm/shmem.c | 4 +++-
23 files changed, 71 insertions(+), 29 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da8ec0b9d909..6416111a2ae1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -438,7 +438,7 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
goto out;
}
- ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq));
+ ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq), (u64)-1);
if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
ret = -EIO;
out:
diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 8044510d8ec6..ea05ff484ebe 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -489,7 +489,7 @@ ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
mutex_unlock(&ashmem_mutex);
f->f_op->fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
- start, end - start);
+ start, end - start, (u64)-1);
fput(f);
if (atomic_dec_and_test(&ashmem_shrink_inflight))
wake_up_all(&ashmem_shrink_wait);
diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7143d03f0e02..feafb731bbd9 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -581,7 +581,7 @@ fd_execute_unmap(struct se_cmd *cmd, sector_t lba, sector_t nolb)
if (!file->f_op->fallocate)
return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
- ret = file->f_op->fallocate(file, mode, pos, len);
+ ret = file->f_op->fallocate(file, mode, pos, len, (u64)-1);
if (ret < 0) {
pr_warn("FILEIO: fallocate() failed: %d\n", ret);
return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..d356f7d7f666 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2078,7 +2078,7 @@ static const struct address_space_operations def_blk_aops = {
FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
static long blkdev_fallocate(struct file *file, int mode, loff_t start,
- loff_t len)
+ loff_t len, u64 physical)
{
struct block_device *bdev = I_BDEV(bdev_file_inode(file));
struct address_space *mapping;
@@ -2087,7 +2087,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
int error;
/* Fail if we don't recognize the flags. */
- if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
+ if ((mode & ~BLKDEV_FALLOC_FL_SUPPORTED) || physical != (u64)-1)
return -EOPNOTSUPP;
/* Don't go off the end of the device. */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6f6f1805e6fd..5d80da6d14eb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3174,7 +3174,7 @@ static int btrfs_zero_range(struct inode *inode,
}
static long btrfs_fallocate(struct file *file, int mode,
- loff_t offset, loff_t len)
+ loff_t offset, loff_t len, u64 physical)
{
struct inode *inode = file_inode(file);
struct extent_state *cached_state = NULL;
@@ -3201,6 +3201,8 @@ static long btrfs_fallocate(struct file *file, int mode,
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
if (mode & FALLOC_FL_PUNCH_HOLE)
return btrfs_punch_hole(inode, offset, len);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 7e0190b1f821..948694b478a4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1775,7 +1775,7 @@ static int ceph_zero_objects(struct inode *inode, loff_t offset, loff_t length)
}
static long ceph_fallocate(struct file *file, int mode,
- loff_t offset, loff_t length)
+ loff_t offset, loff_t length, u64 physical)
{
struct ceph_file_info *fi = file->private_data;
struct inode *inode = file_inode(file);
@@ -1790,6 +1790,9 @@ static long ceph_fallocate(struct file *file, int mode,
if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
+
if (!S_ISREG(inode->i_mode))
return -EOPNOTSUPP;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index fa77fe5258b0..ddf7888798af 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -281,14 +281,15 @@ cifs_statfs(struct dentry *dentry, struct kstatfs *buf)
return 0;
}
-static long cifs_fallocate(struct file *file, int mode, loff_t off, loff_t len)
+static long cifs_fallocate(struct file *file, int mode,
+ loff_t off, loff_t len, u64 physical)
{
struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(file);
struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb);
struct TCP_Server_Info *server = tcon->ses->server;
- if (server->ops->fallocate)
- return server->ops->fallocate(file, tcon, mode, off, len);
+ if (server->ops->fallocate && physical != (u64)-1)
+ return server->ops->fallocate(file, tcon, mode, off, len, (u64)-1);
return -EOPNOTSUPP;
}
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 5fa34225a99b..30cb1b911ebf 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -3460,8 +3460,11 @@ static int smb3_fiemap(struct cifs_tcon *tcon,
}
static long smb3_fallocate(struct file *file, struct cifs_tcon *tcon, int mode,
- loff_t off, loff_t len)
+ loff_t off, loff_t len, u64 physical)
{
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
+
/* KEEP_SIZE already checked for by do_fallocate */
if (mode & FALLOC_FL_PUNCH_HOLE)
return smb3_punch_hole(file, tcon, off, len);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 61b37a052052..5a98081c5369 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3347,7 +3347,7 @@ extern int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
extern void ext4_ext_init(struct super_block *);
extern void ext4_ext_release(struct super_block *);
extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
- loff_t len);
+ loff_t len, u64 physical);
extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
loff_t offset, ssize_t len);
extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 954013d6076b..10d0188a712d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4835,7 +4835,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
* of writing zeroes to the required new blocks (the same behavior which is
* expected for file systems which do not support fallocate() system call).
*/
-long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+long ext4_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len, u64 physical)
{
struct inode *inode = file_inode(file);
loff_t new_size = 0;
@@ -4861,6 +4862,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
FALLOC_FL_INSERT_RANGE))
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
+
if (mode & FALLOC_FL_PUNCH_HOLE)
return ext4_punch_hole(inode, offset, len);
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0d4da644df3b..2dfd886a2e75 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1685,7 +1685,7 @@ static int expand_inode_data(struct inode *inode, loff_t offset,
}
static long f2fs_fallocate(struct file *file, int mode,
- loff_t offset, loff_t len)
+ loff_t offset, loff_t len, u64 physical)
{
struct inode *inode = file_inode(file);
long ret = 0;
@@ -1696,6 +1696,8 @@ static long f2fs_fallocate(struct file *file, int mode,
return -ENOSPC;
if (!f2fs_is_compress_backend_ready(inode))
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
/* f2fs only support ->fallocate for regular file */
if (!S_ISREG(inode->i_mode))
diff --git a/fs/fat/file.c b/fs/fat/file.c
index bdc4503c00a3..4febd1e4f5af 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -19,7 +19,7 @@
#include "fat.h"
static long fat_fallocate(struct file *file, int mode,
- loff_t offset, loff_t len);
+ loff_t offset, loff_t len, u64 physical);
static int fat_ioctl_get_attributes(struct inode *inode, u32 __user *user_attr)
{
@@ -257,7 +257,7 @@ static int fat_cont_expand(struct inode *inode, loff_t size)
* allocate and zero out clusters via an expanding truncate.
*/
static long fat_fallocate(struct file *file, int mode,
- loff_t offset, loff_t len)
+ loff_t offset, loff_t len, u64 physical)
{
int nr_cluster; /* Number of clusters to be allocated */
loff_t mm_bytes; /* Number of bytes to be allocated for file */
@@ -271,6 +271,9 @@ static long fat_fallocate(struct file *file, int mode,
if (mode & ~FALLOC_FL_KEEP_SIZE)
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
+
/* No support for dir */
if (!S_ISREG(inode->i_mode))
return -EOPNOTSUPP;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9d67b830fb7a..5981ad057b7c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3166,7 +3166,7 @@ static int fuse_writeback_range(struct inode *inode, loff_t start, loff_t end)
}
static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
- loff_t length)
+ loff_t length, u64 physical)
{
struct fuse_file *ff = file->private_data;
struct inode *inode = file_inode(file);
@@ -3186,6 +3186,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
+
if (fc->no_fallocate)
return -EOPNOTSUPP;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index cb26be6f4351..40f958ea0fde 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1114,7 +1114,8 @@ static long __gfs2_fallocate(struct file *file, int mode, loff_t offset, loff_t
return error;
}
-static long gfs2_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+static long gfs2_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len, u64 physical)
{
struct inode *inode = file_inode(file);
struct gfs2_sbd *sdp = GFS2_SB(inode);
@@ -1127,6 +1128,8 @@ static long gfs2_fallocate(struct file *file, int mode, loff_t offset, loff_t le
/* fallocate is needed by gfs2_grow to reserve space in the rindex */
if (gfs2_is_jdata(ip) && inode != sdp->sd_rindex)
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
inode_lock(inode);
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index aff8642f0c2e..98d9af6529fa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -563,7 +563,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
}
static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
- loff_t len)
+ loff_t len, u64 physical)
{
struct inode *inode = file_inode(file);
struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
@@ -580,6 +580,9 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
+
if (mode & FALLOC_FL_PUNCH_HOLE)
return hugetlbfs_punch_hole(inode, offset, len);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 1297919e0fce..51061872e9fc 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -214,7 +214,8 @@ static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence)
}
}
-static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t len)
+static long nfs42_fallocate(struct file *filep, int mode,
+ loff_t offset, loff_t len, u64 physical)
{
struct inode *inode = file_inode(filep);
long ret;
@@ -225,6 +226,9 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
if ((mode != 0) && (mode != (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE)))
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
+
ret = inode_newsize_ok(inode, offset + len);
if (ret < 0)
return ret;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 6cd5e4924e4d..a749ff71b8e4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2022,7 +2022,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
}
static long ocfs2_fallocate(struct file *file, int mode, loff_t offset,
- loff_t len)
+ loff_t len, u64 physical)
{
struct inode *inode = file_inode(file);
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
@@ -2032,6 +2032,8 @@ static long ocfs2_fallocate(struct file *file, int mode, loff_t offset,
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
if (!ocfs2_writes_unwritten_extents(osb))
return -EOPNOTSUPP;
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..73f27c9b518c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -306,7 +306,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -EOPNOTSUPP;
file_start_write(file);
- ret = file->f_op->fallocate(file, mode, offset, len);
+ ret = file->f_op->fallocate(file, mode, offset, len, (u64)-1);
/*
* Create inotify and fanotify events.
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index a5317216de73..abe34162d9d4 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -460,13 +460,17 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
return ret;
}
-static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+static long ovl_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len, u64 physical)
{
struct inode *inode = file_inode(file);
struct fd real;
const struct cred *old_cred;
int ret;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
+
ret = ovl_real_fdget(file, &real);
if (ret)
return ret;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b8a4a3f29b36..61ca96469fa0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -802,7 +802,8 @@ xfs_file_fallocate(
struct file *file,
int mode,
loff_t offset,
- loff_t len)
+ loff_t len,
+ u64 physical)
{
struct inode *inode = file_inode(file);
struct xfs_inode *ip = XFS_I(inode);
@@ -816,6 +817,8 @@ xfs_file_fallocate(
return -EINVAL;
if (mode & ~XFS_FALLOC_FL_SUPPORTED)
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
xfs_ilock(ip, iolock);
error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f814ccd8d929..17c111e164d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1846,7 +1846,7 @@ struct file_operations {
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **, void **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
- loff_t len);
+ loff_t len, u64 physical);
void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
diff --git a/ipc/shm.c b/ipc/shm.c
index ce1ca9f7c6e9..3ab15a1c2d91 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -532,13 +532,13 @@ static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
}
static long shm_fallocate(struct file *file, int mode, loff_t offset,
- loff_t len)
+ loff_t len, u64 physical)
{
struct shm_file_data *sfd = shm_file_data(file);
- if (!sfd->file->f_op->fallocate)
+ if (!sfd->file->f_op->fallocate || physical != (u64)-1)
return -EOPNOTSUPP;
- return sfd->file->f_op->fallocate(file, mode, offset, len);
+ return sfd->file->f_op->fallocate(file, mode, offset, len, (u64)-1);
}
static unsigned long shm_get_unmapped_area(struct file *file,
diff --git a/mm/shmem.c b/mm/shmem.c
index 31b4bcc95f17..a07afc5b06d0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2724,7 +2724,7 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
}
static long shmem_fallocate(struct file *file, int mode, loff_t offset,
- loff_t len)
+ loff_t len, u64 physical)
{
struct inode *inode = file_inode(file);
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
@@ -2735,6 +2735,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
+ if (physical != (u64)-1)
+ return -EOPNOTSUPP;
inode_lock(inode);
This introduces a new syscall and propagates @physical there.
Also, architecture-dependent definitions for x86 are added.
Signed-off-by: Kirill Tkhai <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/ia32/sys_ia32.c | 10 ++++++++++
fs/open.c | 16 +++++++++++++---
include/linux/syscalls.h | 8 +++++++-
5 files changed, 32 insertions(+), 4 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c17cb77eb150..62b3692df584 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -442,3 +442,4 @@
435 i386 clone3 sys_clone3 __ia32_sys_clone3
437 i386 openat2 sys_openat2 __ia32_sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd __ia32_sys_pidfd_getfd
+486 i386 fallocate2 sys_fallocate2 __ia32_compat_sys_x86_fallocate2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 44d510bc9b78..b106a39509ee 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@
435 common clone3 __x64_sys_clone3/ptregs
437 common openat2 __x64_sys_openat2
438 common pidfd_getfd __x64_sys_pidfd_getfd
+486 common fallocate2 __x64_sys_fallocate2
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index 21790307121e..1757bfe1a19c 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -230,6 +230,16 @@ COMPAT_SYSCALL_DEFINE6(x86_fallocate, int, fd, int, mode,
((u64)len_hi << 32) | len_lo);
}
+COMPAT_SYSCALL_DEFINE6(x86_fallocate2, int, fd, int, mode,
+ unsigned int, offset_lo, unsigned int, offset_hi,
+ unsigned int, len_lo, unsigned int, len_hi,
+ unsigned int physical_lo, unsigned int physical_hi)
+{
+ return ksys_fallocate2(fd, mode, ((u64)offset_hi << 32) | offset_lo,
+ ((u64)len_hi << 32) | len_lo,
+ ((u64)physical_hi << 32) | physical_lo);
+}
+
/*
* The 32-bit clone ABI is CONFIG_CLONE_BACKWARDS
*/
diff --git a/fs/open.c b/fs/open.c
index 596fd3dc3988..1b964a37ecc2 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -290,6 +290,10 @@ int vfs_fallocate(struct file *file, int mode,
if (ret)
return ret;
+ if (physical != (u64)-1 &&
+ !ns_capable(inode->i_sb->s_user_ns, CAP_FOWNER))
+ return -EPERM;
+
if (S_ISFIFO(inode->i_mode))
return -ESPIPE;
@@ -324,13 +328,13 @@ int vfs_fallocate(struct file *file, int mode,
}
EXPORT_SYMBOL_GPL(vfs_fallocate);
-int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+int ksys_fallocate2(int fd, int mode, loff_t offset, loff_t len, u64 physical)
{
struct fd f = fdget(fd);
int error = -EBADF;
if (f.file) {
- error = vfs_fallocate(f.file, mode, offset, len, (u64)-1);
+ error = vfs_fallocate(f.file, mode, offset, len, physical);
fdput(f);
}
return error;
@@ -338,7 +342,13 @@ int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len)
SYSCALL_DEFINE4(fallocate, int, fd, int, mode, loff_t, offset, loff_t, len)
{
- return ksys_fallocate(fd, mode, offset, len);
+ return ksys_fallocate2(fd, mode, offset, len, (u64)-1);
+}
+
+SYSCALL_DEFINE5(fallocate2, int, fd, int, mode, loff_t, offset, loff_t, len,
+ unsigned long long, physical)
+{
+ return ksys_fallocate2(fd, mode, offset, len, physical);
}
/*
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1815065d52f3..1999493b03e9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -427,6 +427,8 @@ asmlinkage long sys_truncate64(const char __user *path, loff_t length);
asmlinkage long sys_ftruncate64(unsigned int fd, loff_t length);
#endif
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_fallocate2(int fd, int mode, loff_t offset, loff_t len,
+ unsigned long long physical);
asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode);
asmlinkage long sys_chdir(const char __user *filename);
asmlinkage long sys_fchdir(unsigned int fd);
@@ -1255,7 +1257,11 @@ ssize_t ksys_pread64(unsigned int fd, char __user *buf, size_t count,
loff_t pos);
ssize_t ksys_pwrite64(unsigned int fd, const char __user *buf,
size_t count, loff_t pos);
-int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+int ksys_fallocate2(int fd, int mode, loff_t offset, loff_t len, u64 physical);
+static inline int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+ return ksys_fallocate2(fd, mode, offset, len, (u64)-1);
+}
#ifdef CONFIG_ADVISE_SYSCALLS
int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
#else
EXT4_MB_HINT_GOAL_ONLY is currently unused. This patch teaches
ext4_mb_discard_preallocations() to discard only that preallocated
range, which contains a specified block, in case of the flag is set.
Otherwise, a preallocated range is not discarded.
Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/ext4/mballoc.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 51a78eb65f3c..b1b3c5526d1a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3894,8 +3894,8 @@ ext4_mb_release_group_pa(struct ext4_buddy *e4b,
* 1) how many requested
*/
static noinline_for_stack int
-ext4_mb_discard_group_preallocations(struct super_block *sb,
- ext4_group_t group, int needed)
+ext4_mb_discard_group_preallocations(struct super_block *sb, ext4_group_t group,
+ int needed, ext4_fsblk_t goal)
{
struct ext4_group_info *grp = ext4_get_group_info(sb, group);
struct buffer_head *bitmap_bh = NULL;
@@ -3947,6 +3947,12 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
continue;
}
+ if (goal != (ext4_fsblk_t)-1 &&
+ (goal < pa->pa_pstart || goal >= pa->pa_pstart + pa->pa_len)) {
+ spin_unlock(&pa->pa_lock);
+ continue;
+ }
+
/* seems this one can be freed ... */
pa->pa_deleted = 1;
@@ -4462,15 +4468,23 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
return 0;
}
-static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
+static int ext4_mb_discard_preallocations(struct super_block *sb,
+ struct ext4_allocation_context *ac)
{
- ext4_group_t i, ngroups = ext4_get_groups_count(sb);
+ ext4_group_t i = 0, ngroups = ext4_get_groups_count(sb);
+ int needed = ac->ac_o_ex.fe_len;
+ ext4_fsblk_t goal = (ext4_fsblk_t)-1;
int ret;
int freed = 0;
+ if (ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY) {
+ i = ac->ac_o_ex.fe_group;
+ ngroups = i + 1;
+ goal = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex);
+ }
trace_ext4_mb_discard_preallocations(sb, needed);
- for (i = 0; i < ngroups && needed > 0; i++) {
- ret = ext4_mb_discard_group_preallocations(sb, i, needed);
+ for (; i < ngroups && needed > 0; i++) {
+ ret = ext4_mb_discard_group_preallocations(sb, i, needed, goal);
freed += ret;
needed -= ret;
}
@@ -4585,7 +4599,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ar->len = ac->ac_b_ex.fe_len;
}
} else {
- freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
+ freed = ext4_mb_discard_preallocations(sb, ac);
if (freed)
goto repeat;
*errp = -ENOSPC;
> fallocate() goes thru standard blocks allocator, which try to behave very
> good for life allocation cases: block placement and future file size
> prediction, delayed blocks allocation, etc. But it almost impossible
> to allocate blocks from specified place for our specific case. The only
> ext4 block allocator option possible to use is that the allocator firstly
> tries to allocate blocks from the same block group, that inode is related to.
> But this is not enough for effective files compaction.
>
> This patchset implements an extension of fallocate():
>
> fallocate2(int fd, int mode, loff_t offset, loff_t len,
> unsigned long long physical)
>
> The new argument is @physical offset from start of device, which is must
> for block allocation. In case of [@physical, @physical + len] block range
> is available for allocation, the syscall assigns the corresponding extent/
> extents to inode. In case of the range or its part is occupied, the syscall
> returns with error (maybe, smaller range will be allocated. The behavior
> is the same as when fallocate() meets no space in the middle).
Doesn't this interface kills the whole philosophy of letting filesystems
to decide which block it sees as most fit for allocation. IMHO user
passing over actual physical location from where the FS should allocate,
does not sound like a good interface.
-ritesh
Hi, Ted,
On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
> Kirill,
>
> In a couple of your comments on this patch series, you mentioned
> "defragmentation". Is that because you're trying to use this as part
> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>
> If that's the case, you should note that input parameter for that
> ioctl is:
>
> struct move_extent {
> __u32 reserved; /* should be zero */
> __u32 donor_fd; /* donor file descriptor */
> __u64 orig_start; /* logical start offset in block for orig */
> __u64 donor_start; /* logical start offset in block for donor */
> __u64 len; /* block length to be moved */
> __u64 moved_len; /* moved block length */
> };
>
> Note that the donor_start is separate from the start of the file that
> is being defragged. So you could have the userspace application
> fallocate a large chunk of space for that donor file, and then use
> that donor file to defrag multiple files if you want to close pack
> them.
The practice shows it's not so. Your suggestion was the first thing we tried,
but it works bad and just doubles/triples IO.
Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
[[512Kb file][512Kb free]][[512Kb file][512Kb free]]
We want to pack both of files in the same 1Mb cluster. Packed together on block device,
they will be in the same server of underlining distributed storage file system.
This gives a big performance improvement, and this is the price I aimed.
In case of I fallocate a large hunk for both of them, I have to move them
both to this new hunk. So, instead of moving 512Kb of data, we will have to move
1Mb of data, i.e. double size, which is counterproductive.
Imaging another situation, when we have
[[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
Here we may just move [4Kb file] into [4Kb free]. But your suggestion again forces
us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is terrible!
And this is the thing I try prevent with finding a suitable new interface.
> Many years ago, back when LSF/MM colocated with a larger
> storage-focused conference so we could manage to origanize an ext4
> developer's workshop, we had talked about ways we create kernel
> support for a more powerful userspace defragger, which could also
> defragment the free space, so that future block allocations were more
> likely to be successful.
>
> The discussions surrounded interfaces where userspace could block (or
> at least strongly dissuade unless the only other alternative was
> returning ENOSPC) the kernel from allocating out of a certain number
> of block groups. And then also to have an interface where for a
> particular process (namely, the defragger), to make the kernel
> strongly prefer that allocations come out of an ordered list of block
> groups.
>
> (Of course these days, now that the cool kids are all embracing eBPF,
> one could imagine a privileged interface where the defragger could
> install some kind of eBPF program which provided enhanced policy to
> ext4's block allocator.)
>
> No one ever really followed through with this, in part because the
> details of allowing userspace (and it would have to be privileged
> userspace) to dictate policy to the block allocator has all sorts of
> potential pitfalls, and in part because no company was really
> interested in funding the engineering work. In addition, I'll note
> that the windows world, the need and interest for defragging has gone
> done significantly with the advent more sophisticated file systems
> like NTFSv5, which doesn't need defragging nearly as often as say, the
> FAT file system. And I think if anything, the interst in doing work
> with e4defrag has decreased even more over the years.
>
> That being said, there has been some interest in making changes to
> both the block allocator and some kind of on-line defrag which is
> optimized for low-end flash (such as the kind found in android
> handsets). There, the need to be careful that we don't end up
> increasing the write wearout becomes even more critical, although the
> GC work which f2fs does involve extra moving around of data blocks,
> and phones have seemed to do fine. Of course, the typical phone only
> has to last 2-3 years before the battery dies, the screen gets
> cracked, and/or the owner decides they want the latest cool toy from
> the phone manufacturers. :-)
>
> In any case, if your goal is really some interface to support on-line
> defragmentation for ext4, you want to consider whether the
> EXT4_IOC_MOVE_EXTENT interface is sufficiently powerful such that you
> don't really need to mess around with new block allocation hints.
It's powerful, but it does not allow to create an effective defragmentation
tool for my usecase. See the examples above. I do not want to replace
EXT4_IOC_MOVE_EXTENT I just want an interface to be able to allocate
a space close to some existing file and reduce IO at defragmentation time.
This is just only thing I need in this patchset.
I can't climb into maintainers heads and find a thing, which will be suitable
for you. I did my try and suggested the interface. In case of it's not OK
for you, could you, please, suggest another one, which will work for my usecase?
The thesis "EXT4_IOC_MOVE_EXTENT is enough for everything" does not work for me :(
Are you OK with interface suggested by Andreas?
Thanks,
Kirill
On Tue, Mar 03, 2020 at 12:57:15PM +0300, Kirill Tkhai wrote:
> The practice shows it's not so. Your suggestion was the first thing we tried,
> but it works bad and just doubles/triples IO.
>
> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>
> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>
> We want to pack both of files in the same 1Mb cluster. Packed together on block device,
> they will be in the same server of underlining distributed storage file system.
> This gives a big performance improvement, and this is the price I aimed.
>
> In case of I fallocate a large hunk for both of them, I have to move them
> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
> 1Mb of data, i.e. double size, which is counterproductive.
>
> Imaging another situation, when we have
> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>
> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again forces
> us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is terrible!
> And this is the thing I try prevent with finding a suitable new interface.
OK, so you aren't trying to *defragment*. You want to have files
placed "properly" ab initio.
It sounds like what you *think* is the best way to go is to simply
have files backed tightly together. So effectively what you want as a
block allocation strategy is something which just finds the next free
space big enough for the requested fallocate space, and just plop it
down right there.
OK, so what happens once you've allocated all of the free space, and
the pattern of deletes leaves the file system with a lot of holes?
I could imagine trying to implement this as a mount option which uses
an alternate block allocation strategy, but it's not clear what your
end game is after all of the "easy" spaces have been taken. It's much
like proposals I've seen for a log-structured file system, where the
garbage collector is left as a "we'll get to it later" TODO item. (If
I had a dollar each time I've read a paper proposing a log structured
file system which leaves out the garbage collector as an
implementation detail....)
> It's powerful, but it does not allow to create an effective defragmentation
> tool for my usecase. See the examples above. I do not want to replace
> EXT4_IOC_MOVE_EXTENT I just want an interface to be able to allocate
> a space close to some existing file and reduce IO at defragmentation time.
> This is just only thing I need in this patchset.
"At defragmentation time"? So you do want to run a defragger?
It might be helpful to see the full design of what you have in mind,
and not just a request for interfaces....
> I can't climb into maintainers heads and find a thing, which will be suitable
> for you. I did my try and suggested the interface. In case of it's not OK
> for you, could you, please, suggest another one, which will work for my usecase?
> The thesis "EXT4_IOC_MOVE_EXTENT is enough for everything" does not work for me :(
> Are you OK with interface suggested by Andreas?
Like you, I can't climb into your head and figure out exactly how your
entire system design is going to work. And I'd really rather not
proposal or bless an interface until I do, since it may be that it's
better to make some minor changes to your system design, instead of
trying to twist ext4 for your particular use case....
- Ted
On 12.03.2020 03:31, Andreas Dilger wrote:
> On Mar 11, 2020, at 2:29 PM, Kirill Tkhai <[email protected]> wrote:
>> On 11.03.2020 22:26, Andreas Dilger wrote:
>>> On Mar 3, 2020, at 2:57 AM, Kirill Tkhai <[email protected]> wrote:
>>>>
>>>> On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
>>>>> Kirill,
>>>>>
>>>>> In a couple of your comments on this patch series, you mentioned
>>>>> "defragmentation". Is that because you're trying to use this as part
>>>>> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>>>>>
>>>>> If that's the case, you should note that input parameter for that
>>>>> ioctl is:
>>>>>
>>>>> struct move_extent {
>>>>> __u32 reserved; /* should be zero */
>>>>> __u32 donor_fd; /* donor file descriptor */
>>>>> __u64 orig_start; /* logical start offset in block for orig */
>>>>> __u64 donor_start; /* logical start offset in block for donor */
>>>>> __u64 len; /* block length to be moved */
>>>>> __u64 moved_len; /* moved block length */
>>>>> };
>>>>>
>>>>> Note that the donor_start is separate from the start of the file that
>>>>> is being defragged. So you could have the userspace application
>>>>> fallocate a large chunk of space for that donor file, and then use
>>>>> that donor file to defrag multiple files if you want to close pack
>>>>> them.
>>>>
>>>> The practice shows it's not so. Your suggestion was the first thing we tried,
>>>> but it works bad and just doubles/triples IO.
>>>>
>>>> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>>>>
>>>> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>>>>
>>>> We want to pack both of files in the same 1Mb cluster. Packed together on block
>>>> device, they will be in the same server of underlining distributed storage file
>>>> system. This gives a big performance improvement, and this is the price I aimed.
>>>>
>>>> In case of I fallocate a large hunk for both of them, I have to move them
>>>> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
>>>> 1Mb of data, i.e. double size, which is counterproductive.
>>>>
>>>> Imaging another situation, when we have
>>>> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>>>>
>>>> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again
>>>> forces us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is
>>>> terrible! And this is the thing I try prevent with finding a new interface.
>>>
>>> One idea I had, which may work for your use case, is to run fallocate() on
>>> the *1MB-4KB file* to allocate the last 4KB in that hunk, then use that block
>>> as the donor file for the 1MB+4KB file. The ext4 allocation algorithms should
>>> always give you that 4KB chunk if it is free, and that avoids the need to try
>>> and force the allocator to select that block through some other method.
>>
>> Do you mean the following:
>>
>> 1)fallocate() 4K at the end of *1MB-4KB* the first file (==> this increases the file length).
>
> You can use FALLOCATE_KEEP_SIZE to avoid changing the size of the file.
Ok, but there still remains the problem with fallocation of a block in front of target file:
[4KB hole][1MB-4KB file][hole][4KB file]
Is there a high-probably way that ext4 allocator returns a block before target file?
>> 2)EXT4_IOC_MOVE_EXT *4KB* the second file in that new hunk.
>> 3)truncate 4KB at the end of the first file.
>>
>> If so, this can't be an online defrag, since some process may want to increase
>> *1MB-4KB* file in between. This will just bring to data corruption.
>
> You previously stated that one of the main reasons to do the defrag is because
> the files are not being modified. It would be possible to detect the case of
> the file being modified by the file version and/or size and/or time change
> before removing the fallocated block.
Yes, files should not be modified in parallel, but there is no 100% guarantee.
It's almost 100% by statistics, but since there is multi-user system (I defrag
fs on VPS), there are possible exceptions. And the whole architecture must be
safe in any cases.
File version does not look acceptable for online defrag, since there is no a way,
which allows to check the version and to remove fallocated block *atomically*.
>> Another problem is that power lose between 1 and 3 will result in that file
>> length remain *1MB* instead of *1MB-4KB*.
>
> With FALLOCATE_KEEP_SIZE you can just use this file as the donor file to
> allocate the blocks, then migrate it to another file without having written
> anything into it.
In case of some task decides to write into fallocated block, there will be data
corruption.