2022-06-16 20:18:49

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v3 0/8] make statx() return DIO alignment information

This patchset makes the statx() system call return direct I/O (DIO)
alignment information. This allows userspace to easily determine
whether a file supports DIO, and if so with what alignment restrictions.

Patch 1 adds the basic VFS support for STATX_DIOALIGN. Patch 2 wires it
up for all block device files. The remaining patches wire it up for
regular files on ext4 and f2fs. Support for regular files on other
filesystems can be added later.

I've also written a man-pages patch, which I'm sending separately.

Note, f2fs has one corner case where DIO reads are allowed but not DIO
writes. The proposed statx fields can't represent this. My proposal
(patch 6) is to just eliminate this case, as it seems much too weird.
But I'd appreciate any feedback on that part.

This patchset applies to v5.19-rc2.

Changed v2 => v3:
- Dropped the stx_offset_align_optimal field, since its purpose
wasn't clearly distinguished from the existing stx_blksize.

- Renamed STATX_IOALIGN to STATX_DIOALIGN, to reflect the new focus
on DIO only.

- Similarly, renamed stx_{mem,offset}_align_dio to
stx_dio_{mem,offset}_align, to reflect the new focus on DIO only.

- Wired up STATX_DIOALIGN on block device files.

Changed v1 => v2:
- No changes.

Eric Biggers (8):
statx: add direct I/O alignment information
vfs: support STATX_DIOALIGN on block devices
fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN
ext4: support STATX_DIOALIGN
f2fs: move f2fs_force_buffered_io() into file.c
f2fs: don't allow DIO reads but not DIO writes
f2fs: simplify f2fs_force_buffered_io()
f2fs: support STATX_DIOALIGN

fs/crypto/inline_crypt.c | 48 ++++++++++++++++++++-------------------
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 10 ++++----
fs/ext4/inode.c | 29 +++++++++++++++++++++++
fs/f2fs/f2fs.h | 45 ------------------------------------
fs/f2fs/file.c | 45 +++++++++++++++++++++++++++++++++++-
fs/stat.c | 37 ++++++++++++++++++++++++++++++
include/linux/fscrypt.h | 7 ++----
include/linux/stat.h | 2 ++
include/uapi/linux/stat.h | 4 +++-
10 files changed, 147 insertions(+), 81 deletions(-)

--
2.36.1

base-commit: b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3


2022-06-16 20:18:49

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v3 2/8] vfs: support STATX_DIOALIGN on block devices

From: Eric Biggers <[email protected]>

Add support for STATX_DIOALIGN to block devices, so that direct I/O
alignment restrictions are exposed to userspace in a generic way.

Note that this breaks the tradition of stat operating only on the block
device node, not the block device itself. However, it was felt that
doing this is preferable, in order to make the interface useful and
avoid needing separate interfaces for regular files and block devices.

Signed-off-by: Eric Biggers <[email protected]>
---
fs/stat.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index a7930d7444830..c1ce447c1a383 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -5,6 +5,7 @@
* Copyright (C) 1991, 1992 Linus Torvalds
*/

+#include <linux/blkdev.h>
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/errno.h>
@@ -198,6 +199,35 @@ int getname_statx_lookup_flags(int flags)
return lookup_flags;
}

+/* Handle STATX_DIOALIGN for block devices. */
+static inline void handle_bdev_dioalign(struct path *path, u32 request_mask,
+ struct kstat *stat)
+{
+#ifdef CONFIG_BLOCK
+ struct inode *inode;
+ struct block_device *bdev;
+ unsigned int lbs;
+
+ if (likely(!(request_mask & STATX_DIOALIGN)))
+ return;
+
+ inode = d_backing_inode(path->dentry);
+ if (!S_ISBLK(inode->i_mode))
+ return;
+
+ bdev = blkdev_get_no_open(inode->i_rdev);
+ if (!bdev)
+ return;
+
+ lbs = bdev_logical_block_size(bdev);
+ stat->dio_mem_align = lbs;
+ stat->dio_offset_align = lbs;
+ stat->result_mask |= STATX_DIOALIGN;
+
+ blkdev_put_no_open(bdev);
+#endif /* CONFIG_BLOCK */
+}
+
/**
* vfs_statx - Get basic and extra attributes by filename
* @dfd: A file descriptor representing the base dir for a relative filename
@@ -230,11 +260,16 @@ static int vfs_statx(int dfd, struct filename *filename, int flags,
goto out;

error = vfs_getattr(&path, stat, request_mask, flags);
+
stat->mnt_id = real_mount(path.mnt)->mnt_id;
stat->result_mask |= STATX_MNT_ID;
+
if (path.mnt->mnt_root == path.dentry)
stat->attributes |= STATX_ATTR_MOUNT_ROOT;
stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT;
+
+ handle_bdev_dioalign(&path, request_mask, stat);
+
path_put(&path);
if (retry_estale(error, lookup_flags)) {
lookup_flags |= LOOKUP_REVAL;
--
2.36.1

2022-06-16 20:18:56

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v3 1/8] statx: add direct I/O alignment information

From: Eric Biggers <[email protected]>

Traditionally, the conditions for when DIO (direct I/O) is supported
were fairly simple. For both block devices and regular files, DIO had
to be aligned to the logical block size of the block device.

However, due to filesystem features that have been added over time (e.g.
multi-device support, data journalling, inline data, encryption, verity,
compression, checkpoint disabling, log-structured mode), the conditions
for when DIO is allowed on a regular file have gotten increasingly
complex. Whether a particular regular file supports DIO, and with what
alignment, can depend on various file attributes and filesystem mount
options, as well as which block device(s) the file's data is located on.

Moreover, the general rule of DIO needing to be aligned to the block
device's logical block size is being relaxed to allow user buffers (but
not file offsets) aligned to the DMA alignment instead
(https://lore.kernel.org/linux-block/[email protected]/T/#u).

XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information.
Uplifting this to the VFS is one possibility. However, as discussed
(https://lore.kernel.org/linux-fsdevel/[email protected]/T/#u),
this ioctl is rarely used and not known to be used outside of
XFS-specific code. It was also never intended to indicate when a file
doesn't support DIO at all, nor was it intended for block devices.

Therefore, let's expose this information via statx(). Add the
STATX_DIOALIGN flag and two new statx fields associated with it:

* stx_dio_mem_align: the alignment (in bytes) required for user memory
buffers for DIO, or 0 if DIO is not supported on the file.

* stx_dio_offset_align: the alignment (in bytes) required for file
offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
on the file. This will only be nonzero if stx_dio_mem_align is
nonzero, and vice versa.

Note that as with other statx() extensions, if STATX_DIOALIGN isn't set
in the returned statx struct, then these new fields won't be filled in.
This will happen if the file is neither a regular file nor a block
device, or if the file is a regular file and the filesystem doesn't
support STATX_DIOALIGN. It might also happen if the caller didn't
include STATX_DIOALIGN in the request mask, since statx() isn't required
to return unrequested information.

This commit only adds the VFS-level plumbing for STATX_DIOALIGN. For
regular files, individual filesystems will still need to add code to
support it. For block devices, a separate commit will wire it up too.

Signed-off-by: Eric Biggers <[email protected]>
---
fs/stat.c | 2 ++
include/linux/stat.h | 2 ++
include/uapi/linux/stat.h | 4 +++-
3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/stat.c b/fs/stat.c
index 9ced8860e0f35..a7930d7444830 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -611,6 +611,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
tmp.stx_dev_major = MAJOR(stat->dev);
tmp.stx_dev_minor = MINOR(stat->dev);
tmp.stx_mnt_id = stat->mnt_id;
+ tmp.stx_dio_mem_align = stat->dio_mem_align;
+ tmp.stx_dio_offset_align = stat->dio_offset_align;

return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
}
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 7df06931f25d8..ff277ced50e9f 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -50,6 +50,8 @@ struct kstat {
struct timespec64 btime; /* File creation time */
u64 blocks;
u64 mnt_id;
+ u32 dio_mem_align;
+ u32 dio_offset_align;
};

#endif
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 1500a0f58041a..7cab2c65d3d7f 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -124,7 +124,8 @@ struct statx {
__u32 stx_dev_minor;
/* 0x90 */
__u64 stx_mnt_id;
- __u64 __spare2;
+ __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */
+ __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */
/* 0xa0 */
__u64 __spare3[12]; /* Spare space for future expansion */
/* 0x100 */
@@ -152,6 +153,7 @@ struct statx {
#define STATX_BASIC_STATS 0x000007ffU /* The stuff in the normal stat struct */
#define STATX_BTIME 0x00000800U /* Want/got stx_btime */
#define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */
+#define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */

#define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */

--
2.36.1

2022-06-16 20:18:59

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v3 4/8] ext4: support STATX_DIOALIGN

From: Eric Biggers <[email protected]>

Add support for STATX_DIOALIGN to ext4, so that direct I/O alignment
restrictions are exposed to userspace in a generic way.

Signed-off-by: Eric Biggers <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 15 ++++-----------
fs/ext4/inode.c | 29 +++++++++++++++++++++++++++++
3 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 75b8d81b24692..68e964394e917 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2968,6 +2968,7 @@ extern struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
extern int ext4_write_inode(struct inode *, struct writeback_control *);
extern int ext4_setattr(struct user_namespace *, struct dentry *,
struct iattr *);
+extern u32 ext4_dio_alignment(struct inode *inode);
extern int ext4_getattr(struct user_namespace *, const struct path *,
struct kstat *, u32, unsigned int);
extern void ext4_evict_inode(struct inode *);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 26d7426208970..ecede6b01b7f0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -39,19 +39,12 @@
static bool ext4_dio_supported(struct kiocb *iocb, struct iov_iter *iter)
{
struct inode *inode = file_inode(iocb->ki_filp);
+ u32 dio_align = ext4_dio_alignment(inode);

- if (IS_ENCRYPTED(inode)) {
- if (!fscrypt_dio_supported(inode))
- return false;
- if (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter),
- i_blocksize(inode)))
- return false;
- }
- if (fsverity_active(inode))
- return false;
- if (ext4_should_journal_data(inode))
+ if (!dio_align)
return false;
- if (ext4_has_inline_data(inode))
+ if (dio_align > bdev_logical_block_size(inode->i_sb->s_bdev) &&
+ !IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter), dio_align))
return false;
return true;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3dce7d058985b..78180c98f3847 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5536,6 +5536,22 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
return error;
}

+u32 ext4_dio_alignment(struct inode *inode)
+{
+ if (fsverity_active(inode))
+ return 0;
+ if (ext4_should_journal_data(inode))
+ return 0;
+ if (ext4_has_inline_data(inode))
+ return 0;
+ if (IS_ENCRYPTED(inode)) {
+ if (!fscrypt_dio_supported(inode))
+ return 0;
+ return i_blocksize(inode);
+ }
+ return bdev_logical_block_size(inode->i_sb->s_bdev);
+}
+
int ext4_getattr(struct user_namespace *mnt_userns, const struct path *path,
struct kstat *stat, u32 request_mask, unsigned int query_flags)
{
@@ -5551,6 +5567,19 @@ int ext4_getattr(struct user_namespace *mnt_userns, const struct path *path,
stat->btime.tv_nsec = ei->i_crtime.tv_nsec;
}

+ /*
+ * Return the DIO alignment restrictions if requested. We only return
+ * this information when requested, since on encrypted files it might
+ * take a fair bit of work to get if the file wasn't opened recently.
+ */
+ if ((request_mask & STATX_DIOALIGN) && S_ISREG(inode->i_mode)) {
+ u32 dio_align = ext4_dio_alignment(inode);
+
+ stat->result_mask |= STATX_DIOALIGN;
+ stat->dio_mem_align = dio_align;
+ stat->dio_offset_align = dio_align;
+ }
+
flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
if (flags & EXT4_APPEND_FL)
stat->attributes |= STATX_ATTR_APPEND;
--
2.36.1

2022-06-16 20:19:04

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v3 6/8] f2fs: don't allow DIO reads but not DIO writes

From: Eric Biggers <[email protected]>

Currently, if an f2fs filesystem is mounted with the mode=lfs and
io_bits mount options, DIO reads are allowed but DIO writes are not.
Allowing DIO reads but not DIO writes is an unusual restriction, which
is likely to be surprising to applications, namely any application that
both reads and writes from a file (using O_DIRECT). This behavior is
also incompatible with the proposed STATX_DIOALIGN extension to statx.
Given this, let's drop the support for DIO reads in this configuration.

Signed-off-by: Eric Biggers <[email protected]>
---
fs/f2fs/file.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 5e5c97fccfb4e..ad0212848a1ab 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -823,7 +823,6 @@ static inline bool f2fs_force_buffered_io(struct inode *inode,
struct kiocb *iocb, struct iov_iter *iter)
{
struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
- int rw = iov_iter_rw(iter);

if (!fscrypt_dio_supported(inode))
return true;
@@ -841,7 +840,7 @@ static inline bool f2fs_force_buffered_io(struct inode *inode,
*/
if (f2fs_sb_has_blkzoned(sbi))
return true;
- if (f2fs_lfs_mode(sbi) && (rw == WRITE)) {
+ if (f2fs_lfs_mode(sbi)) {
if (block_unaligned_IO(inode, iocb, iter))
return true;
if (F2FS_IO_ALIGNED(sbi))
--
2.36.1

2022-06-16 20:19:06

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v3 3/8] fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN

From: Eric Biggers <[email protected]>

To prepare for STATX_DIOALIGN support, make two changes to
fscrypt_dio_supported().

First, remove the filesystem-block-alignment check and make the
filesystems handle it instead. It previously made sense to have it in
fs/crypto/; however, to support STATX_DIOALIGN the alignment restriction
would have to be returned to filesystems. It ends up being simpler if
filesystems handle this part themselves, especially for f2fs which only
allows fs-block-aligned DIO in the first place.

Second, make fscrypt_dio_supported() work on inodes whose encryption key
hasn't been set up yet, by making it set up the key if needed. This is
required for statx(), since statx() doesn't require a file descriptor.

Signed-off-by: Eric Biggers <[email protected]>
---
fs/crypto/inline_crypt.c | 48 +++++++++++++++++++++-------------------
fs/ext4/file.c | 9 ++++++--
fs/f2fs/f2fs.h | 2 +-
include/linux/fscrypt.h | 7 ++----
4 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/fs/crypto/inline_crypt.c b/fs/crypto/inline_crypt.c
index 90f3e68f166e3..388e2330d6263 100644
--- a/fs/crypto/inline_crypt.c
+++ b/fs/crypto/inline_crypt.c
@@ -401,43 +401,45 @@ bool fscrypt_mergeable_bio_bh(struct bio *bio,
EXPORT_SYMBOL_GPL(fscrypt_mergeable_bio_bh);

/**
- * fscrypt_dio_supported() - check whether a DIO (direct I/O) request is
- * supported as far as encryption is concerned
- * @iocb: the file and position the I/O is targeting
- * @iter: the I/O data segment(s)
+ * fscrypt_dio_supported() - check whether DIO (direct I/O) is supported on an
+ * inode, as far as encryption is concerned
+ * @inode: the inode in question
*
* Return: %true if there are no encryption constraints that prevent DIO from
* being supported; %false if DIO is unsupported. (Note that in the
* %true case, the filesystem might have other, non-encryption-related
- * constraints that prevent DIO from actually being supported.)
+ * constraints that prevent DIO from actually being supported. Also, on
+ * encrypted files the filesystem is still responsible for only allowing
+ * DIO when requests are filesystem-block-aligned.)
*/
-bool fscrypt_dio_supported(struct kiocb *iocb, struct iov_iter *iter)
+bool fscrypt_dio_supported(struct inode *inode)
{
- const struct inode *inode = file_inode(iocb->ki_filp);
- const unsigned int blocksize = i_blocksize(inode);
+ int err;

/* If the file is unencrypted, no veto from us. */
if (!fscrypt_needs_contents_encryption(inode))
return true;

- /* We only support DIO with inline crypto, not fs-layer crypto. */
- if (!fscrypt_inode_uses_inline_crypto(inode))
- return false;
-
/*
- * Since the granularity of encryption is filesystem blocks, the file
- * position and total I/O length must be aligned to the filesystem block
- * size -- not just to the block device's logical block size as is
- * traditionally the case for DIO on many filesystems.
+ * We only support DIO with inline crypto, not fs-layer crypto.
*
- * We require that the user-provided memory buffers be filesystem block
- * aligned too. It is simpler to have a single alignment value required
- * for all properties of the I/O, as is normally the case for DIO.
- * Also, allowing less aligned buffers would imply that data units could
- * cross bvecs, which would greatly complicate the I/O stack, which
- * assumes that bios can be split at any bvec boundary.
+ * To determine whether the inode is using inline crypto, we have to set
+ * up the key if it wasn't already done. This is because in the current
+ * design of fscrypt, the decision of whether to use inline crypto or
+ * not isn't made until the inode's encryption key is being set up. In
+ * the DIO read/write case, the key will always be set up already, since
+ * the file will be open. But in the case of statx(), the key might not
+ * be set up yet, as the file might not have been opened yet.
*/
- if (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter), blocksize))
+ err = fscrypt_require_key(inode);
+ if (err) {
+ /*
+ * Key unavailable or couldn't be set up. This edge case isn't
+ * worth worrying about; just report that DIO is unsupported.
+ */
+ return false;
+ }
+ if (!fscrypt_inode_uses_inline_crypto(inode))
return false;

return true;
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 109d07629f81f..26d7426208970 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -40,8 +40,13 @@ static bool ext4_dio_supported(struct kiocb *iocb, struct iov_iter *iter)
{
struct inode *inode = file_inode(iocb->ki_filp);

- if (!fscrypt_dio_supported(iocb, iter))
- return false;
+ if (IS_ENCRYPTED(inode)) {
+ if (!fscrypt_dio_supported(inode))
+ return false;
+ if (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter),
+ i_blocksize(inode)))
+ return false;
+ }
if (fsverity_active(inode))
return false;
if (ext4_should_journal_data(inode))
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index d9bbecd008d22..7869e749700fc 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -4453,7 +4453,7 @@ static inline bool f2fs_force_buffered_io(struct inode *inode,
struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
int rw = iov_iter_rw(iter);

- if (!fscrypt_dio_supported(iocb, iter))
+ if (!fscrypt_dio_supported(inode))
return true;
if (fsverity_active(inode))
return true;
diff --git a/include/linux/fscrypt.h b/include/linux/fscrypt.h
index e60d57c99cb6f..0f9f5ed5b34d3 100644
--- a/include/linux/fscrypt.h
+++ b/include/linux/fscrypt.h
@@ -763,7 +763,7 @@ bool fscrypt_mergeable_bio(struct bio *bio, const struct inode *inode,
bool fscrypt_mergeable_bio_bh(struct bio *bio,
const struct buffer_head *next_bh);

-bool fscrypt_dio_supported(struct kiocb *iocb, struct iov_iter *iter);
+bool fscrypt_dio_supported(struct inode *inode);

u64 fscrypt_limit_io_blocks(const struct inode *inode, u64 lblk, u64 nr_blocks);

@@ -796,11 +796,8 @@ static inline bool fscrypt_mergeable_bio_bh(struct bio *bio,
return true;
}

-static inline bool fscrypt_dio_supported(struct kiocb *iocb,
- struct iov_iter *iter)
+static inline bool fscrypt_dio_supported(struct inode *inode)
{
- const struct inode *inode = file_inode(iocb->ki_filp);
-
return !fscrypt_needs_contents_encryption(inode);
}

--
2.36.1

2022-06-16 20:19:09

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v3 5/8] f2fs: move f2fs_force_buffered_io() into file.c

From: Eric Biggers <[email protected]>

f2fs_force_buffered_io() is only used in file.c, so move it into there.
No behavior change. This makes it easier to review later patches.

Signed-off-by: Eric Biggers <[email protected]>
---
fs/f2fs/f2fs.h | 45 ---------------------------------------------
fs/f2fs/file.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 7869e749700fc..d187b7d7ed243 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -4426,17 +4426,6 @@ static inline void f2fs_i_compr_blocks_update(struct inode *inode,
f2fs_mark_inode_dirty_sync(inode, true);
}

-static inline int block_unaligned_IO(struct inode *inode,
- struct kiocb *iocb, struct iov_iter *iter)
-{
- unsigned int i_blkbits = READ_ONCE(inode->i_blkbits);
- unsigned int blocksize_mask = (1 << i_blkbits) - 1;
- loff_t offset = iocb->ki_pos;
- unsigned long align = offset | iov_iter_alignment(iter);
-
- return align & blocksize_mask;
-}
-
static inline bool f2fs_allow_multi_device_dio(struct f2fs_sb_info *sbi,
int flag)
{
@@ -4447,40 +4436,6 @@ static inline bool f2fs_allow_multi_device_dio(struct f2fs_sb_info *sbi,
return sbi->aligned_blksize;
}

-static inline bool f2fs_force_buffered_io(struct inode *inode,
- struct kiocb *iocb, struct iov_iter *iter)
-{
- struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
- int rw = iov_iter_rw(iter);
-
- if (!fscrypt_dio_supported(inode))
- return true;
- if (fsverity_active(inode))
- return true;
- if (f2fs_compressed_file(inode))
- return true;
-
- /* disallow direct IO if any of devices has unaligned blksize */
- if (f2fs_is_multi_device(sbi) && !sbi->aligned_blksize)
- return true;
- /*
- * for blkzoned device, fallback direct IO to buffered IO, so
- * all IOs can be serialized by log-structured write.
- */
- if (f2fs_sb_has_blkzoned(sbi))
- return true;
- if (f2fs_lfs_mode(sbi) && (rw == WRITE)) {
- if (block_unaligned_IO(inode, iocb, iter))
- return true;
- if (F2FS_IO_ALIGNED(sbi))
- return true;
- }
- if (is_sbi_flag_set(F2FS_I_SB(inode), SBI_CP_DISABLED))
- return true;
-
- return false;
-}
-
static inline bool f2fs_need_verity(const struct inode *inode, pgoff_t idx)
{
return fsverity_active(inode) &&
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index bd14cef1b08fd..5e5c97fccfb4e 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -808,6 +808,51 @@ int f2fs_truncate(struct inode *inode)
return 0;
}

+static int block_unaligned_IO(struct inode *inode, struct kiocb *iocb,
+ struct iov_iter *iter)
+{
+ unsigned int i_blkbits = READ_ONCE(inode->i_blkbits);
+ unsigned int blocksize_mask = (1 << i_blkbits) - 1;
+ loff_t offset = iocb->ki_pos;
+ unsigned long align = offset | iov_iter_alignment(iter);
+
+ return align & blocksize_mask;
+}
+
+static inline bool f2fs_force_buffered_io(struct inode *inode,
+ struct kiocb *iocb, struct iov_iter *iter)
+{
+ struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
+ int rw = iov_iter_rw(iter);
+
+ if (!fscrypt_dio_supported(inode))
+ return true;
+ if (fsverity_active(inode))
+ return true;
+ if (f2fs_compressed_file(inode))
+ return true;
+
+ /* disallow direct IO if any of devices has unaligned blksize */
+ if (f2fs_is_multi_device(sbi) && !sbi->aligned_blksize)
+ return true;
+ /*
+ * for blkzoned device, fallback direct IO to buffered IO, so
+ * all IOs can be serialized by log-structured write.
+ */
+ if (f2fs_sb_has_blkzoned(sbi))
+ return true;
+ if (f2fs_lfs_mode(sbi) && (rw == WRITE)) {
+ if (block_unaligned_IO(inode, iocb, iter))
+ return true;
+ if (F2FS_IO_ALIGNED(sbi))
+ return true;
+ }
+ if (is_sbi_flag_set(F2FS_I_SB(inode), SBI_CP_DISABLED))
+ return true;
+
+ return false;
+}
+
int f2fs_getattr(struct user_namespace *mnt_userns, const struct path *path,
struct kstat *stat, u32 request_mask, unsigned int query_flags)
{
--
2.36.1

2022-06-16 20:19:13

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v3 8/8] f2fs: support STATX_DIOALIGN

From: Eric Biggers <[email protected]>

Add support for STATX_DIOALIGN to f2fs, so that direct I/O alignment
restrictions are exposed to userspace in a generic way.

Signed-off-by: Eric Biggers <[email protected]>
---
fs/f2fs/file.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 1b452bb75af29..11d75aa3da185 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -852,6 +852,21 @@ int f2fs_getattr(struct user_namespace *mnt_userns, const struct path *path,
stat->btime.tv_nsec = fi->i_crtime.tv_nsec;
}

+ /*
+ * Return the DIO alignment restrictions if requested. We only return
+ * this information when requested, since on encrypted files it might
+ * take a fair bit of work to get if the file wasn't opened recently.
+ */
+ if ((request_mask & STATX_DIOALIGN) && S_ISREG(inode->i_mode)) {
+ unsigned int bsize = i_blocksize(inode);
+
+ stat->result_mask |= STATX_DIOALIGN;
+ if (!f2fs_force_buffered_io(inode)) {
+ stat->dio_mem_align = bsize;
+ stat->dio_offset_align = bsize;
+ }
+ }
+
flags = fi->i_flags;
if (flags & F2FS_COMPR_FL)
stat->attributes |= STATX_ATTR_COMPRESSED;
--
2.36.1

2022-06-16 20:19:15

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v3 7/8] f2fs: simplify f2fs_force_buffered_io()

From: Eric Biggers <[email protected]>

f2fs only allows direct I/O that is aligned to the filesystem block
size. Given that fact, simplify f2fs_force_buffered_io() by removing
the redundant call to block_unaligned_IO().

This makes it easier to reuse this code for STATX_DIOALIGN.

Signed-off-by: Eric Biggers <[email protected]>
---
fs/f2fs/file.c | 24 ++++--------------------
1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index ad0212848a1ab..1b452bb75af29 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -808,19 +808,7 @@ int f2fs_truncate(struct inode *inode)
return 0;
}

-static int block_unaligned_IO(struct inode *inode, struct kiocb *iocb,
- struct iov_iter *iter)
-{
- unsigned int i_blkbits = READ_ONCE(inode->i_blkbits);
- unsigned int blocksize_mask = (1 << i_blkbits) - 1;
- loff_t offset = iocb->ki_pos;
- unsigned long align = offset | iov_iter_alignment(iter);
-
- return align & blocksize_mask;
-}
-
-static inline bool f2fs_force_buffered_io(struct inode *inode,
- struct kiocb *iocb, struct iov_iter *iter)
+static bool f2fs_force_buffered_io(struct inode *inode)
{
struct f2fs_sb_info *sbi = F2FS_I_SB(inode);

@@ -840,12 +828,8 @@ static inline bool f2fs_force_buffered_io(struct inode *inode,
*/
if (f2fs_sb_has_blkzoned(sbi))
return true;
- if (f2fs_lfs_mode(sbi)) {
- if (block_unaligned_IO(inode, iocb, iter))
- return true;
- if (F2FS_IO_ALIGNED(sbi))
- return true;
- }
+ if (f2fs_lfs_mode(sbi) && F2FS_IO_ALIGNED(sbi))
+ return true;
if (is_sbi_flag_set(F2FS_I_SB(inode), SBI_CP_DISABLED))
return true;

@@ -4205,7 +4189,7 @@ static bool f2fs_should_use_dio(struct inode *inode, struct kiocb *iocb,
if (!(iocb->ki_flags & IOCB_DIRECT))
return false;

- if (f2fs_force_buffered_io(inode, iocb, iter))
+ if (f2fs_force_buffered_io(inode))
return false;

/*
--
2.36.1

2022-06-19 11:31:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] statx: add direct I/O alignment information


On 16/06/2022 23.14, Eric Biggers wrote:
> From: Eric Biggers <[email protected]>
>
> Traditionally, the conditions for when DIO (direct I/O) is supported
> were fairly simple. For both block devices and regular files, DIO had
> to be aligned to the logical block size of the block device.
>
> However, due to filesystem features that have been added over time (e.g.
> multi-device support, data journalling, inline data, encryption, verity,
> compression, checkpoint disabling, log-structured mode), the conditions
> for when DIO is allowed on a regular file have gotten increasingly
> complex. Whether a particular regular file supports DIO, and with what
> alignment, can depend on various file attributes and filesystem mount
> options, as well as which block device(s) the file's data is located on.
>
> Moreover, the general rule of DIO needing to be aligned to the block
> device's logical block size is being relaxed to allow user buffers (but
> not file offsets) aligned to the DMA alignment instead
> (https://lore.kernel.org/linux-block/[email protected]/T/#u).
>
> XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information.
> Uplifting this to the VFS is one possibility. However, as discussed
> (https://lore.kernel.org/linux-fsdevel/[email protected]/T/#u),
> this ioctl is rarely used and not known to be used outside of
> XFS-specific code. It was also never intended to indicate when a file
> doesn't support DIO at all, nor was it intended for block devices.
>
> Therefore, let's expose this information via statx(). Add the
> STATX_DIOALIGN flag and two new statx fields associated with it:
>
> * stx_dio_mem_align: the alignment (in bytes) required for user memory
> buffers for DIO, or 0 if DIO is not supported on the file.
>
> * stx_dio_offset_align: the alignment (in bytes) required for file
> offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
> on the file. This will only be nonzero if stx_dio_mem_align is
> nonzero, and vice versa.


If you consider AIO, this is actually three alignments:

1. offset alignment for reads (sector size in XFS)

2. offset alignment for overwrites (sector size in XFS since
ed1128c2d0c87e, block size earlier)

3. offset alignment for appending writes (block size)


This is critical for linux-aio since violation of these alignments will
stall the io_submit system call. Perhaps io_uring handles it better by
bouncing to a workqueue, but there is a significant performance and
latency penalty for that.


Small appending writes are important for database commit logs (and so
it's better to overwrite a pre-formatted file to avoid aligning to block
size).


It would be good to expose these differences.


>
> Note that as with other statx() extensions, if STATX_DIOALIGN isn't set
> in the returned statx struct, then these new fields won't be filled in.
> This will happen if the file is neither a regular file nor a block
> device, or if the file is a regular file and the filesystem doesn't
> support STATX_DIOALIGN. It might also happen if the caller didn't
> include STATX_DIOALIGN in the request mask, since statx() isn't required
> to return unrequested information.
>
> This commit only adds the VFS-level plumbing for STATX_DIOALIGN. For
> regular files, individual filesystems will still need to add code to
> support it. For block devices, a separate commit will wire it up too.
>
> Signed-off-by: Eric Biggers <[email protected]>
> ---
> fs/stat.c | 2 ++
> include/linux/stat.h | 2 ++
> include/uapi/linux/stat.h | 4 +++-
> 3 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/stat.c b/fs/stat.c
> index 9ced8860e0f35..a7930d7444830 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -611,6 +611,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
> tmp.stx_dev_major = MAJOR(stat->dev);
> tmp.stx_dev_minor = MINOR(stat->dev);
> tmp.stx_mnt_id = stat->mnt_id;
> + tmp.stx_dio_mem_align = stat->dio_mem_align;
> + tmp.stx_dio_offset_align = stat->dio_offset_align;
>
> return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
> }
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 7df06931f25d8..ff277ced50e9f 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -50,6 +50,8 @@ struct kstat {
> struct timespec64 btime; /* File creation time */
> u64 blocks;
> u64 mnt_id;
> + u32 dio_mem_align;
> + u32 dio_offset_align;
> };
>
> #endif
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 1500a0f58041a..7cab2c65d3d7f 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -124,7 +124,8 @@ struct statx {
> __u32 stx_dev_minor;
> /* 0x90 */
> __u64 stx_mnt_id;
> - __u64 __spare2;
> + __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */
> + __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */
> /* 0xa0 */
> __u64 __spare3[12]; /* Spare space for future expansion */
> /* 0x100 */
> @@ -152,6 +153,7 @@ struct statx {
> #define STATX_BASIC_STATS 0x000007ffU /* The stuff in the normal stat struct */
> #define STATX_BTIME 0x00000800U /* Want/got stx_btime */
> #define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */
> +#define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */
>
> #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
>

2022-06-23 15:59:04

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] statx: add direct I/O alignment information

On Thu, Jun 16, 2022 at 01:14:59PM -0700, Eric Biggers wrote:
> From: Eric Biggers <[email protected]>
>
> Traditionally, the conditions for when DIO (direct I/O) is supported
> were fairly simple. For both block devices and regular files, DIO had
> to be aligned to the logical block size of the block device.
>
> However, due to filesystem features that have been added over time (e.g.
> multi-device support, data journalling, inline data, encryption, verity,
> compression, checkpoint disabling, log-structured mode), the conditions
> for when DIO is allowed on a regular file have gotten increasingly
> complex. Whether a particular regular file supports DIO, and with what
> alignment, can depend on various file attributes and filesystem mount
> options, as well as which block device(s) the file's data is located on.
>
> Moreover, the general rule of DIO needing to be aligned to the block
> device's logical block size is being relaxed to allow user buffers (but
> not file offsets) aligned to the DMA alignment instead
> (https://lore.kernel.org/linux-block/[email protected]/T/#u).
>
> XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information.
> Uplifting this to the VFS is one possibility. However, as discussed
> (https://lore.kernel.org/linux-fsdevel/[email protected]/T/#u),
> this ioctl is rarely used and not known to be used outside of
> XFS-specific code. It was also never intended to indicate when a file
> doesn't support DIO at all, nor was it intended for block devices.
>
> Therefore, let's expose this information via statx(). Add the
> STATX_DIOALIGN flag and two new statx fields associated with it:
>
> * stx_dio_mem_align: the alignment (in bytes) required for user memory
> buffers for DIO, or 0 if DIO is not supported on the file.
>
> * stx_dio_offset_align: the alignment (in bytes) required for file
> offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
> on the file. This will only be nonzero if stx_dio_mem_align is
> nonzero, and vice versa.
>
> Note that as with other statx() extensions, if STATX_DIOALIGN isn't set
> in the returned statx struct, then these new fields won't be filled in.
> This will happen if the file is neither a regular file nor a block
> device, or if the file is a regular file and the filesystem doesn't
> support STATX_DIOALIGN. It might also happen if the caller didn't
> include STATX_DIOALIGN in the request mask, since statx() isn't required
> to return unrequested information.
>
> This commit only adds the VFS-level plumbing for STATX_DIOALIGN. For
> regular files, individual filesystems will still need to add code to
> support it. For block devices, a separate commit will wire it up too.
>
> Signed-off-by: Eric Biggers <[email protected]>
> ---
> fs/stat.c | 2 ++
> include/linux/stat.h | 2 ++
> include/uapi/linux/stat.h | 4 +++-
> 3 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/stat.c b/fs/stat.c
> index 9ced8860e0f35..a7930d7444830 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -611,6 +611,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
> tmp.stx_dev_major = MAJOR(stat->dev);
> tmp.stx_dev_minor = MINOR(stat->dev);
> tmp.stx_mnt_id = stat->mnt_id;
> + tmp.stx_dio_mem_align = stat->dio_mem_align;
> + tmp.stx_dio_offset_align = stat->dio_offset_align;
>
> return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
> }
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 7df06931f25d8..ff277ced50e9f 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -50,6 +50,8 @@ struct kstat {
> struct timespec64 btime; /* File creation time */
> u64 blocks;
> u64 mnt_id;
> + u32 dio_mem_align;
> + u32 dio_offset_align;

Hmm. Does the XFS port of XFS_IOC_DIOINFO to STATX_DIOALIGN look like
this?

struct xfs_buftarg *target = xfs_inode_buftarg(ip);

kstat.dio_mem_align = target->bt_logical_sectorsize;
kstat.dio_offset_align = target->bt_logical_sectorsize;
kstat.result_mask |= STATX_DIOALIGN;

And I guess you're tabling the "optimal" IO discussions for now, because
there are too many variants of what that means?

--D

> };
>
> #endif
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 1500a0f58041a..7cab2c65d3d7f 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -124,7 +124,8 @@ struct statx {
> __u32 stx_dev_minor;
> /* 0x90 */
> __u64 stx_mnt_id;
> - __u64 __spare2;
> + __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */
> + __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */
> /* 0xa0 */
> __u64 __spare3[12]; /* Spare space for future expansion */
> /* 0x100 */
> @@ -152,6 +153,7 @@ struct statx {
> #define STATX_BASIC_STATS 0x000007ffU /* The stuff in the normal stat struct */
> #define STATX_BTIME 0x00000800U /* Want/got stx_btime */
> #define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */
> +#define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */
>
> #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
>
> --
> 2.36.1
>

2022-06-23 18:19:22

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] statx: add direct I/O alignment information

On Thu, Jun 23, 2022 at 08:58:12AM -0700, Darrick J. Wong wrote:
> > diff --git a/include/linux/stat.h b/include/linux/stat.h
> > index 7df06931f25d8..ff277ced50e9f 100644
> > --- a/include/linux/stat.h
> > +++ b/include/linux/stat.h
> > @@ -50,6 +50,8 @@ struct kstat {
> > struct timespec64 btime; /* File creation time */
> > u64 blocks;
> > u64 mnt_id;
> > + u32 dio_mem_align;
> > + u32 dio_offset_align;
>
> Hmm. Does the XFS port of XFS_IOC_DIOINFO to STATX_DIOALIGN look like
> this?
>
> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>
> kstat.dio_mem_align = target->bt_logical_sectorsize;
> kstat.dio_offset_align = target->bt_logical_sectorsize;
> kstat.result_mask |= STATX_DIOALIGN;

Yes, I think so.

However, if we need more fields as Avi Kivity requested at
https://lore.kernel.org/r/[email protected]
that is going to complicate things. I haven't had a chance to look
into whether those extra fields are really needed. Your opinion on whether XFS
(and any other filesystem) needs them would be appreciated.

>
> And I guess you're tabling the "optimal" IO discussions for now, because
> there are too many variants of what that means?
>

Yes, that's omitted for now due to the apparent redundancy with stx_blksize.

- Eric

2022-06-23 19:30:27

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] statx: add direct I/O alignment information

On Thu, Jun 23, 2022 at 10:23:20AM -0700, Eric Biggers wrote:
> On Thu, Jun 23, 2022 at 08:58:12AM -0700, Darrick J. Wong wrote:
> > > diff --git a/include/linux/stat.h b/include/linux/stat.h
> > > index 7df06931f25d8..ff277ced50e9f 100644
> > > --- a/include/linux/stat.h
> > > +++ b/include/linux/stat.h
> > > @@ -50,6 +50,8 @@ struct kstat {
> > > struct timespec64 btime; /* File creation time */
> > > u64 blocks;
> > > u64 mnt_id;
> > > + u32 dio_mem_align;
> > > + u32 dio_offset_align;
> >
> > Hmm. Does the XFS port of XFS_IOC_DIOINFO to STATX_DIOALIGN look like
> > this?
> >
> > struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> >
> > kstat.dio_mem_align = target->bt_logical_sectorsize;
> > kstat.dio_offset_align = target->bt_logical_sectorsize;
> > kstat.result_mask |= STATX_DIOALIGN;
>
> Yes, I think so.
>

By the way, the patchset "[PATCHv6 00/11] direct-io dma alignment"
(https://lore.kernel.org/linux-block/[email protected]/T/#u),
which is currently queued in linux-block/for-next for 5.20, will relax the user
buffer alignment requirement to the dma alignment for all filesystems using the
iomap direct I/O implementation. If that goes in, the XFS implementation of
STATX_DIOALIGN, as well as the ext4 and f2fs ones, will need to be changed
accordingly. Also, the existing XFS_IOC_DIOINFO will need to be changed.

- Eric

2022-06-26 07:45:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] statx: add direct I/O alignment information

On Sun, Jun 19, 2022 at 02:30:47PM +0300, Avi Kivity wrote:
> > * stx_dio_offset_align: the alignment (in bytes) required for file
> > offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
> > on the file. This will only be nonzero if stx_dio_mem_align is
> > nonzero, and vice versa.
>
>
> If you consider AIO, this is actually three alignments:
>
> 1. offset alignment for reads (sector size in XFS)
>
> 2. offset alignment for overwrites (sector size in XFS since ed1128c2d0c87e,
> block size earlier)
>
> 3. offset alignment for appending writes (block size)
>
>
> This is critical for linux-aio since violation of these alignments will
> stall the io_submit system call. Perhaps io_uring handles it better by
> bouncing to a workqueue, but there is a significant performance and latency
> penalty for that.

I think you are mixing things up here. We actually have two limits that
matter:

a) the hard limit, which if violated will return an error.
This has been sector size for all common file systems for years,
but can be bigger than that with fscrypt in the game (which
triggered this series)
b) an optimal write size, which can be done asynchronous and
without exclusive locking.
This is what your cases 2) and 3) above refer to.

Exposting this additional optimal performance size might be a good idea
in addition to what is proposed here, even if matters a little less
with io_uring. But I'm not sure I'd additional split it into append
vs overwrite vs hole filling but just round up to the maximum of those.

2022-06-26 07:45:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] statx: add direct I/O alignment information

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-06-26 08:03:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] statx: add direct I/O alignment information

On Thu, Jun 23, 2022 at 08:58:12AM -0700, Darrick J. Wong wrote:
> Hmm. Does the XFS port of XFS_IOC_DIOINFO to STATX_DIOALIGN look like
> this?
>
> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>
> kstat.dio_mem_align = target->bt_logical_sectorsize;
> kstat.dio_offset_align = target->bt_logical_sectorsize;
> kstat.result_mask |= STATX_DIOALIGN;

Yes, I think so. And it would be very good to include the XFS conversion
with this series as the only file systems that already supports
reporting alignment constraints.

I also suspect that lifting XFS_IOC_DIOINFO to common code by calling
->getattr would be useful because now all existing software using that
will also do the right thing on ext4 and f2fs now.

2022-06-26 10:21:36

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] statx: add direct I/O alignment information


On 26/06/2022 10.44, Christoph Hellwig wrote:
> On Sun, Jun 19, 2022 at 02:30:47PM +0300, Avi Kivity wrote:
>>> * stx_dio_offset_align: the alignment (in bytes) required for file
>>> offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
>>> on the file. This will only be nonzero if stx_dio_mem_align is
>>> nonzero, and vice versa.
>>
>> If you consider AIO, this is actually three alignments:
>>
>> 1. offset alignment for reads (sector size in XFS)
>>
>> 2. offset alignment for overwrites (sector size in XFS since ed1128c2d0c87e,
>> block size earlier)
>>
>> 3. offset alignment for appending writes (block size)
>>
>>
>> This is critical for linux-aio since violation of these alignments will
>> stall the io_submit system call. Perhaps io_uring handles it better by
>> bouncing to a workqueue, but there is a significant performance and latency
>> penalty for that.
> I think you are mixing things up here.


Yes.


> We actually have two limits that
> matter:
>
> a) the hard limit, which if violated will return an error.
> This has been sector size for all common file systems for years,
> but can be bigger than that with fscrypt in the game (which
> triggered this series)
> b) an optimal write size, which can be done asynchronous and
> without exclusive locking.
> This is what your cases 2) and 3) above refer to.
>
> Exposting this additional optimal performance size might be a good idea
> in addition to what is proposed here, even if matters a little less
> with io_uring. But I'm not sure I'd additional split it into append
> vs overwrite vs hole filling but just round up to the maximum of those.


Rounding up will penalize database workloads, with and without io_uring.
Database commit logs are characterized by frequent small writes. Telling
the database to round up to 4k vs 512 bytes means large write
amplification. The disk probably won't care (or maybe it will - it will
also have to generate more erase blocks), but the database will run out
of commitlog space much sooner and will have to compensate in expensive
ways.


Of course, people that care can continue to use internal filesystem
knowledge, and maybe there are few enough of those that the API can
choose to ignore them.


2022-07-20 05:05:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] make statx() return DIO alignment information

Hi Eric,

can you resend the series based on the received feedback?