2024-04-29 17:53:37

by John Garry

[permalink] [raw]
Subject: [PATCH v3 00/21] block atomic writes for XFS

This series expands atomic write support to filesystems, specifically
XFS. Extent alignment is based on new feature forcealign.

Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.

XFS can be formatted for atomic writes as follows:
mkfs.xfs -i forcealign=1 -d extsize=16384 -d atomic-writes=1 /dev/sda

atomic-writes=1 just enables atomic writes in the SB, but does not auto-
enable atomic writes for each file.

Support can be enabled through xfs_io command:
$xfs_io -c "lsattr -v" filename
[extsize, force-align]
$xfs_io -c "extsize" filename
[16384] filename
$xfs_io -c "chattr +W" filename
$xfs_io -c "lsattr -v" filename
[extsize, force-align, atomic-writes] filename
$xfs_io -c statx filename
..
stat.stx_atomic_write_unit_min = 4096
stat.stx_atomic_write_unit_max = 16384
stat.stx_atomic_write_segments_max = 1
..

A couple of patches are marked as RFC, as I am anything but 100% confident
in them. Indeed, from testing, there is an issue that I get ENOSPC at what
appears to be a moderate FS usage, like ~60%, and spaceman is reporting
appropriately-sized free extents. I notice that when I disable any
fallocate punch calls in my testing, then this issue goes away. More
details available upon request.

Stripe alignment testing for forcealign changes should be noted at
https://lore.kernel.org/linux-xfs/[email protected]/

Baseline is following series (which is based on v6.9-rc1):
https://lore.kernel.org/linux-block/[email protected]/

Basic xfsprogs support at:
https://github.com/johnpgarry/xfsprogs-dev/tree/forcealign_and_atomicwrites_for_v3_xfs_block_atomic_writes

Patches for this series can be found at:
https://github.com/johnpgarry/linux/commits/atomic-writes-v6.9-v6-fs-v3/

Changes since v2:
https://lore.kernel.org/linux-xfs/[email protected]/
- Incorporate forcealign patches from
https://lore.kernel.org/linux-xfs/[email protected]/
- Put bdev awu min and max in buftarg
- Extra forcealign patches to deal with truncate and fallocate punch,
insert, collapse
- Add generic_atomic_write_valid_size()
- Change iomap.extent_shift -> .extent_size

Changes since v1:
https://lore.kernel.org/linux-xfs/[email protected]/
- Add blk_validate_atomic_write_op_size() (Darrick suggested idea)
- Swap forcealign for rtvol support (Dave requested forcealign)
- Sub-extent DIO zeroing (Dave wanted rid of XFS_BMAPI_ZERO usage)
- Improve coding for XFS statx support (Darrick, Ojaswin)
- Improve conditions for setting FMODE_CAN_ATOMIC_WRITE (Darrick)
- Improve commit message for FS_XFLAG_ATOMICWRITES flag (Darrick)
- Validate atomic writes in xfs_file_dio_write()
- Drop IOMAP_ATOMIC

Darrick J. Wong (2):
xfs: Introduce FORCEALIGN inode flag
xfs: Enable file data forcealign feature

Dave Chinner (6):
xfs: only allow minlen allocations when near ENOSPC
xfs: always tail align maxlen allocations
xfs: simplify extent allocation alignment
xfs: make EOF allocation simpler
xfs: introduce forced allocation alignment
fs: xfs: align args->minlen for forced allocation alignment

John Garry (13):
fs: Add generic_atomic_write_valid_size()
xfs: Do not free EOF blocks for forcealign
xfs: Update xfs_is_falloc_aligned() mask for forcealign
xfs: Unmap blocks according to forcealign
xfs: Only free full extents for forcealign
iomap: Sub-extent zeroing
fs: xfs: iomap: Sub-extent zeroing
fs: Add FS_XFLAG_ATOMICWRITES flag
iomap: Atomic write support
xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
xfs: Support atomic write for statx
xfs: Validate atomic writes
xfs: Support setting FMODE_CAN_ATOMIC_WRITE

fs/iomap/direct-io.c | 27 ++-
fs/xfs/libxfs/xfs_alloc.c | 33 ++--
fs/xfs/libxfs/xfs_alloc.h | 3 +-
fs/xfs/libxfs/xfs_bmap.c | 302 ++++++++++++++++++----------------
fs/xfs/libxfs/xfs_format.h | 16 +-
fs/xfs/libxfs/xfs_ialloc.c | 12 +-
fs/xfs/libxfs/xfs_inode_buf.c | 86 ++++++++++
fs/xfs/libxfs/xfs_inode_buf.h | 3 +
fs/xfs/libxfs/xfs_sb.c | 4 +
fs/xfs/xfs_bmap_util.c | 14 +-
fs/xfs/xfs_buf.c | 15 +-
fs/xfs/xfs_buf.h | 4 +-
fs/xfs/xfs_file.c | 64 +++++--
fs/xfs/xfs_inode.c | 14 ++
fs/xfs/xfs_inode.h | 15 ++
fs/xfs/xfs_ioctl.c | 55 ++++++-
fs/xfs/xfs_iomap.c | 13 +-
fs/xfs/xfs_iops.c | 28 ++++
fs/xfs/xfs_mount.h | 4 +
fs/xfs/xfs_super.c | 8 +
fs/xfs/xfs_trace.h | 8 +-
include/linux/fs.h | 12 ++
include/linux/iomap.h | 1 +
include/uapi/linux/fs.h | 3 +
24 files changed, 549 insertions(+), 195 deletions(-)

--
2.31.1



2024-04-29 17:53:45

by John Garry

[permalink] [raw]
Subject: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag

From: "Darrick J. Wong" <[email protected]>

Add a new inode flag to require that all file data extent mappings must
be aligned (both the file offset range and the allocated space itself)
to the extent size hint. Having a separate COW extent size hint is no
longer allowed.

The goal here is to enable sysadmins and users to mandate that all space
mappings in a file must have a startoff/blockcount that are aligned to
(say) a 2MB alignment and that the startblock/blockcount will follow the
same alignment.

jpg: Enforce extsize is a power-of-2 and aligned with afgsize + stripe
alignment for forcealign
Signed-off-by: "Darrick J. Wong" <[email protected]>
Co-developed-by: John Garry <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_format.h | 6 ++++-
fs/xfs/libxfs/xfs_inode_buf.c | 50 +++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_inode_buf.h | 3 +++
fs/xfs/libxfs/xfs_sb.c | 2 ++
fs/xfs/xfs_inode.c | 12 +++++++++
fs/xfs/xfs_inode.h | 2 +-
fs/xfs/xfs_ioctl.c | 34 +++++++++++++++++++++++-
fs/xfs/xfs_mount.h | 2 ++
fs/xfs/xfs_super.c | 4 +++
include/uapi/linux/fs.h | 2 ++
10 files changed, 114 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 2b2f9050fbfb..4dd295b047f8 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
#define XFS_SB_FEAT_RO_COMPAT_RMAPBT (1 << 1) /* reverse map btree */
#define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
#define XFS_SB_FEAT_RO_COMPAT_ALL \
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
@@ -1084,16 +1085,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
#define XFS_DIFLAG2_COWEXTSIZE_BIT 2 /* copy on write extent size hint */
#define XFS_DIFLAG2_BIGTIME_BIT 3 /* big timestamps */
#define XFS_DIFLAG2_NREXT64_BIT 4 /* large extent counters */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define XFS_DIFLAG2_FORCEALIGN_BIT 5

#define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
#define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
#define XFS_DIFLAG2_COWEXTSIZE (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
#define XFS_DIFLAG2_BIGTIME (1 << XFS_DIFLAG2_BIGTIME_BIT)
#define XFS_DIFLAG2_NREXT64 (1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_FORCEALIGN (1 << XFS_DIFLAG2_FORCEALIGN_BIT)

#define XFS_DIFLAG2_ANY \
(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
- XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+ XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)

static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
{
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index d0dcce462bf4..12f128f12824 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -616,6 +616,14 @@ xfs_dinode_verify(
!xfs_has_bigtime(mp))
return __this_address;

+ if (flags2 & XFS_DIFLAG2_FORCEALIGN) {
+ fa = xfs_inode_validate_forcealign(mp, mode, flags,
+ be32_to_cpu(dip->di_extsize),
+ be32_to_cpu(dip->di_cowextsize));
+ if (fa)
+ return fa;
+ }
+
return NULL;
}

@@ -783,3 +791,45 @@ xfs_inode_validate_cowextsize(

return NULL;
}
+
+/* Validate the forcealign inode flag */
+xfs_failaddr_t
+xfs_inode_validate_forcealign(
+ struct xfs_mount *mp,
+ uint16_t mode,
+ uint16_t flags,
+ uint32_t extsize,
+ uint32_t cowextsize)
+{
+ /* superblock rocompat feature flag */
+ if (!xfs_has_forcealign(mp))
+ return __this_address;
+
+ /* Only regular files and directories */
+ if (!S_ISDIR(mode) && !S_ISREG(mode))
+ return __this_address;
+
+ /* Doesn't apply to realtime files */
+ if (flags & XFS_DIFLAG_REALTIME)
+ return __this_address;
+
+ /* Requires a non-zero power-of-2 extent size hint */
+ if (extsize == 0 || !is_power_of_2(extsize) ||
+ (mp->m_sb.sb_agblocks % extsize))
+ return __this_address;
+
+ /* Requires agsize be a multiple of extsize */
+ if (mp->m_sb.sb_agblocks % extsize)
+ return __this_address;
+
+ /* Requires stripe unit+width (if set) be a multiple of extsize */
+ if ((mp->m_dalign && (mp->m_dalign % extsize)) ||
+ (mp->m_swidth && (mp->m_swidth % extsize)))
+ return __this_address;
+
+ /* Requires no cow extent size hint */
+ if (cowextsize != 0)
+ return __this_address;
+
+ return NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 585ed5a110af..50db17d22b68 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -33,6 +33,9 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
uint32_t cowextsize, uint16_t mode, uint16_t flags,
uint64_t flags2);
+xfs_failaddr_t xfs_inode_validate_forcealign(struct xfs_mount *mp,
+ uint16_t mode, uint16_t flags, uint32_t extsize,
+ uint32_t cowextsize);

static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
{
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d991eec05436..e746c57c4cc4 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -163,6 +163,8 @@ xfs_sb_version_to_features(
features |= XFS_FEAT_REFLINK;
if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
features |= XFS_FEAT_INOBTCNT;
+ if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+ features |= XFS_FEAT_FORCEALIGN;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
features |= XFS_FEAT_FTYPE;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ea48774f6b76..db5a0f66a121 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -607,6 +607,8 @@ xfs_ip2xflags(
flags |= FS_XFLAG_DAX;
if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
flags |= FS_XFLAG_COWEXTSIZE;
+ if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+ flags |= FS_XFLAG_FORCEALIGN;
}

if (xfs_inode_has_attr_fork(ip))
@@ -736,6 +738,8 @@ xfs_inode_inherit_flags2(
}
if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
ip->i_diflags2 |= XFS_DIFLAG2_DAX;
+ if (pip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+ ip->i_diflags2 |= XFS_DIFLAG2_FORCEALIGN;

/* Don't let invalid cowextsize hints propagate. */
failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
@@ -744,6 +748,14 @@ xfs_inode_inherit_flags2(
ip->i_diflags2 &= ~XFS_DIFLAG2_COWEXTSIZE;
ip->i_cowextsize = 0;
}
+
+ if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+ failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+ VFS_I(ip)->i_mode, ip->i_diflags, ip->i_extsize,
+ ip->i_cowextsize);
+ if (failaddr)
+ ip->i_diflags2 &= ~XFS_DIFLAG2_FORCEALIGN;
+ }
}

/*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 67f10349a6ed..065028789473 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -313,7 +313,7 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)

static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
{
- return false;
+ return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
}

/*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d0e2cec6210d..d1126509ceb9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
di_flags2 |= XFS_DIFLAG2_DAX;
if (xflags & FS_XFLAG_COWEXTSIZE)
di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+ if (xflags & FS_XFLAG_FORCEALIGN)
+ di_flags2 |= XFS_DIFLAG2_FORCEALIGN;

return di_flags2;
}
@@ -1146,6 +1148,22 @@ xfs_ioctl_setattr_xflags(
if (i_flags2 && !xfs_has_v3inodes(mp))
return -EINVAL;

+ /*
+ * Force-align requires a nonzero extent size hint and a zero cow
+ * extent size hint. It doesn't apply to realtime files.
+ */
+ if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
+ if (!xfs_has_forcealign(mp))
+ return -EINVAL;
+ if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
+ return -EINVAL;
+ if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+ FS_XFLAG_EXTSZINHERIT)))
+ return -EINVAL;
+ if (fa->fsx_xflags & FS_XFLAG_REALTIME)
+ return -EINVAL;
+ }
+
ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
ip->i_diflags2 = i_flags2;

@@ -1232,6 +1250,7 @@ xfs_ioctl_setattr_check_extsize(
struct xfs_mount *mp = ip->i_mount;
xfs_failaddr_t failaddr;
uint16_t new_diflags;
+ uint16_t new_diflags2;

if (!fa->fsx_valid)
return 0;
@@ -1244,6 +1263,7 @@ xfs_ioctl_setattr_check_extsize(
return -EINVAL;

new_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
+ new_diflags2 = xfs_flags2diflags2(ip, fa->fsx_xflags);

/*
* Inode verifiers do not check that the extent size hint is an integer
@@ -1263,7 +1283,19 @@ xfs_ioctl_setattr_check_extsize(
failaddr = xfs_inode_validate_extsize(ip->i_mount,
XFS_B_TO_FSB(mp, fa->fsx_extsize),
VFS_I(ip)->i_mode, new_diflags);
- return failaddr != NULL ? -EINVAL : 0;
+ if (failaddr)
+ return -EINVAL;
+
+ if (new_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+ failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+ VFS_I(ip)->i_mode, new_diflags,
+ XFS_B_TO_FSB(mp, fa->fsx_extsize),
+ XFS_B_TO_FSB(mp, fa->fsx_cowextsize));
+ if (failaddr)
+ return -EINVAL;
+ }
+
+ return 0;
}

static int
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e880aa48de68..a8266cf654c4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -292,6 +292,7 @@ typedef struct xfs_mount {
#define XFS_FEAT_BIGTIME (1ULL << 24) /* large timestamps */
#define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */
#define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */
+#define XFS_FEAT_FORCEALIGN (1ULL << 27) /* aligned file data extents */

/* Mount features */
#define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */
@@ -355,6 +356,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
__XFS_HAS_FEAT(bigtime, BIGTIME)
__XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
__XFS_HAS_FEAT(large_extent_counts, NREXT64)
+__XFS_HAS_FEAT(forcealign, FORCEALIGN)

/*
* Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c21f10ab0f5d..63d4312785ef 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1706,6 +1706,10 @@ xfs_fs_fill_super(
mp->m_features &= ~XFS_FEAT_DISCARD;
}

+ if (xfs_has_forcealign(mp))
+ xfs_warn(mp,
+"EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
+
if (xfs_has_reflink(mp)) {
if (mp->m_sb.sb_rblocks) {
xfs_alert(mp,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 191a7e88a8ab..6a6bcb53594a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -158,6 +158,8 @@ struct fsxattr {
#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
#define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define FS_XFLAG_FORCEALIGN 0x00020000
#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */

/* the read-only stuff doesn't really belong here, but any other place is
--
2.31.1


2024-04-29 17:53:48

by John Garry

[permalink] [raw]
Subject: [PATCH v3 21/21] xfs: Support setting FMODE_CAN_ATOMIC_WRITE

For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
flag. Only direct IO is currently supported, so check for that also.

We rely on the block layer to reject atomic writes which exceed the bdev
request_queue limits, so don't bother checking any such thing here.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_file.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 256d05c1be6a..5a17748eb6bd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1243,6 +1243,18 @@ xfs_file_remap_range(
return remapped > 0 ? remapped : ret;
}

+static bool xfs_file_open_can_atomicwrite(
+ struct inode *inode,
+ struct file *file)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+
+ if (!(file->f_flags & O_DIRECT))
+ return false;
+
+ return xfs_inode_has_atomicwrites(ip);
+}
+
STATIC int
xfs_file_open(
struct inode *inode,
@@ -1252,6 +1264,8 @@ xfs_file_open(
return -EIO;
file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
+ if (xfs_file_open_can_atomicwrite(inode, file))
+ file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
return generic_file_open(inode, file);
}

--
2.31.1


2024-04-29 17:53:52

by John Garry

[permalink] [raw]
Subject: [PATCH v3 04/21] xfs: simplify extent allocation alignment

From: Dave Chinner <[email protected]>

We currently align extent allocation to stripe unit or stripe width.
That is specified by an external parameter to the allocation code,
which then manipulates the xfs_alloc_args alignment configuration in
interesting ways.

The args->alignment field specifies extent start alignment, but
because we may be attempting non-aligned allocation first there are
also slop variables that allow for those allocation attempts to
account for aligned allocation if they fail.

This gets much more complex as we introduce forced allocation
alignment, where extent size hints are used to generate the extent
start alignment. extent size hints currently only affect extent
lengths (via args->prod and args->mod) and so with this change we
will have two different start alignment conditions.

Avoid this complexity by always using args->alignment to indicate
extent start alignment, and always using args->prod/mod to indicate
extent length adjustment.

Signed-off-by: Dave Chinner <[email protected]>
jpg: fixup alignslop references in xfs_trace.h and xfs_ialloc.c
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_alloc.c | 4 +-
fs/xfs/libxfs/xfs_alloc.h | 2 +-
fs/xfs/libxfs/xfs_bmap.c | 96 +++++++++++++++++---------------------
fs/xfs/libxfs/xfs_ialloc.c | 10 ++--
fs/xfs/xfs_trace.h | 8 ++--
5 files changed, 54 insertions(+), 66 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index e21fd5c1f802..563599e956a6 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2393,7 +2393,7 @@ xfs_alloc_space_available(
reservation = xfs_ag_resv_needed(pag, args->resv);

/* do we have enough contiguous free space for the allocation? */
- alloc_len = args->minlen + (args->alignment - 1) + args->minalignslop;
+ alloc_len = args->minlen + (args->alignment - 1) + args->alignslop;
longest = xfs_alloc_longest_free_extent(pag, min_free, reservation);
if (longest < alloc_len)
return false;
@@ -2422,7 +2422,7 @@ xfs_alloc_space_available(
* allocation as we know that will definitely succeed and match the
* callers alignment constraints.
*/
- alloc_len = args->maxlen + (args->alignment - 1) + args->minalignslop;
+ alloc_len = args->maxlen + (args->alignment - 1) + args->alignslop;
if (longest < alloc_len) {
args->maxlen = args->minlen;
ASSERT(args->maxlen > 0);
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 0b956f8b9d5a..aa2c103d98f0 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -46,7 +46,7 @@ typedef struct xfs_alloc_arg {
xfs_extlen_t minleft; /* min blocks must be left after us */
xfs_extlen_t total; /* total blocks needed in xaction */
xfs_extlen_t alignment; /* align answer to multiple of this */
- xfs_extlen_t minalignslop; /* slop for minlen+alignment calcs */
+ xfs_extlen_t alignslop; /* slop for alignment calcs */
xfs_agblock_t min_agbno; /* set an agbno range for NEAR allocs */
xfs_agblock_t max_agbno; /* ... */
xfs_extlen_t len; /* output: actual size of extent */
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 656c95a22f2e..d56c82c07505 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3295,6 +3295,10 @@ xfs_bmap_select_minlen(
xfs_extlen_t blen)
{

+ /* Adjust best length for extent start alignment. */
+ if (blen > args->alignment)
+ blen -= args->alignment;
+
/*
* Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
* possible that there is enough contiguous free space for this request.
@@ -3310,6 +3314,7 @@ xfs_bmap_select_minlen(
if (blen < args->maxlen)
return blen;
return args->maxlen;
+
}

static int
@@ -3403,35 +3408,43 @@ xfs_bmap_alloc_account(
xfs_trans_mod_dquot_byino(ap->tp, ap->ip, fld, ap->length);
}

-static int
+/*
+ * Calculate the extent start alignment and the extent length adjustments that
+ * constrain this allocation.
+ *
+ * Extent start alignment is currently determined by stripe configuration and is
+ * carried in args->alignment, whilst extent length adjustment is determined by
+ * extent size hints and is carried by args->prod and args->mod.
+ *
+ * Low level allocation code is free to either ignore or override these values
+ * as required.
+ */
+static void
xfs_bmap_compute_alignments(
struct xfs_bmalloca *ap,
struct xfs_alloc_arg *args)
{
struct xfs_mount *mp = args->mp;
xfs_extlen_t align = 0; /* minimum allocation alignment */
- int stripe_align = 0;

/* stripe alignment for allocation is determined by mount parameters */
if (mp->m_swidth && xfs_has_swalloc(mp))
- stripe_align = mp->m_swidth;
+ args->alignment = mp->m_swidth;
else if (mp->m_dalign)
- stripe_align = mp->m_dalign;
+ args->alignment = mp->m_dalign;

if (ap->flags & XFS_BMAPI_COWFORK)
align = xfs_get_cowextsz_hint(ap->ip);
else if (ap->datatype & XFS_ALLOC_USERDATA)
align = xfs_get_extsz_hint(ap->ip);
+
if (align) {
if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
ap->eof, 0, ap->conv, &ap->offset,
&ap->length))
ASSERT(0);
ASSERT(ap->length);
- }

- /* apply extent size hints if obtained earlier */
- if (align) {
args->prod = align;
div_u64_rem(ap->offset, args->prod, &args->mod);
if (args->mod)
@@ -3446,7 +3459,6 @@ xfs_bmap_compute_alignments(
args->mod = args->prod - args->mod;
}

- return stripe_align;
}

static void
@@ -3518,7 +3530,7 @@ xfs_bmap_exact_minlen_extent_alloc(
args.total = ap->total;

args.alignment = 1;
- args.minalignslop = 0;
+ args.alignslop = 0;

args.minleft = ap->minleft;
args.wasdel = ap->wasdel;
@@ -3558,7 +3570,6 @@ xfs_bmap_btalloc_at_eof(
struct xfs_bmalloca *ap,
struct xfs_alloc_arg *args,
xfs_extlen_t blen,
- int stripe_align,
bool ag_only)
{
struct xfs_mount *mp = args->mp;
@@ -3572,23 +3583,15 @@ xfs_bmap_btalloc_at_eof(
* allocation.
*/
if (ap->offset) {
- xfs_extlen_t nextminlen = 0;
+ xfs_extlen_t alignment = args->alignment;

/*
- * Compute the minlen+alignment for the next case. Set slop so
- * that the value of minlen+alignment+slop doesn't go up between
- * the calls.
+ * Compute the alignment slop for the fallback path so we ensure
+ * we account for the potential alignemnt space required by the
+ * fallback paths before we modify the AGF and AGFL here.
*/
args->alignment = 1;
- if (blen > stripe_align && blen <= args->maxlen)
- nextminlen = blen - stripe_align;
- else
- nextminlen = args->minlen;
- if (nextminlen + stripe_align > args->minlen + 1)
- args->minalignslop = nextminlen + stripe_align -
- args->minlen - 1;
- else
- args->minalignslop = 0;
+ args->alignslop = alignment - args->alignment;

if (!caller_pag)
args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
@@ -3606,19 +3609,8 @@ xfs_bmap_btalloc_at_eof(
* Exact allocation failed. Reset to try an aligned allocation
* according to the original allocation specification.
*/
- args->alignment = stripe_align;
- args->minlen = nextminlen;
- args->minalignslop = 0;
- } else {
- /*
- * Adjust minlen to try and preserve alignment if we
- * can't guarantee an aligned maxlen extent.
- */
- args->alignment = stripe_align;
- if (blen > args->alignment &&
- blen <= args->maxlen + args->alignment)
- args->minlen = blen - args->alignment;
- args->minalignslop = 0;
+ args->alignment = alignment;
+ args->alignslop = 0;
}

if (ag_only) {
@@ -3636,9 +3628,8 @@ xfs_bmap_btalloc_at_eof(
return 0;

/*
- * Allocation failed, so turn return the allocation args to their
- * original non-aligned state so the caller can proceed on allocation
- * failure as if this function was never called.
+ * Aligned allocation failed, so all fallback paths from here drop the
+ * start alignment requirement as we know it will not succeed.
*/
args->alignment = 1;
return 0;
@@ -3646,7 +3637,9 @@ xfs_bmap_btalloc_at_eof(

/*
* We have failed multiple allocation attempts so now are in a low space
- * allocation situation. Try a locality first full filesystem minimum length
+ * allocation situation. We give up on any attempt at aligned allocation here.
+ *
+ * Try a locality first full filesystem minimum length
* allocation whilst still maintaining necessary total block reservation
* requirements.
*
@@ -3663,6 +3656,7 @@ xfs_bmap_btalloc_low_space(
{
int error;

+ args->alignment = 1;
if (args->minlen > ap->minlen) {
args->minlen = ap->minlen;
error = xfs_alloc_vextent_start_ag(args, ap->blkno);
@@ -3682,13 +3676,11 @@ xfs_bmap_btalloc_low_space(
static int
xfs_bmap_btalloc_filestreams(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- int stripe_align)
+ struct xfs_alloc_arg *args)
{
xfs_extlen_t blen = 0;
int error = 0;

-
error = xfs_filestream_select_ag(ap, args, &blen);
if (error)
return error;
@@ -3707,8 +3699,7 @@ xfs_bmap_btalloc_filestreams(

args->minlen = xfs_bmap_select_minlen(ap, args, blen);
if (ap->aeof)
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
- true);
+ error = xfs_bmap_btalloc_at_eof(ap, args, blen, true);

if (!error && args->fsbno == NULLFSBLOCK)
error = xfs_alloc_vextent_near_bno(args, ap->blkno);
@@ -3732,8 +3723,7 @@ xfs_bmap_btalloc_filestreams(
static int
xfs_bmap_btalloc_best_length(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- int stripe_align)
+ struct xfs_alloc_arg *args)
{
xfs_extlen_t blen = 0;
int error;
@@ -3757,8 +3747,7 @@ xfs_bmap_btalloc_best_length(
* trying.
*/
if (ap->aeof && !(ap->tp->t_flags & XFS_TRANS_LOWMODE)) {
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
- false);
+ error = xfs_bmap_btalloc_at_eof(ap, args, blen, false);
if (error || args->fsbno != NULLFSBLOCK)
return error;
}
@@ -3785,27 +3774,26 @@ xfs_bmap_btalloc(
.resv = XFS_AG_RESV_NONE,
.datatype = ap->datatype,
.alignment = 1,
- .minalignslop = 0,
+ .alignslop = 0,
};
xfs_fileoff_t orig_offset;
xfs_extlen_t orig_length;
int error;
- int stripe_align;

ASSERT(ap->length);
orig_offset = ap->offset;
orig_length = ap->length;

- stripe_align = xfs_bmap_compute_alignments(ap, &args);
+ xfs_bmap_compute_alignments(ap, &args);

/* Trim the allocation back to the maximum an AG can fit. */
args.maxlen = min(ap->length, mp->m_ag_max_usable);

if ((ap->datatype & XFS_ALLOC_USERDATA) &&
xfs_inode_is_filestream(ap->ip))
- error = xfs_bmap_btalloc_filestreams(ap, &args, stripe_align);
+ error = xfs_bmap_btalloc_filestreams(ap, &args);
else
- error = xfs_bmap_btalloc_best_length(ap, &args, stripe_align);
+ error = xfs_bmap_btalloc_best_length(ap, &args);
if (error)
return error;

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index e5ac3e5430c4..164b6dcdbb44 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -758,12 +758,12 @@ xfs_ialloc_ag_alloc(
*
* For an exact allocation, alignment must be 1,
* however we need to take cluster alignment into account when
- * fixing up the freelist. Use the minalignslop field to
- * indicate that extra blocks might be required for alignment,
- * but not to use them in the actual exact allocation.
+ * fixing up the freelist. Use the alignslop field to indicate
+ * that extra blocks might be required for alignment, but not
+ * to use them in the actual exact allocation.
*/
args.alignment = 1;
- args.minalignslop = igeo->cluster_align - 1;
+ args.alignslop = igeo->cluster_align - 1;

/* Allow space for the inode btree to split. */
args.minleft = igeo->inobt_maxlevels;
@@ -783,7 +783,7 @@ xfs_ialloc_ag_alloc(
* on, so reset minalignslop to ensure it is not included in
* subsequent requests.
*/
- args.minalignslop = 0;
+ args.alignslop = 0;
}

if (unlikely(args.fsbno == NULLFSBLOCK)) {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index aea97fc074f8..14679d64558a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1800,7 +1800,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__field(xfs_extlen_t, minleft)
__field(xfs_extlen_t, total)
__field(xfs_extlen_t, alignment)
- __field(xfs_extlen_t, minalignslop)
+ __field(xfs_extlen_t, alignslop)
__field(xfs_extlen_t, len)
__field(char, wasdel)
__field(char, wasfromfl)
@@ -1819,7 +1819,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__entry->minleft = args->minleft;
__entry->total = args->total;
__entry->alignment = args->alignment;
- __entry->minalignslop = args->minalignslop;
+ __entry->alignslop = args->alignslop;
__entry->len = args->len;
__entry->wasdel = args->wasdel;
__entry->wasfromfl = args->wasfromfl;
@@ -1828,7 +1828,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__entry->highest_agno = args->tp->t_highest_agno;
),
TP_printk("dev %d:%d agno 0x%x agbno 0x%x minlen %u maxlen %u mod %u "
- "prod %u minleft %u total %u alignment %u minalignslop %u "
+ "prod %u minleft %u total %u alignment %u alignslop %u "
"len %u wasdel %d wasfromfl %d resv %d "
"datatype 0x%x highest_agno 0x%x",
MAJOR(__entry->dev), MINOR(__entry->dev),
@@ -1841,7 +1841,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__entry->minleft,
__entry->total,
__entry->alignment,
- __entry->minalignslop,
+ __entry->alignslop,
__entry->len,
__entry->wasdel,
__entry->wasfromfl,
--
2.31.1


2024-04-29 17:55:13

by John Garry

[permalink] [raw]
Subject: [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign

Like we already do for rtvol, only free full extents for forcealign in
xfs_free_file_space().

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_bmap_util.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index f26d1570b9bd..1dd45dfb2811 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -847,8 +847,11 @@ xfs_free_file_space(
startoffset_fsb = XFS_B_TO_FSB(mp, offset);
endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);

- /* We can only free complete realtime extents. */
- if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
+ /* Free only complete extents. */
+ if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
+ startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
+ endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
+ } else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
}
--
2.31.1


2024-04-29 17:56:57

by John Garry

[permalink] [raw]
Subject: [PATCH v3 03/21] xfs: always tail align maxlen allocations

From: Dave Chinner <[email protected]>

When we do a large allocation, the core free space allocation code
assumes that args->maxlen is aligned to args->prod/args->mod. hence
if we get a maximum sized extent allocated, it does not do tail
alignment of the extent.

However, this assumes that nothing modifies args->maxlen between the
original allocation context setup and trimming the selected free
space extent to size. This assumption has recently been found to be
invalid - xfs_alloc_space_available() modifies args->maxlen in low
space situations - and there may be more situations we haven't yet
found like this.

Force aligned allocation introduces the requirement that extents are
correctly tail aligned, resulting in this occasional latent
alignment failure to e reclassified from an unimportant curiousity
to a must-fix bug.

Removing the assumption about args->maxlen allocations always being
tail aligned is trivial, and should not impact anything because
args->maxlen for inodes with extent size hints configured are
already aligned. Hence all this change does it avoid weird corner
cases that would have resulted in unaligned extent sizes by always
trimming the extent down to an aligned size.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_alloc.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 215265e0f68f..e21fd5c1f802 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -432,20 +432,18 @@ xfs_alloc_compute_diff(
* Fix up the length, based on mod and prod.
* len should be k * prod + mod for some k.
* If len is too small it is returned unchanged.
- * If len hits maxlen it is left alone.
*/
-STATIC void
+static void
xfs_alloc_fix_len(
- xfs_alloc_arg_t *args) /* allocation argument structure */
+ struct xfs_alloc_arg *args)
{
- xfs_extlen_t k;
- xfs_extlen_t rlen;
+ xfs_extlen_t k;
+ xfs_extlen_t rlen = args->len;

ASSERT(args->mod < args->prod);
- rlen = args->len;
ASSERT(rlen >= args->minlen);
ASSERT(rlen <= args->maxlen);
- if (args->prod <= 1 || rlen < args->mod || rlen == args->maxlen ||
+ if (args->prod <= 1 || rlen < args->mod ||
(args->mod == 0 && rlen < args->prod))
return;
k = rlen % args->prod;
--
2.31.1


2024-04-29 17:57:00

by John Garry

[permalink] [raw]
Subject: [PATCH v3 19/21] xfs: Support atomic write for statx

Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size, but a
lower limit could be supported in future. This is required by iomap
DIO.

The atomic write unit min and max is limited by the guaranteed extent
alignment for the inode.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_iops.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 66f8c47642e8..7d2ef3059ca5 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -546,6 +546,27 @@ xfs_stat_blksize(
return PAGE_SIZE;
}

+static void
+xfs_get_atomic_write_attr(
+ struct xfs_inode *ip,
+ unsigned int *unit_min,
+ unsigned int *unit_max)
+{
+ struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_sb *sbp = &mp->m_sb;
+ unsigned int extsz_bytes = XFS_FSB_TO_B(mp, ip->i_extsize);
+
+ if (!xfs_inode_has_atomicwrites(ip)) {
+ *unit_min = 0;
+ *unit_max = 0;
+ return;
+ }
+
+ *unit_min = sbp->sb_blocksize;
+ *unit_max = min(target->bt_bdev_awu_max, extsz_bytes);
+}
+
STATIC int
xfs_vn_getattr(
struct mnt_idmap *idmap,
@@ -619,6 +640,13 @@ xfs_vn_getattr(
stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
stat->dio_offset_align = bdev_logical_block_size(bdev);
}
+ if (request_mask & STATX_WRITE_ATOMIC) {
+ unsigned int unit_min, unit_max;
+
+ xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+ generic_fill_statx_atomic_writes(stat,
+ unit_min, unit_max);
+ }
fallthrough;
default:
stat->blksize = xfs_stat_blksize(ip);
--
2.31.1


2024-04-29 18:02:27

by John Garry

[permalink] [raw]
Subject: [PATCH v3 16/21] fs: Add FS_XFLAG_ATOMICWRITES flag

Add a flag indicating that a regular file is enabled for atomic writes.

This is a file attribute that mirrors an ondisk inode flag. Actual support
for untorn file writes (for now) depends on both the iflag and the
underlying storage devices, which we can only really check at statx and
pwritev2() time. This is the same story as FS_XFLAG_DAX, which signals to
the fs that we should try to enable the fsdax IO path on the file (instead
of the regular page cache), but applications have to query STAT_ATTR_DAX
to find out if they really got that IO path.

Signed-off-by: John Garry <[email protected]>
---
include/uapi/linux/fs.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 6a6bcb53594a..0eae5383a0b4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -160,6 +160,7 @@ struct fsxattr {
#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
/* data extent mappings for regular files must be aligned to extent size hint */
#define FS_XFLAG_FORCEALIGN 0x00020000
+#define FS_XFLAG_ATOMICWRITES 0x00040000 /* atomic writes enabled */
#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */

/* the read-only stuff doesn't really belong here, but any other place is
--
2.31.1


2024-04-29 18:02:57

by John Garry

[permalink] [raw]
Subject: [PATCH v3 13/21] xfs: Enable file data forcealign feature

From: "Darrick J. Wong" <[email protected]>

Enable this feature.

Signed-off-by: "Darrick J. Wong" <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_format.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 4dd295b047f8..0c73b96dbefc 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -358,7 +358,8 @@ xfs_sb_has_compat_feature(
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
XFS_SB_FEAT_RO_COMPAT_REFLINK| \
- XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+ XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+ XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
static inline bool
xfs_sb_has_ro_compat_feature(
--
2.31.1


2024-04-29 18:03:40

by John Garry

[permalink] [raw]
Subject: [PATCH v3 20/21] xfs: Validate atomic writes

Validate that an atomic write adheres to length/offset rules. Since we
require extent alignment for atomic writes, this effectively also enforces
that the BIO which iomap produces is aligned.

For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_dio_write(),
FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
FORCEALIGN and also ATOMICWRITES flags would also need to be set for the
inode.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_file.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ee4f94cf6f4e..256d05c1be6a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -712,12 +712,20 @@ xfs_file_dio_write(
struct kiocb *iocb,
struct iov_iter *from)
{
- struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct xfs_inode *ip = XFS_I(inode);
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
size_t count = iov_iter_count(from);
struct xfs_mount *mp = ip->i_mount;
unsigned int blockmask;

+ if (iocb->ki_flags & IOCB_ATOMIC) {
+ if (!generic_atomic_write_valid_size(iocb->ki_pos, from,
+ i_blocksize(inode), XFS_FSB_TO_B(mp, ip->i_extsize))) {
+ return -EINVAL;
+ }
+ }
+
/* direct I/O must be aligned to device logical sector size */
if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
return -EINVAL;
--
2.31.1


2024-04-29 18:03:57

by John Garry

[permalink] [raw]
Subject: [PATCH v3 06/21] xfs: introduce forced allocation alignment

From: Dave Chinner <[email protected]>

When forced allocation alignment is specified, the extent will
be aligned to the extent size hint size rather than stripe
alignment. If aligned allocation cannot be done, then the allocation
is failed rather than attempting non-aligned fallbacks.

Note: none of the per-inode force align configuration is present
yet, so this just triggers off an "always false" wrapper function
for the moment.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_alloc.h | 1 +
fs/xfs/libxfs/xfs_bmap.c | 29 +++++++++++++++++++++++------
fs/xfs/xfs_inode.h | 5 +++++
3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index aa2c103d98f0..7de2e6f64882 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -66,6 +66,7 @@ typedef struct xfs_alloc_arg {
#define XFS_ALLOC_USERDATA (1 << 0)/* allocation is for user data*/
#define XFS_ALLOC_INITIAL_USER_DATA (1 << 1)/* special case start of file */
#define XFS_ALLOC_NOBUSY (1 << 2)/* Busy extents not allowed */
+#define XFS_ALLOC_FORCEALIGN (1 << 3)/* forced extent alignment */

/* freespace limit calculations */
unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c2ddf1875e52..7a0ef0900097 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3411,9 +3411,10 @@ xfs_bmap_alloc_account(
* Calculate the extent start alignment and the extent length adjustments that
* constrain this allocation.
*
- * Extent start alignment is currently determined by stripe configuration and is
- * carried in args->alignment, whilst extent length adjustment is determined by
- * extent size hints and is carried by args->prod and args->mod.
+ * Extent start alignment is currently determined by forced inode alignment or
+ * stripe configuration and is carried in args->alignment, whilst extent length
+ * adjustment is determined by extent size hints and is carried by args->prod
+ * and args->mod.
*
* Low level allocation code is free to either ignore or override these values
* as required.
@@ -3426,11 +3427,18 @@ xfs_bmap_compute_alignments(
struct xfs_mount *mp = args->mp;
xfs_extlen_t align = 0; /* minimum allocation alignment */

- /* stripe alignment for allocation is determined by mount parameters */
- if (mp->m_swidth && xfs_has_swalloc(mp))
+ /*
+ * Forced inode alignment takes preference over stripe alignment.
+ * Stripe alignment for allocation is determined by mount parameters.
+ */
+ if (xfs_inode_has_forcealign(ap->ip)) {
+ args->alignment = xfs_get_extsz_hint(ap->ip);
+ args->datatype |= XFS_ALLOC_FORCEALIGN;
+ } else if (mp->m_swidth && xfs_has_swalloc(mp)) {
args->alignment = mp->m_swidth;
- else if (mp->m_dalign)
+ } else if (mp->m_dalign) {
args->alignment = mp->m_dalign;
+ }

if (ap->flags & XFS_BMAPI_COWFORK)
align = xfs_get_cowextsz_hint(ap->ip);
@@ -3617,6 +3625,11 @@ xfs_bmap_btalloc_low_space(
{
int error;

+ if (args->alignment > 1 && (args->datatype & XFS_ALLOC_FORCEALIGN)) {
+ args->fsbno = NULLFSBLOCK;
+ return 0;
+ }
+
args->alignment = 1;
if (args->minlen > ap->minlen) {
args->minlen = ap->minlen;
@@ -3668,6 +3681,8 @@ xfs_bmap_btalloc_filestreams(

/* Attempt non-aligned allocation if we haven't already. */
if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ if (args->datatype & XFS_ALLOC_FORCEALIGN)
+ return error;
args->alignment = 1;
error = xfs_alloc_vextent_near_bno(args, ap->blkno);
}
@@ -3726,6 +3741,8 @@ xfs_bmap_btalloc_best_length(

/* Attempt non-aligned allocation if we haven't already. */
if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ if (args->datatype & XFS_ALLOC_FORCEALIGN)
+ return error;
args->alignment = 1;
error = xfs_alloc_vextent_start_ag(args, ap->blkno);
}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index ab46ffb3ac19..67f10349a6ed 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -311,6 +311,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
}

+static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
+{
+ return false;
+}
+
/*
* Return the buftarg used for data allocations on a given inode.
*/
--
2.31.1


2024-04-29 18:05:48

by John Garry

[permalink] [raw]
Subject: [PATCH v3 14/21] iomap: Sub-extent zeroing

For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
unwritten.

This will be important for atomic writes support, in that atomically
writing over a partially written extent would mean that we would need to
do the unwritten extent conversion write separately, and the write could
no longer be atomic.

It is the task of the FS to set iomap.extent_size per iter to indicate
sub-extent zeroing required.

Signed-off-by: John Garry <[email protected]>
---
fs/iomap/direct-io.c | 17 +++++++++++------
include/linux/iomap.h | 1 +
2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..a3ed7cfa95bc 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
{
const struct iomap *iomap = &iter->iomap;
struct inode *inode = iter->inode;
- unsigned int fs_block_size = i_blocksize(inode), pad;
+ unsigned int zeroing_size, pad;
loff_t length = iomap_length(iter);
loff_t pos = iter->pos;
blk_opf_t bio_opf;
@@ -288,6 +288,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
size_t copied = 0;
size_t orig_count;

+ if (iomap->extent_size)
+ zeroing_size = iomap->extent_size;
+ else
+ zeroing_size = i_blocksize(inode);
+
if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
!bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
return -EINVAL;
@@ -354,8 +359,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
dio->iocb->ki_flags &= ~IOCB_HIPRI;

if (need_zeroout) {
- /* zero out from the start of the block to the write offset */
- pad = pos & (fs_block_size - 1);
+ /* zero out from the start of the region to the write offset */
+ pad = pos & (zeroing_size - 1);
if (pad)
iomap_dio_zero(iter, dio, pos - pad, pad);
}
@@ -428,10 +433,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
zero_tail:
if (need_zeroout ||
((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
- /* zero out from the end of the write to the end of the block */
- pad = pos & (fs_block_size - 1);
+ /* zero out from the end of the write to the end of the region */
+ pad = pos & (zeroing_size - 1);
if (pad)
- iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+ iomap_dio_zero(iter, dio, pos, zeroing_size - pad);
}
out:
/* Undo iter limitation to current extent */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..42623b1cdc04 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -97,6 +97,7 @@ struct iomap {
u64 length; /* length of mapping, bytes */
u16 type; /* type of mapping */
u16 flags; /* flags for mapping */
+ unsigned int extent_size;
struct block_device *bdev; /* block device for I/O */
struct dax_device *dax_dev; /* dax_dev for dax operations */
void *inline_data;
--
2.31.1


2024-05-01 10:03:55

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag


>> +/* Validate the forcealign inode flag */
>> +xfs_failaddr_t
>> +xfs_inode_validate_forcealign(
>> + struct xfs_mount *mp,
>> + uint16_t mode,
>
> umode_t mode,

ok. BTW, other functions like xfs_inode_validate_extsize() use uint16_t

>
>> + uint16_t flags,
>> + uint32_t extsize,
>> + uint32_t cowextsize)
>
> extent sizes are xfs_extlen_t types.

ok

>
>> +{
>> + /* superblock rocompat feature flag */
>> + if (!xfs_has_forcealign(mp))
>> + return __this_address;
>> +
>> + /* Only regular files and directories */
>> + if (!S_ISDIR(mode) && !S_ISREG(mode))
>> + return __this_address;
>> +
>> + /* Doesn't apply to realtime files */
>> + if (flags & XFS_DIFLAG_REALTIME)
>> + return __this_address;
>
> Why not? A rt device with an extsize of 1 fsb could make use of
> forced alignment just like the data device to allow larger atomic
> writes to be done. I mean, just because we haven't written the code
> to do this yet doesn't mean it is an illegal on-disk format state.

ok, so where is a better place to disallow forcealign for RT now (since
we have not written the code to support it nor verified it)?

>
>> + /* Requires a non-zero power-of-2 extent size hint */
>> + if (extsize == 0 || !is_power_of_2(extsize) ||
>> + (mp->m_sb.sb_agblocks % extsize))
>> + return __this_address;
>
> Please do these as indiviual checks with their own fail address.

ok

> That way we can tell which check failed from the console output.
> Also, the agblocks check is already split out below, so it's being
> checked twice...
>
> Also, why does force-align require a power-of-2 extent size? Why
> does it require the extent size to be an exact divisor of the AG
> size? Aren't these atomic write alignment restrictions? i.e.
> shouldn't these only be enforced when the atomic writes inode flag
> is set?

With regards the power-of-2 restriction, I think that the code changes
are going to become a lot more complex if we don't enforce this for
forcealign.

For example, consider xfs_file_dio_write(), where we check for an
unaligned write based on forcealign extent mask. It's much simpler to
rely on a power-of-2 size. And same for iomap extent zeroing.

So then it can be asked, for what reason do we want to support
unorthodox, non-power-of-2 sizes? Who would want this?

As for AG size, again I think that it is required to be aligned to the
forcealign extsize. As I remember, when converting from an FSB to a DB,
if the AG itself is not aligned to the forcealign extsize, then the DB
will not be aligned to the forcealign extsize. More below...

>
>> + /* Requires agsize be a multiple of extsize */
>> + if (mp->m_sb.sb_agblocks % extsize)
>> + return __this_address;
>> +
>> + /* Requires stripe unit+width (if set) be a multiple of extsize */
>> + if ((mp->m_dalign && (mp->m_dalign % extsize)) ||
>> + (mp->m_swidth && (mp->m_swidth % extsize)))
>> + return __this_address;
>
> Again, this is an atomic write constraint, isn't it?

So why do we want forcealign? It is to only align extent FSBs? Or to
align extents to DBs? I would have thought the latter. If so, it seems
sensible to do this check also.

>
>> + /* Requires no cow extent size hint */
>> + if (cowextsize != 0)
>> + return __this_address;
>
> What if it's a reflinked file?

Yeah, I think that we want to disallow that.

>
> .....
>
>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
>> index d0e2cec6210d..d1126509ceb9 100644
>> --- a/fs/xfs/xfs_ioctl.c
>> +++ b/fs/xfs/xfs_ioctl.c
>> @@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
>> di_flags2 |= XFS_DIFLAG2_DAX;
>> if (xflags & FS_XFLAG_COWEXTSIZE)
>> di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
>> + if (xflags & FS_XFLAG_FORCEALIGN)
>> + di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
>>
>> return di_flags2;
>> }
>> @@ -1146,6 +1148,22 @@ xfs_ioctl_setattr_xflags(
>> if (i_flags2 && !xfs_has_v3inodes(mp))
>> return -EINVAL;
>>
>> + /*
>> + * Force-align requires a nonzero extent size hint and a zero cow
>> + * extent size hint. It doesn't apply to realtime files.
>> + */
>> + if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
>> + if (!xfs_has_forcealign(mp))
>> + return -EINVAL;
>> + if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
>> + return -EINVAL;
>> + if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
>> + FS_XFLAG_EXTSZINHERIT)))
>> + return -EINVAL;
>> + if (fa->fsx_xflags & FS_XFLAG_REALTIME)
>> + return -EINVAL;
>> + }
>
> What about if the file already has shared extents on it (i.e.
> reflinked or deduped?)

At the top of the function we have this check for RT:

if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
/* Can't change realtime flag if any extents are allocated. */
if (ip->i_df.if_nextents || ip->i_delayed_blks)
return -EINVAL;
}

Would expanding that check for forcealign also suffice? Indeed, later in
this series I expanded this check to cover atomicwrites (when I really
intended it for forcealign).

>
> Also, why is this getting checked here instead of in
> xfs_ioctl_setattr_check_extsize()?
>
>
>> @@ -1263,7 +1283,19 @@ xfs_ioctl_setattr_check_extsize(
>> failaddr = xfs_inode_validate_extsize(ip->i_mount,
>> XFS_B_TO_FSB(mp, fa->fsx_extsize),
>> VFS_I(ip)->i_mode, new_diflags);
>> - return failaddr != NULL ? -EINVAL : 0;
>> + if (failaddr)
>> + return -EINVAL;
>> +
>> + if (new_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
>> + failaddr = xfs_inode_validate_forcealign(ip->i_mount,
>> + VFS_I(ip)->i_mode, new_diflags,
>> + XFS_B_TO_FSB(mp, fa->fsx_extsize),
>> + XFS_B_TO_FSB(mp, fa->fsx_cowextsize));
>> + if (failaddr)
>> + return -EINVAL;
>> + }
>
> Oh, it's because you're trying to use on-disk format validation
> routines for user API validation. That, IMO, is a bad idea because
> the on-disk format and kernel/user APIs should not be tied
> together as they have different constraints and error conditions.
>
> That also explains why xfs_inode_validate_forcealign() doesn't just
> get passed the inode to validate - it's because you want to pass
> information from the user API to it. This results in sub-optimal
> code for both on-disk format validation and user API validation.
>
> Can you please separate these and put all the force align user API
> validation checks in the one function?
>

ok, fine. But it would be good to have clarification on function of
forcealign, above, i.e. does it always align extents to disk blocks?

Thanks,
John


2024-05-01 10:24:09

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v3 14/21] iomap: Sub-extent zeroing

On 01/05/2024 02:07, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:39PM +0000, John Garry wrote:
>> For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
>> sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
>> unwritten.
>>
>> This will be important for atomic writes support, in that atomically
>> writing over a partially written extent would mean that we would need to
>> do the unwritten extent conversion write separately, and the write could
>> no longer be atomic.
>>
>> It is the task of the FS to set iomap.extent_size per iter to indicate
>> sub-extent zeroing required.
>>
>> Signed-off-by: John Garry <[email protected]>
>
> Shouldn't this be done before the XFS feature is enabled in the
> series?

Well, it is done before XFS iomap zeroing support patch. But I can move
this patch to the very beginning of the series.

>
>> ---
>> fs/iomap/direct-io.c | 17 +++++++++++------
>> include/linux/iomap.h | 1 +
>> 2 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index f3b43d223a46..a3ed7cfa95bc 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> {
>> const struct iomap *iomap = &iter->iomap;
>> struct inode *inode = iter->inode;
>> - unsigned int fs_block_size = i_blocksize(inode), pad;
>> + unsigned int zeroing_size, pad;
>> loff_t length = iomap_length(iter);
>> loff_t pos = iter->pos;
>> blk_opf_t bio_opf;
>> @@ -288,6 +288,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> size_t copied = 0;
>> size_t orig_count;
>>
>> + if (iomap->extent_size)
>> + zeroing_size = iomap->extent_size;
>> + else
>> + zeroing_size = i_blocksize(inode);
>
> Oh, the dissonance!
>
> iomap->extent_size isn't an extent size at all.

Right, it's a poorly chosen name

>
> The size of the extent the iomap returns is iomap->length. This new
> variable is the IO specific "block size" that should be assumed by
> the dio code to determine if padding should be done.
>
> IOWs, I think we should add an "io_block_size" field to the iomap,
> and every filesystem that supports iomap should set it to the
> filesystem block size (i_blocksize(inode)). Then the changes to the
> iomap code end up just being:
>
>
> - unsigned int fs_block_size = i_blocksize(inode), pad;
> + unsigned int fs_block_size = iomap->io_block_size, pad;
>
> And the patch that introduces that infrastructure change will also
> change all the filesystem implementations to unconditionally set
> iomap->io_block_size to i_blocksize().

ok

>
> Then, in a separate patch, you can add XFS support for large IO
> block sizes when we have either a large rtextsize or extent size
> hints set.

I hadn't been considering large rtextsize for this. I suppose that it
could be added.

>
>> +
>> if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>> !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>> return -EINVAL;
>> @@ -354,8 +359,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> dio->iocb->ki_flags &= ~IOCB_HIPRI;
>>
>> if (need_zeroout) {
>> - /* zero out from the start of the block to the write offset */
>> - pad = pos & (fs_block_size - 1);
>> + /* zero out from the start of the region to the write offset */
>> + pad = pos & (zeroing_size - 1);
>> if (pad)
>> iomap_dio_zero(iter, dio, pos - pad, pad);
>> }
>> @@ -428,10 +433,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> zero_tail:
>> if (need_zeroout ||
>> ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>> - /* zero out from the end of the write to the end of the block */
>> - pad = pos & (fs_block_size - 1);
>> + /* zero out from the end of the write to the end of the region */
>> + pad = pos & (zeroing_size - 1);
>> if (pad)
>> - iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
>> + iomap_dio_zero(iter, dio, pos, zeroing_size - pad);
>> }
>> out:
>> /* Undo iter limitation to current extent */
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 6fc1c858013d..42623b1cdc04 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -97,6 +97,7 @@ struct iomap {
>> u64 length; /* length of mapping, bytes */
>> u16 type; /* type of mapping */
>> u16 flags; /* flags for mapping */
>> + unsigned int extent_size;
>
> This needs a descriptive comment. At minimum, it should tell the
> reader what units are used for the variable. If it is bytes, then
> it needs to be a u64, because XFS can have extent size hints well
> beyond 2^32 bytes in length.
>

ok

Thanks,
John


2024-05-01 11:25:35

by John Garry

[permalink] [raw]
Subject: Re: [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign

On 01/05/2024 01:53, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:37PM +0000, John Garry wrote:
>> Like we already do for rtvol, only free full extents for forcealign in
>> xfs_free_file_space().
>>
>> Signed-off-by: John Garry <[email protected]>
>> ---
>> fs/xfs/xfs_bmap_util.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index f26d1570b9bd..1dd45dfb2811 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -847,8 +847,11 @@ xfs_free_file_space(
>> startoffset_fsb = XFS_B_TO_FSB(mp, offset);
>> endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
>>
>> - /* We can only free complete realtime extents. */
>> - if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
>> + /* Free only complete extents. */
>> + if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
>> + startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
>> + endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
>> + } else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
>> startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
>> endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
>> }
>
> When you look at xfs_rtb_roundup_rtx() you'll find it's just a one
> line wrapper around roundup_64().
>
> So lets get rid of the obfuscation that the one line RT wrapper
> introduces, and it turns into this:
>
> rounding = 1;
> if (xfs_inode_has_forcealign(ip)
> rounding = ip->i_extsize;
> else if (XFS_IS_REALTIME_INODE(ip))
> rounding = mp->m_sb.sb_rextsize;
>
> if (rounding > 1) {
> startoffset_fsb = roundup_64(startoffset_fsb, rounding);
> endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
> }

ok, and the same idea for xfs_can_free_eofblocks() with
xfs_rtb_roundup_rtx(), right?

>
> What this points out is that the prep steps for fallocate operations
> also need to handle both forced alignment and rtextsize rounding,
> and it does neither right now. xfs_flush_unmap_range() is the main
> offender here, but xfs_prepare_shift() also needs fixing.

When you say fix, is this something to spin off separately for RT? This
series is big enough already...

>
> Hence:
>
> static inline xfs_extlen_t
> xfs_extent_alignment(
> struct xfs_inode *ip)
> {
> if (xfs_inode_has_forcealign(ip))
> return ip->i_extsize;
> if (XFS_IS_REALTIME_INODE(ip))
> return mp->m_sb.sb_rextsize;
> return 1;
> }
>
>
> In xfs_flush_unmap_range():
>
> /*
> * Make sure we extend the flush out to extent alignment
> * boundaries so any extent range overlapping the start/end
> * of the modification we are about to do is clean and idle.
> */
> rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
> rounding = max(rounding, PAGE_SIZE);
> ...
>
> in xfs_free_file_space()
>
> /*
> * Round the range we are going to free inwards to extent
> * alignment boundaries so we don't free blocks outside the
> * range requested.
> */
> rounding = xfs_extent_alignment(ip);
> if (rounding > 1 ) {
> startoffset_fsb = roundup_64(startoffset_fsb, rounding);
> endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
> }
>
> and in xfs_prepare_shift()
>
> /*
> * Shift operations must stabilize the start block offset boundary along
> * with the full range of the operation. If we don't, a COW writeback
> * completion could race with an insert, front merge with the start
> * extent (after split) during the shift and corrupt the file. Start
> * with the aligned block just prior to the start to stabilize the boundary.
> */
> rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
> offset = round_down(offset, rounding);
> if (offset)
> offset -= rounding;
>
> Also, I think that the changes I suggested earlier to
> xfs_is_falloc_aligned() could use this xfs_extent_alignment()
> helper...
>
> Overall this makes the code a whole lot easier to read and it also
> allows forced alignment to work correctly on RT devices...
>

ok, fine

Thanks,
John


2024-05-01 23:53:21

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH RFC v3 12/21] xfs: Only free full extents for forcealign

On Wed, May 01, 2024 at 10:53:28AM +1000, Dave Chinner wrote:
> On Mon, Apr 29, 2024 at 05:47:37PM +0000, John Garry wrote:
> > Like we already do for rtvol, only free full extents for forcealign in
> > xfs_free_file_space().
> >
> > Signed-off-by: John Garry <[email protected]>
> > ---
> > fs/xfs/xfs_bmap_util.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index f26d1570b9bd..1dd45dfb2811 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -847,8 +847,11 @@ xfs_free_file_space(
> > startoffset_fsb = XFS_B_TO_FSB(mp, offset);
> > endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
> >
> > - /* We can only free complete realtime extents. */
> > - if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
> > + /* Free only complete extents. */
> > + if (xfs_inode_has_forcealign(ip) && ip->i_extsize > 1) {
> > + startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
> > + endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
> > + } else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
> > startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
> > endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
> > }
>
> When you look at xfs_rtb_roundup_rtx() you'll find it's just a one
> line wrapper around roundup_64().

I added this a couple of cycles ago to get ready for realtime
modernization. That will create a bunch *more* churn in my tree just to
convert everything *back*.

Where the hell were you when that was being reviewed?!!!

NO! This is pointless busywork!

--D

> So lets get rid of the obfuscation that the one line RT wrapper
> introduces, and it turns into this:
>
> rounding = 1;
> if (xfs_inode_has_forcealign(ip)
> rounding = ip->i_extsize;
> else if (XFS_IS_REALTIME_INODE(ip))
> rounding = mp->m_sb.sb_rextsize;
>
> if (rounding > 1) {
> startoffset_fsb = roundup_64(startoffset_fsb, rounding);
> endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
> }
>
> What this points out is that the prep steps for fallocate operations
> also need to handle both forced alignment and rtextsize rounding,
> and it does neither right now. xfs_flush_unmap_range() is the main
> offender here, but xfs_prepare_shift() also needs fixing.
>
> Hence:
>
> static inline xfs_extlen_t
> xfs_extent_alignment(
> struct xfs_inode *ip)
> {
> if (xfs_inode_has_forcealign(ip))
> return ip->i_extsize;
> if (XFS_IS_REALTIME_INODE(ip))
> return mp->m_sb.sb_rextsize;
> return 1;
> }
>
>
> In xfs_flush_unmap_range():
>
> /*
> * Make sure we extend the flush out to extent alignment
> * boundaries so any extent range overlapping the start/end
> * of the modification we are about to do is clean and idle.
> */
> rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
> rounding = max(rounding, PAGE_SIZE);
> ...
>
> in xfs_free_file_space()
>
> /*
> * Round the range we are going to free inwards to extent
> * alignment boundaries so we don't free blocks outside the
> * range requested.
> */
> rounding = xfs_extent_alignment(ip);
> if (rounding > 1 ) {
> startoffset_fsb = roundup_64(startoffset_fsb, rounding);
> endoffset_fsb = rounddown_64(endoffset_fsb, rounding);
> }
>
> and in xfs_prepare_shift()
>
> /*
> * Shift operations must stabilize the start block offset boundary along
> * with the full range of the operation. If we don't, a COW writeback
> * completion could race with an insert, front merge with the start
> * extent (after split) during the shift and corrupt the file. Start
> * with the aligned block just prior to the start to stabilize the boundary.
> */
> rounding = XFS_FSB_TO_B(mp, xfs_extent_alignment(ip));
> offset = round_down(offset, rounding);
> if (offset)
> offset -= rounding;
>
> Also, I think that the changes I suggested earlier to
> xfs_is_falloc_aligned() could use this xfs_extent_alignment()
> helper...
>
> Overall this makes the code a whole lot easier to read and it also
> allows forced alignment to work correctly on RT devices...
>
> -Dave.
> --
> Dave Chinner
> [email protected]
>

2024-05-02 07:59:12

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag

On 02/05/2024 01:50, Dave Chinner wrote:
>> For example, consider xfs_file_dio_write(), where we check for an unaligned
>> write based on forcealign extent mask. It's much simpler to rely on a
>> power-of-2 size. And same for iomap extent zeroing.
> But it's not more complex - we already do this non-power-of-2
> alignment stuff for all the realtime code, so it's just a matter
> of not blindly using bit masking in alignment checks.
>
>> So then it can be asked, for what reason do we want to support unorthodox,
>> non-power-of-2 sizes? Who would want this?
> I'm constantly surprised by the way people use stuff like this
> filesystem and storage alignment constraints are not arbitrarily
> limited to power-of-2 sizes.
>
> For example, code implementation is simple in RAID setups when you
> use power-of-2 chunk sizes and stripe widths. But not all storage
> hardware fits power-of-2 configs like 4+1, 4+2, 8+1, 8+2, etc. THis
> is pretty common - 2.5" 2U drive trays have 24 drive bays. If you
> want to give up 33% of the storage capacity just to use power-of-2
> stripe widths then you would use 4x4+2 RAID6 luns. However, most
> people don't want to waste that much money on redundancy. They are
> much more likely to use 2x10+2 RAID6 luns or 1x21+2 with a hot spare
> to maximise the data storage capacity.

Thanks for sharing this info

>
> If someone wants to force-align allocation to stripe widths on such
> a RAID array config rather than trying to rely on the best effort
> swalloc mount option, then they need non-power-of-2
> alignments to be supported.
>
> It's pretty much a no-brainer - the alignment code already handles
> non-power-of-2 alignments, and it's not very much additional code to
> ensure we can handle any alignment the user specified.

ok, fine

>
>> As for AG size, again I think that it is required to be aligned to the
>> forcealign extsize. As I remember, when converting from an FSB to a DB, if
>> the AG itself is not aligned to the forcealign extsize, then the DB will not
>> be aligned to the forcealign extsize. More below...
>>
>>>> + /* Requires agsize be a multiple of extsize */
>>>> + if (mp->m_sb.sb_agblocks % extsize)
>>>> + return __this_address;
>>>> +
>>>> + /* Requires stripe unit+width (if set) be a multiple of extsize */
>>>> + if ((mp->m_dalign && (mp->m_dalign % extsize)) ||
>>>> + (mp->m_swidth && (mp->m_swidth % extsize)))
>>>> + return __this_address;
>>> Again, this is an atomic write constraint, isn't it?
>> So why do we want forcealign? It is to only align extent FSBs?
> Yes. forced alignment is essentially just extent size guarantees.
>
> This is part of what is needed for atomic writes, but atomic writes
> also require specific physical storage alignment between the
> filesystem and the device. The filesystem setup has to correctly
> align AGs to the physical storage, and stuff like RAID
> configurations need to be specifically compatible with the atomic
> write capabilities of the underlying hardware.
>
> None of these hardware iand storage stack alignment constraints have
> any relevance to the filesystem forced alignment functionality. They
> are completely indepedent. All the forced alignment does is
> guarantees that allocation is aligned according the extent size hint
> on the inode or it fails with ENOSPC.

Fine, so only for atomic writes we just need to ensure FSBs are aligned
to DBs.

And so it is the responsibility of mkfs to ensure AG size aligns to any
forcealign extsize specified and also disk atomic write geometry.

For atomic write only, it is the responsibility of the kernel to check
the forcealign extsize is compatible with any stripe alignment and AG size.

>>>
>>> Can you please separate these and put all the force align user API
>>> validation checks in the one function?
>>>
>> ok, fine. But it would be good to have clarification on function of
>> forcealign, above, i.e. does it always align extents to disk blocks?
> No, it doesn't. XFS has never done this - physical extent alignment
> is always done relative to the start of the AG, not the underlying
> disk geometry.
>
> IOWs, forced alignement is not aligning to disk blocks at all - it
> is aligning extents logically to file offset and physically to the
> offset from the start of the allocation group. Hence there are no
> real constraints on forced alignment - we can do any sort of
> alignment as long it is smaller than half the max size of a physical
> extent.
>
> For allocation to then be aligned to physical storage, we need mkfs
> to physically align the start of each AG to the geometry of the
> underlying storage. We already do this for filesystems with a stripe
> unit defined, hence stripe aligned allocation is physically aligned
> to the underlying storage.

Sure

>
> However, if mkfs doesn't get the physical layout of AGs right, there
> is nothing the mounted filesystem can do to guarantee extent
> allocation is aligned to physical disk blocks regardless of whether
> forced alignment is enabled or not...

ok, understood.

Thanks,
John


2024-06-11 07:52:33

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v3 14/21] iomap: Sub-extent zeroing

On 11/06/2024 04:10, Long Li wrote:
>> if (need_zeroout) {
>> - /* zero out from the start of the block to the write offset */
>> - pad = pos & (fs_block_size - 1);
>> + /* zero out from the start of the region to the write offset */
>> + pad = pos & (zeroing_size - 1);
>> if (pad)
>> iomap_dio_zero(iter, dio, pos - pad, pad);
>
> Hi, John
>
> I've been testing and using your atomic write patch series recently. I noticed
> that if zeroing_size is larger than a single page, the length passed to
> iomap_dio_zero() could also be larger than a page size. This seems incorrect
> because iomap_dio_zero() utilizes ZERO_PAGE(0), which is only a single page
> in size.

ok, thanks for the notice.

So
https://lore.kernel.org/linux-xfs/[email protected]/T/#m7ba4ed4f0f0f48be99042703c10b42b72c9fe37c
is changing that same function increase the zero range past PAGE_SIZE.
I'll just need to figure out how to make it support an arbitrary larger
size.

Thanks,
John


2024-06-12 06:57:18

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag

On 12/06/2024 03:10, Long Li wrote:
> On Mon, Apr 29, 2024 at 05:47:33PM +0000, John Garry wrote:
>> From: "Darrick J. Wong"<[email protected]>
>>
>> Add a new inode flag to require that all file data extent mappings must
>> be aligned (both the file offset range and the allocated space itself)
>> to the extent size hint. Having a separate COW extent size hint is no
>> longer allowed.
>>
>> The goal here is to enable sysadmins and users to mandate that all space
>> mappings in a file must have a startoff/blockcount that are aligned to
>> (say) a 2MB alignment and that the startblock/blockcount will follow the
>> same alignment.
>>
>> jpg: Enforce extsize is a power-of-2 and aligned with afgsize + stripe
>> alignment for forcealign
>> Signed-off-by: "Darrick J. Wong"<[email protected]>
>> Co-developed-by: John Garry<[email protected]>
>> Signed-off-by: John Garry<[email protected]>
>> ---
>> fs/xfs/libxfs/xfs_format.h | 6 ++++-
>> fs/xfs/libxfs/xfs_inode_buf.c | 50 +++++++++++++++++++++++++++++++++++
>> fs/xfs/libxfs/xfs_inode_buf.h | 3 +++
>> fs/xfs/libxfs/xfs_sb.c | 2 ++
>> fs/xfs/xfs_inode.c | 12 +++++++++
>> fs/xfs/xfs_inode.h | 2 +-
>> fs/xfs/xfs_ioctl.c | 34 +++++++++++++++++++++++-
>> fs/xfs/xfs_mount.h | 2 ++
>> fs/xfs/xfs_super.c | 4 +++
>> include/uapi/linux/fs.h | 2 ++
>> 10 files changed, 114 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> index 2b2f9050fbfb..4dd295b047f8 100644
>> --- a/fs/xfs/libxfs/xfs_format.h
>> +++ b/fs/xfs/libxfs/xfs_format.h
>> @@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
>> #define XFS_SB_FEAT_RO_COMPAT_RMAPBT (1 << 1) /* reverse map btree */
>> #define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
>> #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
>> +#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
>
> Hi, John
>
> You know I've been using and testing your atomic writes patch series recently,
> and I'm particularly interested in the changes to the on-disk format. I noticed
> that XFS_SB_FEAT_RO_COMPAT_FORCEALIGN uses bit 30 instead of bit 4, which would
> be the next available bit in sequence.
>
> I'm wondering if using bit 30 is just a temporary solution to avoid conflicts,
> and if the plan is to eventually use bits sequentially, for example, using bit 4?
> I'm looking forward to your explanation.

I really don't know. I'm looking through the history and it has been
like that this the start of my source control records.

Maybe it was a copy-and-paste error from XFS_FEAT_FORCEALIGN, whose
value has changed since.

Anyway, I'll ask a bit more internally, and I'll look to change to (1 <<
4) if ok.

Thanks,
John

2024-06-12 15:43:58

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag

On Wed, Jun 12, 2024 at 07:55:31AM +0100, John Garry wrote:
> On 12/06/2024 03:10, Long Li wrote:
> > On Mon, Apr 29, 2024 at 05:47:33PM +0000, John Garry wrote:
> > > From: "Darrick J. Wong"<[email protected]>
> > >
> > > Add a new inode flag to require that all file data extent mappings must
> > > be aligned (both the file offset range and the allocated space itself)
> > > to the extent size hint. Having a separate COW extent size hint is no
> > > longer allowed.
> > >
> > > The goal here is to enable sysadmins and users to mandate that all space
> > > mappings in a file must have a startoff/blockcount that are aligned to
> > > (say) a 2MB alignment and that the startblock/blockcount will follow the
> > > same alignment.
> > >
> > > jpg: Enforce extsize is a power-of-2 and aligned with afgsize + stripe
> > > alignment for forcealign
> > > Signed-off-by: "Darrick J. Wong"<[email protected]>
> > > Co-developed-by: John Garry<[email protected]>
> > > Signed-off-by: John Garry<[email protected]>
> > > ---
> > > fs/xfs/libxfs/xfs_format.h | 6 ++++-
> > > fs/xfs/libxfs/xfs_inode_buf.c | 50 +++++++++++++++++++++++++++++++++++
> > > fs/xfs/libxfs/xfs_inode_buf.h | 3 +++
> > > fs/xfs/libxfs/xfs_sb.c | 2 ++
> > > fs/xfs/xfs_inode.c | 12 +++++++++
> > > fs/xfs/xfs_inode.h | 2 +-
> > > fs/xfs/xfs_ioctl.c | 34 +++++++++++++++++++++++-
> > > fs/xfs/xfs_mount.h | 2 ++
> > > fs/xfs/xfs_super.c | 4 +++
> > > include/uapi/linux/fs.h | 2 ++
> > > 10 files changed, 114 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 2b2f9050fbfb..4dd295b047f8 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
> > > #define XFS_SB_FEAT_RO_COMPAT_RMAPBT (1 << 1) /* reverse map btree */
> > > #define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
> > > #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
> > > +#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
> > Hi, John
> >
> > You know I've been using and testing your atomic writes patch series recently,
> > and I'm particularly interested in the changes to the on-disk format. I noticed
> > that XFS_SB_FEAT_RO_COMPAT_FORCEALIGN uses bit 30 instead of bit 4, which would
> > be the next available bit in sequence.
> >
> > I'm wondering if using bit 30 is just a temporary solution to avoid conflicts,
> > and if the plan is to eventually use bits sequentially, for example, using bit 4?
> > I'm looking forward to your explanation.
>
> I really don't know. I'm looking through the history and it has been like
> that this the start of my source control records.
>
> Maybe it was a copy-and-paste error from XFS_FEAT_FORCEALIGN, whose value
> has changed since.
>
> Anyway, I'll ask a bit more internally, and I'll look to change to (1 << 4)
> if ok.

I tend to use upper bits for ondisk features that are still under
development so that (a) there won't be collisions with other features
getting merged and (b) after the feature I'm working on gets merged, any
old fs images in my zoo will no longer mount.

--D

> Thanks,
> John
>