2024-03-04 13:05:32

by John Garry

[permalink] [raw]
Subject: [PATCH v2 00/14] block atomic writes for XFS


This series expands atomic write support to filesystems, specifically
XFS. Extent alignment is based on new feature forcealign (again), and we
do not rely on XFS rtvol extent alignment this time.

Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.

A FS can be formatted for atomic writes as follows:
mkfs.xfs -i forcealign=1 -d extsize=16384 -d atomic-writes=1 /dev/sda

atomic-writes=1 just enables atomic writes in the SB, but does not auto-
enable atomic writes for each file. There are no mkfs checks yet whether
the underlying HW actually supports atomic writes for at least 16K, but
this will be added.

Support can be enabled through xfs_io command:
$xfs_io -c "lsattr -v" filename
[extsize, force-align]
$xfs_io -c "extsize" filename
[16384] filename
$xfs_io -c "chattr +W" filename
$xfs_io -c "lsattr -v" filename
[extsize, force-align, atomic-writes] filename
$xfs_io -c statx filename
..
stat.stx_atomic_write_unit_min = 4096
stat.stx_atomic_write_unit_max = 16384
stat.stx_atomic_write_segments_max = 1
..

Note that, apart from "Do not free EOF blocks for forcealign" change, XFS
forcealign support has not been updated or comments addressed since
originally sent in:
https://lore.kernel.org/linux-xfs/[email protected]/

I was waiting until this series was progressed for updating forcealign:
https://lore.kernel.org/linux-xfs/[email protected]/
I don't know the status of that series.

Baseline is following series (which is based on v6.8-rc5):
https://lore.kernel.org/linux-block/[email protected]/

Basic xfsprogs support at:
https://github.com/johnpgarry/xfsprogs-dev/tree/forcealign_and_atomicwrites_for_v2_xfs_block_atomic_writes

Patches for this series can be found at:
https://github.com/johnpgarry/linux/commits/atomic-writes-v6.8-v5-fs-v2/

Changes since v1:
https://lore.kernel.org/linux-xfs/[email protected]/
- Add blk_validate_atomic_write_op_size() (Darrick suggested idea)
- Swap forcealign for rtvol support (Dave requested forcealign)
- Sub-extent DIO zeroing (Dave wanted rid of XFS_BMAPI_ZERO usage)
- Improve coding for XFS statx support (Darrick, Ojaswin)
- Improve conditions for setting FMODE_CAN_ATOMIC_WRITE (Darrick)
- Improve commit message for FS_XFLAG_ATOMICWRITES flag (Darrick)
- Validate atomic writes in xfs_file_dio_write()
- Drop IOMAP_ATOMIC

Darrick J. Wong (3):
fs: xfs: Introduce FORCEALIGN inode flag
fs: xfs: Make file data allocations observe the 'forcealign' flag
fs: xfs: Enable file data forcealign feature

John Garry (11):
block: Add blk_validate_atomic_write_op_size()
fs: xfs: Don't use low-space allocator for alignment > 1
fs: xfs: Do not free EOF blocks for forcealign
fs: iomap: Sub-extent zeroing
fs: xfs: iomap: Sub-extent zeroing
fs: Add FS_XFLAG_ATOMICWRITES flag
fs: iomap: Atomic write support
fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
fs: xfs: Support atomic write for statx
fs: xfs: Validate atomic writes
fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE

block/blk-core.c | 17 ++++++++++
fs/iomap/direct-io.c | 24 ++++++++++----
fs/xfs/libxfs/xfs_bmap.c | 26 ++++++++++++---
fs/xfs/libxfs/xfs_format.h | 16 ++++++++--
fs/xfs/libxfs/xfs_inode_buf.c | 40 +++++++++++++++++++++++
fs/xfs/libxfs/xfs_inode_buf.h | 3 ++
fs/xfs/libxfs/xfs_sb.c | 4 +++
fs/xfs/xfs_bmap_util.c | 7 +++-
fs/xfs/xfs_file.c | 60 +++++++++++++++++++++++++++--------
fs/xfs/xfs_inode.c | 28 ++++++++++++++++
fs/xfs/xfs_inode.h | 11 +++++++
fs/xfs/xfs_ioctl.c | 49 ++++++++++++++++++++++++++--
fs/xfs/xfs_iomap.c | 19 +++++++++--
fs/xfs/xfs_iops.c | 38 ++++++++++++++++++++++
fs/xfs/xfs_mount.h | 4 +++
fs/xfs/xfs_super.c | 8 +++++
include/linux/iomap.h | 1 +
include/uapi/linux/fs.h | 3 ++
18 files changed, 324 insertions(+), 34 deletions(-)

--
2.31.1



2024-03-04 13:05:45

by John Garry

[permalink] [raw]
Subject: [PATCH v2 01/14] block: Add blk_validate_atomic_write_op_size()

FSes will not check if the BIO submitted for an atomic write adheres to the
request_queue limits, so check in the BIO submission path.

Signed-off-by: John Garry <[email protected]>
---
We really should return -EINVAL to userspce, so can be have BLK_STS_INVAL
or similar?
block/blk-core.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index de771093b526..a1b095a68498 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -718,6 +718,18 @@ void submit_bio_noacct_nocheck(struct bio *bio)
__submit_bio_noacct(bio);
}

+static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
+ struct bio *bio)
+{
+ if (bio->bi_iter.bi_size > queue_atomic_write_unit_max_bytes(q))
+ return BLK_STS_IOERR;
+
+ if (bio->bi_iter.bi_size % queue_atomic_write_unit_min_bytes(q))
+ return BLK_STS_IOERR;
+
+ return BLK_STS_OK;
+}
+
/**
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
* @bio: The bio describing the location in memory and on the device.
@@ -775,6 +787,11 @@ void submit_bio_noacct(struct bio *bio)
switch (bio_op(bio)) {
case REQ_OP_READ:
case REQ_OP_WRITE:
+ if (bio->bi_opf & REQ_ATOMIC) {
+ status = blk_validate_atomic_write_op_size(q, bio);
+ if (status != BLK_STS_OK)
+ goto end_io;
+ }
break;
case REQ_OP_FLUSH:
/*
--
2.31.1


2024-03-04 13:06:32

by John Garry

[permalink] [raw]
Subject: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag

From: "Darrick J. Wong" <[email protected]>

The existing extsize hint code already did the work of expanding file
range mapping requests so that the range is aligned to the hint value.
Now add the code we need to guarantee that the space allocations are
also always aligned.

XXX: still need to check all this with reflink

Signed-off-by: "Darrick J. Wong" <[email protected]>
Co-developed-by: John Garry <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
fs/xfs/xfs_iomap.c | 4 +++-
2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 60d100134280..8dee60795cf4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
align = xfs_get_cowextsz_hint(ap->ip);
else if (ap->datatype & XFS_ALLOC_USERDATA)
align = xfs_get_extsz_hint(ap->ip);
+
+ /*
+ * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
+ * set as forcealign and cowextsz_hint are mutually exclusive
+ */
+ if (xfs_inode_forcealign(ap->ip) && align) {
+ args->alignment = align;
+ if (stripe_align % align)
+ stripe_align = align;
+ } else {
+ args->alignment = 1;
+ }
+
if (align) {
if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
ap->eof, 0, ap->conv, &ap->offset,
@@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
args.minlen = args.maxlen = ap->minlen;
args.total = ap->total;

- args.alignment = 1;
args.minalignslop = 0;

args.minleft = ap->minleft;
@@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
{
struct xfs_mount *mp = args->mp;
struct xfs_perag *caller_pag = args->pag;
+ int orig_alignment = args->alignment;
int error;

/*
@@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(

/*
* Allocation failed, so turn return the allocation args to their
- * original non-aligned state so the caller can proceed on allocation
- * failure as if this function was never called.
+ * original state so the caller can proceed on allocation failure as
+ * if this function was never called.
*/
- args->alignment = 1;
+ args->alignment = orig_alignment;
return 0;
}

@@ -3709,7 +3722,6 @@ xfs_bmap_btalloc(
.wasdel = ap->wasdel,
.resv = XFS_AG_RESV_NONE,
.datatype = ap->datatype,
- .alignment = 1,
.minalignslop = 0,
};
xfs_fileoff_t orig_offset;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 18c8f168b153..70fe873951f3 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -181,7 +181,9 @@ xfs_eof_alignment(
* If mounted with the "-o swalloc" option the alignment is
* increased from the strip unit size to the stripe width.
*/
- if (mp->m_swidth && xfs_has_swalloc(mp))
+ if (xfs_inode_forcealign(ip))
+ align = xfs_get_extsz_hint(ip);
+ else if (mp->m_swidth && xfs_has_swalloc(mp))
align = mp->m_swidth;
else if (mp->m_dalign)
align = mp->m_dalign;
--
2.31.1


2024-03-04 13:07:06

by John Garry

[permalink] [raw]
Subject: [PATCH v2 03/14] fs: xfs: Introduce FORCEALIGN inode flag

From: "Darrick J. Wong" <[email protected]>

Add a new inode flag to require that all file data extent mappings must
be aligned (both the file offset range and the allocated space itself)
to the extent size hint. Having a separate COW extent size hint is no
longer allowed.

The goal here is to enable sysadmins and users to mandate that all space
mappings in a file must have a startoff/blockcount that are aligned to
(say) a 2MB alignment and that the startblock/blockcount will follow the
same alignment.

jpg: Enforce extsize is a power-of-2 for forcealign
Signed-off-by: "Darrick J. Wong" <[email protected]>
Co-developed-by: John Garry <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_format.h | 6 +++++-
fs/xfs/libxfs/xfs_inode_buf.c | 40 +++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_inode_buf.h | 3 +++
fs/xfs/libxfs/xfs_sb.c | 2 ++
fs/xfs/xfs_inode.c | 12 +++++++++++
fs/xfs/xfs_inode.h | 5 +++++
fs/xfs/xfs_ioctl.c | 34 ++++++++++++++++++++++++++++-
fs/xfs/xfs_mount.h | 2 ++
fs/xfs/xfs_super.c | 4 ++++
include/uapi/linux/fs.h | 2 ++
10 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 382ab1e71c0b..db2113cf6e47 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
#define XFS_SB_FEAT_RO_COMPAT_RMAPBT (1 << 1) /* reverse map btree */
#define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
#define XFS_SB_FEAT_RO_COMPAT_ALL \
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
@@ -1085,16 +1086,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
#define XFS_DIFLAG2_COWEXTSIZE_BIT 2 /* copy on write extent size hint */
#define XFS_DIFLAG2_BIGTIME_BIT 3 /* big timestamps */
#define XFS_DIFLAG2_NREXT64_BIT 4 /* large extent counters */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define XFS_DIFLAG2_FORCEALIGN_BIT 5

#define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
#define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
#define XFS_DIFLAG2_COWEXTSIZE (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
#define XFS_DIFLAG2_BIGTIME (1 << XFS_DIFLAG2_BIGTIME_BIT)
#define XFS_DIFLAG2_NREXT64 (1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_FORCEALIGN (1 << XFS_DIFLAG2_FORCEALIGN_BIT)

#define XFS_DIFLAG2_ANY \
(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
- XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+ XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)

static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
{
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 137a65bda95d..61cc12cd54db 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -610,6 +610,14 @@ xfs_dinode_verify(
!xfs_has_bigtime(mp))
return __this_address;

+ if (flags2 & XFS_DIFLAG2_FORCEALIGN) {
+ fa = xfs_inode_validate_forcealign(mp, mode, flags,
+ be32_to_cpu(dip->di_extsize),
+ be32_to_cpu(dip->di_cowextsize));
+ if (fa)
+ return fa;
+ }
+
return NULL;
}

@@ -777,3 +785,35 @@ xfs_inode_validate_cowextsize(

return NULL;
}
+
+/* Validate the forcealign inode flag */
+xfs_failaddr_t
+xfs_inode_validate_forcealign(
+ struct xfs_mount *mp,
+ uint16_t mode,
+ uint16_t flags,
+ uint32_t extsize,
+ uint32_t cowextsize)
+{
+ /* superblock rocompat feature flag */
+ if (!xfs_has_forcealign(mp))
+ return __this_address;
+
+ /* Only regular files and directories */
+ if (!S_ISDIR(mode) && !S_ISREG(mode))
+ return __this_address;
+
+ /* Doesn't apply to realtime files */
+ if (flags & XFS_DIFLAG_REALTIME)
+ return __this_address;
+
+ /* Requires a non-zero power-of-2 extent size hint */
+ if (extsize == 0 || !is_power_of_2(extsize))
+ return __this_address;
+
+ /* Requires no cow extent size hint */
+ if (cowextsize != 0)
+ return __this_address;
+
+ return NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 585ed5a110af..50db17d22b68 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -33,6 +33,9 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
uint32_t cowextsize, uint16_t mode, uint16_t flags,
uint64_t flags2);
+xfs_failaddr_t xfs_inode_validate_forcealign(struct xfs_mount *mp,
+ uint16_t mode, uint16_t flags, uint32_t extsize,
+ uint32_t cowextsize);

static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
{
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 5bb6e2bd6dee..f2c16a028fae 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -163,6 +163,8 @@ xfs_sb_version_to_features(
features |= XFS_FEAT_REFLINK;
if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
features |= XFS_FEAT_INOBTCNT;
+ if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+ features |= XFS_FEAT_FORCEALIGN;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
features |= XFS_FEAT_FTYPE;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 1fd94958aa97..2c439df8c47f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -629,6 +629,8 @@ xfs_ip2xflags(
flags |= FS_XFLAG_DAX;
if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
flags |= FS_XFLAG_COWEXTSIZE;
+ if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+ flags |= FS_XFLAG_FORCEALIGN;
}

if (xfs_inode_has_attr_fork(ip))
@@ -758,6 +760,8 @@ xfs_inode_inherit_flags2(
}
if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
ip->i_diflags2 |= XFS_DIFLAG2_DAX;
+ if (pip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+ ip->i_diflags2 |= XFS_DIFLAG2_FORCEALIGN;

/* Don't let invalid cowextsize hints propagate. */
failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
@@ -766,6 +770,14 @@ xfs_inode_inherit_flags2(
ip->i_diflags2 &= ~XFS_DIFLAG2_COWEXTSIZE;
ip->i_cowextsize = 0;
}
+
+ if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+ failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+ VFS_I(ip)->i_mode, ip->i_diflags, ip->i_extsize,
+ ip->i_cowextsize);
+ if (failaddr)
+ ip->i_diflags2 &= ~XFS_DIFLAG2_FORCEALIGN;
+ }
}

/*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 97f63bacd4c2..82e2838f6d64 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
}

+static inline bool xfs_inode_forcealign(struct xfs_inode *ip)
+{
+ return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
+}
+
/*
* Return the buftarg used for data allocations on a given inode.
*/
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index f02b6e558af5..867d8d51a3d0 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
di_flags2 |= XFS_DIFLAG2_DAX;
if (xflags & FS_XFLAG_COWEXTSIZE)
di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+ if (xflags & FS_XFLAG_FORCEALIGN)
+ di_flags2 |= XFS_DIFLAG2_FORCEALIGN;

return di_flags2;
}
@@ -1146,6 +1148,22 @@ xfs_ioctl_setattr_xflags(
if (i_flags2 && !xfs_has_v3inodes(mp))
return -EINVAL;

+ /*
+ * Force-align requires a nonzero extent size hint and a zero cow
+ * extent size hint. It doesn't apply to realtime files.
+ */
+ if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
+ if (!xfs_has_forcealign(mp))
+ return -EINVAL;
+ if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
+ return -EINVAL;
+ if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+ FS_XFLAG_EXTSZINHERIT)))
+ return -EINVAL;
+ if (fa->fsx_xflags & FS_XFLAG_REALTIME)
+ return -EINVAL;
+ }
+
ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
ip->i_diflags2 = i_flags2;

@@ -1232,6 +1250,7 @@ xfs_ioctl_setattr_check_extsize(
struct xfs_mount *mp = ip->i_mount;
xfs_failaddr_t failaddr;
uint16_t new_diflags;
+ uint16_t new_diflags2;

if (!fa->fsx_valid)
return 0;
@@ -1244,6 +1263,7 @@ xfs_ioctl_setattr_check_extsize(
return -EINVAL;

new_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
+ new_diflags2 = xfs_flags2diflags2(ip, fa->fsx_xflags);

/*
* Inode verifiers do not check that the extent size hint is an integer
@@ -1263,7 +1283,19 @@ xfs_ioctl_setattr_check_extsize(
failaddr = xfs_inode_validate_extsize(ip->i_mount,
XFS_B_TO_FSB(mp, fa->fsx_extsize),
VFS_I(ip)->i_mode, new_diflags);
- return failaddr != NULL ? -EINVAL : 0;
+ if (failaddr)
+ return -EINVAL;
+
+ if (new_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+ failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+ VFS_I(ip)->i_mode, new_diflags,
+ XFS_B_TO_FSB(mp, fa->fsx_extsize),
+ XFS_B_TO_FSB(mp, fa->fsx_cowextsize));
+ if (failaddr)
+ return -EINVAL;
+ }
+
+ return 0;
}

static int
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 503fe3c7edbf..e1ef31675db3 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -289,6 +289,7 @@ typedef struct xfs_mount {
#define XFS_FEAT_BIGTIME (1ULL << 24) /* large timestamps */
#define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */
#define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */
+#define XFS_FEAT_FORCEALIGN (1ULL << 27) /* aligned file data extents */

/* Mount features */
#define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */
@@ -352,6 +353,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
__XFS_HAS_FEAT(bigtime, BIGTIME)
__XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
__XFS_HAS_FEAT(large_extent_counts, NREXT64)
+__XFS_HAS_FEAT(forcealign, FORCEALIGN)

/*
* Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 5a2512d20bd0..74dcafddf6a9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1708,6 +1708,10 @@ xfs_fs_fill_super(
mp->m_features &= ~XFS_FEAT_DISCARD;
}

+ if (xfs_has_forcealign(mp))
+ xfs_warn(mp,
+"EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
+
if (xfs_has_reflink(mp)) {
if (mp->m_sb.sb_rblocks) {
xfs_alert(mp,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index a0975ae81e64..8828822331bf 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -140,6 +140,8 @@ struct fsxattr {
#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
#define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define FS_XFLAG_FORCEALIGN 0x00020000
#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */

/* the read-only stuff doesn't really belong here, but any other place is
--
2.31.1


2024-03-04 13:07:10

by John Garry

[permalink] [raw]
Subject: [PATCH v2 05/14] fs: xfs: Enable file data forcealign feature

From: "Darrick J. Wong" <[email protected]>

Enable this feature.

Signed-off-by: "Darrick J. Wong" <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_format.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index db2113cf6e47..2d9f5430efc3 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -358,7 +358,8 @@ xfs_sb_has_compat_feature(
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
XFS_SB_FEAT_RO_COMPAT_REFLINK| \
- XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+ XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+ XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
static inline bool
xfs_sb_has_ro_compat_feature(
--
2.31.1


2024-03-04 13:07:44

by John Garry

[permalink] [raw]
Subject: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing

Set iomap->extent_shift when sub-extent zeroing is required.

We treat a sub-extent write same as an unaligned write, so we can leverage
the existing sub-FSblock unaligned write support, i.e. try a shared lock
with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
lock.

In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_file.c | 28 +++++++++++++++-------------
fs/xfs/xfs_iomap.c | 15 +++++++++++++--
2 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e33e5e13b95f..d0bd9d5f596c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -617,18 +617,19 @@ xfs_file_dio_write_aligned(
* Handle block unaligned direct I/O writes
*
* In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
- * them to be done in parallel with reads and other direct I/O writes. However,
- * if the I/O is not aligned to filesystem blocks, the direct I/O layer may need
- * to do sub-block zeroing and that requires serialisation against other direct
- * I/O to the same block. In this case we need to serialise the submission of
- * the unaligned I/O so that we don't get racing block zeroing in the dio layer.
- * In the case where sub-block zeroing is not required, we can do concurrent
- * sub-block dios to the same block successfully.
+ * them to be done in parallel with reads and other direct I/O writes.
+ * However if the I/O is not aligned to filesystem blocks/extent, the direct
+ * I/O layer may need to do sub-block/extent zeroing and that requires
+ * serialisation against other direct I/O to the same block/extent. In this
+ * case we need to serialise the submission of the unaligned I/O so that we
+ * don't get racing block/extent zeroing in the dio layer.
+ * In the case where sub-block/extent zeroing is not required, we can do
+ * concurrent sub-block/extent dios to the same block/extent successfully.
*
* Optimistically submit the I/O using the shared lock first, but use the
* IOMAP_DIO_OVERWRITE_ONLY flag to tell the lower layers to return -EAGAIN
- * if block allocation or partial block zeroing would be required. In that case
- * we try again with the exclusive lock.
+ * if block/extent allocation or partial block/extent zeroing would be
+ * required. In that case we try again with the exclusive lock.
*/
static noinline ssize_t
xfs_file_dio_write_unaligned(
@@ -643,9 +644,9 @@ xfs_file_dio_write_unaligned(
ssize_t ret;

/*
- * Extending writes need exclusivity because of the sub-block zeroing
- * that the DIO code always does for partial tail blocks beyond EOF, so
- * don't even bother trying the fast path in this case.
+ * Extending writes need exclusivity because of the sub-block/extent
+ * zeroing that the DIO code always does for partial tail blocks
+ * beyond EOF, so don't even bother trying the fast path in this case.
*/
if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -709,13 +710,14 @@ xfs_file_dio_write(
struct iov_iter *from)
{
struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
+ struct xfs_mount *mp = ip->i_mount;
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
size_t count = iov_iter_count(from);

/* direct I/O must be aligned to device logical sector size */
if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
return -EINVAL;
- if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
+ if ((iocb->ki_pos | count) & (XFS_FSB_TO_B(mp, xfs_get_extsz(ip)) - 1))
return xfs_file_dio_write_unaligned(ip, iocb, from);
return xfs_file_dio_write_aligned(ip, iocb, from);
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 70fe873951f3..88cc20bb19c9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -98,6 +98,7 @@ xfs_bmbt_to_iomap(
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+ xfs_extlen_t extsz = xfs_get_extsz(ip);

if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
return xfs_alert_fsblock_zero(ip, imap);
@@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(

iomap->validity_cookie = sequence_cookie;
iomap->folio_ops = &xfs_iomap_folio_ops;
+ if (extsz > 1)
+ iomap->extent_shift = ffs(extsz) - 1;
return 0;
}

@@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
xfs_fsize_t i_size;
uint resblks;
int error;
+ xfs_extlen_t extsz = xfs_get_extsz(ip);

trace_xfs_unwritten_convert(ip, offset, count);

- offset_fsb = XFS_B_TO_FSBT(mp, offset);
- count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+ if (extsz > 1) {
+ xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
+
+ offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
+ count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
+ } else {
+ offset_fsb = XFS_B_TO_FSBT(mp, offset);
+ count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+ }
count_fsb = (xfs_filblks_t)(count_fsb - offset_fsb);

/*
--
2.31.1


2024-03-04 13:07:52

by John Garry

[permalink] [raw]
Subject: [PATCH v2 07/14] fs: iomap: Sub-extent zeroing

For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
unwritten.

This will be important for atomic writes support, in that atomically
writing over a partially written extent would mean that we would need to
do the unwritten extent conversion write separately, and the write could
no longer be atomic.

It is the task of the FS to set iomap.extent_shift per iter to indicate
sub-extent zeroing required.

Maybe a macro like i_blocksize() should be introduced for extent sizes,
instead of using extent_shift. It would also eliminate excessive use
of xfs_get_extss() for XFS in future.

Signed-off-by: John Garry <[email protected]>
---
fs/iomap/direct-io.c | 14 ++++++++------
include/linux/iomap.h | 1 +
2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..733f83f839b6 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
{
const struct iomap *iomap = &iter->iomap;
struct inode *inode = iter->inode;
- unsigned int fs_block_size = i_blocksize(inode), pad;
+ unsigned int zeroing_size, pad;
loff_t length = iomap_length(iter);
loff_t pos = iter->pos;
blk_opf_t bio_opf;
@@ -288,6 +288,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
size_t copied = 0;
size_t orig_count;

+ zeroing_size = i_blocksize(inode) << iomap->extent_shift;
+
if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
!bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
return -EINVAL;
@@ -354,8 +356,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
dio->iocb->ki_flags &= ~IOCB_HIPRI;

if (need_zeroout) {
- /* zero out from the start of the block to the write offset */
- pad = pos & (fs_block_size - 1);
+ /* zero out from the start of the region to the write offset */
+ pad = pos & (zeroing_size - 1);
if (pad)
iomap_dio_zero(iter, dio, pos - pad, pad);
}
@@ -427,10 +429,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
zero_tail:
if (need_zeroout ||
((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
- /* zero out from the end of the write to the end of the block */
- pad = pos & (fs_block_size - 1);
+ /* zero out from the end of the write to the end of the region */
+ pad = pos & (zeroing_size - 1);
if (pad)
- iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+ iomap_dio_zero(iter, dio, pos, zeroing_size - pad);
}
out:
/* Undo iter limitation to current extent */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 96dd0acbba44..89cd3dcbb8ec 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -97,6 +97,7 @@ struct iomap {
u64 length; /* length of mapping, bytes */
u16 type; /* type of mapping */
u16 flags; /* flags for mapping */
+ unsigned int extent_shift;
struct block_device *bdev; /* block device for I/O */
struct dax_device *dax_dev; /* dax_dev for dax operations */
void *inline_data;
--
2.31.1


2024-03-04 13:08:11

by John Garry

[permalink] [raw]
Subject: [PATCH v2 09/14] fs: Add FS_XFLAG_ATOMICWRITES flag

Add a flag indicating that a regular file is enabled for atomic writes.

This is a file attribute that mirrors an ondisk inode flag. Actual support
for untorn file writes (for now) depends on both the iflag and the
underlying storage devices, which we can only really check at statx and
pwritev2() time. This is the same story as FS_XFLAG_DAX, which signals to
the fs that we should try to enable the fsdax IO path on the file (instead
of the regular page cache), but applications have to query STAT_ATTR_DAX
to find out if they really got that IO path.

Signed-off-by: John Garry <[email protected]>
---
include/uapi/linux/fs.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 8828822331bf..aacf54381718 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -142,6 +142,7 @@ struct fsxattr {
#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
/* data extent mappings for regular files must be aligned to extent size hint */
#define FS_XFLAG_FORCEALIGN 0x00020000
+#define FS_XFLAG_ATOMICWRITES 0x00040000 /* atomic writes enabled */
#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */

/* the read-only stuff doesn't really belong here, but any other place is
--
2.31.1


2024-03-04 13:09:35

by John Garry

[permalink] [raw]
Subject: [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign

For when forcealign is enabled, we want the EOF to be aligned as well, so
do not free EOF blocks.

Add helper function xfs_get_extsz() to get the guaranteed extent size
alignment for forcealign enabled. Since this code is only relevant to
forcealign and forcealign is not possible for RT yet, ignore RT.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_bmap_util.c | 7 ++++++-
fs/xfs/xfs_inode.c | 14 ++++++++++++++
fs/xfs/xfs_inode.h | 1 +
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c2531c28905c..07bfb03c671a 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -542,8 +542,13 @@ xfs_can_free_eofblocks(
* forever.
*/
end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
- if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
+
+ /* Do not free blocks when forcing extent sizes */
+ if (xfs_get_extsz(ip) > 1)
+ end_fsb = roundup_64(end_fsb, xfs_get_extsz(ip));
+ else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
+
last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
if (last_fsb <= end_fsb)
return false;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2c439df8c47f..bbb8886f1d32 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -65,6 +65,20 @@ xfs_get_extsz_hint(
return 0;
}

+/*
+ * Helper function to extract extent size. It will return a power-of-2,
+ * as forcealign requires this.
+ */
+xfs_extlen_t
+xfs_get_extsz(
+ struct xfs_inode *ip)
+{
+ if (xfs_inode_forcealign(ip) && ip->i_extsize)
+ return ip->i_extsize;
+
+ return 1;
+}
+
/*
* Helper function to extract CoW extent size hint from inode.
* Between the extent size hint and the CoW extent size hint, we
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 82e2838f6d64..b6c42c27943e 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -547,6 +547,7 @@ void xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
struct xfs_inode *ip1, uint ip1_mode);

xfs_extlen_t xfs_get_extsz_hint(struct xfs_inode *ip);
+xfs_extlen_t xfs_get_extsz(struct xfs_inode *ip);
xfs_extlen_t xfs_get_cowextsz_hint(struct xfs_inode *ip);

int xfs_init_new_inode(struct mnt_idmap *idmap, struct xfs_trans *tp,
--
2.31.1


2024-03-04 13:10:31

by John Garry

[permalink] [raw]
Subject: [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE

For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
flag. We check for direct I/O and also check that the bdev can actually
support atomic writes.

We rely on the block layer to reject atomic writes which exceed the bdev
request_queue limits, so don't bother checking any such thing here.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_file.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index cecc5428fd7c..e63851be6c15 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1234,6 +1234,25 @@ xfs_file_remap_range(
return remapped > 0 ? remapped : ret;
}

+static bool xfs_file_open_can_atomicwrite(
+ struct inode *inode,
+ struct file *file)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+
+ if (!(file->f_flags & O_DIRECT))
+ return false;
+
+ if (!xfs_inode_atomicwrites(ip))
+ return false;
+
+ if (!bdev_can_atomic_write(target->bt_bdev))
+ return false;
+
+ return true;
+}
+
STATIC int
xfs_file_open(
struct inode *inode,
@@ -1243,6 +1262,8 @@ xfs_file_open(
return -EIO;
file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
+ if (xfs_file_open_can_atomicwrite(inode, file))
+ file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
return generic_file_open(inode, file);
}

--
2.31.1


2024-03-04 13:10:47

by John Garry

[permalink] [raw]
Subject: [PATCH v2 12/14] fs: xfs: Support atomic write for statx

Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size, but a
lower limit could be supported in future. This is required by iomap
DIO.

The atomic write unit min and max is limited by the guaranteed extent
alignment for the inode.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_iops.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a0d77f5f512e..6316448083d2 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -546,6 +546,37 @@ xfs_stat_blksize(
return PAGE_SIZE;
}

+static void
+xfs_get_atomic_write_attr(
+ struct xfs_inode *ip,
+ unsigned int *unit_min,
+ unsigned int *unit_max)
+{
+ xfs_extlen_t extsz = xfs_get_extsz(ip);
+ struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+ struct block_device *bdev = target->bt_bdev;
+ struct request_queue *q = bdev->bd_queue;
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_sb *sbp = &mp->m_sb;
+ unsigned int awu_min, awu_max;
+ unsigned int extsz_bytes = XFS_FSB_TO_B(mp, extsz);
+
+ awu_min = queue_atomic_write_unit_min_bytes(q);
+ awu_max = queue_atomic_write_unit_max_bytes(q);
+
+ if (sbp->sb_blocksize > awu_max || awu_min > sbp->sb_blocksize ||
+ !xfs_inode_atomicwrites(ip)) {
+ *unit_min = 0;
+ *unit_max = 0;
+ return;
+ }
+
+ /* Floor at FS block size */
+ *unit_min = max(sbp->sb_blocksize, awu_min);
+
+ *unit_max = min(extsz_bytes, awu_max);
+}
+
STATIC int
xfs_vn_getattr(
struct mnt_idmap *idmap,
@@ -619,6 +650,13 @@ xfs_vn_getattr(
stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
stat->dio_offset_align = bdev_logical_block_size(bdev);
}
+ if (request_mask & STATX_WRITE_ATOMIC) {
+ unsigned int unit_min, unit_max;
+
+ xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+ generic_fill_statx_atomic_writes(stat,
+ unit_min, unit_max);
+ }
fallthrough;
default:
stat->blksize = xfs_stat_blksize(ip);
--
2.31.1


2024-03-04 13:11:02

by John Garry

[permalink] [raw]
Subject: [PATCH v2 10/14] fs: iomap: Atomic write support

Support atomic writes by producing a single BIO with REQ_ATOMIC flag set.

We rely on the FS to guarantee extent alignment, such that an atomic write
should never straddle two or more extents. The FS should also check for
validity of an atomic write length/alignment.

Signed-off-by: John Garry <[email protected]>
---
fs/iomap/direct-io.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 733f83f839b6..197f1bb6a261 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -275,6 +275,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
struct iomap_dio *dio)
{
+ bool is_atomic = dio->iocb->ki_flags & IOCB_ATOMIC;
const struct iomap *iomap = &iter->iomap;
struct inode *inode = iter->inode;
unsigned int zeroing_size, pad;
@@ -383,6 +384,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
GFP_KERNEL);
bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
bio->bi_ioprio = dio->iocb->ki_ioprio;
+ if (is_atomic)
+ bio->bi_opf |= REQ_ATOMIC;
+
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;

@@ -399,6 +403,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
}

n = bio->bi_iter.bi_size;
+ if (is_atomic && (n != orig_count)) {
+ /* This bio should have covered the complete length */
+ ret = -EINVAL;
+ bio_put(bio);
+ goto out;
+ }
if (dio->flags & IOMAP_DIO_WRITE) {
task_io_account_write(n);
} else {
--
2.31.1


2024-03-04 13:19:48

by John Garry

[permalink] [raw]
Subject: [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1

The low-space allocator doesn't honour the alignment requirement, so don't
attempt to even use it (when we have an alignment requirement).

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_bmap.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index f362345467fa..60d100134280 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3584,6 +3584,10 @@ xfs_bmap_btalloc_low_space(
{
int error;

+ /* The allocator doesn't honour args->alignment */
+ if (args->alignment > 1)
+ return 0;
+
if (args->minlen > ap->minlen) {
args->minlen = ap->minlen;
error = xfs_alloc_vextent_start_ag(args, ap->blkno);
--
2.31.1


2024-03-04 13:26:00

by John Garry

[permalink] [raw]
Subject: [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign

Add initial support for FS_XFLAG_ATOMICWRITES for forcealign enabled.

Current kernel support for atomic writes is based on HW support (for atomic
writes). As such, it is required to ensure extent alignment with
atomic_write_unit_max so that an atomic write can result in a single
HW-compliant IO operation.

rtvol also guarantees extent alignment, but we are basing support initially
on forcealign, which is not supported for rtvol yet.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_format.h | 13 ++++++++++---
fs/xfs/libxfs/xfs_sb.c | 2 ++
fs/xfs/xfs_inode.c | 2 ++
fs/xfs/xfs_inode.h | 5 +++++
fs/xfs/xfs_ioctl.c | 15 +++++++++++++--
fs/xfs/xfs_mount.h | 2 ++
fs/xfs/xfs_super.c | 4 ++++
7 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 2d9f5430efc3..5f54f9b3755e 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -354,12 +354,16 @@ xfs_sb_has_compat_feature(
#define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
+#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31) /* atomicwrites enabled */
+
#define XFS_SB_FEAT_RO_COMPAT_ALL \
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
XFS_SB_FEAT_RO_COMPAT_REFLINK| \
- XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
- XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+ XFS_SB_FEAT_RO_COMPAT_INOBTCNT| \
+ XFS_SB_FEAT_RO_COMPAT_FORCEALIGN| \
+ XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+
#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
static inline bool
xfs_sb_has_ro_compat_feature(
@@ -1089,6 +1093,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
#define XFS_DIFLAG2_NREXT64_BIT 4 /* large extent counters */
/* data extent mappings for regular files must be aligned to extent size hint */
#define XFS_DIFLAG2_FORCEALIGN_BIT 5
+#define XFS_DIFLAG2_ATOMICWRITES_BIT 6

#define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
#define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
@@ -1096,10 +1101,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
#define XFS_DIFLAG2_BIGTIME (1 << XFS_DIFLAG2_BIGTIME_BIT)
#define XFS_DIFLAG2_NREXT64 (1 << XFS_DIFLAG2_NREXT64_BIT)
#define XFS_DIFLAG2_FORCEALIGN (1 << XFS_DIFLAG2_FORCEALIGN_BIT)
+#define XFS_DIFLAG2_ATOMICWRITES (1 << XFS_DIFLAG2_ATOMICWRITES_BIT)

#define XFS_DIFLAG2_ANY \
(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
- XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
+ XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN | \
+ XFS_DIFLAG2_ATOMICWRITES)

static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
{
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index f2c16a028fae..d7bb3e34dd69 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -165,6 +165,8 @@ xfs_sb_version_to_features(
features |= XFS_FEAT_INOBTCNT;
if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
features |= XFS_FEAT_FORCEALIGN;
+ if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+ features |= XFS_FEAT_ATOMICWRITES;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
features |= XFS_FEAT_FTYPE;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index bbb8886f1d32..14020ab1450c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -645,6 +645,8 @@ xfs_ip2xflags(
flags |= FS_XFLAG_COWEXTSIZE;
if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
flags |= FS_XFLAG_FORCEALIGN;
+ if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
+ flags |= FS_XFLAG_ATOMICWRITES;
}

if (xfs_inode_has_attr_fork(ip))
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b6c42c27943e..f56bdbb74ad7 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -310,6 +310,11 @@ static inline bool xfs_inode_forcealign(struct xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
}

+static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
+{
+ return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
+}
+
/*
* Return the buftarg used for data allocations on a given inode.
*/
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 867d8d51a3d0..f118a1ae39b5 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1112,6 +1112,8 @@ xfs_flags2diflags2(
di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
if (xflags & FS_XFLAG_FORCEALIGN)
di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
+ if (xflags & FS_XFLAG_ATOMICWRITES)
+ di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;

return di_flags2;
}
@@ -1124,10 +1126,12 @@ xfs_ioctl_setattr_xflags(
{
struct xfs_mount *mp = ip->i_mount;
bool rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
+ bool atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
uint64_t i_flags2;

- if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
- /* Can't change realtime flag if any extents are allocated. */
+ /* Can't change RT or atomic flags if any extents are allocated. */
+ if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
+ atomic_writes != xfs_inode_atomicwrites(ip)) {
if (ip->i_df.if_nextents || ip->i_delayed_blks)
return -EINVAL;
}
@@ -1164,6 +1168,13 @@ xfs_ioctl_setattr_xflags(
return -EINVAL;
}

+ if (atomic_writes) {
+ if (!xfs_has_atomicwrites(mp))
+ return -EINVAL;
+ if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
+ return -EINVAL;
+ }
+
ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
ip->i_diflags2 = i_flags2;

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e1ef31675db3..3b60d8a1d396 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -290,6 +290,7 @@ typedef struct xfs_mount {
#define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */
#define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */
#define XFS_FEAT_FORCEALIGN (1ULL << 27) /* aligned file data extents */
+#define XFS_FEAT_ATOMICWRITES (1ULL << 28) /* atomic writes support */

/* Mount features */
#define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */
@@ -354,6 +355,7 @@ __XFS_HAS_FEAT(bigtime, BIGTIME)
__XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
__XFS_HAS_FEAT(large_extent_counts, NREXT64)
__XFS_HAS_FEAT(forcealign, FORCEALIGN)
+__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)

/*
* Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 74dcafddf6a9..efe4b4234b2e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1712,6 +1712,10 @@ xfs_fs_fill_super(
xfs_warn(mp,
"EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");

+ if (xfs_has_atomicwrites(mp))
+ xfs_warn(mp,
+"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
+
if (xfs_has_reflink(mp)) {
if (mp->m_sb.sb_rblocks) {
xfs_alert(mp,
--
2.31.1


2024-03-04 13:26:06

by John Garry

[permalink] [raw]
Subject: [PATCH v2 13/14] fs: xfs: Validate atomic writes

Validate that an atomic write adheres to length/offset rules. Since we
require extent alignment for atomic writes, this effectively also enforces
that the BIO which iomap produces is aligned.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_file.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d0bd9d5f596c..cecc5428fd7c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -709,11 +709,20 @@ xfs_file_dio_write(
struct kiocb *iocb,
struct iov_iter *from)
{
- struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
size_t count = iov_iter_count(from);

+ if (iocb->ki_flags & IOCB_ATOMIC) {
+ if (!generic_atomic_write_valid(iocb->ki_pos, from,
+ i_blocksize(inode),
+ XFS_FSB_TO_B(mp, xfs_get_extsz(ip)))) {
+ return -EINVAL;
+ }
+ }
+
/* direct I/O must be aligned to device logical sector size */
if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
return -EINVAL;
--
2.31.1


2024-03-05 00:44:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag

On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
> From: "Darrick J. Wong" <[email protected]>
>
> The existing extsize hint code already did the work of expanding file
> range mapping requests so that the range is aligned to the hint value.
> Now add the code we need to guarantee that the space allocations are
> also always aligned.
>
> XXX: still need to check all this with reflink
>
> Signed-off-by: "Darrick J. Wong" <[email protected]>
> Co-developed-by: John Garry <[email protected]>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
> fs/xfs/xfs_iomap.c | 4 +++-
> 2 files changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 60d100134280..8dee60795cf4 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
> align = xfs_get_cowextsz_hint(ap->ip);
> else if (ap->datatype & XFS_ALLOC_USERDATA)
> align = xfs_get_extsz_hint(ap->ip);
> +
> + /*
> + * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
> + * set as forcealign and cowextsz_hint are mutually exclusive
> + */
> + if (xfs_inode_forcealign(ap->ip) && align) {
> + args->alignment = align;
> + if (stripe_align % align)
> + stripe_align = align;

This kinda does the right thing, but it strikes me as the wrong
thing to be doing - it seems to conflates two different physical
alignment properties in potentially bad ways, and we should never
get this far into the allocator to discover these issues.

Stripe alignment is alignment to physical disks in a RAID array.
Inode forced alignment is file offset alignment to specificly
defined LBA address ranges. The stripe unit/width is set at mkfs
time, the inode extent size hint at runtime.

These can only work together in specific situations:

1. max sized RWF_ATOMIC IO must be the same or smaller size
than the stripe unit. Alternatively: stripe unit must be
an integer (power of 2?) multiple of max atomic IO size.

IOWs, stripe unit vs atomic IO constraints must be
resolved in a compatible manner at mkfs time. If they
can't be resolved (e.g. non-power of 2 stripe unit) then
atomic IO cannot be done on the filesystem at all. Stripe
unit constraints always win over atomic IO.

2. min sized RWF_ATOMIC IO must be an integer divider of
stripe unit so that small atomic IOs are always correctly
aligned to the stripe unit.

3. Inode forced alignment constraints set at runtime (i.e.
extsize hints) must fit within the above stripe unit vs
min/max atomic IO requirements.

i.e. extent size hint will typically need to an integer
multiple of stripe unit size which will always be
compatible with stripe unit and atomic IO size and
alignments...

IOWs, these are static geometry constraints and so should be checked
and rejected at the point where alignments are specified (i.e.
mkfs, mount and ioctls). Then the allocator can simply assume that
forced inode alignments are always stripe alignment compatible and
we don't need separate handling of two possibly incompatible
alignments.

Regardless, I think the code as it stands won't work correctly when
a stripe unit is not set.

The correct functioning of this code appears to be dependent on
stripe_align being set >= the extent size hint. If stripe align is
not set, then what does (0 % align) return? If my checks are
correct, this will return 0, and so the above code will end up with
stripe_align = 0.

Now, consider that args->alignment is used to signal to the
allocator what the -start alignment- of the allocation is supposed
to be, whilst args->prod/args->mod are used to tell the allocator
what the tail adjustment is supposed to be.

Stripe alignment wants start alignment, extent size hints want tail
adjustment and force aligned allocation wants both start alignment
and tail adjustment.

However, the allocator overrides these depending on what it is
doing. e.g. exact block EOF allocation turns off start alignment but
has to ensure space is reserved for start alignment if it fails.
This reserves space for stripe_align in args->minalignslop, but if
force alignment has a start alignment requirement larger than stripe
unit alignment, this code fails to reserve the necessary alignment
slop for the force alignment code.

If we aren't doing exact block EOF allocation, then we do stripe
unit alignment. Again, if this fails and the forced alignment is
larger than stripe alignment, this code does not reserve space for
the larger alignment in args->minalignslop, so it will also do the
wrong when we fall back to an args->alignment > 1 allocation.

Failing to correctly account for minalignslop is known to cause
accounting problems when the AG is very near empty, and the usual
result of that is cancelling of a dirty allocation transaction and a
forced shutdown. So this is something we need to get right.

More below....

> + } else {
> + args->alignment = 1;
> + }

Just initialise the allocation args structure with a value of 1 like
we already do?

> +
> if (align) {
> if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
> ap->eof, 0, ap->conv, &ap->offset,
> @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
> args.minlen = args.maxlen = ap->minlen;
> args.total = ap->total;
>
> - args.alignment = 1;
> args.minalignslop = 0;

This likely breaks the debug code. This isn't intended for testing
atomic write capability, it's for exercising low space allocation
algorithms with worst case fragmentation patterns. This is only
called when error injection is turned on to trigger this path, so we
shouldn't need to support forced IO alignment here.

>
> args.minleft = ap->minleft;
> @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
> {
> struct xfs_mount *mp = args->mp;
> struct xfs_perag *caller_pag = args->pag;
> + int orig_alignment = args->alignment;
> int error;
>
> /*
> @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
>
> /*
> * Allocation failed, so turn return the allocation args to their
> - * original non-aligned state so the caller can proceed on allocation
> - * failure as if this function was never called.
> + * original state so the caller can proceed on allocation failure as
> + * if this function was never called.
> */
> - args->alignment = 1;
> + args->alignment = orig_alignment;
> return 0;
> }

As I said above, we can't set an alignment of > 1 here if we haven't
accounted for that alignment in args->minalignslop above. This leads
to unexpected ENOSPC conditions and filesystem shutdowns.

I suspect what we need to do is get rid of the separate stripe_align
variable altogether and always just set args->alignment to what we
need the extent start alignment to be, regardless of whether it is
from stripe alignment or forced alignment.

Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
'stripe_align' is - the exact EOF block allocation can simply save
and restore the args->alignment value and use it for minalignslop
calculations for the initial exact block allocation.

Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
is true, we can abort allocation immediately, and not bother to fall
back on further aligned/unaligned attempts that will also fail or do
the wrong them.

Similarly, if we aren't doing EOF allocation, having args->alignment
set means it will do the right thing for the first allocation
attempt. Again, if that fails, we can check if
xfs_inode_forcealign() is true and fail the aligned allocation
instead of running the low space algorithm. This now makes it clear
that we're failing the allocation because of the forced alignment
requirement, and now the low space allocation code can explicitly
turn off start alignment as it isn't required...


> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 18c8f168b153..70fe873951f3 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -181,7 +181,9 @@ xfs_eof_alignment(
> * If mounted with the "-o swalloc" option the alignment is
> * increased from the strip unit size to the stripe width.
> */
> - if (mp->m_swidth && xfs_has_swalloc(mp))
> + if (xfs_inode_forcealign(ip))
> + align = xfs_get_extsz_hint(ip);
> + else if (mp->m_swidth && xfs_has_swalloc(mp))
> align = mp->m_swidth;
> else if (mp->m_dalign)
> align = mp->m_dalign;

This doesn't work for files with a current size of less than
"align". The next line of code does:

if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
align = 0;

IOWs, the first aligned write allocation will occur with a file size
of zero, and that will result in this function returning 0. i.e.
speculative post EOF delalloc doesn't turn on and align the EOF
until writes up to offset == align have already been completed.

Essentially, this code wants to be:

if (xfs_inode_forcealign(ip))
return xfs_get_extsz_hint(ip);

So that any write to the a force aligned inode will always trigger
extent size hint EOF alignment.

-Dave.
--
Dave Chinner
[email protected]

2024-03-04 22:16:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1

On Mon, Mar 04, 2024 at 01:04:16PM +0000, John Garry wrote:
> The low-space allocator doesn't honour the alignment requirement, so don't
> attempt to even use it (when we have an alignment requirement).
>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/xfs/libxfs/xfs_bmap.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index f362345467fa..60d100134280 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3584,6 +3584,10 @@ xfs_bmap_btalloc_low_space(
> {
> int error;
>
> + /* The allocator doesn't honour args->alignment */
> + if (args->alignment > 1)
> + return 0;

I think that's wrong.

The alignment argument here is purely a best effort consideration -
we ignore it several different allocation situations, not just low
space.

e.g. xfs_bmap_btalloc_at_eof() will try exact block
allocation regardless of whether an alignment parameter is set. It
will then fall back to stripe alignment if exact block fails.

If stripe aligned allocation fails, it will then set args->alignment
= 1 and try a full filesystem allocation scan without alignment.

And if that fails, then we finally get to the low space allocator
with args->alignment = 1 even though we might be trying to allocate
an aligned extent for an atomic IO....

IOWs, I think this indicates deeper surgery is needed to ensure
aligned allocations fail immediately and don't fall back to
unaligned allocations and set XFS_TRANS_LOW_MODE...

-Dave.
--
Dave Chinner
[email protected]

2024-03-05 13:37:59

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1

On 04/03/2024 22:15, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 01:04:16PM +0000, John Garry wrote:
>> The low-space allocator doesn't honour the alignment requirement, so don't
>> attempt to even use it (when we have an alignment requirement).
>>
>> Signed-off-by: John Garry <[email protected]>
>> ---
>> fs/xfs/libxfs/xfs_bmap.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index f362345467fa..60d100134280 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3584,6 +3584,10 @@ xfs_bmap_btalloc_low_space(
>> {
>> int error;
>>
>> + /* The allocator doesn't honour args->alignment */
>> + if (args->alignment > 1)
>> + return 0;
>
> I think that's wrong.
>
> The alignment argument here is purely a best effort consideration -
> we ignore it several different allocation situations, not just low
> space.

Sure, but I am simply addressing the low-space allocator here.

In this series I am /we are effectively trying to conflate
args->alignment > 1 with forcealign. I thought that args->alignment was
guaranteed to be honoured, with some caveats. For forcealign, we
obviously require a guarantee.

>
> e.g. xfs_bmap_btalloc_at_eof() will try exact block
> allocation regardless of whether an alignment parameter is set.

For this specific issue, I think that we are ok, as:
- in xfs_bmap_compute_alignments(), stripe_align is aligned with
args->alignment for forcealign
- xfs_bmap_btalloc_at_eof() has the optimisation to alloc according to
stripe alignment

But obviously we should not be relying on optimisations.

Please also note that I have a modification later in this series to
always have EOF aligned for forcealign.

> It
> will then fall back to stripe alignment if exact block fails.
>
> If stripe aligned allocation fails, it will then set args->alignment
> = 1 and try a full filesystem allocation scan without alignment.
>
> And if that fails, then we finally get to the low space allocator
> with args->alignment = 1 even though we might be trying to allocate
> an aligned extent for an atomic IO....
>
> IOWs, I think this indicates deeper surgery is needed to ensure
> aligned allocations fail immediately and don't fall back to
> unaligned allocations and set XFS_TRANS_LOW_MODE...
>

ok, I'll look at what you write about all of this in the later patch review.

Thanks,
John


2024-03-05 15:30:19

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag

On 05/03/2024 00:44, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
>> From: "Darrick J. Wong" <[email protected]>
>>
>> The existing extsize hint code already did the work of expanding file
>> range mapping requests so that the range is aligned to the hint value.
>> Now add the code we need to guarantee that the space allocations are
>> also always aligned.
>>
>> XXX: still need to check all this with reflink
>>
>> Signed-off-by: "Darrick J. Wong" <[email protected]>
>> Co-developed-by: John Garry <[email protected]>
>> Signed-off-by: John Garry <[email protected]>
>> ---
>> fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
>> fs/xfs/xfs_iomap.c | 4 +++-
>> 2 files changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 60d100134280..8dee60795cf4 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
>> align = xfs_get_cowextsz_hint(ap->ip);
>> else if (ap->datatype & XFS_ALLOC_USERDATA)
>> align = xfs_get_extsz_hint(ap->ip);
>> +
>> + /*
>> + * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
>> + * set as forcealign and cowextsz_hint are mutually exclusive
>> + */
>> + if (xfs_inode_forcealign(ap->ip) && align) {
>> + args->alignment = align;
>> + if (stripe_align % align)
>> + stripe_align = align;
>
> This kinda does the right thing, but it strikes me as the wrong
> thing to be doing - it seems to conflates two different physical
> alignment properties in potentially bad ways, and we should never
> get this far into the allocator to discover these issues.
>
> Stripe alignment is alignment to physical disks in a RAID array.
> Inode forced alignment is file offset alignment to specificly
> defined LBA address ranges. The stripe unit/width is set at mkfs
> time, the inode extent size hint at runtime.
>
> These can only work together in specific situations:
>
> 1. max sized RWF_ATOMIC IO must be the same or smaller size
> than the stripe unit. Alternatively: stripe unit must be
> an integer (power of 2?) multiple of max atomic IO size.

Sure, it would not make sense to have max sized RWF_ATOMIC IO > stripe
alignment.

>
> IOWs, stripe unit vs atomic IO constraints must be
> resolved in a compatible manner at mkfs time. If they
> can't be resolved (e.g. non-power of 2 stripe unit) then
> atomic IO cannot be done on the filesystem at all. Stripe
> unit constraints always win over atomic IO.

We can could do that. Indeed, I thought our xfsprogs was doing that, but
we have had a few versions now for forcealign ...

>
> 2. min sized RWF_ATOMIC IO must be an integer divider of
> stripe unit so that small atomic IOs are always correctly
> aligned to the stripe unit.

That's a given from atomic write rules and point 1.

>
> 3. Inode forced alignment constraints set at runtime (i.e.
> extsize hints) must fit within the above stripe unit vs
> min/max atomic IO requirements.
> > i.e. extent size hint will typically need to an integer
> multiple of stripe unit size which will always be
> compatible with stripe unit and atomic IO size and
> alignments...

I think that any extsize hint for forcealign needs to comply with stripe
unit, same as point 1.

>
> IOWs, these are static geometry constraints and so should be checked
> and rejected at the point where alignments are specified (i.e.
> mkfs, mount and ioctls). Then the allocator can simply assume that
> forced inode alignments are always stripe alignment compatible and
> we don't need separate handling of two possibly incompatible
> alignments.

ok, makes sense.

Please note in case missed, I am mandating extsize hint for forcealign
needs to be a power-of-2. It just makes life easier for all the
sub-extent zero'ing later on.

Also we need to enforce that the AG count to be compliant with the
extsize hint for forcealign; but since the extsize hint for forcealign
needs to be compliant with stripe unit, above, and stripe unit would be
compliant wth AG count (right?), then this would be a given.

>
> Regardless, I think the code as it stands won't work correctly when
> a stripe unit is not set.
>
> The correct functioning of this code appears to be dependent on
> stripe_align being set >= the extent size hint. If stripe align is
> not set, then what does (0 % align) return? If my checks are
> correct, this will return 0,

Correct

> and so the above code will end up with
> stripe_align = 0.
>
> Now, consider that args->alignment is used to signal to the
> allocator what the -start alignment- of the allocation is supposed
> to be, whilst args->prod/args->mod are used to tell the allocator
> what the tail adjustment is supposed to be.
>
> Stripe alignment wants start alignment, extent size hints want tail
> adjustment and force aligned allocation wants both start alignment
> and tail adjustment.

Right

>
> However, the allocator overrides these depending on what it is
> doing. e.g. exact block EOF allocation turns off start alignment but
> has to ensure space is reserved for start alignment if it fails.
> This reserves space for stripe_align in args->minalignslop, but if
> force alignment has a start alignment requirement larger than stripe
> unit alignment, this code fails to reserve the necessary alignment
> slop for the force alignment code.
>
> If we aren't doing exact block EOF allocation, then we do stripe
> unit alignment. Again, if this fails and the forced alignment is
> larger than stripe alignment, this code does not reserve space for
> the larger alignment in args->minalignslop, so it will also do the
> wrong when we fall back to an args->alignment > 1 allocation.
>
> Failing to correctly account for minalignslop is known to cause
> accounting problems when the AG is very near empty, and the usual
> result of that is cancelling of a dirty allocation transaction and a
> forced shutdown. So this is something we need to get right.

For sure

>
> More below....
>
>> + } else {
>> + args->alignment = 1;
>> + }
>
> Just initialise the allocation args structure with a value of 1 like
> we already do?

It was being done in this way to have just a single place where the
value is initialised. It can easily be kept as is.

>
>> +
>> if (align) {
>> if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>> ap->eof, 0, ap->conv, &ap->offset,
>> @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
>> args.minlen = args.maxlen = ap->minlen;
>> args.total = ap->total;
>>
>> - args.alignment = 1;
>> args.minalignslop = 0;
>
> This likely breaks the debug code. This isn't intended for testing
> atomic write capability, it's for exercising low space allocation
> algorithms with worst case fragmentation patterns. This is only
> called when error injection is turned on to trigger this path, so we
> shouldn't need to support forced IO alignment here.

ok, it can be left untouched.

>
>>
>> args.minleft = ap->minleft;
>> @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
>> {
>> struct xfs_mount *mp = args->mp;
>> struct xfs_perag *caller_pag = args->pag;
>> + int orig_alignment = args->alignment;
>> int error;
>>
>> /*
>> @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
>>
>> /*
>> * Allocation failed, so turn return the allocation args to their
>> - * original non-aligned state so the caller can proceed on allocation
>> - * failure as if this function was never called.
>> + * original state so the caller can proceed on allocation failure as
>> + * if this function was never called.
>> */
>> - args->alignment = 1;
>> + args->alignment = orig_alignment;
>> return 0;
>> }
>
> As I said above, we can't set an alignment of > 1 here if we haven't
> accounted for that alignment in args->minalignslop above. This leads
> to unexpected ENOSPC conditions and filesystem shutdowns.
>
> I suspect what we need to do is get rid of the separate stripe_align
> variable altogether and always just set args->alignment to what we
> need the extent start alignment to be, regardless of whether it is
> from stripe alignment or forced alignment.

ok, it sounds a bit simpler at least

>
> Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
> 'stripe_align' is - the exact EOF block allocation can simply save
> and restore the args->alignment value and use it for minalignslop
> calculations for the initial exact block allocation.
>
> Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
> is true, we can abort allocation immediately, and not bother to fall
> back on further aligned/unaligned attempts that will also fail or do
> the wrong them.

ok

>
> Similarly, if we aren't doing EOF allocation, having args->alignment
> set means it will do the right thing for the first allocation
> attempt. Again, if that fails, we can check if
> xfs_inode_forcealign() is true and fail the aligned allocation
> instead of running the low space algorithm. This now makes it clear
> that we're failing the allocation because of the forced alignment
> requirement, and now the low space allocation code can explicitly
> turn off start alignment as it isn't required...

are you saying that low-space allocator can set args->alignment = 1 to
be explicit?

>
>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..70fe873951f3 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -181,7 +181,9 @@ xfs_eof_alignment(
>> * If mounted with the "-o swalloc" option the alignment is
>> * increased from the strip unit size to the stripe width.
>> */
>> - if (mp->m_swidth && xfs_has_swalloc(mp))
>> + if (xfs_inode_forcealign(ip))

I actually thought that I dropped this chunk as it was causing alignment
issues. I need to check that again.

>> + align = xfs_get_extsz_hint(ip);
>> + else if (mp->m_swidth && xfs_has_swalloc(mp))
>> align = mp->m_swidth;
>> else if (mp->m_dalign)
>> align = mp->m_dalign;
>
> This doesn't work for files with a current size of less than
> "align". The next line of code does:
>
> if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
> align = 0;
>
> IOWs, the first aligned write allocation will occur with a file size
> of zero, and that will result in this function returning 0. i.e.
> speculative post EOF delalloc doesn't turn on and align the EOF
> until writes up to offset == align have already been completed.

Maybe it was working as I have the change to keep EOF aligned for
forcealign. I'll check that.

>
> Essentially, this code wants to be:
>
> if (xfs_inode_forcealign(ip))
> return xfs_get_extsz_hint(ip);
>
> So that any write to the a force aligned inode will always trigger
> extent size hint EOF alignment.
>

ok, sounds reasonable.

Thanks,
John


2024-03-06 09:41:56

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag


>> Please note in case missed, I am mandating extsize hint for forcealign needs
>> to be a power-of-2. It just makes life easier for all the sub-extent
>> zero'ing later on.
>
> That's fine - that will need to be documented in the xfsctl man
> page...
>
>> Also we need to enforce that the AG count to be compliant with the extsize
> ^^^^^ size?

Yes

>
>> hint for forcealign; but since the extsize hint for forcealign needs to be
>> compliant with stripe unit, above, and stripe unit would be compliant wth AG
>> count (right?), then this would be a given.
>
> We already align AG size to stripe unit when a stripe unit is set,
> and ensure that we don't place all the AG headers on the same stripe
> unit.
>
> However, if there is no stripe unit we don't align the AG to
> anything.


> So, yes, AG sizing by mkfs will need to ensure that all
> AGs are correctly aligned to the underlying storage (integer
> multiple of the max atomic write size, right?)...

right, this is really important

>
>>> More below....
>>>
>>>> + } else {
>>>> + args->alignment = 1;
>>>> + }
>>>
>>> Just initialise the allocation args structure with a value of 1 like
>>> we already do?
>>
>> It was being done in this way to have just a single place where the value is
>> initialised. It can easily be kept as is.
>
> I'd prefer it as is, because then the value is always initialised
> correctly and we only override in the special cases....

ok

>>
>> are you saying that low-space allocator can set args->alignment = 1 to be
>> explicit?
>
> Yes.

ok

Thanks,
John


2024-03-07 10:20:06

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 13/14] fs: xfs: Validate atomic writes

On 06/03/2024 21:22, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 01:04:27PM +0000, John Garry wrote:
>> Validate that an atomic write adheres to length/offset rules. Since we
>> require extent alignment for atomic writes, this effectively also enforces
>> that the BIO which iomap produces is aligned.
>>
>> Signed-off-by: John Garry<[email protected]>
>> ---
>> fs/xfs/xfs_file.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index d0bd9d5f596c..cecc5428fd7c 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -709,11 +709,20 @@ xfs_file_dio_write(
>> struct kiocb *iocb,
>> struct iov_iter *from)
>> {
>> - struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> + struct xfs_inode *ip = XFS_I(inode);
>> struct xfs_mount *mp = ip->i_mount;
>> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> size_t count = iov_iter_count(from);
>>
>> + if (iocb->ki_flags & IOCB_ATOMIC) {
>> + if (!generic_atomic_write_valid(iocb->ki_pos, from,
>> + i_blocksize(inode),
> a.k.a. mp->m_bsize. If you use that here, then the need for the VFS
> inode goes away, too.

ok

>
>> + XFS_FSB_TO_B(mp, xfs_get_extsz(ip)))) {
>> + return -EINVAL;
>> + }
>> + }
> Also, I think the checks are the wrong way around here. We can only
> do IOCB_ATOMIC IO on force aligned/atomic write inodes, so shouldn't
> we be checking that first,

We are checking that, but not here.

In 14/14, we only set FMODE_CAN_ATOMIC_WRITE for when
xfs_inode_has_atomicwrites() is true, and only when
FMODE_CAN_ATOMIC_WRITE is set can we get this far.

I don't see a point in duplicating this xfs_inode_has_atomicwrites()
check, so I will make the commit message clearer on this - ok? Or add a
comment.

> then basing the rest of the checks on the
> assumption that atomic writes are allowed and have been set up
> correctly on the inode? i.e.
>
> if (iocb->ki_flags & IOCB_ATOMIC) {
> if (!xfs_inode_has_atomicwrites(ip))
> return -EINVAL;
> if (!generic_atomic_write_valid(iocb->ki_pos, from,
> mp->m_bsize, ip->i_extsize))
> return -EINVAL;
> }
>
> because xfs_inode_has_atomicwrites() implies ip->i_extsize has been
> set to the required atomic IO size?

I was not too comfortable using ip->i_extsize, as this can be set
without forcealign being set. I know that we would not get this far
without forcealign (being set).

Having said that, I don't like all the xfs_get_extsz() calls, so
something better is required. Let me know you you think.

Thanks,
John


2024-03-07 10:35:59

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 12/14] fs: xfs: Support atomic write for statx

On 06/03/2024 21:31, Dave Chinner wrote:
>> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> + struct block_device *bdev = target->bt_bdev;
>> + struct request_queue *q = bdev->bd_queue;
>> + struct xfs_mount *mp = ip->i_mount;
>> + struct xfs_sb *sbp = &mp->m_sb;
>> + unsigned int awu_min, awu_max;
>> + unsigned int extsz_bytes = XFS_FSB_TO_B(mp, extsz);
>> +
>> + awu_min = queue_atomic_write_unit_min_bytes(q);
>> + awu_max = queue_atomic_write_unit_max_bytes(q);
> We really should be storing these in the xfs_buftarg at mount time,
> like we do logical and physical sector sizes.

This has been mentioned previously, and Darrick thought that it was not
safe. Please see first response in:
https://lore.kernel.org/linux-xfs/20231003161029.GG21298@frogsfrogsfrogs/#t

So if this really is true, then I'll stick with something like what I
have here and add a comment on that.

However, in this series the block layer does check for out-of-range
atomic write BIOs in 1/14. So we could store the values in xfs_buftarg,
as you suggest for the lookup here. If the bdev geometry does really
change beneath us, worse thing that happens is that we may report
incorrect values for statx.

> Similar to sector
> sizes, they*must not change* once the filesystem has been created
> on the device, let alone during an active mount. The whole point of
> the xfs_buftarg is to store the information the filesystem
> needs to do IO to the underlying block device so we don't have to
> chase pointers deep into the block device whenever we need to use
> static geometry information.....
>
>> + if (sbp->sb_blocksize > awu_max || awu_min > sbp->sb_blocksize ||
>> + !xfs_inode_atomicwrites(ip)) {
>> + *unit_min = 0;
>> + *unit_max = 0;
>> + return;
>> + }
> Again, this is comparing static geometry - if the block size doesn't
> allow atomic writes, then the inode flag should never be set. i.e.
> geometry is checked when configuring atomic writes, not in every
> place we need to check if atomic writes are supported. Hence this
> should simply be:
>
> if (!xfs_inode_has_atomic_writes(ip)) {
> *unit_min = 0;
> *unit_max = 0;
> return;
> } >
> before we even look at the xfs_buftarg to get the supported min/max
> values for the given device.

Thanks,
John

2024-03-07 11:42:55

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign


>> */
>> end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
>> - if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
>> +
>> + /* Do not free blocks when forcing extent sizes */
>
> That comment seems wrong - this just rounds up where we start
> trimming from to be aligned...

ok

>
>> + if (xfs_get_extsz(ip) > 1)
>> + end_fsb = roundup_64(end_fsb, xfs_get_extsz(ip));
>> + else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
>> end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
>
> I think this would be better written as:
>
> /*
> * Forced extent alignment requires us to round up where we
> * start trimming from so that result is correctly aligned.
> */
> if (xfs_inode_forcealign(ip)) {
> if (ip->i_extsize > 1)
> end_fsb = roundup_64(end_fsb, ip->i_extsize);
> else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
> end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
> }
>
> Because we only want end-fsb alignment when forced alignment is
> required.

But why change current rtvol behaviour?

>
> Which then makes me wonder: truncate needs this, too, doesn't it?
> And the various fallocate() ops like hole punching and extent
> shifting?
>

Yes, I would think so. I quickly checked rtvol for truncate and it does
the round up. I would need to check the relevant code for truncate and
fallocate for forcealign now.

I do also wonder if we could factor out this rounding up code for
truncate, facallocate, and whatever else.

>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 2c439df8c47f..bbb8886f1d32 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -65,6 +65,20 @@ xfs_get_extsz_hint(
>> return 0;
>> }
>>
>> +/*
>> + * Helper function to extract extent size. It will return a power-of-2,
>> + * as forcealign requires this.
>> + */
>> +xfs_extlen_t
>> +xfs_get_extsz(
>> + struct xfs_inode *ip)
>> +{
>> + if (xfs_inode_forcealign(ip) && ip->i_extsize)
>> + return ip->i_extsize;
>> +
>> + return 1;
>> +}
>
> This can go away - if it is needed elsewhere, then I think it would
> be better open coded because it better documents what the code is
> doing...
>

I would rather get rid of xfs_get_extsz() for sure.

Thanks,
John


2024-03-07 11:54:02

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 07/14] fs: iomap: Sub-extent zeroing

On 06/03/2024 21:14, Dave Chinner wrote:
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index bcd3f8cf5ea4..733f83f839b6 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> {
>> const struct iomap *iomap = &iter->iomap;
>> struct inode *inode = iter->inode;
>> - unsigned int fs_block_size = i_blocksize(inode), pad;
>> + unsigned int zeroing_size, pad;
>> loff_t length = iomap_length(iter);
>> loff_t pos = iter->pos;
>> blk_opf_t bio_opf;
>> @@ -288,6 +288,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> size_t copied = 0;
>> size_t orig_count;
>>
>> + zeroing_size = i_blocksize(inode) << iomap->extent_shift;
> The iomap interfaces use units of bytes for offsets, sizes, ranges,
> etc. Using shifts to define a granularity value seems like a
> throwback to decades old XFS code and just a bit weird nowdays. Can
> we just pass this as a byte count? i.e.:
>
> zeroing_size = i_blocksize(inode);
> if (iomap->extent_size)
> zeroing_size = iomap->extent_size;

Fine

I was also thinking of something like i_extentsize() for vfs, which
would save requiring adding iomap->extent_size, but decided to limit vfs
changes.

Thanks,
John

2024-03-07 11:56:43

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE

On 06/03/2024 21:33, Dave Chinner wrote:
>> +static bool xfs_file_open_can_atomicwrite(
>> + struct inode *inode,
>> + struct file *file)
>> +{
>> + struct xfs_inode *ip = XFS_I(inode);
>> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> +
>> + if (!(file->f_flags & O_DIRECT))
>> + return false;
>> +
>> + if (!xfs_inode_atomicwrites(ip))
>> + return false;
>> +
>> + if (!bdev_can_atomic_write(target->bt_bdev))
>> + return false;
> Again, this is static blockdev information - the inode atomic write
> flag should not be set if the bdev cannot do atomic writes. It
> should be checked at mount time

ok

> - the filesystem probably should
> only mount read-only if it is configured to allow atomic writes and
> the underlying blockdev does not support atomic writes...

Let me know if you really would like to see that change also. It does
seem a bit drastic, considering we can just disallow atomic writes.

Thanks,
John

2024-03-07 12:57:47

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing


>>
>> if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
>> return xfs_alert_fsblock_zero(ip, imap);
>> @@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(
>>
>> iomap->validity_cookie = sequence_cookie;
>> iomap->folio_ops = &xfs_iomap_folio_ops;
>> + if (extsz > 1)
>> + iomap->extent_shift = ffs(extsz) - 1;
>
> iomap->extent_size = mp->m_bsize;
> if (xfs_inode_has_force_align(ip))
> iomap->extent_size *= ip->i_extsize;

ok, fine


>
>> @@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
>> xfs_fsize_t i_size;
>> uint resblks;
>> int error;
>> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>>
>> trace_xfs_unwritten_convert(ip, offset, count);
>>
>> - offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> - count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> + if (extsz > 1) {
>> + xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
>> +
>> + offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
>> + count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
>> + } else {
>> + offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> + count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> + }
>
> I don't think this is correct. We should only be converting the
> extent when the entire range has had data written to it. If we are
> doing unaligned writes, we end up running 3 separate unwritten
> conversion transactions - the leading zeroing, the data written and
> the trailing zeroing.

Then I missed that in the code.

For sub-FS block conversion, I thought that this was doing the complete
FS blocks conversion, including for the head and tail zeros. And now for
sub-extent writes, we would be similarly doing the full extent
conversion, including head and tail zeros.

>
> This will end up converting the entire range to written when the
> leading zeroing completes, exposing stale data until the data and
> trailing zeroing completes.

That would not be good.

>
> Concurrent reads (both DIO and buffered) can see this stale data
> while the write is in progress, leading to a mechanism where a user
> can issue sub-atomic write range IO and concurrent overlapping reads
> to read arbitrary stale data from the disk just before it is
> overwritten.
>
> I suspect the only way to fix this for sub-force-aligned DIo writes
> if for iomap_dio_bio_iter() to chain the zeroing and data bios so
> the entire range gets a single completion run on it instead of three
> separate sub-aligned extent IO completions. We only need to do this
> in the zeroing case - this is already the DIo slow path, so
> additional submission overhead is not an issue. It would, however,
> reduce completion overhead and latency, as we only need to run a
> single extent conversion instead of 3, so chaining the bios on
> aligned writes may well be a net win...
>

ok, I'll check that idea.

> Thoughts? Christoph probably needs to weigh in on this one...
>

ok

Cheers,
John


2024-03-07 12:43:42

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign


>> #define XFS_SB_FEAT_RO_COMPAT_ALL \
>> (XFS_SB_FEAT_RO_COMPAT_FINOBT | \
>> XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
>> XFS_SB_FEAT_RO_COMPAT_REFLINK| \
>> - XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
>> - XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
>> + XFS_SB_FEAT_RO_COMPAT_INOBTCNT| \
>> + XFS_SB_FEAT_RO_COMPAT_FORCEALIGN| \
>
> Please leave a spave between the feature name and the '| \'.
>

ok

>> + XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
>> +

..

>> }
>>
>> +static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
>> +{
>> + return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
>> +}
>
> I'd really like this to be more readable:
> xfs_inode_has_atomic_writes().
>
> Same for the force align check, now that I notice it:
> xfs_inode_has_force_align().

ok, will change

>
>> +
>> /*
>> * Return the buftarg used for data allocations on a given inode.
>> */
>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
>> index 867d8d51a3d0..f118a1ae39b5 100644
>> --- a/fs/xfs/xfs_ioctl.c
>> +++ b/fs/xfs/xfs_ioctl.c
>> @@ -1112,6 +1112,8 @@ xfs_flags2diflags2(
>> di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
>> if (xflags & FS_XFLAG_FORCEALIGN)
>> di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
>> + if (xflags & FS_XFLAG_ATOMICWRITES)
>> + di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>>
>> return di_flags2;
>> }
>> @@ -1124,10 +1126,12 @@ xfs_ioctl_setattr_xflags(
>> {
>> struct xfs_mount *mp = ip->i_mount;
>> bool rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
>> + bool atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
>> uint64_t i_flags2;
>>
>> - if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
>> - /* Can't change realtime flag if any extents are allocated. */
>> + /* Can't change RT or atomic flags if any extents are allocated. */
>> + if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
>> + atomic_writes != xfs_inode_atomicwrites(ip)) {
>> if (ip->i_df.if_nextents || ip->i_delayed_blks)
>> return -EINVAL;
>> }
>> @@ -1164,6 +1168,13 @@ xfs_ioctl_setattr_xflags(
>> return -EINVAL;
>> }
>>
>> + if (atomic_writes) {
>> + if (!xfs_has_atomicwrites(mp))
>> + return -EINVAL;
>
> That looks wrong - if we are trying to turn on atomic writes, then
> shouldn't this be returning an error if atomic writes are already
> configured?

I think that you are talking about a xfs_inode_atomicwrites() check.


>
>> + if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
>> + return -EINVAL;

>
> Where's the check for xfs_has_atomicwrites(mp) here?

please see above

> We can't allow
> this inode flag to be set if the superblock does not have the
> feature bit that says it's a known feature bit set.
>
> Which reminds me: both the forcealign and the atomicwrite inode flag
> need explicit checking in the inode verifier. i.e. checking that if
> the inode flag bit is set, the relevant superblock feature bit is
> set.

We do have that in the xfs_has_atomicwrites() and xfs_has_forcealign()
checks - is that ok?

>
> ....
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 74dcafddf6a9..efe4b4234b2e 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1712,6 +1712,10 @@ xfs_fs_fill_super(
>> xfs_warn(mp,
>> "EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
>>
>> + if (xfs_has_atomicwrites(mp))
>> + xfs_warn(mp,
>> +"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");
>
> "EXPERIMENTAL atomic write IO feature is in use. Use at your own risk!");
>

ok

Thanks,
John


2024-03-06 21:44:41

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign

On Mon, Mar 04, 2024 at 01:04:25PM +0000, John Garry wrote:
> Add initial support for FS_XFLAG_ATOMICWRITES for forcealign enabled.
>
> Current kernel support for atomic writes is based on HW support (for atomic
> writes). As such, it is required to ensure extent alignment with
> atomic_write_unit_max so that an atomic write can result in a single
> HW-compliant IO operation.
>
> rtvol also guarantees extent alignment, but we are basing support initially
> on forcealign, which is not supported for rtvol yet.
>
> Signed-off-by: John Garry <[email protected]>
...
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 2d9f5430efc3..5f54f9b3755e 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -354,12 +354,16 @@ xfs_sb_has_compat_feature(
> #define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
> #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
> #define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 31) /* atomicwrites enabled */
> +
> #define XFS_SB_FEAT_RO_COMPAT_ALL \
> (XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> XFS_SB_FEAT_RO_COMPAT_REFLINK| \
> - XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
> - XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
> + XFS_SB_FEAT_RO_COMPAT_INOBTCNT| \
> + XFS_SB_FEAT_RO_COMPAT_FORCEALIGN| \

Please leave a spave between the feature name and the '| \'.

> + XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> +
> #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
> static inline bool
> xfs_sb_has_ro_compat_feature(
> @@ -1089,6 +1093,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> #define XFS_DIFLAG2_NREXT64_BIT 4 /* large extent counters */
> /* data extent mappings for regular files must be aligned to extent size hint */
> #define XFS_DIFLAG2_FORCEALIGN_BIT 5
> +#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
>
> #define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
> #define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
> @@ -1096,10 +1101,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> #define XFS_DIFLAG2_BIGTIME (1 << XFS_DIFLAG2_BIGTIME_BIT)
> #define XFS_DIFLAG2_NREXT64 (1 << XFS_DIFLAG2_NREXT64_BIT)
> #define XFS_DIFLAG2_FORCEALIGN (1 << XFS_DIFLAG2_FORCEALIGN_BIT)
> +#define XFS_DIFLAG2_ATOMICWRITES (1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
>
> #define XFS_DIFLAG2_ANY \
> (XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> - XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
> + XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN | \
> + XFS_DIFLAG2_ATOMICWRITES)
>
> static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
> {
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index f2c16a028fae..d7bb3e34dd69 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -165,6 +165,8 @@ xfs_sb_version_to_features(
> features |= XFS_FEAT_INOBTCNT;
> if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
> features |= XFS_FEAT_FORCEALIGN;
> + if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> + features |= XFS_FEAT_ATOMICWRITES;
> if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
> features |= XFS_FEAT_FTYPE;
> if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index bbb8886f1d32..14020ab1450c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -645,6 +645,8 @@ xfs_ip2xflags(
> flags |= FS_XFLAG_COWEXTSIZE;
> if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
> flags |= FS_XFLAG_FORCEALIGN;
> + if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> + flags |= FS_XFLAG_ATOMICWRITES;
> }
>
> if (xfs_inode_has_attr_fork(ip))
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index b6c42c27943e..f56bdbb74ad7 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -310,6 +310,11 @@ static inline bool xfs_inode_forcealign(struct xfs_inode *ip)
> return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
> }
>
> +static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
> +{
> + return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
> +}

I'd really like this to be more readable:
xfs_inode_has_atomic_writes().

Same for the force align check, now that I notice it:
xfs_inode_has_force_align().

> +
> /*
> * Return the buftarg used for data allocations on a given inode.
> */
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 867d8d51a3d0..f118a1ae39b5 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1112,6 +1112,8 @@ xfs_flags2diflags2(
> di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> if (xflags & FS_XFLAG_FORCEALIGN)
> di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
> + if (xflags & FS_XFLAG_ATOMICWRITES)
> + di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>
> return di_flags2;
> }
> @@ -1124,10 +1126,12 @@ xfs_ioctl_setattr_xflags(
> {
> struct xfs_mount *mp = ip->i_mount;
> bool rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
> + bool atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
> uint64_t i_flags2;
>
> - if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
> - /* Can't change realtime flag if any extents are allocated. */
> + /* Can't change RT or atomic flags if any extents are allocated. */
> + if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
> + atomic_writes != xfs_inode_atomicwrites(ip)) {
> if (ip->i_df.if_nextents || ip->i_delayed_blks)
> return -EINVAL;
> }
> @@ -1164,6 +1168,13 @@ xfs_ioctl_setattr_xflags(
> return -EINVAL;
> }
>
> + if (atomic_writes) {
> + if (!xfs_has_atomicwrites(mp))
> + return -EINVAL;

That looks wrong - if we are trying to turn on atomic writes, then
shouldn't this be returning an error if atomic writes are already
configured?

> + if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
> + return -EINVAL;

Where's the check for xfs_has_atomicwrites(mp) here? We can't allow
this inode flag to be set if the superblock does not have the
feature bit that says it's a known feature bit set.

Which reminds me: both the forcealign and the atomicwrite inode flag
need explicit checking in the inode verifier. i.e. checking that if
the inode flag bit is set, the relevant superblock feature bit is
set.

...
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 74dcafddf6a9..efe4b4234b2e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1712,6 +1712,10 @@ xfs_fs_fill_super(
> xfs_warn(mp,
> "EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
>
> + if (xfs_has_atomicwrites(mp))
> + xfs_warn(mp,
> +"EXPERIMENTAL atomicwrites feature in use. Use at your own risk!");

"EXPERIMENTAL atomic write IO feature is in use. Use at your own risk!");

-Dave.
--
Dave Chinner
[email protected]

2024-03-06 22:00:55

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing

On Mon, Mar 04, 2024 at 01:04:22PM +0000, John Garry wrote:
> Set iomap->extent_shift when sub-extent zeroing is required.
>
> We treat a sub-extent write same as an unaligned write, so we can leverage
> the existing sub-FSblock unaligned write support, i.e. try a shared lock
> with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
> lock.
>
> In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.
>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/xfs/xfs_file.c | 28 +++++++++++++++-------------
> fs/xfs/xfs_iomap.c | 15 +++++++++++++--
> 2 files changed, 28 insertions(+), 15 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e33e5e13b95f..d0bd9d5f596c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -617,18 +617,19 @@ xfs_file_dio_write_aligned(
> * Handle block unaligned direct I/O writes
> *
> * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
> - * them to be done in parallel with reads and other direct I/O writes. However,
> - * if the I/O is not aligned to filesystem blocks, the direct I/O layer may need
> - * to do sub-block zeroing and that requires serialisation against other direct
> - * I/O to the same block. In this case we need to serialise the submission of
> - * the unaligned I/O so that we don't get racing block zeroing in the dio layer.
> - * In the case where sub-block zeroing is not required, we can do concurrent
> - * sub-block dios to the same block successfully.
> + * them to be done in parallel with reads and other direct I/O writes.
> + * However if the I/O is not aligned to filesystem blocks/extent, the direct
> + * I/O layer may need to do sub-block/extent zeroing and that requires
> + * serialisation against other direct I/O to the same block/extent. In this
> + * case we need to serialise the submission of the unaligned I/O so that we
> + * don't get racing block/extent zeroing in the dio layer.
> + * In the case where sub-block/extent zeroing is not required, we can do
> + * concurrent sub-block/extent dios to the same block/extent successfully.
> *
> * Optimistically submit the I/O using the shared lock first, but use the
> * IOMAP_DIO_OVERWRITE_ONLY flag to tell the lower layers to return -EAGAIN
> - * if block allocation or partial block zeroing would be required. In that case
> - * we try again with the exclusive lock.
> + * if block/extent allocation or partial block/extent zeroing would be
> + * required. In that case we try again with the exclusive lock.
> */
> static noinline ssize_t
> xfs_file_dio_write_unaligned(
> @@ -643,9 +644,9 @@ xfs_file_dio_write_unaligned(
> ssize_t ret;
>
> /*
> - * Extending writes need exclusivity because of the sub-block zeroing
> - * that the DIO code always does for partial tail blocks beyond EOF, so
> - * don't even bother trying the fast path in this case.
> + * Extending writes need exclusivity because of the sub-block/extent
> + * zeroing that the DIO code always does for partial tail blocks
> + * beyond EOF, so don't even bother trying the fast path in this case.
> */
> if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
> if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -709,13 +710,14 @@ xfs_file_dio_write(
> struct iov_iter *from)
> {
> struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
> + struct xfs_mount *mp = ip->i_mount;
> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> size_t count = iov_iter_count(from);
>
> /* direct I/O must be aligned to device logical sector size */
> if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
> return -EINVAL;
> - if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
> + if ((iocb->ki_pos | count) & (XFS_FSB_TO_B(mp, xfs_get_extsz(ip)) - 1))
> return xfs_file_dio_write_unaligned(ip, iocb, from);
> return xfs_file_dio_write_aligned(ip, iocb, from);
> }
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 70fe873951f3..88cc20bb19c9 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -98,6 +98,7 @@ xfs_bmbt_to_iomap(
> {
> struct xfs_mount *mp = ip->i_mount;
> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>
> if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
> return xfs_alert_fsblock_zero(ip, imap);
> @@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(
>
> iomap->validity_cookie = sequence_cookie;
> iomap->folio_ops = &xfs_iomap_folio_ops;
> + if (extsz > 1)
> + iomap->extent_shift = ffs(extsz) - 1;

iomap->extent_size = mp->m_bsize;
if (xfs_inode_has_force_align(ip))
iomap->extent_size *= ip->i_extsize;

> @@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
> xfs_fsize_t i_size;
> uint resblks;
> int error;
> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>
> trace_xfs_unwritten_convert(ip, offset, count);
>
> - offset_fsb = XFS_B_TO_FSBT(mp, offset);
> - count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> + if (extsz > 1) {
> + xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
> +
> + offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
> + count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
> + } else {
> + offset_fsb = XFS_B_TO_FSBT(mp, offset);
> + count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> + }

I don't think this is correct. We should only be converting the
extent when the entire range has had data written to it. If we are
doing unaligned writes, we end up running 3 separate unwritten
conversion transactions - the leading zeroing, the data written and
the trailing zeroing.

This will end up converting the entire range to written when the
leading zeroing completes, exposing stale data until the data and
trailing zeroing completes.

Concurrent reads (both DIO and buffered) can see this stale data
while the write is in progress, leading to a mechanism where a user
can issue sub-atomic write range IO and concurrent overlapping reads
to read arbitrary stale data from the disk just before it is
overwritten.

I suspect the only way to fix this for sub-force-aligned DIo writes
if for iomap_dio_bio_iter() to chain the zeroing and data bios so
the entire range gets a single completion run on it instead of three
separate sub-aligned extent IO completions. We only need to do this
in the zeroing case - this is already the DIo slow path, so
additional submission overhead is not an issue. It would, however,
reduce completion overhead and latency, as we only need to run a
single extent conversion instead of 3, so chaining the bios on
aligned writes may well be a net win...

Thoughts? Christoph probably needs to weigh in on this one...

-Dave.
--
Dave Chinner
[email protected]

2024-03-06 21:34:04

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE

On Mon, Mar 04, 2024 at 01:04:28PM +0000, John Garry wrote:
> For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
> flag. We check for direct I/O and also check that the bdev can actually
> support atomic writes.
>
> We rely on the block layer to reject atomic writes which exceed the bdev
> request_queue limits, so don't bother checking any such thing here.
>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/xfs/xfs_file.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index cecc5428fd7c..e63851be6c15 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1234,6 +1234,25 @@ xfs_file_remap_range(
> return remapped > 0 ? remapped : ret;
> }
>
> +static bool xfs_file_open_can_atomicwrite(
> + struct inode *inode,
> + struct file *file)
> +{
> + struct xfs_inode *ip = XFS_I(inode);
> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> +
> + if (!(file->f_flags & O_DIRECT))
> + return false;
> +
> + if (!xfs_inode_atomicwrites(ip))
> + return false;
> +
> + if (!bdev_can_atomic_write(target->bt_bdev))
> + return false;

Again, this is static blockdev information - the inode atomic write
flag should not be set if the bdev cannot do atomic writes. It
should be checked at mount time - the filesystem probably should
only mount read-only if it is configured to allow atomic writes and
the underlying blockdev does not support atomic writes...

-Dave.
--
Dave Chinner
[email protected]

2024-03-06 21:31:32

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 12/14] fs: xfs: Support atomic write for statx

On Mon, Mar 04, 2024 at 01:04:26PM +0000, John Garry wrote:
> Support providing info on atomic write unit min and max for an inode.
>
> For simplicity, currently we limit the min at the FS block size, but a
> lower limit could be supported in future. This is required by iomap
> DIO.
>
> The atomic write unit min and max is limited by the guaranteed extent
> alignment for the inode.
>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/xfs/xfs_iops.c | 38 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 38 insertions(+)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a0d77f5f512e..6316448083d2 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -546,6 +546,37 @@ xfs_stat_blksize(
> return PAGE_SIZE;
> }
>
> +static void
> +xfs_get_atomic_write_attr(
> + struct xfs_inode *ip,
> + unsigned int *unit_min,
> + unsigned int *unit_max)
> +{
> + xfs_extlen_t extsz = xfs_get_extsz(ip);
> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> + struct block_device *bdev = target->bt_bdev;
> + struct request_queue *q = bdev->bd_queue;
> + struct xfs_mount *mp = ip->i_mount;
> + struct xfs_sb *sbp = &mp->m_sb;
> + unsigned int awu_min, awu_max;
> + unsigned int extsz_bytes = XFS_FSB_TO_B(mp, extsz);
> +
> + awu_min = queue_atomic_write_unit_min_bytes(q);
> + awu_max = queue_atomic_write_unit_max_bytes(q);

We really should be storing these in the xfs_buftarg at mount time,
like we do logical and physical sector sizes. Similar to sector
sizes, they *must not change* once the filesystem has been created
on the device, let alone during an active mount. The whole point of
the xfs_buftarg is to store the information the filesystem
needs to do IO to the underlying block device so we don't have to
chase pointers deep into the block device whenever we need to use
static geometry information.....

> + if (sbp->sb_blocksize > awu_max || awu_min > sbp->sb_blocksize ||
> + !xfs_inode_atomicwrites(ip)) {
> + *unit_min = 0;
> + *unit_max = 0;
> + return;
> + }

Again, this is comparing static geometry - if the block size doesn't
allow atomic writes, then the inode flag should never be set. i.e.
geometry is checked when configuring atomic writes, not in every
place we need to check if atomic writes are supported. Hence this
should simply be:

if (!xfs_inode_has_atomic_writes(ip)) {
*unit_min = 0;
*unit_max = 0;
return;
}

before we even look at the xfs_buftarg to get the supported min/max
values for the given device.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2024-03-06 21:15:09

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 07/14] fs: iomap: Sub-extent zeroing

On Mon, Mar 04, 2024 at 01:04:21PM +0000, John Garry wrote:
> For FS_XFLAG_FORCEALIGN support, we want to treat any sub-extent IO like
> sub-fsblock DIO, in that we will zero the sub-extent when the mapping is
> unwritten.
>
> This will be important for atomic writes support, in that atomically
> writing over a partially written extent would mean that we would need to
> do the unwritten extent conversion write separately, and the write could
> no longer be atomic.
>
> It is the task of the FS to set iomap.extent_shift per iter to indicate
> sub-extent zeroing required.
>
> Maybe a macro like i_blocksize() should be introduced for extent sizes,
> instead of using extent_shift. It would also eliminate excessive use
> of xfs_get_extss() for XFS in future.
>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/iomap/direct-io.c | 14 ++++++++------
> include/linux/iomap.h | 1 +
> 2 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..733f83f839b6 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> {
> const struct iomap *iomap = &iter->iomap;
> struct inode *inode = iter->inode;
> - unsigned int fs_block_size = i_blocksize(inode), pad;
> + unsigned int zeroing_size, pad;
> loff_t length = iomap_length(iter);
> loff_t pos = iter->pos;
> blk_opf_t bio_opf;
> @@ -288,6 +288,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> size_t copied = 0;
> size_t orig_count;
>
> + zeroing_size = i_blocksize(inode) << iomap->extent_shift;

The iomap interfaces use units of bytes for offsets, sizes, ranges,
etc. Using shifts to define a granularity value seems like a
throwback to decades old XFS code and just a bit weird nowdays. Can
we just pass this as a byte count? i.e.:

zeroing_size = i_blocksize(inode);
if (iomap->extent_size)
zeroing_size = iomap->extent_size;

Cheers,

Dave.
--
Dave Chinner
[email protected]

2024-03-06 21:07:56

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign

On Mon, Mar 04, 2024 at 01:04:20PM +0000, John Garry wrote:
> For when forcealign is enabled, we want the EOF to be aligned as well, so
> do not free EOF blocks.
>
> Add helper function xfs_get_extsz() to get the guaranteed extent size
> alignment for forcealign enabled. Since this code is only relevant to
> forcealign and forcealign is not possible for RT yet, ignore RT.
>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/xfs/xfs_bmap_util.c | 7 ++++++-
> fs/xfs/xfs_inode.c | 14 ++++++++++++++
> fs/xfs/xfs_inode.h | 1 +
> 3 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index c2531c28905c..07bfb03c671a 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -542,8 +542,13 @@ xfs_can_free_eofblocks(
> * forever.
> */
> end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
> - if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
> +
> + /* Do not free blocks when forcing extent sizes */

That comment seems wrong - this just rounds up where we start
trimming from to be aligned...

> + if (xfs_get_extsz(ip) > 1)
> + end_fsb = roundup_64(end_fsb, xfs_get_extsz(ip));
> + else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
> end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);

I think this would be better written as:

/*
* Forced extent alignment requires us to round up where we
* start trimming from so that result is correctly aligned.
*/
if (xfs_inode_forcealign(ip)) {
if (ip->i_extsize > 1)
end_fsb = roundup_64(end_fsb, ip->i_extsize);
else if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1)
end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
}

Because we only want end-fsb alignment when forced alignment is
required.

Which then makes me wonder: truncate needs this, too, doesn't it?
And the various fallocate() ops like hole punching and extent
shifting?

> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 2c439df8c47f..bbb8886f1d32 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -65,6 +65,20 @@ xfs_get_extsz_hint(
> return 0;
> }
>
> +/*
> + * Helper function to extract extent size. It will return a power-of-2,
> + * as forcealign requires this.
> + */
> +xfs_extlen_t
> +xfs_get_extsz(
> + struct xfs_inode *ip)
> +{
> + if (xfs_inode_forcealign(ip) && ip->i_extsize)
> + return ip->i_extsize;
> +
> + return 1;
> +}

This can go away - if it is needed elsewhere, then I think it would
be better open coded because it better documents what the code is
doing...

-Dave.

--
Dave Chinner
[email protected]

2024-03-06 21:23:11

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 13/14] fs: xfs: Validate atomic writes

On Mon, Mar 04, 2024 at 01:04:27PM +0000, John Garry wrote:
> Validate that an atomic write adheres to length/offset rules. Since we
> require extent alignment for atomic writes, this effectively also enforces
> that the BIO which iomap produces is aligned.
>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/xfs/xfs_file.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index d0bd9d5f596c..cecc5428fd7c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -709,11 +709,20 @@ xfs_file_dio_write(
> struct kiocb *iocb,
> struct iov_iter *from)
> {
> - struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
> + struct inode *inode = file_inode(iocb->ki_filp);
> + struct xfs_inode *ip = XFS_I(inode);
> struct xfs_mount *mp = ip->i_mount;
> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> size_t count = iov_iter_count(from);
>
> + if (iocb->ki_flags & IOCB_ATOMIC) {
> + if (!generic_atomic_write_valid(iocb->ki_pos, from,
> + i_blocksize(inode),

a.k.a. mp->m_bsize. If you use that here, then the need for the VFS
inode goes away, too.

> + XFS_FSB_TO_B(mp, xfs_get_extsz(ip)))) {
> + return -EINVAL;
> + }
> + }

Also, I think the checks are the wrong way around here. We can only
do IOCB_ATOMIC IO on force aligned/atomic write inodes, so shouldn't
we be checking that first, then basing the rest of the checks on the
assumption that atomic writes are allowed and have been set up
correctly on the inode? i.e.

if (iocb->ki_flags & IOCB_ATOMIC) {
if (!xfs_inode_has_atomicwrites(ip))
return -EINVAL;
if (!generic_atomic_write_valid(iocb->ki_pos, from,
mp->m_bsize, ip->i_extsize))
return -EINVAL;
}

because xfs_inode_has_atomicwrites() implies ip->i_extsize has been
set to the required atomic IO size?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2024-03-05 22:18:40

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag

On Tue, Mar 05, 2024 at 03:22:52PM +0000, John Garry wrote:
> On 05/03/2024 00:44, Dave Chinner wrote:
> > On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
...
> > IOWs, these are static geometry constraints and so should be checked
> > and rejected at the point where alignments are specified (i.e.
> > mkfs, mount and ioctls). Then the allocator can simply assume that
> > forced inode alignments are always stripe alignment compatible and
> > we don't need separate handling of two possibly incompatible
> > alignments.
>
> ok, makes sense.
>
> Please note in case missed, I am mandating extsize hint for forcealign needs
> to be a power-of-2. It just makes life easier for all the sub-extent
> zero'ing later on.

That's fine - that will need to be documented in the xfsctl man
page...

> Also we need to enforce that the AG count to be compliant with the extsize
^^^^^ size?

> hint for forcealign; but since the extsize hint for forcealign needs to be
> compliant with stripe unit, above, and stripe unit would be compliant wth AG
> count (right?), then this would be a given.

We already align AG size to stripe unit when a stripe unit is set,
and ensure that we don't place all the AG headers on the same stripe
unit.

However, if there is no stripe unit we don't align the AG to
anything. So, yes, AG sizing by mkfs will need to ensure that all
AGs are correctly aligned to the underlying storage (integer
multiple of the max atomic write size, right?)...

> > More below....
> >
> > > + } else {
> > > + args->alignment = 1;
> > > + }
> >
> > Just initialise the allocation args structure with a value of 1 like
> > we already do?
>
> It was being done in this way to have just a single place where the value is
> initialised. It can easily be kept as is.

I'd prefer it as is, because then the value is always initialised
correctly and we only override in the special cases....

> > > args.minleft = ap->minleft;
> > > @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
> > > {
> > > struct xfs_mount *mp = args->mp;
> > > struct xfs_perag *caller_pag = args->pag;
> > > + int orig_alignment = args->alignment;
> > > int error;
> > > /*
> > > @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
> > > /*
> > > * Allocation failed, so turn return the allocation args to their
> > > - * original non-aligned state so the caller can proceed on allocation
> > > - * failure as if this function was never called.
> > > + * original state so the caller can proceed on allocation failure as
> > > + * if this function was never called.
> > > */
> > > - args->alignment = 1;
> > > + args->alignment = orig_alignment;
> > > return 0;
> > > }
> >
> > As I said above, we can't set an alignment of > 1 here if we haven't
> > accounted for that alignment in args->minalignslop above. This leads
> > to unexpected ENOSPC conditions and filesystem shutdowns.
> >
> > I suspect what we need to do is get rid of the separate stripe_align
> > variable altogether and always just set args->alignment to what we
> > need the extent start alignment to be, regardless of whether it is
> > from stripe alignment or forced alignment.
>
> ok, it sounds a bit simpler at least
>
> >
> > Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
> > 'stripe_align' is - the exact EOF block allocation can simply save
> > and restore the args->alignment value and use it for minalignslop
> > calculations for the initial exact block allocation.
> >
> > Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
> > is true, we can abort allocation immediately, and not bother to fall
> > back on further aligned/unaligned attempts that will also fail or do
> > the wrong them.
>
> ok
>
> >
> > Similarly, if we aren't doing EOF allocation, having args->alignment
> > set means it will do the right thing for the first allocation
> > attempt. Again, if that fails, we can check if
> > xfs_inode_forcealign() is true and fail the aligned allocation
> > instead of running the low space algorithm. This now makes it clear
> > that we're failing the allocation because of the forced alignment
> > requirement, and now the low space allocation code can explicitly
> > turn off start alignment as it isn't required...
>
> are you saying that low-space allocator can set args->alignment = 1 to be
> explicit?

Yes.

-Dave.

--
Dave Chinner
[email protected]