2024-06-07 14:40:54

by John Garry

[permalink] [raw]
Subject: [PATCH v4 00/22] block atomic writes for xfs

This series expands atomic write support to filesystems, specifically
XFS. Extent alignment is based on new feature forcealign.

Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.

XFS can be formatted for atomic writes as follows:
mkfs.xfs -i forcealign=1 -d extsize=16384 -d atomic-writes=1 /dev/sda

atomic-writes=1 just enables atomic writes in the SB, but does not auto-
enable atomic writes for each file.

Support can be enabled through xfs_io command:
$xfs_io -c "lsattr -v" filename
[extsize, force-align]
$xfs_io -c "extsize" filename
[16384] filename
$xfs_io -c "chattr +W" filename
$xfs_io -c "lsattr -v" filename
[extsize, force-align, atomic-writes] filename
$xfs_io -c statx filename
...
stat.stx_atomic_write_unit_min = 4096
stat.stx_atomic_write_unit_max = 16384
stat.stx_atomic_write_segments_max = 1
...

A known issue - as reported in
https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/[email protected]/T/*m7093bc85a8e0cbe13c111284768476d294aa077a__;Iw!!ACWV5N9M2RV99hQ!NbuQfXN8ZuUf_an3A6jHUXg3L1oCzefzyTYl0QWgJP1WbQCO8J_NPT9GHdGothSf36d0vxzJAjVUvcIB6IoU9nq3XExF$
-
is that forcealign is broken for !power-of-2 sizes. That needs fixing.

New in this series is a re-work of the iomap extent granularity zeroing
code. In the earlier series, iomap would consider a larger block zeroing
size when a member is set in struct iomap. Now each fs is responsible for
setting this size, which is i_blocksize(inode) when we just want regular
sub-fs block zeroing. All relevant FSes which use iomap are fixing up for
this.

Baseline is following series (which is based on Jens' block-6.10 branch):
https://lore.kernel.org/linux-nvme/[email protected]/T/#mb980c084be402472601831c47fb2b66d0bfa8f0e

Basic xfsprogs support at:
https://github.com/johnpgarry/xfsprogs-dev/tree/forcealign_and_atomicwrites_for_v4_xfs_block_atomic_writes

Patches for this series can be found at:
https://github.com/johnpgarry/linux/commits/atomic-writes-v6.10-v7-fs-v4/

Changes since v3:
https://lore.kernel.org/linux-xfs/[email protected]/T/#m9424b3cd1ccfde795d04474fdb4456520b6b4242
- Only enforce forcealign extsize is power-of-2 for atomic writes
- Re-org some validation code
- Fix xfs_bmap_process_allocated_extent() for forcealign
- Support iomap->io_block_size and make each fs support it
- Add !power-of-2 iomap support for io_block_size
- Make iomap dio iter handle atomic write failure properly by zeroing the
remaining io_block_size

Changes since v2:
https://lore.kernel.org/linux-xfs/[email protected]/
- Incorporate forcealign patches from
https://lore.kernel.org/linux-xfs/[email protected]/
- Put bdev awu min and max in buftarg
- Extra forcealign patches to deal with truncate and fallocate punch,
insert, collapse
- Add generic_atomic_write_valid_size()
- Change iomap.extent_shift -> .extent_size

Darrick J. Wong (2):
xfs: Introduce FORCEALIGN inode flag
xfs: Enable file data forcealign feature

Dave Chinner (6):
xfs: only allow minlen allocations when near ENOSPC
xfs: always tail align maxlen allocations
xfs: simplify extent allocation alignment
xfs: make EOF allocation simpler
xfs: introduce forced allocation alignment
xfs: align args->minlen for forced allocation alignment

John Garry (14):
fs: Add generic_atomic_write_valid_size()
iomap: Allow filesystems set IO block zeroing size
xfs: Use extent size granularity for iomap->io_block_size
xfs: Do not free EOF blocks for forcealign
xfs: Update xfs_inode_alloc_unitsize_fsb() for forcealign
xfs: Unmap blocks according to forcealign
xfs: Only free full extents for forcealign
xfs: Don't revert allocated offset for forcealign
fs: Add FS_XFLAG_ATOMICWRITES flag
iomap: Atomic write support
xfs: Support FS_XFLAG_ATOMICWRITES for forcealign
xfs: Support atomic write for statx
xfs: Validate atomic writes
xfs: Support setting FMODE_CAN_ATOMIC_WRITE

block/fops.c | 1 +
fs/btrfs/inode.c | 1 +
fs/erofs/data.c | 1 +
fs/erofs/zmap.c | 1 +
fs/ext2/inode.c | 1 +
fs/ext4/extents.c | 1 +
fs/ext4/inode.c | 1 +
fs/f2fs/data.c | 1 +
fs/fuse/dax.c | 1 +
fs/gfs2/bmap.c | 1 +
fs/hpfs/file.c | 1 +
fs/iomap/direct-io.c | 41 ++++-
fs/xfs/libxfs/xfs_alloc.c | 33 ++--
fs/xfs/libxfs/xfs_alloc.h | 3 +-
fs/xfs/libxfs/xfs_bmap.c | 308 ++++++++++++++++++----------------
fs/xfs/libxfs/xfs_format.h | 16 +-
fs/xfs/libxfs/xfs_ialloc.c | 12 +-
fs/xfs/libxfs/xfs_inode_buf.c | 105 ++++++++++++
fs/xfs/libxfs/xfs_inode_buf.h | 5 +
fs/xfs/libxfs/xfs_sb.c | 4 +
fs/xfs/xfs_bmap_util.c | 14 +-
fs/xfs/xfs_buf.c | 15 +-
fs/xfs/xfs_buf.h | 4 +-
fs/xfs/xfs_buf_mem.c | 2 +-
fs/xfs/xfs_file.c | 49 +++++-
fs/xfs/xfs_inode.c | 41 ++++-
fs/xfs/xfs_inode.h | 29 ++++
fs/xfs/xfs_ioctl.c | 83 ++++++++-
fs/xfs/xfs_iomap.c | 7 +
fs/xfs/xfs_iops.c | 28 ++++
fs/xfs/xfs_mount.h | 4 +
fs/xfs/xfs_reflink.h | 10 --
fs/xfs/xfs_super.c | 8 +
fs/xfs/xfs_trace.h | 8 +-
fs/zonefs/file.c | 2 +
include/linux/fs.h | 12 ++
include/linux/iomap.h | 2 +
include/uapi/linux/fs.h | 3 +
38 files changed, 656 insertions(+), 203 deletions(-)

--
2.31.1



2024-06-07 14:41:17

by John Garry

[permalink] [raw]
Subject: [PATCH v4 07/22] xfs: make EOF allocation simpler

From: Dave Chinner <[email protected]>

Currently the allocation at EOF is broken into two cases - when the
offset is zero and when the offset is non-zero. When the offset is
non-zero, we try to do exact block allocation for contiguous
extent allocation. When the offset is zero, the allocation is simply
an aligned allocation.

We want aligned allocation as the fallback when exact block
allocation fails, but that complicates the EOF allocation in that it
now has to handle two different allocation cases. The
caller also has to handle allocation when not at EOF, and for the
upcoming forced alignment changes we need that to also be aligned
allocation.

To simplify all this, pull the aligned allocation cases back into
the callers and leave the EOF allocation path for exact block
allocation only. This means that the EOF exact block allocation
fallback path is the normal aligned allocation path and that ends up
making things a lot simpler when forced alignment is introduced.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_bmap.c | 129 +++++++++++++++----------------------
fs/xfs/libxfs/xfs_ialloc.c | 2 +-
2 files changed, 54 insertions(+), 77 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7f8c8e4dd244..528e3cd81ee6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3310,12 +3310,12 @@ xfs_bmap_select_minlen(
static int
xfs_bmap_btalloc_select_lengths(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- xfs_extlen_t *blen)
+ struct xfs_alloc_arg *args)
{
struct xfs_mount *mp = args->mp;
struct xfs_perag *pag;
xfs_agnumber_t agno, startag;
+ xfs_extlen_t blen = 0;
int error = 0;

if (ap->tp->t_flags & XFS_TRANS_LOWMODE) {
@@ -3329,19 +3329,18 @@ xfs_bmap_btalloc_select_lengths(
if (startag == NULLAGNUMBER)
startag = 0;

- *blen = 0;
for_each_perag_wrap(mp, startag, agno, pag) {
- error = xfs_bmap_longest_free_extent(pag, args->tp, blen);
+ error = xfs_bmap_longest_free_extent(pag, args->tp, &blen);
if (error && error != -EAGAIN)
break;
error = 0;
- if (*blen >= args->maxlen)
+ if (blen >= args->maxlen)
break;
}
if (pag)
xfs_perag_rele(pag);

- args->minlen = xfs_bmap_select_minlen(ap, args, *blen);
+ args->minlen = xfs_bmap_select_minlen(ap, args, blen);
return error;
}

@@ -3551,78 +3550,40 @@ xfs_bmap_exact_minlen_extent_alloc(
* If we are not low on available data blocks and we are allocating at
* EOF, optimise allocation for contiguous file extension and/or stripe
* alignment of the new extent.
- *
- * NOTE: ap->aeof is only set if the allocation length is >= the
- * stripe unit and the allocation offset is at the end of file.
*/
static int
xfs_bmap_btalloc_at_eof(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- xfs_extlen_t blen,
- bool ag_only)
+ struct xfs_alloc_arg *args)
{
struct xfs_mount *mp = args->mp;
struct xfs_perag *caller_pag = args->pag;
+ xfs_extlen_t alignment = args->alignment;
int error;

+ ASSERT(ap->aeof && ap->offset);
+ ASSERT(args->alignment >= 1);
+
/*
- * If there are already extents in the file, try an exact EOF block
- * allocation to extend the file as a contiguous extent. If that fails,
- * or it's the first allocation in a file, just try for a stripe aligned
- * allocation.
+ * Compute the alignment slop for the fallback path so we ensure
+ * we account for the potential alignemnt space required by the
+ * fallback paths before we modify the AGF and AGFL here.
*/
- if (ap->offset) {
- xfs_extlen_t alignment = args->alignment;
-
- /*
- * Compute the alignment slop for the fallback path so we ensure
- * we account for the potential alignemnt space required by the
- * fallback paths before we modify the AGF and AGFL here.
- */
- args->alignment = 1;
- args->alignslop = alignment - args->alignment;
-
- if (!caller_pag)
- args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
- error = xfs_alloc_vextent_exact_bno(args, ap->blkno);
- if (!caller_pag) {
- xfs_perag_put(args->pag);
- args->pag = NULL;
- }
- if (error)
- return error;
-
- if (args->fsbno != NULLFSBLOCK)
- return 0;
- /*
- * Exact allocation failed. Reset to try an aligned allocation
- * according to the original allocation specification.
- */
- args->alignment = alignment;
- args->alignslop = 0;
- }
+ args->alignment = 1;
+ args->alignslop = alignment - args->alignment;

- if (ag_only) {
- error = xfs_alloc_vextent_near_bno(args, ap->blkno);
- } else {
+ if (!caller_pag)
+ args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
+ error = xfs_alloc_vextent_exact_bno(args, ap->blkno);
+ if (!caller_pag) {
+ xfs_perag_put(args->pag);
args->pag = NULL;
- error = xfs_alloc_vextent_start_ag(args, ap->blkno);
- ASSERT(args->pag == NULL);
- args->pag = caller_pag;
}
- if (error)
- return error;

- if (args->fsbno != NULLFSBLOCK)
- return 0;
-
- /*
- * Aligned allocation failed, so all fallback paths from here drop the
- * start alignment requirement as we know it will not succeed.
- */
- args->alignment = 1;
- return 0;
+ /* Reset alignment to original specifications. */
+ args->alignment = alignment;
+ args->alignslop = 0;
+ return error;
}

/*
@@ -3688,12 +3649,19 @@ xfs_bmap_btalloc_filestreams(
}

args->minlen = xfs_bmap_select_minlen(ap, args, blen);
- if (ap->aeof)
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, true);
+ if (ap->aeof && ap->offset)
+ error = xfs_bmap_btalloc_at_eof(ap, args);

+ /* This may be an aligned allocation attempt. */
if (!error && args->fsbno == NULLFSBLOCK)
error = xfs_alloc_vextent_near_bno(args, ap->blkno);

+ /* Attempt non-aligned allocation if we haven't already. */
+ if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ args->alignment = 1;
+ error = xfs_alloc_vextent_near_bno(args, ap->blkno);
+ }
+
out_low_space:
/*
* We are now done with the perag reference for the filestreams
@@ -3715,7 +3683,6 @@ xfs_bmap_btalloc_best_length(
struct xfs_bmalloca *ap,
struct xfs_alloc_arg *args)
{
- xfs_extlen_t blen = 0;
int error;

ap->blkno = XFS_INO_TO_FSB(args->mp, ap->ip->i_ino);
@@ -3726,23 +3693,33 @@ xfs_bmap_btalloc_best_length(
* the request. If one isn't found, then adjust the minimum allocation
* size to the largest space found.
*/
- error = xfs_bmap_btalloc_select_lengths(ap, args, &blen);
+ error = xfs_bmap_btalloc_select_lengths(ap, args);
if (error)
return error;

/*
- * Don't attempt optimal EOF allocation if previous allocations barely
- * succeeded due to being near ENOSPC. It is highly unlikely we'll get
- * optimal or even aligned allocations in this case, so don't waste time
- * trying.
+ * If we are in low space mode, then optimal allocation will fail so
+ * prepare for minimal allocation and run the low space algorithm
+ * immediately.
*/
- if (ap->aeof && !(ap->tp->t_flags & XFS_TRANS_LOWMODE)) {
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, false);
- if (error || args->fsbno != NULLFSBLOCK)
- return error;
+ if (ap->tp->t_flags & XFS_TRANS_LOWMODE) {
+ ASSERT(args->fsbno == NULLFSBLOCK);
+ return xfs_bmap_btalloc_low_space(ap, args);
+ }
+
+ if (ap->aeof && ap->offset)
+ error = xfs_bmap_btalloc_at_eof(ap, args);
+
+ /* This may be an aligned allocation attempt. */
+ if (!error && args->fsbno == NULLFSBLOCK)
+ error = xfs_alloc_vextent_start_ag(args, ap->blkno);
+
+ /* Attempt non-aligned allocation if we haven't already. */
+ if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ args->alignment = 1;
+ error = xfs_alloc_vextent_start_ag(args, ap->blkno);
}

- error = xfs_alloc_vextent_start_ag(args, ap->blkno);
if (error || args->fsbno != NULLFSBLOCK)
return error;

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 9f71a9a3a65e..40a2daeea712 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -780,7 +780,7 @@ xfs_ialloc_ag_alloc(
* the exact agbno requirement and increase the alignment
* instead. It is critical that the total size of the request
* (len + alignment + slop) does not increase from this point
- * on, so reset minalignslop to ensure it is not included in
+ * on, so reset alignslop to ensure it is not included in
* subsequent requests.
*/
args.alignslop = 0;
--
2.31.1


2024-06-07 14:41:54

by John Garry

[permalink] [raw]
Subject: [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing size

Allow filesystems to set the io_block_size for sub-fs block size zeroing,
as in future we will want to extend this feature to support zeroing of
block sizes of larger than the inode block size.

The value in io_block_size does not have to be a power-of-2, so fix up
zeroing code to handle that.

Signed-off-by: John Garry <[email protected]>
---
block/fops.c | 1 +
fs/btrfs/inode.c | 1 +
fs/erofs/data.c | 1 +
fs/erofs/zmap.c | 1 +
fs/ext2/inode.c | 1 +
fs/ext4/extents.c | 1 +
fs/ext4/inode.c | 1 +
fs/f2fs/data.c | 1 +
fs/fuse/dax.c | 1 +
fs/gfs2/bmap.c | 1 +
fs/hpfs/file.c | 1 +
fs/iomap/direct-io.c | 23 +++++++++++++++++++----
fs/xfs/xfs_iomap.c | 1 +
fs/zonefs/file.c | 2 ++
include/linux/iomap.h | 2 ++
15 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 9d6d86ebefb9..020443078630 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -402,6 +402,7 @@ static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
iomap->addr = iomap->offset;
iomap->length = isize - iomap->offset;
iomap->flags |= IOMAP_F_BUFFER_HEAD; /* noop for !CONFIG_BUFFER_HEAD */
+ iomap->io_block_size = i_blocksize(inode);
return 0;
}

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 753db965f7c0..665811b1578b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7740,6 +7740,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
iomap->offset = start;
iomap->bdev = fs_info->fs_devices->latest_dev->bdev;
iomap->length = len;
+ iomap->io_block_size = i_blocksize(inode);
free_extent_map(em);

return 0;
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 8be60797ea2f..ea9d2f3eadb3 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -305,6 +305,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
if (flags & IOMAP_DAX)
iomap->addr += mdev.m_dax_part_off;
}
+ iomap->io_block_size = i_blocksize(inode);
return 0;
}

diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index 9b248ee5fef2..6ee89f6a078c 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -749,6 +749,7 @@ static int z_erofs_iomap_begin_report(struct inode *inode, loff_t offset,
if (iomap->offset >= inode->i_size)
iomap->length = length + offset - map.m_la;
}
+ iomap->io_block_size = i_blocksize(inode);
iomap->flags = 0;
return 0;
}
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0caa1650cee8..7a5539a52844 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -862,6 +862,7 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
iomap->length = (u64)ret << blkbits;
iomap->flags |= IOMAP_F_MERGED;
}
+ iomap->io_block_size = i_blocksize(inode);

if (new)
iomap->flags |= IOMAP_F_NEW;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e067f2dd0335..ce3269874fde 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4933,6 +4933,7 @@ static int ext4_iomap_xattr_fiemap(struct inode *inode, struct iomap *iomap)
iomap->length = length;
iomap->type = iomap_type;
iomap->flags = 0;
+ iomap->io_block_size = i_blocksize(inode);
out:
return error;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4bae9ccf5fe0..3ec82e4d71c4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3235,6 +3235,7 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
iomap->bdev = inode->i_sb->s_bdev;
iomap->offset = (u64) map->m_lblk << blkbits;
iomap->length = (u64) map->m_len << blkbits;
+ iomap->io_block_size = i_blocksize(inode);

if ((map->m_flags & EXT4_MAP_MAPPED) &&
!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b9b0debc6b3d..6c12641b9a7b 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -4233,6 +4233,7 @@ static int f2fs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
iomap->addr = IOMAP_NULL_ADDR;
}
+ iomap->io_block_size = i_blocksize(inode);

if (map.m_flags & F2FS_MAP_NEW)
iomap->flags |= IOMAP_F_NEW;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d170bb..68ddc74cb31e 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -577,6 +577,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
iomap->flags = 0;
iomap->bdev = NULL;
iomap->dax_dev = fc->dax->dev;
+ iomap->io_block_size = i_blocksize(inode);

/*
* Both read/write and mmap path can race here. So we need something
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 1795c4e8dbf6..8d2de42b1da9 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -927,6 +927,7 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,

out:
iomap->bdev = inode->i_sb->s_bdev;
+ iomap->io_block_size = i_blocksize(inode);
unlock:
up_read(&ip->i_rw_mutex);
return ret;
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 1bb8d97cd9ae..5d2718faf520 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -149,6 +149,7 @@ static int hpfs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
iomap->addr = IOMAP_NULL_ADDR;
iomap->length = 1 << blkbits;
}
+ iomap->io_block_size = i_blocksize(inode);

hpfs_unlock(sb);
return 0;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..5be8d886ab4a 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
{
const struct iomap *iomap = &iter->iomap;
struct inode *inode = iter->inode;
- unsigned int fs_block_size = i_blocksize(inode), pad;
+ u64 io_block_size = iomap->io_block_size;
loff_t length = iomap_length(iter);
loff_t pos = iter->pos;
blk_opf_t bio_opf;
@@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
int nr_pages, ret = 0;
size_t copied = 0;
size_t orig_count;
+ unsigned int pad;

if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
!bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
@@ -355,7 +356,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,

if (need_zeroout) {
/* zero out from the start of the block to the write offset */
- pad = pos & (fs_block_size - 1);
+ if (is_power_of_2(io_block_size)) {
+ pad = pos & (io_block_size - 1);
+ } else {
+ loff_t _pos = pos;
+
+ pad = do_div(_pos, io_block_size);
+ }
+
if (pad)
iomap_dio_zero(iter, dio, pos - pad, pad);
}
@@ -429,9 +437,16 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
if (need_zeroout ||
((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
/* zero out from the end of the write to the end of the block */
- pad = pos & (fs_block_size - 1);
+ if (is_power_of_2(io_block_size)) {
+ pad = pos & (io_block_size - 1);
+ } else {
+ loff_t _pos = pos;
+
+ pad = do_div(_pos, io_block_size);
+ }
+
if (pad)
- iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
+ iomap_dio_zero(iter, dio, pos, io_block_size - pad);
}
out:
/* Undo iter limitation to current extent */
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 378342673925..ecb4cae88248 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -127,6 +127,7 @@ xfs_bmbt_to_iomap(
}
iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
+ iomap->io_block_size = i_blocksize(VFS_I(ip));
if (mapping_flags & IOMAP_DAX)
iomap->dax_dev = target->bt_daxdev;
else
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 3b103715acc9..bf2cc4bee309 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -50,6 +50,7 @@ static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
iomap->addr = (z->z_sector << SECTOR_SHIFT) + iomap->offset;
iomap->length = isize - iomap->offset;
}
+ iomap->io_block_size = i_blocksize(inode);
mutex_unlock(&zi->i_truncate_mutex);

trace_zonefs_iomap_begin(inode, iomap);
@@ -99,6 +100,7 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
iomap->type = IOMAP_MAPPED;
iomap->length = isize - iomap->offset;
}
+ iomap->io_block_size = i_blocksize(inode);
mutex_unlock(&zi->i_truncate_mutex);

trace_zonefs_iomap_begin(inode, iomap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..d63a35b77907 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -103,6 +103,8 @@ struct iomap {
void *private; /* filesystem private */
const struct iomap_folio_ops *folio_ops;
u64 validity_cookie; /* used with .iomap_valid() */
+ /* io block zeroing size, not necessarily a power-of-2 */
+ u64 io_block_size;
};

static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)
--
2.31.1


2024-06-07 14:42:08

by John Garry

[permalink] [raw]
Subject: [PATCH v4 01/22] fs: Add generic_atomic_write_valid_size()

Add a generic helper for FSes to validate that an atomic write is
appropriately sized (along with the other checks).

Signed-off-by: John Garry <[email protected]>
---
include/linux/fs.h | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 069cbab62700..e13d34f8c24e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3645,4 +3645,16 @@ bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter)
return true;
}

+static inline
+bool generic_atomic_write_valid_size(loff_t pos, struct iov_iter *iter,
+ unsigned int unit_min, unsigned int unit_max)
+{
+ size_t len = iov_iter_count(iter);
+
+ if (len < unit_min || len > unit_max)
+ return false;
+
+ return generic_atomic_write_valid(pos, iter);
+}
+
#endif /* _LINUX_FS_H */
--
2.31.1


2024-06-07 14:42:13

by John Garry

[permalink] [raw]
Subject: [PATCH v4 03/22] xfs: Use extent size granularity for iomap->io_block_size

Currently iomap->io_block_size is set to the i_blocksize() value for the
inode.

Expand the sub-fs block size zeroing to now cover RT extents, by calling
setting iomap->io_block_size as xfs_inode_alloc_unitsize().

In xfs_iomap_write_unwritten(), update the unwritten range fsb to cover
this extent granularity.

In xfs_file_dio_write(), handle a write which is not aligned to extent
size granularity as unaligned. Since the extent size granularity need not
be a power-of-2, handle this also.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_file.c | 24 +++++++++++++++++++-----
fs/xfs/xfs_inode.c | 17 +++++++++++------
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_iomap.c | 8 +++++++-
4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b240ea5241dc..24fe3c2e03da 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -601,7 +601,7 @@ xfs_file_dio_write_aligned(
}

/*
- * Handle block unaligned direct I/O writes
+ * Handle unaligned direct IO writes.
*
* In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
* them to be done in parallel with reads and other direct I/O writes. However,
@@ -630,9 +630,9 @@ xfs_file_dio_write_unaligned(
ssize_t ret;

/*
- * Extending writes need exclusivity because of the sub-block zeroing
- * that the DIO code always does for partial tail blocks beyond EOF, so
- * don't even bother trying the fast path in this case.
+ * Extending writes need exclusivity because of the sub-block/extent
+ * zeroing that the DIO code always does for partial tail blocks
+ * beyond EOF, so don't even bother trying the fast path in this case.
*/
if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -698,11 +698,25 @@ xfs_file_dio_write(
struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
size_t count = iov_iter_count(from);
+ bool unaligned;
+ u64 unitsize;

/* direct I/O must be aligned to device logical sector size */
if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
return -EINVAL;
- if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
+
+ unitsize = xfs_inode_alloc_unitsize(ip);
+ if (!is_power_of_2(unitsize)) {
+ if (isaligned_64(iocb->ki_pos, unitsize) &&
+ isaligned_64(count, unitsize))
+ unaligned = false;
+ else
+ unaligned = true;
+ } else {
+ unaligned = (iocb->ki_pos | count) & (unitsize - 1);
+ }
+
+ if (unaligned)
return xfs_file_dio_write_unaligned(ip, iocb, from);
return xfs_file_dio_write_aligned(ip, iocb, from);
}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 58fb7a5062e1..93ad442f399b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -4264,15 +4264,20 @@ xfs_break_layouts(
return error;
}

-/* Returns the size of fundamental allocation unit for a file, in bytes. */
unsigned int
-xfs_inode_alloc_unitsize(
+xfs_inode_alloc_unitsize_fsb(
struct xfs_inode *ip)
{
- unsigned int blocks = 1;
-
if (XFS_IS_REALTIME_INODE(ip))
- blocks = ip->i_mount->m_sb.sb_rextsize;
+ return ip->i_mount->m_sb.sb_rextsize;
+
+ return 1;
+}

- return XFS_FSB_TO_B(ip->i_mount, blocks);
+/* Returns the size of fundamental allocation unit for a file, in bytes. */
+unsigned int
+xfs_inode_alloc_unitsize(
+ struct xfs_inode *ip)
+{
+ return XFS_FSB_TO_B(ip->i_mount, xfs_inode_alloc_unitsize_fsb(ip));
}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 292b90b5f2ac..90d2fa837117 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -643,6 +643,7 @@ int xfs_inode_reload_unlinked(struct xfs_inode *ip);
bool xfs_ifork_zapped(const struct xfs_inode *ip, int whichfork);
void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
+unsigned int xfs_inode_alloc_unitsize_fsb(struct xfs_inode *ip);
unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);

struct xfs_dir_update_params {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ecb4cae88248..fbe69f747e30 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -127,7 +127,7 @@ xfs_bmbt_to_iomap(
}
iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
- iomap->io_block_size = i_blocksize(VFS_I(ip));
+ iomap->io_block_size = xfs_inode_alloc_unitsize(ip);
if (mapping_flags & IOMAP_DAX)
iomap->dax_dev = target->bt_daxdev;
else
@@ -577,11 +577,17 @@ xfs_iomap_write_unwritten(
xfs_fsize_t i_size;
uint resblks;
int error;
+ unsigned int rounding;

trace_xfs_unwritten_convert(ip, offset, count);

offset_fsb = XFS_B_TO_FSBT(mp, offset);
count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+ rounding = xfs_inode_alloc_unitsize_fsb(ip);
+ if (rounding > 1) {
+ offset_fsb = rounddown_64(offset_fsb, rounding);
+ count_fsb = roundup_64(count_fsb, rounding);
+ }
count_fsb = (xfs_filblks_t)(count_fsb - offset_fsb);

/*
--
2.31.1


2024-06-07 14:42:42

by John Garry

[permalink] [raw]
Subject: [PATCH v4 05/22] xfs: always tail align maxlen allocations

From: Dave Chinner <[email protected]>

When we do a large allocation, the core free space allocation code
assumes that args->maxlen is aligned to args->prod/args->mod. hence
if we get a maximum sized extent allocated, it does not do tail
alignment of the extent.

However, this assumes that nothing modifies args->maxlen between the
original allocation context setup and trimming the selected free
space extent to size. This assumption has recently been found to be
invalid - xfs_alloc_space_available() modifies args->maxlen in low
space situations - and there may be more situations we haven't yet
found like this.

Force aligned allocation introduces the requirement that extents are
correctly tail aligned, resulting in this occasional latent
alignment failure to e reclassified from an unimportant curiousity
to a must-fix bug.

Removing the assumption about args->maxlen allocations always being
tail aligned is trivial, and should not impact anything because
args->maxlen for inodes with extent size hints configured are
already aligned. Hence all this change does it avoid weird corner
cases that would have resulted in unaligned extent sizes by always
trimming the extent down to an aligned size.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_alloc.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 5855a21d4864..32f72217c126 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -432,20 +432,18 @@ xfs_alloc_compute_diff(
* Fix up the length, based on mod and prod.
* len should be k * prod + mod for some k.
* If len is too small it is returned unchanged.
- * If len hits maxlen it is left alone.
*/
-STATIC void
+static void
xfs_alloc_fix_len(
- xfs_alloc_arg_t *args) /* allocation argument structure */
+ struct xfs_alloc_arg *args)
{
- xfs_extlen_t k;
- xfs_extlen_t rlen;
+ xfs_extlen_t k;
+ xfs_extlen_t rlen = args->len;

ASSERT(args->mod < args->prod);
- rlen = args->len;
ASSERT(rlen >= args->minlen);
ASSERT(rlen <= args->maxlen);
- if (args->prod <= 1 || rlen < args->mod || rlen == args->maxlen ||
+ if (args->prod <= 1 || rlen < args->mod ||
(args->mod == 0 && rlen < args->prod))
return;
k = rlen % args->prod;
--
2.31.1


2024-06-07 14:43:32

by John Garry

[permalink] [raw]
Subject: [PATCH v4 09/22] xfs: align args->minlen for forced allocation alignment

From: Dave Chinner <[email protected]>

If args->minlen is not aligned to the constraints of forced
alignment, we may do minlen allocations that are not aligned when we
approach ENOSPC. Avoid this by always aligning args->minlen
appropriately. If alignment of minlen results in a value smaller
than the alignment constraint, fail the allocation immediately.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_bmap.c | 45 +++++++++++++++++++++++++++-------------
1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 9131ba8113a6..c9cf138e13c4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3278,33 +3278,48 @@ xfs_bmap_longest_free_extent(
return 0;
}

-static xfs_extlen_t
+static int
xfs_bmap_select_minlen(
struct xfs_bmalloca *ap,
struct xfs_alloc_arg *args,
xfs_extlen_t blen)
{
-
/* Adjust best length for extent start alignment. */
if (blen > args->alignment)
blen -= args->alignment;

/*
* Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
- * possible that there is enough contiguous free space for this request.
+ * possible that there is enough contiguous free space for this request
+ * even if best length is less that the minimum length we need.
+ *
+ * If the best length won't satisfy the maximum length we requested,
+ * then use it as the minimum length so we get as large an allocation
+ * as possible.
*/
if (blen < ap->minlen)
- return ap->minlen;
+ blen = ap->minlen;
+ else if (blen > args->maxlen)
+ blen = args->maxlen;

/*
- * If the best seen length is less than the request length,
- * use the best as the minimum, otherwise we've got the maxlen we
- * were asked for.
+ * If we have alignment constraints, round the minlen down to match the
+ * constraint so that alignment will be attempted. This may reduce the
+ * allocation to smaller than was requested, so clamp the minimum to
+ * ap->minlen to allow unaligned allocation to succeed. If we are forced
+ * to align the allocation, return ENOSPC at this point because we don't
+ * have enough contiguous free space to guarantee aligned allocation.
*/
- if (blen < args->maxlen)
- return blen;
- return args->maxlen;
-
+ if (args->alignment > 1) {
+ blen = rounddown(blen, args->alignment);
+ if (blen < ap->minlen) {
+ if (args->datatype & XFS_ALLOC_FORCEALIGN)
+ return -ENOSPC;
+ blen = ap->minlen;
+ }
+ }
+ args->minlen = blen;
+ return 0;
}

static int
@@ -3340,8 +3355,7 @@ xfs_bmap_btalloc_select_lengths(
if (pag)
xfs_perag_rele(pag);

- args->minlen = xfs_bmap_select_minlen(ap, args, blen);
- return error;
+ return xfs_bmap_select_minlen(ap, args, blen);
}

/* Update all inode and quota accounting for the allocation we just did. */
@@ -3661,7 +3675,10 @@ xfs_bmap_btalloc_filestreams(
goto out_low_space;
}

- args->minlen = xfs_bmap_select_minlen(ap, args, blen);
+ error = xfs_bmap_select_minlen(ap, args, blen);
+ if (error)
+ goto out_low_space;
+
if (ap->aeof && ap->offset)
error = xfs_bmap_btalloc_at_eof(ap, args);

--
2.31.1


2024-06-07 14:45:19

by John Garry

[permalink] [raw]
Subject: [PATCH v4 11/22] xfs: Do not free EOF blocks for forcealign

For when forcealign is enabled, we want the EOF to be aligned as well, so
do not free EOF blocks.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_bmap_util.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e5d893f93522..56b80a7c0992 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -539,8 +539,13 @@ xfs_can_free_eofblocks(
* forever.
*/
end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
- if (xfs_inode_has_bigrtalloc(ip))
+
+ /* Do not free blocks when forcing extent sizes */
+ if (xfs_inode_has_forcealign(ip))
+ end_fsb = roundup_64(end_fsb, ip->i_extsize);
+ else if (xfs_inode_has_bigrtalloc(ip))
end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb);
+
last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
if (last_fsb <= end_fsb)
return false;
--
2.31.1


2024-06-07 14:46:12

by John Garry

[permalink] [raw]
Subject: [PATCH v4 13/22] xfs: Unmap blocks according to forcealign

For when forcealign is enabled, blocks in an inode need to be unmapped
according to extent alignment, like what is already done for rtvol.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_bmap.c | 33 ++++++++++++++++++++++++++++-----
1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c9cf138e13c4..2b6d5ebd8b4f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5380,6 +5380,20 @@ xfs_bmap_del_extent_real(
return 0;
}

+static xfs_extlen_t
+xfs_bunmapi_align(
+ struct xfs_inode *ip,
+ xfs_fsblock_t bno)
+{
+ if (xfs_inode_has_forcealign(ip)) {
+ if (is_power_of_2(ip->i_extsize))
+ return bno & (ip->i_extsize - 1);
+ return do_div(bno, ip->i_extsize);
+ }
+ ASSERT(XFS_IS_REALTIME_INODE(ip));
+ return xfs_rtb_to_rtxoff(ip->i_mount, bno);
+}
+
/*
* Unmap (remove) blocks from a file.
* If nexts is nonzero then the number of extents to remove is limited to
@@ -5402,6 +5416,7 @@ __xfs_bunmapi(
struct xfs_bmbt_irec got; /* current extent record */
struct xfs_ifork *ifp; /* inode fork pointer */
int isrt; /* freeing in rt area */
+ int isforcealign; /* freeing for inode with forcealign */
int logflags; /* transaction logging flags */
xfs_extlen_t mod; /* rt extent offset */
struct xfs_mount *mp = ip->i_mount;
@@ -5439,6 +5454,8 @@ __xfs_bunmapi(
}
XFS_STATS_INC(mp, xs_blk_unmap);
isrt = xfs_ifork_is_realtime(ip, whichfork);
+ isforcealign = (whichfork != XFS_ATTR_FORK) &&
+ xfs_inode_has_forcealign(ip);
end = start + len;

if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
@@ -5490,11 +5507,10 @@ __xfs_bunmapi(
if (del.br_startoff + del.br_blockcount > end + 1)
del.br_blockcount = end + 1 - del.br_startoff;

- if (!isrt || (flags & XFS_BMAPI_REMAP))
+ if ((!isrt && !isforcealign) || (flags & XFS_BMAPI_REMAP))
goto delete;

- mod = xfs_rtb_to_rtxoff(mp,
- del.br_startblock + del.br_blockcount);
+ mod = xfs_bunmapi_align(ip, del.br_startblock + del.br_blockcount);
if (mod) {
/*
* Realtime extent not lined up at the end.
@@ -5542,9 +5558,16 @@ __xfs_bunmapi(
goto nodelete;
}

- mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
+ mod = xfs_bunmapi_align(ip, del.br_startblock);
if (mod) {
- xfs_extlen_t off = mp->m_sb.sb_rextsize - mod;
+ xfs_extlen_t off;
+
+ if (isforcealign) {
+ off = ip->i_extsize - mod;
+ } else {
+ ASSERT(isrt);
+ off = mp->m_sb.sb_rextsize - mod;
+ }

/*
* Realtime extent is lined up at the end but not
--
2.31.1


2024-06-07 14:46:56

by John Garry

[permalink] [raw]
Subject: [PATCH v4 04/22] xfs: only allow minlen allocations when near ENOSPC

From: Dave Chinner <[email protected]>

When we are near ENOSPC and don't have enough free
space for an args->maxlen allocation, xfs_alloc_space_available()
will trim args->maxlen to equal the available space. However, this
function has only checked that there is enough contiguous free space
for an aligned args->minlen allocation to succeed. Hence there is no
guarantee that an args->maxlen allocation will succeed, nor that the
available space will allow for correct alignment of an args->maxlen
allocation.

Further, by trimming args->maxlen arbitrarily, it breaks an
assumption made in xfs_alloc_fix_len() that if the caller wants
aligned allocation, then args->maxlen will be set to an aligned
value. It then skips the tail alignment and so we end up with
extents that aren't aligned to extent size hint boundaries as we
approach ENOSPC.

To avoid this problem, don't reduce args->maxlen by some random,
arbitrary amount. If args->maxlen is too large for the available
space, reduce the allocation to a minlen allocation as we know we
have contiguous free space available for this to succeed and always
be correctly aligned.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_alloc.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 6c55a6e88eba..5855a21d4864 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2409,14 +2409,23 @@ xfs_alloc_space_available(
if (available < (int)max(args->total, alloc_len))
return false;

+ if (flags & XFS_ALLOC_FLAG_CHECK)
+ return true;
+
/*
- * Clamp maxlen to the amount of free space available for the actual
- * extent allocation.
+ * If we can't do a maxlen allocation, then we must reduce the size of
+ * the allocation to match the available free space. We know how big
+ * the largest contiguous free space we can allocate is, so that's our
+ * upper bound. However, we don't exaclty know what alignment/size
+ * constraints have been placed on the allocation, so we can't
+ * arbitrarily select some new max size. Hence make this a minlen
+ * allocation as we know that will definitely succeed and match the
+ * callers alignment constraints.
*/
- if (available < (int)args->maxlen && !(flags & XFS_ALLOC_FLAG_CHECK)) {
- args->maxlen = available;
+ alloc_len = args->maxlen + (args->alignment - 1) + args->minalignslop;
+ if (longest < alloc_len) {
+ args->maxlen = args->minlen;
ASSERT(args->maxlen > 0);
- ASSERT(args->maxlen >= args->minlen);
}

return true;
--
2.31.1


2024-06-07 14:47:09

by John Garry

[permalink] [raw]
Subject: [PATCH v4 08/22] xfs: introduce forced allocation alignment

From: Dave Chinner <[email protected]>

When forced allocation alignment is specified, the extent will
be aligned to the extent size hint size rather than stripe
alignment. If aligned allocation cannot be done, then the allocation
is failed rather than attempting non-aligned fallbacks.

Note: none of the per-inode force align configuration is present
yet, so this just triggers off an "always false" wrapper function
for the moment.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_alloc.h | 1 +
fs/xfs/libxfs/xfs_bmap.c | 29 +++++++++++++++++++++++------
fs/xfs/xfs_inode.h | 5 +++++
3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index aa2c103d98f0..7de2e6f64882 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -66,6 +66,7 @@ typedef struct xfs_alloc_arg {
#define XFS_ALLOC_USERDATA (1 << 0)/* allocation is for user data*/
#define XFS_ALLOC_INITIAL_USER_DATA (1 << 1)/* special case start of file */
#define XFS_ALLOC_NOBUSY (1 << 2)/* Busy extents not allowed */
+#define XFS_ALLOC_FORCEALIGN (1 << 3)/* forced extent alignment */

/* freespace limit calculations */
unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 528e3cd81ee6..9131ba8113a6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3401,9 +3401,10 @@ xfs_bmap_alloc_account(
* Calculate the extent start alignment and the extent length adjustments that
* constrain this allocation.
*
- * Extent start alignment is currently determined by stripe configuration and is
- * carried in args->alignment, whilst extent length adjustment is determined by
- * extent size hints and is carried by args->prod and args->mod.
+ * Extent start alignment is currently determined by forced inode alignment or
+ * stripe configuration and is carried in args->alignment, whilst extent length
+ * adjustment is determined by extent size hints and is carried by args->prod
+ * and args->mod.
*
* Low level allocation code is free to either ignore or override these values
* as required.
@@ -3416,11 +3417,18 @@ xfs_bmap_compute_alignments(
struct xfs_mount *mp = args->mp;
xfs_extlen_t align = 0; /* minimum allocation alignment */

- /* stripe alignment for allocation is determined by mount parameters */
- if (mp->m_swidth && xfs_has_swalloc(mp))
+ /*
+ * Forced inode alignment takes preference over stripe alignment.
+ * Stripe alignment for allocation is determined by mount parameters.
+ */
+ if (xfs_inode_has_forcealign(ap->ip)) {
+ args->alignment = xfs_get_extsz_hint(ap->ip);
+ args->datatype |= XFS_ALLOC_FORCEALIGN;
+ } else if (mp->m_swidth && xfs_has_swalloc(mp)) {
args->alignment = mp->m_swidth;
- else if (mp->m_dalign)
+ } else if (mp->m_dalign) {
args->alignment = mp->m_dalign;
+ }

if (ap->flags & XFS_BMAPI_COWFORK)
align = xfs_get_cowextsz_hint(ap->ip);
@@ -3607,6 +3615,11 @@ xfs_bmap_btalloc_low_space(
{
int error;

+ if (args->alignment > 1 && (args->datatype & XFS_ALLOC_FORCEALIGN)) {
+ args->fsbno = NULLFSBLOCK;
+ return 0;
+ }
+
args->alignment = 1;
if (args->minlen > ap->minlen) {
args->minlen = ap->minlen;
@@ -3658,6 +3671,8 @@ xfs_bmap_btalloc_filestreams(

/* Attempt non-aligned allocation if we haven't already. */
if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ if (args->datatype & XFS_ALLOC_FORCEALIGN)
+ return error;
args->alignment = 1;
error = xfs_alloc_vextent_near_bno(args, ap->blkno);
}
@@ -3716,6 +3731,8 @@ xfs_bmap_btalloc_best_length(

/* Attempt non-aligned allocation if we haven't already. */
if (!error && args->fsbno == NULLFSBLOCK && args->alignment > 1) {
+ if (args->datatype & XFS_ALLOC_FORCEALIGN)
+ return error;
args->alignment = 1;
error = xfs_alloc_vextent_start_ag(args, ap->blkno);
}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 90d2fa837117..805a8cf522c6 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -311,6 +311,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
}

+static inline bool xfs_inode_has_forcealign(struct xfs_inode *ip)
+{
+ return false;
+}
+
/*
* Decide if this file is a realtime file whose data allocation unit is larger
* than a single filesystem block.
--
2.31.1


2024-06-07 14:48:00

by John Garry

[permalink] [raw]
Subject: [PATCH v4 15/22] xfs: Don't revert allocated offset for forcealign

In xfs_bmap_process_allocated_extent(), for when we found that we could not
provide the requested length completely, the mapping is moved so that we
can provide as much as possible for the original request.

For forcealign, this would mean ignoring alignment guaranteed, so don't do
this.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_bmap.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2b6d5ebd8b4f..b3552cb5fc8f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3492,11 +3492,15 @@ xfs_bmap_process_allocated_extent(
* original request as possible. Free space is apparently
* very fragmented so we're unlikely to be able to satisfy the
* hints anyway.
+ * However, for an inode with forcealign, continue with the
+ * found offset as we need to honour the alignment hint.
*/
- if (ap->length <= orig_length)
- ap->offset = orig_offset;
- else if (ap->offset + ap->length < orig_offset + orig_length)
- ap->offset = orig_offset + orig_length - ap->length;
+ if (!xfs_inode_has_forcealign(ap->ip)) {
+ if (ap->length <= orig_length)
+ ap->offset = orig_offset;
+ else if (ap->offset + ap->length < orig_offset + orig_length)
+ ap->offset = orig_offset + orig_length - ap->length;
+ }
xfs_bmap_alloc_account(ap);
}

--
2.31.1


2024-06-07 14:48:25

by John Garry

[permalink] [raw]
Subject: [PATCH v4 14/22] xfs: Only free full extents for forcealign

Like we already do for rtvol, only free full extents for forcealign in
xfs_free_file_space().

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_bmap_util.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 56b80a7c0992..ee767a4fd63a 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -842,8 +842,11 @@ xfs_free_file_space(
startoffset_fsb = XFS_B_TO_FSB(mp, offset);
endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);

- /* We can only free complete realtime extents. */
- if (xfs_inode_has_bigrtalloc(ip)) {
+ /* Free only complete extents. */
+ if (xfs_inode_has_forcealign(ip)) {
+ startoffset_fsb = roundup_64(startoffset_fsb, ip->i_extsize);
+ endoffset_fsb = rounddown_64(endoffset_fsb, ip->i_extsize);
+ } else if (xfs_inode_has_bigrtalloc(ip)) {
startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
}
--
2.31.1


2024-06-07 14:51:53

by John Garry

[permalink] [raw]
Subject: [PATCH v4 06/22] xfs: simplify extent allocation alignment

From: Dave Chinner <[email protected]>

We currently align extent allocation to stripe unit or stripe width.
That is specified by an external parameter to the allocation code,
which then manipulates the xfs_alloc_args alignment configuration in
interesting ways.

The args->alignment field specifies extent start alignment, but
because we may be attempting non-aligned allocation first there are
also slop variables that allow for those allocation attempts to
account for aligned allocation if they fail.

This gets much more complex as we introduce forced allocation
alignment, where extent size hints are used to generate the extent
start alignment. extent size hints currently only affect extent
lengths (via args->prod and args->mod) and so with this change we
will have two different start alignment conditions.

Avoid this complexity by always using args->alignment to indicate
extent start alignment, and always using args->prod/mod to indicate
extent length adjustment.

Signed-off-by: Dave Chinner <[email protected]>
jpg: fixup alignslop references in xfs_trace.h and xfs_ialloc.c
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_alloc.c | 4 +-
fs/xfs/libxfs/xfs_alloc.h | 2 +-
fs/xfs/libxfs/xfs_bmap.c | 96 +++++++++++++++++---------------------
fs/xfs/libxfs/xfs_ialloc.c | 10 ++--
fs/xfs/xfs_trace.h | 8 ++--
5 files changed, 54 insertions(+), 66 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 32f72217c126..35fbd6b19682 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2391,7 +2391,7 @@ xfs_alloc_space_available(
reservation = xfs_ag_resv_needed(pag, args->resv);

/* do we have enough contiguous free space for the allocation? */
- alloc_len = args->minlen + (args->alignment - 1) + args->minalignslop;
+ alloc_len = args->minlen + (args->alignment - 1) + args->alignslop;
longest = xfs_alloc_longest_free_extent(pag, min_free, reservation);
if (longest < alloc_len)
return false;
@@ -2420,7 +2420,7 @@ xfs_alloc_space_available(
* allocation as we know that will definitely succeed and match the
* callers alignment constraints.
*/
- alloc_len = args->maxlen + (args->alignment - 1) + args->minalignslop;
+ alloc_len = args->maxlen + (args->alignment - 1) + args->alignslop;
if (longest < alloc_len) {
args->maxlen = args->minlen;
ASSERT(args->maxlen > 0);
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 0b956f8b9d5a..aa2c103d98f0 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -46,7 +46,7 @@ typedef struct xfs_alloc_arg {
xfs_extlen_t minleft; /* min blocks must be left after us */
xfs_extlen_t total; /* total blocks needed in xaction */
xfs_extlen_t alignment; /* align answer to multiple of this */
- xfs_extlen_t minalignslop; /* slop for minlen+alignment calcs */
+ xfs_extlen_t alignslop; /* slop for alignment calcs */
xfs_agblock_t min_agbno; /* set an agbno range for NEAR allocs */
xfs_agblock_t max_agbno; /* ... */
xfs_extlen_t len; /* output: actual size of extent */
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c101cf266bc4..7f8c8e4dd244 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3285,6 +3285,10 @@ xfs_bmap_select_minlen(
xfs_extlen_t blen)
{

+ /* Adjust best length for extent start alignment. */
+ if (blen > args->alignment)
+ blen -= args->alignment;
+
/*
* Since we used XFS_ALLOC_FLAG_TRYLOCK in _longest_free_extent(), it is
* possible that there is enough contiguous free space for this request.
@@ -3300,6 +3304,7 @@ xfs_bmap_select_minlen(
if (blen < args->maxlen)
return blen;
return args->maxlen;
+
}

static int
@@ -3393,35 +3398,43 @@ xfs_bmap_alloc_account(
xfs_trans_mod_dquot_byino(ap->tp, ap->ip, fld, ap->length);
}

-static int
+/*
+ * Calculate the extent start alignment and the extent length adjustments that
+ * constrain this allocation.
+ *
+ * Extent start alignment is currently determined by stripe configuration and is
+ * carried in args->alignment, whilst extent length adjustment is determined by
+ * extent size hints and is carried by args->prod and args->mod.
+ *
+ * Low level allocation code is free to either ignore or override these values
+ * as required.
+ */
+static void
xfs_bmap_compute_alignments(
struct xfs_bmalloca *ap,
struct xfs_alloc_arg *args)
{
struct xfs_mount *mp = args->mp;
xfs_extlen_t align = 0; /* minimum allocation alignment */
- int stripe_align = 0;

/* stripe alignment for allocation is determined by mount parameters */
if (mp->m_swidth && xfs_has_swalloc(mp))
- stripe_align = mp->m_swidth;
+ args->alignment = mp->m_swidth;
else if (mp->m_dalign)
- stripe_align = mp->m_dalign;
+ args->alignment = mp->m_dalign;

if (ap->flags & XFS_BMAPI_COWFORK)
align = xfs_get_cowextsz_hint(ap->ip);
else if (ap->datatype & XFS_ALLOC_USERDATA)
align = xfs_get_extsz_hint(ap->ip);
+
if (align) {
if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
ap->eof, 0, ap->conv, &ap->offset,
&ap->length))
ASSERT(0);
ASSERT(ap->length);
- }

- /* apply extent size hints if obtained earlier */
- if (align) {
args->prod = align;
div_u64_rem(ap->offset, args->prod, &args->mod);
if (args->mod)
@@ -3436,7 +3449,6 @@ xfs_bmap_compute_alignments(
args->mod = args->prod - args->mod;
}

- return stripe_align;
}

static void
@@ -3508,7 +3520,7 @@ xfs_bmap_exact_minlen_extent_alloc(
args.total = ap->total;

args.alignment = 1;
- args.minalignslop = 0;
+ args.alignslop = 0;

args.minleft = ap->minleft;
args.wasdel = ap->wasdel;
@@ -3548,7 +3560,6 @@ xfs_bmap_btalloc_at_eof(
struct xfs_bmalloca *ap,
struct xfs_alloc_arg *args,
xfs_extlen_t blen,
- int stripe_align,
bool ag_only)
{
struct xfs_mount *mp = args->mp;
@@ -3562,23 +3573,15 @@ xfs_bmap_btalloc_at_eof(
* allocation.
*/
if (ap->offset) {
- xfs_extlen_t nextminlen = 0;
+ xfs_extlen_t alignment = args->alignment;

/*
- * Compute the minlen+alignment for the next case. Set slop so
- * that the value of minlen+alignment+slop doesn't go up between
- * the calls.
+ * Compute the alignment slop for the fallback path so we ensure
+ * we account for the potential alignemnt space required by the
+ * fallback paths before we modify the AGF and AGFL here.
*/
args->alignment = 1;
- if (blen > stripe_align && blen <= args->maxlen)
- nextminlen = blen - stripe_align;
- else
- nextminlen = args->minlen;
- if (nextminlen + stripe_align > args->minlen + 1)
- args->minalignslop = nextminlen + stripe_align -
- args->minlen - 1;
- else
- args->minalignslop = 0;
+ args->alignslop = alignment - args->alignment;

if (!caller_pag)
args->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ap->blkno));
@@ -3596,19 +3599,8 @@ xfs_bmap_btalloc_at_eof(
* Exact allocation failed. Reset to try an aligned allocation
* according to the original allocation specification.
*/
- args->alignment = stripe_align;
- args->minlen = nextminlen;
- args->minalignslop = 0;
- } else {
- /*
- * Adjust minlen to try and preserve alignment if we
- * can't guarantee an aligned maxlen extent.
- */
- args->alignment = stripe_align;
- if (blen > args->alignment &&
- blen <= args->maxlen + args->alignment)
- args->minlen = blen - args->alignment;
- args->minalignslop = 0;
+ args->alignment = alignment;
+ args->alignslop = 0;
}

if (ag_only) {
@@ -3626,9 +3618,8 @@ xfs_bmap_btalloc_at_eof(
return 0;

/*
- * Allocation failed, so turn return the allocation args to their
- * original non-aligned state so the caller can proceed on allocation
- * failure as if this function was never called.
+ * Aligned allocation failed, so all fallback paths from here drop the
+ * start alignment requirement as we know it will not succeed.
*/
args->alignment = 1;
return 0;
@@ -3636,7 +3627,9 @@ xfs_bmap_btalloc_at_eof(

/*
* We have failed multiple allocation attempts so now are in a low space
- * allocation situation. Try a locality first full filesystem minimum length
+ * allocation situation. We give up on any attempt at aligned allocation here.
+ *
+ * Try a locality first full filesystem minimum length
* allocation whilst still maintaining necessary total block reservation
* requirements.
*
@@ -3653,6 +3646,7 @@ xfs_bmap_btalloc_low_space(
{
int error;

+ args->alignment = 1;
if (args->minlen > ap->minlen) {
args->minlen = ap->minlen;
error = xfs_alloc_vextent_start_ag(args, ap->blkno);
@@ -3672,13 +3666,11 @@ xfs_bmap_btalloc_low_space(
static int
xfs_bmap_btalloc_filestreams(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- int stripe_align)
+ struct xfs_alloc_arg *args)
{
xfs_extlen_t blen = 0;
int error = 0;

-
error = xfs_filestream_select_ag(ap, args, &blen);
if (error)
return error;
@@ -3697,8 +3689,7 @@ xfs_bmap_btalloc_filestreams(

args->minlen = xfs_bmap_select_minlen(ap, args, blen);
if (ap->aeof)
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
- true);
+ error = xfs_bmap_btalloc_at_eof(ap, args, blen, true);

if (!error && args->fsbno == NULLFSBLOCK)
error = xfs_alloc_vextent_near_bno(args, ap->blkno);
@@ -3722,8 +3713,7 @@ xfs_bmap_btalloc_filestreams(
static int
xfs_bmap_btalloc_best_length(
struct xfs_bmalloca *ap,
- struct xfs_alloc_arg *args,
- int stripe_align)
+ struct xfs_alloc_arg *args)
{
xfs_extlen_t blen = 0;
int error;
@@ -3747,8 +3737,7 @@ xfs_bmap_btalloc_best_length(
* trying.
*/
if (ap->aeof && !(ap->tp->t_flags & XFS_TRANS_LOWMODE)) {
- error = xfs_bmap_btalloc_at_eof(ap, args, blen, stripe_align,
- false);
+ error = xfs_bmap_btalloc_at_eof(ap, args, blen, false);
if (error || args->fsbno != NULLFSBLOCK)
return error;
}
@@ -3775,27 +3764,26 @@ xfs_bmap_btalloc(
.resv = XFS_AG_RESV_NONE,
.datatype = ap->datatype,
.alignment = 1,
- .minalignslop = 0,
+ .alignslop = 0,
};
xfs_fileoff_t orig_offset;
xfs_extlen_t orig_length;
int error;
- int stripe_align;

ASSERT(ap->length);
orig_offset = ap->offset;
orig_length = ap->length;

- stripe_align = xfs_bmap_compute_alignments(ap, &args);
+ xfs_bmap_compute_alignments(ap, &args);

/* Trim the allocation back to the maximum an AG can fit. */
args.maxlen = min(ap->length, mp->m_ag_max_usable);

if ((ap->datatype & XFS_ALLOC_USERDATA) &&
xfs_inode_is_filestream(ap->ip))
- error = xfs_bmap_btalloc_filestreams(ap, &args, stripe_align);
+ error = xfs_bmap_btalloc_filestreams(ap, &args);
else
- error = xfs_bmap_btalloc_best_length(ap, &args, stripe_align);
+ error = xfs_bmap_btalloc_best_length(ap, &args);
if (error)
return error;

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 14c81f227c5b..9f71a9a3a65e 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -758,12 +758,12 @@ xfs_ialloc_ag_alloc(
*
* For an exact allocation, alignment must be 1,
* however we need to take cluster alignment into account when
- * fixing up the freelist. Use the minalignslop field to
- * indicate that extra blocks might be required for alignment,
- * but not to use them in the actual exact allocation.
+ * fixing up the freelist. Use the alignslop field to indicate
+ * that extra blocks might be required for alignment, but not
+ * to use them in the actual exact allocation.
*/
args.alignment = 1;
- args.minalignslop = igeo->cluster_align - 1;
+ args.alignslop = igeo->cluster_align - 1;

/* Allow space for the inode btree to split. */
args.minleft = igeo->inobt_maxlevels;
@@ -783,7 +783,7 @@ xfs_ialloc_ag_alloc(
* on, so reset minalignslop to ensure it is not included in
* subsequent requests.
*/
- args.minalignslop = 0;
+ args.alignslop = 0;
}

if (unlikely(args.fsbno == NULLFSBLOCK)) {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 25ff6fe1eb6c..0b2a2a1379bd 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1808,7 +1808,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__field(xfs_extlen_t, minleft)
__field(xfs_extlen_t, total)
__field(xfs_extlen_t, alignment)
- __field(xfs_extlen_t, minalignslop)
+ __field(xfs_extlen_t, alignslop)
__field(xfs_extlen_t, len)
__field(char, wasdel)
__field(char, wasfromfl)
@@ -1827,7 +1827,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__entry->minleft = args->minleft;
__entry->total = args->total;
__entry->alignment = args->alignment;
- __entry->minalignslop = args->minalignslop;
+ __entry->alignslop = args->alignslop;
__entry->len = args->len;
__entry->wasdel = args->wasdel;
__entry->wasfromfl = args->wasfromfl;
@@ -1836,7 +1836,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__entry->highest_agno = args->tp->t_highest_agno;
),
TP_printk("dev %d:%d agno 0x%x agbno 0x%x minlen %u maxlen %u mod %u "
- "prod %u minleft %u total %u alignment %u minalignslop %u "
+ "prod %u minleft %u total %u alignment %u alignslop %u "
"len %u wasdel %d wasfromfl %d resv %d "
"datatype 0x%x highest_agno 0x%x",
MAJOR(__entry->dev), MINOR(__entry->dev),
@@ -1849,7 +1849,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
__entry->minleft,
__entry->total,
__entry->alignment,
- __entry->minalignslop,
+ __entry->alignslop,
__entry->len,
__entry->wasdel,
__entry->wasfromfl,
--
2.31.1


2024-06-07 14:53:17

by John Garry

[permalink] [raw]
Subject: [PATCH v4 16/22] xfs: Enable file data forcealign feature

From: "Darrick J. Wong" <[email protected]>

Enable this feature.

Signed-off-by: "Darrick J. Wong" <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_format.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index b48cd75d34a6..42e1f80206ab 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -358,7 +358,8 @@ xfs_sb_has_compat_feature(
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
XFS_SB_FEAT_RO_COMPAT_REFLINK| \
- XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+ XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+ XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
static inline bool
xfs_sb_has_ro_compat_feature(
--
2.31.1


2024-06-07 14:58:40

by John Garry

[permalink] [raw]
Subject: [PATCH v4 21/22] xfs: Validate atomic writes

Validate that an atomic write adheres to length/offset rules. Since we
require extent alignment for atomic writes, this effectively also enforces
that the BIO which iomap produces is aligned.

For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_dio_write(),
FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
FORCEALIGN and also ATOMICWRITES flags would also need to be set for the
inode.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_file.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 24fe3c2e03da..eeb267ae2bf2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -695,12 +695,21 @@ xfs_file_dio_write(
struct kiocb *iocb,
struct iov_iter *from)
{
- struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct xfs_inode *ip = XFS_I(inode);
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
size_t count = iov_iter_count(from);
+ struct xfs_mount *mp = ip->i_mount;
bool unaligned;
u64 unitsize;

+ if (iocb->ki_flags & IOCB_ATOMIC) {
+ if (!generic_atomic_write_valid_size(iocb->ki_pos, from,
+ i_blocksize(inode), XFS_FSB_TO_B(mp, ip->i_extsize))) {
+ return -EINVAL;
+ }
+ }
+
/* direct I/O must be aligned to device logical sector size */
if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
return -EINVAL;
--
2.31.1


2024-06-11 10:09:20

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v4 13/22] xfs: Unmap blocks according to forcealign

On 07/06/2024 15:39, John Garry wrote:
> For when forcealign is enabled, blocks in an inode need to be unmapped
> according to extent alignment, like what is already done for rtvol.
>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/xfs/libxfs/xfs_bmap.c | 33 ++++++++++++++++++++++++++++-----
> 1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index c9cf138e13c4..2b6d5ebd8b4f 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -5380,6 +5380,20 @@ xfs_bmap_del_extent_real(
> return 0;
> }
>
> +static xfs_extlen_t
> +xfs_bunmapi_align(
> + struct xfs_inode *ip,
> + xfs_fsblock_t bno)
> +{
> + if (xfs_inode_has_forcealign(ip)) {
> + if (is_power_of_2(ip->i_extsize))
> + return bno & (ip->i_extsize - 1);
> + return do_div(bno, ip->i_extsize);
> + }
> + ASSERT(XFS_IS_REALTIME_INODE(ip));
> + return xfs_rtb_to_rtxoff(ip->i_mount, bno);
> +}

This following updated version of xfs_bunmapi_align() seems to fix the
issue reported in
https://lore.kernel.org/linux-xfs/[email protected]/

static xfs_extlen_t
xfs_bunmapi_align(
struct xfs_inode *ip,
xfs_fsblock_t fsbno)
{
if (xfs_inode_has_forcealign(ip)) {
struct xfs_mount *mp = ip->i_mount;
xfs_agblock_t agbno = XFS_FSB_TO_AGBNO(mp,
fsbno);

if (is_power_of_2(ip->i_extsize))
return agbno & (ip->i_extsize - 1);
return do_div(agbno, ip->i_extsize);
}
ASSERT(XFS_IS_REALTIME_INODE(ip));
return xfs_rtb_to_rtxoff(ip->i_mount, fsbno);
}


> +
> /*
> * Unmap (remove) blocks from a file.
> * If nexts is nonzero then the number of extents to remove is limited to
> @@ -5402,6 +5416,7 @@ __xfs_bunmapi(
> struct xfs_bmbt_irec got; /* current extent record */
> struct xfs_ifork *ifp; /* inode fork pointer */
> int isrt; /* freeing in rt area */
> + int isforcealign; /* freeing for inode with forcealign */
> int logflags; /* transaction logging flags */
> xfs_extlen_t mod; /* rt extent offset */
> struct xfs_mount *mp = ip->i_mount;
> @@ -5439,6 +5454,8 @@ __xfs_bunmapi(
> }
> XFS_STATS_INC(mp, xs_blk_unmap);
> isrt = xfs_ifork_is_realtime(ip, whichfork);
> + isforcealign = (whichfork != XFS_ATTR_FORK) &&
> + xfs_inode_has_forcealign(ip);
> end = start + len;
>
> if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
> @@ -5490,11 +5507,10 @@ __xfs_bunmapi(
> if (del.br_startoff + del.br_blockcount > end + 1)
> del.br_blockcount = end + 1 - del.br_startoff;
>
> - if (!isrt || (flags & XFS_BMAPI_REMAP))
> + if ((!isrt && !isforcealign) || (flags & XFS_BMAPI_REMAP))
> goto delete;
>
> - mod = xfs_rtb_to_rtxoff(mp,
> - del.br_startblock + del.br_blockcount);
> + mod = xfs_bunmapi_align(ip, del.br_startblock + del.br_blockcount);
> if (mod) {
> /*
> * Realtime extent not lined up at the end.
> @@ -5542,9 +5558,16 @@ __xfs_bunmapi(
> goto nodelete;
> }
>
> - mod = xfs_rtb_to_rtxoff(mp, del.br_startblock);
> + mod = xfs_bunmapi_align(ip, del.br_startblock);
> if (mod) {
> - xfs_extlen_t off = mp->m_sb.sb_rextsize - mod;
> + xfs_extlen_t off;
> +
> + if (isforcealign) {
> + off = ip->i_extsize - mod;
> + } else {
> + ASSERT(isrt);
> + off = mp->m_sb.sb_rextsize - mod;
> + }
>
> /*
> * Realtime extent is lined up at the end but not


2024-06-12 21:33:12

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing size

On Fri, Jun 07, 2024 at 02:38:59PM +0000, John Garry wrote:
> Allow filesystems to set the io_block_size for sub-fs block size zeroing,
> as in future we will want to extend this feature to support zeroing of
> block sizes of larger than the inode block size.
>
> The value in io_block_size does not have to be a power-of-2, so fix up
> zeroing code to handle that.
>
> Signed-off-by: John Garry <[email protected]>
> ---
> block/fops.c | 1 +
> fs/btrfs/inode.c | 1 +
> fs/erofs/data.c | 1 +
> fs/erofs/zmap.c | 1 +
> fs/ext2/inode.c | 1 +
> fs/ext4/extents.c | 1 +
> fs/ext4/inode.c | 1 +
> fs/f2fs/data.c | 1 +
> fs/fuse/dax.c | 1 +
> fs/gfs2/bmap.c | 1 +
> fs/hpfs/file.c | 1 +
> fs/iomap/direct-io.c | 23 +++++++++++++++++++----
> fs/xfs/xfs_iomap.c | 1 +
> fs/zonefs/file.c | 2 ++
> include/linux/iomap.h | 2 ++
> 15 files changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/block/fops.c b/block/fops.c
> index 9d6d86ebefb9..020443078630 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -402,6 +402,7 @@ static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> iomap->addr = iomap->offset;
> iomap->length = isize - iomap->offset;
> iomap->flags |= IOMAP_F_BUFFER_HEAD; /* noop for !CONFIG_BUFFER_HEAD */
> + iomap->io_block_size = i_blocksize(inode);
> return 0;
> }
>

<snip a bunch of filesystems setting io_block_size to i_blocksize>

> diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
> index 1bb8d97cd9ae..5d2718faf520 100644
> --- a/fs/hpfs/file.c
> +++ b/fs/hpfs/file.c
> @@ -149,6 +149,7 @@ static int hpfs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> iomap->addr = IOMAP_NULL_ADDR;
> iomap->length = 1 << blkbits;
> }
> + iomap->io_block_size = i_blocksize(inode);

HPFS does iomap now? Yikes.

>
> hpfs_unlock(sb);
> return 0;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..5be8d886ab4a 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> {
> const struct iomap *iomap = &iter->iomap;
> struct inode *inode = iter->inode;
> - unsigned int fs_block_size = i_blocksize(inode), pad;
> + u64 io_block_size = iomap->io_block_size;

I wonder, should iomap be nice and not require filesystems to set
io_block_size themselves unless they really need it? Anyone working on
an iomap port while this patchset is in progress may or may not remember
to add this bit if they get their port merged after atomicwrites is
merged; and you might not remember to prevent the bitrot if the reverse
order happens.

u64 io_block_size = iomap->io_block_size ?: i_blocksize(inode);

> loff_t length = iomap_length(iter);
> loff_t pos = iter->pos;
> blk_opf_t bio_opf;
> @@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> int nr_pages, ret = 0;
> size_t copied = 0;
> size_t orig_count;
> + unsigned int pad;
>
> if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> @@ -355,7 +356,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>
> if (need_zeroout) {
> /* zero out from the start of the block to the write offset */
> - pad = pos & (fs_block_size - 1);
> + if (is_power_of_2(io_block_size)) {
> + pad = pos & (io_block_size - 1);
> + } else {
> + loff_t _pos = pos;
> +
> + pad = do_div(_pos, io_block_size);
> + }

Please don't opencode this twice.

static unsigned int offset_in_block(loff_t pos, u64 blocksize)
{
if (likely(is_power_of_2(blocksize)))
return pos & (blocksize - 1);
return do_div(pos, blocksize);
}

pad = offset_in_block(pos, io_block_size);
if (pad)
...

Also, what happens if pos-pad points to a byte before the mapping?

> +
> if (pad)
> iomap_dio_zero(iter, dio, pos - pad, pad);
> }
> @@ -429,9 +437,16 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> if (need_zeroout ||
> ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
> /* zero out from the end of the write to the end of the block */
> - pad = pos & (fs_block_size - 1);
> + if (is_power_of_2(io_block_size)) {
> + pad = pos & (io_block_size - 1);
> + } else {
> + loff_t _pos = pos;
> +
> + pad = do_div(_pos, io_block_size);
> + }
> +
> if (pad)
> - iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
> + iomap_dio_zero(iter, dio, pos, io_block_size - pad);

What if pos + io_block_size - pad points to a byte after the end of the
mapping?

> }
> out:
> /* Undo iter limitation to current extent */
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 378342673925..ecb4cae88248 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -127,6 +127,7 @@ xfs_bmbt_to_iomap(
> }
> iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
> iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
> + iomap->io_block_size = i_blocksize(VFS_I(ip));
> if (mapping_flags & IOMAP_DAX)
> iomap->dax_dev = target->bt_daxdev;
> else
> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
> index 3b103715acc9..bf2cc4bee309 100644
> --- a/fs/zonefs/file.c
> +++ b/fs/zonefs/file.c
> @@ -50,6 +50,7 @@ static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
> iomap->addr = (z->z_sector << SECTOR_SHIFT) + iomap->offset;
> iomap->length = isize - iomap->offset;
> }
> + iomap->io_block_size = i_blocksize(inode);
> mutex_unlock(&zi->i_truncate_mutex);
>
> trace_zonefs_iomap_begin(inode, iomap);
> @@ -99,6 +100,7 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
> iomap->type = IOMAP_MAPPED;
> iomap->length = isize - iomap->offset;
> }
> + iomap->io_block_size = i_blocksize(inode);
> mutex_unlock(&zi->i_truncate_mutex);
>
> trace_zonefs_iomap_begin(inode, iomap);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6fc1c858013d..d63a35b77907 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -103,6 +103,8 @@ struct iomap {
> void *private; /* filesystem private */
> const struct iomap_folio_ops *folio_ops;
> u64 validity_cookie; /* used with .iomap_valid() */
> + /* io block zeroing size, not necessarily a power-of-2 */

size in bytes?

I'm not sure what "io block zeroing" means. What are you trying to
accomplish here? Let's say the fsblock size is 4k and the allocation
unit (aka the atomic write size) is 16k. Userspace wants a direct write
to file offset 8192-12287, and that space is unwritten:

uuuu
^

Currently we'd just write the 4k and run the io completion handler, so
the final state is:

uuWu

Instead, if the fs sets io_block_size to 16384, does this direct write
now amplify into a full 16k write? With the end result being:

ZZWZ

only.... I don't see the unwritten areas being converted to written?
I guess for an atomic write you'd require the user to write 0-16383?

<still confused about why we need to do this, maybe i'll figure it out
as I go along>

--D

> + u64 io_block_size;
> };
>
> static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)
> --
> 2.31.1
>
>

2024-06-13 10:32:42

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing size

On 12/06/2024 22:32, Darrick J. Wong wrote:
>> unsigned int fs_block_size = i_blocksize(inode), pad;
>> + u64 io_block_size = iomap->io_block_size;
> I wonder, should iomap be nice and not require filesystems to set
> io_block_size themselves unless they really need it?

That's what I had in v3, like:

if (iomap->io_block_size)
io_block_size = iomap->io_block_size;
else
io_block_size = i_block_size(inode)

but it was suggested to change that (to like what I have here).

> Anyone working on
> an iomap port while this patchset is in progress may or may not remember
> to add this bit if they get their port merged after atomicwrites is
> merged; and you might not remember to prevent the bitrot if the reverse
> order happens.

Sure, I get your point.

However, OTOH, if we check xfs_bmbt_to_iomap(), it does set all or close
to all members of struct iomap, so we are just continuing that trend,
i.e. it is the job of the FS callback to set all these members.

>
> u64 io_block_size = iomap->io_block_size ?: i_blocksize(inode);
>
>> loff_t length = iomap_length(iter);
>> loff_t pos = iter->pos;
>> blk_opf_t bio_opf;
>> @@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> int nr_pages, ret = 0;
>> size_t copied = 0;
>> size_t orig_count;
>> + unsigned int pad;
>>
>> if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>> !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>> @@ -355,7 +356,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>
>> if (need_zeroout) {
>> /* zero out from the start of the block to the write offset */
>> - pad = pos & (fs_block_size - 1);
>> + if (is_power_of_2(io_block_size)) {
>> + pad = pos & (io_block_size - 1);
>> + } else {
>> + loff_t _pos = pos;
>> +
>> + pad = do_div(_pos, io_block_size);
>> + }
> Please don't opencode this twice.
>
> static unsigned int offset_in_block(loff_t pos, u64 blocksize)
> {
> if (likely(is_power_of_2(blocksize)))
> return pos & (blocksize - 1);
> return do_div(pos, blocksize);
> }

ok, fine

>
> pad = offset_in_block(pos, io_block_size);
> if (pad)
> ...
>
> Also, what happens if pos-pad points to a byte before the mapping?

It's the job of the FS to map in something aligned to io_block_size.
Having said that, I don't think we are doing that for XFS (which sets
io_block_size > i_block_size(inode)), so I need to check that.

>
>> +
>> if (pad)
>> iomap_dio_zero(iter, dio, pos - pad, pad);
>> }
>> @@ -429,9 +437,16 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> if (need_zeroout ||
>> ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>> /* zero out from the end of the write to the end of the block */
>> - pad = pos & (fs_block_size - 1);
>> + if (is_power_of_2(io_block_size)) {
>> + pad = pos & (io_block_size - 1);
>> + } else {
>> + loff_t _pos = pos;
>> +
>> + pad = do_div(_pos, io_block_size);
>> + }
>> +
>> if (pad)
>> - iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
>> + iomap_dio_zero(iter, dio, pos, io_block_size - pad);
> What if pos + io_block_size - pad points to a byte after the end of the
> mapping?

as above, we expect this to be mapped in (so ok to zero)

>
>> }
>> out:
>> /* Undo iter limitation to current extent */
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 378342673925..ecb4cae88248 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -127,6 +127,7 @@ xfs_bmbt_to_iomap(
>> }
>> iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
>> iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
>> + iomap->io_block_size = i_blocksize(VFS_I(ip));
>> if (mapping_flags & IOMAP_DAX)
>> iomap->dax_dev = target->bt_daxdev;
>> else
>> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
>> index 3b103715acc9..bf2cc4bee309 100644
>> --- a/fs/zonefs/file.c
>> +++ b/fs/zonefs/file.c
>> @@ -50,6 +50,7 @@ static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
>> iomap->addr = (z->z_sector << SECTOR_SHIFT) + iomap->offset;
>> iomap->length = isize - iomap->offset;
>> }
>> + iomap->io_block_size = i_blocksize(inode);
>> mutex_unlock(&zi->i_truncate_mutex);
>>
>> trace_zonefs_iomap_begin(inode, iomap);
>> @@ -99,6 +100,7 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
>> iomap->type = IOMAP_MAPPED;
>> iomap->length = isize - iomap->offset;
>> }
>> + iomap->io_block_size = i_blocksize(inode);
>> mutex_unlock(&zi->i_truncate_mutex);
>>
>> trace_zonefs_iomap_begin(inode, iomap);
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 6fc1c858013d..d63a35b77907 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -103,6 +103,8 @@ struct iomap {
>> void *private; /* filesystem private */
>> const struct iomap_folio_ops *folio_ops;
>> u64 validity_cookie; /* used with .iomap_valid() */
>> + /* io block zeroing size, not necessarily a power-of-2 */
> size in bytes?
>
> I'm not sure what "io block zeroing" means.

Naming is hard. Essentally we are trying to reuse the sub-fs block
zeroing code for sub-extent granule writes. More below.

> What are you trying to
> accomplish here? Let's say the fsblock size is 4k and the allocation
> unit (aka the atomic write size) is 16k.

ok, so I say here that the extent granule is 16k

> Userspace wants a direct write
> to file offset 8192-12287, and that space is unwritten:
>
> uuuu
> ^
>
> Currently we'd just write the 4k and run the io completion handler, so
> the final state is:
>
> uuWu
>
> Instead, if the fs sets io_block_size to 16384, does this direct write
> now amplify into a full 16k write?

Yes, but only when the extent is newly allocated and we require zeroing.

> With the end result being:
> ZZWZ

Yes

>
> only.... I don't see the unwritten areas being converted to written?

See xfs_iomap_write_unwritten() change in the next patch

> I guess for an atomic write you'd require the user to write 0-16383?

Not exactly

>
> <still confused about why we need to do this, maybe i'll figure it out
> as I go along>

This zeroing is just really required for atomic writes. The purpose is
to zero the extent granule for any write within a newly allocated granule.

Consider we have uuWu, above. If the user then attempts to write the
full 16K as an atomic write, the iomap iter code would generate writes
for sizes 8k, 4k, and 4k, i.e. not a single 16K write. This is not
acceptable. So the idea is to zero the full extent granule when
allocated, so we have ZZWZ after the 4k write at offset 8192, above. As
such, if we then attempt this 16K atomic write, we get a single 16K BIO,
i.e. there is no unwritten extent conversion.

I am not sure if we should be doing this only for atomic writes inodes,
or also forcealign only or RT.

Thanks,
John



2024-06-13 11:14:48

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v4 03/22] xfs: Use extent size granularity for iomap->io_block_size

On 12/06/2024 22:47, Darrick J. Wong wrote:
> On Fri, Jun 07, 2024 at 02:39:00PM +0000, John Garry wrote:
>> Currently iomap->io_block_size is set to the i_blocksize() value for the
>> inode.
>>
>> Expand the sub-fs block size zeroing to now cover RT extents, by calling
>> setting iomap->io_block_size as xfs_inode_alloc_unitsize().
>>
>> In xfs_iomap_write_unwritten(), update the unwritten range fsb to cover
>> this extent granularity.
>>
>> In xfs_file_dio_write(), handle a write which is not aligned to extent
>> size granularity as unaligned. Since the extent size granularity need not
>> be a power-of-2, handle this also.
>>
>> Signed-off-by: John Garry <[email protected]>
>> ---
>> fs/xfs/xfs_file.c | 24 +++++++++++++++++++-----
>> fs/xfs/xfs_inode.c | 17 +++++++++++------
>> fs/xfs/xfs_inode.h | 1 +
>> fs/xfs/xfs_iomap.c | 8 +++++++-
>> 4 files changed, 38 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index b240ea5241dc..24fe3c2e03da 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -601,7 +601,7 @@ xfs_file_dio_write_aligned(
>> }
>>
>> /*
>> - * Handle block unaligned direct I/O writes
>> + * Handle unaligned direct IO writes.
>> *
>> * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
>> * them to be done in parallel with reads and other direct I/O writes. However,
>> @@ -630,9 +630,9 @@ xfs_file_dio_write_unaligned(
>> ssize_t ret;
>>
>> /*
>> - * Extending writes need exclusivity because of the sub-block zeroing
>> - * that the DIO code always does for partial tail blocks beyond EOF, so
>> - * don't even bother trying the fast path in this case.
>> + * Extending writes need exclusivity because of the sub-block/extent
>> + * zeroing that the DIO code always does for partial tail blocks
>> + * beyond EOF, so don't even bother trying the fast path in this case.
>
> Hummm. So let's say the fsblock size is 4k, the rt extent size is 16k,
> and you want to write bytes 8192-12287 of a file. Currently we'd use
> xfs_file_dio_write_aligned for that, but now we'd use
> xfs_file_dio_write_unaligned? Even though we don't need zeroing or any
> of that stuff?

Right, this is something which I mentioned in response to the previous
patch.

I doubt whether we should only do this for atomic writes inodes, or also
RT and forcealign-only inodes.

I got the impression from Dave in review of the previous version of this
series that it should include RT and forcealign-only.

>
>> */
>> if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
>> if (iocb->ki_flags & IOCB_NOWAIT)
>> @@ -698,11 +698,25 @@ xfs_file_dio_write(
>> struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
>> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> size_t count = iov_iter_count(from);
>> + bool unaligned;
>> + u64 unitsize;
>>
>> /* direct I/O must be aligned to device logical sector size */
>> if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
>> return -EINVAL;
>> - if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
>> +
>> + unitsize = xfs_inode_alloc_unitsize(ip);
>> + if (!is_power_of_2(unitsize)) {
>> + if (isaligned_64(iocb->ki_pos, unitsize) &&
>> + isaligned_64(count, unitsize))
>> + unaligned = false;
>> + else
>> + unaligned = true;
>> + } else {
>> + unaligned = (iocb->ki_pos | count) & (unitsize - 1);
>> + }
>
> Didn't I already write this?

It's from xfs_is_falloc_aligned(). Let's reuse that fully here. I did
look at doing that before, though...

>
>> + if (unaligned)
>
> if (!xfs_is_falloc_aligned(ip, iocb->ki_pos, count))
>
>> return xfs_file_dio_write_unaligned(ip, iocb, from);
>> return xfs_file_dio_write_aligned(ip, iocb, from);
>> }
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 58fb7a5062e1..93ad442f399b 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -4264,15 +4264,20 @@ xfs_break_layouts(
>> return error;
>> }
>>
>> -/* Returns the size of fundamental allocation unit for a file, in bytes. */
>
> Don't delete the comment, it has useful return type information.

It wasn't deleted, it is still below.

>
> /*
> * Returns the size of fundamental allocation unit for a file, in
> * fsblocks.
> */
>
>> unsigned int
>> -xfs_inode_alloc_unitsize(
>> +xfs_inode_alloc_unitsize_fsb(
>> struct xfs_inode *ip)
>> {
>> - unsigned int blocks = 1;
>> -
>> if (XFS_IS_REALTIME_INODE(ip))
>> - blocks = ip->i_mount->m_sb.sb_rextsize;
>> + return ip->i_mount->m_sb.sb_rextsize;
>> +
>> + return 1;
>> +}
>>
>> - return XFS_FSB_TO_B(ip->i_mount, blocks);
>> +/* Returns the size of fundamental allocation unit for a file, in bytes. */
>> +unsigned int
>> +xfs_inode_alloc_unitsize(
>> + struct xfs_inode *ip)
>> +{
>> + return XFS_FSB_TO_B(ip->i_mount, xfs_inode_alloc_unitsize_fsb(ip));
>> }
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index 292b90b5f2ac..90d2fa837117 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -643,6 +643,7 @@ int xfs_inode_reload_unlinked(struct xfs_inode *ip);
>> bool xfs_ifork_zapped(const struct xfs_inode *ip, int whichfork);
>> void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
>> xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
>> +unsigned int xfs_inode_alloc_unitsize_fsb(struct xfs_inode *ip);
>> unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);
>>
>> struct xfs_dir_update_params {
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index ecb4cae88248..fbe69f747e30 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -127,7 +127,7 @@ xfs_bmbt_to_iomap(
>> }
>> iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
>> iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
>> - iomap->io_block_size = i_blocksize(VFS_I(ip));
>> + iomap->io_block_size = xfs_inode_alloc_unitsize(ip);
>
> Oh, I see. So io_block_size causes iomap to write zeroes to the storage
> backing surrounding areas of the file range.
Yes

> In this case, for direct
> writes to the unwritten middle 4k of an otherwise written 16k extent,
> we'll write zeroes to 0-4k and 8k-16k even though that wasn't what the
> caller asked for?

We would only do that for a newly allocated extent. We should not
overwrite existing data.

>
> IOWs, if you start with:
>
> WWuW
>
> write to the "U", then it'll write zeroes to the "W" areas? That
> doesn't sound good...

No, that definitely should not happen.

We only would zero once when do a sub-extent granule write to an
unallocated extent.

In iomap_dio_bio_iter(), we only zero for IOMAP_UNWRITTEN or IOMAP_F_NEW.

>
>> if (mapping_flags & IOMAP_DAX)
>> iomap->dax_dev = target->bt_daxdev;
>> else
>> @@ -577,11 +577,17 @@ xfs_iomap_write_unwritten(
>> xfs_fsize_t i_size;
>> uint resblks;
>> int error;
>> + unsigned int rounding;
>>
>> trace_xfs_unwritten_convert(ip, offset, count);
>>
>> offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> + rounding = xfs_inode_alloc_unitsize_fsb(ip);
>> + if (rounding > 1) {
>> + offset_fsb = rounddown_64(offset_fsb, rounding);
>> + count_fsb = roundup_64(count_fsb, rounding);
>> + }
>
> ...and then the ioend handler is supposed to be smart enough to know
> that iomap quietly wrote to other parts of the disk.

iomap_io_complete() only knows about the non-zeroing written data. I am
not changing that really.

>
> Um, does this cause unwritten extent conversion for entire rtextents
> after writeback to a rtextsize > 1fsb file?

Yes.

>
> Or am I really misunderstanding what's going on here with the io paths?

Thanks,
John