by John Garry

[permalink] [raw]

Subject: Re: [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing size

On 12/06/2024 22:32, Darrick J. Wong wrote:
>> unsigned int fs_block_size = i_blocksize(inode), pad;
>> + u64 io_block_size = iomap->io_block_size;
> I wonder, should iomap be nice and not require filesystems to set
> io_block_size themselves unless they really need it?

That's what I had in v3, like:

if (iomap->io_block_size)
io_block_size = iomap->io_block_size;
else
io_block_size = i_block_size(inode)

but it was suggested to change that (to like what I have here).

> Anyone working on
> an iomap port while this patchset is in progress may or may not remember
> to add this bit if they get their port merged after atomicwrites is
> merged; and you might not remember to prevent the bitrot if the reverse
> order happens.

Sure, I get your point.

However, OTOH, if we check xfs_bmbt_to_iomap(), it does set all or close
to all members of struct iomap, so we are just continuing that trend,
i.e. it is the job of the FS callback to set all these members.

>
> u64 io_block_size = iomap->io_block_size ?: i_blocksize(inode);
>
>> loff_t length = iomap_length(iter);
>> loff_t pos = iter->pos;
>> blk_opf_t bio_opf;
>> @@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> int nr_pages, ret = 0;
>> size_t copied = 0;
>> size_t orig_count;
>> + unsigned int pad;
>>
>> if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>> !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>> @@ -355,7 +356,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>
>> if (need_zeroout) {
>> /* zero out from the start of the block to the write offset */
>> - pad = pos & (fs_block_size - 1);
>> + if (is_power_of_2(io_block_size)) {
>> + pad = pos & (io_block_size - 1);
>> + } else {
>> + loff_t _pos = pos;
>> +
>> + pad = do_div(_pos, io_block_size);
>> + }
> Please don't opencode this twice.
>
> static unsigned int offset_in_block(loff_t pos, u64 blocksize)
> {
> if (likely(is_power_of_2(blocksize)))
> return pos & (blocksize - 1);
> return do_div(pos, blocksize);
> }

ok, fine

>
> pad = offset_in_block(pos, io_block_size);
> if (pad)
> ...
>
> Also, what happens if pos-pad points to a byte before the mapping?

It's the job of the FS to map in something aligned to io_block_size.
Having said that, I don't think we are doing that for XFS (which sets
io_block_size > i_block_size(inode)), so I need to check that.

>
>> +
>> if (pad)
>> iomap_dio_zero(iter, dio, pos - pad, pad);
>> }
>> @@ -429,9 +437,16 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> if (need_zeroout ||
>> ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>> /* zero out from the end of the write to the end of the block */
>> - pad = pos & (fs_block_size - 1);
>> + if (is_power_of_2(io_block_size)) {
>> + pad = pos & (io_block_size - 1);
>> + } else {
>> + loff_t _pos = pos;
>> +
>> + pad = do_div(_pos, io_block_size);
>> + }
>> +
>> if (pad)
>> - iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
>> + iomap_dio_zero(iter, dio, pos, io_block_size - pad);
> What if pos + io_block_size - pad points to a byte after the end of the
> mapping?

as above, we expect this to be mapped in (so ok to zero)

>
>> }
>> out:
>> /* Undo iter limitation to current extent */
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 378342673925..ecb4cae88248 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -127,6 +127,7 @@ xfs_bmbt_to_iomap(
>> }
>> iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
>> iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
>> + iomap->io_block_size = i_blocksize(VFS_I(ip));
>> if (mapping_flags & IOMAP_DAX)
>> iomap->dax_dev = target->bt_daxdev;
>> else
>> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
>> index 3b103715acc9..bf2cc4bee309 100644
>> --- a/fs/zonefs/file.c
>> +++ b/fs/zonefs/file.c
>> @@ -50,6 +50,7 @@ static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
>> iomap->addr = (z->z_sector << SECTOR_SHIFT) + iomap->offset;
>> iomap->length = isize - iomap->offset;
>> }
>> + iomap->io_block_size = i_blocksize(inode);
>> mutex_unlock(&zi->i_truncate_mutex);
>>
>> trace_zonefs_iomap_begin(inode, iomap);
>> @@ -99,6 +100,7 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
>> iomap->type = IOMAP_MAPPED;
>> iomap->length = isize - iomap->offset;
>> }
>> + iomap->io_block_size = i_blocksize(inode);
>> mutex_unlock(&zi->i_truncate_mutex);
>>
>> trace_zonefs_iomap_begin(inode, iomap);
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 6fc1c858013d..d63a35b77907 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -103,6 +103,8 @@ struct iomap {
>> void *private; /* filesystem private */
>> const struct iomap_folio_ops *folio_ops;
>> u64 validity_cookie; /* used with .iomap_valid() */
>> + /* io block zeroing size, not necessarily a power-of-2 */
> size in bytes?
>
> I'm not sure what "io block zeroing" means.

Naming is hard. Essentally we are trying to reuse the sub-fs block
zeroing code for sub-extent granule writes. More below.

> What are you trying to
> accomplish here? Let's say the fsblock size is 4k and the allocation
> unit (aka the atomic write size) is 16k.

ok, so I say here that the extent granule is 16k

> Userspace wants a direct write
> to file offset 8192-12287, and that space is unwritten:
>
> uuuu
> ^
>
> Currently we'd just write the 4k and run the io completion handler, so
> the final state is:
>
> uuWu
>
> Instead, if the fs sets io_block_size to 16384, does this direct write
> now amplify into a full 16k write?

Yes, but only when the extent is newly allocated and we require zeroing.

> With the end result being:
> ZZWZ

Yes

>
> only.... I don't see the unwritten areas being converted to written?

See xfs_iomap_write_unwritten() change in the next patch

> I guess for an atomic write you'd require the user to write 0-16383?

Not exactly

>
> <still confused about why we need to do this, maybe i'll figure it out
> as I go along>

This zeroing is just really required for atomic writes. The purpose is
to zero the extent granule for any write within a newly allocated granule.

Consider we have uuWu, above. If the user then attempts to write the
full 16K as an atomic write, the iomap iter code would generate writes
for sizes 8k, 4k, and 4k, i.e. not a single 16K write. This is not
acceptable. So the idea is to zero the full extent granule when
allocated, so we have ZZWZ after the 4k write at offset 8192, above. As
such, if we then attempt this 16K atomic write, we get a single 16K BIO,
i.e. there is no unwritten extent conversion.

I am not sure if we should be doing this only for atomic writes inodes,
or also forcealign only or RT.

Thanks,
John

2024-06-13 11:14:48

by John Garry

[permalink] [raw]

Subject: Re: [PATCH v4 03/22] xfs: Use extent size granularity for iomap->io_block_size

On 12/06/2024 22:47, Darrick J. Wong wrote:
> On Fri, Jun 07, 2024 at 02:39:00PM +0000, John Garry wrote:
>> Currently iomap->io_block_size is set to the i_blocksize() value for the
>> inode.
>>
>> Expand the sub-fs block size zeroing to now cover RT extents, by calling
>> setting iomap->io_block_size as xfs_inode_alloc_unitsize().
>>
>> In xfs_iomap_write_unwritten(), update the unwritten range fsb to cover
>> this extent granularity.
>>
>> In xfs_file_dio_write(), handle a write which is not aligned to extent
>> size granularity as unaligned. Since the extent size granularity need not
>> be a power-of-2, handle this also.
>>
>> Signed-off-by: John Garry <[email protected]>
>> ---
>> fs/xfs/xfs_file.c | 24 +++++++++++++++++++-----
>> fs/xfs/xfs_inode.c | 17 +++++++++++------
>> fs/xfs/xfs_inode.h | 1 +
>> fs/xfs/xfs_iomap.c | 8 +++++++-
>> 4 files changed, 38 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index b240ea5241dc..24fe3c2e03da 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -601,7 +601,7 @@ xfs_file_dio_write_aligned(
>> }
>>
>> /*
>> - * Handle block unaligned direct I/O writes
>> + * Handle unaligned direct IO writes.
>> *
>> * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
>> * them to be done in parallel with reads and other direct I/O writes. However,
>> @@ -630,9 +630,9 @@ xfs_file_dio_write_unaligned(
>> ssize_t ret;
>>
>> /*
>> - * Extending writes need exclusivity because of the sub-block zeroing
>> - * that the DIO code always does for partial tail blocks beyond EOF, so
>> - * don't even bother trying the fast path in this case.
>> + * Extending writes need exclusivity because of the sub-block/extent
>> + * zeroing that the DIO code always does for partial tail blocks
>> + * beyond EOF, so don't even bother trying the fast path in this case.
>
> Hummm. So let's say the fsblock size is 4k, the rt extent size is 16k,
> and you want to write bytes 8192-12287 of a file. Currently we'd use
> xfs_file_dio_write_aligned for that, but now we'd use
> xfs_file_dio_write_unaligned? Even though we don't need zeroing or any
> of that stuff?

Right, this is something which I mentioned in response to the previous
patch.

I doubt whether we should only do this for atomic writes inodes, or also
RT and forcealign-only inodes.

I got the impression from Dave in review of the previous version of this
series that it should include RT and forcealign-only.

>
>> */
>> if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
>> if (iocb->ki_flags & IOCB_NOWAIT)
>> @@ -698,11 +698,25 @@ xfs_file_dio_write(
>> struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
>> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> size_t count = iov_iter_count(from);
>> + bool unaligned;
>> + u64 unitsize;
>>
>> /* direct I/O must be aligned to device logical sector size */
>> if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
>> return -EINVAL;
>> - if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
>> +
>> + unitsize = xfs_inode_alloc_unitsize(ip);
>> + if (!is_power_of_2(unitsize)) {
>> + if (isaligned_64(iocb->ki_pos, unitsize) &&
>> + isaligned_64(count, unitsize))
>> + unaligned = false;
>> + else
>> + unaligned = true;
>> + } else {
>> + unaligned = (iocb->ki_pos | count) & (unitsize - 1);
>> + }
>
> Didn't I already write this?

It's from xfs_is_falloc_aligned(). Let's reuse that fully here. I did
look at doing that before, though...

>
>> + if (unaligned)
>
> if (!xfs_is_falloc_aligned(ip, iocb->ki_pos, count))
>
>> return xfs_file_dio_write_unaligned(ip, iocb, from);
>> return xfs_file_dio_write_aligned(ip, iocb, from);
>> }
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 58fb7a5062e1..93ad442f399b 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -4264,15 +4264,20 @@ xfs_break_layouts(
>> return error;
>> }
>>
>> -/* Returns the size of fundamental allocation unit for a file, in bytes. */
>
> Don't delete the comment, it has useful return type information.

It wasn't deleted, it is still below.

>
> /*
> * Returns the size of fundamental allocation unit for a file, in
> * fsblocks.
> */
>
>> unsigned int
>> -xfs_inode_alloc_unitsize(
>> +xfs_inode_alloc_unitsize_fsb(
>> struct xfs_inode *ip)
>> {
>> - unsigned int blocks = 1;
>> -
>> if (XFS_IS_REALTIME_INODE(ip))
>> - blocks = ip->i_mount->m_sb.sb_rextsize;
>> + return ip->i_mount->m_sb.sb_rextsize;
>> +
>> + return 1;
>> +}
>>
>> - return XFS_FSB_TO_B(ip->i_mount, blocks);
>> +/* Returns the size of fundamental allocation unit for a file, in bytes. */
>> +unsigned int
>> +xfs_inode_alloc_unitsize(
>> + struct xfs_inode *ip)
>> +{
>> + return XFS_FSB_TO_B(ip->i_mount, xfs_inode_alloc_unitsize_fsb(ip));
>> }
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index 292b90b5f2ac..90d2fa837117 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -643,6 +643,7 @@ int xfs_inode_reload_unlinked(struct xfs_inode *ip);
>> bool xfs_ifork_zapped(const struct xfs_inode *ip, int whichfork);
>> void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
>> xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
>> +unsigned int xfs_inode_alloc_unitsize_fsb(struct xfs_inode *ip);
>> unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);
>>
>> struct xfs_dir_update_params {
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index ecb4cae88248..fbe69f747e30 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -127,7 +127,7 @@ xfs_bmbt_to_iomap(
>> }
>> iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
>> iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
>> - iomap->io_block_size = i_blocksize(VFS_I(ip));
>> + iomap->io_block_size = xfs_inode_alloc_unitsize(ip);
>
> Oh, I see. So io_block_size causes iomap to write zeroes to the storage
> backing surrounding areas of the file range.
Yes

> In this case, for direct
> writes to the unwritten middle 4k of an otherwise written 16k extent,
> we'll write zeroes to 0-4k and 8k-16k even though that wasn't what the
> caller asked for?

We would only do that for a newly allocated extent. We should not
overwrite existing data.

>
> IOWs, if you start with:
>
> WWuW
>
> write to the "U", then it'll write zeroes to the "W" areas? That
> doesn't sound good...

No, that definitely should not happen.

We only would zero once when do a sub-extent granule write to an
unallocated extent.

In iomap_dio_bio_iter(), we only zero for IOMAP_UNWRITTEN or IOMAP_F_NEW.

>
>> if (mapping_flags & IOMAP_DAX)
>> iomap->dax_dev = target->bt_daxdev;
>> else
>> @@ -577,11 +577,17 @@ xfs_iomap_write_unwritten(
>> xfs_fsize_t i_size;
>> uint resblks;
>> int error;
>> + unsigned int rounding;
>>
>> trace_xfs_unwritten_convert(ip, offset, count);
>>
>> offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> + rounding = xfs_inode_alloc_unitsize_fsb(ip);
>> + if (rounding > 1) {
>> + offset_fsb = rounddown_64(offset_fsb, rounding);
>> + count_fsb = roundup_64(count_fsb, rounding);
>> + }
>
> ...and then the ioend handler is supposed to be smart enough to know
> that iomap quietly wrote to other parts of the disk.

iomap_io_complete() only knows about the non-zeroing written data. I am
not changing that really.

>
> Um, does this cause unwritten extent conversion for entire rtextents
> after writeback to a rtextsize > 1fsb file?

Yes.

>
> Or am I really misunderstanding what's going on here with the io paths?

Thanks,
John