LinuxLists.cc - [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Tue, Nov 21, 2023 at 12:35:20AM +0530, Ritesh Harjani (IBM) wrote:
> - mmap path of ext2 continues to use generic_file_vm_ops

So this means there are not space reservations taken at write fault
time. While iomap was written with the assumption those exist, I can't
instantly spot anything that relies on them - you are just a lot more
likely to hit an -ENOSPC from ->map_blocks now. Maybe we should add
an xfstests that covers the case where we use up more then the existing
free space through writable mmaps to ensure it fully works (where works
means at least as good as the old behavior)?

> +static ssize_t ext2_buffered_write_iter(struct kiocb *iocb,
> + struct iov_iter *from)
> +{
> + ssize_t ret = 0;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + inode_lock(inode);
> + ret = generic_write_checks(iocb, from);
> + if (ret > 0)
> + ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
> + inode_unlock(inode);
> + if (ret > 0)
> + ret = generic_write_sync(iocb, ret);
> + return ret;
> +}

Not for now, but if we end up doing a bunch of conversation of trivial
file systems, it might be worth to eventually add a wrapper for the
above code with just the iomap_ops passed in. Only once we see a few
duplicates, though.

> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> + struct inode *inode, loff_t offset)
> +{
> + if (offset >= wpc->iomap.offset &&
> + offset < wpc->iomap.offset + wpc->iomap.length)
> + return 0;
> +
> + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> + IOMAP_WRITE, &wpc->iomap, NULL);
> +}

Looking at gfs2 this also might become a pattern. But I'm fine with
waiting for now.

> @@ -1372,7 +1428,7 @@ void ext2_set_file_ops(struct inode *inode)
> if (IS_DAX(inode))
> inode->i_mapping->a_ops = &ext2_dax_aops;
> else
> - inode->i_mapping->a_ops = &ext2_aops;
> + inode->i_mapping->a_ops = &ext2_file_aops;
> }

Did yo run into issues in using the iomap based aops for the other uses
of ext2_aops, or are just trying to address the users one at a time?

2023-11-21 05:56:49

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Christoph Hellwig <[email protected]> writes:

> On Tue, Nov 21, 2023 at 12:35:20AM +0530, Ritesh Harjani (IBM) wrote:
>> - mmap path of ext2 continues to use generic_file_vm_ops
>
> So this means there are not space reservations taken at write fault
> time.

Yes.

> While iomap was written with the assumption those exist, I can't
> instantly spot anything that relies on them - you are just a lot more
> likely to hit an -ENOSPC from ->map_blocks now.

Which is also true with existing code no? If the block reservation is
not done at the write fault, writeback is likely to fail due to ENOSPC?

> Maybe we should add
> an xfstests that covers the case where we use up more then the existing
> free space through writable mmaps to ensure it fully works (where works
> means at least as good as the old behavior)?
>

Sure, I can try write an fstests for the same.

Also I did find an old thread which tried implementing ->page_mkwrite in
ext2 [1]. However, it was rejected with following reason -

"""
Allocating
blocks on write fault has the unwelcome impact that the file layout is
likely going to be much worse (much more fragmented) - I remember getting
some reports about performance regressions from users back when I did a
similar change for ext3. And so I'm reluctant to change behavior of such
an old legacy filesystem as ext2...
"""

https://lore.kernel.org/linux-ext4/[email protected]/

>> +static ssize_t ext2_buffered_write_iter(struct kiocb *iocb,
>> + struct iov_iter *from)
>> +{
>> + ssize_t ret = 0;
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> + inode_lock(inode);
>> + ret = generic_write_checks(iocb, from);
>> + if (ret > 0)
>> + ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
>> + inode_unlock(inode);
>> + if (ret > 0)
>> + ret = generic_write_sync(iocb, ret);
>> + return ret;
>> +}
>
> Not for now, but if we end up doing a bunch of conversation of trivial
> file systems, it might be worth to eventually add a wrapper for the
> above code with just the iomap_ops passed in. Only once we see a few
> duplicates, though.
>

Sure, make sense. Thanks!
I can try and check if the the wrapper helps.

>> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>> + struct inode *inode, loff_t offset)
>> +{
>> + if (offset >= wpc->iomap.offset &&
>> + offset < wpc->iomap.offset + wpc->iomap.length)
>> + return 0;
>> +
>> + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
>> + IOMAP_WRITE, &wpc->iomap, NULL);
>> +}
>
> Looking at gfs2 this also might become a pattern. But I'm fine with
> waiting for now.
>

ok.

>> @@ -1372,7 +1428,7 @@ void ext2_set_file_ops(struct inode *inode)
>> if (IS_DAX(inode))
>> inode->i_mapping->a_ops = &ext2_dax_aops;
>> else
>> - inode->i_mapping->a_ops = &ext2_aops;
>> + inode->i_mapping->a_ops = &ext2_file_aops;
>> }
>
> Did yo run into issues in using the iomap based aops for the other uses
> of ext2_aops, or are just trying to address the users one at a time?

There are problems for e.g. for dir type in ext2. It uses the pagecache
for dir. It uses buffer_heads and attaches them to folio->private.
...it uses block_write_begin/block_write_end() calls.
Look for ext4_make_empty() -> ext4_prepare_chunk ->
block_write_begin().
Now during sync/writeback of the dirty pages (ext4_handle_dirsync()), we
might take a iomap writeback path (if using ext2_file_aops for dir)
which sees folio->private assuming it is "struct iomap_folio_state".
And bad things will happen...

Now we don't have an equivalent APIs in iomap for
block_write_begin()/end() which the users can call for. Hence, Jan
suggested to lets first convert ext2 regular file path to iomap as an RFC.
I shall now check for the dir type to see what changes we might require
in ext2/iomap code.

Thanks again for your review!

-ritesh

2023-11-21 06:08:52

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Tue, Nov 21, 2023 at 11:26:15AM +0530, Ritesh Harjani wrote:
> > instantly spot anything that relies on them - you are just a lot more
> > likely to hit an -ENOSPC from ->map_blocks now.
>
> Which is also true with existing code no? If the block reservation is
> not done at the write fault, writeback is likely to fail due to ENOSPC?

Yes. Not saying you should change this, I just want to make sure the
iomap code handles this fine. I think it does, but I'd rather be sure.

> Sure, make sense. Thanks!
> I can try and check if the the wrapper helps.

Let's wait until we have a few more conversions.

> > Did yo run into issues in using the iomap based aops for the other uses
> > of ext2_aops, or are just trying to address the users one at a time?
>
> There are problems for e.g. for dir type in ext2. It uses the pagecache
> for dir. It uses buffer_heads and attaches them to folio->private.
> ...it uses block_write_begin/block_write_end() calls.
> Look for ext4_make_empty() -> ext4_prepare_chunk ->
> block_write_begin().
> Now during sync/writeback of the dirty pages (ext4_handle_dirsync()), we
> might take a iomap writeback path (if using ext2_file_aops for dir)
> which sees folio->private assuming it is "struct iomap_folio_state".
> And bad things will happen...

Oh, indeed, bufferheads again.

> Now we don't have an equivalent APIs in iomap for
> block_write_begin()/end() which the users can call for. Hence, Jan
> suggested to lets first convert ext2 regular file path to iomap as an RFC.

Yes, no problem. But maybe worth documenting in the commit log.

2023-11-21 06:15:57

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Christoph Hellwig <[email protected]> writes:

> On Tue, Nov 21, 2023 at 11:26:15AM +0530, Ritesh Harjani wrote:
>> > instantly spot anything that relies on them - you are just a lot more
>> > likely to hit an -ENOSPC from ->map_blocks now.
>>
>> Which is also true with existing code no? If the block reservation is
>> not done at the write fault, writeback is likely to fail due to ENOSPC?
>
> Yes. Not saying you should change this, I just want to make sure the
> iomap code handles this fine. I think it does, but I'd rather be sure.
>

Sure. I can write a fstest to test the behavior.

>> Sure, make sense. Thanks!
>> I can try and check if the the wrapper helps.
>
> Let's wait until we have a few more conversions.
>

Sure.

>> > Did yo run into issues in using the iomap based aops for the other uses
>> > of ext2_aops, or are just trying to address the users one at a time?
>>
>> There are problems for e.g. for dir type in ext2. It uses the pagecache
>> for dir. It uses buffer_heads and attaches them to folio->private.
>> ...it uses block_write_begin/block_write_end() calls.
>> Look for ext4_make_empty() -> ext4_prepare_chunk ->
>> block_write_begin().
>> Now during sync/writeback of the dirty pages (ext4_handle_dirsync()), we
>> might take a iomap writeback path (if using ext2_file_aops for dir)
>> which sees folio->private assuming it is "struct iomap_folio_state".
>> And bad things will happen...
>
> Oh, indeed, bufferheads again.
>
>> Now we don't have an equivalent APIs in iomap for
>> block_write_begin()/end() which the users can call for. Hence, Jan
>> suggested to lets first convert ext2 regular file path to iomap as an RFC.
>
> Yes, no problem. But maybe worth documenting in the commit log.

Sure, I will update the commit log.

-ritesh

2023-11-22 12:31:18

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Tue 21-11-23 00:35:20, Ritesh Harjani (IBM) wrote:
> This patch converts ext2 regular file's buffered-io path to use iomap.
> - buffered-io path using iomap_file_buffered_write
> - DIO fallback to buffered-io now uses iomap_file_buffered_write
> - writeback path now uses a new aops - ext2_file_aops
> - truncate now uses iomap_truncate_page
> - mmap path of ext2 continues to use generic_file_vm_ops
>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>

Looks nice and simple. Just one comment below:

> +static void ext2_file_readahead(struct readahead_control *rac)
> +{
> + return iomap_readahead(rac, &ext2_iomap_ops);
> +}

As Matthew noted, the return of void here looks strange.

> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> + struct inode *inode, loff_t offset)
> +{
> + if (offset >= wpc->iomap.offset &&
> + offset < wpc->iomap.offset + wpc->iomap.length)
> + return 0;
> +
> + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> + IOMAP_WRITE, &wpc->iomap, NULL);
> +}

So this will end up mapping blocks for writeback one block at a time if I'm
understanding things right because ext2_iomap_begin() never sets extent
larger than 'length' in the iomap result. Hence the wpc checking looks
pointless (and actually dangerous - see below). So this will probably get
more expensive than necessary by calling into ext2_get_blocks() for each
block? Perhaps we could first call ext2_iomap_begin() for reading with high
length to get as many mapped blocks as we can and if we find unmapped block
(should happen only for writes through mmap), we resort to what you have
here... Hum, but this will not work because of the races with truncate
which could remove blocks whose mapping we have cached in wpc. We can
safely provide a mapping under a folio only once it is locked or has
writeback bit set. XFS plays the revalidation sequence counter games
because of this so we'd have to do something similar for ext2. Not that I'd
care as much about ext2 writeback performance but it should not be that
hard and we'll definitely need some similar solution for ext4 anyway. Can
you give that a try (as a followup "performance improvement" patch).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-11-22 13:11:37

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> writeback bit set. XFS plays the revalidation sequence counter games
> because of this so we'd have to do something similar for ext2. Not that I'd
> care as much about ext2 writeback performance but it should not be that
> hard and we'll definitely need some similar solution for ext4 anyway. Can
> you give that a try (as a followup "performance improvement" patch).

Darrick has mentioned that he is looking into lifting more of the
validation sequence counter validation into iomap.

In the meantime I have a series here that at least maps multiple blocks
inside a folio in a single go, which might be worth trying with ext2 as
well:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks

2023-11-22 20:25:48

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Jan Kara <[email protected]> writes:

> On Tue 21-11-23 00:35:20, Ritesh Harjani (IBM) wrote:
>> This patch converts ext2 regular file's buffered-io path to use iomap.
>> - buffered-io path using iomap_file_buffered_write
>> - DIO fallback to buffered-io now uses iomap_file_buffered_write
>> - writeback path now uses a new aops - ext2_file_aops
>> - truncate now uses iomap_truncate_page
>> - mmap path of ext2 continues to use generic_file_vm_ops
>>
>> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
>
> Looks nice and simple. Just one comment below:
>
>> +static void ext2_file_readahead(struct readahead_control *rac)
>> +{
>> + return iomap_readahead(rac, &ext2_iomap_ops);
>> +}
>
> As Matthew noted, the return of void here looks strange.
>

yes, I will fix it.

>> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>> + struct inode *inode, loff_t offset)
>> +{
>> + if (offset >= wpc->iomap.offset &&
>> + offset < wpc->iomap.offset + wpc->iomap.length)
>> + return 0;
>> +
>> + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
>> + IOMAP_WRITE, &wpc->iomap, NULL);
>> +}
>
> So this will end up mapping blocks for writeback one block at a time if I'm
> understanding things right because ext2_iomap_begin() never sets extent
> larger than 'length' in the iomap result. Hence the wpc checking looks
> pointless (and actually dangerous - see below). So this will probably get
> more expensive than necessary by calling into ext2_get_blocks() for each
> block? Perhaps we could first call ext2_iomap_begin() for reading with high
> length to get as many mapped blocks as we can and if we find unmapped block
> (should happen only for writes through mmap), we resort to what you have
> here... Hum, but this will not work because of the races with truncate
> which could remove blocks whose mapping we have cached in wpc. We can
> safely provide a mapping under a folio only once it is locked or has
> writeback bit set. XFS plays the revalidation sequence counter games
> because of this so we'd have to do something similar for ext2. Not that I'd
> care as much about ext2 writeback performance but it should not be that
> hard and we'll definitely need some similar solution for ext4 anyway. Can
> you give that a try (as a followup "performance improvement" patch).
>

Yes, looking into ->map_blocks was on my todo list, since this RFC only
maps 1 block at a time which can be expensive. I missed adding that as a
comment in cover letter.

But also thanks for providing many details on the same. I am taking a
look at it and will get back with more details on how can we get this
working, as it will be anyway required for ext4 too.

-ritesh

2023-11-22 20:26:39

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Christoph Hellwig <[email protected]> writes:

> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>> writeback bit set. XFS plays the revalidation sequence counter games
>> because of this so we'd have to do something similar for ext2. Not that I'd
>> care as much about ext2 writeback performance but it should not be that
>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>> you give that a try (as a followup "performance improvement" patch).
>
> Darrick has mentioned that he is looking into lifting more of the
> validation sequence counter validation into iomap.
>
> In the meantime I have a series here that at least maps multiple blocks
> inside a folio in a single go, which might be worth trying with ext2 as
> well:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks

Sure, thanks for providing details. I will check this.

-ritesh

2023-11-23 04:10:00

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Thu, Nov 23, 2023 at 09:26:44AM +1100, Dave Chinner wrote:
> On Wed, Nov 22, 2023 at 05:11:18AM -0800, Christoph Hellwig wrote:
> > On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> > > writeback bit set. XFS plays the revalidation sequence counter games
> > > because of this so we'd have to do something similar for ext2. Not that I'd
> > > care as much about ext2 writeback performance but it should not be that
> > > hard and we'll definitely need some similar solution for ext4 anyway. Can
> > > you give that a try (as a followup "performance improvement" patch).
> >
> > Darrick has mentioned that he is looking into lifting more of the
> > validation sequence counter validation into iomap.
>
> I think that was me, as part of aligning the writeback path with
> the ->iomap_valid() checks in the write path after we lock the folio
> we instantiated for the write.
>
> It's basically the same thing - once we have a locked folio, we have
> to check that the cached iomap is still valid before we use it for
> anything.
>
> I need to find the time to get back to that, though.

Heh, we probably both have been chatting with willy on and off about
iomap.

The particular idea I had is to add a u64 counter to address_space that
we can bump in the same places where we bump xfs_inode_fork::if_seq
right now.. ->iomap_begin would sample this address_space::i_mappingseq
counter (with locks held), and now buffered writes and writeback can
check iomap::mappingseq == address_space::i_mappingseq to decide if it's
time to revalidate.

Anyway, I'll have time to go play with that (and further purging of
function pointers) next week or whenever is "after I put out v28 of
online repair". ATM I have a rewrite of log intent recovery cooking
too, but that's going to need at least a week or two of recoveryloop
testing before I put that on the list.

--D

> -Dave.
> --
> Dave Chinner
> [email protected]
>

2023-11-23 07:02:21

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Thu, Nov 23, 2023 at 09:26:44AM +1100, Dave Chinner wrote:
> I think that was me

No, it was Darrick. We talkd about a lot of things, but not this :)

> , as part of aligning the writeback path with
> the ->iomap_valid() checks in the write path after we lock the folio
> we instantiated for the write.
>
> It's basically the same thing - once we have a locked folio, we have
> to check that the cached iomap is still valid before we use it for
> anything.

Yes. Currently we do that in the wb ops ->map_blocks. This can get
called multiple times per folio, which is a bit silly. With the series
I just posted the link to we at least stop doing that if the folio
is mapped contiguously, which solves all practical cases, as the
sequence check is almost free compared to the actual block mapping.

For steps beyond that I'll reply to Darrick's mail.

2023-11-23 07:09:37

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Wed, Nov 22, 2023 at 08:09:44PM -0800, Darrick J. Wong wrote:
> The particular idea I had is to add a u64 counter to address_space that
> we can bump in the same places where we bump xfs_inode_fork::if_seq
> right now.. ->iomap_begin would sample this address_space::i_mappingseq
> counter (with locks held), and now buffered writes and writeback can
> check iomap::mappingseq == address_space::i_mappingseq to decide if it's
> time to revalidate.

So I think moving this to the VFS is probably a good idea, and I
actually argued for that when the sequence checking was first proposed.
We just have to be careful to be able to map things like the two
separate data and cow seq counts in XFS (or anything else complicated
in other file systems) to it.

> Anyway, I'll have time to go play with that (and further purging of
> function pointers)

Do we have anything where the function pointer overhead is actually
hurting us right now?

One thing I'd like to move to is to merge the iomap_begin and iomap_end
callbacks into one similar to willy's series from 2020. The big
benefit of that would be that (together with switching
write_cache_pages to an iterator model) that we could actually use
this single iterator callback also for writeback instead of
->map_blocks, which doesn't really work with the current begin/end
based iomap_iter as the folios actually written through
write_cache_pages might not be contiguous. Using the same mapping
callback would not only save some code duplication, but should also
allow us to nicely implement Dave's old idea to not dirty pages for
O_SYNC writes, but directly write them out. I did start prototyping
that in the last days, and iomap_begin vs map_blocks is currently
the biggest stumbling block.

2023-11-29 05:37:30

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Wed, Nov 22, 2023 at 11:09:17PM -0800, Christoph Hellwig wrote:
> On Wed, Nov 22, 2023 at 08:09:44PM -0800, Darrick J. Wong wrote:
> > The particular idea I had is to add a u64 counter to address_space that
> > we can bump in the same places where we bump xfs_inode_fork::if_seq
> > right now.. ->iomap_begin would sample this address_space::i_mappingseq
> > counter (with locks held), and now buffered writes and writeback can
> > check iomap::mappingseq == address_space::i_mappingseq to decide if it's
> > time to revalidate.
>
> So I think moving this to the VFS is probably a good idea, and I
> actually argued for that when the sequence checking was first proposed.
> We just have to be careful to be able to map things like the two
> separate data and cow seq counts in XFS (or anything else complicated
> in other file systems) to it.

TBH I've been wondering what would happen if we bumped i_mappingseq on
updates of either data or cow fork instead of the shift+or'd thing that
we use now for writeback and/or pagecache write.

I suppose the nice thing about the current encodings is that we elide
revalidations when the cow fork changes but mapping isn't shared.

> > Anyway, I'll have time to go play with that (and further purging of
> > function pointers)
>
> Do we have anything where the function pointer overhead is actually
> hurting us right now?

Not that I know of, but moving to a direct call model means that the fs
would know based on the iomap_XXX_iter function signature whether or not
iomap needs a srcmap; and then it can modify its iomap_begin function
accordingly.

Right now all those rules aren't especially obvious or well documented.
Maybe I can convince myself that improved documentation will suffice to
eliminate Ted's confusion. :)

Also I haven't checked how much the indirect calls hurt.

> One thing I'd like to move to is to merge the iomap_begin and iomap_end
> callbacks into one similar to willy's series from 2020. The big

Got a link to that? I need my memory refreshed, having DROP TABLE MEM2020;
pretty please.

> benefit of that would be that (together with switching
> write_cache_pages to an iterator model) that we could actually use
> this single iterator callback also for writeback instead of
> ->map_blocks, which doesn't really work with the current begin/end
> based iomap_iter as the folios actually written through
> write_cache_pages might not be contiguous.

Ooh it'd benice to get rid of that parallel callbacks thing finally.

> Using the same mapping
> callback would not only save some code duplication, but should also
> allow us to nicely implement Dave's old idea to not dirty pages for
> O_SYNC writes, but directly write them out. I did start prototyping
> that in the last days, and iomap_begin vs map_blocks is currently
> the biggest stumbling block.

Neat! willy's been pushing me for that too.

--D

2023-11-29 06:33:08

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Tue, Nov 28, 2023 at 09:37:21PM -0800, Darrick J. Wong wrote:
> TBH I've been wondering what would happen if we bumped i_mappingseq on
> updates of either data or cow fork instead of the shift+or'd thing that
> we use now for writeback and/or pagecache write.
>
> I suppose the nice thing about the current encodings is that we elide
> revalidations when the cow fork changes but mapping isn't shared.

changed? But yes.

>
> > > Anyway, I'll have time to go play with that (and further purging of
> > > function pointers)
> >
> > Do we have anything where the function pointer overhead is actually
> > hurting us right now?
>
> Not that I know of, but moving to a direct call model means that the fs
> would know based on the iomap_XXX_iter function signature whether or not
> iomap needs a srcmap; and then it can modify its iomap_begin function
> accordingly.
>
> Right now all those rules aren't especially obvious or well documented.
> Maybe I can convince myself that improved documentation will suffice to
> eliminate Ted's confusion. :)

Well, with an iter model I think we can actually kill the srcmap
entirely, as we could be doing two separate overlapping iterations at
the same time. Which would probably be nice for large unaligned writes
as the write size won't be limited by the existing mapping only used
to read in the unaligned blocks at the beginning end end.

> > One thing I'd like to move to is to merge the iomap_begin and iomap_end
> > callbacks into one similar to willy's series from 2020. The big
>
> Got a link to that? I need my memory refreshed, having DROP TABLE MEM2020;
> pretty please.

https://lore.kernel.org/all/20200811205314.GF6107@magnolia/T/

2023-11-30 03:25:04

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Ritesh Harjani (IBM) <[email protected]> writes:

> Christoph Hellwig <[email protected]> writes:
>
>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>>> writeback bit set. XFS plays the revalidation sequence counter games
>>> because of this so we'd have to do something similar for ext2. Not that I'd
>>> care as much about ext2 writeback performance but it should not be that
>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>>> you give that a try (as a followup "performance improvement" patch).
>>
>> Darrick has mentioned that he is looking into lifting more of the
>> validation sequence counter validation into iomap.
>>
>> In the meantime I have a series here that at least maps multiple blocks
>> inside a folio in a single go, which might be worth trying with ext2 as
>> well:
>>
>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks
>
> Sure, thanks for providing details. I will check this.
>

So I took a look at this. Here is what I think -
So this is useful of-course when we have a large folio. Because
otherwise it's just one block at a time for each folio. This is not a
problem once FS buffered-io handling code moves to iomap (because we
can then enable large folio support to it).

However, this would still require us to pass a folio to ->map_blocks
call to determine the size of the folio (which I am not saying can't be
done but just stating my observations here).

Now coming to implementing validation seq check. I am hoping, it should
be straight forward at least for ext2, because it mostly just have to
protect against truncate operation (which can change the block mapping)...

...ok so here is the PoC for seq counter check for ext2. Next let me
try to see if we can lift this up from the FS side to iomap -

From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001
From: "Ritesh Harjani (IBM)" <[email protected]>
Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks

This implements inode block seq counter (ib_seq) which helps in validating
whether the cached wpc (struct iomap_writepage_ctx) still has the valid
entries or not. Block mapping can get changed say due to a racing truncate.

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/balloc.c | 1 +
fs/ext2/ext2.h | 6 ++++++
fs/ext2/inode.c | 51 +++++++++++++++++++++++++++++++++++++++++++-----
fs/ext2/super.c | 2 +-
4 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index e124f3d709b2..e97040194da4 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
}

ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
+ ext2_inc_ib_seq(EXT2_I(inode));

do_more:
overflow = 0;
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 677a9ad45dcb..882c14d20183 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -663,6 +663,7 @@ struct ext2_inode_info {
struct rw_semaphore xattr_sem;
#endif
rwlock_t i_meta_lock;
+ unsigned int ib_seq;

/*
* truncate_mutex is for serialising ext2_truncate() against
@@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
return container_of(inode, struct ext2_inode_info, vfs_inode);
}

+static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei)
+{
+ WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1);
+}
+
/* balloc.c */
extern int ext2_bg_has_super(struct super_block *sb, int group);
extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f4e0b9a93095..a5490d641c26 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
ext2_fsblk_t current_block = 0;
int ret = 0;

+ ext2_inc_ib_seq(EXT2_I(inode));
+
/*
* Here we try to allocate the requested multiple blocks at once,
* on a best-effort basis.
@@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
return mpage_writepages(mapping, wbc, ext2_get_block);
}

+struct ext2_writepage_ctx {
+ struct iomap_writepage_ctx ctx;
+ unsigned int ib_seq;
+};
+
+static inline
+struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx)
+{
+ return container_of(ctx, struct ext2_writepage_ctx, ctx);
+}
+
+static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
+ loff_t offset)
+{
+ if (offset < wpc->iomap.offset ||
+ offset >= wpc->iomap.offset + wpc->iomap.length)
+ return false;
+
+ if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq))
+ return false;
+
+ return true;
+}
+
static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
struct inode *inode, loff_t offset)
{
- if (offset >= wpc->iomap.offset &&
- offset < wpc->iomap.offset + wpc->iomap.length)
+ loff_t maxblocks = (loff_t)INT_MAX;
+ u8 blkbits = inode->i_blkbits;
+ u32 bno;
+ bool new, boundary;
+ int ret;
+
+ if (ext2_imap_valid(wpc, inode, offset))
return 0;

- return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
+ EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq);
+
+ ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
+ &bno, &new, &boundary, 0);
+ if (ret < 0)
+ return ret;
+ /*
+ * ret can be 0 in case of a hole which is possible for mmaped writes.
+ */
+ ret = ret ? ret : 1;
+ return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
IOMAP_WRITE, &wpc->iomap, NULL);
}

@@ -984,9 +1025,9 @@ static const struct iomap_writeback_ops ext2_writeback_ops = {
static int ext2_file_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- struct iomap_writepage_ctx wpc = { };
+ struct ext2_writepage_ctx wpc = { };

- return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops);
+ return iomap_writepages(mapping, wbc, &wpc.ctx, &ext2_writeback_ops);
}

static int
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 645ee6142f69..cd1d1678e35b 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
#ifdef CONFIG_QUOTA
memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
#endif
-
+ WRITE_ONCE(ei->ib_seq, 0);
return &ei->vfs_inode;
}

2.30.2

2023-11-30 04:23:08

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 677a9ad45dcb..882c14d20183 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -663,6 +663,7 @@ struct ext2_inode_info {
> struct rw_semaphore xattr_sem;
> #endif
> rwlock_t i_meta_lock;
> + unsigned int ib_seq;

Surely i_blkseq? Almost everything in this struct is prefixed i_.

2023-11-30 04:30:18

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
> So I took a look at this. Here is what I think -
> So this is useful of-course when we have a large folio. Because
> otherwise it's just one block at a time for each folio. This is not a
> problem once FS buffered-io handling code moves to iomap (because we
> can then enable large folio support to it).

Yes.

> However, this would still require us to pass a folio to ->map_blocks
> call to determine the size of the folio (which I am not saying can't be
> done but just stating my observations here).

XFS currently maps based on the underlyig reservation (delalloc extent)
and not the actual map size. This works because on-disk extents are
allocated as unwritten extents, and only the actual written part is
the converted. But if you only want to allocate blocks for the part
actually written you actually need to pass in the dirty range and not
just use the whole folio. This would be the incremental patch to do
that:

http://git.infradead.org/users/hch/xfs.git/commitdiff/0007893015796ef2ba16bb8b98c4c9fb6e9e6752

But unless your block allocator is very cheap doing what XFS does is
probably going to work much better.

> ...ok so here is the PoC for seq counter check for ext2. Next let me
> try to see if we can lift this up from the FS side to iomap -

This looks good to me from a very superficial view. Dave is the expert
on this, though.

2023-11-30 04:38:10

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Matthew Wilcox <[email protected]> writes:

> On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
>> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
>> index 677a9ad45dcb..882c14d20183 100644
>> --- a/fs/ext2/ext2.h
>> +++ b/fs/ext2/ext2.h
>> @@ -663,6 +663,7 @@ struct ext2_inode_info {
>> struct rw_semaphore xattr_sem;
>> #endif
>> rwlock_t i_meta_lock;
>> + unsigned int ib_seq;
>
> Surely i_blkseq? Almost everything in this struct is prefixed i_.

Certainly!

-ritesh

2023-11-30 05:27:54

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Christoph Hellwig <[email protected]> writes:

> On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
>> So I took a look at this. Here is what I think -
>> So this is useful of-course when we have a large folio. Because
>> otherwise it's just one block at a time for each folio. This is not a
>> problem once FS buffered-io handling code moves to iomap (because we
>> can then enable large folio support to it).
>
> Yes.
>
>> However, this would still require us to pass a folio to ->map_blocks
>> call to determine the size of the folio (which I am not saying can't be
>> done but just stating my observations here).
>
> XFS currently maps based on the underlyig reservation (delalloc extent)
> and not the actual map size. This works because on-disk extents are
> allocated as unwritten extents, and only the actual written part is
> the converted. But if you only want to allocate blocks for the part
> actually written you actually need to pass in the dirty range and not
> just use the whole folio. This would be the incremental patch to do
> that:

yes. dirty range indeed.

>
> http://git.infradead.org/users/hch/xfs.git/commitdiff/0007893015796ef2ba16bb8b98c4c9fb6e9e6752
>
> But unless your block allocator is very cheap doing what XFS does is
> probably going to work much better.
>
>> ...ok so here is the PoC for seq counter check for ext2. Next let me
>> try to see if we can lift this up from the FS side to iomap -
>
> This looks good to me from a very superficial view. Dave is the expert
> on this, though.

Sure.

2023-11-30 07:46:26

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Ritesh Harjani (IBM) <[email protected]> writes:

> Ritesh Harjani (IBM) <[email protected]> writes:
>
>> Christoph Hellwig <[email protected]> writes:
>>
>>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>>>> writeback bit set. XFS plays the revalidation sequence counter games
>>>> because of this so we'd have to do something similar for ext2. Not that I'd
>>>> care as much about ext2 writeback performance but it should not be that
>>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>>>> you give that a try (as a followup "performance improvement" patch).

ok. So I am re-thinknig over this on why will a filesystem like ext2
would require sequence counter check. We don't have collapse range
or COW sort of operations, it is only the truncate which can race,
but that should be taken care by folio_lock. And even if the partial
truncate happens on a folio, since the logical to physical block mapping
never changes, it should not matter if the writeback wrote data to a
cached entry, right?

-ritesh

>>>
>>> Darrick has mentioned that he is looking into lifting more of the
>>> validation sequence counter validation into iomap.
>>>
>>> In the meantime I have a series here that at least maps multiple blocks
>>> inside a folio in a single go, which might be worth trying with ext2 as
>>> well:
>>>
>>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks
>>
>> Sure, thanks for providing details. I will check this.
>>
>
> So I took a look at this. Here is what I think -
> So this is useful of-course when we have a large folio. Because
> otherwise it's just one block at a time for each folio. This is not a
> problem once FS buffered-io handling code moves to iomap (because we
> can then enable large folio support to it).
>
> However, this would still require us to pass a folio to ->map_blocks
> call to determine the size of the folio (which I am not saying can't be
> done but just stating my observations here).
>
> Now coming to implementing validation seq check. I am hoping, it should
> be straight forward at least for ext2, because it mostly just have to
> protect against truncate operation (which can change the block mapping)...
>
> ...ok so here is the PoC for seq counter check for ext2. Next let me
> try to see if we can lift this up from the FS side to iomap -
>
>
> From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001
> From: "Ritesh Harjani (IBM)" <[email protected]>
> Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks
>
> This implements inode block seq counter (ib_seq) which helps in validating
> whether the cached wpc (struct iomap_writepage_ctx) still has the valid
> entries or not. Block mapping can get changed say due to a racing truncate.
>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> ---
> fs/ext2/balloc.c | 1 +
> fs/ext2/ext2.h | 6 ++++++
> fs/ext2/inode.c | 51 +++++++++++++++++++++++++++++++++++++++++++-----
> fs/ext2/super.c | 2 +-
> 4 files changed, 54 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index e124f3d709b2..e97040194da4 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
> }
>
> ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
> + ext2_inc_ib_seq(EXT2_I(inode));
>
> do_more:
> overflow = 0;
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 677a9ad45dcb..882c14d20183 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -663,6 +663,7 @@ struct ext2_inode_info {
> struct rw_semaphore xattr_sem;
> #endif
> rwlock_t i_meta_lock;
> + unsigned int ib_seq;
>
> /*
> * truncate_mutex is for serialising ext2_truncate() against
> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
> return container_of(inode, struct ext2_inode_info, vfs_inode);
> }
>
> +static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei)
> +{
> + WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1);
> +}
> +
> /* balloc.c */
> extern int ext2_bg_has_super(struct super_block *sb, int group);
> extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index f4e0b9a93095..a5490d641c26 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
> ext2_fsblk_t current_block = 0;
> int ret = 0;
>
> + ext2_inc_ib_seq(EXT2_I(inode));
> +
> /*
> * Here we try to allocate the requested multiple blocks at once,
> * on a best-effort basis.
> @@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
> return mpage_writepages(mapping, wbc, ext2_get_block);
> }
>
> +struct ext2_writepage_ctx {
> + struct iomap_writepage_ctx ctx;
> + unsigned int ib_seq;
> +};
> +
> +static inline
> +struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx)
> +{
> + return container_of(ctx, struct ext2_writepage_ctx, ctx);
> +}
> +
> +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
> + loff_t offset)
> +{
> + if (offset < wpc->iomap.offset ||
> + offset >= wpc->iomap.offset + wpc->iomap.length)
> + return false;
> +
> + if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq))
> + return false;
> +
> + return true;
> +}
> +
> static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> struct inode *inode, loff_t offset)
> {
> - if (offset >= wpc->iomap.offset &&
> - offset < wpc->iomap.offset + wpc->iomap.length)
> + loff_t maxblocks = (loff_t)INT_MAX;
> + u8 blkbits = inode->i_blkbits;
> + u32 bno;
> + bool new, boundary;
> + int ret;
> +
> + if (ext2_imap_valid(wpc, inode, offset))
> return 0;
>
> - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> + EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq);
> +
> + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> + &bno, &new, &boundary, 0);
> + if (ret < 0)
> + return ret;
> + /*
> + * ret can be 0 in case of a hole which is possible for mmaped writes.
> + */
> + ret = ret ? ret : 1;
> + return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
> IOMAP_WRITE, &wpc->iomap, NULL);
> }
>
> @@ -984,9 +1025,9 @@ static const struct iomap_writeback_ops ext2_writeback_ops = {
> static int ext2_file_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> {
> - struct iomap_writepage_ctx wpc = { };
> + struct ext2_writepage_ctx wpc = { };
>
> - return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops);
> + return iomap_writepages(mapping, wbc, &wpc.ctx, &ext2_writeback_ops);
> }
>
> static int
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 645ee6142f69..cd1d1678e35b 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
> #ifdef CONFIG_QUOTA
> memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
> #endif
> -
> + WRITE_ONCE(ei->ib_seq, 0);
> return &ei->vfs_inode;
> }
>
>
> 2.30.2

2023-11-30 10:19:17

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
> Ritesh Harjani (IBM) <[email protected]> writes:
>
> > Ritesh Harjani (IBM) <[email protected]> writes:
> >
> >> Christoph Hellwig <[email protected]> writes:
> >>
> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >>>> writeback bit set. XFS plays the revalidation sequence counter games
> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
> >>>> care as much about ext2 writeback performance but it should not be that
> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >>>> you give that a try (as a followup "performance improvement" patch).
>
> ok. So I am re-thinknig over this on why will a filesystem like ext2
> would require sequence counter check. We don't have collapse range
> or COW sort of operations, it is only the truncate which can race,
> but that should be taken care by folio_lock. And even if the partial
> truncate happens on a folio, since the logical to physical block mapping
> never changes, it should not matter if the writeback wrote data to a
> cached entry, right?

Yes, so this is what I think I've already mentioned. As long as we map just
the block at the current offset (or a block under currently locked folio),
we are fine and we don't need any kind of sequence counter. But as soon as
we start caching any kind of mapping in iomap_writepage_ctx we need a way
to protect from races with truncate. So something like the sequence counter.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-11-30 11:00:19

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Jan Kara <[email protected]> writes:

> On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
>> Ritesh Harjani (IBM) <[email protected]> writes:
>>
>> > Ritesh Harjani (IBM) <[email protected]> writes:
>> >
>> >> Christoph Hellwig <[email protected]> writes:
>> >>
>> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>> >>>> writeback bit set. XFS plays the revalidation sequence counter games
>> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
>> >>>> care as much about ext2 writeback performance but it should not be that
>> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>> >>>> you give that a try (as a followup "performance improvement" patch).
>>
>> ok. So I am re-thinknig over this on why will a filesystem like ext2
>> would require sequence counter check. We don't have collapse range
>> or COW sort of operations, it is only the truncate which can race,
>> but that should be taken care by folio_lock. And even if the partial
>> truncate happens on a folio, since the logical to physical block mapping
>> never changes, it should not matter if the writeback wrote data to a
>> cached entry, right?
>
> Yes, so this is what I think I've already mentioned. As long as we map just
> the block at the current offset (or a block under currently locked folio),
> we are fine and we don't need any kind of sequence counter. But as soon as
> we start caching any kind of mapping in iomap_writepage_ctx we need a way
> to protect from races with truncate. So something like the sequence counter.
>

Why do we need to protect from the race with truncate, is my question here.
So, IMO, truncate will truncate the folio cache first before releasing the FS
blocks. Truncation of the folio cache and the writeback path are
protected using folio_lock()
Truncate will clear the dirty flag of the folio before
releasing the folio_lock() right, so writeback will not even proceed for
folios which are not marked dirty (even if we have a cached wpc entry for
which folio is released from folio cache).

Now coming to the stale cached wpc entry for which truncate is doing a
partial truncation. Say, truncate ended up calling
truncate_inode_partial_folio(). Now for such folio (it remains dirty
after partial truncation) (for which there is a stale cached wpc entry),
when writeback writes to the underlying stale block, there is no harm
with that right?

Also this will "only" happen for folio which was partially truncated.
So why do we need to have sequence counter for protecting against this
race is my question.

So could this be only needed when existing logical to physical block
mapping changes e.g. like COW or maybe collapse range?

-ritesh

2023-11-30 14:09:27

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

On Thu 30-11-23 16:29:59, Ritesh Harjani wrote:
> Jan Kara <[email protected]> writes:
>
> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
> >> Ritesh Harjani (IBM) <[email protected]> writes:
> >>
> >> > Ritesh Harjani (IBM) <[email protected]> writes:
> >> >
> >> >> Christoph Hellwig <[email protected]> writes:
> >> >>
> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games
> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
> >> >>>> care as much about ext2 writeback performance but it should not be that
> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >> >>>> you give that a try (as a followup "performance improvement" patch).
> >>
> >> ok. So I am re-thinknig over this on why will a filesystem like ext2
> >> would require sequence counter check. We don't have collapse range
> >> or COW sort of operations, it is only the truncate which can race,
> >> but that should be taken care by folio_lock. And even if the partial
> >> truncate happens on a folio, since the logical to physical block mapping
> >> never changes, it should not matter if the writeback wrote data to a
> >> cached entry, right?
> >
> > Yes, so this is what I think I've already mentioned. As long as we map just
> > the block at the current offset (or a block under currently locked folio),
> > we are fine and we don't need any kind of sequence counter. But as soon as
> > we start caching any kind of mapping in iomap_writepage_ctx we need a way
> > to protect from races with truncate. So something like the sequence counter.
> >
>
> Why do we need to protect from the race with truncate, is my question here.
> So, IMO, truncate will truncate the folio cache first before releasing the FS
> blocks. Truncation of the folio cache and the writeback path are
> protected using folio_lock()
> Truncate will clear the dirty flag of the folio before
> releasing the folio_lock() right, so writeback will not even proceed for
> folios which are not marked dirty (even if we have a cached wpc entry for
> which folio is released from folio cache).
>
> Now coming to the stale cached wpc entry for which truncate is doing a
> partial truncation. Say, truncate ended up calling
> truncate_inode_partial_folio(). Now for such folio (it remains dirty
> after partial truncation) (for which there is a stale cached wpc entry),
> when writeback writes to the underlying stale block, there is no harm
> with that right?
>
> Also this will "only" happen for folio which was partially truncated.
> So why do we need to have sequence counter for protecting against this
> race is my question.

That's a very good question and it took me a while to formulate my "this
sounds problematic" feeling into a particular example :) We can still have
a race like:

write_cache_pages()
cache extent covering 0..1MB range
write page at offset 0k
truncate(file, 4k)
drops all relevant pages
frees fs blocks
pwrite(file, 4k, 4k)
creates dirty page in the page cache
writes page at offset 4k to a stale block

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-11-30 15:50:57

[permalink] [raw]

Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Jan Kara <[email protected]> writes:

> On Thu 30-11-23 16:29:59, Ritesh Harjani wrote:
>> Jan Kara <[email protected]> writes:
>>
>> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
>> >> Ritesh Harjani (IBM) <[email protected]> writes:
>> >>
>> >> > Ritesh Harjani (IBM) <[email protected]> writes:
>> >> >
>> >> >> Christoph Hellwig <[email protected]> writes:
>> >> >>
>> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
>> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games
>> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
>> >> >>>> care as much about ext2 writeback performance but it should not be that
>> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
>> >> >>>> you give that a try (as a followup "performance improvement" patch).
>> >>
>> >> ok. So I am re-thinknig over this on why will a filesystem like ext2
>> >> would require sequence counter check. We don't have collapse range
>> >> or COW sort of operations, it is only the truncate which can race,
>> >> but that should be taken care by folio_lock. And even if the partial
>> >> truncate happens on a folio, since the logical to physical block mapping
>> >> never changes, it should not matter if the writeback wrote data to a
>> >> cached entry, right?
>> >
>> > Yes, so this is what I think I've already mentioned. As long as we map just
>> > the block at the current offset (or a block under currently locked folio),
>> > we are fine and we don't need any kind of sequence counter. But as soon as
>> > we start caching any kind of mapping in iomap_writepage_ctx we need a way
>> > to protect from races with truncate. So something like the sequence counter.
>> >
>>
>> Why do we need to protect from the race with truncate, is my question here.
>> So, IMO, truncate will truncate the folio cache first before releasing the FS
>> blocks. Truncation of the folio cache and the writeback path are
>> protected using folio_lock()
>> Truncate will clear the dirty flag of the folio before
>> releasing the folio_lock() right, so writeback will not even proceed for
>> folios which are not marked dirty (even if we have a cached wpc entry for
>> which folio is released from folio cache).
>>
>> Now coming to the stale cached wpc entry for which truncate is doing a
>> partial truncation. Say, truncate ended up calling
>> truncate_inode_partial_folio(). Now for such folio (it remains dirty
>> after partial truncation) (for which there is a stale cached wpc entry),
>> when writeback writes to the underlying stale block, there is no harm
>> with that right?
>>
>> Also this will "only" happen for folio which was partially truncated.
>> So why do we need to have sequence counter for protecting against this
>> race is my question.
>
> That's a very good question and it took me a while to formulate my "this
> sounds problematic" feeling into a particular example :) We can still have
> a race like:
>
> write_cache_pages()
> cache extent covering 0..1MB range
> write page at offset 0k
> truncate(file, 4k)
> drops all relevant pages
> frees fs blocks
> pwrite(file, 4k, 4k)
> creates dirty page in the page cache
> writes page at offset 4k to a stale block

:) Nice!
Truncate followed by an append write could cause this race with writeback
happening in parallel. And this might cause data corruption.

Thanks again for clearly explaining the race :)

So I think just having a seq. counter (no other locks required), should
be ok to prevent this race. Since mostly what we are interested in is
whether the stale cached block mapping got changed for the folio which
writeback is going to continue writing to.

i.e. the race only happens when 2 of these operation happen while we
have a cached block mapping for a folio -
- a new physical block got allocated for the same logical offset to
which the previous folio belongs to
- the folio got re-instatiated in the page cache as dirty with the new
physical block mapping at the same logical offset of the file.

Now when the writeback continues, it will iterate over the same
dirty folio in question, lock it, check for ->map_blocks which will
notice a changed seq counter and do ->get_blocks again and then
we do submit_bio() to the right physical block.

So, it should be ok, if we simply update the seq counter everytime a
block is allocated or freed for a given inode... because when we
check the seq counter, we know the folio is locked and the other
operation of re-instating a new physical block mapping for a given folio
can also only happen while under a folio lock.

OR, one other approach to this can be...

IIUC, for a new block mapping for the same folio, the folio can be made to
get invalidated once in between right? So should this be handled in folio
cache somehow to know if for a given folio the underlying mapping might
have changed for which iomap should again query ->map_blocks() ?
... can that also help against unnecessary re-validations against cached
block mappings?

Thoughts?

-ritesh

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2023-11-30 16:02:10