2022-03-17 21:49:42

by Matthew Wilcox

[permalink] [raw]
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()

On Thu, Mar 17, 2022 at 12:26:35PM -0700, Linus Torvalds wrote:
> On Thu, Mar 17, 2022 at 8:04 AM Matthew Wilcox <[email protected]> wrote:
> >
> > So how about we do something like this:
> >
> > - Make folio_start_writeback() and set_page_writeback() return void,
> > fixing up AFS and NFS.
> > - Add a folio_wait_start_writeback() to use in the VFS
> > - Remove the calls to set_page_writeback() in the filesystems
>
> That sounds lovely, but it does worry me a bit. Not just the odd
> 'keepwrite' thing, but also the whole ordering between the folio bit
> and the tagging bits. Does the ordering possibly matter?

I wouldn't change the ordering of setting the xarray bits and the
writeback flag; they'd just be set a little earlier. It'd all be done
while the page was still locked. But you're right, there's lots of
subtle interactions here.

> That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in
> only one place (the folio version isn't used at all):
>
> ext4_writepage():
>
> ext4_walk_page_buffers() fails:
> redirty_page_for_writepage(wbc, page);
> keep_towrite = true;
> ext4_bio_write_page().
>
> which just looks odd. Why does it even try to continue to do the
> writepage when the page buffer thing has failed?
>
> In the regular write path (ie ext4_write_begin()), a
> ext4_walk_page_buffers() failure is fatal or causes a retry). Why is
> ext4_writepage() any different? Particularly since it wants to keep
> the page dirty, then trying to do the writeback just seems wrong.
>
> So this code is all a bit odd, I suspect there are decades of "people
> continued to do what they historically did" changes, and it is all
> worrisome.

I found the commit: 1c8349a17137 ("ext4: fix data integrity sync in
ordered mode"). Fortunately, we have a documented test for this,
generic/127, so we'll know if we've broken it.


2022-03-17 23:05:29

by Dave Chinner

[permalink] [raw]
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()

On Thu, Mar 17, 2022 at 09:16:20PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 17, 2022 at 12:26:35PM -0700, Linus Torvalds wrote:
> > On Thu, Mar 17, 2022 at 8:04 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > So how about we do something like this:
> > >
> > > - Make folio_start_writeback() and set_page_writeback() return void,
> > > fixing up AFS and NFS.
> > > - Add a folio_wait_start_writeback() to use in the VFS
> > > - Remove the calls to set_page_writeback() in the filesystems
> >
> > That sounds lovely, but it does worry me a bit. Not just the odd
> > 'keepwrite' thing, but also the whole ordering between the folio bit
> > and the tagging bits. Does the ordering possibly matter?
>
> I wouldn't change the ordering of setting the xarray bits and the
> writeback flag; they'd just be set a little earlier. It'd all be done
> while the page was still locked. But you're right, there's lots of
> subtle interactions here.
>
> > That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in
> > only one place (the folio version isn't used at all):
> >
> > ext4_writepage():
> >
> > ext4_walk_page_buffers() fails:
> > redirty_page_for_writepage(wbc, page);
> > keep_towrite = true;
> > ext4_bio_write_page().
> >
> > which just looks odd. Why does it even try to continue to do the
> > writepage when the page buffer thing has failed?
> >
> > In the regular write path (ie ext4_write_begin()), a
> > ext4_walk_page_buffers() failure is fatal or causes a retry). Why is
> > ext4_writepage() any different? Particularly since it wants to keep
> > the page dirty, then trying to do the writeback just seems wrong.
> >
> > So this code is all a bit odd, I suspect there are decades of "people
> > continued to do what they historically did" changes, and it is all
> > worrisome.
>
> I found the commit: 1c8349a17137 ("ext4: fix data integrity sync in
> ordered mode"). Fortunately, we have a documented test for this,
> generic/127, so we'll know if we've broken it.

Looks like a footgun. ext4 needs the keepwrite stuff for block size <
page size, in the case where a page has both written and
delalloc/unwritten buffers on it. In that case ext4_writepage tries
to write just the written blocks and leave the dealloc/unwritten
buffers alone because it can't do allocation in ->writepage context.

I say footgun, because the nested ->writepage call that needs
keepwrite comes a from journal stop context in the high level
->writepages context that is doing allocation that will allow the
entire page to be written. i.e. it seems a bit silly to be
triggering partial page writeback that skips delalloc/unwritten
extents, but then needs special awareness of higher level IO that is
in progress that is currently doing the allocation that will allow
all the delalloc/unwritten extents on the page to also be written
back...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-03-18 17:41:22

by Jan Kara

[permalink] [raw]
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()

On Thu 17-03-22 21:16:20, Matthew Wilcox wrote:
> On Thu, Mar 17, 2022 at 12:26:35PM -0700, Linus Torvalds wrote:
> > That whole "xyz_writeback_keepwrite()" thing seems odd. It's used in
> > only one place (the folio version isn't used at all):
> >
> > ext4_writepage():
> >
> > ext4_walk_page_buffers() fails:
> > redirty_page_for_writepage(wbc, page);
> > keep_towrite = true;
> > ext4_bio_write_page().
> >
> > which just looks odd. Why does it even try to continue to do the
> > writepage when the page buffer thing has failed?
> >
> > In the regular write path (ie ext4_write_begin()), a
> > ext4_walk_page_buffers() failure is fatal or causes a retry). Why is
> > ext4_writepage() any different? Particularly since it wants to keep
> > the page dirty, then trying to do the writeback just seems wrong.
> >
> > So this code is all a bit odd, I suspect there are decades of "people
> > continued to do what they historically did" changes, and it is all
> > worrisome.
>
> I found the commit: 1c8349a17137 ("ext4: fix data integrity sync in
> ordered mode"). Fortunately, we have a documented test for this,
> generic/127, so we'll know if we've broken it.

I agree with Dave that 'keep_towrite' thing is kind of self-inflicted
damage on the ext4 side (we need to write out some blocks underlying the
page but cannot write all from the transaction commit code, so we need to
keep xarray tags intact so that data integrity sync cannot miss the page).
Also it is no longer needed in the current default ext4 setup. But if you
have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered'
mount options, the hack is still needed AFAIK and we don't have a
reasonable way around it.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-03-21 07:24:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()

On Fri, Mar 18, 2022 at 6:16 AM Jan Kara <[email protected]> wrote:
>
> I agree with Dave that 'keep_towrite' thing is kind of self-inflicted
> damage on the ext4 side (we need to write out some blocks underlying the
> page but cannot write all from the transaction commit code, so we need to
> keep xarray tags intact so that data integrity sync cannot miss the page).
> Also it is no longer needed in the current default ext4 setup. But if you
> have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered'
> mount options, the hack is still needed AFAIK and we don't have a
> reasonable way around it.

I assume you meant 'dioread_lock'.

Which seems to be the default (even if 'data=ordered' is not).

Anyway - if it's not a problem for any current default setting, maybe
the solution is to warn about this case and turn it off?

IOW, we could simply warn about "data=ordered is no longer supported"
and turn it into data=journal.

Obviously *only* do this for the case of "blocksize < PAGE_SIZE".

If this ext4 thing is (a) obsolete and (b) causes VFS-level problems
that nobody else has, I really think we'd be much better off disabling
it than trying to work with it.

Linus

2022-03-21 08:23:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()

On Fri, Mar 18, 2022 at 11:56:04AM -0700, Linus Torvalds wrote:
> On Fri, Mar 18, 2022 at 6:16 AM Jan Kara <[email protected]> wrote:
> >
> > I agree with Dave that 'keep_towrite' thing is kind of self-inflicted
> > damage on the ext4 side (we need to write out some blocks underlying the
> > page but cannot write all from the transaction commit code, so we need to
> > keep xarray tags intact so that data integrity sync cannot miss the page).
> > Also it is no longer needed in the current default ext4 setup. But if you
> > have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered'
> > mount options, the hack is still needed AFAIK and we don't have a
> > reasonable way around it.
>
> I assume you meant 'dioread_lock'.
>
> Which seems to be the default (even if 'data=ordered' is not).

That's not quite right. data=ordered is the default, and that has
been the case since ext3 was introduced.

"dioread_lock" was the default in the days of ext3; "dioread_nolock"
was added to allow parallel Direct I/O reads (hence "nolock"). A
while back, we tried to make dioread_nolock the default since it tends
improve performance for workloads that have a mix of writes that
should be affected by fsyncs, and those that shouldn't.

Howevver, we had to revert that change when blocksize < pagesize to
work around a problem found on Power machines where "echo 3 >
drop_caches" on the host appears to cause file system corruptions on
the guest. (Commit 626b035b816b "ext4: don't set dioread_nolock by
default for blocksize < pagesize").

> IOW, we could simply warn about "data=ordered is no longer supported"
> and turn it into data=journal.
>
> Obviously *only* do this for the case of "blocksize < PAGE_SIZE".

Actiavelly, we've discussed a number of times doing the reverse ---
removing data=journal entirely, since it's (a) rarely used, and (b)
data=journal disables a bunch of optimizations, including
preallocation, and so its performance is pretty awful for most
workloads. The main reason why haven't until now is that we believe
there is a small number of people who do find data=journal useful for
their workload, and at least _so_ far it's not been that hard to keep
it limping along --- although in most cases, data=journal doesn't get
supported for new features or performance optimizations, and it
definitely does note get as much testing.


So the thing that I've been waiting to do for a while is to replace
the whole data=ordered vs data=writeback and dioread_nolock and
dioread_lock is a complete reworking of the ext4 buffered writeback
path, where we write the data blocks *first*, and only then update the
ext4 metadata.

Historically, going as far back as ext2, we've always allocated data
blocks and updatted the metadata blocks, and only then updated the
buffer or page cache for the data blocks. All of the complexities
around data=ordered, data=writeback, dioread_nolock, etc., is because
we haven't done the fundamental work of reversing the order in which
we do buffered writeback. What we *should* be doing is:

*) Determining where the new allocated data blockblocks should be, and
preventing those blocks from being used for any other purposes, but
*not* updating the file system metadata to reflect that change.

*) Submit the data block write

*) On write completion, update the metadata blocks in a kernel thread.

Over time, we've been finding more and more reasons why I need to do
this work, so it's something I'm going to have to prioritize in the
next few months.

Cheerse,

- Ted

2022-03-30 21:19:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()

On Sat, Mar 19, 2022 at 12:23:04PM -0400, Theodore Ts'o wrote:
> So the thing that I've been waiting to do for a while is to replace
> the whole data=ordered vs data=writeback and dioread_nolock and
> dioread_lock is a complete reworking of the ext4 buffered writeback
> path, where we write the data blocks *first*, and only then update the
> ext4 metadata.

> *) Determining where the new allocated data blockblocks should be, and
> preventing those blocks from being used for any other purposes, but
> *not* updating the file system metadata to reflect that change.
>
> *) Submit the data block write
>
> *) On write completion, update the metadata blocks in a kernel thread.

I think that would be easily done by switching to the iomap buffered
I/O code, which is very much built around that model.