2023-03-08 14:25:39

by Jan Kara

[permalink] [raw]
Subject: [PATCH] ext4: Fix warnings when freezing filesystem with journaled data

Test generic/390 in data=journal mode often triggers a warning that
ext4_do_writepages() tries to start a transaction on frozen filesystem.
This happens because although all dirty data is properly written, jbd2
checkpointing code writes data through submit_bh() and as a result only
buffer dirty bits are cleared but page dirty bits stay set. Later when
the filesystem is frozen, writeback code comes, tries to write
supposedly dirty pages and the warning triggers. Fix the problem by
calling sync_filesystem() once more after flushing the whole journal to
clear stray page dirty bits.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 15 ++++++++++++++-
fs/ext4/super.c | 11 +++++++++++
2 files changed, 25 insertions(+), 1 deletion(-)

This patch fixes warnings for generic/390 test. Admittedly it is a bit of a
hack and the right fix is to change jbd2 code to avoid leaving stray page dirty
bits but that is actually surprisingly difficult to do due to locking
constraints without regressing metadata intensive workloads. Applies on top of
my data=journal cleanup series.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4a45d320fda8..d86efa3d959d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2410,6 +2410,7 @@ static int mpage_journal_page_buffers(handle_t *handle,
static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
{
struct address_space *mapping = mpd->inode->i_mapping;
+ struct super_block *sb = mpd->inode->i_sb;
struct folio_batch fbatch;
unsigned int nr_folios;
pgoff_t index = mpd->first_page;
@@ -2427,7 +2428,15 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
else
tag = PAGECACHE_TAG_DIRTY;

- if (ext4_should_journal_data(mpd->inode)) {
+ /*
+ * Start a transaction for writeback of journalled data. We don't start
+ * start the transaction if the filesystem is frozen. In that case we
+ * should not have any dirty data to write anymore but possibly there
+ * are stray page dirty bits left by the checkpointing code so this
+ * loop clears them.
+ */
+ if (ext4_should_journal_data(mpd->inode) &&
+ sb->s_writers.frozen >= SB_FREEZE_FS) {
handle = ext4_journal_start(mpd->inode, EXT4_HT_WRITE_PAGE,
bpp);
if (IS_ERR(handle))
@@ -2520,12 +2529,16 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
*/
if (!mpd->can_map) {
if (ext4_page_nomap_can_writeout(&folio->page)) {
+ WARN_ON_ONCE(sb->s_writers.frozen ==
+ SB_FREEZE_COMPLETE);
err = mpage_submit_page(mpd, &folio->page);
if (err < 0)
goto out;
}
/* Pending dirtying of journalled data? */
if (PageChecked(&folio->page)) {
+ WARN_ON_ONCE(sb->s_writers.frozen >=
+ SB_FREEZE_FS);
err = mpage_journal_page_buffers(handle,
mpd, &folio->page);
if (err < 0)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 88f7b8a88c76..8cdf1a4e0011 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -6259,6 +6259,17 @@ static int ext4_freeze(struct super_block *sb)
if (error < 0)
goto out;

+ /*
+ * Do another sync. We really should not have any dirty data
+ * anymore but our checkpointing code does not clear page dirty
+ * bits due to locking constraints so writeback still can get
+ * started for inodes with journalled data which triggers
+ * annoying warnings.
+ */
+ error = sync_filesystem(sb);
+ if (error < 0)
+ goto out;
+
/* Journal blocked and flushed, clear needs_recovery flag. */
ext4_clear_feature_journal_needs_recovery(sb);
if (ext4_orphan_file_empty(sb))
--
2.35.3



2023-03-15 16:17:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix warnings when freezing filesystem with journaled data

On Wed, 8 Mar 2023 15:25:28 +0100, Jan Kara wrote:
> Test generic/390 in data=journal mode often triggers a warning that
> ext4_do_writepages() tries to start a transaction on frozen filesystem.
> This happens because although all dirty data is properly written, jbd2
> checkpointing code writes data through submit_bh() and as a result only
> buffer dirty bits are cleared but page dirty bits stay set. Later when
> the filesystem is frozen, writeback code comes, tries to write
> supposedly dirty pages and the warning triggers. Fix the problem by
> calling sync_filesystem() once more after flushing the whole journal to
> clear stray page dirty bits.
>
> [...]

Applied, thanks!

[1/1] ext4: Fix warnings when freezing filesystem with journaled data
commit: 18b7f4107219ef12898b2ee77b8fb6de8887d1b7

Best regards,
--
Theodore Ts'o <[email protected]>

2023-03-19 18:36:25

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix warnings when freezing filesystem with journaled data

Hi Jan,

On Wed, Mar 08, 2023 at 03:25:28PM +0100, Jan Kara wrote:
> Test generic/390 in data=journal mode often triggers a warning that
> ext4_do_writepages() tries to start a transaction on frozen filesystem.
> This happens because although all dirty data is properly written, jbd2
> checkpointing code writes data through submit_bh() and as a result only
> buffer dirty bits are cleared but page dirty bits stay set. Later when
> the filesystem is frozen, writeback code comes, tries to write
> supposedly dirty pages and the warning triggers. Fix the problem by
> calling sync_filesystem() once more after flushing the whole journal to
> clear stray page dirty bits.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/inode.c | 15 ++++++++++++++-
> fs/ext4/super.c | 11 +++++++++++
> 2 files changed, 25 insertions(+), 1 deletion(-)
>
> This patch fixes warnings for generic/390 test. Admittedly it is a bit of a
> hack and the right fix is to change jbd2 code to avoid leaving stray page dirty
> bits but that is actually surprisingly difficult to do due to locking
> constraints without regressing metadata intensive workloads. Applies on top of
> my data=journal cleanup series.
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4a45d320fda8..d86efa3d959d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2410,6 +2410,7 @@ static int mpage_journal_page_buffers(handle_t *handle,
> static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> {
> struct address_space *mapping = mpd->inode->i_mapping;
> + struct super_block *sb = mpd->inode->i_sb;
> struct folio_batch fbatch;
> unsigned int nr_folios;
> pgoff_t index = mpd->first_page;
> @@ -2427,7 +2428,15 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> else
> tag = PAGECACHE_TAG_DIRTY;
>
> - if (ext4_should_journal_data(mpd->inode)) {
> + /*
> + * Start a transaction for writeback of journalled data. We don't start
> + * start the transaction if the filesystem is frozen. In that case we
> + * should not have any dirty data to write anymore but possibly there
> + * are stray page dirty bits left by the checkpointing code so this
> + * loop clears them.
> + */
> + if (ext4_should_journal_data(mpd->inode) &&
> + sb->s_writers.frozen >= SB_FREEZE_FS) {
> handle = ext4_journal_start(mpd->inode, EXT4_HT_WRITE_PAGE,
> bpp);
> if (IS_ERR(handle))
> @@ -2520,12 +2529,16 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> */
> if (!mpd->can_map) {
> if (ext4_page_nomap_can_writeout(&folio->page)) {
> + WARN_ON_ONCE(sb->s_writers.frozen ==
> + SB_FREEZE_COMPLETE);
> err = mpage_submit_page(mpd, &folio->page);
> if (err < 0)
> goto out;
> }
> /* Pending dirtying of journalled data? */
> if (PageChecked(&folio->page)) {
> + WARN_ON_ONCE(sb->s_writers.frozen >=
> + SB_FREEZE_FS);
> err = mpage_journal_page_buffers(handle,
> mpd, &folio->page);
> if (err < 0)
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 88f7b8a88c76..8cdf1a4e0011 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -6259,6 +6259,17 @@ static int ext4_freeze(struct super_block *sb)
> if (error < 0)
> goto out;
>
> + /*
> + * Do another sync. We really should not have any dirty data
> + * anymore but our checkpointing code does not clear page dirty
> + * bits due to locking constraints so writeback still can get
> + * started for inodes with journalled data which triggers
> + * annoying warnings.
> + */
> + error = sync_filesystem(sb);
> + if (error < 0)
> + goto out;
> +
> /* Journal blocked and flushed, clear needs_recovery flag. */
> ext4_clear_feature_journal_needs_recovery(sb);
> if (ext4_orphan_file_empty(sb))

This patch causes a NULL dereference of 'handle' in mpage_journal_page_buffers()
when running the fsverity stress test generic/579 with data journaling enabled.
To reproduce:

kvm-xfstests -c ext4/data_journal generic/579

First, isn't the condition 'ext4_should_journal_data(mpd->inode) &&
sb->s_writers.frozen >= SB_FREEZE_FS' wrong? Shouldn't it check
'sb->s_writers.frozen < SB_FREEZE_FS' instead?

That fixes the crash by itself. However, the other reason this happens is that
the PG_checked bit is being used by both data=journal (to mark DMA-pinned dirty
pages?) and by fsverity (to mark Merkle tree pages that have been verified).

These two uses of PG_checked shouldn't actually collide, since fsverity files
are read-only, and FS_IOC_ENABLE_VERITY calls filemap_write_and_wait() before
finishing enabling fsverity. PG_checked isn't used for fsverity until after
fsverity has finished being enabled, at which point there should no longer be
any dirty pages. But, they are in fact colliding somehow. Is it expected that
dirty pages are still left after filemap_write_and_wait()?

- Eric

2023-03-21 12:06:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix warnings when freezing filesystem with journaled data

Hi Eric,

On Sun 19-03-23 11:36:17, Eric Biggers wrote:
> On Wed, Mar 08, 2023 at 03:25:28PM +0100, Jan Kara wrote:
> > Test generic/390 in data=journal mode often triggers a warning that
> > ext4_do_writepages() tries to start a transaction on frozen filesystem.
> > This happens because although all dirty data is properly written, jbd2
> > checkpointing code writes data through submit_bh() and as a result only
> > buffer dirty bits are cleared but page dirty bits stay set. Later when
> > the filesystem is frozen, writeback code comes, tries to write
> > supposedly dirty pages and the warning triggers. Fix the problem by
> > calling sync_filesystem() once more after flushing the whole journal to
> > clear stray page dirty bits.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext4/inode.c | 15 ++++++++++++++-
> > fs/ext4/super.c | 11 +++++++++++
> > 2 files changed, 25 insertions(+), 1 deletion(-)
> >
> > This patch fixes warnings for generic/390 test. Admittedly it is a bit of a
> > hack and the right fix is to change jbd2 code to avoid leaving stray page dirty
> > bits but that is actually surprisingly difficult to do due to locking
> > constraints without regressing metadata intensive workloads. Applies on top of
> > my data=journal cleanup series.
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 4a45d320fda8..d86efa3d959d 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -2410,6 +2410,7 @@ static int mpage_journal_page_buffers(handle_t *handle,
> > static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> > {
> > struct address_space *mapping = mpd->inode->i_mapping;
> > + struct super_block *sb = mpd->inode->i_sb;
> > struct folio_batch fbatch;
> > unsigned int nr_folios;
> > pgoff_t index = mpd->first_page;
> > @@ -2427,7 +2428,15 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> > else
> > tag = PAGECACHE_TAG_DIRTY;
> >
> > - if (ext4_should_journal_data(mpd->inode)) {
> > + /*
> > + * Start a transaction for writeback of journalled data. We don't start
> > + * start the transaction if the filesystem is frozen. In that case we
> > + * should not have any dirty data to write anymore but possibly there
> > + * are stray page dirty bits left by the checkpointing code so this
> > + * loop clears them.
> > + */
> > + if (ext4_should_journal_data(mpd->inode) &&
> > + sb->s_writers.frozen >= SB_FREEZE_FS) {
> > handle = ext4_journal_start(mpd->inode, EXT4_HT_WRITE_PAGE,
> > bpp);
> > if (IS_ERR(handle))
> > @@ -2520,12 +2529,16 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> > */
> > if (!mpd->can_map) {
> > if (ext4_page_nomap_can_writeout(&folio->page)) {
> > + WARN_ON_ONCE(sb->s_writers.frozen ==
> > + SB_FREEZE_COMPLETE);
> > err = mpage_submit_page(mpd, &folio->page);
> > if (err < 0)
> > goto out;
> > }
> > /* Pending dirtying of journalled data? */
> > if (PageChecked(&folio->page)) {
> > + WARN_ON_ONCE(sb->s_writers.frozen >=
> > + SB_FREEZE_FS);
> > err = mpage_journal_page_buffers(handle,
> > mpd, &folio->page);
> > if (err < 0)
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 88f7b8a88c76..8cdf1a4e0011 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -6259,6 +6259,17 @@ static int ext4_freeze(struct super_block *sb)
> > if (error < 0)
> > goto out;
> >
> > + /*
> > + * Do another sync. We really should not have any dirty data
> > + * anymore but our checkpointing code does not clear page dirty
> > + * bits due to locking constraints so writeback still can get
> > + * started for inodes with journalled data which triggers
> > + * annoying warnings.
> > + */
> > + error = sync_filesystem(sb);
> > + if (error < 0)
> > + goto out;
> > +
> > /* Journal blocked and flushed, clear needs_recovery flag. */
> > ext4_clear_feature_journal_needs_recovery(sb);
> > if (ext4_orphan_file_empty(sb))
>
> This patch causes a NULL dereference of 'handle' in mpage_journal_page_buffers()
> when running the fsverity stress test generic/579 with data journaling enabled.
> To reproduce:
>
> kvm-xfstests -c ext4/data_journal generic/579
>
> First, isn't the condition 'ext4_should_journal_data(mpd->inode) &&
> sb->s_writers.frozen >= SB_FREEZE_FS' wrong? Shouldn't it check
> 'sb->s_writers.frozen < SB_FREEZE_FS' instead?

Indeed, I wonder how this has passed all the other testing I and Ted did :)
I'll send a fix.

> That fixes the crash by itself. However, the other reason this happens is that
> the PG_checked bit is being used by both data=journal (to mark DMA-pinned dirty
> pages?) and by fsverity (to mark Merkle tree pages that have been verified).

Yup.

> These two uses of PG_checked shouldn't actually collide, since fsverity files
> are read-only, and FS_IOC_ENABLE_VERITY calls filemap_write_and_wait() before
> finishing enabling fsverity. PG_checked isn't used for fsverity until after
> fsverity has finished being enabled, at which point there should no longer be
> any dirty pages. But, they are in fact colliding somehow. Is it expected that
> dirty pages are still left after filemap_write_and_wait()?

Sadly filemap_write_and_wait() is not enough to clean pages from a file
with journalled data. For couple of reasons:

1) Pages can be part of the running transaction so until we commit that
(which filemap_write_and_wait() does not do, only whole ext4_sync_file()
does that) they are not really stable.

2) Once the transaction commits, pages get marked as dirty for
checkpointing code to pick them up.

So to fully clean journaled data pages you need a sequence like:

filemap_write_and_wait() (to add PageChecked pages to the journal)
ext4_force_commit() (to commit the transaction with the pages)
filemap_write_and_wait() (to clean pages that need checkpointing)

I know this sucks and there are quite a few places that need special
handling of journalled data because of this. Ideally we'd hide all these
details in ext4_writepages() but the hard part is how to do this without
making performance go completely out of the window. But I think now that
the journalled data writeout path is simplified, we can handle it
reasonably.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-03-23 14:23:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix warnings when freezing filesystem with journaled data

Jan, have you had a chance to look into finding a fixup to the problem
that Eric identified? I'm not sure how I had missed this in my
testing as well. :-/

If fixing this is going to take a while, I can drop this patch for
now, especially since you've pointed out that the warning is a false
positive.

Thanks,

- Ted