2023-03-29 15:53:44

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/13 v1] ext4: Make ext4_writepages() write all journalled data

Hello,

These patches modify ext4 so that whenever inode with journalled data is
written back we make sure we writeout all the dirty data the inode has.
Previously this was not the case as pages with journalled data that were still
part of running (or committing) transaction appeared as clean to the writeback
code. As a result we can remove workarounds for inodes with journalled data
from multiple places.

In particular, we make sure a page is marked as dirty as soon as it has any
data to write (even though it is part of the running transaction) and in
ext4_writepages() we make sure appropriate transaction is committed (in
WB_SYNC_ALL mode) and then go writing back pages (effectively performing a
checkpoint for the inode). Thus after WB_SYNC_ALL writeback all pages for the
inode are guaranteed to be clean (providing there were no new pages dirtied)
the same way this is guaranteed for non-journaled data.

The downside of this change is that workloads that use journalled data and
frequently redirty pages and call fsync(2) will see more IO happening (more
pages will get checkpointed). But given low amount of users of data=journal
mode and their nature, I deem the code simplification worth it.

Honza


2023-03-29 15:53:45

by Jan Kara

[permalink] [raw]
Subject: [PATCH 01/13] jdb2: Don't refuse invalidation of already invalidated buffers

When invalidating buffers under the partial tail page,
jbd2_journal_invalidate_folio() returns -EBUSY if the buffer is part of
the committing transaction as we cannot safely modify buffer state.
However if the buffer is already invalidated (due to previous
invalidation attempts from ext4_wait_for_tail_page_commit()), there's
nothing to do and there's no point in returning -EBUSY. This fixes
occasional warnings from ext4_journalled_invalidate_folio() triggered by
generic/051 fstest when blocksize < pagesize.

Fixes: 53e872681fed ("ext4: fix deadlock in journal_unmap_buffer()")
Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd2/transaction.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 15de1385012e..18611241f451 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2387,6 +2387,9 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
spin_unlock(&jh->b_state_lock);
write_unlock(&journal->j_state_lock);
jbd2_journal_put_journal_head(jh);
+ /* Already zapped buffer? Nothing to do... */
+ if (!bh->b_bdev)
+ return 0;
return -EBUSY;
}
/*
--
2.35.3

2023-03-29 15:53:44

by Jan Kara

[permalink] [raw]
Subject: [PATCH 03/13] ext4: Keep pages with journalled data dirty

Currently we clear page dirty bit when we checkpoint some buffers from a
page with journalled data or when we perform delayed dirtying of a page
in ext4_writepages(). In a quest to simplify handling of journalled data
we want to keep page dirty as long as it has either buffers to
checkpoint or journalled dirty data. So make sure to keep page dirty in
ext4_writepages() if it still has journalled data attached to it.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 1 -
fs/ext4/page-io.c | 6 ++++--
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 73e24e61fdd2..78e29da70af7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2387,7 +2387,6 @@ static int mpage_journal_page_buffers(handle_t *handle,
int len;

ClearPageChecked(page);
- clear_page_dirty_for_io(page);
mpd->wbc->nr_to_write--;

if (page->index == size >> PAGE_SHIFT &&
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 8703fd732abb..23b29a50b159 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -484,9 +484,11 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
* to redirty the page and keep TOWRITE tag so that
* racing WB_SYNC_ALL writeback does not skip the page.
* This happens e.g. when doing writeout for
- * transaction commit.
+ * transaction commit or when journalled data is not
+ * yet committed.
*/
- if (buffer_dirty(bh)) {
+ if (buffer_dirty(bh) ||
+ (buffer_jbd(bh) && buffer_jbddirty(bh))) {
if (!PageDirty(page))
redirty_page_for_writepage(wbc, page);
keep_towrite = true;
--
2.35.3

2023-03-29 15:53:45

by Jan Kara

[permalink] [raw]
Subject: [PATCH 06/13] ext4: Drop special handling of journalled data from ext4_sync_file()

Now that ext4_writepages() make sure all pages with journalled data are
stable on disk, we don't need special handling of journalled data in
ext4_sync_file().

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/fsync.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 027a7d7037a0..f65fdb27ce14 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -153,23 +153,12 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
goto out;

/*
- * data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
* Metadata is in the journal, we wait for proper transaction to
* commit here.
- *
- * data=journal:
- * filemap_fdatawrite won't do anything (the buffers are clean).
- * ext4_force_commit will write the file data into the journal and
- * will wait on that.
- * filemap_fdatawait() will encounter a ton of newly-dirtied pages
- * (they were dirtied by commit). But that's OK - the blocks are
- * safe in-journal, which is all fsync() needs to ensure.
*/
if (!sbi->s_journal)
ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
- else if (ext4_should_journal_data(inode))
- ret = ext4_force_commit(inode->i_sb);
else
ret = ext4_fsync_journal(inode, datasync, &needs_barrier);

--
2.35.3

2023-03-29 15:53:49

by Jan Kara

[permalink] [raw]
Subject: [PATCH 11/13] ext4: Simplify handling of journalled data in ext4_bmap()

Now that ext4_writepages() gets journalled data into its final location
we just use filemap_write_and_wait() instead of special handling of
journalled data in ext4_bmap(). We can also drop EXT4_STATE_JDATA flag
as it is not used anymore.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 1 -
fs/ext4/inode.c | 44 +++++---------------------------------------
2 files changed, 5 insertions(+), 40 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9b2cfc32cf78..7a6e03da5a3d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1887,7 +1887,6 @@ static inline void ext4_simulate_fail_bh(struct super_block *sb,
* Inode dynamic state flags
*/
enum {
- EXT4_STATE_JDATA, /* journaled data exists */
EXT4_STATE_NEW, /* inode is newly created */
EXT4_STATE_XATTR, /* has in-inode xattrs */
EXT4_STATE_NO_EXPAND, /* No space for expansion */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 94bfb12aaa9e..3c6bce0afb20 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1408,7 +1408,6 @@ static int ext4_journalled_write_end(struct file *file,
}
if (!verity)
size_changed = ext4_update_inode_size(inode, pos + copied);
- ext4_set_inode_state(inode, EXT4_STATE_JDATA);
EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
unlock_page(page);
put_page(page);
@@ -2334,8 +2333,6 @@ static int ext4_journal_page_buffers(handle_t *handle, struct page *page,
ret = err;
EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;

- ext4_set_inode_state(inode, EXT4_STATE_JDATA);
-
return ret;
}

@@ -3079,9 +3076,7 @@ int ext4_alloc_da_blocks(struct inode *inode)
static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
{
struct inode *inode = mapping->host;
- journal_t *journal;
sector_t ret = 0;
- int err;

inode_lock_shared(inode);
/*
@@ -3091,45 +3086,16 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
goto out;

if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
- test_opt(inode->i_sb, DELALLOC)) {
+ (test_opt(inode->i_sb, DELALLOC) ||
+ ext4_should_journal_data(inode))) {
/*
- * With delalloc we want to sync the file
- * so that we can make sure we allocate
- * blocks for file
+ * With delalloc or journalled data we want to sync the file so
+ * that we can make sure we allocate blocks for file and data
+ * is in place for the user to see it
*/
filemap_write_and_wait(mapping);
}

- if (EXT4_JOURNAL(inode) &&
- ext4_test_inode_state(inode, EXT4_STATE_JDATA)) {
- /*
- * This is a REALLY heavyweight approach, but the use of
- * bmap on dirty files is expected to be extremely rare:
- * only if we run lilo or swapon on a freshly made file
- * do we expect this to happen.
- *
- * (bmap requires CAP_SYS_RAWIO so this does not
- * represent an unprivileged user DOS attack --- we'd be
- * in trouble if mortal users could trigger this path at
- * will.)
- *
- * NB. EXT4_STATE_JDATA is not set on files other than
- * regular files. If somebody wants to bmap a directory
- * or symlink and gets confused because the buffer
- * hasn't yet been flushed to disk, they deserve
- * everything they get.
- */
-
- ext4_clear_inode_state(inode, EXT4_STATE_JDATA);
- journal = EXT4_JOURNAL(inode);
- jbd2_journal_lock_updates(journal);
- err = jbd2_journal_flush(journal, 0);
- jbd2_journal_unlock_updates(journal);
-
- if (err)
- goto out;
- }
-
ret = iomap_bmap(mapping, block, &ext4_iomap_ops);

out:
--
2.35.3

2023-03-29 15:53:55

by Jan Kara

[permalink] [raw]
Subject: [PATCH 09/13] ext4: Drop special handling of journalled data from ext4_evict_inode()

Now that ext4_writepages() makes sure journalled data is on stable
storage, write_inode_now() call in iput_final() is enough to make
pagecache pages with journalled data really clean (data committed and
checkpointed). So we can drop special handling of journalled data in
ext4_evict_inode().

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 27 ---------------------------
1 file changed, 27 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3ab2d56b6840..94bfb12aaa9e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -179,33 +179,6 @@ void ext4_evict_inode(struct inode *inode)
if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
ext4_evict_ea_inode(inode);
if (inode->i_nlink) {
- /*
- * When journalling data dirty buffers are tracked only in the
- * journal. So although mm thinks everything is clean and
- * ready for reaping the inode might still have some pages to
- * write in the running transaction or waiting to be
- * checkpointed. Thus calling jbd2_journal_invalidate_folio()
- * (via truncate_inode_pages()) to discard these buffers can
- * cause data loss. Also even if we did not discard these
- * buffers, we would have no way to find them after the inode
- * is reaped and thus user could see stale data if he tries to
- * read them before the transaction is checkpointed. So be
- * careful and force everything to disk here... We use
- * ei->i_datasync_tid to store the newest transaction
- * containing inode's data.
- *
- * Note that directories do not have this problem because they
- * don't use page cache.
- */
- if (inode->i_ino != EXT4_JOURNAL_INO &&
- ext4_should_journal_data(inode) &&
- S_ISREG(inode->i_mode) && inode->i_data.nrpages) {
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
- tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;
-
- jbd2_complete_transaction(journal, commit_tid);
- filemap_write_and_wait(&inode->i_data);
- }
truncate_inode_pages_final(&inode->i_data);

goto no_delete;
--
2.35.3

2023-03-29 15:53:55

by Jan Kara

[permalink] [raw]
Subject: [PATCH 08/13] ext4: Fix special handling of journalled data from extent zeroing

The handling of journalled data in ext4_zero_range() is incomplete. We
do not need to commit running transaction but we rather need to
checkpoint pages with journalled data. If we don't, journal tail can be
advanced beyond transaction containing the journalled data and if we
then crash before committing the transaction doing the zeroing we will
have inconsistent (too old) data in the file. Make sure file pages with
journalled data are properly checkpointed before removing them from the
page cache.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0b622ae29a73..e79c767cc5e0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4526,13 +4526,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,

trace_ext4_zero_range(inode, offset, len, mode);

- /* Call ext4_force_commit to flush all data in case of data=journal. */
- if (ext4_should_journal_data(inode)) {
- ret = ext4_force_commit(inode->i_sb);
- if (ret)
- return ret;
- }
-
/*
* Round up offset. This is not fallocate, we need to zero out
* blocks, so convert interior block aligned part of the range to
@@ -4616,6 +4609,20 @@ static long ext4_zero_range(struct file *file, loff_t offset,
filemap_invalidate_unlock(mapping);
goto out_mutex;
}
+
+ /*
+ * For journalled data we need to write (and checkpoint) pages
+ * before discarding page cache to avoid inconsitent data on
+ * disk in case of crash before zeroing trans is committed.
+ */
+ if (ext4_should_journal_data(inode)) {
+ ret = filemap_write_and_wait_range(mapping, start, end);
+ if (ret) {
+ filemap_invalidate_unlock(mapping);
+ goto out_mutex;
+ }
+ }
+
/* Now release the pages and zero block aligned part of pages */
truncate_pagecache_range(inode, start, end - 1);
inode->i_mtime = inode->i_ctime = current_time(inode);
--
2.35.3

2023-03-29 15:53:56

by Jan Kara

[permalink] [raw]
Subject: [PATCH 07/13] ext4: Drop special handling of journalled data from extent shifting operations

Now that filemap_write_and_wait() makes sure pages with journalled data
are safely on disk, ext4_collapse_range() and ext4_insert_range() do
not need special handling of journalled data.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 14 --------------
1 file changed, 14 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3559ea6b0781..0b622ae29a73 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5290,13 +5290,6 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
punch_start = offset >> EXT4_BLOCK_SIZE_BITS(sb);
punch_stop = (offset + len) >> EXT4_BLOCK_SIZE_BITS(sb);

- /* Call ext4_force_commit to flush all data in case of data=journal. */
- if (ext4_should_journal_data(inode)) {
- ret = ext4_force_commit(inode->i_sb);
- if (ret)
- return ret;
- }
-
inode_lock(inode);
/*
* There is no need to overlap collapse range with EOF, in which case
@@ -5443,13 +5436,6 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
offset_lblk = offset >> EXT4_BLOCK_SIZE_BITS(sb);
len_lblk = len >> EXT4_BLOCK_SIZE_BITS(sb);

- /* Call ext4_force_commit to flush all data in case of data=journal */
- if (ext4_should_journal_data(inode)) {
- ret = ext4_force_commit(inode->i_sb);
- if (ret)
- return ret;
- }
-
inode_lock(inode);
/* Currently just for extent based files */
if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
--
2.35.3

2023-03-29 15:54:12

by Jan Kara

[permalink] [raw]
Subject: [PATCH 02/13] ext4: Mark pages with journalled data dirty

Currently pages with journalled data written by write(2) or modified by
block zeroing during truncate(2) are not marked as dirty. They are
dirtied only once the transaction commits. This however makes writeback
code think inode has no pages to write and so ext4_writepages() is not
called to make pages with journalled data persistent. Mark pages with
journalled data dirty (similarly as it happens for writes through mmap)
so that writeback code knows about them and ext4_writepages() can do
what it needs to to the inode.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dbcc8b48c7ba..73e24e61fdd2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1003,6 +1003,18 @@ int ext4_walk_page_buffers(handle_t *handle, struct inode *inode,
return ret;
}

+/*
+ * Helper for handling dirtying of journalled data. We also mark the folio as
+ * dirty so that writeback code knows about this page (and inode) contains
+ * dirty data. ext4_writepages() then commits appropriate transaction to
+ * make data stable.
+ */
+static int ext4_dirty_journalled_data(handle_t *handle, struct buffer_head *bh)
+{
+ folio_mark_dirty(bh->b_folio);
+ return ext4_handle_dirty_metadata(handle, NULL, bh);
+}
+
int do_journal_get_write_access(handle_t *handle, struct inode *inode,
struct buffer_head *bh)
{
@@ -1025,7 +1037,7 @@ int do_journal_get_write_access(handle_t *handle, struct inode *inode,
ret = ext4_journal_get_write_access(handle, inode->i_sb, bh,
EXT4_JTR_NONE);
if (!ret && dirty)
- ret = ext4_handle_dirty_metadata(handle, NULL, bh);
+ ret = ext4_dirty_journalled_data(handle, bh);
return ret;
}

@@ -1270,7 +1282,7 @@ static int write_end_fn(handle_t *handle, struct inode *inode,
if (!buffer_mapped(bh) || buffer_freed(bh))
return 0;
set_buffer_uptodate(bh);
- ret = ext4_handle_dirty_metadata(handle, NULL, bh);
+ ret = ext4_dirty_journalled_data(handle, bh);
clear_buffer_meta(bh);
clear_buffer_prio(bh);
return ret;
@@ -1353,7 +1365,7 @@ static int ext4_write_end(struct file *file,
/*
* This is a private version of page_zero_new_buffers() which doesn't
* set the buffer to be dirty, since in data=journalled mode we need
- * to call ext4_handle_dirty_metadata() instead.
+ * to call ext4_dirty_journalled_data() instead.
*/
static void ext4_journalled_zero_new_buffers(handle_t *handle,
struct inode *inode,
@@ -3732,7 +3744,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
BUFFER_TRACE(bh, "zeroed end of block");

if (ext4_should_journal_data(inode)) {
- err = ext4_handle_dirty_metadata(handle, inode, bh);
+ err = ext4_dirty_journalled_data(handle, bh);
} else {
err = 0;
mark_buffer_dirty(bh);
--
2.35.3

2023-03-29 15:54:42

by Jan Kara

[permalink] [raw]
Subject: [PATCH 12/13] ext4: Update comment in mpage_prepare_extent_to_map()

Since filemap_write_and_wait() is now enough to get journalled data to
final location update the comment in mpage_prepare_extent_to_map().

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3c6bce0afb20..25a9e7586c50 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2490,11 +2490,10 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
* Just submit the page. For data=journal mode we
* first handle writeout of the page for checkpoint and
* only after that handle delayed page dirtying. This
- * is crutial so that forcing a transaction commit and
- * then calling filemap_write_and_wait() guarantees
- * current state of data is in its final location. Such
- * sequence is used for example by insert/collapse
- * range operations before discarding the page cache.
+ * makes sure current data is checkpointed to the final
+ * location before possibly journalling it again which
+ * is desirable when the page is frequently dirtied
+ * through a pin.
*/
if (!mpd->can_map) {
WARN_ON_ONCE(sb->s_writers.frozen ==
--
2.35.3

2023-03-29 15:54:42

by Jan Kara

[permalink] [raw]
Subject: [PATCH 04/13] ext4: Clear dirty bit from pages without data to write

With journalled data it can happen that checkpointing code will write
out page contents without clearing the page dirty bit. The logic in
ext4_page_nomap_can_writeout() then results in us never calling
mpage_submit_page() and thus clearing the dirty bit. Drop the
optimization with ext4_page_nomap_can_writeout() and just always call to
mpage_submit_page(). ext4_bio_write_page() knows when to redirty the
page and the additional clearing & setting of page dirty bit for ordered
mode writeout is not that expensive to jump through the hoops for it.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 25 +++++--------------------
1 file changed, 5 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 78e29da70af7..85299c90b0f7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2342,19 +2342,6 @@ static int ext4_da_writepages_trans_blocks(struct inode *inode)
MAX_WRITEPAGES_EXTENT_LEN + bpp - 1, bpp);
}

-/* Return true if the page needs to be written as part of transaction commit */
-static bool ext4_page_nomap_can_writeout(struct page *page)
-{
- struct buffer_head *bh, *head;
-
- bh = head = page_buffers(page);
- do {
- if (buffer_dirty(bh) && buffer_mapped(bh) && !buffer_delay(bh))
- return true;
- } while ((bh = bh->b_this_page) != head);
- return false;
-}
-
static int ext4_journal_page_buffers(handle_t *handle, struct page *page,
int len)
{
@@ -2539,13 +2526,11 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
* range operations before discarding the page cache.
*/
if (!mpd->can_map) {
- if (ext4_page_nomap_can_writeout(&folio->page)) {
- WARN_ON_ONCE(sb->s_writers.frozen ==
- SB_FREEZE_COMPLETE);
- err = mpage_submit_page(mpd, &folio->page);
- if (err < 0)
- goto out;
- }
+ WARN_ON_ONCE(sb->s_writers.frozen ==
+ SB_FREEZE_COMPLETE);
+ err = mpage_submit_page(mpd, &folio->page);
+ if (err < 0)
+ goto out;
/* Pending dirtying of journalled data? */
if (PageChecked(&folio->page)) {
WARN_ON_ONCE(sb->s_writers.frozen >=
--
2.35.3

2023-03-29 15:54:42

by Jan Kara

[permalink] [raw]
Subject: [PATCH 13/13] Revert "ext4: Fix warnings when freezing filesystem with journaled data"

After making ext4_writepages() properly clean all pages there is no need
for special treatment of filesystem freezing. Revert commit
e6c28a26b799c7640b77daff3e4a67808c74381c.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 15 +--------------
fs/ext4/super.c | 11 -----------
2 files changed, 1 insertion(+), 25 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 25a9e7586c50..5161221193f9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2379,7 +2379,6 @@ static int mpage_journal_page_buffers(handle_t *handle,
static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
{
struct address_space *mapping = mpd->inode->i_mapping;
- struct super_block *sb = mpd->inode->i_sb;
struct folio_batch fbatch;
unsigned int nr_folios;
pgoff_t index = mpd->first_page;
@@ -2399,15 +2398,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)

mpd->map.m_len = 0;
mpd->next_page = index;
- /*
- * Start a transaction for writeback of journalled data. We don't start
- * the transaction if the filesystem is frozen. In that case we
- * should not have any dirty data to write anymore but possibly there
- * are stray page dirty bits left by the checkpointing code so this
- * loop clears them.
- */
- if (ext4_should_journal_data(mpd->inode) &&
- sb->s_writers.frozen < SB_FREEZE_FS) {
+ if (ext4_should_journal_data(mpd->inode)) {
handle = ext4_journal_start(mpd->inode, EXT4_HT_WRITE_PAGE,
bpp);
if (IS_ERR(handle))
@@ -2496,15 +2487,11 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
* through a pin.
*/
if (!mpd->can_map) {
- WARN_ON_ONCE(sb->s_writers.frozen ==
- SB_FREEZE_COMPLETE);
err = mpage_submit_page(mpd, &folio->page);
if (err < 0)
goto out;
/* Pending dirtying of journalled data? */
if (PageChecked(&folio->page)) {
- WARN_ON_ONCE(sb->s_writers.frozen >=
- SB_FREEZE_FS);
err = mpage_journal_page_buffers(handle,
mpd, &folio->page);
if (err < 0)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b9f8dd0d6e46..b5b4734fc1b7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -6293,17 +6293,6 @@ static int ext4_freeze(struct super_block *sb)
if (error < 0)
goto out;

- /*
- * Do another sync. We really should not have any dirty data
- * anymore but our checkpointing code does not clear page dirty
- * bits due to locking constraints so writeback still can get
- * started for inodes with journalled data which triggers
- * annoying warnings.
- */
- error = sync_filesystem(sb);
- if (error < 0)
- goto out;
-
/* Journal blocked and flushed, clear needs_recovery flag. */
ext4_clear_feature_journal_needs_recovery(sb);
if (ext4_orphan_file_empty(sb))
--
2.35.3

2023-03-29 15:54:47

by Jan Kara

[permalink] [raw]
Subject: [PATCH 05/13] ext4: Commit transaction before writing back pages in data=journal mode

When journalling data we currently just walk over pages, journal those
that are marked for delayed dirtying (only pinned pages dirtied behing
our back these days) and checkpoint other dirty pages. Because some
pages may be part of running transaction the result is that after
filemap_write_and_wait() we are not guaranteed pages are stable on disk.
Thus places that want to flush current pagecache content need to jump
through hoops to make sure journalled data is not lost. This is
manageable in cases completely controlled by ext4 (such as extent
shifting operations or inode eviction) but it gets ugly for stuff like
fsverity. Furthermore it is rather error prone as people often do not
realize journalled data needs special handling.

So change ext4_writepages() to commit transaction with inode's data
before going through the writeback loop in WB_SYNC_ALL mode. As a result
filemap_write_and_wait() is now really getting pages to stable storage
and makes pagecache pages safe to reclaim. Consequently we can remove
the special handling of journalled data from several places in follow up
patches.

Note that this will make fsync(2) for journalled data more expensive as
we will end up not only committing the transaction we need but also
checkpointing the data (which we may have previously skipped if the data
was part of the running transaction). If we really cared, we would need
to introduce special VFS function for writing out & invalidating page
cache for a range, use ->launder_page callback to perform checkpointing,
and use it from all the places that need this functionality. But at this
point I'm not convinced the complexity is worth it.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 85299c90b0f7..3ab2d56b6840 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1562,6 +1562,7 @@ struct mpage_da_data {
struct ext4_io_submit io_submit; /* IO submission data */
unsigned int do_map:1;
unsigned int scanned_until_end:1;
+ unsigned int journalled_more_data:1;
};

static void mpage_release_unused_pages(struct mpage_da_data *mpd,
@@ -2539,6 +2540,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
mpd, &folio->page);
if (err < 0)
goto out;
+ mpd->journalled_more_data = 1;
}
mpage_page_done(mpd, &folio->page);
} else {
@@ -2628,10 +2630,23 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)

/*
* data=journal mode does not do delalloc so we just need to writeout /
- * journal already mapped buffers
+ * journal already mapped buffers. On the other hand we need to commit
+ * transaction to make data stable. We expect all the data to be
+ * already in the journal (the only exception are DMA pinned pages
+ * dirtied behind our back) so we commit transaction here and run the
+ * writeback loop to checkpoint them. The checkpointing is not actually
+ * necessary to make data persistent *but* quite a few places (extent
+ * shifting operations, fsverity, ...) depend on being able to drop
+ * pagecache pages after calling filemap_write_and_wait() and for that
+ * checkpointing needs to happen.
*/
- if (ext4_should_journal_data(inode))
+ if (ext4_should_journal_data(inode)) {
mpd->can_map = 0;
+ if (wbc->sync_mode == WB_SYNC_ALL)
+ ext4_fc_commit(sbi->s_journal,
+ EXT4_I(inode)->i_datasync_tid);
+ }
+ mpd->journalled_more_data = 0;

if (ext4_should_dioread_nolock(inode)) {
/*
@@ -2812,6 +2827,13 @@ static int ext4_writepages(struct address_space *mapping,

percpu_down_read(&EXT4_SB(sb)->s_writepages_rwsem);
ret = ext4_do_writepages(&mpd);
+ /*
+ * For data=journal writeback we could have come across pages marked
+ * for delayed dirtying (PageChecked) which were just added to the
+ * running transaction. Try once more to get them to stable storage.
+ */
+ if (!ret && mpd.journalled_more_data)
+ ret = ext4_do_writepages(&mpd);
percpu_up_read(&EXT4_SB(sb)->s_writepages_rwsem);

return ret;
--
2.35.3

2023-03-29 15:54:47

by Jan Kara

[permalink] [raw]
Subject: [PATCH 10/13] ext4: Drop special handling of journalled data from ext4_quota_on()

Now that ext4_writepages() makes sure all journalled data is committed
and checkpointed, sync_filesystem() call done by dquot_quota_on() is
enough for quota IO to see uptodate data. So drop special handling of
journalled data from ext4_quota_on().

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/super.c | 17 -----------------
1 file changed, 17 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f226f8ab469b..b9f8dd0d6e46 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -6881,23 +6881,6 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
sb_dqopt(sb)->flags &= ~DQUOT_NOLIST_DIRTY;
}

- /*
- * When we journal data on quota file, we have to flush journal to see
- * all updates to the file when we bypass pagecache...
- */
- if (EXT4_SB(sb)->s_journal &&
- ext4_should_journal_data(d_inode(path->dentry))) {
- /*
- * We don't need to lock updates but journal_flush() could
- * otherwise be livelocked...
- */
- jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
- err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
- jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
- if (err)
- return err;
- }
-
lockdep_set_quota_inode(path->dentry->d_inode, I_DATA_SEM_QUOTA);
err = dquot_quota_on(sb, type, format_id, path);
if (!err) {
--
2.35.3

2023-03-30 00:12:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 06/13] ext4: Drop special handling of journalled data from ext4_sync_file()

On Wed, Mar 29, 2023 at 05:49:37PM +0200, Jan Kara wrote:
> /*
> - * data=writeback,ordered:
> * The caller's filemap_fdatawrite()/wait will sync the data.
> * Metadata is in the journal, we wait for proper transaction to
> * commit here.

Nit: without the list, the two space indent looks a bit weird here.

> if (!sbi->s_journal)
> ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
> - else if (ext4_should_journal_data(inode))
> - ret = ext4_force_commit(inode->i_sb);
> else
> ret = ext4_fsync_journal(inode, datasync, &needs_barrier);

Also if there is not journale the above comment doesn't make much sense.
But I'm really not sure the comment adds any value to start with, so
maybe just drop it entirely?

2023-03-30 08:24:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 06/13] ext4: Drop special handling of journalled data from ext4_sync_file()

On Wed 29-03-23 17:05:45, Christoph Hellwig wrote:
> On Wed, Mar 29, 2023 at 05:49:37PM +0200, Jan Kara wrote:
> > /*
> > - * data=writeback,ordered:
> > * The caller's filemap_fdatawrite()/wait will sync the data.
> > * Metadata is in the journal, we wait for proper transaction to
> > * commit here.
>
> Nit: without the list, the two space indent looks a bit weird here.
>
> > if (!sbi->s_journal)
> > ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
> > - else if (ext4_should_journal_data(inode))
> > - ret = ext4_force_commit(inode->i_sb);
> > else
> > ret = ext4_fsync_journal(inode, datasync, &needs_barrier);
>
> Also if there is not journale the above comment doesn't make much sense.
> But I'm really not sure the comment adds any value to start with, so
> maybe just drop it entirely?

Hum, right. I'll just drop the whole comment. Thanks for the suggestion.

Honza


--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-04-15 03:21:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/13 v1] ext4: Make ext4_writepages() write all journalled data


On Wed, 29 Mar 2023 17:49:31 +0200, Jan Kara wrote:
> These patches modify ext4 so that whenever inode with journalled data is
> written back we make sure we writeout all the dirty data the inode has.
> Previously this was not the case as pages with journalled data that were still
> part of running (or committing) transaction appeared as clean to the writeback
> code. As a result we can remove workarounds for inodes with journalled data
> from multiple places.
>
> [...]

Applied, thanks!

[01/13] jdb2: Don't refuse invalidation of already invalidated buffers
commit: bd159398a2d2234de07d310132865706964aaaa7
[02/13] ext4: Mark pages with journalled data dirty
commit: d84c9ebdac1e39bc7b036c0c829ee8c1956edabc
[03/13] ext4: Keep pages with journalled data dirty
commit: 265e72efa99fcc0959f8d33d346a7e0f2e3fe201
[04/13] ext4: Clear dirty bit from pages without data to write
commit: 5e1bdea6391d09fde424a1406a04e01b208a04d2
[05/13] ext4: Commit transaction before writing back pages in data=journal mode
commit: 1f1a55f0bf069799edd5f21a99ac1e3b10ebafee
[06/13] ext4: Drop special handling of journalled data from ext4_sync_file()
commit: e360c6ed727401ad1a3b5080f06d2ec3a4b75f5b
[07/13] ext4: Drop special handling of journalled data from extent shifting operations
commit: c000dfec7e88cee660cbc594c9716ecc979dc1f1
[08/13] ext4: Fix special handling of journalled data from extent zeroing
commit: 783ae448b7a21ca59ffe5bc261c17d9c3ebfe2ad
[09/13] ext4: Drop special handling of journalled data from ext4_evict_inode()
commit: 56c2a0e3d90d3822fab157883957523e327bc9ae
[10/13] ext4: Drop special handling of journalled data from ext4_quota_on()
commit: 7c375870fdc5f50a001f8265cd8744a78d2d43ab
[11/13] ext4: Simplify handling of journalled data in ext4_bmap()
commit: 951cafa6b80e55b966047b0c9cc5564df8b92145
[12/13] ext4: Update comment in mpage_prepare_extent_to_map()
commit: ab382539adcb43f52d984abf58d8e3459cd707a2
[13/13] Revert "ext4: Fix warnings when freezing filesystem with journaled data"
commit: d0ab8368c175f7b5ef0851283a2ba362a9ab327a

Best regards,
--
Theodore Ts'o <[email protected]>