Subject: [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache

Ted Ts'o explains, in the linux-ext4 thread [1], that currently
data=journal + journal_checksum + mmap is not handled correctly,
and outlines the changes to fix it (+ items/breaks for clarity):

"""
The fix is going to have to involve
- fixing __ext4_journalled_writepage() to call set_page_writeback()
before it unlocks the page,
- adding a list of pages under data=journalled writeback
which is attached to the transaction handle,
- have the jbd2 commit hook call end_page_writeback()
on all of these pages,
- and then in the places where ext4 calls wait_for_stable_page()
or grab_cache_page_write_begin(), we need to add:

if (ext4_should_journal_data(inode))
wait_on_page_writeback(page);

It's all relatively straightforward except for the part
where we have to attach a list of pages to the currently
running transaction. That will require adding some
plumbing into the jbd2 layer.
"""

This is my first adventure into ext4, and after enough struggle
(and try harder!) with the first deadlock described in PATCH 02,
and address it to find the other deadlock (also described there),
and address it, I guess it wasn't that straighforward (for me :)
but absolutely very good learning!

(Granted, I now understand that the fix outlined is indeed the
way it is supposed to work and done in general, weren't it for
the need to unlock_page() before ext4_journal_start(), as that
compromised the base of writeback patterns as far as I learned.
Hope I didn't get that too wrong.)


Summary:
-------

The patchset is a bit long with 11 patches, but I tried to get
changes tiny to help with review, and better document how each
of them work, why and how this or that is done. It's RFC as I
would like to ask for suggestions/feedback, if at all possible.

Patch 01 and 02 implement the outlined fix, with a few changes
(fix first deadlock; use existing plumbing in jbd2 as the list.)

Patch 03 fix a seconds-delay on msync().

Patch 04 introduces helpers to handle the second deadlock.

Patch 05-11 handle the second deadlock (three of these patches,
namely 07, 09 and 10 are changes not specific for data=journal,
affecting other journaling modes, so it's not on their subject.)

The order of the patches intentionally allow the issues on 03
and 05-11 to occur (while putting the core patches first), so
to allow issues to be reproduced/regression tested one by one,
as needed. It can be changed, of course, so to enable actual
writeback changes in the last patch (when issues are patched.)


Testing:
-------

This has been built and regression tested on next-20200417.
(Also rebased and build tested on next-20200423 / "today").

On xfstests (commit b2faf204) quick group (and excluding
generic 430/431/434 which always hung): no regressions w/
data=ordered (default) nor data=journal,journal_checksum.

With data=ordered: (on both original and patched kernel)

Failures: generic/223 generic/465 generic/553 generic/554 generic/565 generic/570

With data=journal,journal_checksum: (likewise)

Failures: ext4/044 generic/223 generic/441 generic/553 generic/554 generic/565 generic/570

The test-case for the problem (and deadlocks) and further
stress testing is stress-ng (with 512 workers on 16 vCPUs)

$ sudo mount -o data=journal,journal_checksum $DEV $MNT
$ cd $MNT
$ sudo stress-ng --mmap 512 --mmap-file --timeout 1w

To reproduce the problem (without patchset), run it a bit
and crash the kernel (to cause unclean shutdown) w/ sysrq,
and mount the device again (it should fail / need e2fsck):

Original:

[ 27.660063] JBD2: Invalid checksum recovering data block 79449 in log
[ 27.792371] JBD2: recovery failed
[ 27.792854] EXT4-fs (vdc): error loading journal
mount: /tmp/ext4: can't read superblock on /dev/vdc.

Patched:

[ 33.111230] EXT4-fs (vdc): 512 orphan inodes deleted
[ 33.111961] EXT4-fs (vdc): recovery complete
[ 33.114590] EXT4-fs (vdc): mounted filesystem with journalled data mode. Opts: data=journal,journal_checksum


RFC / Questions:
---------------

0) Usage of ext4_inode_info.i_datasync_tid for checks

We rely on the struct ext4_inode_info.i_datasync_tid field
(set by __ext4_journalled_writepage() and others) to check
it against the running transaction. Of course, other sites
set it too, and it might be that some of our checks return
false positives then (should be fine, just less efficient.)

To avoid such false positives, we could add another field
to that structure, exclusively for this, but that is more
8 bytes (pointer) for inodes and even on non-data=journal
cases.. so it didn't seem good enough reason, but if that
is better/worth it for efficiency reasons (speed, in this
case, vs. memory consumption) we could do it.

Maybe there are other ideas/better ways to do it?

1) Usage of ext4_force_commit() in ext4_writepages()

Patch 03 describes/fixes an issue where the underlying problem is,
if __ext4_journalled_writepage() does set_page_writeback() but no
journal commit is triggered, wait_on_page_writeback() may wait up
to seconds until the periodic journal commit happens.

The solution there, to fix the impact on msync(), is to just call
ext4_force_commit() (as it's done anyway in ext4_sync_file()), on
ext4_writepages().

Is that a good enough solution? Other ideas?

2) Similar issue (unhandled) in ext4_writepage()

The other, related question is, what about direct callers of
ext4_writepage() that obviously do not use ext4_writepages() ?
(e.g., pageout() and writeout(); write_one_page() not used.)

Those are memory-cleasing writeback, which should not wait,
however, as mentioned in that patch, if its writeback goes
on for seconds and an data-integrity writeback/system call
comes in, it is delayed/wait_on_page_writeback() that long.

So, ideally, we should be trying to kick a journal commit?

It looks like ext4_handle_sync() is not the answer, since
it waits for commit to finish, and pageout() is called on
a list of pages by shrinking. So, not effective to block
on each one of them.

We might not want to start anything right now, actually,
since the memory-cleasing writeback can be happening on
memory pressure scenarios, right? But would need to do
something, to ensure that future wait_on_page_writeback()
do not wait too long.

Maybe the answer is something similar to jbd2 sync transaction
batching (used by ext4_handle_sync()), but in *async* fashion,
say, possibly implemented/batching in the jbd2 worker thread.
Is that reasonable?

...

Any comments/feedback/reviews are very appreciated.

Thank you in advance,
Mauricio

[1] https://lore.kernel.org/linux-ext4/[email protected]/

Mauricio Faria de Oliveira (11):
ext4: data=journal: introduce struct/kmem_cache
ext4_journalled_wb_page/_cachep
ext4: data=journal: handle page writeback in
__ext4_journalled_writepage()
ext4: data=journal: call ext4_force_commit() in ext4_writepages() for
msync()
ext4: data=journal: introduce helpers for journalled writeback
deadlock
ext4: data=journal: prevent journalled writeback deadlock in
__ext4_journalled_writepage()
ext4: data=journal: prevent journalled writeback deadlock in
ext4_write_begin()
ext4: grab page before starting transaction handle in
ext4_convert_inline_data_to_extent()
ext4: data=journal: prevent journalled writeback deadlock in
ext4_convert_inline_data_to_extent()
ext4: grab page before starting transaction handle in
ext4_try_to_write_inline_data()
ext4: deduplicate code with error legs in
ext4_try_to_write_inline_data()
ext4: data=journal: prevent journalled writeback deadlock in
ext4_try_to_write_inline_data()

fs/ext4/ext4_jbd2.h | 88 +++++++++++++++++++++++++
fs/ext4/inline.c | 153 +++++++++++++++++++++++++++++++-------------
fs/ext4/inode.c | 137 +++++++++++++++++++++++++++++++++++++--
fs/ext4/page-io.c | 11 ++++
4 files changed, 341 insertions(+), 48 deletions(-)

--
2.20.1


Subject: [RFC PATCH 01/11] ext4: data=journal: introduce struct/kmem_cache ext4_journalled_wb_page/_cachep

The 'struct ext4_journalled_wb_page' is a journal callback entry;
use 'kmem_cache ext4_journalled_wb_page_cachep' to alloc/free it.

It will be used to end page writeback on transaction commit,
only for writeback of mmaped pagecache in data=journal mode
(i.e., ext4_writepage() -> __ext4_journalled_writepage()).

The next patch will add the associated set page writeback
and add the callback entry to the list in the transaction;
and process that list in the ext4 journal commit callback.

This builds on top of the existing journal commit callback
functions (not used anymore), using t_private_list in jbd2.
(introduced with commit 18aadd47f884 ("ext4: expand commit
callback and"), and not used in commit a015434480dc ("ext4:
send parallel discards on commit completions").)

P.S.: ext4_journal_commit_callback() does have spinlocking
with s_md_lock that is not needed for this particular case
(which seems to be the only one), as it has no concurrency
on transaction commit time/only during transaction running
time (currently only done in __ext4_journalled_writepage()).

But no changes for now, just keep it as it is/has been, so
it remains general in case other usages arise that need it.

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/ext4_jbd2.h | 16 ++++++++++++++++
fs/ext4/page-io.c | 11 +++++++++++
2 files changed, 27 insertions(+)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 4b9002f0e84c..9ea8ee583931 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -209,6 +209,22 @@ static inline bool ext4_journal_callback_try_del(handle_t *handle,
return deleted;
}

+/* The struct ext4_journalled_wb_page is a journal callback entry to
+ * end page writeback on transaction commit in the data=journal mode.
+ *
+ * This is used for writeback of mmaped pagecache in data=journal mode
+ * (see ext4_writepage() -> __ext4_journalled_writepage() for details.)
+ */
+struct ext4_journalled_wb_page {
+ /* First member must be the journal callback entry */
+ struct ext4_journal_cb_entry ejwp_jce;
+
+ /* Pointer to page marked for writeback */
+ struct page *ejwp_page;
+};
+
+extern struct kmem_cache *ext4_journalled_wb_page_cachep;
+
int
ext4_mark_iloc_dirty(handle_t *handle,
struct inode *inode,
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index de6fe969f773..67fabeef6bf2 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -32,6 +32,7 @@

static struct kmem_cache *io_end_cachep;
static struct kmem_cache *io_end_vec_cachep;
+struct kmem_cache *ext4_journalled_wb_page_cachep;

int __init ext4_init_pageio(void)
{
@@ -44,6 +45,15 @@ int __init ext4_init_pageio(void)
kmem_cache_destroy(io_end_cachep);
return -ENOMEM;
}
+
+ ext4_journalled_wb_page_cachep = KMEM_CACHE(ext4_journalled_wb_page,
+ SLAB_RECLAIM_ACCOUNT);
+ if (ext4_journalled_wb_page_cachep == NULL) {
+ kmem_cache_destroy(io_end_cachep);
+ kmem_cache_destroy(io_end_vec_cachep);
+ return -ENOMEM;
+ }
+
return 0;
}

@@ -51,6 +61,7 @@ void ext4_exit_pageio(void)
{
kmem_cache_destroy(io_end_cachep);
kmem_cache_destroy(io_end_vec_cachep);
+ kmem_cache_destroy(ext4_journalled_wb_page_cachep);
}

struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end)
--
2.20.1

Subject: [RFC PATCH 02/11] ext4: data=journal: handle page writeback in __ext4_journalled_writepage()

The idea is to have __ext4_journalled_writepage() to call
set_page_writeback(), then add the page to a page list in
the transaction (pages under data=journal writeback), and
end_page_writeback() on those pages on transaction commit.

And also wait_on_page_writeback() for data=journal after
wait_for_stable_page() and grab_cache_page_write_begin().

(The credits for understanding the problem reported and
suggesting a fix go to Ted Ts'o, on the linux-ext4 list [1])

Now __ext4_journalled_writepage() goes for new adventures.

So, upfront, do not inline it so it shows in stack traces,
not into a big ext4_writepage() function. And remove its
forward declaration (unneeded.)

Still upfront,

Due to a couple of reasons the simple path to set_page_writeback()
before unlock_page(), then wait_on_page_writeback() where needed,
is not possible, as it might deadlock. So there are adaptations.

1) It's not possible to set_page_writeback() before unlock_page()
(needed for ext4_journal_start())

This deadlocks with common 'lock_page(); wait_on_page_writeback();'
pattern seen in write_cache_pages()/grab_cache_page_write_begin()
for example (the latter for the case of stable pages), and more.

That happens because we have to lock_page() again in order to
end_page_writeback() later, but the other task is holding the
page lock, and waiting for this task to finish page writeback.

Therefore, our (contained) option left is set_page_writeback()
after we lock_page() after ext4_journal_start().

Unfortunately, it turns out to create another problem/deadlock
for ourselves, now for wait_on_page_writeback().

2) It's not possible to just wait_on_page_writeback() holding
a journal transaction handle.

This deadlocks with __ext4_journalled_writepage() if the same
transaction is used by the two tasks to set/wait on writeback.

That happens because if set_page_writeback() occurs then the
other task gets a transaction handle in the same transaction
and goes wait_on_page_writeback(), it still holds the handle,
which prevents transaction commit, thus end_page_writeback().

Unfortunately the options to address this problem are not at
all welcoming in all aspects, so the simplest one won: retry.

(another approach is another synchronization mechanism, e.g.,
possibly a page bit, which is overkill for a particular case;
another, a reader-writer semaphore on the inode would do too,
but it's coarse grained, as an inode/mapping maps many pages,
and not all pages could possibly be under writeback.)

This problem is addressed with more patches on each caller.
Regressions until those are applied are unlikely to happen
because data=journal is not default, thus likely debugging
or testing by someone intentionally looking at this anyway.

...

So, back to the patch:

Add the functions to initialize a 'struct ext4_journalled_wb_page'
and its callback to end_page_writeback() and free the used struct.

Change the function to allocate an entry (not failing immediately
to give the chance of successful return in case of page truncated),
set page as writeback then add it to the list in the transaction,
so to end page writeback on transaction commit.

Now pages in the mmaped pagecache that are set dirty ('checked')
and have writeback triggered (e.g., msync() or memory cleansing)
can have the writeback bit set and waited on, until the journal
commit callback entry have it clear later.

Now that we set_page_writeback(), some places need this snippet:

if (ext4_should_journal_data(inode))
wait_on_page_writeback(page)

But due to deadlock #2 above, most of those will be done shortly
on next patches. One place is not affected/no handle held during
wait_on_page_writeback(): ext4_page_mkwrite() so address it here.

[1] https://lore.kernel.org/linux-ext4/[email protected]/

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/inode.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 77 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2a4aae6acdcb..d385a11ba31e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -137,7 +137,6 @@ static inline int ext4_begin_ordered_truncate(struct inode *inode,

static void ext4_invalidatepage(struct page *page, unsigned int offset,
unsigned int length);
-static int __ext4_journalled_writepage(struct page *page, unsigned int len);
static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
int pextents);
@@ -1839,8 +1838,27 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
return 0;
}

-static int __ext4_journalled_writepage(struct page *page,
- unsigned int len)
+static void __ext4_journalled_wb_page_init(struct ext4_journalled_wb_page *ejwp,
+ struct page *page)
+{
+ INIT_LIST_HEAD(&ejwp->ejwp_jce.jce_list);
+ ejwp->ejwp_page = page;
+}
+
+static void __ext4_journalled_wb_end(struct super_block *sb,
+ struct ext4_journal_cb_entry *jce,
+ int error)
+{
+ struct ext4_journalled_wb_page *ejwp =
+ (struct ext4_journalled_wb_page *) jce;
+
+ end_page_writeback(ejwp->ejwp_page);
+ kmem_cache_free(ext4_journalled_wb_page_cachep, ejwp);
+}
+
+static noinline int __ext4_journalled_writepage(struct page *page,
+ unsigned int len,
+ struct writeback_control *wbc)
{
struct address_space *mapping = page->mapping;
struct inode *inode = mapping->host;
@@ -1849,6 +1867,7 @@ static int __ext4_journalled_writepage(struct page *page,
int ret = 0, err = 0;
int inline_data = ext4_has_inline_data(inode);
struct buffer_head *inode_bh = NULL;
+ struct ext4_journalled_wb_page *ejwp = NULL;

ClearPageChecked(page);

@@ -1873,8 +1892,24 @@ static int __ext4_journalled_writepage(struct page *page,
* out from under us.
*/
get_page(page);
+
+ /*
+ * Do not set_page_writeback() before first unlocking the page, or
+ * we can deadlock with any "lock_page(); wait_on_page_writeback();"
+ * (example: write_cache_pages() and grab_cache_page_write_begin())
+ * because we need to lock_page() again so to end_page_writeback().
+ */
unlock_page(page);

+ /*
+ * Allocate callback entry before we start the journal.
+ *
+ * Don't return immediately on error, because the case
+ * of page got truncated does not need this allocation
+ * (but is checked after we start the journal.)
+ */
+ ejwp = kmem_cache_alloc(ext4_journalled_wb_page_cachep, GFP_KERNEL);
+
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
ext4_writepage_trans_blocks(inode));
if (IS_ERR(handle)) {
@@ -1891,8 +1926,42 @@ static int __ext4_journalled_writepage(struct page *page,
ext4_journal_stop(handle);
ret = 0;
goto out;
+ } else if (!ejwp) {
+ /* Or ejwp alloc failed */
+ ext4_journal_stop(handle);
+ ret = -ENOMEM;
+ goto out;
}

+ /*
+ * In case writeback began while the page was unlocked.
+ *
+ * This can happen if another writeback task has locked
+ * the page in the window between we unlocked/locked it
+ * again, and made it to set_page_writeback() before us.
+ */
+ if (PageWriteback(page)) {
+ /* Memory cleansing writeback; just let them do it. */
+ if (wbc->sync_mode == WB_SYNC_NONE) {
+ ext4_journal_stop(handle);
+ ret = 0;
+ goto out;
+ }
+
+ /* Data integrity writeback; have to wait and do it. */
+ wait_on_page_writeback(page);
+ }
+
+ /*
+ * Add page to journalled writeback page list in transaction.
+ * The commit callback will end_page_writeback() and free ejwp.
+ */
+ set_page_writeback(page);
+ __ext4_journalled_wb_page_init(ejwp, page);
+ ext4_journal_callback_add(handle, __ext4_journalled_wb_end,
+ (struct ext4_journal_cb_entry *) ejwp);
+ ejwp = NULL;
+
if (inline_data) {
ret = ext4_mark_inode_dirty(handle, inode);
} else {
@@ -1916,6 +1985,8 @@ static int __ext4_journalled_writepage(struct page *page,
out:
unlock_page(page);
out_no_pagelock:
+ if (ejwp)
+ kmem_cache_free(ext4_journalled_wb_page_cachep, ejwp);
brelse(inode_bh);
return ret;
}
@@ -2027,7 +2098,7 @@ static int ext4_writepage(struct page *page,
* It's mmapped pagecache. Add buffers and journal it. There
* doesn't seem much point in redirtying the page here.
*/
- return __ext4_journalled_writepage(page, len);
+ return __ext4_journalled_writepage(page, len, wbc);

ext4_io_submit_init(&io_submit, wbc);
io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
@@ -5985,6 +6056,8 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
ext4_bh_unmapped)) {
/* Wait so that we don't change page under IO */
wait_for_stable_page(page);
+ if (ext4_should_journal_data(inode))
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
goto out;
}
--
2.20.1

Subject: [RFC PATCH 03/11] ext4: data=journal: call ext4_force_commit() in ext4_writepages() for msync()

The data-integrity syscalls (not memory-cleansing writeback)
use file_write_and_wait_range() that wait_on_page_writeback()
for every page in the range after do_writepages().

If any of these pages is mmap'ed pagecache, i.e., goes into
__ext4_journalled_writepage(), with the last couple patches
end_page_writeback() will be done on (or, not be done until)
transaction commit, which can take seconds (commit interval,
max commit age), which delays msync().

Let's fix this so that msync() syscall should just return
quickly without a delay of up to a few seconds by default.

For data=journal the next thing these syscalls do anyway is
ext4_force_commit() (see ext4_sync_file()), which is needed
for the buffered pagecache, as __filemap_fdatawrite_range()
doesn't do anything: the buffers are clean, so it returns
early without calling do_writepages() / ext4_write_pages().
So it's not possible to just move/replace that call here.

(This is better/more correct than to use ext4_handle_sync()
for mmap'ed pagecache, which triggers one commit per page,
as synchronous transaction batching in jbd2 targets other,
concurrent tasks, but in this case one single task writes
all pages back serially.)

Now for memory-cleansing writeback, even though it is not
supposed to wait, we should not wait for seconds either,
as it could delay an upcoming data-integrity syscall in
write_cache_pages() on a call to wait_on_page_writeback().
(another fix is needed for such calls to ext4_writepage()).

So just do not check for wbc->sync_mode to cover it too.

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/inode.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d385a11ba31e..574a062b8bcd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2709,7 +2709,37 @@ static int ext4_writepages(struct address_space *mapping,
goto out_writepages;

if (ext4_should_journal_data(inode)) {
+ journal_t *journal = sbi->s_journal;
+
ret = generic_writepages(mapping, wbc);
+ /*
+ * On the data-integrity syscalls, file_write_and_wait_range()
+ * will wait on page writeback after calling ext4_writepages().
+ * For mmaped pagecache that only ends on transaction commit,
+ * which may take up to commit interval (seconds!) to happen.
+ *
+ * So, ensure that ext4_force_commit() happens before return,
+ * and after all pages in the range are set_page_writeback(),
+ * but only if needed (i.e. check for datasync transaction
+ * set in the inode by __ext4_journalled_writepage().)
+ *
+ * Do it for memory-cleasing writeback too, because it might
+ * delay another data-integrity syscall in write_cache_pages()
+ * on wait_on_page_writeback().
+ */
+ if (!ret && journal) {
+ bool force_commit = false;
+
+ read_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction &&
+ journal->j_running_transaction->t_tid ==
+ EXT4_I(inode)->i_datasync_tid)
+ force_commit = true;
+ read_unlock(&journal->j_state_lock);
+
+ if (force_commit)
+ ret = ext4_force_commit(inode->i_sb);
+ }
goto out_writepages;
}

--
2.20.1

Subject: [RFC PATCH 04/11] ext4: data=journal: introduce helpers for journalled writeback deadlock

This patch introduces the helper functions ext4_check_journalled_writeback(),
and ext4_start_commit_datasync(), to check for, and prevent the deadlock #2
(detailed in a previous commit.)

The former checks the transaction associated with the handle (parameter)
is also the transaction stored in the inode for datasync'ing operations
(set by __ext4_journalled_writepage()) if the page (parameter) is under
writeback (set by that function too.)

This patch also documents the steps to prevent the deadlock, if needed
(i.e., helper function returns true) which consist in a retry strategy,
using the latter helper.

The check may return false positives: i_datasync_tid and PageWriteback
are set by other functions than __ext4_journalled_writepage(); but not
false negatives.

So the code may unnecessarily stop/commit/start on false-positives,
but it does prevent deadlocks so it's reasonable cost-benefit case.

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/ext4_jbd2.h | 72 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 9ea8ee583931..fca6551dbf09 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -225,6 +225,78 @@ struct ext4_journalled_wb_page {

extern struct kmem_cache *ext4_journalled_wb_page_cachep;

+/*
+ * ext4_check_journalled_writeback(handle, page).
+ * See __ext4_journalled_writepage().
+ *
+ * This function can be used to check for a potential deadlock if this task has
+ * a handle on a transaction and has to wait_on_page_writeback() when it's held.
+ * (NOTE: this affects grab_cache_page_write_begin() after ext4_journal_start())
+ *
+ * The deadlock occurs if another task has set_page_writeback() for the _same_
+ * page on the _same_ transaction, and this task calls wait_on_page_writeback().
+ *
+ * The held handle blocks the transaction commit, and thus end_page_writeback(),
+ * blocking this task in wait_on_page_writeback(); and only ext4_journal_stop()
+ * could unblock the commit, but it is not reached.
+ *
+ * If the function returns true, prevent the deadlock:
+ *
+ * 1) Stop handle to safely wait_on_page_writeback()
+ * 2) Start commiting the transaction (non-blocking)
+ * saved in inode by __ext4_journalled_writepage()
+ * 3) Call wait_on_page_writeback()
+ * 4) Retry
+ *
+ * For example,
+ *
+ * retry:
+ * // done before ext4_journal_start()
+ * page = grab_cache_page_write_begin(mapping, ...);
+ * if (ext4_should_journal_data(inode))
+ * wait_on_page_writeback(page);
+ * unlock_page(page);
+ *
+ * handle = ext4_journal_start(inode, ...);
+ *
+ * lock_page(page);
+ *
+ * // new code
+ * if (ext4_check_journalled_writeback(handle, page) {
+ * unlock_page(page);
+ * put_page(page);
+ * ext4_journal_stop(handle);
+ * ext4_start_commit_datasync(inode);
+ * goto retry;
+ * }
+ *
+ * Unfortunately the check may return false positives (e.g., non-mmaped/buffered
+ * pagecache that is under writeback that turned out to have same i_datasync_tid)
+ * and thus stop/commit/start unnecessarily. But since it can prevent deadlocks
+ * and only affects the data=journal mode, it seems a reasonable cost/benefit.
+ */
+static inline int ext4_should_journal_data(struct inode *inode);
+
+static inline bool ext4_check_journalled_writeback(handle_t *handle,
+ struct page *page)
+{
+ struct inode *inode = page->mapping->host;
+
+ BUG_ON(!ext4_should_journal_data(inode));
+
+ return (PageWriteback(page) &&
+ handle->h_transaction->t_tid == EXT4_I(inode)->i_datasync_tid);
+}
+
+static inline int ext4_start_commit_datasync(struct inode *inode)
+{
+ BUG_ON(!ext4_should_journal_data(inode));
+
+ /* Start commit associated with datasync transaction (non-blocking.) */
+ return jbd2_log_start_commit(EXT4_JOURNAL(inode),
+ EXT4_I(inode)->i_datasync_tid);
+}
+
int
ext4_mark_iloc_dirty(handle_t *handle,
struct inode *inode,
--
2.20.1

Subject: [RFC PATCH 05/11] ext4: data=journal: prevent journalled writeback deadlock in __ext4_journalled_writepage()

This patch checks for and prevents the deadlock with a page in
journalled writeback in __ext4_journalled_writepage().

It turns out that __ext4_journalled_writepage() may race with itself
in another task (e.g., two tasks set the same mmap'ed page dirty and
called msync(), one after the other), so we have to check/handle it.

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/inode.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 574a062b8bcd..401313be8a5b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1910,6 +1910,7 @@ static noinline int __ext4_journalled_writepage(struct page *page,
*/
ejwp = kmem_cache_alloc(ext4_journalled_wb_page_cachep, GFP_KERNEL);

+retry_journal:
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
ext4_writepage_trans_blocks(inode));
if (IS_ERR(handle)) {
@@ -1948,7 +1949,20 @@ static noinline int __ext4_journalled_writepage(struct page *page,
goto out;
}

- /* Data integrity writeback; have to wait and do it. */
+ /*
+ * Data integrity writeback; have to wait and do it.
+ *
+ * Check for deadlock with page in journalled writeback
+ * (i.e. another task running/ran this function as well)
+ */
+ if (ext4_check_journalled_writeback(handle, page)) {
+ get_page(page);
+ unlock_page(page);
+ ext4_journal_stop(handle);
+ ext4_start_commit_datasync(inode);
+ wait_on_page_writeback(page);
+ goto retry_journal;
+ }
wait_on_page_writeback(page);
}

--
2.20.1

Subject: [RFC PATCH 06/11] ext4: data=journal: prevent journalled writeback deadlock in ext4_write_begin()

This patch checks for and prevents the deadlock with a page in
journalled writeback in ext4_write_begin().

Finally, add wait_on_page_writeback() if data=journal mode.

Note: similar changes are not needed in ext4_da_write_begin()
as delayed allocation is not applicable to data journalling
(different struct address_space_operations).

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/inode.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 401313be8a5b..f58c426aaca1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1149,6 +1149,8 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
+ if (ext4_should_journal_data(inode))
+ wait_on_page_writeback(page);
unlock_page(page);

retry_journal:
@@ -1165,9 +1167,19 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
put_page(page);
ext4_journal_stop(handle);
goto retry_grab;
+ } else if (ext4_should_journal_data(inode) &&
+ ext4_check_journalled_writeback(handle, page)) {
+ /* Or transaction may block page writeback */
+ unlock_page(page);
+ put_page(page);
+ ext4_journal_stop(handle);
+ ext4_start_commit_datasync(inode);
+ goto retry_grab;
}
/* In case writeback began while the page was unlocked */
wait_for_stable_page(page);
+ if (ext4_should_journal_data(inode))
+ wait_on_page_writeback(page);

#ifdef CONFIG_FS_ENCRYPTION
if (ext4_should_dioread_nolock(inode))
--
2.20.1

Subject: [RFC PATCH 08/11] ext4: data=journal: prevent journalled writeback deadlock in ext4_convert_inline_data_to_extent()

This patch checks for and prevents the deadlock with a page in
journalled writeback in ext4_convert_inline_data_to_extent().

Finally, add wait_on_page_writeback() if data=journal mode.

Note: similar changes are not needed in ext4_da_convert_inline_data_to_extent()
as delayed allocation is not applicable to data journalling
(different struct address_space_operations).

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/inline.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 5fd275098d10..10665b066bb7 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -562,6 +562,8 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
handle = NULL;
goto out;
}
+ if (ext4_should_journal_data(inode))
+ wait_on_page_writeback(page);
unlock_page(page);

retry_journal:
@@ -581,9 +583,19 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
put_page(page);
ext4_journal_stop(handle);
goto retry_grab;
+ } else if (ext4_should_journal_data(inode) &&
+ ext4_check_journalled_writeback(handle, page)) {
+ /* Or transaction may block page writeback */
+ unlock_page(page);
+ put_page(page);
+ ext4_journal_stop(handle);
+ ext4_start_commit_datasync(inode);
+ goto retry_grab;
}
/* In case writeback began while the page was unlocked */
wait_for_stable_page(page);
+ if (ext4_should_journal_data(inode))
+ wait_on_page_writeback(page);

ext4_write_lock_xattr(inode, &no_expand);
sem_held = 1;
--
2.20.1

Subject: [RFC PATCH 10/11] ext4: deduplicate code with error legs in ext4_try_to_write_inline_data()

Make the error legs similar to ext4_convert_inline_data_to_extent(),
now that we have even more "unlock_page(); put_page();" statements.

While in there, do similarly for the semaphore and label "convert:".

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/inline.c | 56 +++++++++++++++++++++---------------------------
1 file changed, 25 insertions(+), 31 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 9bb85e06854d..3d370b04a740 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -684,9 +684,13 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
handle_t *handle;
struct page *page;
struct ext4_iloc iloc;
+ bool convert = false;
+ bool sem_held = false;

- if (pos + len > ext4_get_max_inline_size(inode))
- goto convert;
+ if (pos + len > ext4_get_max_inline_size(inode)) {
+ convert = true;
+ goto out_convert;
+ }

ret = ext4_get_inode_loc(inode, &iloc);
if (ret)
@@ -717,6 +721,7 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
put_page(page);
ret = PTR_ERR(handle);
handle = NULL;
+ page = NULL;
goto out;
}

@@ -732,58 +737,47 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
wait_for_stable_page(page);

ret = ext4_prepare_inline_data(handle, inode, pos + len);
- if (ret && ret != -ENOSPC) {
- unlock_page(page);
- put_page(page);
+ if (ret) {
+ /* If don't have space in inline inode, convert it to extent. */
+ convert = (ret == -ENOSPC);
goto out;
}

- /* We don't have space in inline inode, so convert it to extent. */
- if (ret == -ENOSPC) {
- unlock_page(page);
- put_page(page);
- ext4_journal_stop(handle);
- brelse(iloc.bh);
- goto convert;
- }
-
ret = ext4_journal_get_write_access(handle, iloc.bh);
- if (ret) {
- unlock_page(page);
- put_page(page);
+ if (ret)
goto out;
- }

*pagep = page;
down_read(&EXT4_I(inode)->xattr_sem);
+ sem_held = true;
if (!ext4_has_inline_data(inode)) {
ret = 0;
- unlock_page(page);
- put_page(page);
- goto out_up_read;
+ goto out;
}

if (!PageUptodate(page)) {
ret = ext4_read_inline_page(inode, page);
- if (ret < 0) {
- unlock_page(page);
- put_page(page);
- goto out_up_read;
- }
+ if (ret < 0)
+ goto out;
}

ret = 1;
handle = NULL;
-out_up_read:
- up_read(&EXT4_I(inode)->xattr_sem);
out:
+ if (page && (ret != 1)) {
+ unlock_page(page);
+ put_page(page);
+ }
+ if (sem_held)
+ up_read(&EXT4_I(inode)->xattr_sem);
if (handle && (ret != 1))
ext4_journal_stop(handle);
brelse(iloc.bh);
+out_convert:
+ if (convert)
+ return ext4_convert_inline_data_to_extent(mapping,
+ inode, flags);
return ret;
-convert:
- return ext4_convert_inline_data_to_extent(mapping,
- inode, flags);
}

int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
--
2.20.1

Subject: [RFC PATCH 07/11] ext4: grab page before starting transaction handle in ext4_convert_inline_data_to_extent()

Since even just grab_cache_page_write_begin() might deadlock with
page writeback from __ext4_journalled_writepage() if stable pages
are required (as it does "lock_page(); wait_for_stable_page();")
and the handle to the same/running transaction is currently held,
it _cannot_ be called between ext4_journal_start/stop() as usual,
we have to be careful.

We have two options:

1) open-code the first part of grab_cache_page_write_begin()
(before wait_for_stable_pages()) just before calling it,
then check for deadlock and retry (a bit ugly but valid.)

2) move it before ext4_journal_start() to avoid the deadlock.

Option 2 has been done as optimization to ext4_write_begin()
in commit 47564bfb95bf ("ext4: grab page before starting
transaction handle in write_begin()"), and can similarly
apply to this case.

Since it sounds more official, counts as optimization that
prevents long transaction hold time, and isn't ugly, do it.

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/inline.c | 48 ++++++++++++++++++++++++++++++++++--------------
1 file changed, 34 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index f35e289e17aa..5fd275098d10 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -548,23 +548,42 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
if (ret)
return ret;

-retry:
+ /*
+ * grab_cache_page_write_begin() can take a long time if the
+ * system is thrashing due to memory pressure, or if the page
+ * is being written back. So grab it first before we start
+ * the transaction handle. This also allows us to allocate
+ * the page (if needed) without using GFP_NOFS.
+ */
+retry_grab:
+ page = grab_cache_page_write_begin(mapping, 0, flags);
+ if (!page) {
+ ret = -ENOMEM;
+ handle = NULL;
+ goto out;
+ }
+ unlock_page(page);
+
+retry_journal:
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
if (IS_ERR(handle)) {
+ put_page(page);
ret = PTR_ERR(handle);
handle = NULL;
+ page = NULL;
goto out;
}

- /* We cannot recurse into the filesystem as the transaction is already
- * started */
- flags |= AOP_FLAG_NOFS;
-
- page = grab_cache_page_write_begin(mapping, 0, flags);
- if (!page) {
- ret = -ENOMEM;
- goto out;
+ lock_page(page);
+ if (page->mapping != mapping) {
+ /* The page got truncated from under us */
+ unlock_page(page);
+ put_page(page);
+ ext4_journal_stop(handle);
+ goto retry_grab;
}
+ /* In case writeback began while the page was unlocked */
+ wait_for_stable_page(page);

ext4_write_lock_xattr(inode, &no_expand);
sem_held = 1;
@@ -600,8 +619,6 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,

if (ret) {
unlock_page(page);
- put_page(page);
- page = NULL;
ext4_orphan_add(handle, inode);
ext4_write_unlock_xattr(inode, &no_expand);
sem_held = 0;
@@ -616,10 +633,13 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
*/
if (inode->i_nlink)
ext4_orphan_del(NULL, inode);
- }

- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
+ if (ret == -ENOSPC &&
+ ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry_journal;
+ put_page(page);
+ page = NULL;
+ }

if (page)
block_commit_write(page, from, to);
--
2.20.1

Subject: [RFC PATCH 11/11] ext4: data=journal: prevent journalled writeback deadlock in ext4_try_to_write_inline_data()

This patch checks for and prevents the deadlock with a page in
journalled writeback in ext4_try_to_write_inline_data().

Finally, add wait_on_page_writeback() if data=journal mode.

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/inline.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 3d370b04a740..c55a11f515d3 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -710,6 +710,8 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
handle = NULL;
goto out;
}
+ if (ext4_should_journal_data(inode))
+ wait_on_page_writeback(page);
unlock_page(page);

/*
@@ -732,9 +734,18 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
put_page(page);
ext4_journal_stop(handle);
goto retry_grab;
+ } else if (ext4_should_journal_data(inode) &&
+ ext4_check_journalled_writeback(handle, page)) {
+ unlock_page(page);
+ put_page(page);
+ ext4_journal_stop(handle);
+ ext4_start_commit_datasync(inode);
+ goto retry_grab;
}
/* In case writeback began while the page was unlocked */
wait_for_stable_page(page);
+ if (ext4_should_journal_data(inode))
+ wait_on_page_writeback(page);

ret = ext4_prepare_inline_data(handle, inode, pos + len);
if (ret) {
--
2.20.1

Subject: [RFC PATCH 09/11] ext4: grab page before starting transaction handle in ext4_try_to_write_inline_data()

Since even just grab_cache_page_write_begin() might deadlock with
page writeback from __ext4_journalled_writepage() if stable pages
are required (as it does "lock_page(); wait_for_stable_page();")
and the handle to the same/running transaction is currently held,
it _cannot_ be called between ext4_journal_start/stop() as usual,
we have to be careful.

We have two options:

1) open-code the first part of grab_cache_page_write_begin()
(before wait_for_stable_pages()) just before calling it,
then check for deadlock and retry (a bit ugly but valid.)

2) move it before ext4_journal_start() to avoid the deadlock.

Option 2 has been done as optimization to ext4_write_begin()
in commit 47564bfb95bf ("ext4: grab page before starting
transaction handle in write_begin()"), and can similarly
apply to this case.

Since it sounds more official, counts as optimization that
prevents long transaction hold time, and isn't ugly, do it.

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
fs/ext4/inline.c | 46 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 10665b066bb7..9bb85e06854d 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -692,37 +692,65 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
if (ret)
return ret;

+ /*
+ * grab_cache_page_write_begin() can take a long time if the
+ * system is thrashing due to memory pressure, or if the page
+ * is being written back. So grab it first before we start
+ * the transaction handle. This also allows us to allocate
+ * the page (if needed) without using GFP_NOFS.
+ */
+retry_grab:
+ page = grab_cache_page_write_begin(mapping, 0, flags);
+ if (!page) {
+ ret = -ENOMEM;
+ handle = NULL;
+ goto out;
+ }
+ unlock_page(page);
+
/*
* The possible write could happen in the inode,
* so try to reserve the space in inode first.
*/
handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
if (IS_ERR(handle)) {
+ put_page(page);
ret = PTR_ERR(handle);
handle = NULL;
goto out;
}

+ lock_page(page);
+ if (page->mapping != mapping) {
+ /* The page got truncated from under us */
+ unlock_page(page);
+ put_page(page);
+ ext4_journal_stop(handle);
+ goto retry_grab;
+ }
+ /* In case writeback began while the page was unlocked */
+ wait_for_stable_page(page);
+
ret = ext4_prepare_inline_data(handle, inode, pos + len);
- if (ret && ret != -ENOSPC)
+ if (ret && ret != -ENOSPC) {
+ unlock_page(page);
+ put_page(page);
goto out;
+ }

/* We don't have space in inline inode, so convert it to extent. */
if (ret == -ENOSPC) {
+ unlock_page(page);
+ put_page(page);
ext4_journal_stop(handle);
brelse(iloc.bh);
goto convert;
}

ret = ext4_journal_get_write_access(handle, iloc.bh);
- if (ret)
- goto out;
-
- flags |= AOP_FLAG_NOFS;
-
- page = grab_cache_page_write_begin(mapping, 0, flags);
- if (!page) {
- ret = -ENOMEM;
+ if (ret) {
+ unlock_page(page);
+ put_page(page);
goto out;
}

--
2.20.1

Subject: Re: [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache

Hi,

On Thu, Apr 23, 2020 at 8:37 PM Mauricio Faria de Oliveira
<[email protected]> wrote:
[snip]
> Summary:
> -------
>
> The patchset is a bit long with 11 patches, but I tried to get
> changes tiny to help with review, and better document how each
> of them work, why and how this or that is done. It's RFC as I
> would like to ask for suggestions/feedback, if at all possible.

If at all possible, may this patchset have at least a cursory look?

I'm aware it's been a busy period for some of you, so I just wanted
to friendly ping on it, in case this got buried deep under other stuff.

Thanks!


>
> Patch 01 and 02 implement the outlined fix, with a few changes
> (fix first deadlock; use existing plumbing in jbd2 as the list.)
>
> Patch 03 fix a seconds-delay on msync().
>
> Patch 04 introduces helpers to handle the second deadlock.
>
> Patch 05-11 handle the second deadlock (three of these patches,
> namely 07, 09 and 10 are changes not specific for data=journal,
> affecting other journaling modes, so it's not on their subject.)
>
> The order of the patches intentionally allow the issues on 03
> and 05-11 to occur (while putting the core patches first), so
> to allow issues to be reproduced/regression tested one by one,
> as needed. It can be changed, of course, so to enable actual
> writeback changes in the last patch (when issues are patched.)
>
>
> Testing:
> -------
>
> This has been built and regression tested on next-20200417.
> (Also rebased and build tested on next-20200423 / "today").
>
> On xfstests (commit b2faf204) quick group (and excluding
> generic 430/431/434 which always hung): no regressions w/
> data=ordered (default) nor data=journal,journal_checksum.
>
> With data=ordered: (on both original and patched kernel)
>
> Failures: generic/223 generic/465 generic/553 generic/554 generic/565 generic/570
>
> With data=journal,journal_checksum: (likewise)
>
> Failures: ext4/044 generic/223 generic/441 generic/553 generic/554 generic/565 generic/570
>
> The test-case for the problem (and deadlocks) and further
> stress testing is stress-ng (with 512 workers on 16 vCPUs)
>
> $ sudo mount -o data=journal,journal_checksum $DEV $MNT
> $ cd $MNT
> $ sudo stress-ng --mmap 512 --mmap-file --timeout 1w
>
> To reproduce the problem (without patchset), run it a bit
> and crash the kernel (to cause unclean shutdown) w/ sysrq,
> and mount the device again (it should fail / need e2fsck):
>
> Original:
>
> [ 27.660063] JBD2: Invalid checksum recovering data block 79449 in log
> [ 27.792371] JBD2: recovery failed
> [ 27.792854] EXT4-fs (vdc): error loading journal
> mount: /tmp/ext4: can't read superblock on /dev/vdc.
>
> Patched:
>
> [ 33.111230] EXT4-fs (vdc): 512 orphan inodes deleted
> [ 33.111961] EXT4-fs (vdc): recovery complete
> [ 33.114590] EXT4-fs (vdc): mounted filesystem with journalled data mode. Opts: data=journal,journal_checksum
>
>
> RFC / Questions:
> ---------------
>
> 0) Usage of ext4_inode_info.i_datasync_tid for checks
>
> We rely on the struct ext4_inode_info.i_datasync_tid field
> (set by __ext4_journalled_writepage() and others) to check
> it against the running transaction. Of course, other sites
> set it too, and it might be that some of our checks return
> false positives then (should be fine, just less efficient.)
>
> To avoid such false positives, we could add another field
> to that structure, exclusively for this, but that is more
> 8 bytes (pointer) for inodes and even on non-data=journal
> cases.. so it didn't seem good enough reason, but if that
> is better/worth it for efficiency reasons (speed, in this
> case, vs. memory consumption) we could do it.
>
> Maybe there are other ideas/better ways to do it?
>
> 1) Usage of ext4_force_commit() in ext4_writepages()
>
> Patch 03 describes/fixes an issue where the underlying problem is,
> if __ext4_journalled_writepage() does set_page_writeback() but no
> journal commit is triggered, wait_on_page_writeback() may wait up
> to seconds until the periodic journal commit happens.
>
> The solution there, to fix the impact on msync(), is to just call
> ext4_force_commit() (as it's done anyway in ext4_sync_file()), on
> ext4_writepages().
>
> Is that a good enough solution? Other ideas?
>
> 2) Similar issue (unhandled) in ext4_writepage()
>
> The other, related question is, what about direct callers of
> ext4_writepage() that obviously do not use ext4_writepages() ?
> (e.g., pageout() and writeout(); write_one_page() not used.)
>
> Those are memory-cleasing writeback, which should not wait,
> however, as mentioned in that patch, if its writeback goes
> on for seconds and an data-integrity writeback/system call
> comes in, it is delayed/wait_on_page_writeback() that long.
>
> So, ideally, we should be trying to kick a journal commit?
>
> It looks like ext4_handle_sync() is not the answer, since
> it waits for commit to finish, and pageout() is called on
> a list of pages by shrinking. So, not effective to block
> on each one of them.
>
> We might not want to start anything right now, actually,
> since the memory-cleasing writeback can be happening on
> memory pressure scenarios, right? But would need to do
> something, to ensure that future wait_on_page_writeback()
> do not wait too long.
>
> Maybe the answer is something similar to jbd2 sync transaction
> batching (used by ext4_handle_sync()), but in *async* fashion,
> say, possibly implemented/batching in the jbd2 worker thread.
> Is that reasonable?
>
> ...
>
> Any comments/feedback/reviews are very appreciated.
>
> Thank you in advance,
> Mauricio
>
> [1] https://lore.kernel.org/linux-ext4/[email protected]/
>
> Mauricio Faria de Oliveira (11):
> ext4: data=journal: introduce struct/kmem_cache
> ext4_journalled_wb_page/_cachep
> ext4: data=journal: handle page writeback in
> __ext4_journalled_writepage()
> ext4: data=journal: call ext4_force_commit() in ext4_writepages() for
> msync()
> ext4: data=journal: introduce helpers for journalled writeback
> deadlock
> ext4: data=journal: prevent journalled writeback deadlock in
> __ext4_journalled_writepage()
> ext4: data=journal: prevent journalled writeback deadlock in
> ext4_write_begin()
> ext4: grab page before starting transaction handle in
> ext4_convert_inline_data_to_extent()
> ext4: data=journal: prevent journalled writeback deadlock in
> ext4_convert_inline_data_to_extent()
> ext4: grab page before starting transaction handle in
> ext4_try_to_write_inline_data()
> ext4: deduplicate code with error legs in
> ext4_try_to_write_inline_data()
> ext4: data=journal: prevent journalled writeback deadlock in
> ext4_try_to_write_inline_data()
>
> fs/ext4/ext4_jbd2.h | 88 +++++++++++++++++++++++++
> fs/ext4/inline.c | 153 +++++++++++++++++++++++++++++++-------------
> fs/ext4/inode.c | 137 +++++++++++++++++++++++++++++++++++++--
> fs/ext4/page-io.c | 11 ++++
> 4 files changed, 341 insertions(+), 48 deletions(-)
>
> --
> 2.20.1
>


--
Mauricio Faria de Oliveira

2020-05-17 07:47:12

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache

On May 15, 2020, at 12:39 PM, Mauricio Faria de Oliveira <[email protected]> wrote:
>
> Hi,
>
> On Thu, Apr 23, 2020 at 8:37 PM Mauricio Faria de Oliveira
> <[email protected]> wrote:
> [snip]
>> Summary:
>> -------
>>
>> The patchset is a bit long with 11 patches, but I tried to get
>> changes tiny to help with review, and better document how each
>> of them work, why and how this or that is done. It's RFC as I
>> would like to ask for suggestions/feedback, if at all possible.
>
> If at all possible, may this patchset have at least a cursory look?
>
> I'm aware it's been a busy period for some of you, so I just wanted
> to friendly ping on it, in case this got buried deep under other stuff.


Hi Mauricio,
thank you for submitting the patch. I've always thought data=journal
is a useful feature, especially for the future with host-managed SMR
(which seems is being added to drives whether we want it or not), and
hybrid use cases (NVMe journal, large HDD size which may also be SMR).

That said, diving into the ext4/jbd2 code on your first adventure here
is certainly ambitious and it is great that you have taken this fix on.

I'd say that Jan is the resident expert on jbd2, so hopefully he will
have a chance to look at this patch series.

Cheers, Andreas

>> Patch 01 and 02 implement the outlined fix, with a few changes
>> (fix first deadlock; use existing plumbing in jbd2 as the list.)
>>
>> Patch 03 fix a seconds-delay on msync().
>>
>> Patch 04 introduces helpers to handle the second deadlock.
>>
>> Patch 05-11 handle the second deadlock (three of these patches,
>> namely 07, 09 and 10 are changes not specific for data=journal,
>> affecting other journaling modes, so it's not on their subject.)
>>
>> The order of the patches intentionally allow the issues on 03
>> and 05-11 to occur (while putting the core patches first), so
>> to allow issues to be reproduced/regression tested one by one,
>> as needed. It can be changed, of course, so to enable actual
>> writeback changes in the last patch (when issues are patched.)
>>
>>
>> Testing:
>> -------
>>
>> This has been built and regression tested on next-20200417.
>> (Also rebased and build tested on next-20200423 / "today").
>>
>> On xfstests (commit b2faf204) quick group (and excluding
>> generic 430/431/434 which always hung): no regressions w/
>> data=ordered (default) nor data=journal,journal_checksum.
>>
>> With data=ordered: (on both original and patched kernel)
>>
>> Failures: generic/223 generic/465 generic/553 generic/554 generic/565 generic/570
>>
>> With data=journal,journal_checksum: (likewise)
>>
>> Failures: ext4/044 generic/223 generic/441 generic/553 generic/554 generic/565 generic/570
>>
>> The test-case for the problem (and deadlocks) and further
>> stress testing is stress-ng (with 512 workers on 16 vCPUs)
>>
>> $ sudo mount -o data=journal,journal_checksum $DEV $MNT
>> $ cd $MNT
>> $ sudo stress-ng --mmap 512 --mmap-file --timeout 1w
>>
>> To reproduce the problem (without patchset), run it a bit
>> and crash the kernel (to cause unclean shutdown) w/ sysrq,
>> and mount the device again (it should fail / need e2fsck):
>>
>> Original:
>>
>> [ 27.660063] JBD2: Invalid checksum recovering data block 79449 in log
>> [ 27.792371] JBD2: recovery failed
>> [ 27.792854] EXT4-fs (vdc): error loading journal
>> mount: /tmp/ext4: can't read superblock on /dev/vdc.
>>
>> Patched:
>>
>> [ 33.111230] EXT4-fs (vdc): 512 orphan inodes deleted
>> [ 33.111961] EXT4-fs (vdc): recovery complete
>> [ 33.114590] EXT4-fs (vdc): mounted filesystem with journalled data mode. Opts: data=journal,journal_checksum
>>
>>
>> RFC / Questions:
>> ---------------
>>
>> 0) Usage of ext4_inode_info.i_datasync_tid for checks
>>
>> We rely on the struct ext4_inode_info.i_datasync_tid field
>> (set by __ext4_journalled_writepage() and others) to check
>> it against the running transaction. Of course, other sites
>> set it too, and it might be that some of our checks return
>> false positives then (should be fine, just less efficient.)
>>
>> To avoid such false positives, we could add another field
>> to that structure, exclusively for this, but that is more
>> 8 bytes (pointer) for inodes and even on non-data=journal
>> cases.. so it didn't seem good enough reason, but if that
>> is better/worth it for efficiency reasons (speed, in this
>> case, vs. memory consumption) we could do it.
>>
>> Maybe there are other ideas/better ways to do it?
>>
>> 1) Usage of ext4_force_commit() in ext4_writepages()
>>
>> Patch 03 describes/fixes an issue where the underlying problem is,
>> if __ext4_journalled_writepage() does set_page_writeback() but no
>> journal commit is triggered, wait_on_page_writeback() may wait up
>> to seconds until the periodic journal commit happens.
>>
>> The solution there, to fix the impact on msync(), is to just call
>> ext4_force_commit() (as it's done anyway in ext4_sync_file()), on
>> ext4_writepages().
>>
>> Is that a good enough solution? Other ideas?
>>
>> 2) Similar issue (unhandled) in ext4_writepage()
>>
>> The other, related question is, what about direct callers of
>> ext4_writepage() that obviously do not use ext4_writepages() ?
>> (e.g., pageout() and writeout(); write_one_page() not used.)
>>
>> Those are memory-cleasing writeback, which should not wait,
>> however, as mentioned in that patch, if its writeback goes
>> on for seconds and an data-integrity writeback/system call
>> comes in, it is delayed/wait_on_page_writeback() that long.
>>
>> So, ideally, we should be trying to kick a journal commit?
>>
>> It looks like ext4_handle_sync() is not the answer, since
>> it waits for commit to finish, and pageout() is called on
>> a list of pages by shrinking. So, not effective to block
>> on each one of them.
>>
>> We might not want to start anything right now, actually,
>> since the memory-cleasing writeback can be happening on
>> memory pressure scenarios, right? But would need to do
>> something, to ensure that future wait_on_page_writeback()
>> do not wait too long.
>>
>> Maybe the answer is something similar to jbd2 sync transaction
>> batching (used by ext4_handle_sync()), but in *async* fashion,
>> say, possibly implemented/batching in the jbd2 worker thread.
>> Is that reasonable?
>>
>> ...
>>
>> Any comments/feedback/reviews are very appreciated.
>>
>> Thank you in advance,
>> Mauricio
>>
>> [1] https://lore.kernel.org/linux-ext4/[email protected]/
>>
>> Mauricio Faria de Oliveira (11):
>> ext4: data=journal: introduce struct/kmem_cache
>> ext4_journalled_wb_page/_cachep
>> ext4: data=journal: handle page writeback in
>> __ext4_journalled_writepage()
>> ext4: data=journal: call ext4_force_commit() in ext4_writepages() for
>> msync()
>> ext4: data=journal: introduce helpers for journalled writeback
>> deadlock
>> ext4: data=journal: prevent journalled writeback deadlock in
>> __ext4_journalled_writepage()
>> ext4: data=journal: prevent journalled writeback deadlock in
>> ext4_write_begin()
>> ext4: grab page before starting transaction handle in
>> ext4_convert_inline_data_to_extent()
>> ext4: data=journal: prevent journalled writeback deadlock in
>> ext4_convert_inline_data_to_extent()
>> ext4: grab page before starting transaction handle in
>> ext4_try_to_write_inline_data()
>> ext4: deduplicate code with error legs in
>> ext4_try_to_write_inline_data()
>> ext4: data=journal: prevent journalled writeback deadlock in
>> ext4_try_to_write_inline_data()
>>
>> fs/ext4/ext4_jbd2.h | 88 +++++++++++++++++++++++++
>> fs/ext4/inline.c | 153 +++++++++++++++++++++++++++++++-------------
>> fs/ext4/inode.c | 137 +++++++++++++++++++++++++++++++++++++--
>> fs/ext4/page-io.c | 11 ++++
>> 4 files changed, 341 insertions(+), 48 deletions(-)
>>
>> --
>> 2.20.1
>>
>
>
> --
> Mauricio Faria de Oliveira


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-06-10 16:33:56

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache

Hello!

Firstly, thanks for the patches and I'm sorry that it took me so long to
get to this.

On Thu 23-04-20 20:36:54, Mauricio Faria de Oliveira wrote:
> Ted Ts'o explains, in the linux-ext4 thread [1], that currently
> data=journal + journal_checksum + mmap is not handled correctly,
> and outlines the changes to fix it (+ items/breaks for clarity):
>
> """
> The fix is going to have to involve
> - fixing __ext4_journalled_writepage() to call set_page_writeback()
> before it unlocks the page,
> - adding a list of pages under data=journalled writeback
> which is attached to the transaction handle,
> - have the jbd2 commit hook call end_page_writeback()
> on all of these pages,
> - and then in the places where ext4 calls wait_for_stable_page()
> or grab_cache_page_write_begin(), we need to add:
>
> if (ext4_should_journal_data(inode))
> wait_on_page_writeback(page);
>
> It's all relatively straightforward except for the part
> where we have to attach a list of pages to the currently
> running transaction. That will require adding some
> plumbing into the jbd2 layer.
> """

So I was pondering about this general design for some time. First let me
state here how I see the problem you are hitting in data=journal mode:

The journalling machinery expects the buffer contents cannot change from
the moment transaction commit starts (more exactly from the moment
transaction exists LOCKED state) until the moment transaction commit
finishes. Places like jbd2_journal_get_write_access() are there exactly to
ascertain this - they copy the "to be committed" contents into a bounce
buffer (jh->b_frozen_data) or wait for commit to finish (if it's too late
for copying and the data is already in flight). data=journal mode breaks
this assumption because although ext4_page_mkwrite() calls
do_journal_get_write_access() for each page buffer (and thus makes sure
there's no commit with these buffers running at that moment), the commit
can start the moment we call ext4_journal_stop() in ext4_page_mkwrite() and
then this commit runs while buffers in the committing transaction are
writeably mapped to userspace.

However I don't think Ted's solution actually fully solves the above
problem. Think for example about the following scenario:

fd = open('file');
addr = mmap(fd);
addr[0] = 'a';
-> caused page fault, ext4_page_mkwrite() is called, page is
dirtied, all buffers are added to running transaction
pwrite(fd, buf, 1, 1);
-> page dirtied again, all buffer are dirtied in the running
transaction

jbd2 thread commits transaction now
-> the problem with committing buffers that are writeably mapped is still
there and your patches wouldn't solve it because
ext4_journalled_writepage() didn't get called at all.

Also, as you found out, leaving pages under writeback while we didn't even
start transaction commit that will end writeback on them is going to cause
odd stalls in various unexpected places.

So I'd suggest a somewhat different solution. Instead of trying to mark,
when page is under writeback, do it the other way around and make jbd2 kick
ext4 on transaction commit to writeprotect journalled pages. Then, because
of do_journal_get_write_access() in ext4_page_mkwrite() we are sure pages
cannot be writeably mapped into page tables until transaction commit
finishes (or we'll copy data to commit to b_frozen_data).

Now let me explain a bit more the "make jbd2 kick ext4 on transaction
commit to writeprotect journalled pages" part. I think we could mostly
reuse the data=ordered machinery for that. data=ordered machinery tracks
with each running transaction also a list of inodes for which we need to
flush data pages before doing commit of metadata into the journal. Now we
could use the same mechanism for data=journal mode - we'd track in the
inode list inodes with writeably mapped pages and on transaction commit we
need to writeprotect these pages using clear_page_dirty_for_io(). Probably
the best place to add inode to this list is ext4_journalled_writepage().
To do the commit handling we probably need to introduce callbacks that jbd2
will call instead of journal_submit_inode_data_buffers() in
journal_submit_data_buffers() and instead of
filemap_fdatawait_range_keep_errors() in
journal_finish_inode_data_buffers(). In data=ordered mode ext4 will just do
what jbd2 does in its callback, when inode's data should be journalled, the
callback will use write_cache_pages() to iterate and writeprotect all dirty
pages. The writepage callback just returns AOP_WRITEPAGE_ACTIVATE, some
care needs to be taken to redirty a page in the writepage callback if not
all its underlying buffers are part of the committing transaction or if
some buffer already has b_next_transaction set (since in that case the page
was already dirtied also against the following transaction). But it should
all be reasonably doable.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache

Hi Jan,

On Wed, Jun 10, 2020 at 10:21 AM Jan Kara <[email protected]> wrote:
>
> Hello!
>
> Firstly, thanks for the patches and I'm sorry that it took me so long to
> get to this.
>

No worries at all. Thank you very much for reviewing, and explaining your
understanding of the problem and why this patchset doesn't fully address it.

I believe I understood the rationale and the pieces involved in the solution
you suggested (thanks for the level of detail), and should be able to work
on it and send something within a few weeks.

cheers,
Mauricio

> On Thu 23-04-20 20:36:54, Mauricio Faria de Oliveira wrote:
> > Ted Ts'o explains, in the linux-ext4 thread [1], that currently
> > data=journal + journal_checksum + mmap is not handled correctly,
> > and outlines the changes to fix it (+ items/breaks for clarity):
> >
> > """
> > The fix is going to have to involve
> > - fixing __ext4_journalled_writepage() to call set_page_writeback()
> > before it unlocks the page,
> > - adding a list of pages under data=journalled writeback
> > which is attached to the transaction handle,
> > - have the jbd2 commit hook call end_page_writeback()
> > on all of these pages,
> > - and then in the places where ext4 calls wait_for_stable_page()
> > or grab_cache_page_write_begin(), we need to add:
> >
> > if (ext4_should_journal_data(inode))
> > wait_on_page_writeback(page);
> >
> > It's all relatively straightforward except for the part
> > where we have to attach a list of pages to the currently
> > running transaction. That will require adding some
> > plumbing into the jbd2 layer.
> > """
>
> So I was pondering about this general design for some time. First let me
> state here how I see the problem you are hitting in data=journal mode:
>
> The journalling machinery expects the buffer contents cannot change from
> the moment transaction commit starts (more exactly from the moment
> transaction exists LOCKED state) until the moment transaction commit
> finishes. Places like jbd2_journal_get_write_access() are there exactly to
> ascertain this - they copy the "to be committed" contents into a bounce
> buffer (jh->b_frozen_data) or wait for commit to finish (if it's too late
> for copying and the data is already in flight). data=journal mode breaks
> this assumption because although ext4_page_mkwrite() calls
> do_journal_get_write_access() for each page buffer (and thus makes sure
> there's no commit with these buffers running at that moment), the commit
> can start the moment we call ext4_journal_stop() in ext4_page_mkwrite() and
> then this commit runs while buffers in the committing transaction are
> writeably mapped to userspace.
>
> However I don't think Ted's solution actually fully solves the above
> problem. Think for example about the following scenario:
>
> fd = open('file');
> addr = mmap(fd);
> addr[0] = 'a';
> -> caused page fault, ext4_page_mkwrite() is called, page is
> dirtied, all buffers are added to running transaction
> pwrite(fd, buf, 1, 1);
> -> page dirtied again, all buffer are dirtied in the running
> transaction
>
> jbd2 thread commits transaction now
> -> the problem with committing buffers that are writeably mapped is still
> there and your patches wouldn't solve it because
> ext4_journalled_writepage() didn't get called at all.
>
> Also, as you found out, leaving pages under writeback while we didn't even
> start transaction commit that will end writeback on them is going to cause
> odd stalls in various unexpected places.
>
> So I'd suggest a somewhat different solution. Instead of trying to mark,
> when page is under writeback, do it the other way around and make jbd2 kick
> ext4 on transaction commit to writeprotect journalled pages. Then, because
> of do_journal_get_write_access() in ext4_page_mkwrite() we are sure pages
> cannot be writeably mapped into page tables until transaction commit
> finishes (or we'll copy data to commit to b_frozen_data).
>
> Now let me explain a bit more the "make jbd2 kick ext4 on transaction
> commit to writeprotect journalled pages" part. I think we could mostly
> reuse the data=ordered machinery for that. data=ordered machinery tracks
> with each running transaction also a list of inodes for which we need to
> flush data pages before doing commit of metadata into the journal. Now we
> could use the same mechanism for data=journal mode - we'd track in the
> inode list inodes with writeably mapped pages and on transaction commit we
> need to writeprotect these pages using clear_page_dirty_for_io(). Probably
> the best place to add inode to this list is ext4_journalled_writepage().
> To do the commit handling we probably need to introduce callbacks that jbd2
> will call instead of journal_submit_inode_data_buffers() in
> journal_submit_data_buffers() and instead of
> filemap_fdatawait_range_keep_errors() in
> journal_finish_inode_data_buffers(). In data=ordered mode ext4 will just do
> what jbd2 does in its callback, when inode's data should be journalled, the
> callback will use write_cache_pages() to iterate and writeprotect all dirty
> pages. The writepage callback just returns AOP_WRITEPAGE_ACTIVATE, some
> care needs to be taken to redirty a page in the writepage callback if not
> all its underlying buffers are part of the committing transaction or if
> some buffer already has b_next_transaction set (since in that case the page
> was already dirtied also against the following transaction). But it should
> all be reasonably doable.
>
> Honza
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR



--
Mauricio Faria de Oliveira