Subject: [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit

Hey Jan,

This series implements your suggestions for the RFC PATCH v2 set,
which we mostly talked about in cover letter [1] and PATCH 3 [2].
(I added Suggested-by: tags, by the way, for due credit.)

It looks almost good on fstests: zero regressions on data=ordered,
and only one regression in data=journal (generic/347); I'll check.
(That's with ext4; I'll check ocfs2, but it's only a minor change.)

However, there's an issue I have to check with you about, that we
exposed from the original kernel. It's described below, but other
than this I _guess_ this should be close if you don't spot errors.

Thanks!
Mauricio

...

Testing with 'stress-ng --mmap <N> --mmap-file' runs well for days,
but it occasionally hits:

JBD2: Spotted dirty metadata buffer (dev = vdc, blocknr = 74775).
There's a risk of filesystem corruption in case of system crash.

I added dump_stack() in warn_dirty_buffer(), and it usually comes
from jbd2_journal_file_buffer(, BJ_Forget) in the commit function.
When filing from BJ_Shadow to BJ_Forget.. so it possibly happened
while BH_Shadow was still set!

So I instrumented [test_]set_buffer_dirty() macros to dump_stack()
if BH_Shadow is set (i.e. buffer being set dirty during write-out.)

This showed that the occasional BH_Dirty setter with BH_Shadow set
is block_page_mkwrite() in ext4_page_mkwrite(). And it seems right,
because it's called before do_journal_get_write_access() (where we
check for/wait on BH_Shadow.)

ext4_page_mkwrite():

err = block_page_mkwrite(vma, vmf, get_block);
if (!err && ext4_should_journal_data(inode)) {
if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
PAGE_SIZE, NULL, do_journal_get_write_access)) {

The patches didn't directly break this, they only allow this code
to run more often as we disabled an early-return optimization for
the case 'all buffers mapped' in data=journal (question 2 in [1]):

ext4_page_mkwrite():

* Return if we have all the buffers mapped.
...
- if (page_has_buffers(page)) {
+ if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {


In order to confirm it, I built the unpatched v5.9-rc4 kernel, with
just the change to disable that optimization in data=journal -- and
it hit that occasionally too ("JBD2: Spotted dirty metadata buffer".)

I was naive enough to mindlessly try to swap the order of those two
statements, in hopes that do_journal_get_write_access() should wait
for BH_Shadow to clear, and then we just call block_page_mkwrite().
BUT it trips into the BUG() check in page_buffers() in the former.

I still have to dig more about it, but if you have something quick
in mind, I'd appreciate any comments/feedback about it, if it gets
more complex than I can see.

Thanks again!

[1] https://lore.kernel.org/linux-ext4/[email protected]/
[2] https://lore.kernel.org/linux-ext4/[email protected]/

Mauricio Faria de Oliveira (3):
jbd2: introduce/export functions
jbd2_journal_submit|finish_inode_data_buffers()
jbd2, ext4, ocfs2: introduce/use journal callbacks
j_submit|finish_inode_data_buffers()
ext4: data=journal: write-protect pages on
j_submit_inode_data_buffers()

fs/ext4/inode.c | 29 +++++++++++-----
fs/ext4/super.c | 82 ++++++++++++++++++++++++++++++++++++++++++++
fs/jbd2/commit.c | 58 +++++++++++++++++--------------
fs/jbd2/journal.c | 2 ++
fs/ocfs2/super.c | 15 ++++++++
include/linux/jbd2.h | 29 +++++++++++++++-
6 files changed, 181 insertions(+), 34 deletions(-)

--
2.17.1


Subject: [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()

Introduce journal callbacks to allow different behaviors
for an inode in journal_submit|finish_inode_data_buffers().

The existing users of the current behavior (ext4, ocfs2)
are adapted to use the previously exported functions
that implement the current behavior.

Users are callers of jbd2_journal_inode_ranged_write|wait(),
which adds the inode to the transaction's inode list with
the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
Suggested-by: Jan Kara <[email protected]>
---
fs/ext4/super.c | 14 ++++++++++++++
fs/jbd2/commit.c | 30 ++++++++++++++++++------------
fs/ocfs2/super.c | 15 +++++++++++++++
include/linux/jbd2.h | 25 ++++++++++++++++++++++++-
4 files changed, 71 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ea425b49b345..7303839d7ad9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -472,6 +472,16 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
spin_unlock(&sbi->s_md_lock);
}

+static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ return jbd2_journal_submit_inode_data_buffers(jinode);
+}
+
+static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ return jbd2_journal_finish_inode_data_buffers(jinode);
+}
+
static bool system_going_down(void)
{
return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF
@@ -4646,6 +4656,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);

sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+ sbi->s_journal->j_submit_inode_data_buffers =
+ ext4_journal_submit_inode_data_buffers;
+ sbi->s_journal->j_finish_inode_data_buffers =
+ ext4_journal_finish_inode_data_buffers;

no_journal:
if (!test_opt(sb, NO_MBCACHE)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index c17cda96926e..23d3fcc11b97 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -200,6 +200,12 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
.range_end = dirty_end,
};

+ /*
+ * submit the inode data buffers. We use writepage
+ * instead of writepages. Because writepages can do
+ * block allocation with delalloc. We need to write
+ * only allocated blocks here.
+ */
ret = generic_writepages(mapping, &wbc);
return ret;
}
@@ -224,16 +230,13 @@ static int journal_submit_data_buffers(journal_t *journal,
continue;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(&journal->j_list_lock);
- /*
- * submit the inode data buffers. We use writepage
- * instead of writepages. Because writepages can do
- * block allocation with delalloc. We need to write
- * only allocated blocks here.
- */
+ /* submit the inode data buffers. */
trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
- err = jbd2_journal_submit_inode_data_buffers(jinode);
- if (!ret)
- ret = err;
+ if (journal->j_submit_inode_data_buffers) {
+ err = journal->j_submit_inode_data_buffers(jinode);
+ if (!ret)
+ ret = err;
+ }
spin_lock(&journal->j_list_lock);
J_ASSERT(jinode->i_transaction == commit_transaction);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
@@ -273,9 +276,12 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
continue;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(&journal->j_list_lock);
- err = jbd2_journal_finish_inode_data_buffers(jinode);
- if (!ret)
- ret = err;
+ /* wait for the inode data buffers writeout. */
+ if (journal->j_finish_inode_data_buffers) {
+ err = journal->j_finish_inode_data_buffers(jinode);
+ if (!ret)
+ ret = err;
+ }
spin_lock(&journal->j_list_lock);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
smp_mb();
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 1d91dd1e8711..f4e62aafc89c 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -2010,6 +2010,16 @@ static int ocfs2_journal_addressable(struct ocfs2_super *osb)
return status;
}

+static int ocfs2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ return jbd2_journal_submit_inode_data_buffers(jinode);
+}
+
+static int ocfs2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ return jbd2_journal_finish_inode_data_buffers(jinode);
+}
+
static int ocfs2_initialize_super(struct super_block *sb,
struct buffer_head *bh,
int sector_size,
@@ -2211,6 +2221,11 @@ static int ocfs2_initialize_super(struct super_block *sb,
}
osb->journal = journal;
journal->j_osb = osb;
+ journal->j_journal->j_submit_inode_data_buffers =
+ ocfs2_journal_submit_inode_data_buffers;
+ journal->j_journal->j_finish_inode_data_buffers =
+ ocfs2_journal_finish_inode_data_buffers;
+

atomic_set(&journal->j_num_trans, 0);
init_rwsem(&journal->j_trans_barrier);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2865a5475888..4aaa408c0ca7 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -629,7 +629,9 @@ struct transaction_s
struct journal_head *t_shadow_list;

/*
- * List of inodes whose data we've modified in data=ordered mode.
+ * List of inodes associated with the transaction; e.g., ext4 uses
+ * this to track inodes in data=ordered and data=journal mode that
+ * need special handling on transaction commit; also used by ocfs2.
* [j_list_lock]
*/
struct list_head t_inode_list;
@@ -1111,6 +1113,27 @@ struct journal_s
void (*j_commit_callback)(journal_t *,
transaction_t *);

+ /**
+ * @j_submit_inode_data_buffers:
+ *
+ * This function is called for all inodes associated with the
+ * committing transaction marked with JI_WRITE_DATA flag
+ * before we start to write out the transaction to the journal.
+ */
+ int (*j_submit_inode_data_buffers)
+ (struct jbd2_inode *);
+
+ /**
+ * @j_finish_inode_data_buffers:
+ *
+ * This function is called for all inodes associated with the
+ * committing transaction marked with JI_WAIT_DATA flag
+ * after we have written the transaction to the journal
+ * but before we write out the commit block.
+ */
+ int (*j_finish_inode_data_buffers)
+ (struct jbd2_inode *);
+
/*
* Journal statistics
*/
--
2.17.1

2020-09-16 20:53:46

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()

On Thu 10-09-20 16:31:26, Mauricio Faria de Oliveira wrote:
> Introduce journal callbacks to allow different behaviors
> for an inode in journal_submit|finish_inode_data_buffers().
>
> The existing users of the current behavior (ext4, ocfs2)
> are adapted to use the previously exported functions
> that implement the current behavior.
>
> Users are callers of jbd2_journal_inode_ranged_write|wait(),
> which adds the inode to the transaction's inode list with
> the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.
>
> Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
> Suggested-by: Jan Kara <[email protected]>
> ---
> fs/ext4/super.c | 14 ++++++++++++++
> fs/jbd2/commit.c | 30 ++++++++++++++++++------------
> fs/ocfs2/super.c | 15 +++++++++++++++
> include/linux/jbd2.h | 25 ++++++++++++++++++++++++-
> 4 files changed, 71 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index ea425b49b345..7303839d7ad9 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -472,6 +472,16 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
> spin_unlock(&sbi->s_md_lock);
> }
>
> +static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> + return jbd2_journal_submit_inode_data_buffers(jinode);
> +}
> +
> +static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> + return jbd2_journal_finish_inode_data_buffers(jinode);
> +}
> +

No need for these ext4 wrappers. They just obfuscate code... Ditto for
ocfs2 below.

> @@ -1111,6 +1113,27 @@ struct journal_s
> void (*j_commit_callback)(journal_t *,
> transaction_t *);
>
> + /**
> + * @j_submit_inode_data_buffers:
> + *
> + * This function is called for all inodes associated with the
> + * committing transaction marked with JI_WRITE_DATA flag
> + * before we start to write out the transaction to the journal.
> + */
> + int (*j_submit_inode_data_buffers)
> + (struct jbd2_inode *);
> +
> + /**
> + * @j_finish_inode_data_buffers:
> + *
> + * This function is called for all inodes associated with the
> + * committing transaction marked with JI_WAIT_DATA flag
> + * after we have written the transaction to the journal
> + * but before we write out the commit block.
> + */
> + int (*j_finish_inode_data_buffers)
> + (struct jbd2_inode *);
> +

Having the callbacks in the journal_s will not work if we have inodes with
data journalling on a filesystem mounted in data=ordered mode. The
callbacks really need to be a per-inode thing so I'd add them to struct
jbd2_inode.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-16 20:59:23

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()

On Wed 16-09-20 18:22:40, Jan Kara wrote:
> On Thu 10-09-20 16:31:26, Mauricio Faria de Oliveira wrote:
> > @@ -1111,6 +1113,27 @@ struct journal_s
> > void (*j_commit_callback)(journal_t *,
> > transaction_t *);
> >
> > + /**
> > + * @j_submit_inode_data_buffers:
> > + *
> > + * This function is called for all inodes associated with the
> > + * committing transaction marked with JI_WRITE_DATA flag
> > + * before we start to write out the transaction to the journal.
> > + */
> > + int (*j_submit_inode_data_buffers)
> > + (struct jbd2_inode *);
> > +
> > + /**
> > + * @j_finish_inode_data_buffers:
> > + *
> > + * This function is called for all inodes associated with the
> > + * committing transaction marked with JI_WAIT_DATA flag
> > + * after we have written the transaction to the journal
> > + * but before we write out the commit block.
> > + */
> > + int (*j_finish_inode_data_buffers)
> > + (struct jbd2_inode *);
> > +
>
> Having the callbacks in the journal_s will not work if we have inodes with
> data journalling on a filesystem mounted in data=ordered mode. The
> callbacks really need to be a per-inode thing so I'd add them to struct
> jbd2_inode.

Oh, now I see that you properly handle this in the callbacks. So I retract
this objection. So feel free to add:

Reviewed-by: Jan Kara <[email protected]>

after removing those pointless wrappers.

Hoonza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-16 21:03:19

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit

Hi Mauricio!

On Thu 10-09-20 16:31:24, Mauricio Faria de Oliveira wrote:
> This series implements your suggestions for the RFC PATCH v2 set,
> which we mostly talked about in cover letter [1] and PATCH 3 [2].
> (I added Suggested-by: tags, by the way, for due credit.)
>
> It looks almost good on fstests: zero regressions on data=ordered,
> and only one regression in data=journal (generic/347); I'll check.
> (That's with ext4; I'll check ocfs2, but it's only a minor change.)

Glad to hear that!

> Testing with 'stress-ng --mmap <N> --mmap-file' runs well for days,
> but it occasionally hits:
>
> JBD2: Spotted dirty metadata buffer (dev = vdc, blocknr = 74775).
> There's a risk of filesystem corruption in case of system crash.
>
> I added dump_stack() in warn_dirty_buffer(), and it usually comes
> from jbd2_journal_file_buffer(, BJ_Forget) in the commit function.
> When filing from BJ_Shadow to BJ_Forget.. so it possibly happened
> while BH_Shadow was still set!
>
> So I instrumented [test_]set_buffer_dirty() macros to dump_stack()
> if BH_Shadow is set (i.e. buffer being set dirty during write-out.)
>
> This showed that the occasional BH_Dirty setter with BH_Shadow set
> is block_page_mkwrite() in ext4_page_mkwrite(). And it seems right,
> because it's called before do_journal_get_write_access() (where we
> check for/wait on BH_Shadow.)
>
> ext4_page_mkwrite():
>
> err = block_page_mkwrite(vma, vmf, get_block);
> if (!err && ext4_should_journal_data(inode)) {
> if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
> PAGE_SIZE, NULL, do_journal_get_write_access)) {
>
> The patches didn't directly break this, they only allow this code
> to run more often as we disabled an early-return optimization for
> the case 'all buffers mapped' in data=journal (question 2 in [1]):
>
> ext4_page_mkwrite():
>
> * Return if we have all the buffers mapped.
> ...
> - if (page_has_buffers(page)) {
> + if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {
>
>
> In order to confirm it, I built the unpatched v5.9-rc4 kernel, with
> just the change to disable that optimization in data=journal -- and
> it hit that occasionally too ("JBD2: Spotted dirty metadata buffer".)
>
> I was naive enough to mindlessly try to swap the order of those two
> statements, in hopes that do_journal_get_write_access() should wait
> for BH_Shadow to clear, and then we just call block_page_mkwrite().
> BUT it trips into the BUG() check in page_buffers() in the former.

Yeah, you're right that code is wrong for data=journal case. We cannot
really use block_page_mkwrite() for it - we rather need to use there
something like:

__block_write_begin(page, pos, len, ext4_get_block);
ext4_walk_page_buffers(handle, page_buffers(page),
from, to, NULL,
do_journal_get_write_access);
ext4_walk_page_buffers(handle, page_buffers(page), from, to, NULL,
write_end_fn);

or something like that, I guess you get the idea...

Actually, I think data=journal mode should get it's own .page_mkwrite
handler because the sharing between data=journal handling and the other
cases is pretty minimal.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR