Subject: [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit

Hi Jan,

Upfront, thanks again for taking the time about a month ago to provide
your understanding of this problem and suggestions on how to implement
this fix [1].

I'd like to ask for your comments on the patchset so far, and have a few questions.

First, there is a test case to reproduce the issue:

Essentially, it enlarges the race window during journal commit, between
buffer checksum calculation and commit to disk, for userspace to modify
mmap()ed buffers, then panic afterwards. Later, journal recovery fails
due to invalid checksum.

Patch 1 is the kernel hack to allow it.
The last 2 files in the series are the a setup script to create/mount
a loopdev w/ ext4 data=journal filesystem, and the C test case for it.

Test is essentially:

fd = open("file");
addr = mmap(fd);
pwrite(fd, "a", 1, 0);
addr[0] = 'a';
press_enter(); // when asked for.
addr[0] = 'b';

Usage:

$ gcc -o test test.c
$ ./ext4.sh && cd /ext4 && ~/test

Press enter to change buffer contents
# wait for periodic journal commit
[ 26.286197] TESTCASE: Cookie found. Waiting 5 seconds for changes.
# now press enter

Buffer contents changed
Press enter to change buffer contents
# just wait
[ 31.370558] TESTCASE: Cookie eaten. Resumed.
[ 31.375906] Kernel panic - not syncing: TESTCASE: checksum changed; commit record done; panic!
[ 31.379799] CPU: 1 PID: 649 Comm: jbd2/loop0-8 Not tainted 5.8.0.mfo.1 #111
...
[ 31.400643] Rebooting in 5 seconds..

$ ./ext4.sh mount
...
+ sudo mount -o data=journal /dev/loop0 /ext4
[ 27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
[ 27.339492] JBD2: Invalid checksum recovering data block 8705 in log
[ 27.342716] JBD2: recovery failed
[ 27.343316] EXT4-fs (loop0): error loading journal
mount: /ext4: can't read superblock on /dev/loop0.


Second, this is a recap of your solution implementation suggestion [1], with line breaks added for clarity:

"""
So I'd suggest a somewhat different solution.

Instead of trying to mark, when page is under writeback, do it the other way around
and make jbd2 kick ext4 on transaction commit to writeprotect journalled pages.

Then, because of do_journal_get_write_access() in ext4_page_mkwrite() we are sure pages
cannot be writeably mapped into page tables until transaction commit finishes
(or we'll copy data to commit to b_frozen_data).

Now let me explain a bit more the "make jbd2 kick ext4 on transaction
commit to writeprotect journalled pages" part.

I think we could mostly reuse the data=ordered machinery for that.

data=ordered machinery tracks with each running transaction also a list of inodes
for which we need to flush data pages before doing commit of metadata into the journal.

Now we could use the same mechanism for data=journal mode -
we'd track in the inode list inodes with writeably mapped pages
and on transaction commit we need to writeprotect these pages using clear_page_dirty_for_io().

Probably the best place to add inode to this list is ext4_journalled_writepage().

To do the commit handling we probably need to introduce callbacks that jbd2 will call
instead of journal_submit_inode_data_buffers() in journal_submit_data_buffers()
and instead of filemap_fdatawait_range_keep_errors() in journal_finish_inode_data_buffers().

In data=ordered mode ext4 will just do what jbd2 does in its callback,

when inode's data should be journalled, the callback will
use write_cache_pages() to iterate and writeprotect all dirty pages.

The writepage callback just returns AOP_WRITEPAGE_ACTIVATE,

some care needs to be taken to redirty a page in the writepage callback
A) if not all its underlying buffers are part of the committing transaction
B) or if some buffer already has b_next_transaction set
(since in that case the page was already dirtied also against the following transaction).

But it should all be reasonably doable.
"""

[1] https://lore.kernel.org/linux-ext4/[email protected]/

If I understood it correctly:
- Patch 2 implements the callback mechanism.
- Patch 3 implements the write-protect code.
- Patch 4 has changes to ext4_page_mkwrite() discussed below.
- Patch 5 introduces debug messages for RFC discussion only.

Third, my changes/observations/questions:

1) > Probably the best place to add inode to this list is ext4_journalled_writepage().

I think we also need it in ext4_page_mkwrite(), right? See the test case, for example:

fd = open("file");
addr = mmap(fd);
pwrite(fd, "a", 1, 0);
addr[0] = 'a';
press_enter(); // when asked for.
addr[0] = 'b';

Here pwrite() led to ext4_write_begin() which added the inode to a transaction (not to the transaction's inode list).

Now the jbd2 thread commits the transaction, before any writeback work or msync() call.
So __ext4_journalled_writepage() didn't get called; the page is still writeably mapped.

And it could be changed / problem reproduced with the test case.
Added some debug messages and line breaks for clarity:

$ cd /ext4 && ~/test
<...>
[ 20.668889] TESTDBG: ext4_write_begin() :: journal started: inode ffff8f72eda5fca8, txn ffff8f72f5ea0300
[ 20.670593] TESTDBG: do_get_write_access() :: entry for bh: ffff8f72edae2d68, offset: 0000, page ffffe50288a0a200, inode: ffff8f72eda5fca8
[ 20.672018] TESTDBG: ext4_journalled_write_end() :: journal stopped: inode ffff8f72eda5fca8

[ 20.673769] TESTDBG: ext4_page_mkwrite() :: entry for inode ffff8f72eda5fca8
[ 20.675016] TESTDBG: ext4_bh_tdbg() :: bh: ffff8f72edae2d68, data offset: 0000, page: ffffe50288a0a200, inode: ffff8f72eda5fca8
[ 20.677639] TESTDBG: ext4_page_mkwrite() :: returning; all buffers mapped for inode ffff8f72eda5fca8

Press enter to change buffer contents
[ 26.250842] TESTDBG: journal_submit_data_buffers() :: entry for transaction: 0xffff8f72f5ea0300
<...>
[ 26.283521] TESTDBG: jbd2_journal_write_metadata_buffer() :: copy out: done/need 0/0, bh: ffff8f72edae2d68, offset: 0000, page: ffffe50288a0a200, inode: ffff8f72eda5fca8
[ 26.286197] TESTCASE: Cookie found. Waiting 5 seconds for changes.
# Press enter

Buffer contents changed
Press enter to change buffer contents
[ 31.370558] TESTCASE: Cookie eaten. Resumed.
[ 31.375906] Kernel panic - not syncing: TESTCASE: checksum changed; commit record done; panic!

So, we should have ext4_page_mkwrite() add the inode to the transaction's inode list to fix this case, IIUIC.


2) Do not return early in ext4_page_mkwrite() 'if we have all the buffers mapped'?

There's this check in ext4_page_mkwrite():

/*
* Return if we have all the buffers mapped. This avoids the need to do
* journal_start/journal_stop which can block and take a long time
*/

Corresponding to this line in the debug messages:

[ 20.677639] TESTDBG: ext4_page_mkwrite() :: returning; all buffers mapped for inode ffff8f72eda5fca8

That returns early, *before* do_journal_get_write_access(), which is needed in data=journal,
regardless of whether all buffers are mapped: we could exit ext4_page_mkwrite() then change
buffers under commit during the race window.

So, it seems we should ignore that for data=journal, which allows us to ext4_journal_start()
and get a transaction handle, used to add the inode to the transaction's inode list (per #1.)

This is Patch 6.

With the changes from 1) and 2), the inode is added to the transaction's inode list,
and the callback does write-protect the page correctly.

When trying to change the buffer contents later (pressing enter) there _is_ a wait in
ext4_page_mkwrite() as expected! But it's first at the file_update_time() because it
calls do_get_write_access() for that inode, which is also in the transaction.

I presume that a commit can start after file_update_time() and before ext4_journal_start()
in ext4_page_mkwrite(), so we'd still have to keep these changes to ext4_page_mkwrite() ?

$ cd /ext4 && ~/test
<...>
[ 142.264078] TESTDBG: do_get_write_access() :: entry for bh: ffff98e12f35fa90, offset: 0000, page ffffd06548c1c400, inode: ffff98e136849118

[ 142.266110] TESTDBG: ext4_write_begin() :: journal started: inode ffff98e12f2c0508, txn ffff98e12f1ec200
[ 142.268063] TESTDBG: do_get_write_access() :: entry for bh: ffff98e12f35f6e8, offset: 0000, page ffffd06548ccc100, inode: ffff98e12f2c0508
[ 142.269728] TESTDBG: ext4_journalled_write_end() :: journal stopped: inode ffff98e12f2c0508

[ 142.271894] TESTDBG: ext4_page_mkwrite() :: entry for inode ffff98e12f2c0508
[ 142.273498] TESTDBG: ext4_bh_tdbg() :: bh: ffff98e12f35f6e8, data offset: 0000, page: ffffd06548ccc100, inode: ffff98e12f2c0508
[ 142.276779] TESTDBG: ext4_page_mkwrite() :: before djgwa(), for inode ffff98e12f2c0508
[ 142.276781] TESTDBG: ext4_page_mkwrite() :: after djgwa(), for inode ffff98e12f2c0508
[ 142.278273] TESTDBG: ext4_page_mkwrite() :: Added inode to txn list: inode ffff98e12f2c0508, txn = ffff98e12f1ec200, err = 0
Press enter to change buffer contents
# Waiting...
[ 148.140148] TESTDBG: journal_submit_data_buffers() :: entry for transaction: 0xffff98e12f1ec200
[ 148.145770] TESTDBG: journal_submit_data_buffers() :: txn list has inode ffff98e12f2c0508 (write data flag: 0x2)
[ 148.148453] TESTDBG: ext4_journalled_submit_inode_data_buffers() :: entry for inode: ffff98e12f2c0508
[ 148.148462] TESTDBG: ext4_journalled_writepage_callback() :: entry for bh ffff98e12f35f6e8, page ffffd06548ccc100, inode: ffff98e12f2c0508
[ 148.150920] TESTDBG: ext4_journalled_writepage_callback() :: redirty for bh ffff98e12f35f680, jh, 0000000000000000, txn 0000000000000000, next_txn 0000000000000000
<...>
[ 148.168741] TESTDBG: jbd2_journal_write_metadata_buffer() :: copy out: done/need 0/0, bh: ffff98e12f35fa90, offset: 0000, page: ffffd06548c1c400, inode: ffff98e136849118
[ 148.170897] TESTDBG: jbd2_journal_write_metadata_buffer() :: copy out: done/need 0/0, bh: ffff98e12f35f6e8, offset: 0000, page: ffffd06548ccc100, inode: ffff98e12f2c0508
[ 148.173067] TESTCASE: Cookie found. Waiting 5 seconds for changes.
# Press enter.

[ 149.553705] TESTDBG: ext4_page_mkwrite() :: entry for inode ffff98e12f2c0508
[ 149.553719] TESTDBG: do_get_write_access() :: entry for bh: ffff98e132905d00, offset: 0000, page ffffd06548c7f580, inode: ffff98e136849118
[ 153.259932] TESTCASE: Cookie eaten. Resumed.
# Note that dgwa() above waited the 5 seconds to move on, but for another inode (ext4_page_mkwrite() -> file_update_time()), not the file's inode

[ 153.264679] TESTDBG: ext4_journalled_finish_inode_data_buffers() :: entry for inode: ffff98e12f2c0508
[ 153.264885] TESTDBG: do_get_write_access() :: entry for bh: ffff98e132905d00, offset: 0000, page ffffd06548c7f580, inode: ffff98e136849118
[ 153.267647] TESTDBG: ext4_bh_tdbg() :: bh: ffff98e12f35f6e8, data offset: 0000, page: ffffd06548ccc100, inode: ffff98e12f2c0508
[ 153.275150] TESTDBG: ext4_page_mkwrite() :: before djgwa(), for inode ffff98e12f2c0508
[ 153.275153] TESTDBG: do_get_write_access() :: entry for bh: ffff98e12f35f6e8, offset: 0000, page ffffd06548ccc100, inode: ffff98e12f2c0508
[ 153.277669] TESTDBG: ext4_page_mkwrite() :: after djgwa(), for inode ffff98e12f2c0508
Buffer contents changed
Press enter to change buffer contents
[ 153.280907] TESTDBG: ext4_page_mkwrite() :: Added inode to txn list: inode ffff98e12f2c0508, txn = ffff98e135db5d00, err = 0


3) When checking to redirty the page in the writepage callback,
does a buffer without a journal head means we should redirty
the page? (for the reason it's not part of the committing txn)

For example, in this case, the writepage callback is called,
checks the first buffer head is part of the committing txn,
but the next one has no journal head, so it's not part of any
transaction (including the committing txn, of course) so it
does redirty the page.

[ 148.148462] TESTDBG: ext4_journalled_writepage_callback() :: entry for bh ffff98e12f35f6e8, page ffffd06548ccc100, inode: ffff98e12f2c0508
[ 148.150920] TESTDBG: ext4_journalled_writepage_callback() :: redirty for bh ffff98e12f35f680, jh, 0000000000000000, txn 0000000000000000, next_txn 0000000000000000

4) Should we clear the PageChecked bit?

We don't clear the PageChecked bit, thus later when the writeback
work kicks in, __ext4_journalled_writepage() adds the page to the
transaction's inode list again.

Since the page is clean, it just goes into the submit data buffers
callback, but write_cache_pages() returns before the writepage callback.

Should we try to provent that by, say, clearing the pagechecked bit
in case we don't have to redirty the page (in the writepage callback) ?

5) Inodes with inline data. I haven't checked in detail yet, but
would appreciate if you have a brief explanation about them, and
if they need special handling since they apparently don't get
do_journal_get_write_access() called (eg, see __ext4_journalled_writepage()).

Thank you very much,
Mauricio

Mauricio Faria de Oliveira (5):
jbd2: test case for ext4 data=journal/mmap() journal corruption
jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers
ext4: data=journal: write-protect pages on submit inode data buffers
callback
ext4: data=journal: add inode to transaction inode list in
ext4_page_mkwrite()
ext4/jbd2: debugging messages

fs/ext4/inode.c | 42 ++++++++++++++++++++++--
fs/ext4/super.c | 75 +++++++++++++++++++++++++++++++++++++++++++
fs/jbd2/commit.c | 55 ++++++++++++++++++++++++++++---
fs/jbd2/journal.c | 5 +++
fs/jbd2/transaction.c | 4 +++
include/linux/jbd2.h | 21 +++++++++++-
6 files changed, 194 insertions(+), 8 deletions(-)

--
2.17.1


Subject: [RFC PATCH v2/TEST CASE]

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <sys/mman.h>

int main() {
int fd, rc;
char *addr;
const int PAGE_SIZE = sysconf(_SC_PAGESIZE);

rc = unlink("file");
if (rc < 0 && errno != ENOENT ) {
perror("unlink");
return 1;
}

fd = open("file", O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
if (fd < 0) {
perror("open");
return 1;
}

addr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (addr < 0) {
perror("mmap");
return 1;
}

rc = pwrite(fd, "a", 1, 0);
if (rc < 0) {
perror("pwrite");
close(fd);
return 1;
}

addr[0] = 'B';
addr[1] = 'U';
addr[2] = 'G';

while (1) {
printf("Press enter to change buffer contents\n");
getchar();
addr[3]++;
printf("Buffer contents changed\n");
}

// This is not reached.
rc = munmap(addr, PAGE_SIZE);
if (rc < 0) {
perror("munmap");
return 1;
}

close(fd);
return 0;
}

Subject: [RFC PATCH v2 4/5] ext4: data=journal: add inode to transaction inode list in ext4_page_mkwrite()

Since we only add the inode to the transaction's inode list in
__ext4_journalled_writepage(), we depend on msync() or writeback work
(which call it) for the write-protect mechanism to work.

This test snippet shows that, as pwrite() gets the inode into a
transaction (!= than into transaction's inode list), and addr[]
write access gets the page writeably mapped.

fd = open("file");
addr = mmap(fd);
pwrite(fd, "a", 1, 0); // journals inode via ext4_write_begin()
addr[0] = 'a'; // page is writeably mapped to user space.
// periodic journal commit / jbd2 thread runs now.
// __ext4_journalled_writepage() was not called yet.

Now it's possible for a subsequent addr[] write access to race
with the commit function, and possibly hit the window to cause
invalid checksums.
---
fs/ext4/inode.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 978ccde8454f..ce5464f92a7e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6008,9 +6008,10 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
len = PAGE_SIZE;
/*
* Return if we have all the buffers mapped. This avoids the need to do
- * journal_start/journal_stop which can block and take a long time
+ * journal_start/journal_stop which can block and take a long time. But
+ * not on data journalling, as we have to add the inode to the txn list.
*/
- if (page_has_buffers(page)) {
+ if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {
if (!ext4_walk_page_buffers(NULL, page_buffers(page),
0, len, NULL,
ext4_bh_unmapped)) {
@@ -6043,6 +6044,12 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
goto out;
}
ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+ if (ext4_jbd2_inode_add_write(handle, inode, 0, PAGE_SIZE)) {
+ unlock_page(page);
+ ret = VM_FAULT_SIGBUS;
+ ext4_journal_stop(handle);
+ goto out;
+ }
}
ext4_journal_stop(handle);
if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
--
2.17.1

2020-08-19 09:30:47

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit

Hi Mauricio!

On Sun 09-08-20 22:02:03, Mauricio Faria de Oliveira wrote:
> I'd like to ask for your comments on the patchset so far, and have a few
> questions.

Thanks for taking time to write these patches! I have commented on the
patches and now let's address the remaining questions :)

> Third, my changes/observations/questions:
>
> 1) > Probably the best place to add inode to this list is
> ext4_journalled_writepage().
>
> I think we also need it in ext4_page_mkwrite(), right? See the test
> case, for example:
>
> fd = open("file");
> addr = mmap(fd);
> pwrite(fd, "a", 1, 0);
> addr[0] = 'a';
> press_enter(); // when asked for.
> addr[0] = 'b';
>
> Here pwrite() led to ext4_write_begin() which added the inode to a
> transaction (not to the transaction's inode list).

Hum, right. Good spotting, I didn't think of this possibility. Actually,
when we have to do this in ext4_page_mkwrite() we don't need
ext4_journalled_writepage() at all which is what I've suggested when
commenting on your patch 3/5 anyway to simplify things :).

> 2) Do not return early in ext4_page_mkwrite() 'if we have all the buffers mapped'?
>
> There's this check in ext4_page_mkwrite():
>
> /*
> * Return if we have all the buffers mapped. This avoids the need to do
> * journal_start/journal_stop which can block and take a long time
> */
>
> Corresponding to this line in the debug messages:
>
> [ 20.677639] TESTDBG: ext4_page_mkwrite() :: returning; all buffers mapped for inode ffff8f72eda5fca8
>
> That returns early, *before* do_journal_get_write_access(), which is
> needed in data=journal, regardless of whether all buffers are mapped: we
> could exit ext4_page_mkwrite() then change buffers under commit during
> the race window.
>
> So, it seems we should ignore that for data=journal, which allows us to
> ext4_journal_start() and get a transaction handle, used to add the inode
> to the transaction's inode list (per #1.)

Yes, correct.

> With the changes from 1) and 2), the inode is added to the transaction's
> inode list, and the callback does write-protect the page correctly.
>
> When trying to change the buffer contents later (pressing enter) there
> _is_ a wait in ext4_page_mkwrite() as expected! But it's first at the
> file_update_time() because it calls do_get_write_access() for that inode,
> which is also in the transaction.
>
> I presume that a commit can start after file_update_time() and before
> ext4_journal_start() in ext4_page_mkwrite(), so we'd still have to keep
> these changes to ext4_page_mkwrite() ?

Yes, relying on file_update_time() waiting is definitely too fragile.

> 3) When checking to redirty the page in the writepage callback,
> does a buffer without a journal head means we should redirty
> the page? (for the reason it's not part of the committing txn)

This is going to change after the simplifications I've suggested so there's
no need to worry about this now.

> 4) Should we clear the PageChecked bit?
>
> We don't clear the PageChecked bit, thus later when the writeback
> work kicks in, __ext4_journalled_writepage() adds the page to the
> transaction's inode list again.
>
> Since the page is clean, it just goes into the submit data buffers
> callback, but write_cache_pages() returns before the writepage callback.
>
> Should we try to provent that by, say, clearing the pagechecked bit
> in case we don't have to redirty the page (in the writepage callback) ?

I guess it's difficult to determine when PageChecked bit can be safely
cleared. And I plan to do away with PageChecked bit usage completely in my
cleanup so this should go away anyway...

> 5) Inodes with inline data. I haven't checked in detail yet, but
> would appreciate if you have a brief explanation about them, and
> if they need special handling since they apparently don't get
> do_journal_get_write_access() called (eg, see __ext4_journalled_writepage()).

I think I've commented about this in a particular patch.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR