2015-06-10 18:28:37

by Eric Whitney

[permalink] [raw]
Subject: [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations

Online defrag is not currently implemented for journaled data and
bigalloc ext4 file systems. The xfstests that exercise online defrag
do not verify that the test file system supports that feature before
executing, resulting in spurious test failures. (In the case of
ext4/307, the test reports success but actually fails because the
e4compact test component is ignoring failures.)

For the time being, simply exclude these tests to eliminate the test
failure noise.

Signed-off-by: Eric Whitney <[email protected]>
---
kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude | 6 ++++++
kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude | 6 ++++++
kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude | 6 ++++++
3 files changed, 18 insertions(+)
create mode 100644 kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude
create mode 100644 kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude

diff --git a/kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude b/kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude
new file mode 100644
index 0000000..b925016
--- /dev/null
+++ b/kvm-xfstests/test-appliance/files/root/conf/bigalloc.exclude
@@ -0,0 +1,6 @@
+ext4/301
+ext4/302
+ext4/303
+ext4/304
+ext4/307
+ext4/308
diff --git a/kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude b/kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude
new file mode 100644
index 0000000..b925016
--- /dev/null
+++ b/kvm-xfstests/test-appliance/files/root/conf/bigalloc_1k.exclude
@@ -0,0 +1,6 @@
+ext4/301
+ext4/302
+ext4/303
+ext4/304
+ext4/307
+ext4/308
diff --git a/kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude b/kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude
index e2bcac3..5d3dcae 100644
--- a/kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude
+++ b/kvm-xfstests/test-appliance/files/root/conf/data_journal.exclude
@@ -1 +1,7 @@
+ext4/301
+ext4/302
+ext4/303
+ext4/304
+ext4/307
+ext4/308
generic/068
--
2.1.4



2015-06-10 19:28:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations

On Wed, Jun 10, 2015 at 02:13:19PM -0400, Eric Whitney wrote:
> Online defrag is not currently implemented for journaled data and
> bigalloc ext4 file systems. The xfstests that exercise online defrag
> do not verify that the test file system supports that feature before
> executing, resulting in spurious test failures. (In the case of
> ext4/307, the test reports success but actually fails because the
> e4compact test component is ignoring failures.)
>
> For the time being, simply exclude these tests to eliminate the test
> failure noise.
>
> Signed-off-by: Eric Whitney <[email protected]>

Thanks, applied. BTW, I had since removed data_journal.exclude
because generic/068 should be passing with data journalling enabled.

- Ted

2015-06-11 20:00:33

by Eric Whitney

[permalink] [raw]
Subject: Re: [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations

I haven't tried to run generic/068 in quite a while, but do see it's passing
now. I'll close the related bugzillas. Do you know which patch fixed this?

Thanks,
Eric


* Theodore Ts'o <[email protected]>:
> On Wed, Jun 10, 2015 at 02:13:19PM -0400, Eric Whitney wrote:
> > Online defrag is not currently implemented for journaled data and
> > bigalloc ext4 file systems. The xfstests that exercise online defrag
> > do not verify that the test file system supports that feature before
> > executing, resulting in spurious test failures. (In the case of
> > ext4/307, the test reports success but actually fails because the
> > e4compact test component is ignoring failures.)
> >
> > For the time being, simply exclude these tests to eliminate the test
> > failure noise.
> >
> > Signed-off-by: Eric Whitney <[email protected]>
>
> Thanks, applied. BTW, I had since removed data_journal.exclude
> because generic/068 should be passing with data journalling enabled.
>
> - Ted

2015-06-12 00:24:04

by Eric Whitney

[permalink] [raw]
Subject: Re: [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations

Spoke too soon - hit what looks like the old problem on ARM running 4.1-rc7
on the ninth trial run. Various components of the test are left in the 'D'
state after a familiar oops - kernel BUG at fs/jbd2/transaction.c:2150 while
truncating.

I'll see if I can still get the problem to appear on x86_64.

Eric

* Eric Whitney <[email protected]>:
> I haven't tried to run generic/068 in quite a while, but do see it's passing
> now. I'll close the related bugzillas. Do you know which patch fixed this?
>
> Thanks,
> Eric
>
>
> * Theodore Ts'o <[email protected]>:
> > On Wed, Jun 10, 2015 at 02:13:19PM -0400, Eric Whitney wrote:
> > > Online defrag is not currently implemented for journaled data and
> > > bigalloc ext4 file systems. The xfstests that exercise online defrag
> > > do not verify that the test file system supports that feature before
> > > executing, resulting in spurious test failures. (In the case of
> > > ext4/307, the test reports success but actually fails because the
> > > e4compact test component is ignoring failures.)
> > >
> > > For the time being, simply exclude these tests to eliminate the test
> > > failure noise.
> > >
> > > Signed-off-by: Eric Whitney <[email protected]>
> >
> > Thanks, applied. BTW, I had since removed data_journal.exclude
> > because generic/068 should be passing with data journalling enabled.
> >
> > - Ted

2015-06-15 01:14:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] xfstests-bld: exclude ext4 defrag tests from unsupported configurations

On Thu, Jun 11, 2015 at 08:24:21PM -0400, Eric Whitney wrote:
> Spoke too soon - hit what looks like the old problem on ARM running 4.1-rc7
> on the ninth trial run. Various components of the test are left in the 'D'
> state after a familiar oops - kernel BUG at fs/jbd2/transaction.c:2150 while
> truncating.
>
> I'll see if I can still get the problem to appear on x86_64.

I was able to reproduce this on x86_64, after running a kernel build
on the host as an antogonist. I'm guessing that the race window is
pretty narrow, so that's why it wasn't triggering for me easily until
I had some additional I/O traffic to slow things down further.

After some investigation, it looks like this bug has been around since
2008(!). I'll send out a fix.

- Ted

2015-06-15 01:23:50

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage()

The commit cf108bca465d: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock. However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.

Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.

This fixes a number of different crashes or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:

jbd2_journal_dirty_metadata: vdc-8: bad jh for block 84434: transaction (ec90434
ransaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0

- and -

kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
...
Call Trace:
[<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
[<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
[<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
[<c027d883>] ? lock_buffer+0x36/0x36
[<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
[<c0229139>] do_invalidatepage+0x22/0x26
[<c0229198>] truncate_inode_page+0x5b/0x85
[<c022934b>] truncate_inode_pages_range+0x156/0x38c
[<c0229592>] truncate_inode_pages+0x11/0x15
[<c022962d>] truncate_pagecache+0x55/0x71
[<c02b913b>] ext4_setattr+0x4a9/0x560
[<c01ca542>] ? current_kernel_time+0x10/0x44
[<c026c4d8>] notify_change+0x1c7/0x2be
[<c0256a00>] do_truncate+0x65/0x85
[<c0226f31>] ? file_ra_state_init+0x12/0x29

- and -

WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
...
Call Trace:
[<c01b879f>] ? console_unlock+0x3a1/0x3ce
[<c082cbb4>] dump_stack+0x48/0x60
[<c0178b65>] warn_slowpath_common+0x89/0xa0
[<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
[<c0178bef>] warn_slowpath_null+0x14/0x18
[<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
[<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
[<c02b2f44>] write_end_fn+0x40/0x53
[<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
[<c02b59e7>] ext4_writepage+0x354/0x3b8
[<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
[<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
[<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
[<c02b5a5b>] __writepage+0x10/0x2e
[<c0225956>] write_cache_pages+0x22d/0x32c
[<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
[<c02b6ee8>] ext4_writepages+0x102/0x607
[<c019adfe>] ? sched_clock_local+0x10/0x10e
[<c01a8a7c>] ? __lock_is_held+0x2e/0x44
[<c01a8ad5>] ? lock_is_held+0x43/0x51
[<c0226dff>] do_writepages+0x1c/0x29
[<c0276bed>] __writeback_single_inode+0xc3/0x545
[<c0277c07>] writeback_sb_inodes+0x21f/0x36d
...

Signed-off-by: Theodore Ts'o <[email protected]>
Cc: [email protected]
---
fs/ext4/inode.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0554b0b..263a46c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1701,19 +1701,32 @@ static int __ext4_journalled_writepage(struct page *page,
ext4_walk_page_buffers(handle, page_bufs, 0, len,
NULL, bget_one);
}
- /* As soon as we unlock the page, it can go away, but we have
- * references to buffers so we are safe */
+ /*
+ * We need to release the page lock before we start the
+ * journal, so grab a reference so the page won't disappear
+ * out from under us.
+ */
+ get_page(page);
unlock_page(page);

handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
ext4_writepage_trans_blocks(inode));
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
- goto out;
+ put_page(page);
+ goto out_no_pagelock;
}
-
BUG_ON(!ext4_handle_valid(handle));

+ lock_page(page);
+ put_page(page);
+ if (page->mapping != mapping) {
+ /* The page got truncated from under us */
+ ext4_journal_stop(handle);
+ ret = 0;
+ goto out;
+ }
+
if (inline_data) {
BUFFER_TRACE(inode_bh, "get write access");
ret = ext4_journal_get_write_access(handle, inode_bh);
@@ -1739,6 +1752,8 @@ static int __ext4_journalled_writepage(struct page *page,
NULL, bput_one);
ext4_set_inode_state(inode, EXT4_STATE_JDATA);
out:
+ unlock_page(page);
+out_no_pagelock:
brelse(inode_bh);
return ret;
}
--
2.3.0

2015-06-15 12:33:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage()

On Sun 14-06-15 21:23:50, Ted Tso wrote:
> The commit cf108bca465d: "ext4: Invert the locking order of page_lock
> and transaction start" caused __ext4_journalled_writepage() to drop
> the page lock before the page was written back, as part of changing
> the locking order to jbd2_journal_start -> page_lock. However, this
> introduced a potential race if there was a truncate racing with the
> data=journalled writeback mode.
>
> Fix this by grabbing the page lock after starting the journal handle,
> and then checking to see if page had gotten truncated out from under
> us.
>
> This fixes a number of different crashes or BUG_ON's when running
> xfstests generic/086 in data=journalled mode, including:
>
> jbd2_journal_dirty_metadata: vdc-8: bad jh for block 84434: transaction (ec90434
> ransaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0
>
> - and -
>
> kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
Yeah, that's nasty. Thanks for debugging this! However I think your fix
reintroduces the original deadlock issues. do_journal_get_write_access()
can end up blocking waiting for jbd2 thread to finish a commit while jbd2
thread may be blocked waiting for the page to be unlocked.

After some thought I don't think the deadlock is real since
do_journal_get_write_access() will currently only block if a buffer is
under writeout to the journal and at that point we don't wait for page
locks anymore. Also ext4_write_begin() does the same in data=journal mode
and we haven't observed deadlocks so far. But still things look really
fragile here.

A clean fix for these problems would be to implement
ext4_journalled_writepages() which will start a transaction and then
writeback a bunch of pages. Similarly for write_begin() case we could start
the transaction in ext4_write() (and loop there since a single write may
need to be split among several transactions). However this is relatively
extensive work given how rarely the code is used...

So for now, feel free to add:
Acked-by: Jan Kara <[email protected]>

to the patch.

Honza

> ---
> fs/ext4/inode.c | 23 +++++++++++++++++++----
> 1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0554b0b..263a46c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1701,19 +1701,32 @@ static int __ext4_journalled_writepage(struct page *page,
> ext4_walk_page_buffers(handle, page_bufs, 0, len,
> NULL, bget_one);
> }
> - /* As soon as we unlock the page, it can go away, but we have
> - * references to buffers so we are safe */
> + /*
> + * We need to release the page lock before we start the
> + * journal, so grab a reference so the page won't disappear
> + * out from under us.
> + */
> + get_page(page);
> unlock_page(page);
>
> handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
> ext4_writepage_trans_blocks(inode));
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> - goto out;
> + put_page(page);
> + goto out_no_pagelock;
> }
> -
> BUG_ON(!ext4_handle_valid(handle));
>
> + lock_page(page);
> + put_page(page);
> + if (page->mapping != mapping) {
> + /* The page got truncated from under us */
> + ext4_journal_stop(handle);
> + ret = 0;
> + goto out;
> + }
> +
> if (inline_data) {
> BUFFER_TRACE(inode_bh, "get write access");
> ret = ext4_journal_get_write_access(handle, inode_bh);
> @@ -1739,6 +1752,8 @@ static int __ext4_journalled_writepage(struct page *page,
> NULL, bput_one);
> ext4_set_inode_state(inode, EXT4_STATE_JDATA);
> out:
> + unlock_page(page);
> +out_no_pagelock:
> brelse(inode_bh);
> return ret;
> }
> --
> 2.3.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-15 13:06:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage()

On Mon, Jun 15, 2015 at 02:33:52PM +0200, Jan Kara wrote:
> Yeah, that's nasty. Thanks for debugging this! However I think your fix
> reintroduces the original deadlock issues. do_journal_get_write_access()
> can end up blocking waiting for jbd2 thread to finish a commit while jbd2
> thread may be blocked waiting for the page to be unlocked.
>
> After some thought I don't think the deadlock is real since
> do_journal_get_write_access() will currently only block if a buffer is
> under writeout to the journal and at that point we don't wait for page
> locks anymore. Also ext4_write_begin() does the same in data=journal mode
> and we haven't observed deadlocks so far. But still things look really
> fragile here.

The reason why there are no deadlocks is the writeback in the commit
thread happens when the inode gets written back --- but that only
happens for data=ordered inodes, not data=journalled mode. I was a
little worried about what might happen when after the 'j' chattr
attribute gets set on an inode, and the inode was still on the ordered
flush list.

Hmm... I think we could also maybe fix this by having
ext4_change_inode_journal_flag() force a journal commit before setting
the JOURNAL_DATA flag. If we did that, we could just avoid dropping
the page_lock in __ext4_journalled_writepage() altogether.

What do you think?

- Ted

2015-06-15 17:03:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage()

On Mon 15-06-15 09:06:11, Ted Tso wrote:
> On Mon, Jun 15, 2015 at 02:33:52PM +0200, Jan Kara wrote:
> > Yeah, that's nasty. Thanks for debugging this! However I think your fix
> > reintroduces the original deadlock issues. do_journal_get_write_access()
> > can end up blocking waiting for jbd2 thread to finish a commit while jbd2
> > thread may be blocked waiting for the page to be unlocked.
> >
> > After some thought I don't think the deadlock is real since
> > do_journal_get_write_access() will currently only block if a buffer is
> > under writeout to the journal and at that point we don't wait for page
> > locks anymore. Also ext4_write_begin() does the same in data=journal mode
> > and we haven't observed deadlocks so far. But still things look really
> > fragile here.
>
> The reason why there are no deadlocks is the writeback in the commit
> thread happens when the inode gets written back --- but that only
> happens for data=ordered inodes, not data=journalled mode. I was a
> little worried about what might happen when after the 'j' chattr
> attribute gets set on an inode, and the inode was still on the ordered
> flush list.
>
> Hmm... I think we could also maybe fix this by having
> ext4_change_inode_journal_flag() force a journal commit before setting
> the JOURNAL_DATA flag. If we did that, we could just avoid dropping
> the page_lock in __ext4_journalled_writepage() altogether.
>
> What do you think?
I think that fully switching lock ordering for data=journal mode back to
page lock -> transaction start (which is what you effectively do when you
never drop page lock in ->writepage) is rather error prone. We'd have to be
careful to avoid lock inversion also for places like ->write_begin,
->releasepage, ->invalidatepage etc. For example ext4_write_begin() will
currently call lock_page() with transaction started which could deadlock
against journalled writepage you suggest. So effectively we'd have to
completely separate aops for data=journal mode. Doable but I'm not sure
it's worth it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-15 17:59:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage()

On Jun 15, 2015, at 7:06 AM, Theodore Ts'o <[email protected]> wrote:
>
> On Mon, Jun 15, 2015 at 02:33:52PM +0200, Jan Kara wrote:
>> Yeah, that's nasty. Thanks for debugging this! However I think your fix
>> reintroduces the original deadlock issues. do_journal_get_write_access()
>> can end up blocking waiting for jbd2 thread to finish a commit while jbd2
>> thread may be blocked waiting for the page to be unlocked.
>>
>> After some thought I don't think the deadlock is real since
>> do_journal_get_write_access() will currently only block if a buffer is
>> under writeout to the journal and at that point we don't wait for page
>> locks anymore. Also ext4_write_begin() does the same in data=journal mode
>> and we haven't observed deadlocks so far. But still things look really
>> fragile here.
>
> The reason why there are no deadlocks is the writeback in the commit
> thread happens when the inode gets written back --- but that only
> happens for data=ordered inodes, not data=journalled mode. I was a
> little worried about what might happen when after the 'j' chattr
> attribute gets set on an inode, and the inode was still on the ordered
> flush list.
>
> Hmm... I think we could also maybe fix this by having
> ext4_change_inode_journal_flag() force a journal commit before setting
> the JOURNAL_DATA flag. If we did that, we could just avoid dropping
> the page_lock in __ext4_journalled_writepage() altogether.

That is already being done:

int ext4_change_inode_journal_flag(struct inode *inode, int val)
{
:
:

jbd2_journal_lock_updates(journal);

:
:


/**
* void jbd2_journal_lock_updates () - establish a transaction barrier.
* @journal: Journal to establish a barrier on.
*
* This locks out any further updates from being started, and blocks
* until all existing updates have completed, returning only once the
* journal is in a quiescent state with no updates running.
*
* The journal lock should not be held on entry.
*/


Cheers, Andreas

2015-06-15 19:37:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage()

On Mon, Jun 15, 2015 at 07:03:25PM +0200, Jan Kara wrote:
> I think that fully switching lock ordering for data=journal mode back to
> page lock -> transaction start (which is what you effectively do when you
> never drop page lock in ->writepage) is rather error prone. We'd have to be
> careful to avoid lock inversion also for places like ->write_begin,
> ->releasepage, ->invalidatepage etc. For example ext4_write_begin() will
> currently call lock_page() with transaction started which could deadlock
> against journalled writepage you suggest. So effectively we'd have to
> completely separate aops for data=journal mode. Doable but I'm not sure
> it's worth it.

Good point. OTOH, things are kind of fragile with respect to
ext4_write_begin() in data=journal mode, as you've pointed out. But
it's not causing any problems as far as we know, so so I'll use the
patch which you ack'ed and maybe later on, we can figure out a better
way to clean this up.

Thanks,

- Ted

2015-06-15 20:39:07

by Eric Whitney

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage()

* Theodore Ts'o <[email protected]>:
> The commit cf108bca465d: "ext4: Invert the locking order of page_lock
> and transaction start" caused __ext4_journalled_writepage() to drop
> the page lock before the page was written back, as part of changing
> the locking order to jbd2_journal_start -> page_lock. However, this
> introduced a potential race if there was a truncate racing with the
> data=journalled writeback mode.
>
> Fix this by grabbing the page lock after starting the journal handle,
> and then checking to see if page had gotten truncated out from under
> us.
>
> This fixes a number of different crashes or BUG_ON's when running
> xfstests generic/086 in data=journalled mode, including:
>
> jbd2_journal_dirty_metadata: vdc-8: bad jh for block 84434: transaction (ec90434
> ransaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0
>
> - and -
>
> kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
> ...
> Call Trace:
> [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
> [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
> [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
> [<c027d883>] ? lock_buffer+0x36/0x36
> [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
> [<c0229139>] do_invalidatepage+0x22/0x26
> [<c0229198>] truncate_inode_page+0x5b/0x85
> [<c022934b>] truncate_inode_pages_range+0x156/0x38c
> [<c0229592>] truncate_inode_pages+0x11/0x15
> [<c022962d>] truncate_pagecache+0x55/0x71
> [<c02b913b>] ext4_setattr+0x4a9/0x560
> [<c01ca542>] ? current_kernel_time+0x10/0x44
> [<c026c4d8>] notify_change+0x1c7/0x2be
> [<c0256a00>] do_truncate+0x65/0x85
> [<c0226f31>] ? file_ra_state_init+0x12/0x29
>
> - and -
>
> WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
> irty_metadata+0x14a/0x1ae()
> ...
> Call Trace:
> [<c01b879f>] ? console_unlock+0x3a1/0x3ce
> [<c082cbb4>] dump_stack+0x48/0x60
> [<c0178b65>] warn_slowpath_common+0x89/0xa0
> [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
> [<c0178bef>] warn_slowpath_null+0x14/0x18
> [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
> [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
> [<c02b2f44>] write_end_fn+0x40/0x53
> [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
> [<c02b59e7>] ext4_writepage+0x354/0x3b8
> [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
> [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
> [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
> [<c02b5a5b>] __writepage+0x10/0x2e
> [<c0225956>] write_cache_pages+0x22d/0x32c
> [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
> [<c02b6ee8>] ext4_writepages+0x102/0x607
> [<c019adfe>] ? sched_clock_local+0x10/0x10e
> [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
> [<c01a8ad5>] ? lock_is_held+0x43/0x51
> [<c0226dff>] do_writepages+0x1c/0x29
> [<c0276bed>] __writeback_single_inode+0xc3/0x545
> [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
> ...
>
> Signed-off-by: Theodore Ts'o <[email protected]>
> Cc: [email protected]
> ---
> fs/ext4/inode.c | 23 +++++++++++++++++++----
> 1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0554b0b..263a46c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1701,19 +1701,32 @@ static int __ext4_journalled_writepage(struct page *page,
> ext4_walk_page_buffers(handle, page_bufs, 0, len,
> NULL, bget_one);
> }
> - /* As soon as we unlock the page, it can go away, but we have
> - * references to buffers so we are safe */
> + /*
> + * We need to release the page lock before we start the
> + * journal, so grab a reference so the page won't disappear
> + * out from under us.
> + */
> + get_page(page);
> unlock_page(page);
>
> handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
> ext4_writepage_trans_blocks(inode));
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> - goto out;
> + put_page(page);
> + goto out_no_pagelock;
> }
> -
> BUG_ON(!ext4_handle_valid(handle));
>
> + lock_page(page);
> + put_page(page);
> + if (page->mapping != mapping) {
> + /* The page got truncated from under us */
> + ext4_journal_stop(handle);
> + ret = 0;
> + goto out;
> + }
> +
> if (inline_data) {
> BUFFER_TRACE(inode_bh, "get write access");
> ret = ext4_journal_get_write_access(handle, inode_bh);
> @@ -1739,6 +1752,8 @@ static int __ext4_journalled_writepage(struct page *page,
> NULL, bput_one);
> ext4_set_inode_state(inode, EXT4_STATE_JDATA);
> out:
> + unlock_page(page);
> +out_no_pagelock:
> brelse(inode_bh);
> return ret;
> }
> --
> 2.3.0
>

This patch looks promising. I'm running a 1000 trial stress test on a
Pandaboard where I've generally been able to force a couple of manifestations
of this bug to appear within 5 to 10 runs. Applied to 4.1-rc7, it's passed
135 trials cleanly. The full series will complete sometime tomorrow.

I was also able to reproduce the problem on x86_64 pretty consistently in
four runs or less on 4.1-rc7; I'm planning a stress test there as well once
-rc8 regression is complete.

Thanks!
Eric

2015-06-16 12:57:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage()

On Mon 15-06-15 11:59:59, Andreas Dilger wrote:
> On Jun 15, 2015, at 7:06 AM, Theodore Ts'o <[email protected]> wrote:
> >
> > On Mon, Jun 15, 2015 at 02:33:52PM +0200, Jan Kara wrote:
> >> Yeah, that's nasty. Thanks for debugging this! However I think your fix
> >> reintroduces the original deadlock issues. do_journal_get_write_access()
> >> can end up blocking waiting for jbd2 thread to finish a commit while jbd2
> >> thread may be blocked waiting for the page to be unlocked.
> >>
> >> After some thought I don't think the deadlock is real since
> >> do_journal_get_write_access() will currently only block if a buffer is
> >> under writeout to the journal and at that point we don't wait for page
> >> locks anymore. Also ext4_write_begin() does the same in data=journal mode
> >> and we haven't observed deadlocks so far. But still things look really
> >> fragile here.
> >
> > The reason why there are no deadlocks is the writeback in the commit
> > thread happens when the inode gets written back --- but that only
> > happens for data=ordered inodes, not data=journalled mode. I was a
> > little worried about what might happen when after the 'j' chattr
> > attribute gets set on an inode, and the inode was still on the ordered
> > flush list.
> >
> > Hmm... I think we could also maybe fix this by having
> > ext4_change_inode_journal_flag() force a journal commit before setting
> > the JOURNAL_DATA flag. If we did that, we could just avoid dropping
> > the page_lock in __ext4_journalled_writepage() altogether.
>
> That is already being done:
>
> int ext4_change_inode_journal_flag(struct inode *inode, int val)
> {
> :
> :
>
> jbd2_journal_lock_updates(journal);
>
> :
> :
>
>
> /**
> * void jbd2_journal_lock_updates () - establish a transaction barrier.
> * @journal: Journal to establish a barrier on.
> *
> * This locks out any further updates from being started, and blocks
> * until all existing updates have completed, returning only once the
> * journal is in a quiescent state with no updates running.
> *
> * The journal lock should not be held on entry.
> */
Well, jbd2_journal_lock_updates() makes sure there are no handles for the
running transaction but it doesn't ensure current transaction is committed
(which is what Ted wanted because he wanted to avoid inode being both
ordered and journaled in the same transaction).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-16 19:29:20

by Eric Whitney

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix race between truncate and __ext4_journalled_writepage()

* Eric Whitney <[email protected]>:
> * Theodore Ts'o <[email protected]>:
> > The commit cf108bca465d: "ext4: Invert the locking order of page_lock
> > and transaction start" caused __ext4_journalled_writepage() to drop
> > the page lock before the page was written back, as part of changing
> > the locking order to jbd2_journal_start -> page_lock. However, this
> > introduced a potential race if there was a truncate racing with the
> > data=journalled writeback mode.
> >
> > Fix this by grabbing the page lock after starting the journal handle,
> > and then checking to see if page had gotten truncated out from under
> > us.
> >
> > This fixes a number of different crashes or BUG_ON's when running
> > xfstests generic/086 in data=journalled mode, including:
> >
> > jbd2_journal_dirty_metadata: vdc-8: bad jh for block 84434: transaction (ec90434
> > ransaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0
> >
> > - and -
> >
> > kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
> > ...
> > Call Trace:
> > [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
> > [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
> > [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
> > [<c027d883>] ? lock_buffer+0x36/0x36
> > [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
> > [<c0229139>] do_invalidatepage+0x22/0x26
> > [<c0229198>] truncate_inode_page+0x5b/0x85
> > [<c022934b>] truncate_inode_pages_range+0x156/0x38c
> > [<c0229592>] truncate_inode_pages+0x11/0x15
> > [<c022962d>] truncate_pagecache+0x55/0x71
> > [<c02b913b>] ext4_setattr+0x4a9/0x560
> > [<c01ca542>] ? current_kernel_time+0x10/0x44
> > [<c026c4d8>] notify_change+0x1c7/0x2be
> > [<c0256a00>] do_truncate+0x65/0x85
> > [<c0226f31>] ? file_ra_state_init+0x12/0x29
> >
> > - and -
> >
> > WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
> > irty_metadata+0x14a/0x1ae()
> > ...
> > Call Trace:
> > [<c01b879f>] ? console_unlock+0x3a1/0x3ce
> > [<c082cbb4>] dump_stack+0x48/0x60
> > [<c0178b65>] warn_slowpath_common+0x89/0xa0
> > [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
> > [<c0178bef>] warn_slowpath_null+0x14/0x18
> > [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
> > [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
> > [<c02b2f44>] write_end_fn+0x40/0x53
> > [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
> > [<c02b59e7>] ext4_writepage+0x354/0x3b8
> > [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
> > [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
> > [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
> > [<c02b5a5b>] __writepage+0x10/0x2e
> > [<c0225956>] write_cache_pages+0x22d/0x32c
> > [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
> > [<c02b6ee8>] ext4_writepages+0x102/0x607
> > [<c019adfe>] ? sched_clock_local+0x10/0x10e
> > [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
> > [<c01a8ad5>] ? lock_is_held+0x43/0x51
> > [<c0226dff>] do_writepages+0x1c/0x29
> > [<c0276bed>] __writeback_single_inode+0xc3/0x545
> > [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
> > ...
> >
> > Signed-off-by: Theodore Ts'o <[email protected]>
> > Cc: [email protected]
> > ---
> > fs/ext4/inode.c | 23 +++++++++++++++++++----
> > 1 file changed, 19 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 0554b0b..263a46c 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1701,19 +1701,32 @@ static int __ext4_journalled_writepage(struct page *page,
> > ext4_walk_page_buffers(handle, page_bufs, 0, len,
> > NULL, bget_one);
> > }
> > - /* As soon as we unlock the page, it can go away, but we have
> > - * references to buffers so we are safe */
> > + /*
> > + * We need to release the page lock before we start the
> > + * journal, so grab a reference so the page won't disappear
> > + * out from under us.
> > + */
> > + get_page(page);
> > unlock_page(page);
> >
> > handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
> > ext4_writepage_trans_blocks(inode));
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > - goto out;
> > + put_page(page);
> > + goto out_no_pagelock;
> > }
> > -
> > BUG_ON(!ext4_handle_valid(handle));
> >
> > + lock_page(page);
> > + put_page(page);
> > + if (page->mapping != mapping) {
> > + /* The page got truncated from under us */
> > + ext4_journal_stop(handle);
> > + ret = 0;
> > + goto out;
> > + }
> > +
> > if (inline_data) {
> > BUFFER_TRACE(inode_bh, "get write access");
> > ret = ext4_journal_get_write_access(handle, inode_bh);
> > @@ -1739,6 +1752,8 @@ static int __ext4_journalled_writepage(struct page *page,
> > NULL, bput_one);
> > ext4_set_inode_state(inode, EXT4_STATE_JDATA);
> > out:
> > + unlock_page(page);
> > +out_no_pagelock:
> > brelse(inode_bh);
> > return ret;
> > }
> > --
> > 2.3.0
> >
>
> This patch looks promising. I'm running a 1000 trial stress test on a
> Pandaboard where I've generally been able to force a couple of manifestations
> of this bug to appear within 5 to 10 runs. Applied to 4.1-rc7, it's passed
> 135 trials cleanly. The full series will complete sometime tomorrow.
>
> I was also able to reproduce the problem on x86_64 pretty consistently in
> four runs or less on 4.1-rc7; I'm planning a stress test there as well once
> -rc8 regression is complete.
>

1000 consecutive runs of generic/068 on the data_journal test case have
completed successfully on an x86_64 guest and a Pandaboard running a
4.1-rc7 kernel with this patch. That's a couple of orders of magnitude more
runs than have previously completed without hangs on these test system
configurations, so

Tested-by: Eric Whitney <[email protected]>

if useful at this point.

Eric