2013-03-20 01:29:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a
regression when running xfstest #270 when the file system is mounted
with dioread_nolock.

The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
this guarantees that last io_end structure has been freed, but it does
not guarantee that the workqueue structure, which was moved into the
inode by commit 84c17543ab56, is actually finished. Once
ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
will allow ext4_ioend_wait() to return on CPU #2, at which point the
evict_inode() codepath can race against the workqueue code on CPU #1
accessing EXT4_I(inode)->i_unwritten_work to find the next item of
work to do.

Fix this by calling flush_work() if the work structure is still
pending in ext$_ioend_wait(). Also, move the call to
ext4_ioend_wait() until after truncate_inode_pages() and
filemap_write_and_wait() are called, to make sure all dirty pages have
been written back and flushed from the page cache first.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e
*pdpt = 0000000030bc3001 *pde = 0000000000000000
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in:
Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs
EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0
EIP is at cwq_activate_delayed_work+0x3b/0x7e
EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
Stack:
f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
Call Trace:
[<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb
[<c01def1d>] process_one_work+0x5d8/0x637
[<c040a10b>] ? ext4_end_bio+0x300/0x300
[<c01e3105>] worker_thread+0x249/0x3ef
[<c01ea317>] kthread+0xd8/0xeb
[<c01e2ebc>] ? manage_workers+0x4bb/0x4bb
[<c023a370>] ? trace_hardirqs_on+0x27/0x37
[<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28
[<c01ea23f>] ? __init_kthread_worker+0x71/0x71
Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1
6c c1 00 31 c9 83
EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
CR2: 0000000000000000
---[ end trace a1923229da53d8a4 ]---

Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 4 ++--
fs/ext4/page-io.c | 10 ++++++++++
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 65bbc93..15e0da1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -185,8 +185,6 @@ void ext4_evict_inode(struct inode *inode)

trace_ext4_evict_inode(inode);

- ext4_ioend_wait(inode);
-
if (inode->i_nlink) {
/*
* When journalling data dirty buffers are tracked only in the
@@ -216,6 +214,7 @@ void ext4_evict_inode(struct inode *inode)
filemap_write_and_wait(&inode->i_data);
}
truncate_inode_pages(&inode->i_data, 0);
+ ext4_ioend_wait(inode);
goto no_delete;
}

@@ -225,6 +224,7 @@ void ext4_evict_inode(struct inode *inode)
if (ext4_should_order_data(inode))
ext4_begin_ordered_truncate(inode, 0);
truncate_inode_pages(&inode->i_data, 0);
+ ext4_ioend_wait(inode);

if (is_bad_inode(inode))
goto no_delete;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 809b310..8a004a4 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -50,11 +50,21 @@ void ext4_exit_pageio(void)
kmem_cache_destroy(io_page_cachep);
}

+/*
+ * Called by ext4_evict_inode() to make sure there are no pending I/O
+ * completion work left to do.
+ */
void ext4_ioend_wait(struct inode *inode)
{
wait_queue_head_t *wq = ext4_ioend_wq(inode);

wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
+ /*
+ * We need to make sure the work structure is finished before
+ * we let the inode get destroyed by ext4_evict_inode()
+ */
+ if (work_pending(&EXT4_I(inode)->i_unwritten_work))
+ flush_work(&EXT4_I(inode)->i_unwritten_work);
}

static void put_io_page(struct ext4_io_page *io_page)
--
1.7.12.rc0.22.gcdd159b



2013-03-20 01:38:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

I found one more regression failure while running tests on the dev
branch. It doesn't fail every single time you run xfstests #270; in
my test setup, the test failure happens approximatrely one in three
times. Fortunately, "kvm-xfstests -c dioread_nolock
270,270,270,270,270,270,270,270,270" made it easy to bisect the
failure to commit 84c17543ab56 (ext4: move work from io_end to inode).

I've added this commit to fix it, and restarted running the regression
tests. Hopefully this will be the last fix up required before I send
a pull request to Linus....

- Ted

On Tue, Mar 19, 2013 at 09:29:19PM -0400, Theodore Ts'o wrote:
> Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a
> regression when running xfstest #270 when the file system is mounted
> with dioread_nolock.
>
> The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
> this guarantees that last io_end structure has been freed, but it does
> not guarantee that the workqueue structure, which was moved into the
> inode by commit 84c17543ab56, is actually finished. Once
> ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
> will allow ext4_ioend_wait() to return on CPU #2, at which point the
> evict_inode() codepath can race against the workqueue code on CPU #1
> accessing EXT4_I(inode)->i_unwritten_work to find the next item of
> work to do.
>
> Fix this by calling flush_work() if the work structure is still
> pending in ext$_ioend_wait(). Also, move the call to
> ext4_ioend_wait() until after truncate_inode_pages() and
> filemap_write_and_wait() are called, to make sure all dirty pages have
> been written back and flushed from the page cache first.
>
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e
> *pdpt = 0000000030bc3001 *pde = 0000000000000000
> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> Modules linked in:
> Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs
> EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0
> EIP is at cwq_activate_delayed_work+0x3b/0x7e
> EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
> ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
> Stack:
> f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
> f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
> f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
> Call Trace:
> [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb
> [<c01def1d>] process_one_work+0x5d8/0x637
> [<c040a10b>] ? ext4_end_bio+0x300/0x300
> [<c01e3105>] worker_thread+0x249/0x3ef
> [<c01ea317>] kthread+0xd8/0xeb
> [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb
> [<c023a370>] ? trace_hardirqs_on+0x27/0x37
> [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28
> [<c01ea23f>] ? __init_kthread_worker+0x71/0x71
> Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1
> 6c c1 00 31 c9 83
> EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
> CR2: 0000000000000000
> ---[ end trace a1923229da53d8a4 ]---
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> Cc: Jan Kara <[email protected]>
...

2013-03-20 13:22:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Tue 19-03-13 21:29:19, Ted Tso wrote:
> Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a
> regression when running xfstest #270 when the file system is mounted
> with dioread_nolock.
>
> The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
> this guarantees that last io_end structure has been freed, but it does
> not guarantee that the workqueue structure, which was moved into the
> inode by commit 84c17543ab56, is actually finished. Once
> ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
> will allow ext4_ioend_wait() to return on CPU #2, at which point the
> evict_inode() codepath can race against the workqueue code on CPU #1
> accessing EXT4_I(inode)->i_unwritten_work to find the next item of
> work to do.
>
> Fix this by calling flush_work() if the work structure is still
> pending in ext$_ioend_wait(). Also, move the call to
> ext4_ioend_wait() until after truncate_inode_pages() and
> filemap_write_and_wait() are called, to make sure all dirty pages have
> been written back and flushed from the page cache first.
>
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e
> *pdpt = 0000000030bc3001 *pde = 0000000000000000
> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> Modules linked in:
> Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs
> EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0
> EIP is at cwq_activate_delayed_work+0x3b/0x7e
> EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
> ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
> Stack:
> f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
> f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
> f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
> Call Trace:
> [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb
> [<c01def1d>] process_one_work+0x5d8/0x637
> [<c040a10b>] ? ext4_end_bio+0x300/0x300
> [<c01e3105>] worker_thread+0x249/0x3ef
> [<c01ea317>] kthread+0xd8/0xeb
> [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb
> [<c023a370>] ? trace_hardirqs_on+0x27/0x37
> [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28
> [<c01ea23f>] ? __init_kthread_worker+0x71/0x71
> Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1
> 6c c1 00 31 c9 83
> EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
> CR2: 0000000000000000
> ---[ end trace a1923229da53d8a4 ]---
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> Cc: Jan Kara <[email protected]>
Good catch. Thanks for fixing this. Just one question below:

> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 809b310..8a004a4 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -50,11 +50,21 @@ void ext4_exit_pageio(void)
> kmem_cache_destroy(io_page_cachep);
> }
>
> +/*
> + * Called by ext4_evict_inode() to make sure there are no pending I/O
> + * completion work left to do.
> + */
> void ext4_ioend_wait(struct inode *inode)
> {
> wait_queue_head_t *wq = ext4_ioend_wq(inode);
>
> wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
> + /*
> + * We need to make sure the work structure is finished before
> + * we let the inode get destroyed by ext4_evict_inode()
> + */
> + if (work_pending(&EXT4_I(inode)->i_unwritten_work))
> + flush_work(&EXT4_I(inode)->i_unwritten_work);
> }
Won't it be more logical to use cancel_work_sync() here?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-20 13:37:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Wed, Mar 20, 2013 at 02:22:23PM +0100, Jan Kara wrote:
> > + if (work_pending(&EXT4_I(inode)->i_unwritten_work))
> > + flush_work(&EXT4_I(inode)->i_unwritten_work);
> > }
> Won't it be more logical to use cancel_work_sync() here?

Hmm.... yes, probably, but then ext4_ioend_wait() can only be safely
used by ext4_evict_inode(). I'll make the change, but I'll also make
a comment to this effect. (No one else is using it now, but if there
was ever a need to use it while the inode was in use, using
cancel_work_sync() would be highly dagernous/racy. That being said, I
can't really think of a good use case other than evict_inode path, so
it seems fine to make this change.)

- Ted

2013-03-20 13:42:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Wed 20-03-13 09:37:51, Ted Tso wrote:
> On Wed, Mar 20, 2013 at 02:22:23PM +0100, Jan Kara wrote:
> > > + if (work_pending(&EXT4_I(inode)->i_unwritten_work))
> > > + flush_work(&EXT4_I(inode)->i_unwritten_work);
> > > }
> > Won't it be more logical to use cancel_work_sync() here?
>
> Hmm.... yes, probably, but then ext4_ioend_wait() can only be safely
> used by ext4_evict_inode(). I'll make the change, but I'll also make
> a comment to this effect. (No one else is using it now, but if there
> was ever a need to use it while the inode was in use, using
> cancel_work_sync() would be highly dagernous/racy. That being said, I
> can't really think of a good use case other than evict_inode path, so
> it seems fine to make this change.)
Yeah, we can possibly rename the function or maybe even just inline it in
ext4_evict_inode?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-20 13:51:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Wed, Mar 20, 2013 at 02:42:26PM +0100, Jan Kara wrote:
> Yeah, we can possibly rename the function or maybe even just inline it in
> ext4_evict_inode?

Good idea, I'll rename it to ext4_ioend_shutdown(). Given that the
rest of the logic is in page_io.c, it seemed better to leave that
function there than to inline it in ext4_evict_inode().

- Ted

2013-03-20 14:15:29

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On 3/19/13 8:29 PM, Theodore Ts'o wrote:
> Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a
> regression when running xfstest #270 when the file system is mounted
> with dioread_nolock.


As an aside, is there any reason to have "dioread_nolock" as an option
at this point? If it works now, would you ever *not* want it?

(granted it doesn't work with some journaling options etc, but that
behavior could be automatic, w/o the need for special mount options).

-Eric

2013-03-20 14:45:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote:
>
> As an aside, is there any reason to have "dioread_nolock" as an option
> at this point? If it works now, would you ever *not* want it?
>
> (granted it doesn't work with some journaling options etc, but that
> behavior could be automatic, w/o the need for special mount options).

The primary restriction is that diread_nolock doesn't work when fs
block size != page size. If your proposal is that we automatically
enable diread_nolock when we can use it safely, that's definitely
something to consider for the next merge window.

My long range plan/hope is that we eventually be able to use the
extent status tree so that we do allocating writes, we first (a)
allocate the blocks, and mark them as in use as far as the mballoc
data structures are concerned, but we do _not_ mark them as in use in
the on-disk allocation bitmaps, then (b) we write the data blocks, and
then triggered by the block I/O completion, (c) in a single journal
trnasaction, we update the allocation bitmaps, update the inode's
extent tree, and update the inode's i_size field.

This is different from the dioread_nolock approach in that we're not
initially inserting the blocks in the extent tree as uninitialized,
and then convert the extent tree entries from uninit to init after the
I/O completion.

If we get to this long-term nirvana, then (1) we can eliminate the
data=writeback vs data=ordered distiction, since we'll have the safety
benefits of data=ordered while still having the performance
characteristics of data=writeback, and (2) we can eliminate
diread_nolock, since this approach should also obviate needing to take
the read lock on the direct I/O read path. I also think this approach
in the long term will be simpler and faster, since we don't have
modify the extent tree, and start a journal transaction, before we
write the data blocks.

- Ted

2013-03-20 20:13:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Wed 20-03-13 10:45:23, Ted Tso wrote:
> On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote:
> >
> > As an aside, is there any reason to have "dioread_nolock" as an option
> > at this point? If it works now, would you ever *not* want it?
> >
> > (granted it doesn't work with some journaling options etc, but that
> > behavior could be automatic, w/o the need for special mount options).
>
> The primary restriction is that diread_nolock doesn't work when fs
> block size != page size. If your proposal is that we automatically
> enable diread_nolock when we can use it safely, that's definitely
> something to consider for the next merge window.
>
> My long range plan/hope is that we eventually be able to use the
> extent status tree so that we do allocating writes, we first (a)
> allocate the blocks, and mark them as in use as far as the mballoc
> data structures are concerned, but we do _not_ mark them as in use in
> the on-disk allocation bitmaps, then (b) we write the data blocks, and
> then triggered by the block I/O completion, (c) in a single journal
> trnasaction, we update the allocation bitmaps, update the inode's
> extent tree, and update the inode's i_size field.
>
> This is different from the dioread_nolock approach in that we're not
> initially inserting the blocks in the extent tree as uninitialized,
> and then convert the extent tree entries from uninit to init after the
> I/O completion.
>
> If we get to this long-term nirvana, then (1) we can eliminate the
> data=writeback vs data=ordered distiction, since we'll have the safety
> benefits of data=ordered while still having the performance
> characteristics of data=writeback, and (2) we can eliminate
> diread_nolock, since this approach should also obviate needing to take
> the read lock on the direct I/O read path.
But this will be somewhat tricky because when we have racing buffered
write and DIO read to the same block, we have to make sure that DIO read
ignores the information in the extent status tree because data isn't
written to the blocks yet. Umm, maybe we could just mark the extent as
unwritten in the extent status tree (without having anything on disk) and
this should make DIO read work. That sounds like a nice optimization.

> I also think this approach
> in the long term will be simpler and faster, since we don't have
> modify the extent tree, and start a journal transaction, before we
> write the data blocks.
Yeah, it should be faster because we will need to perform some extent ops
only in memory and not on disk.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-26 05:37:11

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

Sorry for the late reply.

On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote:
> On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote:
> >
> > As an aside, is there any reason to have "dioread_nolock" as an option
> > at this point? If it works now, would you ever *not* want it?
> >
> > (granted it doesn't work with some journaling options etc, but that
> > behavior could be automatic, w/o the need for special mount options).
>
> The primary restriction is that diread_nolock doesn't work when fs
> block size != page size. If your proposal is that we automatically
> enable diread_nolock when we can use it safely, that's definitely
> something to consider for the next merge window.

Yes, I also think we can automatically enable dioread_nolock because it
brings us some benefits.

BTW, I think there is an minor improvement for dio overwrite codepath
with indirect-based file. We don't need to take i_mutex in this
condition just as we have done for extent-based file. If a user mounts
a ext2/3 file system with a ext4 kernel modules, he/she could get a
lower latency. But it seems that it would break dio semantic in ext2/3.
Currently in ext2/3 if we issue a overwrite dio and then issue a read
dio. We will always read the latest data because we wait on i_mutex
lock. But after parallelizing overwite dio, this semantic might breaks.
I re-read this doc but it seems that it doesn't describe this case. Do
we need to keep this semantic?

>
> My long range plan/hope is that we eventually be able to use the
> extent status tree so that we do allocating writes, we first (a)
> allocate the blocks, and mark them as in use as far as the mballoc
> data structures are concerned, but we do _not_ mark them as in use in
> the on-disk allocation bitmaps, then (b) we write the data blocks, and
> then triggered by the block I/O completion, (c) in a single journal
> trnasaction, we update the allocation bitmaps, update the inode's
> extent tree, and update the inode's i_size field.
>
> This is different from the dioread_nolock approach in that we're not
> initially inserting the blocks in the extent tree as uninitialized,
> and then convert the extent tree entries from uninit to init after the
> I/O completion.

Yes, this approach is better. I am happy to work on this.

Regards,
- Zheng

2013-03-26 05:46:24

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Tue, Mar 26, 2013 at 01:52:51PM +0800, Zheng Liu wrote:
> Sorry for the late reply.
>
> On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote:
> > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote:
> > >
> > > As an aside, is there any reason to have "dioread_nolock" as an option
> > > at this point? If it works now, would you ever *not* want it?
> > >
> > > (granted it doesn't work with some journaling options etc, but that
> > > behavior could be automatic, w/o the need for special mount options).
> >
> > The primary restriction is that diread_nolock doesn't work when fs
> > block size != page size. If your proposal is that we automatically
> > enable diread_nolock when we can use it safely, that's definitely
> > something to consider for the next merge window.
>
> Yes, I also think we can automatically enable dioread_nolock because it
> brings us some benefits.
>
> BTW, I think there is an minor improvement for dio overwrite codepath
> with indirect-based file. We don't need to take i_mutex in this
> condition just as we have done for extent-based file. If a user mounts
> a ext2/3 file system with a ext4 kernel modules, he/she could get a
> lower latency. But it seems that it would break dio semantic in ext2/3.
> Currently in ext2/3 if we issue a overwrite dio and then issue a read
> dio. We will always read the latest data because we wait on i_mutex
> lock. But after parallelizing overwite dio, this semantic might breaks.
> I re-read this doc but it seems that it doesn't describe this case.
^^^

Ah, sorry, I forgot to paste the link here.
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics

Regards,
- Zheng

2013-03-26 20:34:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Tue 26-03-13 13:52:51, Zheng Liu wrote:
> Sorry for the late reply.
>
> On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote:
> > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote:
> > >
> > > As an aside, is there any reason to have "dioread_nolock" as an option
> > > at this point? If it works now, would you ever *not* want it?
> > >
> > > (granted it doesn't work with some journaling options etc, but that
> > > behavior could be automatic, w/o the need for special mount options).
> >
> > The primary restriction is that diread_nolock doesn't work when fs
> > block size != page size. If your proposal is that we automatically
> > enable diread_nolock when we can use it safely, that's definitely
> > something to consider for the next merge window.
>
> Yes, I also think we can automatically enable dioread_nolock because it
> brings us some benefits.
But isn't there also some overhead due to buffered writes having to go
through uninit->init conversion? Plus there's this potential deadlock in
dioread_nolock code (http://www.spinics.net/lists/linux-ext4/msg36569.html)
which I'm not sure how to fix yet...

> BTW, I think there is an minor improvement for dio overwrite codepath
> with indirect-based file. We don't need to take i_mutex in this
> condition just as we have done for extent-based file. If a user mounts
> a ext2/3 file system with a ext4 kernel modules, he/she could get a
> lower latency. But it seems that it would break dio semantic in ext2/3.
> Currently in ext2/3 if we issue a overwrite dio and then issue a read
> dio. We will always read the latest data because we wait on i_mutex
> lock. But after parallelizing overwite dio, this semantic might breaks.
> I re-read this doc but it seems that it doesn't describe this case. Do
> we need to keep this semantic?
I'm not sure but also I don't think it's important to optimize that
special case.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-27 02:57:59

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Tue, Mar 26, 2013 at 09:34:03PM +0100, Jan Kara wrote:
> On Tue 26-03-13 13:52:51, Zheng Liu wrote:
> > Sorry for the late reply.
> >
> > On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote:
> > > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote:
> > > >
> > > > As an aside, is there any reason to have "dioread_nolock" as an option
> > > > at this point? If it works now, would you ever *not* want it?
> > > >
> > > > (granted it doesn't work with some journaling options etc, but that
> > > > behavior could be automatic, w/o the need for special mount options).
> > >
> > > The primary restriction is that diread_nolock doesn't work when fs
> > > block size != page size. If your proposal is that we automatically
> > > enable diread_nolock when we can use it safely, that's definitely
> > > something to consider for the next merge window.
> >
> > Yes, I also think we can automatically enable dioread_nolock because it
> > brings us some benefits.
> But isn't there also some overhead due to buffered writes having to go
> through uninit->init conversion?

Yeah, in my test, the IOPS will decrease after dioread_nolock enables.
But the latency of dio will also descrease. Honestly I don't test
buffered IO. So I will test this case and post the result later. IMO,
this is a tradeoff that we want to improve latency or get a better
throughput.


> Plus there's this potential deadlock in
> dioread_nolock code (http://www.spinics.net/lists/linux-ext4/msg36569.html)
> which I'm not sure how to fix yet...

Yes, we need to fix this bug first.

>
> > BTW, I think there is an minor improvement for dio overwrite codepath
> > with indirect-based file. We don't need to take i_mutex in this
> > condition just as we have done for extent-based file. If a user mounts
> > a ext2/3 file system with a ext4 kernel modules, he/she could get a
> > lower latency. But it seems that it would break dio semantic in ext2/3.
> > Currently in ext2/3 if we issue a overwrite dio and then issue a read
> > dio. We will always read the latest data because we wait on i_mutex
> > lock. But after parallelizing overwite dio, this semantic might breaks.
> > I re-read this doc but it seems that it doesn't describe this case. Do
> > we need to keep this semantic?
> I'm not sure but also I don't think it's important to optimize that
> special case.

Thanks for the comment. I am really not sure whether it is worth. Let
me test the performance w/ and w/o dioread_nolock first. :-)

Regards,
- Zheng

2013-03-29 07:17:12

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix ext4_evict_inode() racing against workqueue processing code

On Tue, Mar 26, 2013 at 09:34:03PM +0100, Jan Kara wrote:
> On Tue 26-03-13 13:52:51, Zheng Liu wrote:
> > Sorry for the late reply.
> >
> > On Wed, Mar 20, 2013 at 10:45:23AM -0400, Theodore Ts'o wrote:
> > > On Wed, Mar 20, 2013 at 09:14:42AM -0500, Eric Sandeen wrote:
> > > >
> > > > As an aside, is there any reason to have "dioread_nolock" as an option
> > > > at this point? If it works now, would you ever *not* want it?
> > > >
> > > > (granted it doesn't work with some journaling options etc, but that
> > > > behavior could be automatic, w/o the need for special mount options).
> > >
> > > The primary restriction is that diread_nolock doesn't work when fs
> > > block size != page size. If your proposal is that we automatically
> > > enable diread_nolock when we can use it safely, that's definitely
> > > something to consider for the next merge window.
> >
> > Yes, I also think we can automatically enable dioread_nolock because it
> > brings us some benefits.
> But isn't there also some overhead due to buffered writes having to go
> through uninit->init conversion?

Hi Jan,

Here is my test result. I use fio to do some performance tests. The
test environment is a Dell desktop with a Intel(R) Core(TM)2 Duo CPU
E8400 @ 3.00GHz, 4G memory, 1 Intel 320 SSD. I test three cases that
are sync dio read/write, aio dio read/write, and sync buffered io.
The result is as below. Due to too many number we post some key data
here.

>From the result we can see that after dioread_nolock enables the max
latency can be reduced dramatically for sync dio and sync bio, and other
value doesn't be affected too much. If I miss some important test case,
please let me know.

Thanks,
- Zheng

The kernel version is 3.9.0-rc4+.

= sync dio =
== fio config file ==
[global]
ioengine=psync
direct=1
bs=4k
thread
group_reporting
directory=/mnt/sda1/
filename=testfile
filesize=10g
size=10g
runtime=120
iodepth=16
fallocate=0

[reader]
rw=randread
numjobs=8

[writer]
rw=randwrite
numjobs=2

== result ==
=== w/o dioread_nolock ===
read : io=5142.1MB, bw=43885KB/s, iops=10971 , runt=120002msec
clat (usec): min=123 , max=679911 , avg=727.21, stdev=3986.35
lat (usec): min=123 , max=679911 , avg=727.40, stdev=3986.35
write: io=542548KB, bw=4521.2KB/s, iops=1130 , runt=120003msec
clat (usec): min=136 , max=2743.3K, avg=1766.78, stdev=14844.10
lat (usec): min=137 , max=2743.3K, avg=1767.14, stdev=14844.10

=== w/ dioread_nolock ===
read : io=5072.2MB, bw=43282KB/s, iops=10820 , runt=120002msec
clat (usec): min=158 , max=291847 , avg=737.38, stdev=3944.14
lat (usec): min=158 , max=291848 , avg=737.56, stdev=3944.14
write: io=589752KB, bw=4914.6KB/s, iops=1228 , runt=120001msec
clat (usec): min=138 , max=2048.3K, avg=1625.15, stdev=11344.36
lat (usec): min=139 , max=2048.3K, avg=1625.50, stdev=11344.35

= aio dio =
== fio config file ==
[global]
ioengine=libaio
direct=1
bs=4k
thread
group_reporting
directory=/mnt/sda1/
filename=testfile
filesize=10g
size=10g
runtime=120
iodepth=64
fallocate=0

[reader]
rw=randread
numjobs=8

[writer]
rw=randwrite
numjobs=2

== result ==
=== w/o dioread_nolock ===
read : io=4865.4MB, bw=41510KB/s, iops=10377 , runt=120014msec
slat (usec): min=3 , max=320273 , avg=700.79, stdev=6070.93
clat (usec): min=232 , max=958452 , avg=48627.21, stdev=53296.33
lat (usec): min=248 , max=958471 , avg=49328.36, stdev=53905.27
write: io=684100KB, bw=5700.3KB/s, iops=1425 , runt=120013msec
slat (usec): min=5 , max=349019 , avg=1339.75, stdev=9538.63
clat (usec): min=140 , max=1060.1K, avg=88459.40, stdev=95903.63
lat (usec): min=211 , max=1060.1K, avg=89799.48, stdev=97025.95

=== w/ dioread_nolock ===
read : io=4869.1MB, bw=41544KB/s, iops=10385 , runt=120037msec
slat (usec): min=4 , max=322951 , avg=700.33, stdev=5987.29
clat (usec): min=229 , max=703241 , avg=48584.27, stdev=52570.02
lat (usec): min=248 , max=703247 , avg=49284.94, stdev=53185.97
write: io=698856KB, bw=5821.1KB/s, iops=1455 , runt=120038msec
slat (usec): min=7 , max=367042 , avg=1297.87, stdev=9316.21
clat (usec): min=178 , max=968434 , avg=86625.80, stdev=92500.21
lat (usec): min=284 , max=1020.8K, avg=87924.01, stdev=93554.86

= sync buffered io =
== fio config file ==
[global]
ioengine=psync
direct=0
bs=4k
thread
group_reporting
directory=/mnt/sda1/
filename=testfile
filesize=10g
size=10g
runtime=120
iodepth=16
fallocate=0

[reader]
rw=randread
numjobs=8

[writer]
rw=randwrite
numjobs=2

== result ==
=== w/o dioread_nolock ===
read : io=7577.6MB, bw=64661KB/s, iops=16165 , runt=120001msec
clat (usec): min=1 , max=3107.8K, avg=493.15, stdev=5368.87
lat (usec): min=1 , max=3107.8K, avg=493.31, stdev=5368.87
write: io=764256KB, bw=6350.2KB/s, iops=1587 , runt=120353msec
clat (usec): min=2 , max=42443K, avg=1257.41, stdev=138832.28
lat (usec): min=2 , max=42443K, avg=1257.57, stdev=138832.29

=== w/ dioread_nolock ===
read : io=7613.1MB, bw=64972KB/s, iops=16243 , runt=120001msec
clat (usec): min=1 , max=1126.4K, avg=490.79, stdev=3742.64
lat (usec): min=1 , max=1126.4K, avg=490.95, stdev=3742.64
write: io=763428KB, bw=6351.1KB/s, iops=1587 , runt=120189msec
clat (usec): min=2 , max=45783K, avg=1258.23, stdev=163258.23
lat (usec): min=2 , max=45783K, avg=1258.37, stdev=163258.23