2011-08-10 10:51:19

by Michael Tokarev

[permalink] [raw]
Subject: DIO process stuck apparently due to dioread_nolock (3.0)

Hello.

For a few days I'm evaluating various options to use
storage. I'm interested in concurrent direct I/O
(oracle rdbms workload).

I noticed that somehow, ext4fs in mixed read-write
test greatly prefers writes over reads - writes goes
at full speed while reads are almost non-existent.

Sandeen on IRC pointed me at dioread_nolock mount
option, which I tried with great results, if not
one "but".

There's a deadlock somewhere, which I can't trigger
"on demand" - I can't hit the right condition. It
happened twice in a row already, each time after the
same scenario (more about that later).

When it happens, a process doing direct AIO stalls
infinitely, with the following backtrace:

[87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds.
[87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000
[87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0
[87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0
[87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2
[87550.760245] Call Trace:
[87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
[87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170
[87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140
[87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
[87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
[87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
[87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
[87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
[87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
[87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b

At this point, the process in question can't be killed or
stopped. Yes it's oracle DB, and I can kill all other processes
of this instance (this one is lgwr, aka log writer), but the stuck
process will continue to be stuck, so it is not an inter-process
deadlock.

echo "w" > /proc/sysrq-trigger shows only that process, with the
same stack trace.

This is 3.0.1 kernel from kernel.org (amd64 arch). The system is
a relatively large box (IBM System x3850 X5). So far, I've seen
this issue twice, and each time in the following scenario:

I copy an oracle database from another machine to filesystem
mounted with dioread_nolock, and right after the copy completes,
I start the database. And immediately when Oracle opens its
DB ("Database opened") I see stuck lgwr process like above.

So I suspect it happens when there are some unwritten files
in buffer/page cache and some process tries to do direct
writes.

I haven't seen this happening without dioread_nolock, but since
I don't have an easy reproducer I can't say this mount option
is a requiriment. So far, I was able to trigger it only after
large db copy, with small database I created in order to try
to reproduce it the issue does not happen.

And sure thing, when it happens, the only way to clean up is
to forcible reboot the machine (echo b > sysrq-trigger).

I'll continue experiments in a hope to find an easier reproducer,
but the problem is that I've little time left before the machine
in question will go into production. So if anyone have hints
for this issue, please share.. ;)

Thank you!

/mjt


2011-08-11 11:59:46

by Jan Kara

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

Hello,

On Wed 10-08-11 14:51:17, Michael Tokarev wrote:
> For a few days I'm evaluating various options to use
> storage. I'm interested in concurrent direct I/O
> (oracle rdbms workload).
>
> I noticed that somehow, ext4fs in mixed read-write
> test greatly prefers writes over reads - writes goes
> at full speed while reads are almost non-existent.
>
> Sandeen on IRC pointed me at dioread_nolock mount
> option, which I tried with great results, if not
> one "but".
>
> There's a deadlock somewhere, which I can't trigger
> "on demand" - I can't hit the right condition. It
> happened twice in a row already, each time after the
> same scenario (more about that later).
>
> When it happens, a process doing direct AIO stalls
> infinitely, with the following backtrace:
>
> [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds.
> [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000
> [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0
> [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0
> [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2
> [87550.760245] Call Trace:
> [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
> [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170
> [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140
> [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
> [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
> [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
Hmm, the stack trace does not quite make sense to me - the part between
__do_fault and aio_rw_vect_retry is somehow broken. I can imagine we
blocked in ext4_file_write() but I don't see any place there where we would
allocate memory. By any chance, are there messages like "Unaligned AIO/DIO
on inode ..." in the kernel log?

> At this point, the process in question can't be killed or
> stopped. Yes it's oracle DB, and I can kill all other processes
> of this instance (this one is lgwr, aka log writer), but the stuck
> process will continue to be stuck, so it is not an inter-process
> deadlock.
>
> echo "w" > /proc/sysrq-trigger shows only that process, with the
> same stack trace.
>
> This is 3.0.1 kernel from kernel.org (amd64 arch). The system is
> a relatively large box (IBM System x3850 X5). So far, I've seen
> this issue twice, and each time in the following scenario:
>
> I copy an oracle database from another machine to filesystem
> mounted with dioread_nolock, and right after the copy completes,
> I start the database. And immediately when Oracle opens its
> DB ("Database opened") I see stuck lgwr process like above.
>
> So I suspect it happens when there are some unwritten files
> in buffer/page cache and some process tries to do direct
> writes.
>
> I haven't seen this happening without dioread_nolock, but since
> I don't have an easy reproducer I can't say this mount option
> is a requiriment. So far, I was able to trigger it only after
> large db copy, with small database I created in order to try
> to reproduce it the issue does not happen.
>
> And sure thing, when it happens, the only way to clean up is
> to forcible reboot the machine (echo b > sysrq-trigger).
>
> I'll continue experiments in a hope to find an easier reproducer,
> but the problem is that I've little time left before the machine
> in question will go into production. So if anyone have hints
> for this issue, please share.. ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-08-11 12:21:44

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

11.08.2011 15:59, Jan Kara wrote:
> Hello,
>
> On Wed 10-08-11 14:51:17, Michael Tokarev wrote:
>> For a few days I'm evaluating various options to use
>> storage. I'm interested in concurrent direct I/O
>> (oracle rdbms workload).
>>
>> I noticed that somehow, ext4fs in mixed read-write
>> test greatly prefers writes over reads - writes goes
>> at full speed while reads are almost non-existent.
>>
>> Sandeen on IRC pointed me at dioread_nolock mount
>> option, which I tried with great results, if not
>> one "but".
>>
>> There's a deadlock somewhere, which I can't trigger
>> "on demand" - I can't hit the right condition. It
>> happened twice in a row already, each time after the
>> same scenario (more about that later).
>>
>> When it happens, a process doing direct AIO stalls
>> infinitely, with the following backtrace:
>>
>> [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds.
>> [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000
>> [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0
>> [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0
>> [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2
>> [87550.760245] Call Trace:
>> [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
>> [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170
>> [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140
>> [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
>> [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
>> [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
>> [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
>> [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
>> [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
>> [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
> Hmm, the stack trace does not quite make sense to me - the part between
> __do_fault and aio_rw_vect_retry is somehow broken. I can imagine we
> blocked in ext4_file_write() but I don't see any place there where we would
> allocate memory. By any chance, are there messages like "Unaligned AIO/DIO
> on inode ..." in the kernel log?

Yes, there are warnings about unaligned DIO, referring to this same
process actually. Oracle does almost good job at aligning writes
(usually it does i/o by its blocks which are 4Kb by default but
are set to something larger - like 16Kb - for larger database).
Except of a few cases, and lgwr process is one of them (*) - it
writes logfiles using 512b blocks. This is okay for a raw device
with 512bytes blocks, but ext4 expects 4k writes at min.

(*) another case is writing to control file, which is also done in
512byte chunks.

>> At this point, the process in question can't be killed or
>> stopped. Yes it's oracle DB, and I can kill all other processes
>> of this instance (this one is lgwr, aka log writer), but the stuck
>> process will continue to be stuck, so it is not an inter-process
>> deadlock.
>>
>> echo "w" > /proc/sysrq-trigger shows only that process, with the
>> same stack trace.

Just for the record, this is from echoing "w" into sysrq-trigger:

[ 4691.585450] SysRq : Show Blocked State
[ 4691.585493] task PC stack pid father
[ 4691.585752] oracle D 0000000000000000 0 23357 1 0x00000004
[ 4691.585804] ffff88127719e080 0000000000000082 ffff881200000000 ffff881279791040
[ 4691.585883] ffff8803caf03fd8 ffff8803caf03fd8 ffff8803caf03fd8 ffff88127719e080
[ 4691.585961] ffffea0046db2510 ffffffff00000000 ffff8820431b24a8 ffffffff810eeda2
[ 4691.586039] Call Trace:
[ 4691.586079] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
[ 4691.586140] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
[ 4691.586185] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
[ 4691.586232] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
[ 4691.586275] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
[ 4691.586315] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
[ 4691.586354] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
[ 4691.586398] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b

It is almost but not the same as above.

Thank you,

/mjt

>> This is 3.0.1 kernel from kernel.org (amd64 arch). The system is
>> a relatively large box (IBM System x3850 X5). So far, I've seen
>> this issue twice, and each time in the following scenario:
>>
>> I copy an oracle database from another machine to filesystem
>> mounted with dioread_nolock, and right after the copy completes,
>> I start the database. And immediately when Oracle opens its
>> DB ("Database opened") I see stuck lgwr process like above.
>>
>> So I suspect it happens when there are some unwritten files
>> in buffer/page cache and some process tries to do direct
>> writes.
>>
>> I haven't seen this happening without dioread_nolock, but since
>> I don't have an easy reproducer I can't say this mount option
>> is a requiriment. So far, I was able to trigger it only after
>> large db copy, with small database I created in order to try
>> to reproduce it the issue does not happen.
>>
>> And sure thing, when it happens, the only way to clean up is
>> to forcible reboot the machine (echo b > sysrq-trigger).
>>
>> I'll continue experiments in a hope to find an easier reproducer,
>> but the problem is that I've little time left before the machine
>> in question will go into production. So if anyone have hints
>> for this issue, please share.. ;)
>
> Honza


2011-08-11 14:01:03

by Jan Kara

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

Hello,

On Thu 11-08-11 16:21:42, Michael Tokarev wrote:
> 11.08.2011 15:59, Jan Kara wrote:
> > On Wed 10-08-11 14:51:17, Michael Tokarev wrote:
> >> For a few days I'm evaluating various options to use
> >> storage. I'm interested in concurrent direct I/O
> >> (oracle rdbms workload).
> >>
> >> I noticed that somehow, ext4fs in mixed read-write
> >> test greatly prefers writes over reads - writes goes
> >> at full speed while reads are almost non-existent.
> >>
> >> Sandeen on IRC pointed me at dioread_nolock mount
> >> option, which I tried with great results, if not
> >> one "but".
> >>
> >> There's a deadlock somewhere, which I can't trigger
> >> "on demand" - I can't hit the right condition. It
> >> happened twice in a row already, each time after the
> >> same scenario (more about that later).
> >>
> >> When it happens, a process doing direct AIO stalls
> >> infinitely, with the following backtrace:
> >>
> >> [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds.
> >> [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000
> >> [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0
> >> [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0
> >> [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2
> >> [87550.760245] Call Trace:
> >> [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
> >> [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170
> >> [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140
> >> [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
> >> [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> >> [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
> >> [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> >> [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> >> [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> >> [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
> > Hmm, the stack trace does not quite make sense to me - the part between
> > __do_fault and aio_rw_vect_retry is somehow broken. I can imagine we
> > blocked in ext4_file_write() but I don't see any place there where we would
> > allocate memory. By any chance, are there messages like "Unaligned AIO/DIO
> > on inode ..." in the kernel log?
>
> Yes, there are warnings about unaligned DIO, referring to this same
> process actually. Oracle does almost good job at aligning writes
> (usually it does i/o by its blocks which are 4Kb by default but
> are set to something larger - like 16Kb - for larger database).
> Except of a few cases, and lgwr process is one of them (*) - it
> writes logfiles using 512b blocks. This is okay for a raw device
> with 512bytes blocks, but ext4 expects 4k writes at min.
>
> (*) another case is writing to control file, which is also done in
> 512byte chunks.
Ah, OK. I think the code ext4_end_io_nolock() handling the wait queue
might be racy. waitqueue_active() check is missing a barrier I think.
Does attached (untested) patch fix the issue for you?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (3.29 kB)
0001-ext4-Fix-missed-wakeups.patch (1.36 kB)
Download all attachments

2011-08-11 20:05:12

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

11.08.2011 18:01, Jan Kara wrote:
[]
>>>> When it happens, a process doing direct AIO stalls
>>>> infinitely, with the following backtrace:
>>>>
>>>> [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds.
>>>> [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000
>>>> [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0
>>>> [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0
>>>> [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2
>>>> [87550.760245] Call Trace:
>>>> [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
>>>> [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170
>>>> [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140
>>>> [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
>>>> [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
>>>> [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
>>>> [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
>>>> [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
>>>> [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
>>>> [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
>>> Hmm, the stack trace does not quite make sense to me - the part between
>>> __do_fault and aio_rw_vect_retry is somehow broken. I can imagine we
>>> blocked in ext4_file_write() but I don't see any place there where we would
>>> allocate memory. By any chance, are there messages like "Unaligned AIO/DIO
>>> on inode ..." in the kernel log?
>>
>> Yes, there are warnings about unaligned DIO, referring to this same
>> process actually. Oracle does almost good job at aligning writes
>> (usually it does i/o by its blocks which are 4Kb by default but
>> are set to something larger - like 16Kb - for larger database).
>> Except of a few cases, and lgwr process is one of them (*) - it
>> writes logfiles using 512b blocks. This is okay for a raw device
>> with 512bytes blocks, but ext4 expects 4k writes at min.
>>
>> (*) another case is writing to control file, which is also done in
>> 512byte chunks.
> Ah, OK. I think the code ext4_end_io_nolock() handling the wait queue
> might be racy. waitqueue_active() check is missing a barrier I think.
> Does attached (untested) patch fix the issue for you?

With this patch applied, I can reproduce the problem easily too:

[ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor.
[ 1469.734114] SysRq : Show Blocked State
[ 1469.734157] task PC stack pid father
[ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000
[ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040
[ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810
[ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2
[ 1469.734760] Call Trace:
[ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
[ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
[ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
[ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
[ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
[ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
[ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
[ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b

which is exactly the same as with previous "w" into sysrq-trigger.

So it's something else... :(

Thank you!

/mjt

2011-08-12 02:46:16

by Jiaying Zhang

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

Hi Michael,

Could you try your test with the patch I just posted:
http://marc.info/?l=linux-ext4&m=131311627101278&w=2
and see whether it fixes the problem?

Thanks!

Jiaying

On Thu, Aug 11, 2011 at 1:05 PM, Michael Tokarev <[email protected]> wrote:
> 11.08.2011 18:01, Jan Kara wrote:
> []
>>>>> When it happens, a process doing direct AIO stalls
>>>>> infinitely, with the following backtrace:
>>>>>
>>>>> [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds.
>>>>> [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> [87550.759955] oracle ? ? ? ? ?D 0000000000000000 ? ? 0 23176 ? ? ?1 0x00000000
>>>>> [87550.760006] ?ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0
>>>>> [87550.760085] ?ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0
>>>>> [87550.760163] ?ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2
>>>>> [87550.760245] Call Trace:
>>>>> [87550.760285] ?[<ffffffff810eeda2>] ? __do_fault+0x422/0x520
>>>>> [87550.760327] ?[<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170
>>>>> [87550.760367] ?[<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140
>>>>> [87550.760430] ?[<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
>>>>> [87550.760475] ?[<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
>>>>> [87550.760523] ?[<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
>>>>> [87550.760566] ?[<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
>>>>> [87550.760607] ?[<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
>>>>> [87550.760646] ?[<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
>>>>> [87550.760689] ?[<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
>>>> ? Hmm, the stack trace does not quite make sense to me - the part between
>>>> __do_fault and aio_rw_vect_retry is somehow broken. I can imagine we
>>>> blocked in ext4_file_write() but I don't see any place there where we would
>>>> allocate memory. By any chance, are there messages like "Unaligned AIO/DIO
>>>> on inode ..." in the kernel log?
>>>
>>> Yes, there are warnings about unaligned DIO, referring to this same
>>> process actually. Oracle does almost good job at aligning writes
>>> (usually it does i/o by its blocks which are 4Kb by default but
>>> are set to something larger - like 16Kb - for larger database).
>>> Except of a few cases, and lgwr process is one of them (*) - it
>>> writes logfiles using 512b blocks. ?This is okay for a raw device
>>> with 512bytes blocks, but ext4 expects 4k writes at min.
>>>
>>> (*) another case is writing to control file, which is also done in
>>> 512byte chunks.
>> ? Ah, OK. I think the code ext4_end_io_nolock() handling the wait queue
>> might be racy. waitqueue_active() check is missing a barrier I think.
>> Does attached (untested) patch fix the issue for you?
>
> With this patch applied, I can reproduce the problem easily too:
>
> [ ? 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor.
> [ 1469.734114] SysRq : Show Blocked State
> [ 1469.734157] ? task ? ? ? ? ? ? ? ? ? ? ? ?PC stack ? pid father
> [ 1469.734473] oracle ? ? ? ? ?D 0000000000000000 ? ? 0 ?6146 ? ? ?1 0x00000000
> [ 1469.734525] ?ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040
> [ 1469.734603] ?ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810
> [ 1469.734681] ?ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2
> [ 1469.734760] Call Trace:
> [ 1469.734800] ?[<ffffffff810eeda2>] ? __do_fault+0x422/0x520
> [ 1469.734863] ?[<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
> [ 1469.734909] ?[<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> [ 1469.734956] ?[<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
> [ 1469.734999] ?[<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> [ 1469.735039] ?[<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> [ 1469.735078] ?[<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> [ 1469.735122] ?[<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
>
> which is exactly the same as with previous "w" into sysrq-trigger.
>
> So it's something else... :(
>
> Thank you!
>
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-08-12 06:23:44

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

12.08.2011 06:46, Jiaying Zhang wrote:
> Hi Michael,
>
> Could you try your test with the patch I just posted:
> http://marc.info/?l=linux-ext4&m=131311627101278&w=2
> and see whether it fixes the problem?

No it does not. I'm now able to trigger it more or
less reliable - I need to decompress a relatively
small (about 70Gb) oracle database and try to start
it (this requires a rebuild of initrd and reboot ofcourse --
whole thing takes about 15 minutes) - and I see this:

[ 945.729965] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767175 by oracle; performance will be poor.
[ 960.915602] SysRq : Show Blocked State
[ 960.915650] task PC stack pid father
[ 960.915852] oracle D 0000000000000000 0 4985 1 0x00000000
[ 960.915909] ffff88103e627040 0000000000000082 ffff881000000000 ffff881078e3f7d0
[ 960.917855] ffff88103f88ffd8 ffff88103f88ffd8 ffff88103f88ffd8 ffff88103e627040
[ 960.917953] 0000000001c08400 ffff88203e98c948 ffff88207873e240 ffffffff813527c6
[ 960.918045] Call Trace:
[ 960.918092] [<ffffffff813527c6>] ? printk+0x43/0x48
[ 960.918153] [<ffffffffa01432a8>] ? ext4_msg+0x58/0x60 [ext4]
[ 960.918201] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
[ 960.918252] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
[ 960.918301] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
[ 960.918348] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
[ 960.918392] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
[ 960.918436] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
[ 960.918483] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b

(Inum 5767175 is one of oracle redologs).

Jan, I lied to you initially - I didn't even test your first patch,
because I loaded the wrong initrd. With it applied, situation does
not improve still, and it looks exactly the same as with this new
patch by Jiaying Zhang:

[ 415.288104] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767177 by oracle; performance will be poor.
[ 422.967323] SysRq : Show Blocked State
[ 422.967494] task PC stack pid father
[ 422.967872] oracle D 0000000000000000 0 3743 1 0x00000000
[ 422.968112] ffff88203e5a2810 0000000000000086 ffff882000000000 ffff88103f403080
[ 422.968505] ffff88203eeddfd8 ffff88203eeddfd8 ffff88203eeddfd8 ffff88203e5a2810
[ 422.968895] 0000000001c08400 ffff88103f3db348 ffff88103f2fe800 ffffffff813527c6
[ 422.969287] Call Trace:
[ 422.969397] [<ffffffff813527c6>] ? printk+0x43/0x48
[ 422.969528] [<ffffffffa0143288>] ? ext4_msg+0x58/0x60 [ext4]
[ 422.969643] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
[ 422.969758] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
[ 422.969873] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
[ 422.969985] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
[ 422.970093] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
[ 422.970200] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
[ 422.970312] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b

Note in both this cases, I now see slightly different
backtrace -- both mentions ext4_llseek and abort_exclusive_wait,
but the rest is different:

>> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor.
>> [ 1469.734114] SysRq : Show Blocked State
>> [ 1469.734157] task PC stack pid father
>> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000
>> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040
>> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810
>> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2
>> [ 1469.734760] Call Trace:
>> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
>> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
>> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
>> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
>> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
>> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
>> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
>> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b

As Jan already pointed out, this place looks bogus, and
the same can be said about the new backtrace. So I wonder
if there's some stack corruption going on there as well.

Btw, does ext4_llseek() look sane here? Note it's called from
aio_submit() -- does it _ever_ implement SEEKs?

Maybe some debugging is neecessary here?

Thank you!

/mjt

2011-08-12 07:07:46

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

12.08.2011 10:23, Michael Tokarev пишет:
> 12.08.2011 06:46, Jiaying Zhang wrote:
>> Hi Michael,
>>
>> Could you try your test with the patch I just posted:
>> http://marc.info/?l=linux-ext4&m=131311627101278&w=2
>> and see whether it fixes the problem?
>
> No it does not. I'm now able to trigger it more or
> less reliable - I need to decompress a relatively
> small (about 70Gb) oracle database and try to start
> it (this requires a rebuild of initrd and reboot ofcourse --
> whole thing takes about 15 minutes) - and I see this:

And I forgot to note an important (in my opinion anyway)
detail: I can only trigger the issue after copying the
database files, ie, on a hot cache. After reboot when
the files are not in cache, I can start this database
without any issues whatsoever, and it passes all our
stress tests just fine. So it appears the problem only
happens when the cache is hot - maybe DIO is fighting
with in-memory cache of the file or somesuch, or maybe
with yet unwritten file (since the amount of memory on
this machine is large, whole database with all its files
fits in memory). Next time I'll try to ensure only the
(problematic and small) logfiles are hot in the cache.

Thanks,

/mjt

> [ 945.729965] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767175 by oracle; performance will be poor.
> [ 960.915602] SysRq : Show Blocked State
> [ 960.915650] task PC stack pid father
> [ 960.915852] oracle D 0000000000000000 0 4985 1 0x00000000
> [ 960.915909] ffff88103e627040 0000000000000082 ffff881000000000 ffff881078e3f7d0
> [ 960.917855] ffff88103f88ffd8 ffff88103f88ffd8 ffff88103f88ffd8 ffff88103e627040
> [ 960.917953] 0000000001c08400 ffff88203e98c948 ffff88207873e240 ffffffff813527c6
> [ 960.918045] Call Trace:
> [ 960.918092] [<ffffffff813527c6>] ? printk+0x43/0x48
> [ 960.918153] [<ffffffffa01432a8>] ? ext4_msg+0x58/0x60 [ext4]
> [ 960.918201] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
> [ 960.918252] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> [ 960.918301] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
> [ 960.918348] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> [ 960.918392] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> [ 960.918436] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> [ 960.918483] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
>
> (Inum 5767175 is one of oracle redologs).
>
> Jan, I lied to you initially - I didn't even test your first patch,
> because I loaded the wrong initrd. With it applied, situation does
> not improve still, and it looks exactly the same as with this new
> patch by Jiaying Zhang:
>
> [ 415.288104] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767177 by oracle; performance will be poor.
> [ 422.967323] SysRq : Show Blocked State
> [ 422.967494] task PC stack pid father
> [ 422.967872] oracle D 0000000000000000 0 3743 1 0x00000000
> [ 422.968112] ffff88203e5a2810 0000000000000086 ffff882000000000 ffff88103f403080
> [ 422.968505] ffff88203eeddfd8 ffff88203eeddfd8 ffff88203eeddfd8 ffff88203e5a2810
> [ 422.968895] 0000000001c08400 ffff88103f3db348 ffff88103f2fe800 ffffffff813527c6
> [ 422.969287] Call Trace:
> [ 422.969397] [<ffffffff813527c6>] ? printk+0x43/0x48
> [ 422.969528] [<ffffffffa0143288>] ? ext4_msg+0x58/0x60 [ext4]
> [ 422.969643] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
> [ 422.969758] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> [ 422.969873] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
> [ 422.969985] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> [ 422.970093] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> [ 422.970200] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> [ 422.970312] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
>
> Note in both this cases, I now see slightly different
> backtrace -- both mentions ext4_llseek and abort_exclusive_wait,
> but the rest is different:
>
>>> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor.
>>> [ 1469.734114] SysRq : Show Blocked State
>>> [ 1469.734157] task PC stack pid father
>>> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000
>>> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040
>>> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810
>>> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2
>>> [ 1469.734760] Call Trace:
>>> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
>>> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
>>> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
>>> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
>>> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
>>> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
>>> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
>>> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
>
> As Jan already pointed out, this place looks bogus, and
> the same can be said about the new backtrace. So I wonder
> if there's some stack corruption going on there as well.
>
> Btw, does ext4_llseek() look sane here? Note it's called from
> aio_submit() -- does it _ever_ implement SEEKs?
>
> Maybe some debugging is neecessary here?
>
> Thank you!
>
> /mjt

2011-08-12 13:07:31

by Jan Kara

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Fri 12-08-11 10:23:41, Michael Tokarev wrote:
> 12.08.2011 06:46, Jiaying Zhang wrote:
> > Hi Michael,
> >
> > Could you try your test with the patch I just posted:
> > http://marc.info/?l=linux-ext4&m=131311627101278&w=2
> > and see whether it fixes the problem?
>
> No it does not. I'm now able to trigger it more or
> less reliable - I need to decompress a relatively
> small (about 70Gb) oracle database and try to start
> it (this requires a rebuild of initrd and reboot ofcourse --
> whole thing takes about 15 minutes) - and I see this:
>
> [ 945.729965] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767175 by oracle; performance will be poor.
> [ 960.915602] SysRq : Show Blocked State
> [ 960.915650] task PC stack pid father
> [ 960.915852] oracle D 0000000000000000 0 4985 1 0x00000000
> [ 960.915909] ffff88103e627040 0000000000000082 ffff881000000000 ffff881078e3f7d0
> [ 960.917855] ffff88103f88ffd8 ffff88103f88ffd8 ffff88103f88ffd8 ffff88103e627040
> [ 960.917953] 0000000001c08400 ffff88203e98c948 ffff88207873e240 ffffffff813527c6
> [ 960.918045] Call Trace:
> [ 960.918092] [<ffffffff813527c6>] ? printk+0x43/0x48
> [ 960.918153] [<ffffffffa01432a8>] ? ext4_msg+0x58/0x60 [ext4]
> [ 960.918201] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
> [ 960.918252] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> [ 960.918301] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
> [ 960.918348] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> [ 960.918392] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> [ 960.918436] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> [ 960.918483] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
>
> (Inum 5767175 is one of oracle redologs).
>
> Jan, I lied to you initially - I didn't even test your first patch,
> because I loaded the wrong initrd. With it applied, situation does
> not improve still, and it looks exactly the same as with this new
> patch by Jiaying Zhang:
>
> [ 415.288104] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767177 by oracle; performance will be poor.
> [ 422.967323] SysRq : Show Blocked State
> [ 422.967494] task PC stack pid father
> [ 422.967872] oracle D 0000000000000000 0 3743 1 0x00000000
> [ 422.968112] ffff88203e5a2810 0000000000000086 ffff882000000000 ffff88103f403080
> [ 422.968505] ffff88203eeddfd8 ffff88203eeddfd8 ffff88203eeddfd8 ffff88203e5a2810
> [ 422.968895] 0000000001c08400 ffff88103f3db348 ffff88103f2fe800 ffffffff813527c6
> [ 422.969287] Call Trace:
> [ 422.969397] [<ffffffff813527c6>] ? printk+0x43/0x48
> [ 422.969528] [<ffffffffa0143288>] ? ext4_msg+0x58/0x60 [ext4]
> [ 422.969643] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
> [ 422.969758] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> [ 422.969873] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
> [ 422.969985] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> [ 422.970093] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> [ 422.970200] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> [ 422.970312] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
>
> Note in both this cases, I now see slightly different
> backtrace -- both mentions ext4_llseek and abort_exclusive_wait,
> but the rest is different:
>
> >> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor.
> >> [ 1469.734114] SysRq : Show Blocked State
> >> [ 1469.734157] task PC stack pid father
> >> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000
> >> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040
> >> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810
> >> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2
> >> [ 1469.734760] Call Trace:
> >> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
> >> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
> >> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> >> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
> >> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> >> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> >> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> >> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
>
> As Jan already pointed out, this place looks bogus, and
> the same can be said about the new backtrace. So I wonder
> if there's some stack corruption going on there as well.
Probably not, just garbage on stack confuses the stack unwinder. Note
that e.g. ext4_llseek() is at the end of function as well as
abort_exclusive_wait() which probably means these functions have already
finished and just left their addresses on stack. Could you disasseble
your ext4_file_write() function and check where offset 0x20d is? Sadly you
won't see where calls are going when ext4 is compiled as a module (module
is linked when loaded into kernel) so it might be easier to compile ext4
into the kernel and disassemble the function in vmlinux file.

> Btw, does ext4_llseek() look sane here? Note it's called from
> aio_submit() -- does it _ever_ implement SEEKs?
>
> Maybe some debugging is neecessary here?
Yes. The trouble is I'm not completely sure where we are hanging yet.
We should know more from your disassembly. But you can try running with
attached debug patch - maybe it shows something interesting.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (5.63 kB)
ext4-debug.diff (485.00 B)
Download all attachments

2011-08-12 15:55:04

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

12.08.2011 17:07, Jan Kara wrote:
[]
>>>> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor.
>>>> [ 1469.734114] SysRq : Show Blocked State
>>>> [ 1469.734157] task PC stack pid father
>>>> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000
>>>> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040
>>>> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810
>>>> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2
>>>> [ 1469.734760] Call Trace:
>>>> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
>>>> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
>>>> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
>>>> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
>>>> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
>>>> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
>>>> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
>>>> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
[]
> Probably not, just garbage on stack confuses the stack unwinder. Note
> that e.g. ext4_llseek() is at the end of function as well as
> abort_exclusive_wait() which probably means these functions have already
> finished and just left their addresses on stack. Could you disasseble
> your ext4_file_write() function and check where offset 0x20d is? Sadly you
> won't see where calls are going when ext4 is compiled as a module (module
> is linked when loaded into kernel) so it might be easier to compile ext4
> into the kernel and disassemble the function in vmlinux file.
>
>> Btw, does ext4_llseek() look sane here? Note it's called from
>> aio_submit() -- does it _ever_ implement SEEKs?
>>
>> Maybe some debugging is neecessary here?
> Yes. The trouble is I'm not completely sure where we are hanging yet.
> We should know more from your disassembly. But you can try running with
> attached debug patch - maybe it shows something interesting.

With ext4 built-in and your patch applied:

[ 429.061524] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767181 by oracle; performance will be poor.
[ 429.061669] Going to wait for 18446744073709551199 aios
[ 437.717942] SysRq : Show Blocked State
[ 437.718109] task PC stack pid father
[ 437.718528] oracle D 0000000000000000 0 3844 1 0x00000000
[ 437.718767] ffff88203e330080 0000000000000082 0000000000000000 ffff881078e3f7d0
[ 437.719156] ffff88203dc15fd8 ffff88203dc15fd8 ffff88203dc15fd8 ffff88203e330080
[ 437.719546] 0000001e0000000f 0000000000000246 205b88103f806680 3136302e39323420
[ 437.719935] Call Trace:
[ 437.720043] [<ffffffff8139b976>] ? printk+0x43/0x48
[ 437.720155] [<ffffffff8118e16d>] ? ext4_file_write+0x21d/0x290
[ 437.720267] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
[ 437.720376] [<ffffffff8118df50>] ? ext4_llseek+0x120/0x120
[ 437.720485] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
[ 437.720593] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
[ 437.720699] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
[ 437.720811] [<ffffffff813a3152>] ? system_call_fastpath+0x16/0x1b

Trying to find how to disassemble things now...
And yes, 18446744073709551199 aios sounds quite alot ;)

Thanks,

/mjt

2011-08-12 17:01:51

by Eric Sandeen

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On 8/12/11 10:55 AM, Michael Tokarev wrote:
> 12.08.2011 17:07, Jan Kara wrote:
> []
>>>>> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor.
>>>>> [ 1469.734114] SysRq : Show Blocked State
>>>>> [ 1469.734157] task PC stack pid father
>>>>> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000
>>>>> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040
>>>>> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810
>>>>> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2
>>>>> [ 1469.734760] Call Trace:
>>>>> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
>>>>> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
>>>>> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
>>>>> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
>>>>> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
>>>>> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
>>>>> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
>>>>> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
> []
>> Probably not, just garbage on stack confuses the stack unwinder. Note
>> that e.g. ext4_llseek() is at the end of function as well as
>> abort_exclusive_wait() which probably means these functions have already
>> finished and just left their addresses on stack. Could you disasseble
>> your ext4_file_write() function and check where offset 0x20d is? Sadly you
>> won't see where calls are going when ext4 is compiled as a module (module
>> is linked when loaded into kernel) so it might be easier to compile ext4
>> into the kernel and disassemble the function in vmlinux file.
>>
>>> Btw, does ext4_llseek() look sane here? Note it's called from
>>> aio_submit() -- does it _ever_ implement SEEKs?
>>>
>>> Maybe some debugging is neecessary here?
>> Yes. The trouble is I'm not completely sure where we are hanging yet.
>> We should know more from your disassembly. But you can try running with
>> attached debug patch - maybe it shows something interesting.
>
> With ext4 built-in and your patch applied:
>
> [ 429.061524] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767181 by oracle; performance will be poor.
> [ 429.061669] Going to wait for 18446744073709551199 aios
> [ 437.717942] SysRq : Show Blocked State
> [ 437.718109] task PC stack pid father
> [ 437.718528] oracle D 0000000000000000 0 3844 1 0x00000000
> [ 437.718767] ffff88203e330080 0000000000000082 0000000000000000 ffff881078e3f7d0
> [ 437.719156] ffff88203dc15fd8 ffff88203dc15fd8 ffff88203dc15fd8 ffff88203e330080
> [ 437.719546] 0000001e0000000f 0000000000000246 205b88103f806680 3136302e39323420
> [ 437.719935] Call Trace:
> [ 437.720043] [<ffffffff8139b976>] ? printk+0x43/0x48
> [ 437.720155] [<ffffffff8118e16d>] ? ext4_file_write+0x21d/0x290
> [ 437.720267] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> [ 437.720376] [<ffffffff8118df50>] ? ext4_llseek+0x120/0x120
> [ 437.720485] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> [ 437.720593] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> [ 437.720699] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> [ 437.720811] [<ffffffff813a3152>] ? system_call_fastpath+0x16/0x1b
>
> Trying to find how to disassemble things now...
> And yes, 18446744073709551199 aios sounds quite alot ;)

looks like it went negative.

I see that in one case we set EXT4_IO_END_UNWRITTEN, but don't increment the counter.
We decrement the counter for every EXT4_IO_END_UNWRITTEN completion, I think.

I'm not quite sure if that was intentional or not, but it might be a place to start.
I haven't though hard about this, in the middle of something else right now,
but this looks like it's a probllem in my code from that unaligned AIO patch,
perhaps...

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3e5191f..7366488 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3640,6 +3640,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)

io_end->flag = EXT4_IO_END_UNWRITTEN;
inode = io_end->inode;
+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);

/* Add the io_end to per-inode completed io list*/
spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);


> Thanks,
>
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-08-12 17:34:48

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

12.08.2011 21:01, Eric Sandeen wrote:
[]
>> And yes, 18446744073709551199 aios sounds quite alot ;)
>
> looks like it went negative.
>
> I see that in one case we set EXT4_IO_END_UNWRITTEN, but don't increment the counter.
> We decrement the counter for every EXT4_IO_END_UNWRITTEN completion, I think.
>
> I'm not quite sure if that was intentional or not, but it might be a place to start.
> I haven't though hard about this, in the middle of something else right now,
> but this looks like it's a probllem in my code from that unaligned AIO patch,
> perhaps...
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3e5191f..7366488 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3640,6 +3640,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
>
> io_end->flag = EXT4_IO_END_UNWRITTEN;
> inode = io_end->inode;
> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
>
> /* Add the io_end to per-inode completed io list*/
> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);

This patch does not change thing (unfortunately I forgot to reapply
the debugging patch posted by Jan so don't know how many aiocbs it
wants to wait).

But reverting e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d
(ext4: serialize unaligned asynchronous DIO) helps, the thing
goes fine without that patch.

And this refcounting - I can't see why in my case the problem only
happens with hot cache and only with dioread_nolock mount option
(so far anyway - I weren't able to trigger it without at least
one of the two conditions).

BTW, ext4_end_io_buffer_write() is declared twice in fs/ext4/inode.c

Thanks,

/mjt

2011-08-12 21:19:33

by Jan Kara

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Fri 12-08-11 19:55:00, Michael Tokarev wrote:
> 12.08.2011 17:07, Jan Kara wrote:
> []
> >>>> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor.
> >>>> [ 1469.734114] SysRq : Show Blocked State
> >>>> [ 1469.734157] task PC stack pid father
> >>>> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000
> >>>> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040
> >>>> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810
> >>>> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2
> >>>> [ 1469.734760] Call Trace:
> >>>> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520
> >>>> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4]
> >>>> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> >>>> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4]
> >>>> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> >>>> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> >>>> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> >>>> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b
> []
> > Probably not, just garbage on stack confuses the stack unwinder. Note
> > that e.g. ext4_llseek() is at the end of function as well as
> > abort_exclusive_wait() which probably means these functions have already
> > finished and just left their addresses on stack. Could you disasseble
> > your ext4_file_write() function and check where offset 0x20d is? Sadly you
> > won't see where calls are going when ext4 is compiled as a module (module
> > is linked when loaded into kernel) so it might be easier to compile ext4
> > into the kernel and disassemble the function in vmlinux file.
> >
> >> Btw, does ext4_llseek() look sane here? Note it's called from
> >> aio_submit() -- does it _ever_ implement SEEKs?
> >>
> >> Maybe some debugging is neecessary here?
> > Yes. The trouble is I'm not completely sure where we are hanging yet.
> > We should know more from your disassembly. But you can try running with
> > attached debug patch - maybe it shows something interesting.
>
> With ext4 built-in and your patch applied:
>
> [ 429.061524] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767181 by oracle; performance will be poor.
> [ 429.061669] Going to wait for 18446744073709551199 aios
> [ 437.717942] SysRq : Show Blocked State
> [ 437.718109] task PC stack pid father
> [ 437.718528] oracle D 0000000000000000 0 3844 1 0x00000000
> [ 437.718767] ffff88203e330080 0000000000000082 0000000000000000 ffff881078e3f7d0
> [ 437.719156] ffff88203dc15fd8 ffff88203dc15fd8 ffff88203dc15fd8 ffff88203e330080
> [ 437.719546] 0000001e0000000f 0000000000000246 205b88103f806680 3136302e39323420
> [ 437.719935] Call Trace:
> [ 437.720043] [<ffffffff8139b976>] ? printk+0x43/0x48
> [ 437.720155] [<ffffffff8118e16d>] ? ext4_file_write+0x21d/0x290
> [ 437.720267] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0
> [ 437.720376] [<ffffffff8118df50>] ? ext4_llseek+0x120/0x120
> [ 437.720485] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0
> [ 437.720593] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160
> [ 437.720699] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600
> [ 437.720811] [<ffffffff813a3152>] ? system_call_fastpath+0x16/0x1b
>
> Trying to find how to disassemble things now...
> And yes, 18446744073709551199 aios sounds quite alot ;)
I think you don't have to. My stab-in-the-dark patch is showing that we
underflow the number of outstanding aiodio calls - we think there are -416
of these calls outstanding. Thus we wait indefinitely. So now we have to
just find where is the accounting wrong.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-08-13 16:09:24

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

From: Tao Ma <[email protected]>

Hi Michael,
could you please check whether this patch work for you?


Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
From: Tao Ma <[email protected]>
Date: Sat, 13 Aug 2011 23:50:28 +0800
Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.

EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
be done simultaneously since ext4_end_io_nolock always clear the flag and
decrease the counter in the same time.

We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
it will go nagative and causes some process to wait forever.

Part of the patch came from Eric in his e-mail.
http://marc.info/?l=linux-ext4&m=131316851417460&w=2

Reported-by: Michael Tokarev<[email protected]>
Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: Tao Ma <[email protected]>
---
fs/ext4/inode.c | 9 ++++++++-
fs/ext4/page-io.c | 6 ++++--
2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d47264c..40c0b10 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
goto out;
}

- io_end->flag = EXT4_IO_END_UNWRITTEN;
+ /*
+ * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
+ * but being more careful is always safe for the future change.
+ */
inode = io_end->inode;
+ if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+ io_end->flag |= EXT4_IO_END_UNWRITTEN;
+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+ }

/* Add the io_end to per-inode completed io list*/
spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 430c401..78839af 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -334,8 +334,10 @@ submit_and_retry:
if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
(io_end->pages[io_end->num_io_pages-1] != io_page))
goto submit_and_retry;
- if (buffer_uninit(bh))
- io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
+ if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+ io_end->flag |= EXT4_IO_END_UNWRITTEN;
+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+ }
io->io_end->size += bh->b_size;
io->io_next_block++;
ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
--
1.7.0.4


2011-08-14 20:57:14

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

13.08.2011 20:02, Tao Ma wrote:
> From: Tao Ma <[email protected]>
>
> Hi Michael,
> could you please check whether this patch work for you?

With this patch applied to 3.0.1 I can't trigger the issue anymore,
after several attempts -- the system just works as it shold be.
I'm not sure this is the right fix or it's just my testcase isn't
as good as it can be... ;)

> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
> From: Tao Ma <[email protected]>
> Date: Sat, 13 Aug 2011 23:50:28 +0800
> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>
> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
> be done simultaneously since ext4_end_io_nolock always clear the flag and
> decrease the counter in the same time.
>
> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
> it will go nagative and causes some process to wait forever.
>
> Part of the patch came from Eric in his e-mail.
> http://marc.info/?l=linux-ext4&m=131316851417460&w=2
>
> Reported-by: Michael Tokarev<[email protected]>
> Signed-off-by: Eric Sandeen <[email protected]>
> Signed-off-by: Tao Ma <[email protected]>

You can add my Tested-By too ;)

Thank you!

And please excuse me for so long delay replying - I was AFK whole weekend.

/mjt

> fs/ext4/inode.c | 9 ++++++++-
> fs/ext4/page-io.c | 6 ++++--
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d47264c..40c0b10 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
> goto out;
> }
>
> - io_end->flag = EXT4_IO_END_UNWRITTEN;
> + /*
> + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
> + * but being more careful is always safe for the future change.
> + */
> inode = io_end->inode;
> + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> + io_end->flag |= EXT4_IO_END_UNWRITTEN;
> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
> + }
>
> /* Add the io_end to per-inode completed io list*/
> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 430c401..78839af 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -334,8 +334,10 @@ submit_and_retry:
> if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
> (io_end->pages[io_end->num_io_pages-1] != io_page))
> goto submit_and_retry;
> - if (buffer_uninit(bh))
> - io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
> + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> + io_end->flag |= EXT4_IO_END_UNWRITTEN;
> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
> + }
> io->io_end->size += bh->b_size;
> io->io_next_block++;
> ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));


2011-08-14 21:07:28

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

15.08.2011 00:57, Michael Tokarev пишет:
> 13.08.2011 20:02, Tao Ma wrote:
>> From: Tao Ma <[email protected]>
>>
>> Hi Michael,
>> could you please check whether this patch work for you?
>
> With this patch applied to 3.0.1 I can't trigger the issue anymore,
> after several attempts -- the system just works as it shold be.
> I'm not sure this is the right fix or it's just my testcase isn't
> as good as it can be... ;)

Well, I found a way to trigger data corruption with this patch
applied. I guess it's not fault of this patch, but some more
deep problem instead.

The sequence is my usual copy of an oracle database from another
place and start it. When oracle starts doing it's direct-I/O
against its redologs, we had problem which is now solved. But
now I do the following: I shutdown the database, rename the current
redologs out of the way and copy them back into place as new files.
And start the database again.

This time, oracle complains that the redologs contains garbage.
I can reboot the machine now, and compare old (renamed) redologs
with copies - they're indeed different.

My guess is that copy is done from the pagecache - from the old
contents of the files, somehow ignoring the (direct) writes
performed by initial database open. But that copy is somehow
damaged now too, since even file identification is now different.

Is this new issue something that dioread_nolock supposed to create?
I mean, it isn't entirely clear what it supposed to do, it looks
somewhat hackish, but without it performance is quite bad.

Thanks,

/mjt

2011-08-15 02:36:27

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On 08/15/2011 05:07 AM, Michael Tokarev wrote:
> 15.08.2011 00:57, Michael Tokarev пишет:
>> 13.08.2011 20:02, Tao Ma wrote:
>>> From: Tao Ma <[email protected]>
>>>
>>> Hi Michael,
>>> could you please check whether this patch work for you?
>>
>> With this patch applied to 3.0.1 I can't trigger the issue anymore,
>> after several attempts -- the system just works as it shold be.
>> I'm not sure this is the right fix or it's just my testcase isn't
>> as good as it can be... ;)
Thanks for the test.
>
> Well, I found a way to trigger data corruption with this patch
> applied. I guess it's not fault of this patch, but some more
> deep problem instead.
>
> The sequence is my usual copy of an oracle database from another
> place and start it. When oracle starts doing it's direct-I/O
> against its redologs, we had problem which is now solved. But
> now I do the following: I shutdown the database, rename the current
> redologs out of the way and copy them back into place as new files.
> And start the database again.
>
> This time, oracle complains that the redologs contains garbage.
> I can reboot the machine now, and compare old (renamed) redologs
> with copies - they're indeed different.
>
> My guess is that copy is done from the pagecache - from the old
> contents of the files, somehow ignoring the (direct) writes
> performed by initial database open. But that copy is somehow
> damaged now too, since even file identification is now different.
>
> Is this new issue something that dioread_nolock supposed to create?
> I mean, it isn't entirely clear what it supposed to do, it looks
> somewhat hackish, but without it performance is quite bad.
So could I generalize your sequence like below:
1. copy a large file to a new ext4 volume
2. do some direct i/o read/write to this file(bs=512)
3. rename it.
4. cp this back to the original file
5. do direct i/o read/write(bs=512) now and the file is actually corrupted.

You used to meet with problem in step 2, and my patch resolved it. Now
you met with problems in step 5. Right?

Thanks
Tao

2011-08-15 08:00:53

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

15.08.2011 06:36, Tao Ma wrote:
> On 08/15/2011 05:07 AM, Michael Tokarev wrote:
[]
>> Well, I found a way to trigger data corruption with this patch
>> applied. I guess it's not fault of this patch, but some more
>> deep problem instead.
>>
>> The sequence is my usual copy of an oracle database from another
>> place and start it. When oracle starts doing it's direct-I/O
>> against its redologs, we had problem which is now solved. But
>> now I do the following: I shutdown the database, rename the current
>> redologs out of the way and copy them back into place as new files.
>> And start the database again.
>>
>> This time, oracle complains that the redologs contains garbage.
>> I can reboot the machine now, and compare old (renamed) redologs
>> with copies - they're indeed different.
>>
>> My guess is that copy is done from the pagecache - from the old
>> contents of the files, somehow ignoring the (direct) writes
>> performed by initial database open. But that copy is somehow
>> damaged now too, since even file identification is now different.
>>
>> Is this new issue something that dioread_nolock supposed to create?
>> I mean, it isn't entirely clear what it supposed to do, it looks
>> somewhat hackish, but without it performance is quite bad.
> So could I generalize your sequence like below:
> 1. copy a large file to a new ext4 volume
> 2. do some direct i/o read/write to this file(bs=512)
> 3. rename it.
> 4. cp this back to the original file
> 5. do direct i/o read/write(bs=512) now and the file is actually corrupted.
>
> You used to meet with problem in step 2, and my patch resolved it. Now
> you met with problems in step 5. Right?

SQL> shutdown immediate; -- shuts down the database cleanly

$ mkdir tmp
$ mv redo* tmp/
$ cp -p tmp/* .
-- this will make redolog files to be in hot cache, not even written to disk.

SQL> startup
Database mounted.
-- now open and read our redologs...
-- at this point, without the patch, it hangs.
ORA-00316: log 1 of thread 1, type in header is not log file
ORA-00312: online log 1 thread 1: '.../redo1.odf'

$ mv -f tmp/* .

SQL> alter database open; -- this will try to open files again and read them again
Database altered. -- and now we're fine.

This is my small(ish) testcase so far. Only the redologs
needs to be in hot cache in order to trigger the issue.
This does direct I/O in 512byte blocks in these redo*
files.

The rename and a new directory is just to keep the pieces
of the database in the right place.

There's even more fun. I once managed to get old content in
the copied files, but I can't repeat it. I made a copy as
before, sync(1)ed everything, started the database - it was
ok. Next I shut it down, and rebooted (why drop_caches does
not really work is another big question). And now, oracle
complains that the redologs contains previous sequence number.
(to clarify: there's a sequence number in each oracle db
which is incremented each time something happens with the
database, including startup. So on startup, each file in
the database gets new (the same) sequence number). So it
looked like even despite of oracle doing direct writes to
record new sequence number, a previously cached data gets
written to the file.

Now I'm not really sure what's going on, it is somewhat
inconsistent. Before, it used to hang after "Database
mounted" message, when it tries to write to redologs, --
now that hang is gone.

But now I see some apparent data corruption - again, with hot
cache only - but I don't actually understand when it happens.

I'm trying to narrow it down further.

Thank you!

/mjt

2011-08-15 08:56:19

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

15.08.2011 12:00, Michael Tokarev wrote:
[....]

So, it looks like this (starting with cold cache):

1. rename the redologs and copy them over - this will
make a hot copy of redologs
2. startup oracle - it will complain that the redologs aren't
redologs, the header is corrupt
3. shut down oracle, start it up again - it will succeed.

If between 1 and 2 you'll issue sync(1) everything will work.
When shutting down, oracle calls fsync(), so that's like
sync(1) again.

If there will be some time between 1. and 2., everything
will work too.

Without dioread_nolock I can't trigger the problem no matter
how I tried.


A smaller test case. I used redo1.odf file (one of the
redologs) as a test file, any will work.

$ cp -p redo1.odf temp
$ dd if=temp of=foo iflag=direct count=20

Now, first 512bytes of "foo" will contain all zeros, while
the beginning of redo1.odf is _not_ zeros.

Again, without aioread_nolock it works as expected.


And the most important note: without the patch there's no
data corruption like that. But instead, there is the
lockup... ;)

Thank you,

/mjt

2011-08-15 09:03:14

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

15.08.2011 12:56, Michael Tokarev пишет:
> 15.08.2011 12:00, Michael Tokarev wrote:
> [....]
>
> So, it looks like this (starting with cold cache):
>
> 1. rename the redologs and copy them over - this will
> make a hot copy of redologs
> 2. startup oracle - it will complain that the redologs aren't
> redologs, the header is corrupt
> 3. shut down oracle, start it up again - it will succeed.
>
> If between 1 and 2 you'll issue sync(1) everything will work.
> When shutting down, oracle calls fsync(), so that's like
> sync(1) again.
>
> If there will be some time between 1. and 2., everything
> will work too.
>
> Without dioread_nolock I can't trigger the problem no matter
> how I tried.
>
>
> A smaller test case. I used redo1.odf file (one of the
> redologs) as a test file, any will work.
>
> $ cp -p redo1.odf temp
> $ dd if=temp of=foo iflag=direct count=20
>
> Now, first 512bytes of "foo" will contain all zeros, while
> the beginning of redo1.odf is _not_ zeros.
>
> Again, without aioread_nolock it works as expected.
>
>
> And the most important note: without the patch there's no
> data corruption like that. But instead, there is the
> lockup... ;)

Actually I can reproduce this data corruption without the
patch too, just not that easily. Oracle testcase (with
copying redologs over) does that nicely. So that's a
separate bug which was here before.

> Thank you,
>
> /mjt

2011-08-15 10:28:56

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On 08/15/2011 05:03 PM, Michael Tokarev wrote:
> 15.08.2011 12:56, Michael Tokarev пишет:
>> 15.08.2011 12:00, Michael Tokarev wrote:
>> [....]
>>
>> So, it looks like this (starting with cold cache):
>>
>> 1. rename the redologs and copy them over - this will
>> make a hot copy of redologs
>> 2. startup oracle - it will complain that the redologs aren't
>> redologs, the header is corrupt
>> 3. shut down oracle, start it up again - it will succeed.
>>
>> If between 1 and 2 you'll issue sync(1) everything will work.
>> When shutting down, oracle calls fsync(), so that's like
>> sync(1) again.
>>
>> If there will be some time between 1. and 2., everything
>> will work too.
>>
>> Without dioread_nolock I can't trigger the problem no matter
>> how I tried.
>>
>>
>> A smaller test case. I used redo1.odf file (one of the
>> redologs) as a test file, any will work.
>>
>> $ cp -p redo1.odf temp
>> $ dd if=temp of=foo iflag=direct count=20
>>
>> Now, first 512bytes of "foo" will contain all zeros, while
>> the beginning of redo1.odf is _not_ zeros.
>>
>> Again, without aioread_nolock it works as expected.
>>
>>
>> And the most important note: without the patch there's no
>> data corruption like that. But instead, there is the
>> lockup... ;)
>
> Actually I can reproduce this data corruption without the
> patch too, just not that easily. Oracle testcase (with
> copying redologs over) does that nicely. So that's a
> separate bug which was here before.
cool, thanks for the test.

btw, I can reproduce the bug with
$ cp -p redo1.odf temp
$ dd if=temp of=foo iflag=direct count=20
Not that easy, but I did encounter one during my more than 20 tries,
hope I can get something out soon.

Thanks
Tao

2011-08-15 16:08:49

by Eric Sandeen

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On 8/13/11 11:02 AM, Tao Ma wrote:
> From: Tao Ma <[email protected]>
>
> Hi Michael,
> could you please check whether this patch work for you?

Thanks, this is the patch I meant to send, I guess ;) I think
I missed the "|=" case before :(

given that every place we set it requires a test and an increment
if not set, maybe a helper to do all that is in order?

-Eric

>
> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
> From: Tao Ma <[email protected]>
> Date: Sat, 13 Aug 2011 23:50:28 +0800
> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>
> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
> be done simultaneously since ext4_end_io_nolock always clear the flag and
> decrease the counter in the same time.
>
> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
> it will go nagative and causes some process to wait forever.
>
> Part of the patch came from Eric in his e-mail.
> http://marc.info/?l=linux-ext4&m=131316851417460&w=2
>
> Reported-by: Michael Tokarev<[email protected]>
> Signed-off-by: Eric Sandeen <[email protected]>
> Signed-off-by: Tao Ma <[email protected]>
> ---
> fs/ext4/inode.c | 9 ++++++++-
> fs/ext4/page-io.c | 6 ++++--
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d47264c..40c0b10 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
> goto out;
> }
>
> - io_end->flag = EXT4_IO_END_UNWRITTEN;
> + /*
> + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
> + * but being more careful is always safe for the future change.
> + */
> inode = io_end->inode;
> + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> + io_end->flag |= EXT4_IO_END_UNWRITTEN;
> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
> + }
>
> /* Add the io_end to per-inode completed io list*/
> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 430c401..78839af 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -334,8 +334,10 @@ submit_and_retry:
> if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
> (io_end->pages[io_end->num_io_pages-1] != io_page))
> goto submit_and_retry;
> - if (buffer_uninit(bh))
> - io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
> + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> + io_end->flag |= EXT4_IO_END_UNWRITTEN;
> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
> + }
> io->io_end->size += bh->b_size;
> io->io_next_block++;
> ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));


2011-08-15 23:53:38

by Jiaying Zhang

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

Hi Michael,

On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <[email protected]> wrote:
> 15.08.2011 12:00, Michael Tokarev wrote:
> [....]
>
> So, it looks like this (starting with cold cache):
>
> 1. rename the redologs and copy them over - this will
> ? make a hot copy of redologs
> 2. startup oracle - it will complain that the redologs aren't
> ? redologs, the header is corrupt
> 3. shut down oracle, start it up again - it will succeed.
>
> If between 1 and 2 you'll issue sync(1) everything will work.
> When shutting down, oracle calls fsync(), so that's like
> sync(1) again.
>
> If there will be some time between 1. and 2., everything
> will work too.
>
> Without dioread_nolock I can't trigger the problem no matter
> how I tried.
>
>
> A smaller test case. ?I used redo1.odf file (one of the
> redologs) as a test file, any will work.
>
> ?$ cp -p redo1.odf temp
> ?$ dd if=temp of=foo iflag=direct count=20
Isn't this the expected behavior here? When doing
'cp -p redo1.odf temp', data is copied to temp through
buffer write, but there is no guarantee when data will be
actually written to disk. Then with 'dd if=temp of=foo
iflag=direct count=20', data is read directly from disk.
Very likely, the written data hasn't been flushed to disk
yet so ext4 returns zero in this case.

Jiaying
>
> Now, first 512bytes of "foo" will contain all zeros, while
> the beginning of redo1.odf is _not_ zeros.
>
> Again, without aioread_nolock it works as expected.
>
>
> And the most important note: without the patch there's no
> data corruption like that. ?But instead, there is the
> lockup... ;)
>
> Thank you,
>
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-08-16 04:12:21

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On 08/16/2011 12:08 AM, Eric Sandeen wrote:
> On 8/13/11 11:02 AM, Tao Ma wrote:
>> From: Tao Ma <[email protected]>
>>
>> Hi Michael,
>> could you please check whether this patch work for you?
>
> Thanks, this is the patch I meant to send, I guess ;) I think
> I missed the "|=" case before :(
>
> given that every place we set it requires a test and an increment
> if not set, maybe a helper to do all that is in order?
sure, I can work out a patch like that. Thanks for the review.

Thanks
Tao
>
> -Eric
>
>>
>> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
>> From: Tao Ma <[email protected]>
>> Date: Sat, 13 Aug 2011 23:50:28 +0800
>> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>>
>> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
>> be done simultaneously since ext4_end_io_nolock always clear the flag and
>> decrease the counter in the same time.
>>
>> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
>> it will go nagative and causes some process to wait forever.
>>
>> Part of the patch came from Eric in his e-mail.
>> http://marc.info/?l=linux-ext4&m=131316851417460&w=2
>>
>> Reported-by: Michael Tokarev<[email protected]>
>> Signed-off-by: Eric Sandeen <[email protected]>
>> Signed-off-by: Tao Ma <[email protected]>
>> ---
>> fs/ext4/inode.c | 9 ++++++++-
>> fs/ext4/page-io.c | 6 ++++--
>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index d47264c..40c0b10 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
>> goto out;
>> }
>>
>> - io_end->flag = EXT4_IO_END_UNWRITTEN;
>> + /*
>> + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
>> + * but being more careful is always safe for the future change.
>> + */
>> inode = io_end->inode;
>> + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
>> + io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
>> + }
>>
>> /* Add the io_end to per-inode completed io list*/
>> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
>> index 430c401..78839af 100644
>> --- a/fs/ext4/page-io.c
>> +++ b/fs/ext4/page-io.c
>> @@ -334,8 +334,10 @@ submit_and_retry:
>> if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
>> (io_end->pages[io_end->num_io_pages-1] != io_page))
>> goto submit_and_retry;
>> - if (buffer_uninit(bh))
>> - io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
>> + io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
>> + }
>> io->io_end->size += bh->b_size;
>> io->io_next_block++;
>> ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-08-16 04:15:59

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

Hi Jiaying,
On 08/16/2011 07:53 AM, Jiaying Zhang wrote:
> Hi Michael,
>
> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <[email protected]> wrote:
>> 15.08.2011 12:00, Michael Tokarev wrote:
>> [....]
>>
>> So, it looks like this (starting with cold cache):
>>
>> 1. rename the redologs and copy them over - this will
>> make a hot copy of redologs
>> 2. startup oracle - it will complain that the redologs aren't
>> redologs, the header is corrupt
>> 3. shut down oracle, start it up again - it will succeed.
>>
>> If between 1 and 2 you'll issue sync(1) everything will work.
>> When shutting down, oracle calls fsync(), so that's like
>> sync(1) again.
>>
>> If there will be some time between 1. and 2., everything
>> will work too.
>>
>> Without dioread_nolock I can't trigger the problem no matter
>> how I tried.
>>
>>
>> A smaller test case. I used redo1.odf file (one of the
>> redologs) as a test file, any will work.
>>
>> $ cp -p redo1.odf temp
>> $ dd if=temp of=foo iflag=direct count=20
> Isn't this the expected behavior here? When doing
> 'cp -p redo1.odf temp', data is copied to temp through
> buffer write, but there is no guarantee when data will be
> actually written to disk. Then with 'dd if=temp of=foo
> iflag=direct count=20', data is read directly from disk.
> Very likely, the written data hasn't been flushed to disk
> yet so ext4 returns zero in this case.
Sorry, but it doesn't sound correct to me.
Say we use a buffer write to a file and then use direct i/o read, what
we expect(or at least Michael expect) is that we use read the updated
data, not the stale one. I thought of a tiny race window in ext4 here,
but need to do some test to verify and then fix it.

Thanks
Tao

2011-08-16 06:15:41

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

Hi Eric,
On 08/16/2011 12:08 AM, Eric Sandeen wrote:
> On 8/13/11 11:02 AM, Tao Ma wrote:
>> From: Tao Ma <[email protected]>
>>
>> Hi Michael,
>> could you please check whether this patch work for you?
>
> Thanks, this is the patch I meant to send, I guess ;) I think
> I missed the "|=" case before :(
>
> given that every place we set it requires a test and an increment
> if not set, maybe a helper to do all that is in order?
I have decided to leave it to the next merge window. So I will re-send
this patch as a fix and then create another patch for the helper
generalization.

Thanks
Tao
>
> -Eric
>
>>
>> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
>> From: Tao Ma <[email protected]>
>> Date: Sat, 13 Aug 2011 23:50:28 +0800
>> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>>
>> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
>> be done simultaneously since ext4_end_io_nolock always clear the flag and
>> decrease the counter in the same time.
>>
>> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
>> it will go nagative and causes some process to wait forever.
>>
>> Part of the patch came from Eric in his e-mail.
>> http://marc.info/?l=linux-ext4&m=131316851417460&w=2
>>
>> Reported-by: Michael Tokarev<[email protected]>
>> Signed-off-by: Eric Sandeen <[email protected]>
>> Signed-off-by: Tao Ma <[email protected]>
>> ---
>> fs/ext4/inode.c | 9 ++++++++-
>> fs/ext4/page-io.c | 6 ++++--
>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index d47264c..40c0b10 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
>> goto out;
>> }
>>
>> - io_end->flag = EXT4_IO_END_UNWRITTEN;
>> + /*
>> + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
>> + * but being more careful is always safe for the future change.
>> + */
>> inode = io_end->inode;
>> + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
>> + io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
>> + }
>>
>> /* Add the io_end to per-inode completed io list*/
>> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
>> index 430c401..78839af 100644
>> --- a/fs/ext4/page-io.c
>> +++ b/fs/ext4/page-io.c
>> @@ -334,8 +334,10 @@ submit_and_retry:
>> if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
>> (io_end->pages[io_end->num_io_pages-1] != io_page))
>> goto submit_and_retry;
>> - if (buffer_uninit(bh))
>> - io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
>> + io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
>> + }
>> io->io_end->size += bh->b_size;
>> io->io_next_block++;
>> ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-08-16 08:38:16

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

16.08.2011 03:53, Jiaying Zhang wrote:
> Hi Michael,
>
> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <[email protected]> wrote:
[]
>> A smaller test case. I used redo1.odf file (one of the
>> redologs) as a test file, any will work.
>>
>> $ cp -p redo1.odf temp
>> $ dd if=temp of=foo iflag=direct count=20

> Isn't this the expected behavior here? When doing
> 'cp -p redo1.odf temp', data is copied to temp through
> buffer write, but there is no guarantee when data will be
> actually written to disk. Then with 'dd if=temp of=foo
> iflag=direct count=20', data is read directly from disk.
> Very likely, the written data hasn't been flushed to disk
> yet so ext4 returns zero in this case.

The problem is 3-faced (at least ;)

First of all, it is _not_ an expected behavour. When you
think about it, maybe it becomes "more expected", but for
first it looks like something Really Wrong (tm). It can
be made "more expected" by mentioning in various manpages
and whatnot all the possible consecuences of mixing direct
and buffered I/O. So far it hasn't been done. I can
understand (and sort of expect), say, buffered write being
insisible for concurrent direct read, while they're going
at the same time. But here, the file has been closed and
re-opened between writes and reads.

I agree that it's difficult to keep both pieces - direct
and buffered I/O - in sync, -- there were numerous efforts
to syncronize them, with various success and usually huge
amount of work. Maybe if it were noted initially that
direct I/O _is_ incompatible with buffered I/O, things
weren't that bad now.

Next, this problem does not happen without the mentioned
dioread_nolock option (which - as far as I can see -
supposed to be the default (or only) way to handle this
in the future). I can't trigger any of the issues
described in this thread without dioread_nolock.

So that makes this as yet another "corner case" somehow
(like famous non-fs-buffer-aligned direct write past end
of file, or like mmapped I/O mixed with direct I/O and
so on), but since most other such corner cases are fixed
now, this one just needs to be fixed too.

And 3rd, this is a race condition: it does not happen all
the time, or even most of the time, it happens "sometimes",
which makes it more like a bug than not.

Thanks,

/mjt

2011-08-16 13:53:40

by Jan Kara

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <[email protected]> wrote:
> > 15.08.2011 12:00, Michael Tokarev wrote:
> > [....]
> >
> > So, it looks like this (starting with cold cache):
> >
> > 1. rename the redologs and copy them over - this will
> > ? make a hot copy of redologs
> > 2. startup oracle - it will complain that the redologs aren't
> > ? redologs, the header is corrupt
> > 3. shut down oracle, start it up again - it will succeed.
> >
> > If between 1 and 2 you'll issue sync(1) everything will work.
> > When shutting down, oracle calls fsync(), so that's like
> > sync(1) again.
> >
> > If there will be some time between 1. and 2., everything
> > will work too.
> >
> > Without dioread_nolock I can't trigger the problem no matter
> > how I tried.
> >
> >
> > A smaller test case. ?I used redo1.odf file (one of the
> > redologs) as a test file, any will work.
> >
> > ?$ cp -p redo1.odf temp
> > ?$ dd if=temp of=foo iflag=direct count=20
> Isn't this the expected behavior here? When doing
> 'cp -p redo1.odf temp', data is copied to temp through
> buffer write, but there is no guarantee when data will be
> actually written to disk. Then with 'dd if=temp of=foo
> iflag=direct count=20', data is read directly from disk.
> Very likely, the written data hasn't been flushed to disk
> yet so ext4 returns zero in this case.
No it's not. Buffered and direct IO is supposed to work correctly
(although not fast) together. In this particular case we take care to flush
dirty data from page cache before performing direct IO read... But
something is broken in this path obviously.

I don't have time to dig into this in detail now but what seems to be the
problem is that with dioread_nolock option, we don't acquire i_mutex for
direct IO reads anymore. Thus these reads can compete with
ext4_end_io_nolock() called from ext4_end_io_work() (this is called under
i_mutex so without dioread_nolock the race cannot happen).

Hmm, the new writepages code seems to be broken in combination with direct
IO. Direct IO code expects that when filemap_write_and_wait() finishes,
data is on disk but with new bio submission code this is not true because
we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
ext4_end_io_buffer_write() but do extent conversion only after that in
convert workqueue. So the race seems to be there all the time, just without
dioread_nolock it's much smaller.

Fixing this is going to be non-trivial - I'm not sure we can really move
clearing of PageWriteback bit to conversion workqueue. I think we already
tried that once but it caused deadlocks for some reason...

> > Now, first 512bytes of "foo" will contain all zeros, while
> > the beginning of redo1.odf is _not_ zeros.
> >
> > Again, without aioread_nolock it works as expected.
> >
> >
> > And the most important note: without the patch there's no
> > data corruption like that. ?But instead, there is the
> > lockup... ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-08-16 15:03:59

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On 08/16/2011 09:53 PM, Jan Kara wrote:
> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <[email protected]> wrote:
>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>> [....]
>>>
>>> So, it looks like this (starting with cold cache):
>>>
>>> 1. rename the redologs and copy them over - this will
>>> make a hot copy of redologs
>>> 2. startup oracle - it will complain that the redologs aren't
>>> redologs, the header is corrupt
>>> 3. shut down oracle, start it up again - it will succeed.
>>>
>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>> When shutting down, oracle calls fsync(), so that's like
>>> sync(1) again.
>>>
>>> If there will be some time between 1. and 2., everything
>>> will work too.
>>>
>>> Without dioread_nolock I can't trigger the problem no matter
>>> how I tried.
>>>
>>>
>>> A smaller test case. I used redo1.odf file (one of the
>>> redologs) as a test file, any will work.
>>>
>>> $ cp -p redo1.odf temp
>>> $ dd if=temp of=foo iflag=direct count=20
>> Isn't this the expected behavior here? When doing
>> 'cp -p redo1.odf temp', data is copied to temp through
>> buffer write, but there is no guarantee when data will be
>> actually written to disk. Then with 'dd if=temp of=foo
>> iflag=direct count=20', data is read directly from disk.
>> Very likely, the written data hasn't been flushed to disk
>> yet so ext4 returns zero in this case.
> No it's not. Buffered and direct IO is supposed to work correctly
> (although not fast) together. In this particular case we take care to flush
> dirty data from page cache before performing direct IO read... But
> something is broken in this path obviously.
>
> I don't have time to dig into this in detail now but what seems to be the
> problem is that with dioread_nolock option, we don't acquire i_mutex for
> direct IO reads anymore. Thus these reads can compete with
> ext4_end_io_nolock() called from ext4_end_io_work() (this is called under
> i_mutex so without dioread_nolock the race cannot happen).
>
> Hmm, the new writepages code seems to be broken in combination with direct
> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
> data is on disk but with new bio submission code this is not true because
> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
> ext4_end_io_buffer_write() but do extent conversion only after that in
> convert workqueue. So the race seems to be there all the time, just without
> dioread_nolock it's much smaller.
You are absolutely right. The really problem is that ext4_direct_IO
begins to work *after* we clear the page writeback flag and *before* we
convert unwritten extent to a valid state. Some of my trace does show
that. I am working on it now.
>
> Fixing this is going to be non-trivial - I'm not sure we can really move
> clearing of PageWriteback bit to conversion workqueue. I thienk we already
> tried that once but it caused deadlocks for some reason...
I just did as what you described and yes I met with another problem and
try to resolve it now. Once it is OK, I will send out the patch.

Thanks
Tao

2011-08-16 21:32:18

by Jiaying Zhang

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <[email protected]> wrote:
> On 08/16/2011 09:53 PM, Jan Kara wrote:
>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <[email protected]> wrote:
>>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>>> [....]
>>>>
>>>> So, it looks like this (starting with cold cache):
>>>>
>>>> 1. rename the redologs and copy them over - this will
>>>> ? make a hot copy of redologs
>>>> 2. startup oracle - it will complain that the redologs aren't
>>>> ? redologs, the header is corrupt
>>>> 3. shut down oracle, start it up again - it will succeed.
>>>>
>>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>>> When shutting down, oracle calls fsync(), so that's like
>>>> sync(1) again.
>>>>
>>>> If there will be some time between 1. and 2., everything
>>>> will work too.
>>>>
>>>> Without dioread_nolock I can't trigger the problem no matter
>>>> how I tried.
>>>>
>>>>
>>>> A smaller test case. ?I used redo1.odf file (one of the
>>>> redologs) as a test file, any will work.
>>>>
>>>> ?$ cp -p redo1.odf temp
>>>> ?$ dd if=temp of=foo iflag=direct count=20
>>> Isn't this the expected behavior here? When doing
>>> 'cp -p redo1.odf temp', data is copied to temp through
>>> buffer write, but there is no guarantee when data will be
>>> actually written to disk. Then with 'dd if=temp of=foo
>>> iflag=direct count=20', data is read directly from disk.
>>> Very likely, the written data hasn't been flushed to disk
>>> yet so ext4 returns zero in this case.
>> ? No it's not. Buffered and direct IO is supposed to work correctly
>> (although not fast) together. In this particular case we take care to flush
>> dirty data from page cache before performing direct IO read... But
>> something is broken in this path obviously.
I see. Thanks a lot for the explanation.

I wonder whether the following patch will solve the problem:

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 6c27111..ca90d73 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
}

retry:
- if (rw == READ && ext4_should_dioread_nolock(inode))
+ if (rw == READ && ext4_should_dioread_nolock(inode)) {
+ if (unlikely(!list_empty(&ei->i_completed_io_list))) {
+ mutex_lock(&inode->i_mutex);
+ ext4_flush_completed_IO(inode);
+ mutex_unlock(&inode->i_mutex);
+ }
ret = __blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,
ext4_get_block, NULL, NULL, 0);
- else {
+ } else {
ret = blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,

I tested the patch a little bit and it seems to resolve the race
on dioread_nolock in my case. Michael, I would very appreciate
if you can try this patch with your test case and see whether it works.

>>
>> I don't have time to dig into this in detail now but what seems to be the
>> problem is that with dioread_nolock option, we don't acquire i_mutex for
>> direct IO reads anymore. Thus these reads can compete with
>> ext4_end_io_nolock() called from ext4_end_io_work() (this is called under
>> i_mutex so without dioread_nolock the race cannot happen).
>>
>> Hmm, the new writepages code seems to be broken in combination with direct
>> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
>> data is on disk but with new bio submission code this is not true because
>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
>> ext4_end_io_buffer_write() but do extent conversion only after that in
>> convert workqueue. So the race seems to be there all the time, just without
>> dioread_nolock it's much smaller.
I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled.
So I think we should be ok with the default mount options.

Jiaying

> You are absolutely right. The really problem is that ext4_direct_IO
> begins to work *after* we clear the page writeback flag and *before* we
> convert unwritten extent to a valid state. Some of my trace does show
> that. I am working on it now.
>>
>> Fixing this is going to be non-trivial - I'm not sure we can really move
>> clearing of PageWriteback bit to conversion workqueue. I thienk we already
>> tried that once but it caused deadlocks for some reason...
> I just did as what you described and yes I met with another problem and
> try to resolve it now. Once it is OK, I will send out the patch.
>
> Thanks
> Tao
>

2011-08-16 22:28:37

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

17.08.2011 01:32, Jiaying Zhang wrote:
> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <[email protected]> wrote:
>> On 08/16/2011 09:53 PM, Jan Kara wrote:
>>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <[email protected]> wrote:
>>>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>>>> [....]
>>>>>
>>>>> So, it looks like this (starting with cold cache):
>>>>>
>>>>> 1. rename the redologs and copy them over - this will
>>>>> make a hot copy of redologs
>>>>> 2. startup oracle - it will complain that the redologs aren't
>>>>> redologs, the header is corrupt
>>>>> 3. shut down oracle, start it up again - it will succeed.
>>>>>
>>>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>>>> When shutting down, oracle calls fsync(), so that's like
>>>>> sync(1) again.
>>>>>
>>>>> If there will be some time between 1. and 2., everything
>>>>> will work too.
>>>>>
>>>>> Without dioread_nolock I can't trigger the problem no matter
>>>>> how I tried.
>>>>>
>>>>>
>>>>> A smaller test case. I used redo1.odf file (one of the
>>>>> redologs) as a test file, any will work.
>>>>>
>>>>> $ cp -p redo1.odf temp
>>>>> $ dd if=temp of=foo iflag=direct count=20
>>>> Isn't this the expected behavior here? When doing
>>>> 'cp -p redo1.odf temp', data is copied to temp through
>>>> buffer write, but there is no guarantee when data will be
>>>> actually written to disk. Then with 'dd if=temp of=foo
>>>> iflag=direct count=20', data is read directly from disk.
>>>> Very likely, the written data hasn't been flushed to disk
>>>> yet so ext4 returns zero in this case.
>>> No it's not. Buffered and direct IO is supposed to work correctly
>>> (although not fast) together. In this particular case we take care to flush
>>> dirty data from page cache before performing direct IO read... But
>>> something is broken in this path obviously.
> I see. Thanks a lot for the explanation.
>
> I wonder whether the following patch will solve the problem:
>
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 6c27111..ca90d73 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c

It is in inode.c in 3.0, not in indirect.c (there's no such file there).
But it applies (after de-MIMEfying it) - well, almost with no mods... ;)

> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
> }
>
> retry:
> - if (rw == READ && ext4_should_dioread_nolock(inode))
> + if (rw == READ && ext4_should_dioread_nolock(inode)) {
> + if (unlikely(!list_empty(&ei->i_completed_io_list))) {
> + mutex_lock(&inode->i_mutex);
> + ext4_flush_completed_IO(inode);
> + mutex_unlock(&inode->i_mutex);
> + }
> ret = __blockdev_direct_IO(rw, iocb, inode,
> inode->i_sb->s_bdev, iov,
> offset, nr_segs,
> ext4_get_block, NULL, NULL, 0);
> - else {
> + } else {
> ret = blockdev_direct_IO(rw, iocb, inode,
> inode->i_sb->s_bdev, iov,
> offset, nr_segs,
>
> I tested the patch a little bit and it seems to resolve the race
> on dioread_nolock in my case. Michael, I would very appreciate
> if you can try this patch with your test case and see whether it works.

So I tried it just now. I tried hard, several times. No joy: I don't
see the "corruption" anymore, neither with the simple dd case nor with
oracle version. I added a printk into the new if-unlikely path, and
it triggers several times, - almost always when I run my "oracle
test-case" - starting the database after newly-copying the redologs.

So it appears you've hit the right place there. At least for this
test case ;) Not sure if that's exactly right way to go - maybe
it's possible to have some optimisation in there, in case completed
list is not empty but not overlaps with our I/O list, but that's
probably overkill, dunno.

>>> Hmm, the new writepages code seems to be broken in combination with direct
>>> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
>>> data is on disk but with new bio submission code this is not true because
>>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
>>> ext4_end_io_buffer_write() but do extent conversion only after that in
>>> convert workqueue. So the race seems to be there all the time, just without
>>> dioread_nolock it's much smaller.

> I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled.
> So I think we should be ok with the default mount options.

Am I right that the intention was to enable dioread_nolock by default
eventually, or even to keep its behavour only, without mount options?

And thank you -- all -- for your work and support!

/mjt

2011-08-16 23:07:42

by Jiaying Zhang

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Tue, Aug 16, 2011 at 3:28 PM, Michael Tokarev <[email protected]> wrote:
> 17.08.2011 01:32, Jiaying Zhang wrote:
>> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <[email protected]> wrote:
>>> On 08/16/2011 09:53 PM, Jan Kara wrote:
>>>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>>>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <[email protected]> wrote:
>>>>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>>>>> [....]
>>>>>>
>>>>>> So, it looks like this (starting with cold cache):
>>>>>>
>>>>>> 1. rename the redologs and copy them over - this will
>>>>>> ? make a hot copy of redologs
>>>>>> 2. startup oracle - it will complain that the redologs aren't
>>>>>> ? redologs, the header is corrupt
>>>>>> 3. shut down oracle, start it up again - it will succeed.
>>>>>>
>>>>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>>>>> When shutting down, oracle calls fsync(), so that's like
>>>>>> sync(1) again.
>>>>>>
>>>>>> If there will be some time between 1. and 2., everything
>>>>>> will work too.
>>>>>>
>>>>>> Without dioread_nolock I can't trigger the problem no matter
>>>>>> how I tried.
>>>>>>
>>>>>>
>>>>>> A smaller test case. ?I used redo1.odf file (one of the
>>>>>> redologs) as a test file, any will work.
>>>>>>
>>>>>> ?$ cp -p redo1.odf temp
>>>>>> ?$ dd if=temp of=foo iflag=direct count=20
>>>>> Isn't this the expected behavior here? When doing
>>>>> 'cp -p redo1.odf temp', data is copied to temp through
>>>>> buffer write, but there is no guarantee when data will be
>>>>> actually written to disk. Then with 'dd if=temp of=foo
>>>>> iflag=direct count=20', data is read directly from disk.
>>>>> Very likely, the written data hasn't been flushed to disk
>>>>> yet so ext4 returns zero in this case.
>>>> ? No it's not. Buffered and direct IO is supposed to work correctly
>>>> (although not fast) together. In this particular case we take care to flush
>>>> dirty data from page cache before performing direct IO read... But
>>>> something is broken in this path obviously.
>> I see. Thanks a lot for the explanation.
>>
>> I wonder whether the following patch will solve the problem:
>>
>> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
>> index 6c27111..ca90d73 100644
>> --- a/fs/ext4/indirect.c
>> +++ b/fs/ext4/indirect.c
>
> It is in inode.c in 3.0, not in indirect.c (there's no such file there).
> But it applies (after de-MIMEfying it) - well, almost with no mods... ;)
>
>> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>> ? ? ? ? }
>>
>> ?retry:
>> - ? ? ? if (rw == READ && ext4_should_dioread_nolock(inode))
>> + ? ? ? if (rw == READ && ext4_should_dioread_nolock(inode)) {
>> + ? ? ? ? ? ? ? if (unlikely(!list_empty(&ei->i_completed_io_list))) {
>> + ? ? ? ? ? ? ? ? ? ? ? mutex_lock(&inode->i_mutex);
>> + ? ? ? ? ? ? ? ? ? ? ? ext4_flush_completed_IO(inode);
>> + ? ? ? ? ? ? ? ? ? ? ? mutex_unlock(&inode->i_mutex);
>> + ? ? ? ? ? ? ? }
>> ? ? ? ? ? ? ? ? ret = __blockdev_direct_IO(rw, iocb, inode,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?inode->i_sb->s_bdev, iov,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?offset, nr_segs,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext4_get_block, NULL, NULL, 0);
>> - ? ? ? else {
>> + ? ? ? } else {
>> ? ? ? ? ? ? ? ? ret = blockdev_direct_IO(rw, iocb, inode,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?inode->i_sb->s_bdev, iov,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?offset, nr_segs,
>>
>> I tested the patch a little bit and it seems to resolve the race
>> on dioread_nolock in my case. Michael, I would very appreciate
>> if you can try this patch with your test case and see whether it works.
>
> So I tried it just now. ?I tried hard, several times. ?No joy: I don't
> see the "corruption" anymore, neither with the simple dd case nor with
> oracle version. ?I added a printk into the new if-unlikely path, and
> it triggers several times, - almost always when I run my "oracle
> test-case" - starting the database after newly-copying the redologs.
>
> So it appears you've hit the right place there. ?At least for this
> test case ;) ?Not sure if that's exactly right way to go - maybe
> it's possible to have some optimisation in there, in case completed
> list is not empty but not overlaps with our I/O list, but that's
> probably overkill, dunno.
>
>>>> Hmm, the new writepages code seems to be broken in combination with direct
>>>> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
>>>> data is on disk but with new bio submission code this is not true because
>>>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
>>>> ext4_end_io_buffer_write() but do extent conversion only after that in
>>>> convert workqueue. So the race seems to be there all the time, just without
>>>> dioread_nolock it's much smaller.
>
>> I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled.
>> So I think we should be ok with the default mount options.
>
> Am I right that the intention was to enable dioread_nolock by default
> eventually, or even to keep its behavour only, without mount options?
Good question. At the time when we checked in the code, we wanted to be
careful that it didn't introduce data corruptions that would affect normal
workloads. Apparently, the downside is that this code path doesn't get
a good test coverage. Maybe it is time to reconsider enabling this feature
by default. I guess we still want to guard it with a mount option given that
it doesn't work with certain options, like "data=journaled" mode and small
block size.

Jiaying
>
> And thank you -- all -- for your work and support!
>
> /mjt
>

2011-08-16 23:59:15

by Dave Chinner

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote:
> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <[email protected]> wrote:
> > On 08/16/2011 09:53 PM, Jan Kara wrote:
> I wonder whether the following patch will solve the problem:
>
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 6c27111..ca90d73 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
> }
>
> retry:
> - if (rw == READ && ext4_should_dioread_nolock(inode))
> + if (rw == READ && ext4_should_dioread_nolock(inode)) {
> + if (unlikely(!list_empty(&ei->i_completed_io_list))) {
> + mutex_lock(&inode->i_mutex);
> + ext4_flush_completed_IO(inode);
> + mutex_unlock(&inode->i_mutex);
> + }
> ret = __blockdev_direct_IO(rw, iocb, inode,
> inode->i_sb->s_bdev, iov,
> offset, nr_segs,
> ext4_get_block, NULL, NULL, 0);
> - else {
> + } else {
> ret = blockdev_direct_IO(rw, iocb, inode,
> inode->i_sb->s_bdev, iov,
> offset, nr_segs,
>
> I tested the patch a little bit and it seems to resolve the race
> on dioread_nolock in my case. Michael, I would very appreciate
> if you can try this patch with your test case and see whether it works.

Just my 2c worth here: this is a data corruption bug so the root
cause neeeds to be fixed. The above patch does not address the root
cause.

> > You are absolutely right. The really problem is that ext4_direct_IO
> > begins to work *after* we clear the page writeback flag and *before* we
> > convert unwritten extent to a valid state. Some of my trace does show
> > that. I am working on it now.

And that's the root cause - think about what that means for a
minute. It means that extent conversion can race with anything that
requires IO to complete first. e.g. truncate or fsync. It can then
race with other subsequent operations, which can have even nastier
effects. IOWs, there is a data-corruption landmine just sitting
there waiting for the next person to trip over it.

Fix the root cause, don't put band-aids over the symptoms.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-08-17 00:08:46

by Jiaying Zhang

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Tue, Aug 16, 2011 at 4:59 PM, Dave Chinner <[email protected]> wrote:
> On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote:
>> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <[email protected]> wrote:
>> > On 08/16/2011 09:53 PM, Jan Kara wrote:
>> I wonder whether the following patch will solve the problem:
>>
>> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
>> index 6c27111..ca90d73 100644
>> --- a/fs/ext4/indirect.c
>> +++ b/fs/ext4/indirect.c
>> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>> ? ? ? ? }
>>
>> ?retry:
>> - ? ? ? if (rw == READ && ext4_should_dioread_nolock(inode))
>> + ? ? ? if (rw == READ && ext4_should_dioread_nolock(inode)) {
>> + ? ? ? ? ? ? ? if (unlikely(!list_empty(&ei->i_completed_io_list))) {
>> + ? ? ? ? ? ? ? ? ? ? ? mutex_lock(&inode->i_mutex);
>> + ? ? ? ? ? ? ? ? ? ? ? ext4_flush_completed_IO(inode);
>> + ? ? ? ? ? ? ? ? ? ? ? mutex_unlock(&inode->i_mutex);
>> + ? ? ? ? ? ? ? }
>> ? ? ? ? ? ? ? ? ret = __blockdev_direct_IO(rw, iocb, inode,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?inode->i_sb->s_bdev, iov,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?offset, nr_segs,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext4_get_block, NULL, NULL, 0);
>> - ? ? ? else {
>> + ? ? ? } else {
>> ? ? ? ? ? ? ? ? ret = blockdev_direct_IO(rw, iocb, inode,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?inode->i_sb->s_bdev, iov,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?offset, nr_segs,
>>
>> I tested the patch a little bit and it seems to resolve the race
>> on dioread_nolock in my case. Michael, I would very appreciate
>> if you can try this patch with your test case and see whether it works.
>
> Just my 2c worth here: this is a data corruption bug so the root
> cause neeeds to be fixed. The above patch does not address the root
> cause.
>
>> > You are absolutely right. The really problem is that ext4_direct_IO
>> > begins to work *after* we clear the page writeback flag and *before* we
>> > convert unwritten extent to a valid state. Some of my trace does show
>> > that. I am working on it now.
>
> And that's the root cause - think about what that means for a
> minute. ?It means that extent conversion can race with anything that
> requires IO to complete first. e.g. truncate or fsync. ?It can then
> race with other subsequent operations, which can have even nastier
> effects. IOWs, there is a data-corruption landmine just sitting
> there waiting for the next person to trip over it.
You are right that extent conversion can race with truncate and fsync
as well. That is why we already need to call ext4_flush_completed_IO()
in those places as well. I agree this is a little nasty and there can be
some other corner cases that we haven't covered. The problem is we
can not do extent conversion during the end_io time. I haven't thought of
a better approach to deal with these races. I am curious how xfs deals
with this problem.

Jiaying

>
> Fix the root cause, don't put band-aids over the symptoms.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

2011-08-17 02:22:16

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

Hi Jiaying,
On 08/17/2011 08:08 AM, Jiaying Zhang wrote:
> On Tue, Aug 16, 2011 at 4:59 PM, Dave Chinner <[email protected]> wrote:
>> On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote:
>>> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <[email protected]> wrote:
>>>> On 08/16/2011 09:53 PM, Jan Kara wrote:
>>> I wonder whether the following patch will solve the problem:
>>>
>>> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
>>> index 6c27111..ca90d73 100644
>>> --- a/fs/ext4/indirect.c
>>> +++ b/fs/ext4/indirect.c
>>> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>>> }
>>>
>>> retry:
>>> - if (rw == READ && ext4_should_dioread_nolock(inode))
>>> + if (rw == READ && ext4_should_dioread_nolock(inode)) {
>>> + if (unlikely(!list_empty(&ei->i_completed_io_list))) {
>>> + mutex_lock(&inode->i_mutex);
>>> + ext4_flush_completed_IO(inode);
>>> + mutex_unlock(&inode->i_mutex);
>>> + }
>>> ret = __blockdev_direct_IO(rw, iocb, inode,
>>> inode->i_sb->s_bdev, iov,
>>> offset, nr_segs,
>>> ext4_get_block, NULL, NULL, 0);
>>> - else {
>>> + } else {
>>> ret = blockdev_direct_IO(rw, iocb, inode,
>>> inode->i_sb->s_bdev, iov,
>>> offset, nr_segs,
>>>
>>> I tested the patch a little bit and it seems to resolve the race
>>> on dioread_nolock in my case. Michael, I would very appreciate
>>> if you can try this patch with your test case and see whether it works.
>>
>> Just my 2c worth here: this is a data corruption bug so the root
>> cause neeeds to be fixed. The above patch does not address the root
>> cause.
>>
>>>> You are absolutely right. The really problem is that ext4_direct_IO
>>>> begins to work *after* we clear the page writeback flag and *before* we
>>>> convert unwritten extent to a valid state. Some of my trace does show
>>>> that. I am working on it now.
>>
>> And that's the root cause - think about what that means for a
>> minute. It means that extent conversion can race with anything that
>> requires IO to complete first. e.g. truncate or fsync. It can then
>> race with other subsequent operations, which can have even nastier
>> effects. IOWs, there is a data-corruption landmine just sitting
>> there waiting for the next person to trip over it.
> You are right that extent conversion can race with truncate and fsync
> as well. That is why we already need to call ext4_flush_completed_IO()
> in those places as well. I agree this is a little nasty and there can be
> some other corner cases that we haven't covered. The problem is we
> can not do extent conversion during the end_io time. I haven't thought of
> a better approach to deal with these races. I am curious how xfs deals
> with this problem.
I agree with Dave that we may need to figure out a better way for this.
What's more, you patch has another side-effect: if there are concurrent
direct read and buffered write and even they are not interleaved, the
direct read is affected. Do you have any test data of the performance
regression?

Thanks
Tao

>
> Jiaying
>
>>
>> Fix the root cause, don't put band-aids over the symptoms.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> [email protected]
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-08-17 09:04:32

by Jan Kara

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Tue 16-08-11 17:08:42, Jiaying Zhang wrote:
> On Tue, Aug 16, 2011 at 4:59 PM, Dave Chinner <[email protected]> wrote:
> > On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote:
> >> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <[email protected]> wrote:
> >> > On 08/16/2011 09:53 PM, Jan Kara wrote:
> >> I wonder whether the following patch will solve the problem:
> >>
> >> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> >> index 6c27111..ca90d73 100644
> >> --- a/fs/ext4/indirect.c
> >> +++ b/fs/ext4/indirect.c
> >> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
> >> ? ? ? ? }
> >>
> >> ?retry:
> >> - ? ? ? if (rw == READ && ext4_should_dioread_nolock(inode))
> >> + ? ? ? if (rw == READ && ext4_should_dioread_nolock(inode)) {
> >> + ? ? ? ? ? ? ? if (unlikely(!list_empty(&ei->i_completed_io_list))) {
> >> + ? ? ? ? ? ? ? ? ? ? ? mutex_lock(&inode->i_mutex);
> >> + ? ? ? ? ? ? ? ? ? ? ? ext4_flush_completed_IO(inode);
> >> + ? ? ? ? ? ? ? ? ? ? ? mutex_unlock(&inode->i_mutex);
> >> + ? ? ? ? ? ? ? }
> >> ? ? ? ? ? ? ? ? ret = __blockdev_direct_IO(rw, iocb, inode,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?inode->i_sb->s_bdev, iov,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?offset, nr_segs,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext4_get_block, NULL, NULL, 0);
> >> - ? ? ? else {
> >> + ? ? ? } else {
> >> ? ? ? ? ? ? ? ? ret = blockdev_direct_IO(rw, iocb, inode,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?inode->i_sb->s_bdev, iov,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?offset, nr_segs,
> >>
> >> I tested the patch a little bit and it seems to resolve the race
> >> on dioread_nolock in my case. Michael, I would very appreciate
> >> if you can try this patch with your test case and see whether it works.
> >
> > Just my 2c worth here: this is a data corruption bug so the root
> > cause neeeds to be fixed. The above patch does not address the root
> > cause.
> >
> >> > You are absolutely right. The really problem is that ext4_direct_IO
> >> > begins to work *after* we clear the page writeback flag and *before* we
> >> > convert unwritten extent to a valid state. Some of my trace does show
> >> > that. I am working on it now.
> >
> > And that's the root cause - think about what that means for a
> > minute. ?It means that extent conversion can race with anything that
> > requires IO to complete first. e.g. truncate or fsync. ?It can then
> > race with other subsequent operations, which can have even nastier
> > effects. IOWs, there is a data-corruption landmine just sitting
> > there waiting for the next person to trip over it.
> You are right that extent conversion can race with truncate and fsync
> as well. That is why we already need to call ext4_flush_completed_IO()
> in those places as well. I agree this is a little nasty and there can be
> some other corner cases that we haven't covered.
Exactly. I agree with Dave here that it is asking for serious trouble to
clear PageWriteback bit before really completing the IO.

> The problem is we can not do extent conversion during the end_io time. I
> haven't thought of a better approach to deal with these races. I am
> curious how xfs deals with this problem.
Well, XFS cannot do extent conversion in end_io for AIO+DIO either. So it
clears PageWriteback bit only after extent conversion has happened in the
worker thread. ext4 has problems with this (deadlocks) because of unlucky
locking of extent tree using i_mutex. So I believe we have to find a better
locking for extent tree so that ext4 can clear PageWriteback bit from the
worker thread...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-08-17 17:02:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Tue, Aug 16, 2011 at 04:07:38PM -0700, Jiaying Zhang wrote:
> Good question. At the time when we checked in the code, we wanted to be
> careful that it didn't introduce data corruptions that would affect normal
> workloads. Apparently, the downside is that this code path doesn't get
> a good test coverage. Maybe it is time to reconsider enabling this feature
> by default. I guess we still want to guard it with a mount option given that
> it doesn't work with certain options, like "data=journaled" mode and small
> block size.

What I'd like to do long-term here is to change things so that (a)
instead of instantiating the extent as uninitialized, writing the
data, and then doing the uninit->init conversion to writing the data
and then instantiated the extent as initialzied. This would also
allow us to get rid of data=ordered mode. And we should make it work
for fs block size != page size.

It means that we need a way of adding this sort of information into an
in-memory extent cache but which isn't saved to disk until the data is
written. We've also talked about adding the information about whether
an extent is subject to delalloc as well, so we don't have to grovel
through the page cache and look at individual buffers attached to the
pages. And there are folks who have been experimenting with an
in-memory extent tree cache to speed access to fast PCIe-attached
flash.

It seems to me that if we're careful a single solution should be able
to solve all of these problems...

- Ted

2011-08-18 06:49:34

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

17.08.2011 21:02, Ted Ts'o wrote:
[]
> What I'd like to do long-term here is to change things so that (a)
> instead of instantiating the extent as uninitialized, writing the
> data, and then doing the uninit->init conversion to writing the data
> and then instantiated the extent as initialzied. This would also
> allow us to get rid of data=ordered mode. And we should make it work
> for fs block size != page size.
>
> It means that we need a way of adding this sort of information into an
> in-memory extent cache but which isn't saved to disk until the data is
> written. We've also talked about adding the information about whether
> an extent is subject to delalloc as well, so we don't have to grovel
> through the page cache and look at individual buffers attached to the
> pages. And there are folks who have been experimenting with an
> in-memory extent tree cache to speed access to fast PCIe-attached
> flash.
>
> It seems to me that if we're careful a single solution should be able
> to solve all of these problems...

What about current situation, how do you think - should it be ignored
for now, having in mind that dioread_nolock isn't used often (but it
gives _serious_ difference in read speed), or, short term, fix this
very case which have real-life impact already, while implementing a
long-term solution?

Thank you!

/mjt

2011-08-18 18:54:21

by Jiaying Zhang

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Wed, Aug 17, 2011 at 11:49 PM, Michael Tokarev <[email protected]> wrote:
> 17.08.2011 21:02, Ted Ts'o wrote:
> []
>> What I'd like to do long-term here is to change things so that (a)
>> instead of instantiating the extent as uninitialized, writing the
>> data, and then doing the uninit->init conversion to writing the data
>> and then instantiated the extent as initialzied. ?This would also
>> allow us to get rid of data=ordered mode. ?And we should make it work
>> for fs block size != page size.
>>
>> It means that we need a way of adding this sort of information into an
>> in-memory extent cache but which isn't saved to disk until the data is
>> written. ?We've also talked about adding the information about whether
>> an extent is subject to delalloc as well, so we don't have to grovel
>> through the page cache and look at individual buffers attached to the
>> pages. ?And there are folks who have been experimenting with an
>> in-memory extent tree cache to speed access to fast PCIe-attached
>> flash.
>>
>> It seems to me that if we're careful a single solution should be able
>> to solve all of these problems...
>
> What about current situation, how do you think - should it be ignored
> for now, having in mind that dioread_nolock isn't used often (but it
> gives _serious_ difference in read speed), or, short term, fix this
> very case which have real-life impact already, while implementing a
> long-term solution?
I plan to send my patch as a bandaid fix. It doesn't solve the fundamental
problem but I think it helps close the race you saw on your test. In the long
term, I agree that we should think about implementing an extent tree cache
and use it to hold pending uninitialized-to-initialized extent conversions.

Jiaying

>
> Thank you!
>
> /mjt
>

2011-08-19 03:18:49

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

Hi Michael,
On 08/18/2011 02:49 PM, Michael Tokarev wrote:
> 17.08.2011 21:02, Ted Ts'o wrote:
> []
>> What I'd like to do long-term here is to change things so that (a)
>> instead of instantiating the extent as uninitialized, writing the
>> data, and then doing the uninit->init conversion to writing the data
>> and then instantiated the extent as initialzied. This would also
>> allow us to get rid of data=ordered mode. And we should make it work
>> for fs block size != page size.
>>
>> It means that we need a way of adding this sort of information into an
>> in-memory extent cache but which isn't saved to disk until the data is
>> written. We've also talked about adding the information about whether
>> an extent is subject to delalloc as well, so we don't have to grovel
>> through the page cache and look at individual buffers attached to the
>> pages. And there are folks who have been experimenting with an
>> in-memory extent tree cache to speed access to fast PCIe-attached
>> flash.
>>
>> It seems to me that if we're careful a single solution should be able
>> to solve all of these problems...
>
> What about current situation, how do you think - should it be ignored
> for now, having in mind that dioread_nolock isn't used often (but it
> gives _serious_ difference in read speed), or, short term, fix this
> very case which have real-life impact already, while implementing a
> long-term solution?
So could you please share with us how you test and your test result
with/without dioread_nolock? A quick test with fio and intel ssd does't
see much improvement here.

We are based on RHEL6, and dioread_nolock isn't there by now and a large
number of our product system use direct read and buffer write. So if
your test proves to be promising, I guess our company can arrange some
resources to try to work it out.

Thanks
Tao

2011-08-19 03:20:30

by Tao Ma

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

Hi Ted and Jiaying,
On 08/19/2011 02:54 AM, Jiaying Zhang wrote:
> On Wed, Aug 17, 2011 at 11:49 PM, Michael Tokarev <[email protected]> wrote:
>> 17.08.2011 21:02, Ted Ts'o wrote:
>> []
>>> What I'd like to do long-term here is to change things so that (a)
>>> instead of instantiating the extent as uninitialized, writing the
>>> data, and then doing the uninit->init conversion to writing the data
>>> and then instantiated the extent as initialzied. This would also
>>> allow us to get rid of data=ordered mode. And we should make it work
>>> for fs block size != page size.
>>>
>>> It means that we need a way of adding this sort of information into an
>>> in-memory extent cache but which isn't saved to disk until the data is
>>> written. We've also talked about adding the information about whether
>>> an extent is subject to delalloc as well, so we don't have to grovel
>>> through the page cache and look at individual buffers attached to the
>>> pages. And there are folks who have been experimenting with an
>>> in-memory extent tree cache to speed access to fast PCIe-attached
>>> flash.
>>>
>>> It seems to me that if we're careful a single solution should be able
>>> to solve all of these problems...
>>
>> What about current situation, how do you think - should it be ignored
>> for now, having in mind that dioread_nolock isn't used often (but it
>> gives _serious_ difference in read speed), or, short term, fix this
>> very case which have real-life impact already, while implementing a
>> long-term solution?
> I plan to send my patch as a bandaid fix. It doesn't solve the fundamental
> problem but I think it helps close the race you saw on your test. In the long
> term, I agree that we should think about implementing an extent tree cache
> and use it to hold pending uninitialized-to-initialized extent conversions.
Does Google has some plan of doing it recently? We used a large number
of direct read, and we can arrange some resources to try to work it out.

Thanks
Tao

2011-08-19 07:05:24

by Michael Tokarev

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On 19.08.2011 07:18, Tao Ma wrote:
> Hi Michael,
> On 08/18/2011 02:49 PM, Michael Tokarev wrote:
[]
>> What about current situation, how do you think - should it be ignored
>> for now, having in mind that dioread_nolock isn't used often (but it
>> gives _serious_ difference in read speed), or, short term, fix this
>> very case which have real-life impact already, while implementing a
>> long-term solution?

> So could you please share with us how you test and your test result
> with/without dioread_nolock? A quick test with fio and intel ssd does't
> see much improvement here.

I used my home-grown quick-n-dirty microbenchmark for years to measure
i/o subsystem performance. Here are the results from 3.0 kernel on
some Hitachi NAS (FC, on brocade adaptors), 14-drive raid10 array.

The numbers are all megabytes/sec transferred (read or written), summed
for all threads. Leftmost column is the block size; next column is the
number of concurrent threads of the same type. And the columns are
tests: linear read, random read, linear write, random write, and
concurrent random read and write.

For a raw device:

BlkSz Trd linRd rndRd linWr rndWr rndR/W
4k 1 18.3 0.8 14.5 9.6 0.1/ 9.1
4 2.5 9.4 0.4/ 8.4
32 10.0 9.3 4.7/ 5.4
16k 1 59.4 2.5 49.9 35.7 0.3/ 34.7
4 10.3 36.1 1.5/ 31.4
32 38.5 36.2 17.5/ 20.4
64k 1 118.4 9.1 136.0 106.5 1.1/105.8
4 37.7 108.5 4.7/102.6
32 153.0 108.5 57.9/ 73.3
128k 1 125.9 16.5 138.8 125.8 1.1/125.6
4 68.7 128.7 6.3/122.8
32 277.0 128.7 70.3/ 98.6
1024k 1 89.9 81.2 138.9 134.4 5.0/132.3
4 254.7 137.6 19.2/127.1
32 390.7 137.5 117.2/ 90.1

For ext4fs, 1Tb file, default mount options:

BlkSz Trd linRd rndRd linWr rndWr rndR/W
4k 1 15.7 0.6 15.4 9.4 0.0/ 9.0
4 2.6 9.3 0.0/ 8.9
32 10.0 9.3 0.0/ 8.9
16k 1 47.6 2.5 53.2 34.6 0.1/ 33.6
4 10.2 34.6 0.0/ 33.5
32 39.9 34.8 0.1/ 33.6
64k 1 100.5 9.0 137.0 106.2 0.2/105.8
4 37.8 107.8 0.1/106.1
32 153.9 107.8 0.2/105.9
128k 1 115.4 16.3 138.6 125.2 0.3/125.3
4 68.8 127.8 0.2/125.6
32 274.6 127.8 0.2/126.2
1024k 1 124.5 54.2 138.9 133.6 1.0/133.3
4 159.5 136.6 0.2/134.3
32 349.7 136.5 0.3/133.6

And for a 1tb file on ext4fs with dioread_nolock:

BlkSz Trd linRd rndRd linWr rndWr rndR/W
4k 1 15.7 0.6 14.6 9.4 0.1/ 9.0
4 2.6 9.4 0.3/ 8.6
32 10.0 9.4 4.5/ 5.3
16k 1 50.9 2.4 56.7 36.0 0.3/ 35.2
4 10.1 36.4 1.5/ 34.6
32 38.7 36.4 17.3/ 21.0
64k 1 95.2 8.9 136.5 106.8 1.0/106.3
4 37.7 108.4 5.2/103.3
32 152.7 108.6 57.4/ 74.0
128k 1 115.1 16.3 138.8 125.8 1.2/126.4
4 68.9 128.5 5.7/124.0
32 276.1 128.6 70.8/ 98.5
1024k 1 128.5 81.9 138.9 134.4 5.1/132.3
4 253.4 137.4 19.1/126.8
32 385.1 137.4 111.7/ 92.3

These are complete test results. First 4 result
columns are merely identical, the difference is
within last column. Here they are together:

BlkSz Trd Raw Ext4nolock Ext4dflt
4k 1 0.1/ 9.1 0.1/ 9.0 0.0/ 9.0
4 0.4/ 8.4 0.3/ 8.6 0.0/ 8.9
32 4.7/ 5.4 4.5/ 5.3 0.0/ 8.9
16k 1 0.3/ 34.7 0.3/ 35.2 0.1/ 33.6
4 1.5/ 31.4 1.5/ 34.6 0.0/ 33.5
32 17.5/ 20.4 17.3/ 21.0 0.1/ 33.6
64k 1 1.1/105.8 1.0/106.3 0.2/105.8
4 4.7/102.6 5.2/103.3 0.1/106.1
32 57.9/ 73.3 57.4/ 74.0 0.2/105.9
128k 1 1.1/125.6 1.2/126.4 0.3/125.3
4 6.3/122.8 5.7/124.0 0.2/125.6
32 70.3/ 98.6 70.8/ 98.5 0.2/126.2
1024k 1 5.0/132.3 5.1/132.3 1.0/133.3
4 19.2/127.1 19.1/126.8 0.2/134.3
32 117.2/ 90.1 111.7/ 92.3 0.3/133.6

Ext4 with dioread_nolock (middle column) behaves close to
raw device. But default ext4 greatly prefers writes over
reads, reads are almost non-existent.

This is, again, more or less a microbenchmark. Where it
come from is my attempt to simulate an (oracle) database
workload (many years ago, when larger and more standard
now benchmarks weren't (freely) available). And there,
on a busy DB, the difference is quite well-visible.
In short, any writer makes all readers to wait. Once
we start writing something, all users immediately notice.
With dioread_nolock they don't complain anymore.

There's some more background around this all. Right
now I'm evaluating a new machine for our current database.
Old hardware had 2Gb RAM so it had _significant_ memory
pressure, and lots of stuff weren't able to be cached.
New machine has 128Gb of RAM, which will ensure that
all important stuff is in cache. So the effect of this
read/write disbalance will be much less visible.

For example, we've a dictionary (several tables) with
addresses - towns, streets, even buildings. When they
enter customer information they search in these dicts.
With current 2Gb memory thses dictionaries can't be
kept in memory, so they gets read from disk again every
time someone enters customer information, and this is
what they do all the time. So no doubt disk access is
very important here.

On a new hardware, obviously, all these dictionaries will
be in memory after first access, so even if all reads will
wait till any write completes, it wont be as dramatic as
it is now.

That to say, -- maybe I'm really paying too much attention
for a wrong problem. So far, on a new machine, I don't see
actual noticeable difference between dioread_nolock and
without that option.

(BTW, I found no way to remount a filesystem to EXclude
that option, I have to umount and mount it in order to
switch from using dioread_nolock to not using it. Is
there a way?)

Thanks,

/mjt

> We are based on RHEL6, and dioread_nolock isn't there by now and a large
> number of our product system use direct read and buffer write. So if
> your test proves to be promising, I guess our company can arrange some
> resources to try to work it out.
>
> Thanks
> Tao


2011-08-19 17:55:07

by Jiaying Zhang

[permalink] [raw]
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)

On Fri, Aug 19, 2011 at 12:05 AM, Michael Tokarev <[email protected]> wrote:
> On 19.08.2011 07:18, Tao Ma wrote:
>> Hi Michael,
>> On 08/18/2011 02:49 PM, Michael Tokarev wrote:
> []
>>> What about current situation, how do you think - should it be ignored
>>> for now, having in mind that dioread_nolock isn't used often (but it
>>> gives _serious_ difference in read speed), or, short term, fix this
>>> very case which have real-life impact already, while implementing a
>>> long-term solution?
>
>> So could you please share with us how you test and your test result
>> with/without dioread_nolock? A quick test with fio and intel ssd does't
>> see much improvement here.
>
> I used my home-grown quick-n-dirty microbenchmark for years to measure
> i/o subsystem performance. ?Here are the results from 3.0 kernel on
> some Hitachi NAS (FC, on brocade adaptors), 14-drive raid10 array.
>
> The numbers are all megabytes/sec transferred (read or written), summed
> for all threads. ?Leftmost column is the block size; next column is the
> number of concurrent threads of the same type. ?And the columns are
> tests: linear read, random read, linear write, random write, and
> concurrent random read and write.
>
> For a raw device:
>
> BlkSz Trd linRd rndRd linWr rndWr ?rndR/W
> ? 4k ? 1 ?18.3 ? 0.8 ?14.5 ? 9.6 ? 0.1/ ?9.1
> ? ? ? ?4 ? ? ? ? 2.5 ? ? ? ? 9.4 ? 0.4/ ?8.4
> ? ? ? 32 ? ? ? ?10.0 ? ? ? ? 9.3 ? 4.7/ ?5.4
> ?16k ? 1 ?59.4 ? 2.5 ?49.9 ?35.7 ? 0.3/ 34.7
> ? ? ? ?4 ? ? ? ?10.3 ? ? ? ?36.1 ? 1.5/ 31.4
> ? ? ? 32 ? ? ? ?38.5 ? ? ? ?36.2 ?17.5/ 20.4
> ?64k ? 1 118.4 ? 9.1 136.0 106.5 ? 1.1/105.8
> ? ? ? ?4 ? ? ? ?37.7 ? ? ? 108.5 ? 4.7/102.6
> ? ? ? 32 ? ? ? 153.0 ? ? ? 108.5 ?57.9/ 73.3
> ?128k ? 1 125.9 ?16.5 138.8 125.8 ? 1.1/125.6
> ? ? ? ?4 ? ? ? ?68.7 ? ? ? 128.7 ? 6.3/122.8
> ? ? ? 32 ? ? ? 277.0 ? ? ? 128.7 ?70.3/ 98.6
> 1024k ? 1 ?89.9 ?81.2 138.9 134.4 ? 5.0/132.3
> ? ? ? ?4 ? ? ? 254.7 ? ? ? 137.6 ?19.2/127.1
> ? ? ? 32 ? ? ? 390.7 ? ? ? 137.5 117.2/ 90.1
>
> For ext4fs, 1Tb file, default mount options:
>
> BlkSz Trd linRd rndRd linWr rndWr ?rndR/W
> ? 4k ? 1 ?15.7 ? 0.6 ?15.4 ? 9.4 ? 0.0/ ?9.0
> ? ? ? ?4 ? ? ? ? 2.6 ? ? ? ? 9.3 ? 0.0/ ?8.9
> ? ? ? 32 ? ? ? ?10.0 ? ? ? ? 9.3 ? 0.0/ ?8.9
> ?16k ? 1 ?47.6 ? 2.5 ?53.2 ?34.6 ? 0.1/ 33.6
> ? ? ? ?4 ? ? ? ?10.2 ? ? ? ?34.6 ? 0.0/ 33.5
> ? ? ? 32 ? ? ? ?39.9 ? ? ? ?34.8 ? 0.1/ 33.6
> ?64k ? 1 100.5 ? 9.0 137.0 106.2 ? 0.2/105.8
> ? ? ? ?4 ? ? ? ?37.8 ? ? ? 107.8 ? 0.1/106.1
> ? ? ? 32 ? ? ? 153.9 ? ? ? 107.8 ? 0.2/105.9
> ?128k ? 1 115.4 ?16.3 138.6 125.2 ? 0.3/125.3
> ? ? ? ?4 ? ? ? ?68.8 ? ? ? 127.8 ? 0.2/125.6
> ? ? ? 32 ? ? ? 274.6 ? ? ? 127.8 ? 0.2/126.2
> 1024k ? 1 124.5 ?54.2 138.9 133.6 ? 1.0/133.3
> ? ? ? ?4 ? ? ? 159.5 ? ? ? 136.6 ? 0.2/134.3
> ? ? ? 32 ? ? ? 349.7 ? ? ? 136.5 ? 0.3/133.6
>
> And for a 1tb file on ext4fs with dioread_nolock:
>
> BlkSz Trd linRd rndRd linWr rndWr ?rndR/W
> ? 4k ? 1 ?15.7 ? 0.6 ?14.6 ? 9.4 ? 0.1/ ?9.0
> ? ? ? ?4 ? ? ? ? 2.6 ? ? ? ? 9.4 ? 0.3/ ?8.6
> ? ? ? 32 ? ? ? ?10.0 ? ? ? ? 9.4 ? 4.5/ ?5.3
> ?16k ? 1 ?50.9 ? 2.4 ?56.7 ?36.0 ? 0.3/ 35.2
> ? ? ? ?4 ? ? ? ?10.1 ? ? ? ?36.4 ? 1.5/ 34.6
> ? ? ? 32 ? ? ? ?38.7 ? ? ? ?36.4 ?17.3/ 21.0
> ?64k ? 1 ?95.2 ? 8.9 136.5 106.8 ? 1.0/106.3
> ? ? ? ?4 ? ? ? ?37.7 ? ? ? 108.4 ? 5.2/103.3
> ? ? ? 32 ? ? ? 152.7 ? ? ? 108.6 ?57.4/ 74.0
> ?128k ? 1 115.1 ?16.3 138.8 125.8 ? 1.2/126.4
> ? ? ? ?4 ? ? ? ?68.9 ? ? ? 128.5 ? 5.7/124.0
> ? ? ? 32 ? ? ? 276.1 ? ? ? 128.6 ?70.8/ 98.5
> 1024k ? 1 128.5 ?81.9 138.9 134.4 ? 5.1/132.3
> ? ? ? ?4 ? ? ? 253.4 ? ? ? 137.4 ?19.1/126.8
> ? ? ? 32 ? ? ? 385.1 ? ? ? 137.4 111.7/ 92.3
>
> These are complete test results. ?First 4 result
> columns are merely identical, the difference is
> within last column. ?Here they are together:
>
> BlkSz Trd ? ? Raw ? ? ?Ext4nolock ?Ext4dflt
> ? 4k ? 1 ? 0.1/ ?9.1 ? 0.1/ ?9.0 ?0.0/ ?9.0
> ? ? ? ?4 ? 0.4/ ?8.4 ? 0.3/ ?8.6 ?0.0/ ?8.9
> ? ? ? 32 ? 4.7/ ?5.4 ? 4.5/ ?5.3 ?0.0/ ?8.9
> ?16k ? 1 ? 0.3/ 34.7 ? 0.3/ 35.2 ?0.1/ 33.6
> ? ? ? ?4 ? 1.5/ 31.4 ? 1.5/ 34.6 ?0.0/ 33.5
> ? ? ? 32 ?17.5/ 20.4 ?17.3/ 21.0 ?0.1/ 33.6
> ?64k ? 1 ? 1.1/105.8 ? 1.0/106.3 ?0.2/105.8
> ? ? ? ?4 ? 4.7/102.6 ? 5.2/103.3 ?0.1/106.1
> ? ? ? 32 ?57.9/ 73.3 ?57.4/ 74.0 ?0.2/105.9
> ?128k ? 1 ? 1.1/125.6 ? 1.2/126.4 ?0.3/125.3
> ? ? ? ?4 ? 6.3/122.8 ? 5.7/124.0 ?0.2/125.6
> ? ? ? 32 ?70.3/ 98.6 ?70.8/ 98.5 ?0.2/126.2
> 1024k ? 1 ? 5.0/132.3 ? 5.1/132.3 ?1.0/133.3
> ? ? ? ?4 ?19.2/127.1 ?19.1/126.8 ?0.2/134.3
> ? ? ? 32 117.2/ 90.1 111.7/ 92.3 ?0.3/133.6
>
> Ext4 with dioread_nolock (middle column) behaves close to
> raw device. ?But default ext4 greatly prefers writes over
> reads, reads are almost non-existent.
>
> This is, again, more or less a microbenchmark. ?Where it
> come from is my attempt to simulate an (oracle) database
> workload (many years ago, when larger and more standard
> now benchmarks weren't (freely) available). ?And there,
> on a busy DB, the difference is quite well-visible.
> In short, any writer makes all readers to wait. ?Once
> we start writing something, all users immediately notice.
> With dioread_nolock they don't complain anymore.
>
> There's some more background around this all. ?Right
> now I'm evaluating a new machine for our current database.
> Old hardware had 2Gb RAM so it had _significant_ memory
> pressure, and lots of stuff weren't able to be cached.
> New machine has 128Gb of RAM, which will ensure that
> all important stuff is in cache. ?So the effect of this
> read/write disbalance will be much less visible.
>
> For example, we've a dictionary (several tables) with
> addresses - towns, streets, even buildings. ?When they
> enter customer information they search in these dicts.
> With current 2Gb memory thses dictionaries can't be
> kept in memory, so they gets read from disk again every
> time someone enters customer information, and this is
> what they do all the time. ?So no doubt disk access is
> very important here.
>
> On a new hardware, obviously, all these dictionaries will
> be in memory after first access, so even if all reads will
> wait till any write completes, it wont be as dramatic as
> it is now.
>
> That to say, -- maybe I'm really paying too much attention
> for a wrong problem. ?So far, on a new machine, I don't see
> actual noticeable difference between dioread_nolock and
> without that option.
>
> (BTW, I found no way to remount a filesystem to EXclude
> that option, I have to umount and mount it in order to
> switch from using dioread_nolock to not using it. ?Is
> there a way?)
I think the command to do this is:
mount -o remount,dioread_lock /dev/xxx <mountpoint>

Now looking at this, I guess it is not very intuitive that the option to
turn off dioread_nolock is dioread_lock instead of nodioread_nolock,
but nodioread_nolock does look ugly. Maybe we should try to support
both ways.

Jiaying
>
> Thanks,
>
> /mjt
>
>> We are based on RHEL6, and dioread_nolock isn't there by now and a large
>> number of our product system use direct read and buffer write. So if
>> your test proves to be promising, I guess our company can arrange some
>> resources to try to work it out.
>>
>> Thanks
>> Tao
>
>