LinuxLists.cc - Question about ext4 excessive stall time

2013-05-15 07:15:07

Subject: Question about ext4 excessive stall time

Hello.
I saw some commit messages about excessive stall times in ext4.
The same problem was reproduced in my board. Actually, i'm not sure that is the same problem.
But the calltrace message is very similar with commit id: f783f091.

My board configuration as follow.
Linux: 2.6.32.60
CPU: 8CPU
Disk: SSD
File system: ext4

I know my kernel version is so old. I just want to know why this problem is happened.
Because of my kernel version is old? or Because of disk ?,,
If anyone knows about this problem, Could you help me?

As i said, the calltrace message is very similar with commit id: f783f091 as below.
Because i enabled hungtask detection and set the hungtask timeout to 120 seconds.
the calltrace messages were printed by this.

[ 262.455615] INFO: task swm:1692 blocked for more than 120 seconds.
[ 262.461715] Stack : 0000000000000000 ffffffff8011f684 0000000000000000 a8000001f8b33960
[ 262.469438] a8000001f7f2ce38 00000000ffffb6a9 0000000000000005 ffffffff80efa270
[ 262.477424] a8000001f7f2d0e8 0000000000000020 0000000000000000 ffffffff80ec69a0
[ 262.485732] 00000001f90a3a50 a800000107c93020 a8000001f9062eb8 a8000001f9302950
[ 262.493708] a8000001f5121280 a8000001f90a3810 a8000001f7e20000 0000000000000000
[ 262.501675] 0000000000000004 a800000107c93020 a8000001f8b339c0 ffffffff8011cd50
[ 262.509661] a8000001f9062eb8 ffffffff8011f684 0000000000000001 ffffffff8011cea8
[ 262.517647] a8000001f9302950 ffffffff804295a0 a8000001f9302950 0000001800000000
[ 262.525652] 0000000000000000 a8000001f7f2ce38 ffffffff802e3f00 a8000001f4f73bb8
[ 262.533631] a800000107c93038 ffffffff8043987c 0000000000000000 a8000001f9302950
[ 262.541603] ...
[ 262.544208] Call Trace:
[ 262.546668] [<ffffffff8011c598>] __schedule_nobkl+0x278/0x900
[ 262.552391] [<ffffffff8011cd50>] __schedule+0x48/0x80
[ 262.557431] [<ffffffff8011cea8>] schedule+0x10/0x28
[ 262.562269] [<ffffffff804295a0>] do_get_write_access+0x470/0x6a8
[ 262.568285] [<ffffffff80434620>] jbd2_journal_get_write_access+0x30/0x58
[ 262.574939] [<ffffffff804171a8>] __ext4_journal_get_write_access+0x48/0x88
[ 262.581817] [<ffffffff803fc0f8>] ext4_reserve_inode_write+0x80/0xa8
[ 262.588056] [<ffffffff803fc174>] ext4_mark_inode_dirty+0x54/0x1e8
[ 262.594121] [<ffffffff803fc4a0>] ext4_dirty_inode+0x38/0x70
[ 262.599745] [<ffffffff80381a98>] __mark_inode_dirty+0x40/0x228
[ 262.605493] [<ffffffff803747bc>] file_update_time+0xec/0x190
[ 262.611159] [<ffffffff8031cb20>] __generic_file_aio_write+0x1f8/0x3e8
[ 262.617568] [<ffffffff8031cd70>] generic_file_aio_write+0x60/0xc0
[ 262.623636] [<ffffffff8035c1f4>] do_sync_write+0xbc/0x120
[ 262.629037] [<ffffffff8035ce24>] vfs_write+0xb4/0x178
[ 262.634049] [<ffffffff8035cfd0>] SyS_write+0x48/0xa0
[ 262.639022] [<ffffffff80102fc4>] handle_sys64+0x44/0x60
[ 262.644221]
262.645734] INFO: task evm:1814 blocked for more than 120 seconds.
[ 262.651855] Stack : 0000000000000000 a8000001f442b930 0000000000000000 0000000000100100
[ 262.659651] a8000001f44fc5b8 00000000ffffb4e6 0000000000000000 ffffffff80efa270
[ 262.667637] a8000001f44fc868 0000000000000001 0000000000000000 ffffffff80ec69a0
[ 262.675643] a8000001f5da63a0 a800000107ca2020 a8000001f9214438 a8000001f5da63a0
[ 262.683608] a8000001f5121280 a8000001f90a3798 a8000001f7e20000 0000000000000000
[ 262.691594] 0000000000000004 a800000107ca2020 a8000001f442ba70 ffffffff8011cd50
[ 262.699579] a8000001f9214438 ffffffff8011f684 0000000000000001 ffffffff8011cea8
[ 262.707565] a8000001f5da63a0 ffffffff804295a0 a8000001f5da63a0 0000001800000000
[ 262.715576] 0000000000000000 a8000001f44fc5b8 ffffffff802e3f00 a800000107ca2038
[ 262.723536] a800000107ca2038 ffffffff8043987c 0000000000000000 a8000001f5da63a0
[ 262.731522] ...
[ 262.734126] Call Trace:
[ 262.736578] [<ffffffff8011c598>] __schedule_nobkl+0x278/0x900
[ 262.742312] [<ffffffff8011cd50>] __schedule+0x48/0x80
[ 262.747373] [<ffffffff8011cea8>] schedule+0x10/0x28
[ 262.752185] [<ffffffff804295a0>] do_get_write_access+0x470/0x6a8
[ 262.758220] [<ffffffff80434620>] jbd2_journal_get_write_access+0x30/0x58
[ 262.764857] [<ffffffff804171a8>] __ext4_journal_get_write_access+0x48/0x88
[ 262.771738] [<ffffffff803f94f0>] ext4_new_inode+0x290/0x1298
[ 262.777367] [<ffffffff80406c40>] ext4_create+0xe8/0x1e0
[ 262.782566] [<ffffffff80367c58>] vfs_create+0xf8/0x180
[ 262.787705] [<ffffffff8036bcc8>] do_filp_open+0xab0/0xbb0
[ 262.793069] [<ffffffff8035a060>] do_sys_open+0x78/0x170
[ 262.798293] [<ffffffff80102fc4>] handle_sys64+0x44/0x60
[ 262.803480]

[ 262.804984] INFO: task logrotate:2422 blocked for more than 120 seconds.
[ 262.811674] Stack : 0000000000000000 0000000000000000 0000000000000000 ffffffff8040ebb8
[ 262.819451] a8000001f46b6738 00000000ffffafaf 0000000000000000 ffffffff80efa270
[ 262.827436] a8000001f46b69e8 0000000000000001 0000000000000000 ffffffff80ec69a0
[ 262.835421] ffffffff80ec6d48 a800000107c93020 a8000001f9062eb8 a8000001f9302950
[ 262.843407] a8000001f5121280 a8000001f90a3bb8 a8000001f7e20000 0000000000000000
[ 262.851393] 0000000000000004 a800000107c93020 a8000001f4f73ba0 ffffffff8011cd50
[ 262.859378] a8000001f9062eb8 ffffffff8011f684 0000000000000001 ffffffff8011cea8
[ 262.867364] a8000001f9302950 ffffffff804295a0 a8000001f9302950 0000001800000000
[ 262.875349] 0000000000000000 a8000001f46b6738 ffffffff802e3f00 a800000107c93038
[ 262.883335] a8000001f8b339d8 ffffffff8043987c 0000000000000000 a8000001f9302950
[ 262.891321] ...
[ 262.893925] Call Trace:
[ 262.896380] [<ffffffff8011c598>] __schedule_nobkl+0x278/0x900
[ 262.902089] [<ffffffff8011cd50>] __schedule+0x48/0x80
[ 262.907143] [<ffffffff8011cea8>] schedule+0x10/0x28
[ 262.911985] [<ffffffff804295a0>] do_get_write_access+0x470/0x6a8
[ 262.917994] [<ffffffff80434620>] jbd2_journal_get_write_access+0x30/0x58
[ 262.924655] [<ffffffff804171a8>] __ext4_journal_get_write_access+0x48/0x88
[ 262.931534] [<ffffffff803fc0f8>] ext4_reserve_inode_write+0x80/0xa8
[ 262.937775] [<ffffffff804075c8>] ext4_orphan_add+0x130/0x298
[ 262.943405] [<ffffffff804027a4>] ext4_setattr+0x1bc/0x3e0
[ 262.948806] [<ffffffff80376f74>] notify_change+0x1a4/0x3a0
[ 262.954255] [<ffffffff8035b140>] do_truncate+0x58/0xa8
[ 262.959397] [<ffffffff8035b9cc>] SyS_ftruncate+0x104/0x160
[ 262.964845] [<ffffffff80102fc4>] handle_sys64+0x44/0x60Thanks.

Thanks.????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2013-05-15 12:24:19

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Question about ext4 excessive stall time

On Wed, May 15, 2013 at 07:15:02AM +0000, EUNBONG SONG wrote:
> I know my kernel version is so old. I just want to know why this
> problem is happened. Because of my kernel version is old? or
> Because of disk ?,, If anyone knows about this problem, Could you
> help me?

So what's happening is this. The CFQ I/O scheduler prioritizes reads
over writes, since most reads are synchronous (for example, if the
compiler is waiting for the data block from include/unistd.h, it cant
make forward progress until it receives the data blocks; there is an
exception for readahead blocks, but those are dealt with at a low
priority), and most writes are synchronous (since they are issued by
the writeback daemons, and unless we are doing an fsync, no one is
waiting for them).

The problem comes when a metadata block, usually one which is shared
across multiple files is undergoing writeback, such as an inode table
block or a allocation bitmap block. The write gets issued as a low
priority I/O operation. Then during the the next jbd2 transaction,
some userspace operation needs to modify that metadata block, and in
order to do that, it has to call jbd2_journal_get_write_access(). But
if there is heavy read traffic going on, due to some other process
using the disk a lot, the writeback operation may end up getting
starved, and doesn't get acted on for a very long time.

But the moment a process called jbd2_journal_get_write_access(), the
write has effectively become one which is synchronous, in that forward
progress of at least one process is now getting blocked waiting for
this I/O to complete, since the buffer_head is locked for writeback,
possibly for hundreds or thousands of milliseconds, and
jbd2_journal_get_write_access() can not proceed until it can get the
buffer_head lock.

This was discussed at least month's Linux Storage, File System, and MM
worksthop. The right solution is to for lock_buffer() to notice if
the buffer head has been locked for writeback, and if so, to bump the
write request to the head of the elevator. Jeff Moyer is looking at
this.

The partial workaround which will be in 3.10 is that we're marking all
metadata writes with REQ_META and REQ_PRIO. This will cause metadata
writebacks to be prioritized at the same priority level as synchrnous
reads. If there is heavy read traffic, the metadata writebacks will
still be in competition with the reads, but at least they will
complete.

Once we get priority escalation (or priority inheritance, because what
we're seeing here is really a classic priority inversion problem),
then it would make sense for us to no longer set REQ_PRIO for metadata
writebacks, so the metadata writebacks only get prioritized when they
are blocking some process from making forward progress. (Doing this
will probably result in a slight performance degradation on some
workloads, but it will improve others with a heavy read traffic and
minimal writeback interference. We'll want to benchmark what
percentage of metadata writebacks require getting bumped to the head
of the line, but I suspect it will be the right choice.)

If you want to try to backport this workaround to your older kernel,
please see commit 9f203507ed277.

Regards,

- Ted

2013-05-15 22:07:40

by EUNBONG SONG

[permalink] [raw]

Subject: Re: Re: Question about ext4 excessive stall time

> On Wed, May 15, 2013 at 07:15:02AM +0000, EUNBONG SONG wrote:
> > I know my kernel version is so old. I just want to know why this
> > problem is happened. Because of my kernel version is old? or
> > Because of disk ?,, If anyone knows about this problem, Could you
> > help me?

> So what's happening is this. The CFQ I/O scheduler prioritizes reads
> over writes, since most reads are synchronous (for example, if the
> compiler is waiting for the data block from include/unistd.h, it cant
> make forward progress until it receives the data blocks; there is an
> exception for readahead blocks, but those are dealt with at a low
> priority), and most writes are synchronous (since they are issued by
> the writeback daemons, and unless we are doing an fsync, no one is
> waiting for them).
>
> The problem comes when a metadata block, usually one which is shared
> across multiple files is undergoing writeback, such as an inode table
> block or a allocation bitmap block. The write gets issued as a low
> priority I/O operation. Then during the the next jbd2 transaction,
> some userspace operation needs to modify that metadata block, and in
> order to do that, it has to call jbd2_journal_get_write_access(). But
> if there is heavy read traffic going on, due to some other process
> using the disk a lot, the writeback operation may end up getting
> starved, and doesn't get acted on for a very long time.
>
> But the moment a process called jbd2_journal_get_write_access(), the
> write has effectively become one which is synchronous, in that forward
> progress of at least one process is now getting blocked waiting for
> this I/O to complete, since the buffer_head is locked for writeback,
> possibly for hundreds or thousands of milliseconds, and
> jbd2_journal_get_write_access() can not proceed until it can get the
> buffer_head lock.
>
> This was discussed at least month's Linux Storage, File System, and MM
> worksthop. The right solution is to for lock_buffer() to notice if
> the buffer head has been locked for writeback, and if so, to bump the
> write request to the head of the elevator. Jeff Moyer is looking at
> this.
>
> The partial workaround which will be in 3.10 is that we're marking all
> metadata writes with REQ_META and REQ_PRIO. This will cause metadata
> writebacks to be prioritized at the same priority level as synchrnous
> reads. If there is heavy read traffic, the metadata writebacks will
> still be in competition with the reads, but at least they will
> complete.
>
> Once we get priority escalation (or priority inheritance, because what
> we're seeing here is really a classic priority inversion problem),
> then it would make sense for us to no longer set REQ_PRIO for metadata
> writebacks, so the metadata writebacks only get prioritized when they
> are blocking some process from making forward progress. (Doing this
> will probably result in a slight performance degradation on some
> workloads, but it will improve others with a heavy read traffic and
> minimal writeback interference. We'll want to benchmark what
> percentage of metadata writebacks require getting bumped to the head
> of the line, but I suspect it will be the right choice.)
>
> If you want to try to backport this workaround to your older kernel,
> please see commit 9f203507ed277.

Hi, Ted.
I appreciate for your fantastic explanation. It's really great and very helpful for me.
Now i can understand about this issue thanks to you.

Thanks!
EunBong

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?