2014-04-09 09:14:38

by liang xie

[permalink] [raw]
Subject: Question about slow buffered io

Hi,

I am an Apache HDFS/HBase developer and debugging the slow buffered io
issue on ext4. I saw some slow sys_write caused by:
(mount -o noatime)
0xffffffff814ed1c3 : io_schedule+0x73/0xc0 [kernel]
0xffffffff81110b4d : sync_page+0x3d/0x50 [kernel]
0xffffffff814eda2a : __wait_on_bit_lock+0x5a/0xc0 [kernel]
0xffffffff81110ae7 : __lock_page+0x67/0x70 [kernel]
0xffffffff81111abc : find_lock_page+0x4c/0x80 [kernel]
0xffffffff81111b3a : grab_cache_page_write_begin+0x4a/0xc0 [kernel]
0xffffffffa00d05d4 : ext4_da_write_begin+0xb4/0x200 [ext4]

seems caused by delay allocation, right? so i reran with "mount -o
noatime,,nodiratime,data=writeback,nodelalloc", unfortunately, i saw
another stack trace contributing high latency:
0xffffffff811a9416 : __wait_on_buffer+0x26/0x30 [kernel]
0xffffffffa0123564 : ext4_mb_init_cache+0x234/0x9f0 [ext4]
0xffffffffa0123e3e : ext4_mb_init_group+0x11e/0x210 [ext4]
0xffffffffa0123ffd : ext4_mb_good_group+0xcd/0x110 [ext4]
0xffffffffa01276eb : ext4_mb_regular_allocator+0x19b/0x410 [ext4]
0xffffffffa0127ced : ext4_mb_new_blocks+0x38d/0x560 [ext4]
0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
0xffffffff811ab27b : __block_prepare_write+0x1db/0x570 [kernel]
0xffffffff811ab8cc : block_write_begin_newtrunc+0x5c/0xd0 [kernel]
0xffffffff811abcd3 : block_write_begin+0x43/0x90 [kernel]
0xffffffffa00fe408 : ext4_write_begin+0x1b8/0x2d0 [ext4]
and from HDFS/HBASE side, also no obvious improvement be found.

and inside both two scenarios, the following stack trace was hit as well:
0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
0xffffffffa01209ba : ext4_mb_mark_diskspace_used+0x7a/0x300 [ext4]
0xffffffffa0127c09 : ext4_mb_new_blocks+0x2a9/0x560 [ext4]
0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]

My question is:
1)what's the ext4 best practice for low latency append-only workload
like HBase application? Is there any recommended option i could try,
flex_bg size? nomballoc?
2)for the last strace trace, does
9f203507ed277ee86e3f76a15e09db1c92e40b94 help a lot, or no big win? (i
haven't run on 3.10+ so far and it's inconvenient to bump kernel
version on my cluster currently, so forgive my this stupid question if
it's...)

PS; My current kernel is 2.6.32-220

Liang Xie
Software Engineer, Xiaomi


2014-04-09 09:38:58

by Jan Kara

[permalink] [raw]
Subject: Re: Question about slow buffered io

Hello,

On Wed 09-04-14 17:14:37, liang xie wrote:
> I am an Apache HDFS/HBase developer and debugging the slow buffered io
> issue on ext4. I saw some slow sys_write caused by:
> (mount -o noatime)
> 0xffffffff814ed1c3 : io_schedule+0x73/0xc0 [kernel]
> 0xffffffff81110b4d : sync_page+0x3d/0x50 [kernel]
> 0xffffffff814eda2a : __wait_on_bit_lock+0x5a/0xc0 [kernel]
> 0xffffffff81110ae7 : __lock_page+0x67/0x70 [kernel]
> 0xffffffff81111abc : find_lock_page+0x4c/0x80 [kernel]
> 0xffffffff81111b3a : grab_cache_page_write_begin+0x4a/0xc0 [kernel]
> 0xffffffffa00d05d4 : ext4_da_write_begin+0xb4/0x200 [ext4]
>
> seems caused by delay allocation, right? so i reran with "mount -o
> noatime,,nodiratime,data=writeback,nodelalloc", unfortunately, i saw
> another stack trace contributing high latency:
> 0xffffffff811a9416 : __wait_on_buffer+0x26/0x30 [kernel]
> 0xffffffffa0123564 : ext4_mb_init_cache+0x234/0x9f0 [ext4]
> 0xffffffffa0123e3e : ext4_mb_init_group+0x11e/0x210 [ext4]
> 0xffffffffa0123ffd : ext4_mb_good_group+0xcd/0x110 [ext4]
> 0xffffffffa01276eb : ext4_mb_regular_allocator+0x19b/0x410 [ext4]
> 0xffffffffa0127ced : ext4_mb_new_blocks+0x38d/0x560 [ext4]
> 0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
> 0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
> 0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
> 0xffffffff811ab27b : __block_prepare_write+0x1db/0x570 [kernel]
> 0xffffffff811ab8cc : block_write_begin_newtrunc+0x5c/0xd0 [kernel]
> 0xffffffff811abcd3 : block_write_begin+0x43/0x90 [kernel]
> 0xffffffffa00fe408 : ext4_write_begin+0x1b8/0x2d0 [ext4]
> and from HDFS/HBASE side, also no obvious improvement be found.
>
> and inside both two scenarios, the following stack trace was hit as well:
> 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
> 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
> 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
> 0xffffffffa01209ba : ext4_mb_mark_diskspace_used+0x7a/0x300 [ext4]
> 0xffffffffa0127c09 : ext4_mb_new_blocks+0x2a9/0x560 [ext4]
> 0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
> 0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
> 0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
>
> My question is:
> 1)what's the ext4 best practice for low latency append-only workload
> like HBase application? Is there any recommended option i could try,
> flex_bg size? nomballoc?
> 2)for the last strace trace, does
> 9f203507ed277ee86e3f76a15e09db1c92e40b94 help a lot, or no big win? (i
> haven't run on 3.10+ so far and it's inconvenient to bump kernel
> version on my cluster currently, so forgive my this stupid question if
> it's...)
>
> PS; My current kernel is 2.6.32-220
This kernel is way too old and ext4 at that time was a lot different to
what it is now. Also I'm not sure what the -220 suffix means, it suggests
that you carry additional patches on top of stock 2.6.32 which makes any
suggestions even harder. So I'm afraid we cannot help you much.

>From the traces it seems to me that the processes are waiting for IO to
complete. You might want to try finding out why the IO takes so long to
complete. Maybe it's an IO scheduler issue?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-04-09 10:45:09

by Zheng Liu

[permalink] [raw]
Subject: Re: Question about slow buffered io

On Wed, Apr 09, 2014 at 05:14:37PM +0800, liang xie wrote:
> Hi,
>
> I am an Apache HDFS/HBase developer and debugging the slow buffered io
> issue on ext4. I saw some slow sys_write caused by:
> (mount -o noatime)
> 0xffffffff814ed1c3 : io_schedule+0x73/0xc0 [kernel]
> 0xffffffff81110b4d : sync_page+0x3d/0x50 [kernel]
> 0xffffffff814eda2a : __wait_on_bit_lock+0x5a/0xc0 [kernel]
> 0xffffffff81110ae7 : __lock_page+0x67/0x70 [kernel]
> 0xffffffff81111abc : find_lock_page+0x4c/0x80 [kernel]
> 0xffffffff81111b3a : grab_cache_page_write_begin+0x4a/0xc0 [kernel]
> 0xffffffffa00d05d4 : ext4_da_write_begin+0xb4/0x200 [ext4]

Delalloc obviously could cause a latency spike because of i_data_sem.
When flusher thread tries to write out some dirty pages, it will grab
i_data_sem locking and allocate some blocks for these dirty pages. At
that time if an application tries to do some buffered writes, i_data_sem
also need to be taken. So the application needs to wait on writeback.

>
> seems caused by delay allocation, right? so i reran with "mount -o
> noatime,,nodiratime,data=writeback,nodelalloc", unfortunately, i saw
> another stack trace contributing high latency:
> 0xffffffff811a9416 : __wait_on_buffer+0x26/0x30 [kernel]
> 0xffffffffa0123564 : ext4_mb_init_cache+0x234/0x9f0 [ext4]
> 0xffffffffa0123e3e : ext4_mb_init_group+0x11e/0x210 [ext4]
> 0xffffffffa0123ffd : ext4_mb_good_group+0xcd/0x110 [ext4]
> 0xffffffffa01276eb : ext4_mb_regular_allocator+0x19b/0x410 [ext4]
> 0xffffffffa0127ced : ext4_mb_new_blocks+0x38d/0x560 [ext4]
> 0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
> 0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
> 0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
> 0xffffffff811ab27b : __block_prepare_write+0x1db/0x570 [kernel]
> 0xffffffff811ab8cc : block_write_begin_newtrunc+0x5c/0xd0 [kernel]
> 0xffffffff811abcd3 : block_write_begin+0x43/0x90 [kernel]
> 0xffffffffa00fe408 : ext4_write_begin+0x1b8/0x2d0 [ext4]
> and from HDFS/HBASE side, also no obvious improvement be found.

>From the output of calltrace, it seems that we wait on reading some meta
data for block allocation.

>
> and inside both two scenarios, the following stack trace was hit as well:
> 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
> 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
> 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
> 0xffffffffa01209ba : ext4_mb_mark_diskspace_used+0x7a/0x300 [ext4]
> 0xffffffffa0127c09 : ext4_mb_new_blocks+0x2a9/0x560 [ext4]
> 0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
> 0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
> 0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
>
> My question is:
> 1)what's the ext4 best practice for low latency append-only workload
> like HBase application? Is there any recommended option i could try,
> flex_bg size? nomballoc?

We do the following things in our product system in order to avoid
latency spike:
1. -o nodelalloc
2. -o data=writeback
3. disable stable page write

> 2)for the last strace trace, does
> 9f203507ed277ee86e3f76a15e09db1c92e40b94 help a lot, or no big win? (i
> haven't run on 3.10+ so far and it's inconvenient to bump kernel
> version on my cluster currently, so forgive my this stupid question if
> it's...)

TBH, I don't know. But it is not very hard to backport this patch into
your kernel.

BTW, as far as I understand, Hadoop just does some parallel append buffer
writes, right? Could you please write a simple program to reproduce
this problem?

Regards,
- Zheng

2014-04-10 02:58:12

by liang xie

[permalink] [raw]
Subject: Re: Question about slow buffered io

On Wed, Apr 9, 2014 at 6:51 PM, Zheng Liu <[email protected]> wrote:
> On Wed, Apr 09, 2014 at 05:14:37PM +0800, liang xie wrote:
>> Hi,
>>
>> I am an Apache HDFS/HBase developer and debugging the slow buffered io
>> issue on ext4. I saw some slow sys_write caused by:
>> (mount -o noatime)
>> 0xffffffff814ed1c3 : io_schedule+0x73/0xc0 [kernel]
>> 0xffffffff81110b4d : sync_page+0x3d/0x50 [kernel]
>> 0xffffffff814eda2a : __wait_on_bit_lock+0x5a/0xc0 [kernel]
>> 0xffffffff81110ae7 : __lock_page+0x67/0x70 [kernel]
>> 0xffffffff81111abc : find_lock_page+0x4c/0x80 [kernel]
>> 0xffffffff81111b3a : grab_cache_page_write_begin+0x4a/0xc0 [kernel]
>> 0xffffffffa00d05d4 : ext4_da_write_begin+0xb4/0x200 [ext4]
>
> Delalloc obviously could cause a latency spike because of i_data_sem.
> When flusher thread tries to write out some dirty pages, it will grab
> i_data_sem locking and allocate some blocks for these dirty pages. At
> that time if an application tries to do some buffered writes, i_data_sem
> also need to be taken. So the application needs to wait on writeback.
>
Cool, got it :)

>>
>> seems caused by delay allocation, right? so i reran with "mount -o
>> noatime,,nodiratime,data=writeback,nodelalloc", unfortunately, i saw
>> another stack trace contributing high latency:
>> 0xffffffff811a9416 : __wait_on_buffer+0x26/0x30 [kernel]
>> 0xffffffffa0123564 : ext4_mb_init_cache+0x234/0x9f0 [ext4]
>> 0xffffffffa0123e3e : ext4_mb_init_group+0x11e/0x210 [ext4]
>> 0xffffffffa0123ffd : ext4_mb_good_group+0xcd/0x110 [ext4]
>> 0xffffffffa01276eb : ext4_mb_regular_allocator+0x19b/0x410 [ext4]
>> 0xffffffffa0127ced : ext4_mb_new_blocks+0x38d/0x560 [ext4]
>> 0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
>> 0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
>> 0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
>> 0xffffffff811ab27b : __block_prepare_write+0x1db/0x570 [kernel]
>> 0xffffffff811ab8cc : block_write_begin_newtrunc+0x5c/0xd0 [kernel]
>> 0xffffffff811abcd3 : block_write_begin+0x43/0x90 [kernel]
>> 0xffffffffa00fe408 : ext4_write_begin+0x1b8/0x2d0 [ext4]
>> and from HDFS/HBASE side, also no obvious improvement be found.
>
> From the output of calltrace, it seems that we wait on reading some meta
> data for block allocation.
Any ideas on relieving the write stall caused by it?

>
>>
>> and inside both two scenarios, the following stack trace was hit as well:
>> 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
>> 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
>> 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
>> 0xffffffffa01209ba : ext4_mb_mark_diskspace_used+0x7a/0x300 [ext4]
>> 0xffffffffa0127c09 : ext4_mb_new_blocks+0x2a9/0x560 [ext4]
>> 0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
>> 0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
>> 0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
>>
>> My question is:
>> 1)what's the ext4 best practice for low latency append-only workload
>> like HBase application? Is there any recommended option i could try,
>> flex_bg size? nomballoc?
>
> We do the following things in our product system in order to avoid
> latency spike:
> 1. -o nodelalloc
> 2. -o data=writeback
> 3. disable stable page write

ok

>> 2)for the last strace trace, does
>> 9f203507ed277ee86e3f76a15e09db1c92e40b94 help a lot, or no big win? (i
>> haven't run on 3.10+ so far and it's inconvenient to bump kernel
>> version on my cluster currently, so forgive my this stupid question if
>> it's...)
>
> TBH, I don't know. But it is not very hard to backport this patch into
> your kernel.
>
> BTW, as far as I understand, Hadoop just does some parallel append buffer
> writes, right?

not exactly, per my current understanding, the hdfs data files are always append
only write, but the meta files are not the same story, it possible has a minor
overwrite request under special conditions.

> Could you please write a simple program to reproduce this problem?
np, will do once get chance
Thanks for your nice reply, zheng :)

>
> Regards,
> - Zheng

2014-04-10 04:39:20

by Zheng Liu

[permalink] [raw]
Subject: Re: Question about slow buffered io

On Thu, Apr 10, 2014 at 10:58:11AM +0800, liang xie wrote:
> On Wed, Apr 9, 2014 at 6:51 PM, Zheng Liu <[email protected]> wrote:
> > On Wed, Apr 09, 2014 at 05:14:37PM +0800, liang xie wrote:
> >> Hi,
> >>
> >> I am an Apache HDFS/HBase developer and debugging the slow buffered io
> >> issue on ext4. I saw some slow sys_write caused by:
> >> (mount -o noatime)
> >> 0xffffffff814ed1c3 : io_schedule+0x73/0xc0 [kernel]
> >> 0xffffffff81110b4d : sync_page+0x3d/0x50 [kernel]
> >> 0xffffffff814eda2a : __wait_on_bit_lock+0x5a/0xc0 [kernel]
> >> 0xffffffff81110ae7 : __lock_page+0x67/0x70 [kernel]
> >> 0xffffffff81111abc : find_lock_page+0x4c/0x80 [kernel]
> >> 0xffffffff81111b3a : grab_cache_page_write_begin+0x4a/0xc0 [kernel]
> >> 0xffffffffa00d05d4 : ext4_da_write_begin+0xb4/0x200 [ext4]
> >
> > Delalloc obviously could cause a latency spike because of i_data_sem.
> > When flusher thread tries to write out some dirty pages, it will grab
> > i_data_sem locking and allocate some blocks for these dirty pages. At
> > that time if an application tries to do some buffered writes, i_data_sem
> > also need to be taken. So the application needs to wait on writeback.
> >
> Cool, got it :)
>
> >>
> >> seems caused by delay allocation, right? so i reran with "mount -o
> >> noatime,,nodiratime,data=writeback,nodelalloc", unfortunately, i saw
> >> another stack trace contributing high latency:
> >> 0xffffffff811a9416 : __wait_on_buffer+0x26/0x30 [kernel]
> >> 0xffffffffa0123564 : ext4_mb_init_cache+0x234/0x9f0 [ext4]
> >> 0xffffffffa0123e3e : ext4_mb_init_group+0x11e/0x210 [ext4]
> >> 0xffffffffa0123ffd : ext4_mb_good_group+0xcd/0x110 [ext4]
> >> 0xffffffffa01276eb : ext4_mb_regular_allocator+0x19b/0x410 [ext4]
> >> 0xffffffffa0127ced : ext4_mb_new_blocks+0x38d/0x560 [ext4]
> >> 0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
> >> 0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
> >> 0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
> >> 0xffffffff811ab27b : __block_prepare_write+0x1db/0x570 [kernel]
> >> 0xffffffff811ab8cc : block_write_begin_newtrunc+0x5c/0xd0 [kernel]
> >> 0xffffffff811abcd3 : block_write_begin+0x43/0x90 [kernel]
> >> 0xffffffffa00fe408 : ext4_write_begin+0x1b8/0x2d0 [ext4]
> >> and from HDFS/HBASE side, also no obvious improvement be found.
> >
> > From the output of calltrace, it seems that we wait on reading some meta
> > data for block allocation.
> Any ideas on relieving the write stall caused by it?

This commit (9f203507ed277ee86e3f76a15e09db1c92e40b94) might be useful.

>
> >
> >>
> >> and inside both two scenarios, the following stack trace was hit as well:
> >> 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
> >> 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
> >> 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
> >> 0xffffffffa01209ba : ext4_mb_mark_diskspace_used+0x7a/0x300 [ext4]
> >> 0xffffffffa0127c09 : ext4_mb_new_blocks+0x2a9/0x560 [ext4]
> >> 0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
> >> 0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
> >> 0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
> >>
> >> My question is:
> >> 1)what's the ext4 best practice for low latency append-only workload
> >> like HBase application? Is there any recommended option i could try,
> >> flex_bg size? nomballoc?
> >
> > We do the following things in our product system in order to avoid
> > latency spike:
> > 1. -o nodelalloc
> > 2. -o data=writeback
> > 3. disable stable page write
>
> ok
>
> >> 2)for the last strace trace, does
> >> 9f203507ed277ee86e3f76a15e09db1c92e40b94 help a lot, or no big win? (i
> >> haven't run on 3.10+ so far and it's inconvenient to bump kernel
> >> version on my cluster currently, so forgive my this stupid question if
> >> it's...)
> >
> > TBH, I don't know. But it is not very hard to backport this patch into
> > your kernel.
> >
> > BTW, as far as I understand, Hadoop just does some parallel append buffer
> > writes, right?
>
> not exactly, per my current understanding, the hdfs data files are always append
> only write, but the meta files are not the same story, it possible has a minor
> overwrite request under special conditions.
>
> > Could you please write a simple program to reproduce this problem?
> np, will do once get chance

That would be great if you can provide a simple program to reproduce the
problem because most developers don't have a cluster to run Hadoop.

Regards,
- Zheng