From: Jan Kara <jack@suse.cz>
Subject: Re: Question about slow buffered io
Date: Wed, 9 Apr 2014 11:38:55 +0200
Message-ID: <20140409093855.GC32103@quack.suse.cz>
References: <CADu=CFqjr_DovzRbPiA1KoBL1AHtZ+7bkXAQJum2SOvdWi_JQw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-ext4 <linux-ext4@vger.kernel.org>
To: liang xie <xieliang007@gmail.com>
Content-Disposition: inline
In-Reply-To: <CADu=CFqjr_DovzRbPiA1KoBL1AHtZ+7bkXAQJum2SOvdWi_JQw@mail.gmail.com>
Sender: linux-ext4-owner@vger.kernel.org

  Hello,

On Wed 09-04-14 17:14:37, liang xie wrote:
> I am an Apache HDFS/HBase developer and debugging the slow buffered io
> issue on ext4. I saw some slow sys_write caused by:
> (mount -o noatime)
> 0xffffffff814ed1c3 : io_schedule+0x73/0xc0 [kernel]
> 0xffffffff81110b4d : sync_page+0x3d/0x50 [kernel]
> 0xffffffff814eda2a : __wait_on_bit_lock+0x5a/0xc0 [kernel]
> 0xffffffff81110ae7 : __lock_page+0x67/0x70 [kernel]
> 0xffffffff81111abc : find_lock_page+0x4c/0x80 [kernel]
> 0xffffffff81111b3a : grab_cache_page_write_begin+0x4a/0xc0 [kernel]
> 0xffffffffa00d05d4 : ext4_da_write_begin+0xb4/0x200 [ext4]
> 
> seems caused by delay allocation, right?  so i reran with "mount -o
> noatime,,nodiratime,data=writeback,nodelalloc", unfortunately, i saw
> another stack trace contributing high latency:
>  0xffffffff811a9416 : __wait_on_buffer+0x26/0x30 [kernel]
>  0xffffffffa0123564 : ext4_mb_init_cache+0x234/0x9f0 [ext4]
>  0xffffffffa0123e3e : ext4_mb_init_group+0x11e/0x210 [ext4]
>  0xffffffffa0123ffd : ext4_mb_good_group+0xcd/0x110 [ext4]
>  0xffffffffa01276eb : ext4_mb_regular_allocator+0x19b/0x410 [ext4]
>  0xffffffffa0127ced : ext4_mb_new_blocks+0x38d/0x560 [ext4]
>  0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
>  0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
>  0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
>  0xffffffff811ab27b : __block_prepare_write+0x1db/0x570 [kernel]
>  0xffffffff811ab8cc : block_write_begin_newtrunc+0x5c/0xd0 [kernel]
>  0xffffffff811abcd3 : block_write_begin+0x43/0x90 [kernel]
>  0xffffffffa00fe408 : ext4_write_begin+0x1b8/0x2d0 [ext4]
> and from HDFS/HBASE side, also no obvious improvement be found.
> 
> and inside both two scenarios, the following stack trace was hit as well:
>  0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
>  0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
>  0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
>  0xffffffffa01209ba : ext4_mb_mark_diskspace_used+0x7a/0x300 [ext4]
>  0xffffffffa0127c09 : ext4_mb_new_blocks+0x2a9/0x560 [ext4]
>  0xffffffffa011dfc3 : ext4_ext_get_blocks+0x1113/0x1a10 [ext4]
>  0xffffffffa00fb335 : ext4_get_blocks+0xf5/0x2a0 [ext4]
>  0xffffffffa00fbdad : ext4_get_block+0xbd/0x120 [ext4]
> 
> My question is:
> 1)what's the ext4 best practice for low latency append-only workload
> like HBase application? Is there any recommended option i could try,
> flex_bg size? nomballoc?
> 2)for the last strace trace, does
> 9f203507ed277ee86e3f76a15e09db1c92e40b94 help a lot, or no big win? (i
> haven't run on 3.10+ so far and it's inconvenient to bump kernel
> version on my cluster currently, so forgive my this stupid question if
> it's...)
> 
> PS; My current kernel is 2.6.32-220
  This kernel is way too old and ext4 at that time was a lot different to
what it is now. Also I'm not sure what the -220 suffix means, it suggests
that you carry additional patches on top of stock 2.6.32 which makes any
suggestions even harder. So I'm afraid we cannot help you much.

>From the traces it seems to me that the processes are waiting for IO to
complete. You might want to try finding out why the IO takes so long to
complete. Maybe it's an IO scheduler issue?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR