Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753962AbZIPAHm (ORCPT ); Tue, 15 Sep 2009 20:07:42 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753141AbZIPAHh (ORCPT ); Tue, 15 Sep 2009 20:07:37 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:61120 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1753024AbZIPAHg (ORCPT ); Tue, 15 Sep 2009 20:07:36 -0400 Message-ID: <4AB02BD5.4050107@cn.fujitsu.com> Date: Wed, 16 Sep 2009 08:05:41 +0800 From: Gui Jianfeng User-Agent: Thunderbird 2.0.0.5 (Windows/20070716) MIME-Version: 1.0 To: Vivek Goyal CC: jens.axboe@oracle.com, linux-kernel@vger.kernel.org, containers@lists.linux-foundation.org, dm-devel@redhat.com, nauman@google.com, dpshah@google.com, lizf@cn.fujitsu.com, mikew@google.com, fchecconi@gmail.com, paolo.valente@unimore.it, ryov@valinux.co.jp, fernando@oss.ntt.co.jp, s-uchida@ap.jp.nec.com, taka@valinux.co.jp, jmoyer@redhat.com, dhaval@linux.vnet.ibm.com, balbir@linux.vnet.ibm.com, righi.andrea@gmail.com, m-ikeda@ds.jp.nec.com, agk@redhat.com, akpm@linux-foundation.org, peterz@infradead.org, jmarchan@redhat.com, torvalds@linux-foundation.org, mingo@elte.hu, riel@redhat.com Subject: Re: [PATCH] io-controller: Fix task hanging when there are more than one groups References: <1251495072-7780-1-git-send-email-vgoyal@redhat.com> <4AA4B905.8010801@cn.fujitsu.com> <20090908191941.GF15974@redhat.com> <4AA75B71.5060109@cn.fujitsu.com> <20090909150537.GD8256@redhat.com> <4AA9A4BE.30005@cn.fujitsu.com> <20090915033739.GA4054@redhat.com> In-Reply-To: <20090915033739.GA4054@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7305 Lines: 154 Vivek Goyal wrote: > On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote: >> Vivek Goyal wrote: >>> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote: >>>> Vivek Goyal wrote: >>>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote: >>>>>> Hi Vivek, >>>>>> >>>>>> I happened to encount a bug when i test IO Controller V9. >>>>>> When there are three tasks to run concurrently in three group, >>>>>> that is, one is parent group, and other two tasks are running >>>>>> in two different child groups respectively to read or write >>>>>> files in some disk, say disk "hdb", The task may hang up, and >>>>>> other tasks which access into "hdb" will also hang up. >>>>>> >>>>>> The bug only happens when using AS io scheduler. >>>>>> The following scirpt can reproduce this bug in my box. >>>>>> >>>>> Hi Gui, >>>>> >>>>> I tried reproducing this on my system and can't reproduce it. All the >>>>> three processes get killed and system does not hang. >>>>> >>>>> Can you please dig deeper a bit into it. >>>>> >>>>> - If whole system hangs or it is just IO to disk seems to be hung. >>>> Only when the task is trying do IO to disk it will hang up. >>>> >>>>> - Does io scheduler switch on the device work >>>> yes, io scheduler can be switched, and the hung task will be resumed. >>>> >>>>> - If the system is not hung, can you capture the blktrace on the device. >>>>> Trace might give some idea, what's happening. >>>> I run a "find" task to do some io on that disk, it seems that task hangs >>>> when it is issuing getdents() syscall. >>>> kernel generates the following message: >>>> >>>> INFO: task find:3260 blocked for more than 120 seconds. >>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> find D a1e95787 1912 3260 2897 0x00000004 >>>> f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820 >>>> 00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c >>>> 0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820 >>>> Call Trace: >>>> [] ? getnstimeofday+0x57/0xe0 >>>> [] ? ktime_get_ts+0x4a/0x4e >>>> [] io_schedule+0x47/0x79 >>>> [] sync_buffer+0x36/0x3a >>>> [] __wait_on_bit+0x36/0x5d >>>> [] ? sync_buffer+0x0/0x3a >>>> [] out_of_line_wait_on_bit+0x58/0x60 >>>> [] ? sync_buffer+0x0/0x3a >>>> [] ? wake_bit_function+0x0/0x43 >>>> [] __wait_on_buffer+0x19/0x1c >>>> [] ext3_bread+0x5e/0x79 [ext3] >>>> [] htree_dirblock_to_tree+0x1f/0x120 [ext3] >>>> [] ext3_htree_fill_tree+0x7a/0x1bb [ext3] >>>> [] ? kmem_cache_alloc+0x86/0xf3 >>>> [] ? trace_hardirqs_on_caller+0x107/0x12f >>>> [] ? trace_hardirqs_on+0xb/0xd >>>> [] ? ext3_readdir+0x9e/0x692 [ext3] >>>> [] ext3_readdir+0x1ee/0x692 [ext3] >>>> [] ? filldir64+0x0/0xcd >>>> [] ? mutex_lock_killable_nested+0x2b1/0x2c5 >>>> [] ? mutex_lock_killable_nested+0x2bb/0x2c5 >>>> [] ? vfs_readdir+0x46/0x94 >>>> [] vfs_readdir+0x68/0x94 >>>> [] ? filldir64+0x0/0xcd >>>> [] sys_getdents64+0x5e/0x9f >>>> [] sysenter_do_call+0x12/0x32 >>>> 1 lock held by find/3260: >>>> #0: (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [] vfs_readdir+0x46/0x94 >>>> >>>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE >>>> state, and I found this task will be resumed after a quite long period(more than 10 mins). >>> Thanks Gui. As Jens said, it does look like a case of missing queue >>> restart somewhere and now we are stuck, no requests are being dispatched >>> to the disk and queue is already unplugged. >>> >>> Can you please also try capturing the trace of events at io scheduler >>> (blktrace) to see how did we get into that situation. >>> >>> Are you using ide drivers and not libata? As jens said, I will try to make >>> use of ide drivers and see if I can reproduce it. >>> >> Hi Vivek, Jens, >> >> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will >> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But >> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup >> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we >> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq >> is still under service, and from now on, this ioq won't expire because "only root" optimization. >> The following patch ensures the ioq do belongs to the root group if there's only root group existing. >> >> Signed-off-by: Gui Jianfeng > > Hi Gui, > > I have modified your patch a bit to improve readability. Looking at the > issue closely I realized that this optimization of not expiring the > queue can lead to other issues like high vdisktime in certain scenarios. > While fixing that also noticed the issue of high rate of as queue > expiration in certain cases which could have been avoided. > > Here is a patch which should fix all that. I am still testing this patch > to make sure that something is not obiviously broken. Will merge it if > there are no issues. > > Thanks > Vivek > > o Fixed the issue of not expiring the queue for single ioq schedulers. Reported > and fixed by Gui. > > o If an AS queue is not expired for a long time and suddenly somebody > decides to create a group and launch a job there, in that case old AS > queue will be expired with a very high value of slice used and will get > a very high disk time. Fix it by marking the queue as "charge_one_slice" > and charge the queue only for a single time slice and not for whole > of the duration when queue was running. > > o There are cases where in case of AS, excessive queue expiration will take > place by elevator fair queuing layer because of few reasons. > - AS does not anticipate on a queue if there are no competing requests. > So if only a single reader is present in a group, anticipation does > not get turn on. > > - elevator layer does not know that As is anticipating hence initiates > expiry requests in select_ioq() thinking queue is empty. > > - elevaotr layer tries to aggressively expire last empty queue. This > can lead to lof of queue expiry > > o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the > queue completed and associated io context is eligible to anticipate. Also > AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request()) > . This solves above mentioned issues. > > o Moved some of the code in separate functions to improve readability. > > Signed-off-by: Vivek Goyal I'd like to have a try this patch :) -- Regards Gui Jianfeng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/