Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932892Ab2JWJud (ORCPT ); Tue, 23 Oct 2012 05:50:33 -0400 Received: from cantor2.suse.de ([195.135.220.15]:48670 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932791Ab2JWJua (ORCPT ); Tue, 23 Oct 2012 05:50:30 -0400 Date: Tue, 23 Oct 2012 11:50:28 +0200 From: Michal Hocko To: Qiang Gao Cc: Balbir Singh , "linux-kernel@vger.kernel.org" , "linux-mmc@vger.kernel.org" , "cgroups@vger.kernel.org" , linux-mm@kvack.org Subject: Re: process hangs on do_exit when oom happens Message-ID: <20121023095028.GD15397@dhcp22.suse.cz> References: <20121019160425.GA10175@dhcp22.suse.cz>

MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4898 Lines: 122 On Tue 23-10-12 15:18:48, Qiang Gao wrote: > This process was moved to RT-priority queue when global oom-killer > happened to boost the recovery of the system.. Who did that? oom killer doesn't boost the priority (scheduling class) AFAIK. > but it wasn't get properily dealt with. I still have no idea why where > the problem is .. Well your configuration says that there is no runtime reserved for the group. Please refer to Documentation/scheduler/sched-rt-group.txt for more information. > On Tue, Oct 23, 2012 at 12:40 PM, Balbir Singh wrote: > > On Tue, Oct 23, 2012 at 9:05 AM, Qiang Gao wrote: > >> information about the system is in the attach file "information.txt" > >> > >> I can not reproduce it in the upstream 3.6.0 kernel.. > >> > >> On Sat, Oct 20, 2012 at 12:04 AM, Michal Hocko wrote: > >>> On Wed 17-10-12 18:23:34, gaoqiang wrote: > >>>> I looked up nothing useful with google,so I'm here for help.. > >>>> > >>>> when this happens: I use memcg to limit the memory use of a > >>>> process,and when the memcg cgroup was out of memory, > >>>> the process was oom-killed however,it cannot really complete the > >>>> exiting. here is the some information > >>> > >>> How many tasks are in the group and what kind of memory do they use? > >>> Is it possible that you were hit by the same issue as described in > >>> 79dfdacc memcg: make oom_lock 0 and 1 based rather than counter. > >>> > >>>> OS version: centos6.2 2.6.32.220.7.1 > >>> > >>> Your kernel is quite old and you should be probably asking your > >>> distribution to help you out. There were many fixes since 2.6.32. > >>> Are you able to reproduce the same issue with the current vanila kernel? > >>> > >>>> /proc/pid/stack > >>>> --------------------------------------------------------------- > >>>> > >>>> [] __cond_resched+0x2a/0x40 > >>>> [] unmap_vmas+0xb49/0xb70 > >>>> [] exit_mmap+0x7e/0x140 > >>>> [] mmput+0x58/0x110 > >>>> [] exit_mm+0x11d/0x160 > >>>> [] do_exit+0x1ad/0x860 > >>>> [] do_group_exit+0x41/0xb0 > >>>> [] get_signal_to_deliver+0x1e8/0x430 > >>>> [] do_notify_resume+0xf4/0x8b0 > >>>> [] int_signal+0x12/0x17 > >>>> [] 0xffffffffffffffff > >>> > >>> This looks strange because this is just an exit part which shouldn't > >>> deadlock or anything. Is this stack stable? Have you tried to take check > >>> it more times? > > > > Looking at information.txt, I found something interesting > > > > rt_rq[0]:/1314 > > .rt_nr_running : 1 > > .rt_throttled : 1 > > .rt_time : 0.856656 > > .rt_runtime : 0.000000 > > > > > > cfs_rq[0]:/1314 > > .exec_clock : 8738.133429 > > .MIN_vruntime : 0.000001 > > .min_vruntime : 8739.371271 > > .max_vruntime : 0.000001 > > .spread : 0.000000 > > .spread0 : -9792.255554 > > .nr_spread_over : 1 > > .nr_running : 0 > > .load : 0 > > .load_avg : 7376.722880 > > .load_period : 7.203830 > > .load_contrib : 1023 > > .load_tg : 1023 > > .se->exec_start : 282004.715064 > > .se->vruntime : 18435.664560 > > .se->sum_exec_runtime : 8738.133429 > > .se->wait_start : 0.000000 > > .se->sleep_start : 0.000000 > > .se->block_start : 0.000000 > > .se->sleep_max : 0.000000 > > .se->block_max : 0.000000 > > .se->exec_max : 77.977054 > > .se->slice_max : 0.000000 > > .se->wait_max : 2.664779 > > .se->wait_sum : 29.970575 > > .se->wait_count : 102 > > .se->load.weight : 2 > > > > So 1314 is a real time process and > > > > cpu.rt_period_us: > > 1000000 > > ---------------------- > > cpu.rt_runtime_us: > > 0 > > > > When did tt move to being a Real Time process (hint: see nr_running > > and nr_throttled)? > > > > Balbir > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/