Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752661AbbFKOSV (ORCPT ); Thu, 11 Jun 2015 10:18:21 -0400 Received: from mail-wg0-f66.google.com ([74.125.82.66]:33457 "EHLO mail-wg0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752358AbbFKOSR (ORCPT ); Thu, 11 Jun 2015 10:18:17 -0400 Date: Thu, 11 Jun 2015 16:18:13 +0200 From: Michal Hocko To: Tetsuo Handa Cc: linux-mm@kvack.org, rientjes@google.com, hannes@cmpxchg.org, tj@kernel.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [RFC] panic_on_oom_timeout Message-ID: <20150611141813.GA14088@dhcp22.suse.cz> References: <20150609170310.GA8990@dhcp22.suse.cz> <201506102120.FEC87595.OQSJLOVtMFOHFF@I-love.SAKURA.ne.jp> <20150610142801.GD4501@dhcp22.suse.cz> <201506112212.JAG26531.FLSVFMOQJOtOHF@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201506112212.JAG26531.FLSVFMOQJOtOHF@I-love.SAKURA.ne.jp> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2223 Lines: 53 On Thu 11-06-15 22:12:40, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > > The moom_work used by SysRq-f sometimes cannot be executed > > > because some work which is processed before the moom_work is processed is > > > stalled for unbounded amount of time due to looping inside the memory > > > allocator. > > > > Wouldn't wq code pick up another worker thread to execute the work. > > There is also a rescuer thread as the last resort AFAIR. > > > > Below is an example of moom_work lockup in v4.1-rc7 from > http://I-love.SAKURA.ne.jp/tmp/serial-20150611.txt.xz > > ---------- > [ 171.710406] sysrq: SysRq : Manual OOM execution > [ 171.720193] kworker/2:9 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 > [ 171.722699] kworker/2:9 cpuset=/ mems_allowed=0 > [ 171.724603] CPU: 2 PID: 11016 Comm: kworker/2:9 Not tainted 4.1.0-rc7 #3 > [ 171.726817] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 > [ 171.729727] Workqueue: events moom_callback > (...snipped...) > [ 258.302016] sysrq: SysRq : Manual OOM execution Wow, this is a _lot_. I was aware that workqueues might be overloaded. We have seen that in real loads and that led to http://marc.info/?l=linux-kernel&m=141456398425553 wher the rescuer didn't handle pending work properly. I thought that the fix helped in the end. But 1.5 minutes is indeed unexpected for me. This of course disqualifies DELAYED_WORK for anything that has at least reasonable time expectations which is the case here. [...] > I think that the basic rule for handling OOM condition is that "Do not trust > anybody, for even the kswapd and rescuer threads can be waiting for lock or > allocation. Decide when to give up at administrator's own risk, and choose > another mm struct or call panic()". Yes I agree with this and checking the timeout at select_bad_process time sounds much more plausible in that regard. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/