Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932899AbbFIJoQ (ORCPT ); Tue, 9 Jun 2015 05:44:16 -0400 Received: from cantor2.suse.de ([195.135.220.15]:40533 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932815AbbFIJn7 (ORCPT ); Tue, 9 Jun 2015 05:43:59 -0400 Date: Tue, 9 Jun 2015 11:43:56 +0200 From: Michal Hocko To: David Rientjes Cc: Andrew Morton , Tetsuo Handa , linux-mm@kvack.org, LKML Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured Message-ID: <20150609094356.GB29057@dhcp22.suse.cz> References: <1433159948-9912-1-git-send-email-mhocko@suse.cz> <20150605111302.GB26113@dhcp22.suse.cz> <20150608213218.GB18360@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5222 Lines: 120 On Mon 08-06-15 16:20:21, David Rientjes wrote: > On Mon, 8 Jun 2015, Michal Hocko wrote: [...] > > On Mon 08-06-15 12:51:53, David Rientjes wrote: > > Do you actually have panic_on_oops enabled? > > > > CONFIG_PANIC_ON_OOPS_VALUE should be 0, I'm not sure why that's relevant. No I meant panic_on_oops > 0. > The functionality I'm referring to is that your patch now panics the > machine for configs where /proc/sys/vm/panic_on_oom is set and the same > scenario occurs as described above. You're introducing userspace breakage > because you are using panic_on_oom in a way that it hasn't been used in > the past and isn't described as working in the documentation. I am sorry, but I do not follow. The knob has been always used to _panic_ the OOM system. Nothing more and nothing less. Now you are arguing about the change being buggy because a task might be killed but that argument doesn't make much sense to me because basically _any_ other allocation which allows OOM to trigger might hit check_panic_on_oom() and panic the system well before your killed task gets a chance to terminate. I would understand your complain if we waited for oom victim(s) before check_panic_on_oom but we have not been doing that. > > > We want to send the SIGKILL, which will interrupt things like > > > > But this patch only changes the ordering of panic_on_oops vs. > > fatal_signal_pending(current) shortcut. So it can possible affect the > > case when the current is exiting during OOM. Is this the case that you > > are worried about? > > > > Yes, of course, the case specifically when the killed process is in the > exit path due to a userspace oom kill and needs access to memory reserves > to exit. That's needed because the machine is oom (typically the only > time a non-buggy userspace oom handler would kill a process). > > This: > > check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL); > ... > if (current->mm && > (fatal_signal_pending(current) || task_will_free_mem(current))) { > mark_oom_victim(current); > goto out; > } > > is obviously buggy in that regard. We need to be able to give a killed > process or an exiting process memory reserves so it may (1) allocate prior > to handling the signal and (2) be assured of exiting once it is in the > exit path. Which OOM path are you talking about here? memcg OOM user space handler killing a task? This one however doesn't go this path unless the memcg OOM is racing with the global OOM. Is this the case you are worried about? If yes then this is racy anyway and nothing to rely on as described above. If you have a global OOM detection mechanism then it is racy as well for the very same reason. User space OOM handling with panic_on_oom is simply not usable. > > The documentation clearly states: > > " > > If this is set to 1, the kernel panics when out-of-memory happens. > > However, if a process limits using nodes by mempolicy/cpusets, > > and those nodes become memory exhaustion status, one process > > may be killed by oom-killer. No panic occurs in this case. > > Because other nodes' memory may be free. This means system total status > > may be not fatal yet. > > > > If this is set to 2, the kernel panics compulsorily even on the > > above-mentioned. Even oom happens under memory cgroup, the whole > > system panics. > > > > The default value is 0. > > 1 and 2 are for failover of clustering. Please select either > > according to your policy of failover. > > " > > > > So I read this as a reliability feature to handle oom situation as soon > > as possible. > > > > A userspace process that is killed by userspace that simply needs memory > to handle the signal and exit is not oom. We have always allowed current > to have access to memory reserves to exit without triggering panic_on_oom. > This is nothing new, and is not implied by the documentation to be the > case. The documentation doesn't mention anything about exiting task or any other last minute attempt to be nice and prevent from OOM killing. And moreover the assumption that TIF_MEMDIE will help to exit the oom victim or a task with fatal_signal_pending is not true in general (and you haven't provided sound arguments yet). > I'm not going to spend all day trying to convince you that you cannot > change the semantics of sysctls that have existed for years with new > behavior especially when users require that behavior to handle userspace > kills while still keeping their machines running. Let me remind that you are trying to nack this patch and your argumentation is unclear at best. The matter seems quite simple to me. Relying on fatal_signal_pending(current) to help before check_panic_on_oom might help to prevent OOM but it is racy and cannot be relied on while not going to check_panic_on_oom might be potentially harmful, albeil unlikely, and lock up machine which is against the user defined policy to panic machine on OOM. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/