Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755281AbcDKOBb (ORCPT ); Mon, 11 Apr 2016 10:01:31 -0400 Received: from mx2.suse.de ([195.135.220.15]:37403 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932293AbcDKNXu (ORCPT ); Mon, 11 Apr 2016 09:23:50 -0400 X-Amavis-Alert: BAD HEADER SECTION, Duplicate header field: "References" From: Jiri Slaby To: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Michal Hocko , KAMEZAWA Hiroyuki , Andrew Morton , Linus Torvalds , Jiri Slaby Subject: [PATCH 3.12 20/98] memcg: do not hang on OOM when killed by userspace OOM access to memory reserves Date: Mon, 11 Apr 2016 15:22:22 +0200 Message-Id: <1f7e1e7f0018706fa29c752fd88a919c7e25b456.1460380917.git.jslaby@suse.cz> X-Mailer: git-send-email 2.8.1 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3310 Lines: 90 From: Michal Hocko 3.12-stable review patch. If anyone has any objections, please let me know. =============== commit d8dc595ce3909fbc131bdf5ab8c9808fe624b18d upstream. Eric has reported that he can see task(s) stuck in memcg OOM handler regularly. The only way out is to echo 0 > $GROUP/memory.oom_control His usecase is: - Setup a hierarchy with memory and the freezer (disable kernel oom and have a process watch for oom). - In that memory cgroup add a process with one thread per cpu. - In one thread slowly allocate once per second I think it is 16M of ram and mlock and dirty it (just to force the pages into ram and stay there). - When oom is achieved loop: * attempt to freeze all of the tasks. * if frozen send every task SIGKILL, unfreeze, remove the directory in cgroupfs. Eric has then pinpointed the issue to be memcg specific. All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. Those that have received fatal signal will bypass the charge and should continue on their way out. The tricky part is that the exit path might trigger a page fault (e.g. exit_robust_list), thus the memcg charge, while its memcg is still under OOM because nobody has released any charges yet. Unlike with the in-kernel OOM handler the exiting task doesn't get TIF_MEMDIE set so it doesn't shortcut further charges of the killed task and falls to the memcg OOM again without any way out of it as there are no fatal signals pending anymore. This patch fixes the issue by checking PF_EXITING early in mem_cgroup_try_charge and bypass the charge same as if it had fatal signal pending or TIF_MEMDIE set. Normally exiting tasks (aka not killed) will bypass the charge now but this should be OK as the task is leaving and will release memory and increasing the memory pressure just to release it in a moment seems dubious wasting of cycles. Besides that charges after exit_signals should be rare. I am bringing this patch again (rebased on the current mmotm tree). I hope we can move forward finally. If there is still an opposition then I would really appreciate a concurrent approach so that we can discuss alternatives. http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference to the followup discussion when the patch has been dropped from the mmotm last time. Reported-by: Eric W. Biederman Signed-off-by: Michal Hocko Acked-by: David Rientjes Acked-by: Johannes Weiner Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Jiri Slaby --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5904fc833523..4a1559d8739f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2710,7 +2710,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, * MEMDIE process. */ if (unlikely(test_thread_flag(TIF_MEMDIE) - || fatal_signal_pending(current))) + || fatal_signal_pending(current) + || current->flags & PF_EXITING)) goto bypass; if (unlikely(task_in_memcg_oom(current))) -- 2.8.1