Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751556Ab3FYBj4 (ORCPT ); Mon, 24 Jun 2013 21:39:56 -0400 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:56014 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751071Ab3FYBjy (ORCPT ); Mon, 24 Jun 2013 21:39:54 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <51C8F4B9.9060604@jp.fujitsu.com> Date: Tue, 25 Jun 2013 10:39:05 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: David Rientjes CC: Michal Hocko , Andrew Morton , Johannes Weiner , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org Subject: Re: [patch] mm, memcg: add oom killer delay References: <20130603193147.GC23659@dhcp22.suse.cz> <20130604095514.GC31242@dhcp22.suse.cz> <20130605093937.GK15997@dhcp22.suse.cz> <20130610142321.GE5138@dhcp22.suse.cz> <20130612202348.GA17282@dhcp22.suse.cz> <20130613151602.GG23070@dhcp22.suse.cz> <51BA6A2A.3060107@jp.fujitsu.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6077 Lines: 163 (2013/06/14 19:12), David Rientjes wrote: > On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote: > >> Reading your discussion, I think I understand your requirements. >> The problem is that I can't think you took into all options into >> accounts and found the best way is this new oom_delay. IOW, I can't >> convice oom-delay is the best way to handle your issue. >> > > Ok, let's talk about it. > I'm sorry that my RTT is long in these days. >> Your requeirement is >> - Allowing userland oom-handler within local memcg. >> > > Another requirement: > > - Allow userland oom handler for global oom conditions. > > Hopefully that's hooked into memcg because the functionality is already > there, we can simply duplicate all of the oom functionality that we'll be > adding for the root memcg. > At mm-summit, it was discussed ant people seems to think user-land-oom-handler is impossible. Hm, and in-kernel scripting was discussed, as far as I remember. >> Considering straightforward, the answer should be >> - Allowing oom-handler daemon out of memcg's control by its limit. >> (For example, a flag/capability for a task can archive this.) >> Or attaching some *fixed* resource to the task rather than cgroup. >> >> Allow to set task->secret_saving=20M. >> > > Exactly! > > First of all, thanks very much for taking an interest in our usecase and > discussing it with us. > > I didn't propose what I referred to earlier in the thread as "memcg > reserves" because I thought it was going to be a more difficult battle. > The fact that you brought it up first actually makes me think it's less > insane :) > > We do indeed want memcg reserves and I have patches to add it if you'd > like to see that first. It ensures that this userspace oom handler can > actually do some work in determining which process to kill. The reserve > is a fraction of true memory reserves (the space below the per-zone min > watermarks) which is dependent on min_free_kbytes. This does indeed > become more difficult with true and complete kmem charging. That "work" > could be opening the tasks file (which allocates the pidlist within the > kernel), checking /proc/pid/status for rss, checking for how long a > process has been running, checking for tid, sending a signal to drop > caches, etc. > Considering only memcg, bypassing all charge-limit-check will work. But as you say, that will not work against global-oom. Then, in-kernel scripting was discussed. > We'd also like to do this for global oom conditions, which makes it even > more interesting. I was thinking of using a fraction of memory reserves > as the oom killer currently does (that memory below the min watermark) for > these purposes. > > Memory charging is simply bypassed for these oom handlers (we only grant > access to those waiting on the memory.oom_control eventfd) up to > memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think > this is entirely insane because these oom handlers should lead to future > memory freeing, just like TIF_MEMDIE processes. > I think that kinds of bypassing is acceptable. >> Going back to your patch, what's confusing is your approach. >> Why the problem caused by the amount of memory should be solved by >> some dealy, i.e. the amount of time ? >> >> This exchanging sounds confusing to me. >> > > Even with all of the above (which is not actually that invasive of a > patch), I still think we need memory.oom_delay_millisecs. I probably made > a mistake in describing what that is addressing if it seems like it's > trying to address any of the above. > > If a userspace oom handler fails to respond even with access to those > "memcg reserves", How this happens ? > the kernel needs to kill within that memcg. Do we do > that above a set time period (this patch) or when the reserves are > completely exhausted? That's debatable, but if we are to allow it for > global oom conditions as well then my opinion was to make it as safe as > possible; today, we can't disable the global oom killer from userspace and > I don't think we should ever allow it to be disabled. I think we should > allow userspace a reasonable amount of time to respond and then kill if it > is exceeded. > > For the global oom case, we want to have a priority-based memcg selection. > Select the lowest priority top-level memcg and kill within it. If it has > an oom notifier, send it a signal to kill something. If it fails to > react, kill something after memory.oom_delay_millisecs has elapsed. If > there isn't a userspace oom notifier, kill something within that lowest > priority memcg. > Someone may be against that kind of control and say "Hey, I have better idea". That was another reason that oom-scirpiting was discussed. No one can implement general-purpose-victim-selection-logic. > The bottomline with my approach is that I don't believe there is ever a > reason for an oom memcg to remain oom indefinitely. That's why I hate > memory.oom_control == 1 and I think for the global notification it would > be deemed a nonstarter since you couldn't even login to the machine. > >> I'm not against what you finally want to do, but I don't like the fix. >> > > I'm thrilled to hear that, and I hope we can work to make userspace oom > handling more effective. > > What do you think about that above? IMHO, it will be difficult but allowing to write script/filter for oom-killing will be worth to try. like.. == for_each_process : if comm == mem_manage_daemon : continue if user == root : continue score = default_calc_score() if score > high_score : selected = current == BTW, if you love the logic in the userland oom daemon, why you can't implement it in the kernel ? Does that do some pretty things other than sending SIGKILL ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/