Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752105Ab3FEGkn (ORCPT ); Wed, 5 Jun 2013 02:40:43 -0400 Received: from mail-pb0-f51.google.com ([209.85.160.51]:57136 "EHLO mail-pb0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752092Ab3FEGkU (ORCPT ); Wed, 5 Jun 2013 02:40:20 -0400 Date: Tue, 4 Jun 2013 23:40:16 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Andrew Morton , Johannes Weiner , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org Subject: Re: [patch] mm, memcg: add oom killer delay In-Reply-To: <20130604095514.GC31242@dhcp22.suse.cz> Message-ID: References: <20130530150539.GA18155@dhcp22.suse.cz> <20130531081052.GA32491@dhcp22.suse.cz> <20130531112116.GC32491@dhcp22.suse.cz> <20130601102058.GA19474@dhcp22.suse.cz> <20130603193147.GC23659@dhcp22.suse.cz> <20130604095514.GC31242@dhcp22.suse.cz> User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4021 Lines: 76 On Tue, 4 Jun 2013, Michal Hocko wrote: > > I'm not sure a userspace oom notifier would want to keep a > > preallocated buffer around that is mlocked in memory for all possible > > lengths of this file. > > Well, an oom handler which allocates memory under the same restricted > memory doesn't make much sense to me. Tracking all kmem allocations > makes it almost impossible to implement a non-trivial handler. > This isn't only about oom notifiers that reside in the same oom memcg, they can be at a higher level or a different subtree entirely. The point is that they can be unresponsive either because userspace is flaky, its oom itself, or we do track all slab. The kernel is the only thing in the position to fix the issue after a sensible user-defined amount of time has elapsed, and that's what this patch is. > OK, maybe I just wasn't clear enough or I am missing your point. Your > users _can_ implement and register their oom handlers. But as your > requirements are rather benevolent for handlers implementation you would > have a global watchdog which would sit on the oom_control of those > groups (which are allowed to have own handlers - all of them in your > case I guess) and trigger (user defined/global) timeout when it gets a > notification. If the group was under oom always during the timeout then > just disable oom_control until oom is settled (under_oom is 0). > > Why wouldn't something like this work for your use case? > For the aforementioned reason that we give users the ability to manipulate their own memcg trees and userspace is untrusted. Their oom notifiers cannot be run as root, not only because of security but also because it would not properly isolate their memory usage to their memcg tree. > Hohmm, so you are insisting on something that can be implemented in the > userspace and put it into the kernel because it is more convenient for > you and your use case. This doesn't sound like a way for accepting a > feature. > I don't think you yet understand the problem, which is probably my fault. I'm not insisting this be implemented in the kernel, I'm saying it's not possible to do it in userspace. Your idea of a timeout implemented in userspace doesn't work in practice: userspace is both untrusted and cannot be guaranteed to be perfect and always wakeup, get the information it does according to its implementation, and issue a SIGKILL. This is the result of memcg allowing users to disable the oom killer entirely for a memcg, which is still ridiculous, because if the user doesn't respond then you've wasted all that memory and cannot get it back without admin intervention or a reboot. There are no other "features" in the kernel that put such a responsibility on a userspace process such that if it doesn't work then the entire memcg deadlocks forever without admin intervention. We need a failsafe in the kernel. Real users like this cannot run as root, and we cannot run in the root memcg without charging that memory usage to the user's container for that share of a global resource. Real users do have to tollerate buggy and flaky userspace implementations that cannot be guaranteed to run or do what they are supposed to do. It's too costly of a problem to not address with a failsafe. I speak strictly from experience on this. > And yes we should make memcg oom handling less deadlock prone and > Johannes' work in this thread is a good step forward. The long-term solution to that, which I already have patches for, is something you would cringe even more at: memcg memory reserves that are shared with per-zone memory reserves that get the global oom killer to kill off that process without notification in the case the memcg memory reserves cause a global oom. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/