Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752945AbaA3Mat (ORCPT ); Thu, 30 Jan 2014 07:30:49 -0500 Received: from cantor2.suse.de ([195.135.220.15]:42059 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751180AbaA3Mar (ORCPT ); Thu, 30 Jan 2014 07:30:47 -0500 Date: Thu, 30 Jan 2014 13:30:44 +0100 From: Michal Hocko To: Greg Thelen Cc: linux-mm@kvack.org, Johannes Weiner , Andrew Morton , KAMEZAWA Hiroyuki , LKML , Ying Han , Hugh Dickins , Michel Lespinasse , KOSAKI Motohiro , Tejun Heo Subject: Re: [RFC 0/4] memcg: Low-limit reclaim Message-ID: <20140130123044.GB13509@dhcp22.suse.cz> References: <1386771355-21805-1-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 29-01-14 11:08:46, Greg Thelen wrote: [...] > The series looks useful. We (Google) have been using something similar. > In practice such a low_limit (or memory guarantee), doesn't nest very > well. > > Example: > - parent_memcg: limit 500, low_limit 500, usage 500 > 1 privately charged non-reclaimable page (e.g. mlock, slab) > - child_memcg: limit 500, low_limit 500, usage 499 I am not sure this is a good example. Your setup basically say that no single page should be reclaimed. I can imagine this might be useful in some cases and I would like to allow it but it sounds too extreme (e.g. a load which would start trashing heavily once the reclaim starts and it makes more sense to start it again rather than crowl - think about some mathematical simulation which might diverge). > If a streaming file cache workload (e.g. sha1sum) starts gobbling up > page cache it will lead to an oom kill instead of reclaiming. Does it make any sense to protect all of such memory although it is easily reclaimable? > One could > argue that this is working as intended because child_memcg was promised > 500 but can only get 499. So child_memcg is oom killed rather than > being forced to operate below its promised low limit. > > This has led to various internal workarounds like: > - don't charge any memory to interior tree nodes (e.g. parent_memcg); > only charge memory to cgroup leafs. This gets tricky when dealing > with reparented memory inherited to parent from child during cgroup > deletion. Do those need any protection at all? > - don't set low_limit on non leafs (e.g. do not set low limit on > parent_memcg). This constrains the cgroup layout a bit. Some > customers want to purchase $MEM and setup their workload with a few > child cgroups. A system daemon hands out $MEM by setting low_limit > for top-level containers (e.g. parent_memcg). Thereafter such > customers are able to partition their workload with sub memcg below > child_memcg. Example: > parent_memcg > \ > child_memcg > / \ > server backup I think that the low_limit makes sense where you actually want to protect something from reclaim. And backup sounds like a bad fit for that. > Thereafter customers often want some weak isolation between server and > backup. To avoid undesired oom kills the server/backup isolation is > provided with a softer memory guarantee (e.g. soft_limit). The soft > limit acts like the low_limit until priority becomes desperate. Johannes was already suggesting that the low_limit should allow for a weaker semantic as well. I am not very much inclined to that but I can leave with a knob which would say oom_on_lowlimit (on by default but allowed to be set to 0). We would fallback to the full reclaim if no groups turn out to be reclaimable. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/