Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753782AbbHaQvu (ORCPT ); Mon, 31 Aug 2015 12:51:50 -0400 Received: from relay.parallels.com ([195.214.232.42]:48796 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753404AbbHaQvt (ORCPT ); Mon, 31 Aug 2015 12:51:49 -0400 Date: Mon, 31 Aug 2015 19:51:32 +0300 From: Vladimir Davydov To: Tejun Heo CC: Michal Hocko , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , , Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled Message-ID: <20150831165131.GD15420@esperanza> References: <20150831132414.GG29723@dhcp22.suse.cz> <20150831134335.GB2271@mtj.duckdns.org> <20150831143007.GA13814@esperanza> <20150831143939.GC2271@mtj.duckdns.org> <20150831151814.GC13814@esperanza> <20150831154756.GE2271@mtj.duckdns.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20150831154756.GE2271@mtj.duckdns.org> X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To MSK-EXCH1.sw.swsoft.com (10.67.48.55) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4014 Lines: 75 On Mon, Aug 31, 2015 at 11:47:56AM -0400, Tejun Heo wrote: > On Mon, Aug 31, 2015 at 06:18:14PM +0300, Vladimir Davydov wrote: > > We have to be cautious about placing memcg_charge in slab/slub. To > > understand why, consider SLAB case, which first tries to allocate from > > all nodes in the order of preference w/o __GFP_WAIT and only if it fails > > falls back on an allocation from any node w/ __GFP_WAIT. This is its > > internal algorithm. If we blindly put memcg_charge to alloc_slab method, > > then, when we are near the memcg limit, we will go over all NUMA nodes > > in vain, then finally fall back to __GFP_WAIT allocation, which will get > > a slab from a random node. Not only we do more work than necessary due > > to walking over all NUMA nodes for nothing, but we also break SLAB > > internal logic! And you just can't fix it in memcg, because memcg knows > > nothing about the internal logic of SLAB, how it handles NUMA nodes. > > > > SLUB has a different problem. It tries to avoid high-order allocations > > if there is a risk of invoking costly memory compactor. It has nothing > > to do with memcg, because memcg does not care if the charge is for a > > high order page or not. > > Maybe I'm missing something but aren't both issues caused by memcg > failing to provide headroom for NOWAIT allocations when the > consumption gets close to the max limit? That's correct. > Regardless of the specific usage, !__GFP_WAIT means "give me memory if > it can be spared w/o inducing direct time-consuming maintenance work" > and the contract around it is that such requests will mostly succeed > under nominal conditions. Also, slab/slub might not stay as the only > user of try_charge(). Indeed, there might be other users trying GFP_NOWAIT before falling back to GFP_KERNEL, but they are not doing that constantly and hence cause no problems. If SLAB/SLUB plays such tricks, the problem becomes massive: under certain conditions *every* try_charge may be invoked w/o __GFP_WAIT, resulting in memory.high breaching and hitting memory.max. Generally speaking, handing over reclaim responsibility to task_work won't help, because there might be cases when a process spends quite a lot of time in kernel invoking lots of GFP_KERNEL allocations before returning to userspace. Without fixing slab/slub, such a process will charge w/o __GFP_WAIT and therefore can exceed memory.high and reach memory.max. If there are no other active processes in the cgroup, the cgroup can stay with memory.high excess for a relatively long time (suppose the process was throttled in kernel), possibly hurting the rest of the system. What is worse, if the process happens to invoke a real GFP_NOWAIT allocation when it's about to hit the limit, it will fail. If we want to allow slab/slub implementation to invoke try_charge wherever it wants, we need to introduce an asynchronous thread doing reclaim when a memcg is approaching its limit (or teach kswapd do that). That's a way to go, but what's the point to complicate things prematurely while it seems we can fix the problem by using the technique similar to the one behind memory.high? Nevertheless, even if we introduced such a thread, it'd be just insane to allow slab/slub blindly insert try_charge. Let me repeat the examples of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of try_charge I gave above: - memcg knows nothing about NUMA nodes, so what's the point in failing !__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes? - memcg knows nothing about high order pages, so what's the point in failing !__GFP_WAIT allocations used by SLUB to try to allocate a high order page? Thanks, Vladimir > I still think solving this from memcg side is the right direction. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/