Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757546AbbKFQgJ (ORCPT ); Fri, 6 Nov 2015 11:36:09 -0500 Received: from gum.cmpxchg.org ([85.214.110.215]:42122 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751615AbbKFQgG (ORCPT ); Fri, 6 Nov 2015 11:36:06 -0500 Date: Fri, 6 Nov 2015 11:35:55 -0500 From: Johannes Weiner To: Vladimir Davydov Cc: Michal Hocko , David Miller , akpm@linux-foundation.org, tj@kernel.org, netdev@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Message-ID: <20151106163555.GB7813@cmpxchg.org> References: <20151027154138.GA4665@cmpxchg.org> <20151027161554.GJ9891@dhcp22.suse.cz> <20151027164227.GB7749@cmpxchg.org> <20151029152546.GG23598@dhcp22.suse.cz> <20151029161009.GA9160@cmpxchg.org> <20151104104239.GG29607@dhcp22.suse.cz> <20151104195037.GA6872@cmpxchg.org> <20151105144002.GB15111@dhcp22.suse.cz> <20151105205522.GA1067@cmpxchg.org> <20151106090555.GK29259@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151106090555.GK29259@esperanza> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3495 Lines: 65 On Fri, Nov 06, 2015 at 12:05:55PM +0300, Vladimir Davydov wrote: > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote: > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote: > ... > > > 3) keep only some (safe) cache types enabled by default with the current > > > failing semantic and require an explicit enabling for the complete > > > kmem accounting. [di]cache code paths should be quite robust to > > > handle allocation failures. > > > > Vladimir, what would be your opinion on this? > > I'm all for this option. Actually, I've been thinking about this since I > introduced the __GFP_NOACCOUNT flag. Not because of the failing > semantics, since we can always let kmem allocations breach the limit. > This shouldn't be critical, because I don't think it's possible to issue > a series of kmem allocations w/o a single user page allocation, which > would reclaim/kill the excess. > > The point is there are allocations that are shared system-wide and > therefore shouldn't go to any memcg. Most obvious examples are: mempool > users and radix_tree/idr preloads. Accounting them to memcg is likely to > result in noticeable memory overhead as memory cgroups are > created/destroyed, because they pin dead memory cgroups with all their > kmem caches, which aren't tiny. > > Another funny example is objects destroyed lazily for performance > reasons, e.g. vmap_area. Such objects are usually very small, so > delaying destruction of a bunch of them will normally go unnoticed. > However, if kmemcg is used the effective memory consumption caused by > such objects can be multiplied by many times due to dangling kmem > caches. > > We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the > problem is they are tricky to identify, because they are scattered all > over the kernel source tree. E.g. Dave Chinner mentioned that XFS > internals do a lot of allocations that are shared among all XFS > filesystems and therefore should not be accounted (BTW that's why > list_lru's used by XFS are not marked as memcg-aware). There must be > more out there. Besides, kernel developers don't usually even know about > kmemcg (they just write the code for their subsys, so why should they?) > so they won't care thinking about using __GFP_NOACCOUNT, and hence new > falsely-accounted allocations are likely to appear. > > That said, by switching from black-list (__GFP_NOACCOUNT) to white-list > (__GFP_ACCOUNT) kmem accounting policy we would make the system more > predictable and robust IMO. OTOH what would we lose? Security? Well, > containers aren't secure IMHO. In fact, I doubt they will ever be (as > secure as VMs). Anyway, if a runaway allocation is reported, it should > be trivial to fix by adding __GFP_ACCOUNT where appropriate. I wholeheartedly agree with all of this. > If there are no objections, I'll prepare a patch switching to the > white-list approach. Let's start from obvious things like fs_struct, > mm_struct, task_struct, signal_struct, dentry, inode, which can be > easily allocated from user space. This should cover 90% of all > allocations that should be accounted AFAICS. The rest will be added > later if necessarily. Awesome, I'm looking forward to that patch! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/