Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S967032AbbBDXvH (ORCPT ); Wed, 4 Feb 2015 18:51:07 -0500 Received: from mail-ig0-f175.google.com ([209.85.213.175]:62718 "EHLO mail-ig0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755241AbbBDXvF (ORCPT ); Wed, 4 Feb 2015 18:51:05 -0500 References: <20150130044324.GA25699@htj.dyndns.org> <20150130062737.GB25699@htj.dyndns.org> <20150130160722.GA26111@htj.dyndns.org> <54CFCF74.6090400@yandex-team.ru> <20150202194608.GA8169@htj.dyndns.org> <20150204170656.GA18858@htj.dyndns.org> From: Greg Thelen To: Tejun Heo Cc: Konstantin Khlebnikov , Johannes Weiner , Michal Hocko , Cgroups , "linux-mm\@kvack.org" , "linux-kernel\@vger.kernel.org" , Jan Kara , Dave Chinner , Jens Axboe , Christoph Hellwig , Li Zefan , Hugh Dickins Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma In-reply-to: <20150204170656.GA18858@htj.dyndns.org> Date: Wed, 04 Feb 2015 15:51:01 -0800 Message-ID: MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9285 Lines: 196 On Wed, Feb 04 2015, Tejun Heo wrote: > Hello, > > On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote: >> If a machine has several top level memcg trying to get some form of >> isolation (using low, min, soft limit) then a shared libc will be >> moved to the root memcg where it's not protected from global memory >> pressure. At least with the current per page accounting such shared >> pages often land into some protected memcg. > > Yes, it becomes interesting with the low limit as the pressure > direction is reversed but at the same time overcommitting low limits > doesn't lead to a sane setup to begin with as it's asking for global > OOMs anyway, which means that things like libc would end up competing > at least fairly with other pages for global pressure and should stay > in memory under most circumstances, which may or may not be > sufficient. I agree. Clarification... I don't plan to overcommit low or min limits. On machines without overcommited min limits the existing system offers some protection for shared libs from global reclaim. Pushing them to root doesn't. > Hmm.... need to think more about it but this only becomes a problem > with the root cgroup because it doesn't have min setting which is > expected to be inclusive of all descendants, right? Maybe the right > thing to do here is treating the inodes which get pushed to the root > as a special case and we can implement a mechanism where the root is > effectively borrowing from the mins of its children which doesn't have > to be completely correct - e.g. just charge it against all children > repeatedly and if any has min protection, put it under min protection. > IOW, make it the baseload for all of them. I think the linux-next low (and the TBD min) limits also have the problem for more than just the root memcg. I'm thinking of a 2M file shared between C and D below. The file will be charged to common parent B. A +-B (usage=2M lim=3M min=2M) +-C (usage=0 lim=2M min=1M shared_usage=2M) +-D (usage=0 lim=2M min=1M shared_usage=2M) \-E (usage=0 lim=2M min=0) The problem arises if A/B/E allocates more than 1M of private reclaimable file data. This pushes A/B into reclaim which will reclaim both the shared file from A/B and private file from A/B/E. In contrast, the current per-page memcg would've protected the shared file in either C or D leaving A/B reclaim to only attack A/B/E. Pinning the shared file to either C or D, using TBD policy such as mount option, would solve this for tightly shared files. But for wide fanout file (libc) the admin would need to assign a global bucket and this would be a pain to size due to various job requirements. >> If two cgroups collude they can use more memory than their limit and >> oom the entire machine. Admittedly the current per-page system isn't >> perfect because deleting a memcg which contains mlocked memory >> (referenced by a remote memcg) moves the mlocked memory to root >> resulting in the same issue. But I'd argue this is more likely with > > Hmmm... why does it do that? Can you point me to where it's > happening? My mistake, I was thinking of older kernels which reparent memory. Though I can't say v3.19-rc7 handles this collusion any better. Instead of reparenting the mlocked memory, it's left in an invisible (offline) memcg. Unlike older kernels the memory doesn't appear in root/memory.stat[unevictable], instead it buried in root/memory.stat[total_unevictable] which includes mlocked memory in visible (online) and invisible (offline) children. >> the RFC because it doesn't involve the cgroup deletion/reparenting. A > > One approach could be expanding on the forementioned scheme and make > all sharing cgroups to get charged for the shared inodes they're > using, which should render such collusions entirely pointless. > e.g. let's say we start with the following. > > A (usage=48M) > +-B (usage=16M) > \-C (usage=32M) > > And let's say, C starts accessing an inode which is 8M and currently > associated with B. > > A (usage=48M, hosted= 8M) > +-B (usage= 8M, shared= 8M) > \-C (usage=32M, shared= 8M) > > The only extra charging that we'd be doing is charing C with extra > 8M. Let's say another cgroup D gets created and uses 4M. > > A (usage=56M, hosted= 8M) > +-B (usage= 8M, shared= 8M) > +-C (usage=32M, shared= 8M) > \-D (usage= 8M) > > and it also accesses the inode. > > A (usage=56M, hosted= 8M) > +-B (usage= 8M, shared= 8M) > +-C (usage=32M, shared= 8M) > \-D (usage= 8M, shared= 8M) > > We'd need to track the shared charges separately as they should count > only once in the parent but that shouldn't be too hard. The problem > here is that we'd need to track which inodes are being accessed by > which children, which can get painful for things like libc. Maybe we > can limit it to be level-by-level - track sharing only from the > immediate children and always move a shared inode at one level at a > time. That would lose some ability to track the sharing beyond the > immediate children but it should be enough to solve the root case and > allow us to adapt to changing usage pattern over time. Given that > sharing is mostly a corner case, this could be good enough. > > Now, if D accesses 4M area of the inode which hasn't been accessed by > others yet. We'd want it to look like the following. > > A (usage=64M, hosted=16M) > +-B (usage= 8M, shared=16M) > +-C (usage=32M, shared=16M) > \-D (usage= 8M, shared=16M) > > But charging it to B, C at the same time prolly wouldn't be > particularly convenient. We can prolly just do D -> A charging and > let B and C sort themselves out later. Note that such charging would > still maintain the overall integrity of memory limits. The only thing > which may overflow is the pseudo shared charges to keep sharing in > check and dealing with them later when B and C try to create further > charges should be completely fine. > > Note that we can also try to split the shared charge across the users; > however, charging the full amount seems like the better approach to > me. We don't have any way to tell how the usage is distributed > anyway. For use cases where this sort of sharing is expected, I think > it's perfectly reasonable to provision the sharing children to have > enough to accomodate the possible full size of the shared resource. > >> possible tweak to shore up the current system is to move such mlocked >> pages to the memcg of the surviving locker. When the machine is oom >> it's often nice to examine memcg state to determine which container is >> using the memory. Tracking down who's contributing to a shared >> container is non-trivial. >> >> I actually have a set of patches which add a memcg=M mount option to >> memory backed file systems. I was planning on proposing them, >> regardless of this RFC, and this discussion makes them even more >> appealing. If we go in this direction, then we'd need a similar >> notion for disk based filesystems. As Konstantin suggested, it'd be >> really nice to specify charge policy on a per file, or directory, or >> bind mount basis. This allows shared files to be deterministically > > I'm not too sure about that. We might add that later if absolutely > justifiable but designing assuming that level of intervention from > userland may not be such a good idea. > >> When there's large incidental sharing, then things get sticky. A >> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in >> a small container would pull all pages to the root memcg where they >> are exposed to root pressure which breaks isolation. This is >> concerning. Perhaps the such accesses could be decorated with >> (O_NO_MOVEMEM). > > If such thing is really necessary, FADV_NOREUSE would be a better > indicator; however, yes, such incidental sharing is easier to handle > with per-page scheme as such scanner can be limited in the number of > pages it can carry throughout its operation regardless of which cgroup > it's looking at. It still has the nasty corner case where random > target cgroups can latch onto pages faulted in by the scanner and > keeping accessing them tho, so, even now, FADV_NOREUSE would be a good > idea. Note that such scanning, if repeated on cgroups under high > memory pressure, is *likely* to accumulate residue escaped pages and > if such a management cgroup is transient, those escaped pages will > accumulate over time outside any limit in a way which is unpredictable > and invisible. > >> So this RFC change will introduce significant change to user space >> machine managers and perturb isolation. Is the resulting system >> better? It's not clear, it's the devil know vs devil unknown. Maybe >> it'd be easier if the memcg's I'm talking about were not allowed to >> share page cache (aka copy-on-read) even for files which are jointly >> visible. That would provide today's interface while avoiding the >> problematic sharing. > > Yeah, compatibility would be the stickiest part. > > Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/