Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965621AbbBDKtT (ORCPT ); Wed, 4 Feb 2015 05:49:19 -0500 Received: from forward-corp1f.mail.yandex.net ([95.108.130.40]:47356 "EHLO forward-corp1f.mail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965251AbbBDKtP (ORCPT ); Wed, 4 Feb 2015 05:49:15 -0500 Authentication-Results: smtpcorp1m.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Message-ID: <54D1F924.5000001@yandex-team.ru> Date: Wed, 04 Feb 2015 13:49:08 +0300 From: Konstantin Khlebnikov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Greg Thelen , Tejun Heo CC: Johannes Weiner , Michal Hocko , Cgroups , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Jan Kara , Dave Chinner , Jens Axboe , Christoph Hellwig , Li Zefan , Hugh Dickins , Roman Gushchin Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma References: <20150130044324.GA25699@htj.dyndns.org> <20150130062737.GB25699@htj.dyndns.org> <20150130160722.GA26111@htj.dyndns.org> <54CFCF74.6090400@yandex-team.ru> <20150202194608.GA8169@htj.dyndns.org> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7034 Lines: 138 On 04.02.2015 02:30, Greg Thelen wrote: > On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo wrote: >> Hey, >> >> On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote: >> >>> Keeping shared inodes in common ancestor is reasonable. >>> We could schedule asynchronous moving when somebody opens or mmaps >>> inode from outside of its current cgroup. But it's not clear when >>> inode should be moved into opposite direction: when inode should >>> become private and how detect if it's no longer shared. >>> >>> For example each inode could keep yet another pointer to memcg where >>> it will track subtree of cgroups where it was accessed in past 5 >>> minutes or so. And sometimes that informations goes into moving thread. >>> >>> Actually I don't see other options except that time-based estimation: >>> tracking all cgroups for each inode is too expensive, moving pages >>> from one lru to another is expensive too. So, moving inodes back and >>> forth at each access from the outside world is not an option. >>> That should be rare operation which runs in background or in reclaimer. >> >> Right, what strategy to use for migration is up for debate, even for >> moving to the common ancestor. e.g. should we do that on the first >> access? In the other direction, it get more interesting. Let's say >> if we decide to move back an inode to a descendant, what if that >> triggers OOM condition? Do we still go through it and cause OOM in >> the target? Do we even want automatic moving in this direction? >> >> For explicit cases, userland can do FADV_DONTNEED, I suppose. >> >> Thanks. >> >> -- >> tejun > > I don't have any killer objections, most of my worries are isolation concerns. > > If a machine has several top level memcg trying to get some form of > isolation (using low, min, soft limit) then a shared libc will be > moved to the root memcg where it's not protected from global memory > pressure. At least with the current per page accounting such shared > pages often land into some protected memcg. > > If two cgroups collude they can use more memory than their limit and > oom the entire machine. Admittedly the current per-page system isn't > perfect because deleting a memcg which contains mlocked memory > (referenced by a remote memcg) moves the mlocked memory to root > resulting in the same issue. But I'd argue this is more likely with > the RFC because it doesn't involve the cgroup deletion/reparenting. A > possible tweak to shore up the current system is to move such mlocked > pages to the memcg of the surviving locker. When the machine is oom > it's often nice to examine memcg state to determine which container is > using the memory. Tracking down who's contributing to a shared > container is non-trivial. > > I actually have a set of patches which add a memcg=M mount option to > memory backed file systems. I was planning on proposing them, > regardless of this RFC, and this discussion makes them even more > appealing. If we go in this direction, then we'd need a similar > notion for disk based filesystems. As Konstantin suggested, it'd be > really nice to specify charge policy on a per file, or directory, or > bind mount basis. This allows shared files to be deterministically > charged to a known container. We'd need to flesh out the policies: > e.g. if two bind mound each specify different charge targets for the > same inode, I guess we just pick one. Though the nature of this > catch-all shared container is strange. Presumably a machine manager > would need to create it as an unlimited container (or at least as big > as the sum of all shared files) so that any app which decided it wants > to mlock all shared files has a way to without ooming the shared > container. In the current per-page approach it's possible to lock > shared libs. But the machine manager would need to decide how much > system ram to set aside for this catch-all shared container. > > When there's large incidental sharing, then things get sticky. A > periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in > a small container would pull all pages to the root memcg where they > are exposed to root pressure which breaks isolation. This is > concerning. Perhaps the such accesses could be decorated with > (O_NO_MOVEMEM). > > So this RFC change will introduce significant change to user space > machine managers and perturb isolation. Is the resulting system > better? It's not clear, it's the devil know vs devil unknown. Maybe > it'd be easier if the memcg's I'm talking about were not allowed to > share page cache (aka copy-on-read) even for files which are jointly > visible. That would provide today's interface while avoiding the > problematic sharing. > I think important shared data must be handled and protected explicitly. That 'catch-all' shared container could be separated into several memory cgroups depending on importance of files: glibc protected with soft guarantee, less important stuff is placed into another cgroup and cannot push top-priority libraries out of ram. If shared files are free for use then that 'shared' container must be ready to keep them in memory. Otherwise this need to be fixed at the container side: we could ignore mlock for shared inodes or amount of such vmas might be limited in per-container basis. But sharing responsibility for shared file is vague concept: memory usage and limit of container must depends only on its own behavior not on neighbors at the same machine. Generally incidental sharing could be handled as temporary sharing: default policy (if inode isn't pinned to memory cgroup) after some time should detect that inode is no longer shared and migrate it into original cgroup. Of course task could provide hit: O_NO_MOVEMEM or even while memory cgroup where it runs could be marked as "scanner" which shouldn't disturb memory classification. BTW, the same algorithm which determines who have used inode recently could tell who have used shared inode even if it's pinned to shared container. Other cool option which could fix false-sharing after scanning is FADV_NOREUSE which tells to keep page-cache pages which were used for reading and writing via this file descriptor out of lru and remove them from inode when this file descriptor closes. Something like private per-struct-file page-cache. Probably somebody already tried that? I've missed obvious solution for controlling memory cgroup for files: project id. This persistent integer id stored in file system. For now it's implemented only for xfs and used for quota which is orthogonal to user/group quotas. We could map some of project id to memory cgroup. That is more flexible than per-superblock mark, has no conflicts like mark on bind-mount. -- Konstantin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/