Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753711AbbG2PrY (ORCPT ); Wed, 29 Jul 2015 11:47:24 -0400 Received: from mail-wi0-f176.google.com ([209.85.212.176]:37695 "EHLO mail-wi0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750803AbbG2PrW (ORCPT ); Wed, 29 Jul 2015 11:47:22 -0400 Date: Wed, 29 Jul 2015 17:47:18 +0200 From: Michal Hocko To: Vladimir Davydov Cc: Andrew Morton , Andres Lagar-Cavilla , Minchan Kim , Raghavendra K T , Johannes Weiner , Greg Thelen , Michel Lespinasse , David Rientjes , Pavel Emelyanov , Cyrill Gorcunov , Jonathan Corbet , linux-api@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH -mm v9 0/8] idle memory tracking Message-ID: <20150729154718.GN15801@dhcp22.suse.cz> References: <20150729123629.GI15801@dhcp22.suse.cz> <20150729135907.GT8100@esperanza> <20150729142618.GJ15801@dhcp22.suse.cz> <20150729152817.GV8100@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150729152817.GV8100@esperanza> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4239 Lines: 96 On Wed 29-07-15 18:28:17, Vladimir Davydov wrote: > On Wed, Jul 29, 2015 at 04:26:19PM +0200, Michal Hocko wrote: > > On Wed 29-07-15 16:59:07, Vladimir Davydov wrote: > > > On Wed, Jul 29, 2015 at 02:36:30PM +0200, Michal Hocko wrote: > > > > On Sun 19-07-15 15:31:09, Vladimir Davydov wrote: > > > > [...] > > > > > ---- USER API ---- > > > > > > > > > > The user API consists of two new proc files: > > > > > > > > I was thinking about this for a while. I dislike the interface. It is > > > > quite awkward to use - e.g. you have to read the full memory to check a > > > > single memcg idleness. This might turn out being a problem especially on > > > > large machines. > > > > > > Yes, with this API estimating the wss of a single memory cgroup will > > > cost almost as much as doing this for the whole system. > > > > > > Come to think of it, does anyone really need to estimate idleness of one > > > particular cgroup? > > > > It is certainly interesting for setting the low limit. > > Yes, but IMO there is no point in setting the low limit for one > particular cgroup w/o considering what's going on with the rest of the > system. If you use the low limit for isolating an important load then you do not have to care about the others that much. All you care about is to set the reasonable protection level and let others to compete for the rest. [...] > > > > I would assume that most users are interested only in a single number > > > > which tells the idleness of the system/memcg. > > > > > > Yes, that's what I need it for - estimating containers' wss for setting > > > their limits accordingly. > > > > So why don't we export the single per memcg and global knobs then? > > This would have few advantages. First of all it would be much easier to > > use, you wouldn't have to export memcg ids and finally the implementation > > could be changed without any user visible changes (e.g. lru vs. pfn walks), > > potential caching and who knows what. In other words. Michel had a > > single number interface AFAIR, what was the primary reason to move away > > from that API? > > Because there is too much to be taken care of in the kernel with such an > approach and chances are high that it won't satisfy everyone. What > should the scan period be equal too? No, just gather the data on the read request and let the userspace to decide when/how often etc. If we are clever enough we can cache the numbers and prevent from the walk. Write to the file and do the mark_idle stuff. > Knob. How many kthreads do we want? > Knob. I want to keep history for last N intervals (this was a part of > Michel's implementation), what should N be equal to? Knob. This all relates to the kernel thread implementation which I wasn't suggesting. I was referring to Michel's work which might induce that. I was merely referring to a single number output. Sorry about the confusion. > I want to be > able to choose between an instant scan and a scan distributed in time. > Knob. I want to see stats for anon/locked/file/dirty memory separately, Why is this useful for the memcg limits setting or the wss estimation? I can imagine that a further drop down numbers might be interesting from the debugging POV but I fail to see what kind of decisions from userspace you would do based on them. [...] > > Yes this is really tricky with the current LRU implementation. I > > was playing with some ideas (do some checkpoints on the way) but > > none of them was really working out on a busy systems. But the LRU > > implementation might change in the future. > > It might. Then we could come up with a new /proc or /sys file which > would do the same as /proc/kpageidle, but on per LRU^w whatever-it-is > basis, and give people a choice which one to use. This just leads to proc files count explosion we are seeing already... Proc ended up in dump ground for different things which didn't fit elsewhere and I am not very much happy about it to be honest. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/