Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1030100imu; Wed, 9 Jan 2019 10:18:45 -0800 (PST) X-Google-Smtp-Source: ALg8bN7t657hVmLomHHrdBroxEL+87ROZBstg2yLHiZtYfk0xefdGUNYOSfMoVLl19k7+NA1HXY3 X-Received: by 2002:a17:902:4124:: with SMTP id e33mr7077324pld.236.1547057925510; Wed, 09 Jan 2019 10:18:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547057925; cv=none; d=google.com; s=arc-20160816; b=DyLORqDs1eq2FJaD4FaPsrVy2RVcS1nQjcM3NXbMOnL5rAY5CiHubsbSAUvXBnKoXa bi94djaVw7e8eQQ8iTJ5ggmx+c9iXSV5/nm12HbDb/j89bop2qq/fmvlK3N0bnkW+m+M zT0MP7ZV6QUoE7A2RZyQwx6E4e98//gC9FNSF9VIVlKE+YSY8nqhOWWarie3jEek9qf/ pc0zyDQ8hgZhEUfdzibLqEOSbjhA3MS4zf+ZeRV+w8rFtuvifLbPJabdPta73FdtiEvr NENvMVN03xEJa4ZEvd8iSUs/Ry3ymX+2WHEtCBrgCLqzJq0Nn5KkU8zpRSMqS+Ei9fnm z1gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=Zd3vVhDlpwp20Fd3WSDfdBntGJw90oiWp4BoqDyBCHA=; b=VR5FA1NGUzBOurfS3AL57u8zdbwYBtKrF6Lv1lEHK8y8P43Q/BfBhOVvbLx+WkV01O HYrzZ15GCjx9DJGOGGxEaRLD/W6leNoiKPqlrts5pbEw+5MoI5JaR7YV8Sc7AWL9Cjed WRN9ZYD9/sSUiww6EZtu97afz4SDj7WEyRdelBpY7l7ETN0+4HYLc37BAHIq0tBGr4/l twJSQZkmAkRcjZoM4wr9tF9DvR19jMCmEqQlN1hff0VZ46C8sqkMLMrtavZhcI5kYSuc kTfw3soq0Ym3Aj5B0R2bumYkBUqxBqnSCmJLBq5kMJ3/lSgpTN8NjKOOg4ZvQ7n6Jh3b lijg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u5si9072865pgi.146.2019.01.09.10.18.30; Wed, 09 Jan 2019 10:18:45 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726822AbfAIRKZ (ORCPT + 99 others); Wed, 9 Jan 2019 12:10:25 -0500 Received: from mx2.suse.de ([195.135.220.15]:35732 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726516AbfAIRKZ (ORCPT ); Wed, 9 Jan 2019 12:10:25 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 17804AF03; Wed, 9 Jan 2019 17:10:23 +0000 (UTC) Date: Wed, 9 Jan 2019 18:10:21 +0100 From: Michal Hocko To: Kirill Tkhai Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, josef@toxicpanda.com, jack@suse.cz, hughd@google.com, darrick.wong@oracle.com, aryabinin@virtuozzo.com, guro@fb.com, mgorman@techsingularity.net, shakeelb@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH RFC 0/3] mm: Reduce IO by improving algorithm of memcg pagecache pages eviction Message-ID: <20190109171021.GY31793@dhcp22.suse.cz> References: <154703479840.32690.6504699919905946726.stgit@localhost.localdomain> <20190109141113.GW31793@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 09-01-19 18:43:05, Kirill Tkhai wrote: > Hi, Michal, > > On 09.01.2019 17:11, Michal Hocko wrote: > > On Wed 09-01-19 15:20:18, Kirill Tkhai wrote: > >> On nodes without memory overcommit, it's common a situation, > >> when memcg exceeds its limit and pages from pagecache are > >> shrinked on reclaim, while node has a lot of free memory. > > > > Yes, that is the semantic of the hard limit. If the system is not > > overcommitted then the hard limit can be used to prevent unexpected > > direct reclaim from unrelated activity. > > According to Documentation/admin-guide/cgroup-v2.rst: > > memory.max > Memory usage hard limit. This is the final protection > mechanism. If a cgroup's memory usage reaches this limit and > can't be reduced, the OOM killer is invoked in the cgroup. > Under certain circumstances, the usage may go over the limit > temporarily. > > There is nothing about direct reclaim in another memcg. I don't think > we break something here. Others in the thread have pointed that out already. What is a hard limit in one memcg is an isolateion protection in another one. Especially when the system is not overcommited. > File pages are accounted to memcg, and this guarantees, that single > memcg won't occupy all system memory by its unevictible page cache. > But the suggested patchset follows the same way. Pages, which remain > in pagecache, are easy-to-be-evicted, since they are not dirty and > not under writeback. System can drop them fast and in foreseeable time. > This is cardinal thing about the patchset: remained pages do not > introduce principal burden on system memory or reclaim time. What does prevent that the page cache is easily reclaimable? Aka clean and ready to be dropped? Not to mention that even when the reclaim is fast it is not free. Especially when you do not expect that because you haven't reached your hard limit and the admin made sure that hard limits do not overcommit. [...] > > But this also means that any hard limited memcg can fill up all the > > memory and break the above assumption about the isolation from direct > > reclaim. Not to mention the OOM or is there anything you do anything > > about preventing that? > > This is discussed thing. We may add such the pages into tail of LRU list > instead of head. We may introduce one more separate list to link such > the pages only, and fastly evict them in case of global reclaim. I don't > think there is a problem. > > > That beig said, I do not think we want to or even can change the > > semantic of the hard limit and break existing setups. > > Using the original description and the comments I gave in this message, > could you please to clarify the way we break existing setups? isolation as explained above. > > I am still > > interested to hear more about more detailed/specific usecases that might > > benefit from this behavior. Why do those users even use hard limit at > > all? To protect from anon memory leaks? > > In multi-user machine people want to have size of available to container > memory equal to the size, which they pay. So, hard limit is needed to prevent > one container to occupy all system memory via slowly-evictible writeback > pages, unevictible anon pages, etc. You can't fastly allocate a page, > in case of many pages are under writeback, this operation is very slow. > > (But unmapped pagecache pages introduced by patchset is another thing: > you just need to take not sleeping spinlock to call __delete_from_page_cache() > only. This is fast) > > Multi-user machine may have more memory, than sum of all containers hard > limit. This may be used as an optimization just to reduce disk IO. There > is no contradiction to sane sense here. And it's not a rare situation. > In our kernel we have cleancache driver for handling this situation, but > cleancache is not the best solution like I wrote. > > Not overcommited system is likely case for the patchset, while the below > is a little less likely: I beliave Johannes has explained that you are trying to use the hard limit in a wrong way for something it is not designed for. -- Michal Hocko SUSE Labs