Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756341Ab0FJAnM (ORCPT ); Wed, 9 Jun 2010 20:43:12 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:52562 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753213Ab0FJAnK (ORCPT ); Wed, 9 Jun 2010 20:43:10 -0400 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Thu, 10 Jun 2010 09:38:42 +0900 From: KAMEZAWA Hiroyuki To: Mel Gorman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner , Chris Mason , Nick Piggin , Rik van Riel Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible Message-Id: <20100610093842.6a038ab0.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100609095200.GA5650@csn.ul.ie> References: <1275987745-21708-1-git-send-email-mel@csn.ul.ie> <20100609115211.435a45f7.kamezawa.hiroyu@jp.fujitsu.com> <20100609095200.GA5650@csn.ul.ie> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 3.0.2 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5652 Lines: 150 On Wed, 9 Jun 2010 10:52:00 +0100 Mel Gorman wrote: > On Wed, Jun 09, 2010 at 11:52:11AM +0900, KAMEZAWA Hiroyuki wrote: > > > > > > > My concern is how memcg should work. IOW, what changes will be necessary for > > memcg to work with the new vmscan logic as no-direct-writeback. > > > > At worst, memcg waits on background flushers to clean their pages but > obviously this could lead to stalls in containers if it happened to be full > of dirty pages. > yes. > Do you have test scenarios already setup for functional and performance > regression testing of containers? If so, can you run tests with this series > and see what sort of impact you find? I haven't done performance testing > with containers to date so I don't know what the expected values are. > Maybe kernbench is enough. I think it does enough write and malloc. 'Limit' size for test depends on your host. I sometimes does this on 8cpu SMP box. # mount -t cgroup none /cgroups -o memory # mkdir /cgroups/A # echo $$ > /cgroups/A # echo 300M > /cgroups/memory.limit_in_bytes # make -j 8 or make -j 16 Comparing size of swap and speed will be interesting. (Above 300M is enough small because my test machine has 24G memory.) Or # mount -t cgroup none /cgroups -o memory # mkdir /cgroups/A # echo $$ > /cgroups/A # echo 50M > /cgroups/memory.limit_in_bytes # dd if=/dev/zero of=./tmpfile bs=65536 count=100000 or some. When I tested the original patch for "avoiding writeback" by Dave Chinner, I saw 2 ooms in 10 tests. If not patched, I never see OOM. > > Maybe an ideal solution will be > > - support buffered I/O tracking in I/O cgroup. > > - flusher threads should work with I/O cgroup. > > - memcg itself should support dirty ratio. and add a trigger to kick flusher > > threads for dirty pages in a memcg. > > But I know it's a long way. > > > > I'm not very familiar with memcg I'm afraid or its requirements so I am > having trouble guessing which of these would behave the best. You could take > a gamble on having memcg doing writeback in direct reclaim but you may run > into the same problem of overflowing stacks. > maybe. > I'm not sure how a flusher thread would work just within a cgroup. It > would have to do a lot of searching to find the pages it needs > considering that it's looking at inodes rather than pages. > yes. So, I(we) need some way for coloring inode for selectable writeback. But people in this area are very nervous about performance (me too ;), I've not found the answer yet. > One possibility I guess would be to create a flusher-like thread if a direct > reclaimer finds that the dirty pages in the container are above the dirty > ratio. It would scan and clean all dirty pages in the container LRU on behalf > of dirty reclaimers. > Yes, that's possible. But Andrew recommends not to do add more threads. So, I'll use workqueue if necessary. > Another possibility would be to have kswapd work in containers. > Specifically, if wakeup_kswapd() is called with a cgroup that it's added > to a list. kswapd gives priority to global reclaim but would > occasionally check if there is a container that needs kswapd on a > pending list and if so, work within the container. Is there a good > reason why kswapd does not work within container groups? > One reason is node v.s. memcg. Because memcg doesn't limit memory placement, a container can contain pages from the all nodes. So,it's a bit problem which node's kswapd we should run . (but yes, maybe small problem.) Another is memory-reclaim-prioirty between memcg. (I don't want to add such a knob...) Maybe it's time to consider about that. Now, we're using kswapd for softlimit. I think similar hints for kswapd should work. yes. > Finally, you could just allow reclaim within a memcg do writeback. Right > now, the check is based on current_is_kswapd() but I could create a helper > function that also checked for sc->mem_cgroup. Direct reclaim from the > page allocator never appears to work within a container group (which > raises questions in itself such as why a process in a container would > reclaim pages outside the container?) so it would remain safe. > isolate_lru_pages() for memcg finds only pages in a memcg ;) > > How the new logic works with memcg ? Because memcg doesn't trigger kswapd, > > memcg has to wait for a flusher thread make pages clean ? > > Right now, memcg has to wait for a flusher thread to make pages clean. > ok. > > Or memcg should have kswapd-for-memcg ? > > > > Is it okay to call writeback directly when !scanning_global_lru() ? > > memcg's reclaim routine is only called from specific positions, so, I guess > > no stack problem. > > It's a judgement call from you really. I see that direct reclaimers do > not set mem_cgroup so it's down to - are you reasonably sure that all > the paths that reclaim based on a container are not deep? One concerns is add_to_page_cache(). If it's called in deep stack, my assumption is wrong. > I looked > around for a while and the bulk appeared to be in the fault path so I > would guess "yes" but as I'm not familiar with the memcg implementation > I'll have missed a lot. > > > But we just have I/O pattern problem. > > True. > Okay, I'll consider about how to kick kswapd via memcg or flusher-for-memcg. Please go ahead as you want. I love good I/O pattern, too. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/