Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751121Ab0DPEOU (ORCPT ); Fri, 16 Apr 2010 00:14:20 -0400 Received: from bld-mail18.adl2.internode.on.net ([150.101.137.103]:47267 "EHLO mail.internode.on.net" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750792Ab0DPEOS (ORCPT ); Fri, 16 Apr 2010 00:14:18 -0400 Date: Fri, 16 Apr 2010 14:14:12 +1000 From: Dave Chinner To: Mel Gorman Cc: KOSAKI Motohiro , Chris Mason , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: Re: [PATCH] mm: disallow direct reclaim page writeback Message-ID: <20100416041412.GY2493@dastard> References: <20100413202021.GZ13327@think> <20100414014041.GD2493@dastard> <20100414155233.D153.A69D9226@jp.fujitsu.com> <20100414072830.GK2493@dastard> <20100414085132.GJ25756@csn.ul.ie> <20100415013436.GO2493@dastard> <20100415102837.GB10966@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100415102837.GB10966@csn.ul.ie> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5529 Lines: 116 On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote: > On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote: > > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote: > > If you ask it to clean a bunch of pages around the one you want to > > reclaim on the LRU, there is a good chance it will also be cleaning > > pages that are near the end of the LRU or physically close by as > > well. It's not a guarantee, but for the additional IO cost of about > > 10% wall time on that IO to clean the page you need, you also get > > 1-2 orders of magnitude other pages cleaned. That sounds like a > > win any way you look at it... > > At worst, it'll distort the LRU ordering slightly. Lets say the the > file-adjacent-page you clean was near the end of the LRU. Before such a > patch, it may have gotten cleaned and done another lap of the LRU. > After, it would be reclaimed sooner. I don't know if we depend on such > behaviour (very doubtful) but it's a subtle enough change. I can't > predict what it'll do for IO congestion. Simplistically, there is more > IO so it's bad but if the write pattern is less seeky and we needed to > write the pages anyway, it might be improved. Fundamentally, we have so many pages on the LRU, getting a few out of order at the back end of it is going to be in the noise. If we trade off "perfect" LRU behaviour for cleaning pages an order of magnitude faster, reclaim will find candidate pages for a whole lot faster. And if we have more clean pages available, faster, overall system throughput is going to improve and be much less likely to fall into deep, dark holes where the OOM-killer is the light at the end..... [ snip questions Chris answered ] > > what I'm > > pointing out is that the arguments that it is too hard or there are > > no interfaces available to issue larger IO from reclaim are not at > > all valid. > > > > Sure, I'm not resisting fixing this, just your first patch :) There are four > goals here > > 1. Reduce stack usage > 2. Avoid the splicing of subsystem stack usage with direct reclaim > 3. Preserve lumpy reclaims cleaning of contiguous pages > 4. Try and not drastically alter LRU aging > > 1 and 2 are important for you, 3 is important for me and 4 will have to > be dealt with on a case-by-case basis. #4 is important to me, too, because that has direct impact on large file IO workloads. however, it is gross changes in behaviour that concern me, not subtle, probably-in-the-noise changes that you're concerned about. :) > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I > guess dirty pages can cycle around more so it'd need to be cared for. Well, you keep saying that they break #3, but I haven't seen any test cases or results showing that. I've been unable to confirm that lumpy reclaim is broken by disallowing writeback in my testing, so I'm interested to know what tests you are running that show it is broken... > > How about this? For now, we stop direct reclaim from doing writeback > > only on order zero allocations, but allow it for higher order > > allocations. That will prevent the majority of situations where > > direct reclaim blows the stack and interferes with background > > writeout, but won't cause lumpy reclaim to change behaviour. > > This reduces the scope of impact and hence testing and validation > > the needs to be done. > > > > Then we can work towards allowing lumpy reclaim to use background > > threads as Chris suggested for doing specific writeback operations > > to solve the remaining problems being seen. Does this seem like a > > reasonable compromise and approach to dealing with the problem? > > > > I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage > enough or come up with an alternative fix. From the goals above it mitigates > 1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on > the LRU with 4 until the background cleaner or kswapd comes along. We've been through this already, but I'll repeat it again in the hope it sinks in: reducing stack usage is not sufficient to stay within an 8k stack if we can enter writeback with an arbitrary amount of stack already consumed. We've already got a report of 9k of stack usage (7200 bytes left on a order-2 stack) and this is without a complex storage stack - it's just a partition on a SATA drive. We can easily add another 1k, possibly 2k to that stack depth with a complex storage subsystem. Trimming this much (3-4k) is simply not feasible in a callchain that is 50-70 functions deep... > One reason why I am edgy about this is that lumpy reclaim can kick in > for low-enough orders too like order-1 pages for stacks in some cases or > order-2 pages for network cards using jumbo frames or some wireless > cards. The network cards in particular could still cause the stack > overflow but be much harder to reproduce and detect. So push lumpy reclaim into a separate thread. It already blocks, so waiting for some other thread to do the work won't change anything. Separating high-order reclaim from LRU reclaim is probably a good idea, anyway - they use different algorithms and while the two are intertwined it's hard to optimise/improve either.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/