Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753193Ab1BJNdw (ORCPT ); Thu, 10 Feb 2011 08:33:52 -0500 Received: from gir.skynet.ie ([193.1.99.77]:39444 "EHLO gir.skynet.ie" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750895Ab1BJNdv (ORCPT ); Thu, 10 Feb 2011 08:33:51 -0500 Date: Thu, 10 Feb 2011 13:33:24 +0000 From: Mel Gorman To: Andrea Arcangeli Cc: Johannes Weiner , Andrew Morton , Rik van Riel , Michal Hocko , Kent Overstreet , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done Message-ID: <20110210133323.GH17873@csn.ul.ie> References: <20110209154606.GJ27110@cmpxchg.org> <20110209164656.GA1063@csn.ul.ie> <20110209182846.GN3347@random.random> <20110210102109.GB17873@csn.ul.ie> <20110210124838.GU3347@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20110210124838.GU3347@random.random> User-Agent: Mutt/1.5.17+20080114 (2008-01-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5821 Lines: 123 On Thu, Feb 10, 2011 at 01:48:38PM +0100, Andrea Arcangeli wrote: > On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote: > > We should not be ending up in a situation with the LRU list of only > > page_evictable pages and that situation persisting causing excessive (or > > infinite) looping. As unevictable pages are encountered on the LRU list, > > they should be moved to the unevictable lists by putback_lru_page(). Are you > > aware of a situation where this becomes broken? > > > > I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they > > are all get moved. In this case, nr_scanned is positive and we continue > > to scan but this is expected and desirable: Reclaim/compaction needs more > > pages to be freed before it starts compaction. If it stops scanning early, > > then it would just fail the allocation later. This is what the "NOTE" is about. > > > > I prefer Johannes' fix for the observed problem. > > should_continue_reclaim is only needed for compaction. It tries to > free enough pages so that compaction can succeed in its defrag > attempt. Correct. > So breaking the loop faster isn't going to cause failures for > 0 order pages. Also true, I commented on this in the "Note" your patch deletes and a suggestion on how an alternative would be to break early unless GFP_REPEAT. > My worry is that we loop too much in shrink_zone just > for compaction even when we don't do any progress. shrink_zone would > never scan more than SWAP_CLUSTER_MAX pages, before this change. Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so scanning was still pretty high. The other costs of lumpy reclaim would hide it of course. > Now > it can loop over the whole lru as long as we're scanning stuff. True, the alternative being failing the allocation. Returning sooner is of course an option, but it would be preferable to see a case where the logic after Johannes' patch is failing. > Ok to > overboost shrink_zone if we're making progress to allow compaction at > the next round, but if we don't visibly make progress, I'm concerned > that it may be too aggressive to scan the whole list. The performance > benefit of having an hugepage isn't as huge as scanning all pages in > the lru when before we would have broken the loop and declared failure > after only SWAP_CLUSTER_MAX pages, and then we would have fallen back > in a order 0 allocation. What about other cases such as order-1 allocations for stack or order-3 allocations for those network cards using jumbo frames without scatter/gather? Don't get me wrong, I see your point but I'm wondering if there really are cases where we routinely scan an entire LRU list of unevictable pages that are somehow not being migrated properly to the unevictable lists. If this is happening, we are also in trouble for reclaiming for order-0 pages, right? > The fix may help of course, maybe it's enough > for his case I don't know, but I don't see it making a whole lot of > difference, except now it will stop when the lru is practically empty > which clearly is an improvement. I think we shouldn't be so worried > about succeeding compaction, the important thing is we don't waste > time in compaction if there's not enough free memory but > compaction_suitable used by both logics should be enough for that. > > I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_ It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE" as an alternative to how we could break early while still being agressive when required. The only reason it's not that way now is because a) I didn't consider an LRU mostly full of unevictable pages to be the normal case and b) for allocations such as order-3 that are preferable not to fail. > flag or similar that increases how compaction is strict in succeeding, > up to scanning the whole lru in one go in order to make some free > memory for compaction to succeed. > > Going ahead with the scan until compaction_suitable is true instead > makes sense when there's absence of memory pressure and nr_reclaimed > is never zero. > > Maybe we should try a bit more times than just nr_reclaim but going > over the whole lru, sounds a bit extreme. > Where should be draw the line? We could come up with ratio of the lists depending on priority but it'd be hard to measure the gain or loss without having a profile of a problem case to look at. > The issue isn't just for unevictable pages, that will be refiled > during the scan but it will also happen in presence of lots of > referenced pages. For example if we don't apply my fix, the current > code can take down all young bits in all ptes in one go in the whole > system before returning from shrink_zone, that is too much in my view, > and losing all that information in one go (not even to tell the cost > associated with losing it) can hardly be offseted by the improvement > given by 1 more hugepage. > > But please let me know if I've misread something... > I don't think you have misread anything but if we're going to weaken this logic, I'd at least like to see the GFP_REPEAT option tried - i.e. preserve being aggressive if set. I'm also not convinced we routinely get into a situation where the LRU consists of almost all unevictable pages and if we are in this situation, that is a serious problem on its own. It would also be preferable if we could get latency figures on alloc_pages for hugepage-sized allocations and a count of how many are succeeding or failing to measure the impact (if any). -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/