Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756440Ab1BJOPY (ORCPT ); Thu, 10 Feb 2011 09:15:24 -0500 Received: from mx1.redhat.com ([209.132.183.28]:17142 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756407Ab1BJOPV (ORCPT ); Thu, 10 Feb 2011 09:15:21 -0500 Date: Thu, 10 Feb 2011 15:14:47 +0100 From: Andrea Arcangeli To: Mel Gorman Cc: Johannes Weiner , Andrew Morton , Rik van Riel , Michal Hocko , Kent Overstreet , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch] vmscan: fix zone shrinking exit when scan work is done Message-ID: <20110210141447.GW3347@random.random> References: <20110209154606.GJ27110@cmpxchg.org> <20110209164656.GA1063@csn.ul.ie> <20110209182846.GN3347@random.random> <20110210102109.GB17873@csn.ul.ie> <20110210124838.GU3347@random.random> <20110210133323.GH17873@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110210133323.GH17873@csn.ul.ie> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5393 Lines: 110 On Thu, Feb 10, 2011 at 01:33:24PM +0000, Mel Gorman wrote: > Also true, I commented on this in the "Note" your patch deletes and a > suggestion on how an alternative would be to break early unless GFP_REPEAT. Yep noticed that ;), doing that with __GFP_REPEAT sounds just fine to me. > Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so > scanning was still pretty high. The other costs of lumpy reclaim would hide > it of course. Ok but we know lumpy reclaim was not ok to start with. > What about other cases such as order-1 allocations for stack or order-3 > allocations for those network cards using jumbo frames without > scatter/gather? stack order 1 is one of the few cases that come to mind where failing an allocation becomes fatal. Maybe we should use __GFP_REPEAT there too. But we probably need a way to discriminate callers that can gracefully fallback. I'd be extremely surprised if the cost of looping all over the lru taking down all young bits could ever be offseted by the jumbo frame. In fact the jumbo frame I'm afraid might be better off without using compaction at all because it's probably very latency sensitive. We need a 'lowlatency' version of compaction for these users where the improvement of having a compound page instead of a regular page isn't very significant. On a separated topic, I'm currently trying to use the new async compaction code upstream with jumbo frames. I'm also wondering if I'll have to set sync=0 by default unless __GFP_REPEAT is set. It seems adding compaction to jumbo frames is increasing latency to certain workloads in a measurable way. Things were fine when compaction was only used by THP and not for all order allocations (but I didn't test the async mode yet plus the other optimizations for compaction you did recently, I hope they're enough to jumbo frames). > Don't get me wrong, I see your point but I'm wondering if there really are > cases where we routinely scan an entire LRU list of unevictable pages that > are somehow not being migrated properly to the unevictable lists. If > this is happening, we are also in trouble for reclaiming for order-0 > pages, right? Well unevictable pages are just an example and like you said they last only one round of the loop at most. But other caching bits like the referenced bits and all young bits will get all taken down during all later loops too. We definitely don't want to swap just to allow compaction to succeed! I think this argument explains it pretty well, if you takedown all young bits in a constant loop, then system may end up swapping. That's definitely something we don't want. Things may be different if this is a stack allocation without fallback, or if it's hugetlbfs again without kernel fallback (only userland fallback). > It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE" > as an alternative to how we could break early while still being agressive > when required. The only reason it's not that way now is because a) I didn't > consider an LRU mostly full of unevictable pages to be the normal case and b) > for allocations such as order-3 that are preferable not to fail. Ok. > Where should be draw the line? We could come up with ratio of the lists > depending on priority but it'd be hard to measure the gain or loss > without having a profile of a problem case to look at. I would just stick to !nr_reclaimed to break the loop, and ignore the nr_scanned unless __GFP_REPEAT is set, in which case you're welcome to scan everything. Then we've to decide if to add __GFP_REPEAT to the stack allocation... > I don't think you have misread anything but if we're going to weaken > this logic, I'd at least like to see the GFP_REPEAT option tried - i.e. I see the point of __GFP_REPEAT, that sounds the best, I should have just followed your comment but I felt scanning everything was too heavyweight regardless, but ok I see you want as much accuracy as possible in that case, even if that end up in a swap storm. > preserve being aggressive if set. I'm also not convinced we routinely get > into a situation where the LRU consists of almost all unevictable pages > and if we are in this situation, that is a serious problem on its own. It > would also be preferable if we could get latency figures on alloc_pages for > hugepage-sized allocations and a count of how many are succeeding or failing > to measure the impact (if any). I think I made not the best example talking about unevictable pages. I said that because the code is like below, but to me all the goto something following the !page_evictable check were also relevant for this shrink_zone loop. The real life issue is avoid swap storm (or expensive loop flooding the whole system of ipis to takedown all young bits in all ptes), to allocate an hugepage or jumboframe that has a graceful fallback that performs not hugely slower than the hugepage/jumboframe. sc->nr_scanned++; if (unlikely(!page_evictable(page, NULL))) goto cull_mlocked; I think making the !nr_scanned check conditional to __GFP_REPEAT as the comment suggested, for now is the best way to go. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/