Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756865Ab2EBGXb (ORCPT ); Wed, 2 May 2012 02:23:31 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:45149 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755089Ab2EBGX3 (ORCPT ); Wed, 2 May 2012 02:23:29 -0400 Date: Wed, 2 May 2012 08:23:09 +0200 From: Johannes Weiner To: Andrea Arcangeli Cc: linux-mm@kvack.org, Rik van Riel , Peter Zijlstra , Mel Gorman , Andrew Morton , Minchan Kim , Hugh Dickins , KOSAKI Motohiro , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [patch 5/5] mm: refault distance-based file cache sizing Message-ID: <20120502062308.GN2536@cmpxchg.org> References: <1335861713-4573-1-git-send-email-hannes@cmpxchg.org> <1335861713-4573-6-git-send-email-hannes@cmpxchg.org> <20120502015741.GE22923@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120502015741.GE22923@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3925 Lines: 93 On Wed, May 02, 2012 at 03:57:41AM +0200, Andrea Arcangeli wrote: > On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote: > > frequently used active page. Instead, for each refault with a > > distance smaller than the size of the active list, we deactivate an > > Shouldn't this be the size of active list + size of inactive list? > > If the active list is 500M, inactive 500M and the new working set is > 600M, the refault distance will be 600M, it won't be smaller than the > size of the active list, and it won't deactivate the active list as it > should and it won't be detected as working set. > > Only the refault distance bigger than inactive+active should not > deactivate the active list if I understand how this works correctly. The refault distance is what's missing, not the full reuse frequency. You ignore the 500M worth of inactive LRU time the page had in memory. The distance in that scenario would be 100M, the time between eviction and refault: +-----------------------------++-----------------------------+ | || | | inactive || active | +-----------------------------++-----------------------------+ +~~~~~~~------------------------------+ | | | new set | +~~~~~~~------------------------------+ ^ ^ | | | eviction refault The ~~~'d part could fit into memory if the active list was 100M smaller. > > @@ -1726,6 +1728,11 @@ zonelist_scan: > > if ((alloc_flags & ALLOC_CPUSET) && > > !cpuset_zone_allowed_softwall(zone, gfp_mask)) > > continue; > > + if ((alloc_flags & ALLOC_WMARK_LOW) && > > + current->refault_distance && > > + !workingset_zone_alloc(zone, current->refault_distance, > > + &distance, &active)) > > + continue; > > /* > > * When allocating a page cache page for writing, we > > * want to get it from a zone that is within its dirty > > It's a bit hard to see how this may not run oom prematurely if the > distance is always bigger, this is just an implementation question and > maybe I'm missing a fallback somewhere where we actually allocate > memory from whatever place in case no place is ideal. Sorry, this should be documented better. The ALLOC_WMARK_LOW check makes sure this only applies in the fastpath. It will prepare reclaim with lruvec->shrink_active, then wake up kswapd and retry the zonelist without this constraint. > > + /* > > + * Lower zones may not even be full, and free pages are > > + * potential inactive space, too. But the dirty reserve is > > + * not available to page cache due to lowmem reserves and the > > + * kswapd watermark. Don't include it. > > + */ > > + zone_free = zone_page_state(zone, NR_FREE_PAGES); > > + if (zone_free > zone->dirty_balance_reserve) > > + zone_free -= zone->dirty_balance_reserve; > > + else > > + zone_free = 0; > > Maybe also remove the high wmark from the sum? It can be some hundred > meg so it's better to take it into account, to have a more accurate > math and locate the best zone that surely fits. > > For the same reason it looks like the lowmem reserve should also be > taken into account, on the full sum. dirty_balance_reserve IS the sum of the high watermark and the biggest lowmem reserve for a particular zone, see how it's calculated in mm/page_alloc.c::calculate_totalreserve_pages(). nr_free - dirty_balance_reserve is the number of pages available to page cache allocations without keeping kswapd alive or having to dip into lowmem reserves. Or did I misunderstand you? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/