Date: Mon, 15 Mar 2004 19:51:47 +0100
From: Andrea Arcangeli <andrea@suse.de>
To: Andrew Morton <akpm@osdl.org>
Cc: piggin@cyberone.com.au, riel@redhat.com, marcelo.tosatti@cyclades.com,
       j-nomura@ce.jp.nec.com, linux-kernel@vger.kernel.org, torvalds@osdl.org
Subject: Re: [2.4] heavy-load under swap space shortage
Message-ID: <20040315185147.GH30940@dualathlon.random>
References: <Pine.LNX.4.44.0403150822040.12895-100000@chimarrao.boston.redhat.com> <4055BF90.5030806@cyberone.com.au> <20040315145020.GC30940@dualathlon.random> <20040315103510.25c955a3.akpm@osdl.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20040315103510.25c955a3.akpm@osdl.org>
User-Agent: Mutt/1.4.1i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4826
Lines: 134

On Mon, Mar 15, 2004 at 10:35:10AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
> > > This case I think is well worth the unfairness it causes, because it
> > > means your zone's pages can be freed quickly and without freeing pages
> > > from other zones.
> > 
> > freeing pages from other zones is perfectly fine, the classzone design
> > gets it right, you have to free memory from the other zones too or you
> > have no way to work on a 1G machine. you call the thing "unfair" when it
> > has nothing to do with fariness, your unfariness is the slowdown I
> > pointed out,
> 
> This "slowdown" is purely theoretical and has never been demonstrated.

on a 32G box the slowdown is zero, as it's zero on a 1G box too, you
definitely need a 2G box to measure it.

The effect is that you can do stuff like 'cvs up' and you will end up
caching just 1G instead of 2G. Or do I miss something? If I would own a
2G box I would hate to be able to cache just 1 G (yeah, the cache is 2G
but half of that cache is pinned and it sits there with years old data,
so effectively you lose 50% of the ram in the box in terms of cache
utilization).

> One could just as easily point at the fact that on a 32GB machine with a
> single LRU we have to send 64 highmem pages to the wrong end of the LRU for
> each scanned lowmem page, thus utterly destroying any concept of it being
> an LRU in the first place.  But this is also theoretical, and has never
> been demonstrated and is thus uninteresting.

the lowmem zone on a 32G box is completely reserved for zone-normal
allocation, and dcache shrinks aren't too frequent in some workload, but
you're certainly right that on a 32G box per-zone lru is optimal in
terms of cpu utilization (on 64bit either ways doesn't make any
difference, the GFP_DMA allocations are so seldom that throwing a bit of
cpu at those seldom allocation is fine).

> 
> Worked out why my box is going into a 3-5 minute coma with one test.
> Think what the LRUs look like when the test first hits page reclaim
> on this 2.5G ia32 box:
> 
>                head                           tail
> active_list:   <800M of ZONE_NORMAL> <200M of ZONE_HIGHMEM>
> inactive_list:          <1.5G of ZONE_HIGHMEM>
> 
> now, somebody does a GFP_KERNEL allocation.
> 
> uh-oh.
> 
> VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
> the inactive list.  It then scans 5000 pages, achieving nothing.

I fixed this in my tree a long time ago, you certainly don't need
per-zone lru to fix this (though for a 32G box the per-zone lru doesn't
only fix it, it also save lots of cpu too compared to the global lru).
See the refill_inactive code in my tree:

static void refill_inactive(int nr_pages, zone_t * classzone)
{
	struct list_head * entry;
	unsigned long ratio;

	ratio = (unsigned long) nr_pages * classzone->nr_active_pages /
(((unsigned long) classzone->nr_inactive_pages * vm_lru_balance_ratio) +
1);

	entry = active_list.prev;
	while (ratio && entry != &active_list) {
		struct page * page;
		int related_metadata = 0;

		page = list_entry(entry, struct page, lru);
		entry = entry->prev;

		if (!memclass(page_zone(page), classzone)) {
			/*
			 * Hack to address an issue found by Rik. The
			 * problem is that
			 * highmem pages can hold buffer headers
			 * allocated
			 * from the slab on lowmem, and so if we are
			 * working
			 * on the NORMAL classzone here, it is correct
			 * not to
			 * try to free the highmem pages themself (that
			 * would be useless)
			 * but we must make sure to drop any lowmem
			 * metadata related to those
			 * highmem pages.
			 */
			if (page->buffers && page->mapping) { /* fast path racy check */
				if (unlikely(TryLockPage(page)))
					continue;
				if (page->buffers && page->mapping && memclass_related_bhs(page, classzone)) /* non racy check */
					related_metadata = 1;
				UnlockPage(page);
			}
			if (!related_metadata)
				continue;
		}

		if (PageTestandClearReferenced(page)) {
			list_del(&page->lru);
			list_add(&page->lru, &active_list);
			continue;
		}

		if (!related_metadata)
			ratio--;

		del_page_from_active_list(page);
		add_page_to_inactive_list(page);
		SetPageReferenced(page);
	}
	if (entry != &active_list) {
		list_del(&active_list);
		list_add(&active_list, entry);
	}
}


the memclass checks guarantees that we make progress. the old vm code
(that you inherit in 2.5) missed those bits I believe.

without those fixes the 2.4 vm wouldn't perform on 32G (as you also
found during 2.5).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/