Andrea Arcangeli <[email protected]> wrote:
>
> >
> > Having a magic knob is a weak solution: the majority of people who are
> > affected by this problem won't know to turn it on.
>
> that's why I turned it _on_ by default in my tree ;)
So maybe Marcelo should apply this patch, and also turn it on by default.
> There are workloads where adding anonymous pages to the lru is
> suboptimal for both the vm (cache shrinking) and the fast path too
> (lru_cache_add), not sure how 2.6 optimizes those bits, since with 2.6
> you're forced to add those pages to the lru somehow and that implies
> some form of locking.
Basically a bunch of tweeaks:
- Per-zone lru locks (which implicitly made them per-node)
- Adding/removing sixteen pages for one taking of the lock.
- Making the lock irq-safe (it had to be done for other reasons, but
reduced contention by 30% on 4-way due to not having a CPU wander off to
service an interrupt while holding a critical lock).
- In page reclaim, snip 32 pages off the lru completely and drop the
lock while we go off and process them.
On Sun, Mar 14, 2004 at 03:22:53PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > >
> > > Having a magic knob is a weak solution: the majority of people who are
> > > affected by this problem won't know to turn it on.
> >
> > that's why I turned it _on_ by default in my tree ;)
>
> So maybe Marcelo should apply this patch, and also turn it on by default.
yes, I would suggest so. If anybody can find any swap-regression on
small UP machines then reporting to us on l-k will be welcome. So far
nobody could notice any swap difference at swap regime AFIK, and the
improvement for the fast path is dramatic on the big smp boxes.
> > There are workloads where adding anonymous pages to the lru is
> > suboptimal for both the vm (cache shrinking) and the fast path too
> > (lru_cache_add), not sure how 2.6 optimizes those bits, since with 2.6
> > you're forced to add those pages to the lru somehow and that implies
> > some form of locking.
>
> Basically a bunch of tweeaks:
>
> - Per-zone lru locks (which implicitly made them per-node)
the 16-ways weren't numa, and these days 16-ways HT (8-ways phys) are
not so uncommon anymore.
>
> - Adding/removing sixteen pages for one taking of the lock.
>
> - Making the lock irq-safe (it had to be done for other reasons, but
> reduced contention by 30% on 4-way due to not having a CPU wander off to
> service an interrupt while holding a critical lock).
>
> - In page reclaim, snip 32 pages off the lru completely and drop the
> lock while we go off and process them.
sounds good, thanks.
I don't see other ways to optimize it (and I never enjoyed too much the
per-zone lru since it has some downside too with a worst case on 2G
systems). peraphs a further optimization could be a transient per-cpu
lru refiled only by the page reclaim (so absolutely lazy while lots of
ram is free), but maybe that's already what you're doing when you say
"Adding/removing sixteen pages for one taking of the lock". Though the
fact you say "sixteen pages" sounds like it's not as lazy as it could
be.
Andrea Arcangeli wrote:
>
>I don't see other ways to optimize it (and I never enjoyed too much the
>per-zone lru since it has some downside too with a worst case on 2G
>systems). peraphs a further optimization could be a transient per-cpu
>lru refiled only by the page reclaim (so absolutely lazy while lots of
>ram is free), but maybe that's already what you're doing when you say
>"Adding/removing sixteen pages for one taking of the lock". Though the
>fact you say "sixteen pages" sounds like it's not as lazy as it could
>be.
>
Hi Andrea,
What are the downsides on a 2G system?
Hi Nick,
On Mon, Mar 15, 2004 at 03:38:51PM +1100, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >
> >I don't see other ways to optimize it (and I never enjoyed too much the
> >per-zone lru since it has some downside too with a worst case on 2G
> >systems). peraphs a further optimization could be a transient per-cpu
> >lru refiled only by the page reclaim (so absolutely lazy while lots of
> >ram is free), but maybe that's already what you're doing when you say
> >"Adding/removing sixteen pages for one taking of the lock". Though the
> >fact you say "sixteen pages" sounds like it's not as lazy as it could
> >be.
> >
>
> Hi Andrea,
> What are the downsides on a 2G system?
it is the absolutely worst case since both lru could be of around the same
size (800M zone-normal-lru and 1.2G zone-highmem-lru), maximizing the
loss of "age" information needed for optimal reclaim decisions.
On Mon, 15 Mar 2004, Andrea Arcangeli wrote:
> it is the absolutely worst case since both lru could be of around the same
> size (800M zone-normal-lru and 1.2G zone-highmem-lru), maximizing the
> loss of "age" information needed for optimal reclaim decisions.
You only lose age information if you don't put equal aging
pressure on both zones. If you make sure the allocation and
pageout pressure are more or less in line with the zone sizes,
why would you lose any aging information ?
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Rik van Riel wrote:
>On Mon, 15 Mar 2004, Andrea Arcangeli wrote:
>
>
>>it is the absolutely worst case since both lru could be of around the same
>>size (800M zone-normal-lru and 1.2G zone-highmem-lru), maximizing the
>>loss of "age" information needed for optimal reclaim decisions.
>>
>
>You only lose age information if you don't put equal aging
>pressure on both zones. If you make sure the allocation and
>pageout pressure are more or less in line with the zone sizes,
>why would you lose any aging information ?
>
>
I can't see that you would, no. But maybe I've missed something.
We apply pressure equally except when there is a shortage in a
low memory zone, in which case we can scan only the required
zone(s).
This case I think is well worth the unfairness it causes, because it
means your zone's pages can be freed quickly and without freeing pages
from other zones.
On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
> This case I think is well worth the unfairness it causes, because it
> means your zone's pages can be freed quickly and without freeing pages
> from other zones.
freeing pages from other zones is perfectly fine, the classzone design
gets it right, you have to free memory from the other zones too or you
have no way to work on a 1G machine. you call the thing "unfair" when it
has nothing to do with fariness, your unfariness is the slowdown I
pointed out, it's all about being able to maintain a more reliable cache
information from the point of view of the pagecache users (the pagecache
users cares at the _classzone_, they can't care about the zones
themself), it has nothing to do with fairness.
Andrea Arcangeli <[email protected]> wrote:
>
> On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
> > This case I think is well worth the unfairness it causes, because it
> > means your zone's pages can be freed quickly and without freeing pages
> > from other zones.
>
> freeing pages from other zones is perfectly fine, the classzone design
> gets it right, you have to free memory from the other zones too or you
> have no way to work on a 1G machine. you call the thing "unfair" when it
> has nothing to do with fariness, your unfariness is the slowdown I
> pointed out,
This "slowdown" is purely theoretical and has never been demonstrated.
One could just as easily point at the fact that on a 32GB machine with a
single LRU we have to send 64 highmem pages to the wrong end of the LRU for
each scanned lowmem page, thus utterly destroying any concept of it being
an LRU in the first place. But this is also theoretical, and has never
been demonstrated and is thus uninteresting.
What _is_ interesting is the way in which the single LRU collapses when
there are a huge number amount of highmem pages on the tail and then there
is a surge in lowmem demand. This was demonstrated, and is what prompted
the per-zone LRU.
Begin forwarded message:
Date: Sun, 04 Aug 2002 01:35:22 -0700
From: Andrew Morton <[email protected]>
To: "[email protected]" <[email protected]>
Subject: how not to write a search algorithm
Worked out why my box is going into a 3-5 minute coma with one test.
Think what the LRUs look like when the test first hits page reclaim
on this 2.5G ia32 box:
head tail
active_list: <800M of ZONE_NORMAL> <200M of ZONE_HIGHMEM>
inactive_list: <1.5G of ZONE_HIGHMEM>
now, somebody does a GFP_KERNEL allocation.
uh-oh.
VM calls refill_inactive. That moves 25 ZONE_HIGHMEM pages onto
the inactive list. It then scans 5000 pages, achieving nothing.
VM calls refill_inactive. That moves 25 ZONE_HIGHMEM pages onto
the inactive list. It then scans about 10000 pages, achieving nothing.
VM calls refill_inactive. That moves 25 ZONE_HIGHMEM pages onto
the inactive list. It then scans about 20000 pages, achieving nothing.
VM calls refill_inactive. That moves 25 ZONE_HIGHMEM pages onto
the inactive list. It then scans about 40000 pages, achieving nothing.
VM calls refill_inactive. That moves 25 ZONE_HIGHMEM pages onto
the inactive list. It then scans about 80000 pages, achieving nothing.
VM calls refill_inactive. That moves 25 ZONE_HIGHMEM pages onto
the inactive list. It then scans about 160000 pages, achieving nothing.
VM calls refill_inactive. That moves 25 ZONE_HIGHMEM pages onto
the inactive list. It then scans about 320000 pages, achieving nothing.
The page allocation fails. So __alloc_pages tries it all again.
This all gets rather boring.
Per-zone LRUs will fix it up. We need that anyway, because a ZONE_NORMAL
request will bogusly refile, on average, memory_size/800M pages to the
head of the inactive list, thus wrecking page aging.
Alan's kernel has a nice-looking implementation. I'll lift that out
next week unless someone beats me to it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/
On Mon, Mar 15, 2004 at 10:35:10AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
> > > This case I think is well worth the unfairness it causes, because it
> > > means your zone's pages can be freed quickly and without freeing pages
> > > from other zones.
> >
> > freeing pages from other zones is perfectly fine, the classzone design
> > gets it right, you have to free memory from the other zones too or you
> > have no way to work on a 1G machine. you call the thing "unfair" when it
> > has nothing to do with fariness, your unfariness is the slowdown I
> > pointed out,
>
> This "slowdown" is purely theoretical and has never been demonstrated.
on a 32G box the slowdown is zero, as it's zero on a 1G box too, you
definitely need a 2G box to measure it.
The effect is that you can do stuff like 'cvs up' and you will end up
caching just 1G instead of 2G. Or do I miss something? If I would own a
2G box I would hate to be able to cache just 1 G (yeah, the cache is 2G
but half of that cache is pinned and it sits there with years old data,
so effectively you lose 50% of the ram in the box in terms of cache
utilization).
> One could just as easily point at the fact that on a 32GB machine with a
> single LRU we have to send 64 highmem pages to the wrong end of the LRU for
> each scanned lowmem page, thus utterly destroying any concept of it being
> an LRU in the first place. But this is also theoretical, and has never
> been demonstrated and is thus uninteresting.
the lowmem zone on a 32G box is completely reserved for zone-normal
allocation, and dcache shrinks aren't too frequent in some workload, but
you're certainly right that on a 32G box per-zone lru is optimal in
terms of cpu utilization (on 64bit either ways doesn't make any
difference, the GFP_DMA allocations are so seldom that throwing a bit of
cpu at those seldom allocation is fine).
>
> Worked out why my box is going into a 3-5 minute coma with one test.
> Think what the LRUs look like when the test first hits page reclaim
> on this 2.5G ia32 box:
>
> head tail
> active_list: <800M of ZONE_NORMAL> <200M of ZONE_HIGHMEM>
> inactive_list: <1.5G of ZONE_HIGHMEM>
>
> now, somebody does a GFP_KERNEL allocation.
>
> uh-oh.
>
> VM calls refill_inactive. That moves 25 ZONE_HIGHMEM pages onto
> the inactive list. It then scans 5000 pages, achieving nothing.
I fixed this in my tree a long time ago, you certainly don't need
per-zone lru to fix this (though for a 32G box the per-zone lru doesn't
only fix it, it also save lots of cpu too compared to the global lru).
See the refill_inactive code in my tree:
static void refill_inactive(int nr_pages, zone_t * classzone)
{
struct list_head * entry;
unsigned long ratio;
ratio = (unsigned long) nr_pages * classzone->nr_active_pages /
(((unsigned long) classzone->nr_inactive_pages * vm_lru_balance_ratio) +
1);
entry = active_list.prev;
while (ratio && entry != &active_list) {
struct page * page;
int related_metadata = 0;
page = list_entry(entry, struct page, lru);
entry = entry->prev;
if (!memclass(page_zone(page), classzone)) {
/*
* Hack to address an issue found by Rik. The
* problem is that
* highmem pages can hold buffer headers
* allocated
* from the slab on lowmem, and so if we are
* working
* on the NORMAL classzone here, it is correct
* not to
* try to free the highmem pages themself (that
* would be useless)
* but we must make sure to drop any lowmem
* metadata related to those
* highmem pages.
*/
if (page->buffers && page->mapping) { /* fast path racy check */
if (unlikely(TryLockPage(page)))
continue;
if (page->buffers && page->mapping && memclass_related_bhs(page, classzone)) /* non racy check */
related_metadata = 1;
UnlockPage(page);
}
if (!related_metadata)
continue;
}
if (PageTestandClearReferenced(page)) {
list_del(&page->lru);
list_add(&page->lru, &active_list);
continue;
}
if (!related_metadata)
ratio--;
del_page_from_active_list(page);
add_page_to_inactive_list(page);
SetPageReferenced(page);
}
if (entry != &active_list) {
list_del(&active_list);
list_add(&active_list, entry);
}
}
the memclass checks guarantees that we make progress. the old vm code
(that you inherit in 2.5) missed those bits I believe.
without those fixes the 2.4 vm wouldn't perform on 32G (as you also
found during 2.5).
Andrea Arcangeli <[email protected]> wrote:
>
> The effect is that you can do stuff like 'cvs up' and you will end up
> caching just 1G instead of 2G. Or do I miss something? If I would own a
> 2G box I would hate to be able to cache just 1 G (yeah, the cache is 2G
> but half of that cache is pinned and it sits there with years old data,
> so effectively you lose 50% of the ram in the box in terms of cache
> utilization).
Nope, we fill all zones with pagecache and once they've all reached
pages_low we scan all zones in proportion to their size. So the
probability of a page being scanned is independent of its zone.
It took a bit of diddling, but it seems to work OK now. Here are the
relevant bits of /proc/vmstat from a 1G machine, running 2.6.4-rc1-mm1 with
13 days uptime:
pgalloc_high 65658111
pgalloc_normal 384294820
pgalloc_dma 617780
pgrefill_high 5980273
pgrefill_normal 11873490
pgrefill_dma 69861
pgsteal_high 2377905
pgsteal_normal 10504356
pgsteal_dma 4756
pgscan_kswapd_high 3621882
pgscan_kswapd_normal 15652593
pgscan_kswapd_dma 99
pgscan_direct_high 54120
pgscan_direct_normal 162353
pgscan_direct_dma 69377
These are approximately balanced wrt the zone sizes, with a bias towards
ZONE_NORMAL because of non-highmem allocations. It's not perfect, but we
did fix a few things up after 2.6.4-rc1-mm1.
On Mon, Mar 15, 2004 at 11:02:40AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > The effect is that you can do stuff like 'cvs up' and you will end up
> > caching just 1G instead of 2G. Or do I miss something? If I would own a
> > 2G box I would hate to be able to cache just 1 G (yeah, the cache is 2G
> > but half of that cache is pinned and it sits there with years old data,
> > so effectively you lose 50% of the ram in the box in terms of cache
> > utilization).
>
> Nope, we fill all zones with pagecache and once they've all reached
> pages_low we scan all zones in proportion to their size. So the
> probability of a page being scanned is independent of its zone.
>
> It took a bit of diddling, but it seems to work OK now. Here are the
> relevant bits of /proc/vmstat from a 1G machine, running 2.6.4-rc1-mm1 with
> 13 days uptime:
>
> pgalloc_high 65658111
> pgalloc_normal 384294820
> pgalloc_dma 617780
>
> pgrefill_high 5980273
> pgrefill_normal 11873490
> pgrefill_dma 69861
>
> pgsteal_high 2377905
> pgsteal_normal 10504356
> pgsteal_dma 4756
>
> pgscan_kswapd_high 3621882
> pgscan_kswapd_normal 15652593
> pgscan_kswapd_dma 99
>
> pgscan_direct_high 54120
> pgscan_direct_normal 162353
> pgscan_direct_dma 69377
>
> These are approximately balanced wrt the zone sizes, with a bias towards
> ZONE_NORMAL because of non-highmem allocations. It's not perfect, but we
> did fix a few things up after 2.6.4-rc1-mm1.
as far as you don't always start from the highmem zone (so you need a
per-classzone variable to keep track of the last zone scanned and to
start shrinking from zone-normal and zone-dma if needed), the above
should avoid the problem I mentioned for the 2G setup.
Andrea Arcangeli wrote:
>On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
>
>>This case I think is well worth the unfairness it causes, because it
>>means your zone's pages can be freed quickly and without freeing pages
>>from other zones.
>>
>
>freeing pages from other zones is perfectly fine, the classzone design
>gets it right, you have to free memory from the other zones too or you
>have no way to work on a 1G machine. you call the thing "unfair" when it
>has nothing to do with fariness, your unfariness is the slowdown I
>pointed out, it's all about being able to maintain a more reliable cache
>information from the point of view of the pagecache users (the pagecache
>users cares at the _classzone_, they can't care about the zones
>themself), it has nothing to do with fairness.
>
>
What I meant by unfairness is that low zone scanning in response
to low zone pressure will not put any pressure on higher zones.
Thus pages in higher zones have an advantage.
We do scan lowmem in response to highmem pressure.
On Tue, Mar 16, 2004 at 09:05:32AM +1100, Nick Piggin wrote:
>
> What I meant by unfairness is that low zone scanning in response
> to low zone pressure will not put any pressure on higher zones.
> Thus pages in higher zones have an advantage.
Ok I see what you mean now, in this sense the unfariness is the same
with the global lru too.
> We do scan lowmem in response to highmem pressure.
As I told Andrew, you've also to make sure not to start always from the
highmemzone, and from the code this seems not the case, so my 2G
scenario still applies.
Obviously I expected that you would can the lowmem zones too, otherwise
you couldn't allocate in cache more than 100M or so in a 1G box.
shrink_caches(struct zone **zones, int priority, int *total_scanned,
int gfp_mask, int nr_pages, struct page_state *ps)
{
int ret = 0;
int i;
for (i = 0; zones[i] != NULL; i++) {
int to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX);
struct zone *zone = zones[i];
int nr_mapped = 0;
int max_scan;
you seem to start always from zones[0] (that zone ** thing is the
zonelist so it starts with highmem, the normal then dma, depending on
the classzone you're shrinking). That will generate the waste of cache
in a 2G box that I described.
I'm reading 2.4.4 mainline here.
to really fix it, you need a global information keeping track of the
last zone shrinked to keep going in round robin.
Either that or you can choose to do some overwork and to shrink from all
the zones removing this break:
if (ret >= nr_pages)
break;
but as far as I can tell, the 50% waste of cache in a 2G box can happen
in 2.6.4 and it won't happen in 2.4.x.
On Mon, 15 Mar 2004, Andrea Arcangeli wrote:
> As I told Andrew, you've also to make sure not to start always from the
> highmemzone, and from the code this seems not the case, so my 2G
> scenario still applies.
Agreed, the scenario applies. However, I don't see how a
global LRU would fix it in eg. the case of an AMD64 NUMA
system...
And once we fix it right for those NUMA systems, we can
use the same code to take care of balancing between zones
on normal PCs, giving us the scalability benefits of the
per-zone lists and locks.
> for (i = 0; zones[i] != NULL; i++) {
> int to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX);
> Either that or you can choose to do some overwork and to shrink from all
> the zones removing this break:
>
> if (ret >= nr_pages)
> break;
That's probably the nicest solution. Though you will want
to cap it at a certain high water mark (2 * pages_high?) so
you don't end up freeing all of highmem on a burst of lowmem
pressure.
> but as far as I can tell, the 50% waste of cache in a 2G box can happen
> in 2.6.4 and it won't happen in 2.4.x.
How about AMD64 NUMA systems ?
What evens out the LRU pressure there in 2.4 ?
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Tue, Mar 16, 2004 at 09:41:24AM +1100, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >
> >Either that or you can choose to do some overwork and to shrink from all
> >the zones removing this break:
> >
> > if (ret >= nr_pages)
> > break;
> >
> >but as far as I can tell, the 50% waste of cache in a 2G box can happen
> >in 2.6.4 and it won't happen in 2.4.x.
> >
> >
>
> Yeah you are right. Some patches have since gone into 2.6-bk and
> this is one of the things fixed up.
sounds great, thanks!
Andrea Arcangeli wrote:
>
>Either that or you can choose to do some overwork and to shrink from all
>the zones removing this break:
>
> if (ret >= nr_pages)
> break;
>
>but as far as I can tell, the 50% waste of cache in a 2G box can happen
>in 2.6.4 and it won't happen in 2.4.x.
>
>
Yeah you are right. Some patches have since gone into 2.6-bk and
this is one of the things fixed up.
Nick
On Mon, Mar 15, 2004 at 05:41:54PM -0500, Rik van Riel wrote:
> On Mon, 15 Mar 2004, Andrea Arcangeli wrote:
>
> > As I told Andrew, you've also to make sure not to start always from the
> > highmemzone, and from the code this seems not the case, so my 2G
> > scenario still applies.
>
> Agreed, the scenario applies. However, I don't see how a
> global LRU would fix it in eg. the case of an AMD64 NUMA
> system...
I think I mentioned the per-node lru would be enough for numa, I'm only
talking here about per-zone lru, per-node numa needs are another matter.
For 64bit per-node or per-zone is basically the same in practice.
however after 2.6.4 will be fixed even the per-zone should not generate
loss of caching info, so with that part fixed I'm not against per-zone
even if it's more difficult to be fair.
> And once we fix it right for those NUMA systems, we can
> use the same code to take care of balancing between zones
> on normal PCs, giving us the scalability benefits of the
> per-zone lists and locks.
I argue those scalability benefits of the locks, on a 32G machine or on
a 1G machine those locks benefits are near zero. The only significant
benefit is in terms of computational complexity of the normal-zone
allocations, where we'll only walk on the zone-normal and zone-dma
pages.
> How about AMD64 NUMA systems ?
> What evens out the LRU pressure there in 2.4 ?
by the time you say 64bit you can forget the per-zone per-node
differences. sure there will be still a difference but it's cosmetical
so I don't care about those per-zone lru issues for 64bit hardware,
infact on 64bit hardware per-zone (even if totally unfair) is the most
optimal just in case somebody asks for ZONE_DMA more than once per day.
But the difference is so small in practice that even global would be ok.
the per-node on numa (not necessairly on amd64, infact in amd64 the
penality is so small that I doubt things like that will payoff big)
still remains but that's not the thing I was discussing here.
Andrea Arcangeli wrote:
>
>I argue those scalability benefits of the locks, on a 32G machine or on
>a 1G machine those locks benefits are near zero. The only significant
>benefit is in terms of computational complexity of the normal-zone
>allocations, where we'll only walk on the zone-normal and zone-dma
>pages.
>
>
Out of interest, are there workloads on 8 and 16-way UMA systems
that have lru_lock scalability problems in 2.6? Anyone know?
On Tue, 16 Mar 2004, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
> >
> >>This case I think is well worth the unfairness it causes, because it
> >>means your zone's pages can be freed quickly and without freeing pages
> >>from other zones.
> >>
> >
> >freeing pages from other zones is perfectly fine, the classzone design
> >gets it right, you have to free memory from the other zones too or you
> >have no way to work on a 1G machine. you call the thing "unfair" when it
> >has nothing to do with fariness, your unfariness is the slowdown I
> >pointed out, it's all about being able to maintain a more reliable cache
> >information from the point of view of the pagecache users (the pagecache
> >users cares at the _classzone_, they can't care about the zones
> >themself), it has nothing to do with fairness.
> >
> >
>
> What I meant by unfairness is that low zone scanning in response
> to low zone pressure will not put any pressure on higher zones.
> Thus pages in higher zones have an advantage.
>
> We do scan lowmem in response to highmem pressure.
Hi Nick,
I'm having a good time reading this discussion, so let me jump in.
Sure, the "unfairness" between lowmem and highmem exists. Quoting what
you said, "pages in higher zones have an advantage".
That is natural, after all the necessity for lowmem pages is much higher
than the need for highmem pages. And this necessity for the lowmem
precious increases as far as the lowmem/highmem ratio grows.
As Andrew has demonstrated, the problems previously caused by such
"unfairness" are non existant with per-zone LRU lists.
So, yes, we have unfairness between lowmem and highmem, and yes, that is
the way it should be.
I felt you had a problem with such a thing, however I dont see one.
Am I missing something?
Regards
On Sun, 14 Mar 2004, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > >
> > > Having a magic knob is a weak solution: the majority of people who are
> > > affected by this problem won't know to turn it on.
> >
> > that's why I turned it _on_ by default in my tree ;)
>
> So maybe Marcelo should apply this patch, and also turn it on by default.
Hhhmm, not so easy I guess. What about the added overhead of
lru_cache_add() for every anonymous page created?
I bet this will cause problems for users which are happy with the current
behaviour. Wont it?
Andrea, do you have any numbers (or at least estimates) for the added
overhead of instantly addition of anon pages to the LRU? That would be
cool to know.
> > There are workloads where adding anonymous pages to the lru is
> > suboptimal for both the vm (cache shrinking) and the fast path too
> > (lru_cache_add), not sure how 2.6 optimizes those bits, since with 2.6
> > you're forced to add those pages to the lru somehow and that implies
> > some form of locking.
>
> Basically a bunch of tweeaks:
>
> - Per-zone lru locks (which implicitly made them per-node)
>
> - Adding/removing sixteen pages for one taking of the lock.
>
> - Making the lock irq-safe (it had to be done for other reasons, but
> reduced contention by 30% on 4-way due to not having a CPU wander off to
> service an interrupt while holding a critical lock).
>
> - In page reclaim, snip 32 pages off the lru completely and drop the
> lock while we go off and process them.
Obviously we dont have, and dont want to, such things in 2.4.
Anyway, it seems this discussion is being productive. Glad!
On Tue, Mar 16, 2004 at 03:31:33AM -0300, Marcelo Tosatti wrote:
>
>
> On Sun, 14 Mar 2004, Andrew Morton wrote:
>
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > >
> > > > Having a magic knob is a weak solution: the majority of people who are
> > > > affected by this problem won't know to turn it on.
> > >
> > > that's why I turned it _on_ by default in my tree ;)
> >
> > So maybe Marcelo should apply this patch, and also turn it on by default.
>
> Hhhmm, not so easy I guess. What about the added overhead of
> lru_cache_add() for every anonymous page created?
>
> I bet this will cause problems for users which are happy with the current
> behaviour. Wont it?
the lru_cache_add is happening in 2.4 mainline, the only point of the
patch is to _avoid_ calling lru_cache_add (tunable with a sysctl so you
can get to the old behaviour of calling lru_cache_add for every anon
page).
> Andrea, do you have any numbers (or at least estimates) for the added
> overhead of instantly addition of anon pages to the LRU? That would be
> cool to know.
I've the numbers for the removed overhead, that's significant in some
workload, but only in the >=16-ways.
> Obviously we dont have, and dont want to, such things in 2.4.
agreed ;)
> Anyway, it seems this discussion is being productive. Glad!
yep!
On Tue, 16 Mar 2004, Andrea Arcangeli wrote:
> On Tue, Mar 16, 2004 at 03:31:33AM -0300, Marcelo Tosatti wrote:
> >
> >
> > On Sun, 14 Mar 2004, Andrew Morton wrote:
> >
> > > Andrea Arcangeli <[email protected]> wrote:
> > > >
> > > > >
> > > > > Having a magic knob is a weak solution: the majority of people who are
> > > > > affected by this problem won't know to turn it on.
> > > >
> > > > that's why I turned it _on_ by default in my tree ;)
> > >
> > > So maybe Marcelo should apply this patch, and also turn it on by default.
> >
> > Hhhmm, not so easy I guess. What about the added overhead of
> > lru_cache_add() for every anonymous page created?
> >
> > I bet this will cause problems for users which are happy with the current
> > behaviour. Wont it?
>
> the lru_cache_add is happening in 2.4 mainline, the only point of the
> patch is to _avoid_ calling lru_cache_add (tunable with a sysctl so you
> can get to the old behaviour of calling lru_cache_add for every anon
> page).
Uh oh, just ignore me.
I misread the message, and misunderstood the whole thing. Will go reread
the patch, and the code.
> > Andrea, do you have any numbers (or at least estimates) for the added
> > overhead of instantly addition of anon pages to the LRU? That would be
> > cool to know.2A2A2A
>
> I've the numbers for the removed overhead, that's significant in some
> workload, but only in the >=16-ways.
And for those workloads one should turn be able to turn it off - right.
> > Obviously we dont have, and dont want to, such things in 2.4.
>
> agreed ;)
>
> > Anyway, it seems this discussion is being productive. Glad!
>
> yep!
>
On Sun, Mar 14, 2004 at 03:22:53PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > >
> > > Having a magic knob is a weak solution: the majority of people who are
> > > affected by this problem won't know to turn it on.
> >
> > that's why I turned it _on_ by default in my tree ;)
>
> So maybe Marcelo should apply this patch, and also turn it on by default.
I've been pondering this again for 2.4.29pre - the thing I'm not sure about
what negative effect will be caused by not adding anonymous pages to LRU
immediately on creation.
The scanning algorithm will apply more pressure to pagecache pages initially
(which are on the LRU) - but that is _hopefully_ no problem because swap_out() will
kick-in soon moving anon pages to LRU soon as they are swap-allocated.
I'm afraid that might be a significant problem for some workloads. No?
Marc-Christian-Petersen claims it improves behaviour for him - how so Marc,
and what is your workload/hardware description?
This is known to decrease contention on pagemap_lru_lock.
Guys, doo you have any further thoughts on this?
I think I'll give it a shot on 2.4.29-pre?
> > There are workloads where adding anonymous pages to the lru is
> > suboptimal for both the vm (cache shrinking) and the fast path too
> > (lru_cache_add), not sure how 2.6 optimizes those bits, since with 2.6
> > you're forced to add those pages to the lru somehow and that implies
> > some form of locking.
On Mon, Nov 22, 2004 at 01:01:38PM -0200, Marcelo Tosatti wrote:
> On Sun, Mar 14, 2004 at 03:22:53PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > >
> > > > Having a magic knob is a weak solution: the majority of people who are
> > > > affected by this problem won't know to turn it on.
> > >
> > > that's why I turned it _on_ by default in my tree ;)
> >
> > So maybe Marcelo should apply this patch, and also turn it on by default.
>
> I've been pondering this again for 2.4.29pre - the thing I'm not sure about
> what negative effect will be caused by not adding anonymous pages to LRU
> immediately on creation.
>
> The scanning algorithm will apply more pressure to pagecache pages initially
> (which are on the LRU) - but that is _hopefully_ no problem because swap_out() will
> kick-in soon moving anon pages to LRU soon as they are swap-allocated.
>
> I'm afraid that might be a significant problem for some workloads. No?
>
> Marc-Christian-Petersen claims it improves behaviour for him - how so Marc,
> and what is your workload/hardware description?
>
> This is known to decrease contention on pagemap_lru_lock.
>
> Guys, doo you have any further thoughts on this?
> I think I'll give it a shot on 2.4.29-pre?
I think you mean the one liner patch that avoids the lru_cache_add
during anonymous page allocation (you didn't quote it, and I can't see
the start of the thread). I develoepd that patch for 2.4-aa and I'm
using it for years, and it runs in all latest SLES8 kernels too, plus
2.4-aa is the only kernel I'm sure can sustain certain extreme VM loads
with heavy swapping of shmfs during heavy I/O. So you can apply it
safely I think.
On Mon, Nov 22, 2004 at 08:49:53PM +0100, Andrea Arcangeli wrote:
> On Mon, Nov 22, 2004 at 01:01:38PM -0200, Marcelo Tosatti wrote:
> > On Sun, Mar 14, 2004 at 03:22:53PM -0800, Andrew Morton wrote:
> > > Andrea Arcangeli <[email protected]> wrote:
> > > >
> > > > >
> > > > > Having a magic knob is a weak solution: the majority of people who are
> > > > > affected by this problem won't know to turn it on.
> > > >
> > > > that's why I turned it _on_ by default in my tree ;)
> > >
> > > So maybe Marcelo should apply this patch, and also turn it on by default.
> >
> > I've been pondering this again for 2.4.29pre - the thing I'm not sure about
> > what negative effect will be caused by not adding anonymous pages to LRU
> > immediately on creation.
> >
> > The scanning algorithm will apply more pressure to pagecache pages initially
> > (which are on the LRU) - but that is _hopefully_ no problem because swap_out() will
> > kick-in soon moving anon pages to LRU soon as they are swap-allocated.
> >
> > I'm afraid that might be a significant problem for some workloads. No?
> >
> > Marc-Christian-Petersen claims it improves behaviour for him - how so Marc,
> > and what is your workload/hardware description?
> >
> > This is known to decrease contention on pagemap_lru_lock.
> >
> > Guys, doo you have any further thoughts on this?
> > I think I'll give it a shot on 2.4.29-pre?
>
> I think you mean the one liner patch that avoids the lru_cache_add
> during anonymous page allocation (you didn't quote it, and I can't see
> the start of the thread). I develoepd that patch for 2.4-aa and I'm
> using it for years, and it runs in all latest SLES8 kernels too, plus
> 2.4-aa is the only kernel I'm sure can sustain certain extreme VM loads
> with heavy swapping of shmfs during heavy I/O. So you can apply it
> safely I think.
Yes it is your patch I am talking about Andrea. Ok, good to hear that.