LinuxLists.cc - Why active list and inactive list?

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Andrea Arcangeli wrote:
> On Tue, Jan 23, 2007 at 01:10:46AM +0100, Niki Hammler wrote:
>> Dear Linux Developers/Enthusiasts,
>>
>> For a course at my university I'm implementing parts of an operating
>> system where I get most ideas from the Linux Kernel (which I like very
>> much). One book I gain information from is [1].
>>
>> Linux uses for its Page Replacing Algorithm (based on LRU) *two* chained
>> lists - one active list and one active list.
>> I implemented my PRA this way too.
>>
>> No the big question is: WHY do I need *two* lists? Isn't it just
>> overhead/more work? Are there any reasons for that?
>>
>> In my opinion, it would be better to have just one just; pop frames to
>> be swapped out from the end of the list and push new frames in front of
>> it. Then just evaluate the frames and shift them around in the list.
>>
>> Is there any explanation why Linux uses two lists?
>
> Back then I designed it with two lru lists because by splitting the
> active from the inactive cache allows to detect the cache pollution
> before it starts discarding the working set. The idea is that the
> pollution will enter and exit the inactive list without ever being
> elected to the active list because by definition it will never
> generate a cache hit. The working set will instead trigger cache hits
> during page faults or repeated reads, and it will be preserved better
> by electing it to enter the active list.
>
> A page in the inactive list will be collected much more quickly than a
> page in the active list, so the pollution will be collected more
> quickly than the working set. Then the VM while freeing cache tries to
> keep a balance between the size of the two lists to avoid being too
> unfair, obviously at some point the active list have to be
> de-activated too. If your server "fits in ram" you'll find lots of
> cache to be active and so the I/O activity not part of the working set
> will be collected without affecting the working set much.

This makes me wonder if it makes sense to split up the LRU into page
cache LRU and mapped pages LRU. I see two benefits

1. Currently based on swappiness, we might walk an entire list
searching for page cache pages or mapped pages. With these
lists separated, it should get easier and faster to implement
this scheme
2. There is another parallel thread on implementing page cache
limits. If the lists split out, we need not scan the entire
list to find page cache pages to evict them.

Of course I might be missing something (some piece of history)

--
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-01-23 01:40:58

[permalink] [raw]

Subject: Re: Why active list and inactive list?

On Tue, 23 Jan 2007, Balbir Singh wrote:

> This makes me wonder if it makes sense to split up the LRU into page
> cache LRU and mapped pages LRU. I see two benefits
>
> 1. Currently based on swappiness, we might walk an entire list
> searching for page cache pages or mapped pages. With these
> lists separated, it should get easier and faster to implement
> this scheme
> 2. There is another parallel thread on implementing page cache
> limits. If the lists split out, we need not scan the entire
> list to find page cache pages to evict them.
>
> Of course I might be missing something (some piece of history)

This means page cache = unmapped file backed page right? Otherwise this
would not work. I always thought that the page cache were all file backed
pages both mapped and unmapped.

With the proposed schemd you would have to move pages between lists if
they are mapped and unmapped by a process. Terminating a process could
lead to lots of pages moving to the unnmapped list.

2007-01-23 01:50:19

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Balbir Singh wrote:

> This makes me wonder if it makes sense to split up the LRU into page
> cache LRU and mapped pages LRU. I see two benefits

Unlikely. I have seen several workloads fall over because they
did not throw out mapped pages soon enough.

If the kernel does not keep the most frequently accessed pages
resident, hit rates will suffer. Sometimes (well, usually)
those are the mapped pages, but this is not true in all workloads.

Some workloads are very page cache heavy and it pays to keep
the more frequently accessed page cache pages resident while
discarding the less frequently accessed ones.

Since memory size has increased a lot more than disk speed
over the last decade (and this is likely to continue for the
next decades), the quality of page replacement algorithms is
likely to become more and more important over time.

> 1. Currently based on swappiness, we might walk an entire list
> searching for page cache pages or mapped pages. With these
> lists separated, it should get easier and faster to implement
> this scheme

How do you classify a mapped page cache page?

Another issue is that you'll want to make sure that the page
cache pages that are referenced more frequently than the least
referenced mapped (I assume you mean anonymous?) pages in
memory, while swapping out those least used anonymous pages.

One way to do this could be to compare the scan rates, list
sizes and referenced percentage of both lists, to find out
which of the two caches is hotter.

> 2. There is another parallel thread on implementing page cache
> limits. If the lists split out, we need not scan the entire
> list to find page cache pages to evict them.

If the lists split out, there is no reason to limit the page
cache size because you can easily reclaim them. Right?

> Of course I might be missing something (some piece of history)

http://linux-mm.org/AdvancedPageReplacement

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-23 01:56:33

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Christoph Lameter wrote:

> With the proposed schemd you would have to move pages between lists if
> they are mapped and unmapped by a process. Terminating a process could
> lead to lots of pages moving to the unnmapped list.

That could be a problem.

Another problem is that any such heuristic in the VM is
bound to have corner cases that some workloads will hit.

It would be really nice if we came up with a page replacement
algorithm that did not need many extra heuristics to make it
work...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-23 02:04:08

[permalink] [raw]

Subject: Re: Why active list and inactive list?

On Mon, 22 Jan 2007, Rik van Riel wrote:

> It would be really nice if we came up with a page replacement
> algorithm that did not need many extra heuristics to make it
> work...

I guess the "clock" type algorithms are the most promising in that
area. What happened to all those advanced page replacement endeavors?
What is the most promising of those? You seem to have done a lot of work
on those.

2007-01-23 02:12:37

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Why active list and inactive list?

On Tue, Jan 23, 2007 at 07:01:33AM +0530, Balbir Singh wrote:
> This makes me wonder if it makes sense to split up the LRU into page
> cache LRU and mapped pages LRU. I see two benefits
> 1. Currently based on swappiness, we might walk an entire list
> searching for page cache pages or mapped pages. With these
> lists separated, it should get easier and faster to implement
> this scheme

When I tried that long time ago, I recall I had troubles, but there
wasn't the reclaim_mapped based on static values so it was even
harder. However it would be still a problem today to decide when to
switch from the unmapped to the mapped lru. When reclaim_mapped is
set, you'll still have to shrink some unmapped page, and by splitting
you literally lose age information to save some cpu. Eventually you
risk spending time in trying to free unfreeable pinned pages that sits
in the unmapped list before finally jumping to the mapped list. So
you've to add yet another list to get rid of the pinned stuff in the
unmapped list and I stopped when I had to refile pages from the
"pinned" list to the unmapped list in irq I/O completion context, now
it's all spin_lock_irq so it would be more natural at least...

> 2. There is another parallel thread on implementing page cache
> limits. If the lists split out, we need not scan the entire
> list to find page cache pages to evict them.

BTW I'm unsure about the cache limit thread, the overhead of the vm
collection shouldn't be an issue, and those tends to hide vm
inefficiencies.

For example Neil has a patch to reduce the writeback cache to 10M-50M
(much lower than the current 1% minimum) to hide huge unfariness in
the writeback cache. I think they should mount the fs with -o sync
instead of using that patch until the unfariness is fixed or
tunable. The patch itself is fine though but for that problem it only
looks a workaround. So I at least try to be always quite skeptical
when I hear about cache "fixed size limiting" patches that improve
responsiveness or performance ;)

> Of course I might be missing something (some piece of history)

Partly ;) Code was very different back then, today it would be easier
thanks to reclaim_mapped but the partial loss of age information and
potential loss of cpu in a pinned&unmapped walk would probably remain.

2007-01-23 02:24:55

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Christoph Lameter wrote:
> On Mon, 22 Jan 2007, Rik van Riel wrote:
>
>> It would be really nice if we came up with a page replacement
>> algorithm that did not need many extra heuristics to make it
>> work...
>
> I guess the "clock" type algorithms are the most promising in that
> area. What happened to all those advanced page replacement endeavors?
> What is the most promising of those? You seem to have done a lot of work
> on those.

CLOCK-Pro seems the most promising algorithm, because it can
act well both as a first level cache (operating system running
applications) and as a second level cache (operating system
running as a file server), because it tracks both recency and
frequency well.

However, there are a few unanswered questions on clock-pro.

The big one is how we are to do some background aging in a
clock-pro system, so referenced bits don't just pile up when
the VM has enough memory - otherwise we might not know the
right pages to evict when a new process starts up and starts
allocating lots of memory.

At least we've solved the problems of keeping track of the
recently evicted pages in a cheap way, and balancing the
pressure/hotness of different caches against each other.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-23 02:44:40

[permalink] [raw]

Subject: Re: Why active list and inactive list?

On Mon, 22 Jan 2007, Rik van Riel wrote:

> The big one is how we are to do some background aging in a
> clock-pro system, so referenced bits don't just pile up when
> the VM has enough memory - otherwise we might not know the
> right pages to evict when a new process starts up and starts
> allocating lots of memory.

There are two bad choices right?

1. Scan for reference bits

Bad because we may have to scan quite a bit without too much
result. LRU allows us to defer this until memory is tight.
Any such scan will pollute the cache and cause a stall of
the app. You really do not want this for a realtime system.

2. Take faults on reference and update the page state.
Bad because this means a fault if the reference bit
has not been set. Could be many faults.

Clock pro really requires 2 right? So lots of additional page faults?

2007-01-23 02:51:39

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Christoph Lameter wrote:
> On Mon, 22 Jan 2007, Rik van Riel wrote:
>
>> The big one is how we are to do some background aging in a
>> clock-pro system, so referenced bits don't just pile up when
>> the VM has enough memory - otherwise we might not know the
>> right pages to evict when a new process starts up and starts
>> allocating lots of memory.
>
> There are two bad choices right?
>
> 1. Scan for reference bits
>
> Bad because we may have to scan quite a bit without too much
> result. LRU allows us to defer this until memory is tight.
> Any such scan will pollute the cache and cause a stall of
> the app. You really do not want this for a realtime system.
>
> 2. Take faults on reference and update the page state.
> Bad because this means a fault if the reference bit
> has not been set. Could be many faults.
>
> Clock pro really requires 2 right? So lots of additional page faults?

Nope, the faults are not required.

I suspect you're confused with the part where it keeps track
of recently evicted (not resident in RAM at all) pages. That
kind of info is common in database replacement schemes, but
not in general purpose OS memory management.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-23 03:18:52

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Christoph Lameter wrote:
> On Tue, 23 Jan 2007, Balbir Singh wrote:
>
>> This makes me wonder if it makes sense to split up the LRU into page
>> cache LRU and mapped pages LRU. I see two benefits
>>
>> 1. Currently based on swappiness, we might walk an entire list
>> searching for page cache pages or mapped pages. With these
>> lists separated, it should get easier and faster to implement
>> this scheme
>> 2. There is another parallel thread on implementing page cache
>> limits. If the lists split out, we need not scan the entire
>> list to find page cache pages to evict them.
>>
>> Of course I might be missing something (some piece of history)
>
> This means page cache = unmapped file backed page right? Otherwise this
> would not work. I always thought that the page cache were all file backed
> pages both mapped and unmapped.
>

Yes, unfortunately my terminology was not clear. I mean unmapped file
backed pages.

> With the proposed schemd you would have to move pages between lists if
> they are mapped and unmapped by a process. Terminating a process could
> lead to lots of pages moving to the unnmapped list.
>

When you unmap or map, you need to touch the pte entries and know the
pages involved, so shouldn't be equivalent to a list_del and list_add
for each page impacted by the map/unmap operation?

--
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-01-23 03:28:45

[permalink] [raw]

Subject: Re: Why active list and inactive list?

On Tue, 23 Jan 2007, Balbir Singh wrote:

> When you unmap or map, you need to touch the pte entries and know the
> pages involved, so shouldn't be equivalent to a list_del and list_add
> for each page impacted by the map/unmap operation?

When you unmap and map you must currently get exclusive access to the
cachelines of the pte and the cacheline of the page struct. If we use a
list_move on page->lru then we have would have to update pointers in up
to 4 other page structs. Thus we need exclusive access to 4 additional
cachelines. This triples the number of cachelines touched. Instead of 2
cachelines we need 6.

2007-01-23 03:37:00

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Rik van Riel wrote:
> Christoph Lameter wrote:
>
>> With the proposed schemd you would have to move pages between lists if
>> they are mapped and unmapped by a process. Terminating a process could
>> lead to lots of pages moving to the unnmapped list.
>
> That could be a problem.
>
> Another problem is that any such heuristic in the VM is
> bound to have corner cases that some workloads will hit.
>
> It would be really nice if we came up with a page replacement
> algorithm that did not need many extra heuristics to make it
> work...
>

Yes, it's damn hard at times. I was reading through an article
(Architectural support for translation table management in large address
space machines - Huck and Hayes), it talks about how Object Oriented
Systems encourage more sharing and decrease the locality of resulting
virtual address memory stream. Even multi threading tends to impact
locality of references.

Unfortunately, we have only heuristics to go by and of-course their
mathematical model.

I have always wondered if it would be useful to have a kernel debug
feature that can extract page references per task, it would be good
to see the page references (last 'n') of a workload that is not
doing too well on a particular system.

--
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-01-23 03:43:42

[permalink] [raw]

Subject: Re: Why active list and inactive list?

On Tue, 23 Jan 2007, Balbir Singh wrote:

> I have always wondered if it would be useful to have a kernel debug
> feature that can extract page references per task, it would be good
> to see the page references (last 'n') of a workload that is not
> doing too well on a particular system.

perfmon can do much of what you are looking for.

2007-01-23 03:45:52

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Christoph Lameter wrote:
> On Tue, 23 Jan 2007, Balbir Singh wrote:
>
>> When you unmap or map, you need to touch the pte entries and know the
>> pages involved, so shouldn't be equivalent to a list_del and list_add
>> for each page impacted by the map/unmap operation?
>
> When you unmap and map you must currently get exclusive access to the
> cachelines of the pte and the cacheline of the page struct. If we use a
> list_move on page->lru then we have would have to update pointers in up
> to 4 other page structs. Thus we need exclusive access to 4 additional
> cachelines. This triples the number of cachelines touched. Instead of 2
> cachelines we need 6.
>
>

Yes, good point, I see what you mean in terms of impact. But the trade
off could come from shrink_active_list() which does

list_del(&page->lru)
if (!reclaim_mapped && other_conditions)
list_add(&page->lru, &l_active);
...

In the case mentioned above, we would triple the cachlines when an area
is mapped/unmapped (which might be acceptable since it is a state change
for the page ;) ). In the trade-off I mentioned, it would happen
everytime reclaim is invoked and it has nothing to do with a page changing
state.

Did I miss something?

--
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-01-23 03:51:18

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Christoph Lameter wrote:
>
> perfmon can do much of what you are looking for.
>

Thanks, I'll look into it.

--
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-01-23 03:53:36

[permalink] [raw]

Subject: Re: Why active list and inactive list?

On Tue, 23 Jan 2007, Balbir Singh wrote:

> Yes, good point, I see what you mean in terms of impact. But the trade
> off could come from shrink_active_list() which does
>
> list_del(&page->lru)
> if (!reclaim_mapped && other_conditions)
> list_add(&page->lru, &l_active);
> ...
>
> In the case mentioned above, we would triple the cachlines when an area
> is mapped/unmapped (which might be acceptable since it is a state change
> for the page ;) ). In the trade-off I mentioned, it would happen
> everytime reclaim is invoked and it has nothing to do with a page changing
> state.
>
> Did I miss something?

We do the list_del/list_add right now in reclaim while moving pages
between active and inactive lists. However, reclaim is not run until the
systems is under memory pressure. Reclaim is run rarely and then lots of
these movements are occurring. At that point is it likely that the
cachelines are already available since the page structs had to be touched
for earlier movements.

2007-01-23 04:17:54

by Nick Piggin

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Balbir Singh wrote:

> This makes me wonder if it makes sense to split up the LRU into page
> cache LRU and mapped pages LRU. I see two benefits
>
> 1. Currently based on swappiness, we might walk an entire list
> searching for page cache pages or mapped pages. With these
> lists separated, it should get easier and faster to implement
> this scheme
> 2. There is another parallel thread on implementing page cache
> limits. If the lists split out, we need not scan the entire
> list to find page cache pages to evict them.
>
> Of course I might be missing something (some piece of history)

I actually had patches to do "split active lists" a while back.

They worked by lazily moving the page at reclaim-time, based on
whether or not it is mapped. This isn't too much worse than the
kernel's current idea of what a mapped page is.

They actually got a noticable speedup of the swapping kbuild
workload, but at this stage there were some more basic
improvements needed, so the difference could be smaller today.

The other nice thing about it was that it didn't have a hard
cutoff that the current reclaim_mapped toggle does -- you could
opt to scan the mapped list at a lower ratio than the unmapped
one. Of course, it also has some downsides too, and would
require retuning...

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-23 04:36:30

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Nick Piggin wrote:

> The other nice thing about it was that it didn't have a hard
> cutoff that the current reclaim_mapped toggle does -- you could
> opt to scan the mapped list at a lower ratio than the unmapped
> one. Of course, it also has some downsides too, and would
> require retuning...

Here's a simple idea for tuning.

For each list we keep track of:
1) the size of the list
2) the rate at which we scan the list
3) the fraction of (non new) pages that get
referenced

That way we can determine which list has the largest
fraction of "idle" pages sitting around and consequently
which list should be scanned more aggressively.

For each list we can calculate how frequently the pages
in the list are being used:

pressure = referenced percentage * scan rate / list size

The VM can equalize the pressure by scanning the list with
lower usage less than the other list. This way the VM can
give the right amount of memory to each type.

Of course, each list needs to be divided into inactive and
active like the current VM, in order to make sure that the
pages which are used once cannot push the real working set
of that list out of memory.

There is a more subtle problem when the list's working set
is larger than the amount of memory the list has. In that
situation the VM will be faulting pages back in just after
they got evicted. Something like my /proc/refaults code
can detect that and adjust the size of the undersized list
accordingly.

Of course, once we properly distinguish between the more
frequently and less frequently accessed pages within each
of the page sets (mapped/anonymous vs. unmapped) and have
the pressure between the lists equalized, why do we need
to keep them separate again?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-23 05:11:14

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Nick Piggin wrote:
> Balbir Singh wrote:
>
>> This makes me wonder if it makes sense to split up the LRU into page
>> cache LRU and mapped pages LRU. I see two benefits
>>
>> 1. Currently based on swappiness, we might walk an entire list
>> searching for page cache pages or mapped pages. With these
>> lists separated, it should get easier and faster to implement
>> this scheme
>> 2. There is another parallel thread on implementing page cache
>> limits. If the lists split out, we need not scan the entire
>> list to find page cache pages to evict them.
>>
>> Of course I might be missing something (some piece of history)
>
> I actually had patches to do "split active lists" a while back.
>
> They worked by lazily moving the page at reclaim-time, based on
> whether or not it is mapped. This isn't too much worse than the
> kernel's current idea of what a mapped page is.
>
> They actually got a noticable speedup of the swapping kbuild
> workload, but at this stage there were some more basic
> improvements needed, so the difference could be smaller today.
>
> The other nice thing about it was that it didn't have a hard
> cutoff that the current reclaim_mapped toggle does -- you could
> opt to scan the mapped list at a lower ratio than the unmapped
> one. Of course, it also has some downsides too, and would
> require retuning...
>

Thanks, I am motivated to experiment with the idea. I guess I need
to (re)discover the downsides for myself :-)

--
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-01-23 05:52:16

[permalink] [raw]

Subject: Re: Why active list and inactive list?

Rik van Riel wrote:
> Nick Piggin wrote:
>
>> The other nice thing about it was that it didn't have a hard
>> cutoff that the current reclaim_mapped toggle does -- you could
>> opt to scan the mapped list at a lower ratio than the unmapped
>> one. Of course, it also has some downsides too, and would
>> require retuning...
>
> Here's a simple idea for tuning.
>
> For each list we keep track of:
> 1) the size of the list
> 2) the rate at which we scan the list
> 3) the fraction of (non new) pages that get
> referenced
>
> That way we can determine which list has the largest
> fraction of "idle" pages sitting around and consequently
> which list should be scanned more aggressively.
>
> For each list we can calculate how frequently the pages
> in the list are being used:
>
> pressure = referenced percentage * scan rate / list size
>
> The VM can equalize the pressure by scanning the list with
> lower usage less than the other list. This way the VM can
> give the right amount of memory to each type.
>

This sounds like a good thing to start with. I think we can
then use swappiness to decide what to evict.

> Of course, each list needs to be divided into inactive and
> active like the current VM, in order to make sure that the
> pages which are used once cannot push the real working set
> of that list out of memory.
>

Yes, that makes sense.

> There is a more subtle problem when the list's working set
> is larger than the amount of memory the list has. In that
> situation the VM will be faulting pages back in just after
> they got evicted. Something like my /proc/refaults code
> can detect that and adjust the size of the undersized list
> accordingly.
>
> Of course, once we properly distinguish between the more
> frequently and less frequently accessed pages within each
> of the page sets (mapped/anonymous vs. unmapped) and have
> the pressure between the lists equalized, why do we need
> to keep them separate again?
>

:-)

--
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-01-23 08:33:14

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Why active list and inactive list?

On Mon, 2007-01-22 at 18:03 -0800, Christoph Lameter wrote:
> What happened to all those advanced page replacement endeavors?

They are here:
http://programming.kicks-ass.net/kernel-patches/page-replace/2.6.19-pr1/

I should update to .20 soonish.

> What is the most promising of those?

I'm still torn between CLOCK-Pro and CART.

CLOCK-Pro is still vulnerable to the cyclic scan use case, since at that
time all pages will have equal distance.

CART might be fixable (my CART-r approach) but I still have to study the
full ramifications of that (does it hurt other workloads?).

Both seems quite capable to distinguish between recency and frequency.
Neither use this horrid swappiness knob to distinguish between mapped
and unmapped pages.

The main problem I'm having is test cases, notably the lack thereof.
(and lack of time ofcourse ;-)

2007-01-23 15:10:37