2002-01-28 17:14:03

by Josh MacDonald

[permalink] [raw]
Subject: Note describing poor dcache utilization under high memory pressure

When memory pressure becomes high, the Linux kswapd begins calling
shrink_caches() from try_to_free_pages() with an integer priority from
6 (the default, lowest priority) to 1 (high priority). Looking
specifically at the dcache, this results in a calls to
shrink_dcache_memory() that attempt to free a fraction (1/priority) of
the inactive dcache entries. This ultimately leads to prune_dcache()
scanning the dcache in least-recently-used order attempting to call
kmem_cache_free() on some number of dcache entries.

Dcache entries are allocated from the kmem_slab_cache, which manages
objects in page-size "slabs", but the kmem_slab_cache cannot free a
page until every object in a slab becomes unused. The problem is that
freeing dcache entries in LRU-order is effectively freeing entries
from randomly-selected slabs, and therefore applying shrink_caches()
pressure to the dcache has an undesired result. In the attempt to
reduce its size, the dcache must free objects from random slabs in
order to actually release full pages. The result is that under high
memory pressure the dcache utilization drops dramatically. The
prune_dcache() mechanism doesn't just reduce the page utilization as
desired, it reduces the intra-page utilization, which is bad.

In order to measure this effect (via /proc/slabinfo) I first populated
a large dcache and then ran a memory-hog to force swapping to occur.
The dcache utilization drops to between 20-35%. For example, before
running the memory-hog my dcache reports:

dentry_cache 10170 10170 128 339 339 1 : 252 126

(i.e., 10170 active dentry objects, 10170 available dentry objects @
128 bytes each, 339 pages with at least one object, and 339 allocated
pages, an approximately 1.4MB dcache)

While running the memory-hog program to initiate swapping, the dcache
stands at:

dentry_cache 693 3150 128 105 105 1 : 252 126

Meaning, the randomly-applied cache pressure was successful at freeing
234 (= 339-105) pages, leaving a 430KB dcache, but at the same time it
reduced the cache utilization to 22%, meaning that although it was
able to free nearly 1MB of space, 335KB are now wasted as a result of
the high memory-pressure condition.

So, it would seem that the dcache and kmem_slab_cache memory allocator
could benefit from a way to shrink the dcache in a less random way.
Any thoughts?

-josh

--
PRCS version control system http://sourceforge.net/projects/prcs
Xdelta storage & transport http://sourceforge.net/projects/xdelta
Need a concurrent skip list? http://sourceforge.net/projects/skiplist


2002-01-28 17:40:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure


On Mon, 28 Jan 2002, Josh MacDonald wrote:
>
> So, it would seem that the dcache and kmem_slab_cache memory allocator
> could benefit from a way to shrink the dcache in a less random way.
> Any thoughts?

The way I want to solve this problem generically is to basically get rid
of the special-purpose memory shrinkers, and have everything done with one
unified interface, namely the physical-page-based "writeout()" routine. We
do that for the page cache, and there's nothing that says that we couldn't
do the same for all other caches, including very much the slab allocator.

Thus any slab user that wants to, could just register their own per-page
memory pressure logic. The dcache "reference" bit would go away, to be
replaced by a per-page reference bit (that part could be done already, of
course, and might help a bit on its own).

Basically, the more different "pools" of memory we have, the harder it
gets to balance them. Clearly, the optimal number of pools from a
balancing standpoint is just a single, direct physical pool.

Right now we have several pools - we have the pure physical LRU, we have
the virtual mapping (where scanning is directly tied to the physical LRU,
but where the separate pool still _does_ pose some problems), and we have
separate balancing for inodes, dentries and quota. And there's no question
that it hurts us under memory pressure.

(There's a related question, which is whether other caches might also
benefit from being able to grow more - right now there are some caches
that are of a limited size partly because they have no good way of
shrinking back on demand).

I am, for example, very interested to see if Rik can get the overhead of
the rmap stuff down low enough that it's not a noticeable hit under
non-VM-pressure. I'm looking at the issue of doing COW on the page tables
(which really is a separate issue), because it might make it more
palatable to go with the rmap approach.

Linus

2002-01-28 18:02:27

by Rik van Riel

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Mon, 28 Jan 2002, Linus Torvalds wrote:

> I am, for example, very interested to see if Rik can get the overhead of
> the rmap stuff down low enough that it's not a noticeable hit under
> non-VM-pressure. I'm looking at the issue of doing COW on the page tables
> (which really is a separate issue), because it might make it more
> palatable to go with the rmap approach.

I'd be interested to know exactly how much overhead -rmap is
causing for both page faults and fork (but I'm sure one of
the regular benchmarkers can figure that one out while I fix
the RSS limit stuff ;))

About page table COW ... I've thought about it a lot and it
wouldn't surprise me if the 4 MB granularity of page tables
is too large to be of a real benefit since the classic path
of fork+exec would _still_ get all 3 page tables of the
typical process copied.

OTOH, it wouldn't surprise me at all if it was a win ;))

kind regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-28 18:22:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure


On Mon, 28 Jan 2002, Rik van Riel wrote:
>
> I'd be interested to know exactly how much overhead -rmap is
> causing for both page faults and fork (but I'm sure one of
> the regular benchmarkers can figure that one out while I fix
> the RSS limit stuff ;))

I doubt it is noticeable on page faults (the cost of maintaining the list
at COW should be basically zero compared to all the other costs), but I've
seen several people reporting fork() overheads of ~300% or so.

Which is not that surprising, considering that most of the fork overhead
by _far_ is the work to copy the page tables, and rmap makes them three
times larger or so.

And I agree that COW'ing the page tables may not actually help. But it
might be worth it even _without_ rmap, so it's worth a look.

(Also, I'd like to understand why some people report so much better times
on dbench, and some people reports so much _worse_ times with dbench.
Admittedly dbench is a horrible benchmark, but still.. Is it just the
elevator breakage, or is it rmap itself?)

Linus

2002-01-28 18:37:31

by Rik van Riel

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Mon, 28 Jan 2002, Linus Torvalds wrote:
> On Mon, 28 Jan 2002, Rik van Riel wrote:
> >
> > I'd be interested to know exactly how much overhead -rmap is
> > causing for both page faults and fork (but I'm sure one of
> > the regular benchmarkers can figure that one out while I fix
> > the RSS limit stuff ;))
>
> I doubt it is noticeable on page faults (the cost of maintaining the list
> at COW should be basically zero compared to all the other costs), but I've
> seen several people reporting fork() overheads of ~300% or so.

Dave McCracken has tested with applications of different
sizes and has found fork() speed differences of 10% for
small applications up to 400% for a 10 MB (IIRC) program.

This was with some debugging code enabled, however...
(some of the debugging code I've only disabled now)

> Which is not that surprising, considering that most of the fork overhead
> by _far_ is the work to copy the page tables, and rmap makes them three
> times larger or so.

For dense page tables they'll be 3 times larger, but for a page
table with is only occupied for 10% (eg. bash with 1.5 MB spread
over executable+data, libraries and stack) the space overhead is
much smaller.

The amount of RAM touched in fork() is mostly tripled though, if
the program is completely resident, because fork() follows VMA
boundaries.

> And I agree that COW'ing the page tables may not actually help. But it
> might be worth it even _without_ rmap, so it's worth a look.

Absolutely, this is something to try...

> (Also, I'd like to understand why some people report so much better
> times on dbench, and some people reports so much _worse_ times with
> dbench. Admittedly dbench is a horrible benchmark, but still.. Is it
> just the elevator breakage, or is it rmap itself?)

We're still looking into this. William Irwin is running a
nice script to see if the settings in /proc/sys/vm/bdflush
have an observable influence on dbench.

Another thing which could have to do with decreased dbench
and increased tiobench performance is drop behind vs. use-once.
It turns out drop behind is better able to sustain IO streams
of different speeds and can fit more IO streams in the same
amount of cache (people running very heavily loaded ftp or
web download servers can find a difference here).

For the interested parties, I've put some text and pictures of
this phenomenon online at:

http://linux-mm.org/wiki/moin.cgi/StreamingIo

It basically comes down to the fact that use-once degrades into
FIFO, which isn't too efficient when different programs do IO
at different speeds. I'm not sure how this is supposed to
affect dbench, but it could have an influence...

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-28 19:26:03

by Hans Reiser

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

If I understand you right, your scheme has the fundamental flaw that one
dcache entry on a page can keep an entire page full of "slackers" in
memory, and since there is little correlation in usage between dcache
entries that happen to get stored on a page, the result is that the
effectiveness per megabyte of the dcache is decreased by an order of
magnitude. It would be worse to have one dcache entry per page, but
maybe not by as much as you might expect.

When objects smaller than a page are stored on a page but not correlated
in their usage, they need to be aged individually not as a page, and
then garbage collected as needed. Neither the current model nor your
proposed scheme solve the fundamental problem Josh's measurements prove
exists.

We need subcache plugins with a subcache size proportional centralized
memory pressure distributor, not a unified cache.

Josh went to bed, but I'll encourage him to say more about this in the
morning Moscow time.

Hans


Linus Torvalds wrote:

>On Mon, 28 Jan 2002, Josh MacDonald wrote:
>
>>So, it would seem that the dcache and kmem_slab_cache memory allocator
>>could benefit from a way to shrink the dcache in a less random way.
>>Any thoughts?
>>
>
>The way I want to solve this problem generically is to basically get rid
>of the special-purpose memory shrinkers, and have everything done with one
>unified interface, namely the physical-page-based "writeout()" routine. We
>do that for the page cache, and there's nothing that says that we couldn't
>do the same for all other caches, including very much the slab allocator.
>
>Thus any slab user that wants to, could just register their own per-page
>memory pressure logic. The dcache "reference" bit would go away, to be
>replaced by a per-page reference bit (that part could be done already, of
>course, and might help a bit on its own).
>
>Basically, the more different "pools" of memory we have, the harder it
>gets to balance them. Clearly, the optimal number of pools from a
>balancing standpoint is just a single, direct physical pool.
>
>Right now we have several pools - we have the pure physical LRU, we have
>the virtual mapping (where scanning is directly tied to the physical LRU,
>but where the separate pool still _does_ pose some problems), and we have
>separate balancing for inodes, dentries and quota. And there's no question
>that it hurts us under memory pressure.
>
>(There's a related question, which is whether other caches might also
>benefit from being able to grow more - right now there are some caches
>that are of a limited size partly because they have no good way of
>shrinking back on demand).
>
>I am, for example, very interested to see if Rik can get the overhead of
>the rmap stuff down low enough that it's not a noticeable hit under
>non-VM-pressure. I'm looking at the issue of doing COW on the page tables
>(which really is a separate issue), because it might make it more
>palatable to go with the rmap approach.
>
> Linus
>
>
>



2002-01-28 19:27:33

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

> On Mon, 28 Jan 2002, Linus Torvalds wrote:
>> (Also, I'd like to understand why some people report so much better
>> times on dbench, and some people reports so much _worse_ times with
>> dbench. Admittedly dbench is a horrible benchmark, but still.. Is it
>> just the elevator breakage, or is it rmap itself?)

On Mon, Jan 28, 2002 at 04:37:02PM -0200, Rik van Riel wrote:
> We're still looking into this. William Irwin is running a
> nice script to see if the settings in /proc/sys/vm/bdflush
> have an observable influence on dbench.

They have observable effects, but the preliminary results (the
runs are so time-intensive the repetitions needed to average out
dbench's wild fluctuations are going to take a while) seem to
give me three intuitions:

(1) the bdflush logic is brittle and/or not sufficiently adaptive
(2) dbench results fluctuate so wildly it's difficult to
reproduce results accurately
(3) dbench results fluctuate so wildly it obscures the true
performance curve

The winner of the first round seems to be 7 0 0 0 500 3000 28 0 0

Cheers,
Bill

2002-01-28 19:58:27

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 28, 2002 07:21 pm, Linus Torvalds wrote:
> On Mon, 28 Jan 2002, Rik van Riel wrote:
> >
> > I'd be interested to know exactly how much overhead -rmap is
> > causing for both page faults and fork (but I'm sure one of
> > the regular benchmarkers can figure that one out while I fix
> > the RSS limit stuff ;))
>
> I doubt it is noticeable on page faults (the cost of maintaining the list
> at COW should be basically zero compared to all the other costs), but I've
> seen several people reporting fork() overheads of ~300% or so.
>
> Which is not that surprising, considering that most of the fork overhead
> by _far_ is the work to copy the page tables, and rmap makes them three
> times larger or so.
>
> And I agree that COW'ing the page tables may not actually help. But it
> might be worth it even _without_ rmap, so it's worth a look.

True. I've been working on this question fairly intensively since around
Christmas. At that time I came up with the algorithm attached below, which I
circulated to a number of people, including Alan (who thought it would work)
and Rik. (Rik, please be more thorough about quoting your sources ;-).

I've been a little slow to 'publish' this on lkml because I wanted a working
prototype first, as proof of concept. My efforts to dragoon one or more of
the notably capable kernelnewbies crowd into coding it haven't been
particularly successful, perhaps due to the opacity of the code in question
(pgtable.h et al). So I've begun coding it myself, and it's rather slow
going, again because of the opacity of the code. Oh, and the difficult
nature of the problem itself, since it requires understanding pretty much all
of Unix memory management semantics first, including the bizarre (and useful)
properties of process forking.

The good news is, the code changes required do fit very cleanly into the
current design. Changes are required in three places I've identified so far:

copy_page_range
Intead of copying the page tables, just increment their use counts

zap_page_range:
If a page table is shared according to its use count, just decrement
the use count and otherwise leave it alone.

handle_mm_fault:
If a page table is shared according to its use count and the faulting
instruction is a write, allocate a new page table and do the work that
would have normally been done by copy_page_range at fork time.
Decrement the use count of the (perhaps formerly) shared page table.

I'd cheerfully hand this coding effort off to someone more familiar with this
particular neck of the kernel woods - you, Davem and Marcelo come to mind,
but if nobody bites I'll just continue working on it at my own pace. I
consider this work a potentially important evolutionary step, since it could
possibly lead to a breakup of the current VM logjam.

Here's the original writeup I circulated earlier this month:

Lazy page table instantiation algorithm
---------------------------------------

This algorithm implements page table sharing between page directories, with
a view to reducing the amount of work required at fork time for a child to
inherit its parent's memory. With this algorithm, only a single page (the
page directory) is copied at fork time.

This work is aimed at making Rik's rmap design practical, i.e., by eliminating
the current high cost of creating potentially millions of pte_chain list nodes
at fork time. However, the usefulness of this approach is not limited to
rmap -
it should speed up fork in the current vm as well, by copying only one page to
the child (the page directory) instead of all page table pages, and by
instantiating only those page tables that refer to pages the child actually
writes to.

The algorithm relies on a page table's use count, which specifies how many
page directories point at it. A page table can be shared by several page
directories until one of its non-shared pages is written. Then the page and
its page table are 'privatized', i.e., a copy of the page is entered into a
copy of the page table (and the page's pte_chain is extended).

In this way, instantiation of page tables does not happen at fork time, but
later, when pages referenced by page tables are privatized. Furthermore,
only page tables that need to be instantiated are instantiated, as opposed
to the current algorithm, which instantiates all page tables at fork time.

Algorithm
---------

On fork:
- Increment the use counts on all page tables mapped by the page directory
- Copy the page directory to the child

On page privatize:
- If the page table has use count greater than one, privatize the page
table. At this time, extend all the pte_chains to include the privatized
page table. Reduce the use count on the original page table by one and
set the use count of the new page table to one.

On process termination:
- Decrement the use counts on all page tables mapped by the page directory;
free those with zero use count.

With this lazy page table instantiation, a simple reverse-mapping memory
management scheme becomes much more attractive, because the overhead of
creating reverse links to ptes has been reduced (to only the page tables that
need it) and is not done at fork time.

--
Daniel

2002-01-28 21:33:34

by Rick Stevens

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

Daniel Phillips wrote:
[snip]

> I've been a little slow to 'publish' this on lkml because I wanted a working
> prototype first, as proof of concept. My efforts to dragoon one or more of
> the notably capable kernelnewbies crowd into coding it haven't been
> particularly successful, perhaps due to the opacity of the code in question
> (pgtable.h et al). So I've begun coding it myself, and it's rather slow
> going, again because of the opacity of the code. Oh, and the difficult
> nature of the problem itself, since it requires understanding pretty much all
> of Unix memory management semantics first, including the bizarre (and useful)
> properties of process forking.
>
> The good news is, the code changes required do fit very cleanly into the
> current design. Changes are required in three places I've identified so far:
>
> copy_page_range
> Intead of copying the page tables, just increment their use counts
>
> zap_page_range:
> If a page table is shared according to its use count, just decrement
> the use count and otherwise leave it alone.
>
> handle_mm_fault:
> If a page table is shared according to its use count and the faulting
> instruction is a write, allocate a new page table and do the work that
> would have normally been done by copy_page_range at fork time.
> Decrement the use count of the (perhaps formerly) shared page table.


Perhaps I'm missing this, but I read that as the child gets a reference
to the parent's memory. If the child attempts a write, then new memory
is allocated, data copied and the write occurs to this new memory. As
I read this, it's only invoked on a child write.

Would this not leave a hole where the parent could write and, since the
child shares that memory, the new data would be read by the child? Sort
of a hidden shm segment? If so, I think we've got problems brewing.
Now, if a parent write causes the same behaviour as a child write, then
my point is moot.

Could you clarify this for me? I'm probably way off base here.

----------------------------------------------------------------------
- Rick Stevens, SSE, VitalStream, Inc. [email protected] -
- 949-743-2010 (Voice) http://www.vitalstream.com -
- -
- grep me no patterns and I'll tell you no lines -
----------------------------------------------------------------------

2002-01-28 21:43:45

by Rik van Riel

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Mon, 28 Jan 2002, Rick Stevens wrote:
> Daniel Phillips wrote:
> [snip]
[page table COW description]

> Perhaps I'm missing this, but I read that as the child gets a reference
> to the parent's memory. If the child attempts a write, then new memory
> is allocated, data copied and the write occurs to this new memory. As
> I read this, it's only invoked on a child write.
>
> Would this not leave a hole where the parent could write and, since the
> child shares that memory, the new data would be read by the child? Sort
> of a hidden shm segment? If so, I think we've got problems brewing.
> Now, if a parent write causes the same behaviour as a child write, then
> my point is moot.

Daniel and I discussed this issue when Daniel first came up with
the idea of doing page table COW. He seemed a bit confused by
fork semantics when we first discussed this idea, too ;)

You're right though, both parent and child need to react in the
same way, preferably _without_ having to walk all of the parent's
page tables and mark them read-only ...

kind regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-28 22:00:46

by Rick Stevens

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

Rik van Riel wrote:

> On Mon, 28 Jan 2002, Rick Stevens wrote:
>
>>Daniel Phillips wrote:
>>[snip]
>>
> [page table COW description]
>
>
>>Perhaps I'm missing this, but I read that as the child gets a reference
>>to the parent's memory. If the child attempts a write, then new memory
>>is allocated, data copied and the write occurs to this new memory. As
>>I read this, it's only invoked on a child write.
>>
>>Would this not leave a hole where the parent could write and, since the
>>child shares that memory, the new data would be read by the child? Sort
>>of a hidden shm segment? If so, I think we've got problems brewing.
>>Now, if a parent write causes the same behaviour as a child write, then
>>my point is moot.
>>
>
> Daniel and I discussed this issue when Daniel first came up with
> the idea of doing page table COW. He seemed a bit confused by
> fork semantics when we first discussed this idea, too ;)
>
> You're right though, both parent and child need to react in the
> same way, preferably _without_ having to walk all of the parent's
> page tables and mark them read-only ...


Ah, good! I wasn't losing it then. That's a relief!

I've gotta read up on the kernel's VM system. I use to write them
for a certain three-letter-acronymed company--many, many moons ago.
Maybe I'd have some ideas. Then again, perhaps not.

----------------------------------------------------------------------
- Rick Stevens, SSE, VitalStream, Inc. [email protected] -
- 949-743-2010 (Voice) http://www.vitalstream.com -
- -
- Veni, Vidi, VISA: I came, I saw, I did a little shopping. -
----------------------------------------------------------------------

2002-01-28 22:02:36

by Momchil Velikov

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

>>>>> "Daniel" == Daniel Phillips <[email protected]> writes:

Daniel> I'd cheerfully hand this coding effort off to someone more familiar with this
Daniel> particular neck of the kernel woods - you, Davem and Marcelo come to mind,
Daniel> but if nobody bites I'll just continue working on it at my own pace. I

BTW, I'm doing just this, working on it at my own pace.

2002-01-28 22:16:16

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 28, 2002 11:01 pm, Momchil Velikov wrote:
> >>>>> "Daniel" == Daniel Phillips <[email protected]> writes:
>
> Daniel> I'd cheerfully hand this coding effort off to someone more familiar with this
> Daniel> particular neck of the kernel woods - you, Davem and Marcelo come to mind,
> Daniel> but if nobody bites I'll just continue working on it at my own pace. I
>
> BTW, I'm doing just this, working on it at my own pace.

Right, well in a couple of days we can compare notes. I'm a little
embarrassed at the state of the code as of today, I think I'm interpreting
some of those ulongs as things they shouldn't be.

This would be a whole lot easier if those ugly macros in pgtable.h were inlines
with pagetable_t etc. parameters instead.

--
Daniel

2002-01-28 22:22:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 28, 2002 10:43 pm, Rik van Riel wrote:
> On Mon, 28 Jan 2002, Rick Stevens wrote:
> > Daniel Phillips wrote:
> > [snip]
> [page table COW description]
>
> > Perhaps I'm missing this, but I read that as the child gets a reference
> > to the parent's memory. If the child attempts a write, then new memory
> > is allocated, data copied and the write occurs to this new memory. As
> > I read this, it's only invoked on a child write.
> >
> > Would this not leave a hole where the parent could write and, since the
> > child shares that memory, the new data would be read by the child? Sort
> > of a hidden shm segment? If so, I think we've got problems brewing.
> > Now, if a parent write causes the same behaviour as a child write, then
> > my point is moot.
>
> Daniel and I discussed this issue when Daniel first came up with
> the idea of doing page table COW. He seemed a bit confused by
> fork semantics when we first discussed this idea, too ;)

Oh yes, I admit it, confused is me. That way I avoid heading off in
directions that people have already gone, and found nothing ;-)

> You're right though, both parent and child need to react in the
> same way, preferably _without_ having to walk all of the parent's
> page tables and mark them read-only ...

Yes, and look at the algorithm as I've stated it, it's symmetric with respect
to parent and child. Getting it into this simple and robust form took a lot
of work, and as you know, my initial attempts were complex and, yes, confused.

--
Daniel

2002-01-28 22:35:28

by Brian Gerst

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

Rick Stevens wrote:
>
> Daniel Phillips wrote:
> [snip]
>
> > I've been a little slow to 'publish' this on lkml because I wanted a working
> > prototype first, as proof of concept. My efforts to dragoon one or more of
> > the notably capable kernelnewbies crowd into coding it haven't been
> > particularly successful, perhaps due to the opacity of the code in question
> > (pgtable.h et al). So I've begun coding it myself, and it's rather slow
> > going, again because of the opacity of the code. Oh, and the difficult
> > nature of the problem itself, since it requires understanding pretty much all
> > of Unix memory management semantics first, including the bizarre (and useful)
> > properties of process forking.
> >
> > The good news is, the code changes required do fit very cleanly into the
> > current design. Changes are required in three places I've identified so far:
> >
> > copy_page_range
> > Intead of copying the page tables, just increment their use counts
> >
> > zap_page_range:
> > If a page table is shared according to its use count, just decrement
> > the use count and otherwise leave it alone.
> >
> > handle_mm_fault:
> > If a page table is shared according to its use count and the faulting
> > instruction is a write, allocate a new page table and do the work that
> > would have normally been done by copy_page_range at fork time.
> > Decrement the use count of the (perhaps formerly) shared page table.
>
> Perhaps I'm missing this, but I read that as the child gets a reference
> to the parent's memory. If the child attempts a write, then new memory
> is allocated, data copied and the write occurs to this new memory. As
> I read this, it's only invoked on a child write.
>
> Would this not leave a hole where the parent could write and, since the
> child shares that memory, the new data would be read by the child? Sort
> of a hidden shm segment? If so, I think we've got problems brewing.
> Now, if a parent write causes the same behaviour as a child write, then
> my point is moot.
>
> Could you clarify this for me? I'm probably way off base here.

Fortunately, the kernel's page mappings are shared by all processes
(except the top level), so if you mark the page containing the user page
table as read-only from the child, it will also be read-only in the
parent.

--

Brian Gerst

2002-01-28 22:35:07

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 28, 2002 10:33 pm, Rick Stevens wrote:
> Daniel Phillips wrote:
> [snip]
>
> > I've been a little slow to 'publish' this on lkml because I wanted a
working
> > prototype first, as proof of concept. My efforts to dragoon one or more
of
> > the notably capable kernelnewbies crowd into coding it haven't been
> > particularly successful, perhaps due to the opacity of the code in
question
> > (pgtable.h et al). So I've begun coding it myself, and it's rather slow
> > going, again because of the opacity of the code. Oh, and the difficult
> > nature of the problem itself, since it requires understanding pretty much
all
> > of Unix memory management semantics first, including the bizarre (and
useful)
> > properties of process forking.
> >
> > The good news is, the code changes required do fit very cleanly into the
> > current design. Changes are required in three places I've identified so
> > far:
> >
> > copy_page_range
> > Intead of copying the page tables, just increment their use counts
> >
> > zap_page_range:
> > If a page table is shared according to its use count, just decrement
> > the use count and otherwise leave it alone.
> >
> > handle_mm_fault:
> > If a page table is shared according to its use count and the faulting
> > instruction is a write, allocate a new page table and do the work
> > that would have normally been done by copy_page_range at fork time.
> > Decrement the use count of the (perhaps formerly) shared page table.
>
>
> Perhaps I'm missing this, but I read that as the child gets a reference
> to the parent's memory. If the child attempts a write, then new memory
> is allocated, data copied and the write occurs to this new memory. As
> I read this, it's only invoked on a child write.

Notice that the word 'child' is only used in reference to the intial page
directory copy. After that things are symmetric with respect to parent and
child, a fundamental simplification that allows the algorithm to work without
explicit knowledge of the structure of the mm tree (and also simplifies the
locking considerably).

> Would this not leave a hole where the parent could write and, since the
> child shares that memory, the new data would be read by the child? Sort
> of a hidden shm segment? If so, I think we've got problems brewing.
> Now, if a parent write causes the same behaviour as a child write, then
> my point is moot.
>
> Could you clarify this for me? I'm probably way off base here.

Since the page was copied to the child, the child's page table must be
altered, and since it is shared, it must first be instantiated by the child.
So after all the dust settles, the parent and child have their own copies of
a page table page, which differ only at a single location: the child's page
table points at its freshly made CoW copy, and the parent's page table points
at the original page.

The beauty of this is, the page table could just as easily have been shared
by a sibling of the child, not the parent at all, in the case that the parent
had already instantiated its own copy of the page table page because of an
earlier CoW.

Confused yet? Welcome to the club ;-)

--
Daniel

2002-01-28 22:39:27

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 28, 2002 11:00 pm, Rick Stevens wrote:
> I've gotta read up on the kernel's VM system. I use to write them
> for a certain three-letter-acronymed company--many, many moons ago.
> Maybe I'd have some ideas. Then again, perhaps not.

Well, you really want to take a trip over to irc.openprojects.net,
#kernelnewbies, and there you'll find a number of still-current IBM people
happily helping cook up plots to take over Linu^H^H^H^H the world ;-)

--
Daniel

Subject: Re: Note describing poor dcache utilization under high memory pressure



--On Monday, 28 January, 2002 9:39 AM -0800 Linus Torvalds
<[email protected]> wrote:

> Thus any slab user that wants to, could just register their own per-page
> memory pressure logic.

It might be useful to use a similar type interface to aid
in defragmentation - i.e. 'relocate the stuff on this
(physical) page please, I want it back'. If nothing is
registered, the default mechanism could be just to free it
a la writepage().

--
Alex Bligh

2002-01-28 23:06:52

by Rick Stevens

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

Daniel Phillips wrote:

> On January 28, 2002 11:00 pm, Rick Stevens wrote:
>
>>I've gotta read up on the kernel's VM system. I use to write them
>>for a certain three-letter-acronymed company--many, many moons ago.
>>Maybe I'd have some ideas. Then again, perhaps not.
>>
>
> Well, you really want to take a trip over to irc.openprojects.net,
> #kernelnewbies, and there you'll find a number of still-current IBM people
> happily helping cook up plots to take over Linu^H^H^H^H the world ;-)


Uh, I never said IBM ;-) I said "a three-letter-acronym" company.
There were several. The one I dealt with was in Massachusetts, had
a real penchant for three-letter acronyms and used a programming
dialect which was the only single word oxymoron in the English
language (enough hints yet?).

And, no, I wasn't an employee.

----------------------------------------------------------------------
- Rick Stevens, SSE, VitalStream, Inc. [email protected] -
- 949-743-2010 (Voice) http://www.vitalstream.com -
- -
- "More hay, Trigger?" "No thanks, Roy, I'm stuffed!" -
----------------------------------------------------------------------

2002-01-28 23:09:43

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 28, 2002 11:34 pm, Brian Gerst wrote:
> Rick Stevens wrote:
> >
> > Daniel Phillips wrote:
> > [snip]
> >
> > > I've been a little slow to 'publish' this on lkml because I wanted a working
> > > prototype first, as proof of concept. My efforts to dragoon one or more of
> > > the notably capable kernelnewbies crowd into coding it haven't been
> > > particularly successful, perhaps due to the opacity of the code in question
> > > (pgtable.h et al). So I've begun coding it myself, and it's rather slow
> > > going, again because of the opacity of the code. Oh, and the difficult
> > > nature of the problem itself, since it requires understanding pretty much all
> > > of Unix memory management semantics first, including the bizarre (and useful)
> > > properties of process forking.
> > >
> > > The good news is, the code changes required do fit very cleanly into the
> > > current design. Changes are required in three places I've identified so far:
> > >
> > > copy_page_range
> > > Intead of copying the page tables, just increment their use counts
> > >
> > > zap_page_range:
> > > If a page table is shared according to its use count, just decrement
> > > the use count and otherwise leave it alone.
> > >
> > > handle_mm_fault:
> > > If a page table is shared according to its use count and the faulting
> > > instruction is a write, allocate a new page table and do the work that
> > > would have normally been done by copy_page_range at fork time.
> > > Decrement the use count of the (perhaps formerly) shared page table.
> >
> > Perhaps I'm missing this, but I read that as the child gets a reference
> > to the parent's memory. If the child attempts a write, then new memory
> > is allocated, data copied and the write occurs to this new memory. As
> > I read this, it's only invoked on a child write.
> >
> > Would this not leave a hole where the parent could write and, since the
> > child shares that memory, the new data would be read by the child? Sort
> > of a hidden shm segment? If so, I think we've got problems brewing.
> > Now, if a parent write causes the same behaviour as a child write, then
> > my point is moot.
> >
> > Could you clarify this for me? I'm probably way off base here.
>
> Fortunately, the kernel's page mappings are shared by all processes
> (except the top level), so if you mark the page containing the user page
> table as read-only from the child, it will also be read-only in the
> parent.

Actually, the page tables don't even have to be marked read-only. It's
done with use counts instead: use count > 1 -> shared.

We don't have to take faults on page tables themselves since we are
always servicing a fault, or we are carrying out some explicit VM
operation when we encounter the page table. That means we are already
in control and don't have to get clever. Or as it's been expressed
here before, we don't have to 'play stupid fault tricks', which would
be far less efficient and likely more complex to code.

--
Daniel

2002-01-28 23:13:13

by Rick Stevens

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

Daniel Phillips wrote:

> Since the page was copied to the child, the child's page table must be
> altered, and since it is shared, it must first be instantiated by the child.
> So after all the dust settles, the parent and child have their own copies of
> a page table page, which differ only at a single location: the child's page
> table points at its freshly made CoW copy, and the parent's page table points
> at the original page.

>
> The beauty of this is, the page table could just as easily have been shared
> by a sibling of the child, not the parent at all, in the case that the parent
> had already instantiated its own copy of the page table page because of an
> earlier CoW.


Ok. Still seems like a bit more copying than necessary. I'd have
to look at it a bit more and do some noodling.


> Confused yet? Welcome to the club ;-)


Does my head exploding qualify for "confused"? If so, then I'm not
yet "confused". I'm "concerned", since my ears are bleeding (a
precursor to an explosion) ;-p

----------------------------------------------------------------------
- Rick Stevens, SSE, VitalStream, Inc. [email protected] -
- 949-743-2010 (Voice) http://www.vitalstream.com -
- -
- "More hay, Trigger?" "No thanks, Roy, I'm stuffed!" -
----------------------------------------------------------------------

2002-01-28 23:23:03

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 12:12 am, Rick Stevens wrote:
> Daniel Phillips wrote:
>
> > Since the page was copied to the child, the child's page table must be
> > altered, and since it is shared, it must first be instantiated by the child.
> > So after all the dust settles, the parent and child have their own copies of
> > a page table page, which differ only at a single location: the child's page
> > table points at its freshly made CoW copy, and the parent's page table points
> > at the original page.
>
> >
> > The beauty of this is, the page table could just as easily have been shared
> > by a sibling of the child, not the parent at all, in the case that the parent
> > had already instantiated its own copy of the page table page because of an
> > earlier CoW.
>
>
> Ok. Still seems like a bit more copying than necessary.

I think I can show that it's exactly as much copying as necessary, and no more.
Oh, the page directory could also be shared and, on a 3 level page table, so
could the mid level tables, but that's not a really big win because of the 1K/1
fanout. I.e, we already took care of 99.9% of the problem just by sharing the
bottom-level page tables.

> I'd have to look at it a bit more and do some noodling.
>
> > Confused yet? Welcome to the club ;-)
>
> Does my head exploding qualify for "confused"? If so, then I'm not
> yet "confused". I'm "concerned", since my ears are bleeding (a
> precursor to an explosion) ;-p

:-)

--
Daniel

2002-01-28 23:48:37

by Daniel Phillips

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

On January 28, 2002 08:25 pm, Hans Reiser wrote:
> If I understand you right, your scheme has the fundamental flaw that one
> dcache entry on a page can keep an entire page full of "slackers" in
> memory, and since there is little correlation in usage between dcache
> entries that happen to get stored on a page, the result is that the
> effectiveness per megabyte of the dcache is decreased by an order of
> magnitude. It would be worse to have one dcache entry per page, but
> maybe not by as much as you might expect.
>
> When objects smaller than a page are stored on a page but not correlated
> in their usage, they need to be aged individually not as a page, and
> then garbage collected as needed.

I had the identical thought - i.e., that this is a job for object aging and
not lru, then I realized that a slight modification to lru can do the job,
that is:

- An access to any object on the page promotes the page to the hot end
of the lru list.

- When it's time to recover a page (or pages) scan from the cold end
towards the hot end, and recover the first page(s) on which all
objects are free.

> Neither the current model nor your
> proposed scheme solve the fundamental problem Josh's measurements prove
> exists.

My suggestion might.

--
Daniel

2002-01-28 23:52:17

by Jeff Epler

[permalink] [raw]
Subject: [OT] Re: Note describing poor dcache utilization under high memory pressure

On Mon, Jan 28, 2002 at 03:06:24PM -0800, Rick Stevens wrote:
> Uh, I never said IBM ;-) I said "a three-letter-acronym" company.
> There were several. The one I dealt with was in Massachusetts, had
> a real penchant for three-letter acronyms and used a programming
> dialect which was the only single word oxymoron in the English
> language (enough hints yet?).

Nope, I haven't got it yet. But, a note on "single-word oxymorons", from
http://www.wordways.com/oxymoron.htm:
Appropriately, the word oxymoron is itself oxymoronic because
it is formed from two Greek roots of opposite meaning, oxys
"sharp, keen," and moros "foolish," the same root that gives us
the word moron . Noting that oxymoron is a single-word oxymoron
consisting of two morphemes that are dependent in English, the
intrepid linguist senses a rich opportunity to impose order on
seeming chaos, to extract significance from the swirl of data
that escape through the holes in people's faces, leak from their
pens, and glow on their computer screens.

With books such as Warren S. Blumenfeld's Jumbo Shrimp and Pretty
Ugly (Perigee, 1986, 1989) selling so well, oxymora (my preferred
plural form) were a hot language item in the 1980s. Now that we
can recollect that decade with some tranquility, it is time to
attempt a taxonomy of the collected oxymoronic specimens and to
set the aborning discipline of oxymoronology in some order.

Single-word oxymora composed of dependent morphemes The more in
oxymoron also gives us the more in sophomore, a "wise fool"--and
there are indeed many sophomoric sophomores. Other, examples:
pianoforte ("soft-loud"), preposterous ("before-after"), and
superette ("big-small").

Jeff

2002-01-29 00:17:17

by Hans Reiser

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

Daniel Phillips wrote:

>On January 28, 2002 08:25 pm, Hans Reiser wrote:
>
>>If I understand you right, your scheme has the fundamental flaw that one
>>dcache entry on a page can keep an entire page full of "slackers" in
>>memory, and since there is little correlation in usage between dcache
>>entries that happen to get stored on a page, the result is that the
>>effectiveness per megabyte of the dcache is decreased by an order of
>>magnitude. It would be worse to have one dcache entry per page, but
>>maybe not by as much as you might expect.
>>
>>When objects smaller than a page are stored on a page but not correlated
>>in their usage, they need to be aged individually not as a page, and
>>then garbage collected as needed.
>>
>
>I had the identical thought - i.e., that this is a job for object aging and
>not lru, then I realized that a slight modification to lru can do the job,
>that is:
>
> - An access to any object on the page promotes the page to the hot end
> of the lru list.
>
> - When it's time to recover a page (or pages) scan from the cold end
> towards the hot end, and recover the first page(s) on which all
> objects are free.
>
>>Neither the current model nor your
>>proposed scheme solve the fundamental problem Josh's measurements prove
>>exists.
>>
>
>My suggestion might.
>
This fails to recover an object (e.g. dcache entry) which is used once,
and then spends a year in cache on the same page as an object which is
hot all the time. This means that the hot set of objects becomes
diffused over an order of magnitude more pages than if garbage
collection squeezes them all together. That makes for very poor caching.

Hans

2002-01-29 00:31:11

by Alexander Viro

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure



On Tue, 29 Jan 2002, Hans Reiser wrote:

> This fails to recover an object (e.g. dcache entry) which is used once,
> and then spends a year in cache on the same page as an object which is
> hot all the time. This means that the hot set of objects becomes
> diffused over an order of magnitude more pages than if garbage
> collection squeezes them all together. That makes for very poor caching.

Any GC that is going to move active dentries around is out of question.
It would need a locking of such strength that you would be the first
to cry bloody murder - about 5 seconds after you look at the scalability
benchmarks.

2002-01-29 00:47:31

by Daniel Phillips

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 01:16 am, Hans Reiser wrote:
> Daniel Phillips wrote:
>
> >On January 28, 2002 08:25 pm, Hans Reiser wrote:
> >
> >>If I understand you right, your scheme has the fundamental flaw that one
> >>dcache entry on a page can keep an entire page full of "slackers" in
> >>memory, and since there is little correlation in usage between dcache
> >>entries that happen to get stored on a page, the result is that the
> >>effectiveness per megabyte of the dcache is decreased by an order of
> >>magnitude. It would be worse to have one dcache entry per page, but
> >>maybe not by as much as you might expect.
> >>
> >>When objects smaller than a page are stored on a page but not correlated
> >>in their usage, they need to be aged individually not as a page, and
> >>then garbage collected as needed.
> >>
> >
> >I had the identical thought - i.e., that this is a job for object aging and
> >not lru, then I realized that a slight modification to lru can do the job,
> >that is:
> >
> > - An access to any object on the page promotes the page to the hot end
> > of the lru list.
> >
> > - When it's time to recover a page (or pages) scan from the cold end
> > towards the hot end, and recover the first page(s) on which all
> > objects are free.
> >
> >>Neither the current model nor your
> >>proposed scheme solve the fundamental problem Josh's measurements prove
> >>exists.
> >>
> >
> >My suggestion might.
> >
> This fails to recover an object (e.g. dcache entry) which is used once,
> and then spends a year in cache on the same page as an object which is
> hot all the time.

You don't worry about that case. If there's so much pressure that you
scan all the way to the hot end of the lru list then you will recover
that hot/cold page[1] and all will be well. Not that the hot/cold page
will tend to migrate further away from the hot end of the list than a
hot/hot page.

[1] Assuming all the objects are freeable, otherwise you have a very
different problem: fragmentation.

> This means that the hot set of objects becomes
> diffused over an order of magnitude more pages than if garbage
> collection squeezes them all together.

I don't see that. How does garbage collection squeeze things together?
I'm assuming these objects are pinned, at least from the VM's point of
view.

--
Daniel

2002-01-29 01:29:04

by Daniel Phillips

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 01:51 am, Daniel Phillips wrote:
> You don't worry about that case. If there's so much pressure that you
> scan all the way to the hot end of the lru list then you will recover
> that hot/cold page[1] and all will be well. Not that the hot/cold page
"Note" ^^^
> will tend to migrate further away from the hot end of the list than a
> hot/hot page.

--
Daniel

2002-01-29 01:30:54

by Oliver Xymoron

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Mon, 28 Jan 2002, Daniel Phillips wrote:

> copy_page_range
> Intead of copying the page tables, just increment their use counts
>
> zap_page_range:
> If a page table is shared according to its use count, just decrement
> the use count and otherwise leave it alone.
>
> handle_mm_fault:
> If a page table is shared according to its use count and the faulting
> instruction is a write, allocate a new page table and do the work that
> would have normally been done by copy_page_range at fork time.
> Decrement the use count of the (perhaps formerly) shared page table.

Somewhere in here, the pages have got to all be marked read-only or
something. If they're not, then either parent or child writing to
non-faulting addresses will be writing to shared memory.

I think something more is needed, such as creating a minimal page table
for the child process with read-only mappings to the current %EIP and %EBP
pages in it. This gets us past the fork/exec hurdle. Without the exec, we
copy over chunks when they're accessed as above in handle_mm_fault. But
you can't actually _share_ the page tables without marking the pages
themselves readonly.


--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-01-29 01:38:34

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [reiserfs-list] Re: Note describing poor dcache utilization under high memory pressure

On Mon, 28 Jan 2002 19:29:49 CST, Oliver Xymoron said:

> I think something more is needed, such as creating a minimal page table
> for the child process with read-only mappings to the current %EIP and %EBP
> pages in it. This gets us past the fork/exec hurdle. Without the exec, we
> copy over chunks when they're accessed as above in handle_mm_fault. But
> you can't actually _share_ the page tables without marking the pages
> themselves readonly.

Actually, you can. But everybody agrees that vfork() was blecherous.
--
Valdis Kletnieks
Computer Systems Senior Engineer
Virginia Tech


Attachments:
(No filename) (226.00 B)

2002-01-29 01:42:34

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 02:29 am, Oliver Xymoron wrote:
> On Mon, 28 Jan 2002, Daniel Phillips wrote:
>
> > copy_page_range
> > Intead of copying the page tables, just increment their use counts
> >
> > zap_page_range:
> > If a page table is shared according to its use count, just decrement
> > the use count and otherwise leave it alone.
> >
> > handle_mm_fault:
> > If a page table is shared according to its use count and the faulting
> > instruction is a write, allocate a new page table and do the work that
> > would have normally been done by copy_page_range at fork time.
> > Decrement the use count of the (perhaps formerly) shared page table.
>
> Somewhere in here, the pages have got to all be marked read-only or
> something.

Yes, that's an essential detail I omitted: when a page table's use count
transitions from 1 to 2, mark all the CoW pages on the page table RO.

> If they're not, then either parent or child writing to
> non-faulting addresses will be writing to shared memory.

Yes, and after all, the whole point is to generalize CoW of pages to include
instantiation of page tables.

> I think something more is needed, such as creating a minimal page table
> for the child process with read-only mappings to the current %EIP and %EBP
> pages in it. This gets us past the fork/exec hurdle. Without the exec, we
> copy over chunks when they're accessed as above in handle_mm_fault. But
> you can't actually _share_ the page tables without marking the pages
> themselves readonly.

Oh yes, it's what I intended, thanks. Um, and I think you just told me what
one of my bugs is.

--
Daniel

2002-01-29 02:35:20

by IPmonger

[permalink] [raw]
Subject: Re: [OT] Re: Note describing poor dcache utilization under high memory pressure

[email protected] writes:

> On Mon, Jan 28, 2002 at 03:06:24PM -0800, Rick Stevens wrote:
>> Uh, I never said IBM ;-) I said "a three-letter-acronym"
>> company. There were several. The one I dealt with was in
>> Massachusetts, had a real penchant for three-letter acronyms and
>> used a programming dialect which was the only single word oxymoron
>> in the English language (enough hints yet?).

I'm guessing DEC, but I must admit that the oxymoronic (scripting?)
language escapes me...

-IPmonger
--
------------------
IPmonger
[email protected]
CCIE #8338

2002-01-29 08:38:19

by Momchil Velikov

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

>>>>> "Oliver" == Oliver Xymoron <[email protected]> writes:
Oliver> you can't actually _share_ the page tables without marking the pages
Oliver> themselves readonly.

Of course, ptes are made COW, just like now. Which brings up the
question how much speedup we'll gain with a code that touches every
single pte anyway ?

2002-01-29 08:51:31

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 09:39 am, Momchil Velikov wrote:
> >>>>> "Oliver" == Oliver Xymoron <[email protected]> writes:
> Oliver> you can't actually _share_ the page tables without marking the pages
> Oliver> themselves readonly.
>
> Of course, ptes are made COW, just like now. Which brings up the
> question how much speedup we'll gain with a code that touches every
> single pte anyway ?

It's only touching the ptes on tables that are actually used, so if a parent
with a massive amount of mapped memory forks a child that only instantiates
a small portion of it (common situation) then the saving is pretty big.

Note that I'm not counting on this to be a huge performance win, except in
the specific case that that is bothering rmap. This is already worth the
price of admission.

This code might also be helpful if we want to get into swapping page tables
at some point, and it can also be pushed in the direction of sharing page
tables for mmaps. There it could be quite helpful, for example, with glibc.
(Thanks to Suparna for that observation.)

--
Daniel

2002-01-29 09:19:00

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, Jan 29, 2002 at 09:55:02AM +0100, Daniel Phillips wrote:
> It's only touching the ptes on tables that are actually used, so if a parent
> with a massive amount of mapped memory forks a child that only instantiates
> a small portion of it (common situation) then the saving is pretty big.

Please correct my attempt at clarifying this:
The COW markings are done at the next higher level of hierarchy above
the pte's themselves, and so experience the radix tree branch factor
reduction in the amount of work done at fork-time in comparison to a
full pagetable copy on fork.

On Tue, Jan 29, 2002 at 09:55:02AM +0100, Daniel Phillips wrote:
> Note that I'm not counting on this to be a huge performance win, except in
> the specific case that that is bothering rmap. This is already worth the
> price of admission.

It is an overall throughput loss in the cases where the majority of the
page table entries are in fact referenced by the child, and this is
more than acceptable because it is more incremental, reference-all is
an uncommon case, and once all the page table entries are referenced,
there are no longer any penalties. Defeating this scheme would truly
require a contrived application, and penalizes only that application.

Cheers,
Bill

2002-01-29 09:52:15

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 10:20 am, William Lee Irwin III wrote:
> On Tue, Jan 29, 2002 at 09:55:02AM +0100, Daniel Phillips wrote:
> > It's only touching the ptes on tables that are actually used, so if a parent
> > with a massive amount of mapped memory forks a child that only instantiates
> > a small portion of it (common situation) then the saving is pretty big.
>
> Please correct my attempt at clarifying this:

Sorry, it's my fault, my explanation above is ambiguous, or even incorrect.

> The COW markings are done at the next higher level of hierarchy above
> the pte's themselves, and so experience the radix tree branch factor
> reduction in the amount of work done at fork-time in comparison to a
> full pagetable copy on fork.

The CoW markings are done at the same level they always have been - directly
in the ptes (the 4 byte thingies, please don't get confused by the unfortunate
overloading of 'pte' to mean 'page table' in some contexts, e.g.,
zap_pte_range). But the CoW marking only has to be done for page tables
that have use count == 1 at the time of fork. So if the parent inherited
some page tables then these already have use count > 1, i.e., are shared,
and they don't have to be set up for CoW again when the child forks. Only
page tables that the parent instantiated for itself have to be re-marked
for CoW.

Wow, this is getting subtle, isn't it?

> On Tue, Jan 29, 2002 at 09:55:02AM +0100, Daniel Phillips wrote:
> > Note that I'm not counting on this to be a huge performance win, except in
> > the specific case that that is bothering rmap. This is already worth the
> > price of admission.
>
> It is an overall throughput loss in the cases where the majority of the
> page table entries are in fact referenced by the child, and this is
> more than acceptable because it is more incremental, reference-all is
> an uncommon case, and once all the page table entries are referenced,
> there are no longer any penalties. Defeating this scheme would truly
> require a contrived application, and penalizes only that application.

True. Also, since the child is doing all that work anyway, the cost of
instantiating one page table (and extending the pte_chains) per 1K
referenced pages will get lost in the noise.

--
Daniel

2002-01-29 10:14:15

by Momchil Velikov

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

>>>>> "Daniel" == Daniel Phillips <[email protected]> writes:

Daniel> On January 29, 2002 09:39 am, Momchil Velikov wrote:
>> >>>>> "Oliver" == Oliver Xymoron <[email protected]> writes:
Oliver> you can't actually _share_ the page tables without marking the pages
Oliver> themselves readonly.
>>
>> Of course, ptes are made COW, just like now. Which brings up the
>> question how much speedup we'll gain with a code that touches every
>> single pte anyway ?

Daniel> It's only touching the ptes on tables that are actually used, so if a parent
Daniel> with a massive amount of mapped memory forks a child that only instantiates
Daniel> a small portion of it (common situation) then the saving is pretty big.

Umm, all the ptes af the parent ought to be made COW, no ?

2002-01-29 10:17:35

by Momchil Velikov

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

>>>>> "William" == William Lee Irwin <[email protected]> writes:

William> On Tue, Jan 29, 2002 at 09:55:02AM +0100, Daniel Phillips wrote:
>> It's only touching the ptes on tables that are actually used, so if a parent
>> with a massive amount of mapped memory forks a child that only instantiates
>> a small portion of it (common situation) then the saving is pretty big.

William> Please correct my attempt at clarifying this:
William> The COW markings are done at the next higher level of hierarchy above
William> the pte's themselves, and so experience the radix tree branch factor
William> reduction in the amount of work done at fork-time in comparison to a
William> full pagetable copy on fork.

COW at pgd/pmd level is ia32-ism, unlike COW at pte level.

Regards,
-velco

PS. Well, the whole pgd/pmd/ptb stuff is ia32-ism, but that's another
story.

2002-01-29 10:23:55

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 10:20 am, Momchil Velikov wrote:
> >>>>> "Daniel" == Daniel Phillips <[email protected]> writes:
>
> Daniel> On January 29, 2002 09:39 am, Momchil Velikov wrote:
> >> >>>>> "Oliver" == Oliver Xymoron <[email protected]> writes:
> Oliver> you can't actually _share_ the page tables without marking the pages
> Oliver> themselves readonly.
> >>
> >> Of course, ptes are made COW, just like now. Which brings up the
> >> question how much speedup we'll gain with a code that touches every
> >> single pte anyway ?
>
> Daniel> It's only touching the ptes on tables that are actually used, so if a parent
> Daniel> with a massive amount of mapped memory forks a child that only instantiates
> Daniel> a small portion of it (common situation) then the saving is pretty big.
>
> Umm, all the ptes af the parent ought to be made COW, no ?

My explanation above is borked, please see my reply to wli. In short, each
page table of the parent is already either shared - in which case the CoW
ptes are already marked RO - or it was instantiated by the parent, in which
case the work of marking its pte's RO is insignificant, and is certainly not
more than the cost incurred by the current method.

--
Daniel

2002-01-29 10:47:06

by Hans Reiser

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

Alexander Viro wrote:

>
>On Tue, 29 Jan 2002, Hans Reiser wrote:
>
>>This fails to recover an object (e.g. dcache entry) which is used once,
>>and then spends a year in cache on the same page as an object which is
>>hot all the time. This means that the hot set of objects becomes
>>diffused over an order of magnitude more pages than if garbage
>>collection squeezes them all together. That makes for very poor caching.
>>
>
>Any GC that is going to move active dentries around is out of question.
>It would need a locking of such strength that you would be the first
>to cry bloody murder - about 5 seconds after you look at the scalability
>benchmarks.
>
>

I don't mean to suggest that the dentry cache locking is an easy problem
to solve, but the problem discussed is a real one, and it is sufficient
to illustrate that the unified cache is fundamentally flawed as an
algorithm compared to using subcache plugins.

Hans

2002-01-29 11:02:00

by Rik van Riel

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Mon, 28 Jan 2002, Oliver Xymoron wrote:

> Somewhere in here, the pages have got to all be marked read-only or
> something. If they're not, then either parent or child writing to
> non-faulting addresses will be writing to shared memory.

Either that, or we don't populate the page tables of the
parent and the child at all and have the page tables
filled in at fault time.

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-29 11:24:35

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 11:59 am, Rik van Riel wrote:
> On Mon, 28 Jan 2002, Oliver Xymoron wrote:
>
> > Somewhere in here, the pages have got to all be marked read-only or
> > something. If they're not, then either parent or child writing to
> > non-faulting addresses will be writing to shared memory.
>
> Either that, or we don't populate the page tables of the
> parent and the child at all and have the page tables
> filled in at fault time.

Yes, you could go that route but you'd have to do some weird and wonderful
bookkeeping to figure out how to populate those page tables.

--
Daniel

2002-01-29 11:39:27

by Rik van Riel

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, 29 Jan 2002, Daniel Phillips wrote:

> > Either that, or we don't populate the page tables of the
> > parent and the child at all and have the page tables
> > filled in at fault time.
>
> Yes, you could go that route but you'd have to do some weird and wonderful
> bookkeeping to figure out how to populate those page tables.

Not really, if the page table isn't present you just check whether
you need to allocate a new one or whether you need to instantiate
one.

That can all be done from within pte_alloc, which is always called
by handle_mm_fault()...

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-29 11:57:57

by Helge Hafting

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

Momchil Velikov wrote:
>
> >>>>> "Daniel" == Daniel Phillips <[email protected]> writes:
[...]
> Daniel> It's only touching the ptes on tables that are actually used, so if a parent
> Daniel> with a massive amount of mapped memory forks a child that only instantiates
> Daniel> a small portion of it (common situation) then the saving is pretty big.
>
> Umm, all the ptes af the parent ought to be made COW, no ?

Sure. But quite a few of them may be COW already, if the parent
itself is a result of some earlier fork.

Helge Hafting

2002-01-29 12:06:22

by Karl & Betty Schendel

[permalink] [raw]
Subject: Re: [OT] Re: Note describing poor dcache utilization under high memory pressure

At 9:30 PM -0500 1/28/02, IPmonger wrote:
>[email protected] writes:
>
>> On Mon, Jan 28, 2002 at 03:06:24PM -0800, Rick Stevens wrote:
>>> Uh, I never said IBM ;-) I said "a three-letter-acronym"
>>> company. There were several. The one I dealt with was in
>>> Massachusetts, had a real penchant for three-letter acronyms and
>>> used a programming dialect which was the only single word oxymoron
>>> in the English language (enough hints yet?).
>
> I'm guessing DEC, but I must admit that the oxymoronic (scripting?)
> language escapes me...
>

Bliss, surely?

Which actually wasn't too bad once you got used to putting the
goddamed dots in front of all your variable (contents) accesses.

--
Karl R. Schendel, Jr. [email protected]
K/B Computer Associates http://www.kbcomputer.com
Ingres, Unix, VMS Consulting and Training

2002-01-29 12:06:22

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 12:38 pm, Rik van Riel wrote:
> On Tue, 29 Jan 2002, Daniel Phillips wrote:
>
> > > Either that, or we don't populate the page tables of the
> > > parent and the child at all and have the page tables
> > > filled in at fault time.
> >
> > Yes, you could go that route but you'd have to do some weird and wonderful
> > bookkeeping to figure out how to populate those page tables.
>
> Not really, if the page table isn't present you just check whether
> you need to allocate a new one or whether you need to instantiate
> one.

Since you didn't store it in the parent and you didn't store it in the child,
how are you going to find it? This is my point about the weird and wonderful
bookkeeping, which I managed to avoid entirely.

> That can all be done from within pte_alloc, which is always called
> by handle_mm_fault()...

--
Daniel

2002-01-29 12:30:47

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 12:54 pm, Helge Hafting wrote:
> Momchil Velikov wrote:
> >
> > >>>>> "Daniel" == Daniel Phillips <[email protected]> writes:
> [...]
> > Daniel> It's only touching the ptes on tables that are actually used, so if a parent
> > Daniel> with a massive amount of mapped memory forks a child that only instantiates
> > Daniel> a small portion of it (common situation) then the saving is pretty big.
> >
> > Umm, all the ptes af the parent ought to be made COW, no ?
>
> Sure. But quite a few of them may be COW already, if the parent
> itself is a result of some earlier fork.

Right, or if the parent has already forked at least one child.

--
Daniel

2002-01-29 14:52:52

by Chris Mason

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure



On Tuesday, January 29, 2002 01:46:43 PM +0300 Hans Reiser <[email protected]> wrote:

> Alexander Viro wrote:
>
>>
>> On Tue, 29 Jan 2002, Hans Reiser wrote:
>>
>>> This fails to recover an object (e.g. dcache entry) which is used once,
>>> and then spends a year in cache on the same page as an object which is
>>> hot all the time. This means that the hot set of objects becomes
>>> diffused over an order of magnitude more pages than if garbage
>>> collection squeezes them all together. That makes for very poor caching.
>>>
>>
>> Any GC that is going to move active dentries around is out of question.
>> It would need a locking of such strength that you would be the first
>> to cry bloody murder - about 5 seconds after you look at the scalability
>> benchmarks.
>>
>>
>
> I don't mean to suggest that the dentry cache locking is an easy problem to solve, but the problem discussed is a real one, and it is sufficient to illustrate that the unified cache is fundamentally flawed as an algorithm compared to using subcache plugins.

It isn't just dentries. If a subcache object is in use, it can't be moved
to a warmer page without invalidating all existing pointers to it.

If it isn't in use, it can be migrated when the VM asks for the page to
be flushed.

-chris

2002-01-29 16:57:54

by Oliver Xymoron

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, 29 Jan 2002, Rik van Riel wrote:

> On Mon, 28 Jan 2002, Oliver Xymoron wrote:
>
> > Somewhere in here, the pages have got to all be marked read-only or
> > something. If they're not, then either parent or child writing to
> > non-faulting addresses will be writing to shared memory.
>
> Either that, or we don't populate the page tables of the
> parent and the child at all and have the page tables
> filled in at fault time.

That's very nearly what I proposed in the second half of my message (with
the exception that we ought to pre-fault the current stack and code page
tables as we're sure to need these immediately).

Daniel's approach seems to be workable (once he's spelled out all the
details) but it misses the big performance win for fork/exec, which is
surely the common case. Given that exec will be throwing away all these
mappings, we can safely assume that we will not be inheriting many shared
mappings from parents of parents so Daniel's approach also still ends up
marking most of the pages RO still.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-01-29 17:26:06

by Rik van Riel

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, 29 Jan 2002, Oliver Xymoron wrote:

> Daniel's approach seems to be workable (once he's spelled out all the
> details) but it misses the big performance win for fork/exec, which is
> surely the common case. Given that exec will be throwing away all these
> mappings, we can safely assume that we will not be inheriting many shared
> mappings from parents of parents so Daniel's approach also still ends up
> marking most of the pages RO still.

It gets worse. His approach also needs to adjust the reference
counts on all pages (and swap pages).

kind regards,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-29 17:27:37

by Josh MacDonald

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

Quoting Linus Torvalds ([email protected]):
>
> On Mon, 28 Jan 2002, Josh MacDonald wrote:
> >
> > So, it would seem that the dcache and kmem_slab_cache memory allocator
> > could benefit from a way to shrink the dcache in a less random way.
> > Any thoughts?
>
> The way I want to solve this problem generically is to basically get rid
> of the special-purpose memory shrinkers, and have everything done with one
> unified interface, namely the physical-page-based "writeout()" routine. We
> do that for the page cache, and there's nothing that says that we couldn't
> do the same for all other caches, including very much the slab allocator.
>
> Thus any slab user that wants to, could just register their own per-page
> memory pressure logic. The dcache "reference" bit would go away, to be
> replaced by a per-page reference bit (that part could be done already, of
> course, and might help a bit on its own).
>
> Basically, the more different "pools" of memory we have, the harder it
> gets to balance them. Clearly, the optimal number of pools from a
> balancing standpoint is just a single, direct physical pool.
>
> Right now we have several pools - we have the pure physical LRU, we have
> the virtual mapping (where scanning is directly tied to the physical LRU,
> but where the separate pool still _does_ pose some problems), and we have
> separate balancing for inodes, dentries and quota. And there's no question
> that it hurts us under memory pressure.
>
> (There's a related question, which is whether other caches might also
> benefit from being able to grow more - right now there are some caches
> that are of a limited size partly because they have no good way of
> shrinking back on demand).

Using a physical-page-based "writeout()" routine seems like a nice way
to unify the application of memory pressure to various caches, but it
does not address the issue of fragmentation within a cache slab. You
could have a situation in which a number of hot dcache entries are
occupying some number of pages, such that dcache pages are always more
recently used than other pages in the system. Would the VM ever tell
the dcache to writeout() in that case?

It seems that the current special-purpose memory "shrinkers" approach
has some advantages in this regard: when memory pressure is applied
every cache attempts to free some resources. Do you envision the
unified interface approach applying pressure to pages of every kind of
cache under memory pressure?

Even so, the physical-page writeout() approach results in a less
effective cache under memory pressure. Suppose the VM chooses some
number of least-recently-used physical pages belonging to the dcache
and tells the slab allocator to release those pages. Assume that the
dcache entries are not currently in use and that the dcache is in fact
able to release them. Some of the dcache entries being tossed from
memory could instead replace less-recently-used objects on more
recently-used physical pages. In other words, the dcache would
benefit from relocating its more frequently used entries onto the same
physical pages under memory pressure.

Unless the cache ejects entries based on the object access and not
physical page access, the situation will never improve. Pages with
hot dcache entries will never clean-out the inactive entries on the
same page. For this reason, I don't think it makes sense to eliminate
the object-based aging of cache entries altogether.

Perhaps a combination of the approaches would work best. When the VM
system begins forcing the dcache to writeout(), the dcache could both
release some of its pages by ejecting all the entries (as above) and
in addition it could run something like prune_dcache(), thus creating
free space in the hotter set of physical pages so that over a period
of prolonged memory pressure, the hotter dcache entries would
eventually become located on the same pages.

A solution that relocates dcache entries to reduce total page
consumption, however, makes the most effective use of cache space.

-josh

--
PRCS version control system http://sourceforge.net/projects/prcs
Xdelta storage & transport http://sourceforge.net/projects/xdelta
Need a concurrent skip list? http://sourceforge.net/projects/skiplist

2002-01-29 17:29:27

by Josh MacDonald

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

Quoting Alexander Viro ([email protected]):
>
>
> On Tue, 29 Jan 2002, Hans Reiser wrote:
>
> > This fails to recover an object (e.g. dcache entry) which is used once,
> > and then spends a year in cache on the same page as an object which is
> > hot all the time. This means that the hot set of objects becomes
> > diffused over an order of magnitude more pages than if garbage
> > collection squeezes them all together. That makes for very poor caching.
>
> Any GC that is going to move active dentries around is out of question.
> It would need a locking of such strength that you would be the first
> to cry bloody murder - about 5 seconds after you look at the scalability
> benchmarks.

We're not talking about actively referenced entries, we're talking about
entries on the d_lru list with zero references. Relocating those objects
should not require any more locking than currently required to remove and
re-insert the dcache entry. Right?

-josh

--
PRCS version control system http://sourceforge.net/projects/prcs
Xdelta storage & transport http://sourceforge.net/projects/xdelta
Need a concurrent skip list? http://sourceforge.net/projects/skiplist

2002-01-29 18:46:42

by Andreas Dilger

[permalink] [raw]
Subject: Re: [reiserfs-list] Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

On Jan 29, 2002 09:28 -0800, Josh MacDonald wrote:
> Quoting Alexander Viro ([email protected]):
> > On Tue, 29 Jan 2002, Hans Reiser wrote:
> > > This fails to recover an object (e.g. dcache entry) which is used once,
> > > and then spends a year in cache on the same page as an object which is
> > > hot all the time. This means that the hot set of objects becomes
> > > diffused over an order of magnitude more pages than if garbage
> > > collection squeezes them all together. That makes for very poor caching.
> >
> > Any GC that is going to move active dentries around is out of question.
> > It would need a locking of such strength that you would be the first
> > to cry bloody murder - about 5 seconds after you look at the scalability
> > benchmarks.
>
> We're not talking about actively referenced entries, we're talking about
> entries on the d_lru list with zero references. Relocating those objects
> should not require any more locking than currently required to remove and
> re-insert the dcache entry. Right?

But if it is unused and not recently referenced, there is little benefit
in keeping it around, is there? I suppose there might be some benefit in
moving hot negative dentries to be on the same page as other hot dentries,
and that should be possible. Negative dentries are a special case, though,
in that they are not actually in use, but their presence is a performance
improvement (i.e. not having to look up /usr/bin/ls, /usr/sbin/ls, etc).

Being able to move active entries between slabs is just too hard in
most cases, and the overhead (locking, back references, etc) in doing so
would probably outweigh the benefits of more tightly packed slab caches.
This could be up to the per-page (or per-slab) "try_to_free_page"
callback function to handle though.


I do agree with the assertion that you shouldn't judge the utility of
all objects on the page in which it lives by just the page use, otherwise
you will never free any other entries on a hot page. There should still
be per-item reference bits that give hints on which entries in the slab
should be freed.

For example we can use the following simple algorithm to free pages/entries:
1) Walk slab pages (maybe page LRU order), and if all entries in the page
can be freed, free them and then remove the page from the slab.
This gives the VM a free page to work with, and any entries that
are later reloaded will go to another page.

As a special case for dcache (which would be handled in the dcache
try_to_free_page callback) we could potentially move referenced
negative dentries to another free entry in that slab page in order to
free a page but not take the hit to go out to disk again to recreate
the negative dentry.

2) For slab pages with in-use entries, free some fraction of entries based
on whether the entries have been referenced. There is no point in
freeing all but one of the entries on a page if we can't free the page
in the end, and having a 10% cache page utilization does nobody any good.
Freeing some of the entries gives us free space to add new entries, but
does not discard too many entries when we can't free the pages anyways.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2002-01-29 19:55:09

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

"William" == William Lee Irwin <[email protected]> writes:
William> Please correct my attempt at clarifying this:
William> The COW markings are done at the next higher level of hierarchy above
William> the pte's themselves, and so experience the radix tree branch factor
William> reduction in the amount of work done at fork-time in comparison to a
William> full pagetable copy on fork.

On Tue, Jan 29, 2002 at 12:18:42PM +0200, Momchil Velikov wrote:
> COW at pgd/pmd level is ia32-ism, unlike COW at pte level.

Pain! Well, at least the pte markings dodge the page_add_rmap() bullet.

On Tue, Jan 29, 2002 at 12:18:42PM +0200, Momchil Velikov wrote:
> PS. Well, the whole pgd/pmd/ptb stuff is ia32-ism, but that's another
> story.

Perhaps something can be done about that.


Cheers,
Bill

2002-01-29 20:02:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [reiserfs-list] Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

Andreas Dilger wrote:
>
> But if it is unused and not recently referenced, there is little benefit
> in keeping it around, is there?

In all of this, please remember that all caches are not of
equal value-per-byte. A single page contains 32 dentries,
and can thus save up to 32 disk seeks. It's potentially
a *lot* more valuable than a (single-seek) pagecache page.

Just a thought...

-

2002-01-29 20:10:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure


On Tue, 29 Jan 2002, William Lee Irwin III wrote:
>
> On Tue, Jan 29, 2002 at 12:18:42PM +0200, Momchil Velikov wrote:
> > COW at pgd/pmd level is ia32-ism, unlike COW at pte level.
>
> Pain! Well, at least the pte markings dodge the page_add_rmap() bullet.

You can portably mark then non-present, at some higher cost. That choice
would have to be left to the architecture configuration (the same way the
dirty and accessed bits are left to the architecture - some hardware
supports them directly, others have to trap and emulate them)

So if would not be out of the question to have a

pmd_mkrdonly()

function that would (on architectures that do not support it) just remove
the present bit instead of removing the WR bit.

> On Tue, Jan 29, 2002 at 12:18:42PM +0200, Momchil Velikov wrote:
> > PS. Well, the whole pgd/pmd/ptb stuff is ia32-ism, but that's another
> > story.
>
> Perhaps something can be done about that.

The pgd/pmd/pte thing is _not_ a ia32-ism.

A tree-based data structure is very much a generic portable construct, and
is, in fact, the only sane way to maintain VM mappings.

The fact that Linux tries very hard to allow the portable data structure
to be _represented_ using the same data structures that the hardware uses
for the TLB lookup is a separate issue, and mainly helps performance and
TLB coherency. And even then it's not a ia32-ism: every single sane
architecture out there is either soft-fill TLB (in which case a tree is
fine), or already is a tree (ia32, x86-64, alpha, m68k etc).

The others are just stupid aberrations (ie the ppc hash-tables), and can
portably be considered nothing but in-memory TLB's.

However, in order to allow other architectures to share their hardware
page tables with the Linux software VM mappings, we _should_ take their
page walkers into account. It would be a pity if the Linux page table
abstraction was too removed from some machines reality.

Linus

2002-01-29 20:38:23

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, Jan 29, 2002 at 12:18:42PM +0200, Momchil Velikov wrote:
>>> PS. Well, the whole pgd/pmd/ptb stuff is ia32-ism, but that's another
>>> story.

On Tue, 29 Jan 2002, William Lee Irwin III wrote:
>> Perhaps something can be done about that.

On Tue, Jan 29, 2002 at 12:08:20PM -0800, Linus Torvalds wrote:
> The pgd/pmd/pte thing is _not_ a ia32-ism.
> A tree-based data structure is very much a generic portable construct, and
> is, in fact, the only sane way to maintain VM mappings.

In my mind it's not about the form but about how much of it is exposed.
For instance, exposing the number of levels seems to require emulating
an extra level for machines with 2-level pagetables. There is also a
(perhaps minor) issue with this exposure creating some extra codework
in mm/*.c -- here most of what's needed is largely a for_each_pte()
that can pick a range to walk over.

On Tue, Jan 29, 2002 at 12:08:20PM -0800, Linus Torvalds wrote:
> The fact that Linux tries very hard to allow the portable data structure
> to be _represented_ using the same data structures that the hardware uses
> for the TLB lookup is a separate issue, and mainly helps performance and
> TLB coherency. And even then it's not a ia32-ism: every single sane
> architecture out there is either soft-fill TLB (in which case a tree is
> fine), or already is a tree (ia32, x86-64, alpha, m68k etc).

It's quite a happy coincidence when this happens, and in my mind making
it happen more often would be quite nice.

On Tue, Jan 29, 2002 at 12:08:20PM -0800, Linus Torvalds wrote:
> The others are just stupid aberrations (ie the ppc hash-tables), and can
> portably be considered nothing but in-memory TLB's.

Ouch! Well, you have spoken.

On Tue, Jan 29, 2002 at 12:08:20PM -0800, Linus Torvalds wrote:
> However, in order to allow other architectures to share their hardware
> page tables with the Linux software VM mappings, we _should_ take their
> page walkers into account. It would be a pity if the Linux page table
> abstraction was too removed from some machines reality.

This is definitely one of the directions in which I intended to move.
If I can't bring the software pagetables closer to the hardware ones,
then it's not worthwhile.


Thanks,
Bill

2002-01-29 20:45:13

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 06:25 pm, Rik van Riel wrote:
> On Tue, 29 Jan 2002, Oliver Xymoron wrote:
>
> > Daniel's approach seems to be workable (once he's spelled out all the
> > details) but it misses the big performance win for fork/exec, which is
> > surely the common case. Given that exec will be throwing away all these
> > mappings, we can safely assume that we will not be inheriting many shared
> > mappings from parents of parents so Daniel's approach also still ends up
> > marking most of the pages RO still.
>
> It gets worse. His approach also needs to adjust the reference
> counts on all pages (and swap pages).

Well, Rik, time to present your algorithm. I assume it won't reference
counts on pages, and will do some kind of traversal of the mm tree. Note
however, that I did investigate the class of algorithm you are interested in,
and found only nasty, complex solutions there, with challenging locking
problems. (I also looked at a number of possible improvements to virtual
scanning, as you know, and likewise only found ugly or inadequate solutions.)

Before you sink a lot of time into it though, you might add up the actual
overhead you're worried about above, and see if it moves the needle in a real
system.

--
Daniel

2002-01-29 20:50:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure


On Tue, 29 Jan 2002, William Lee Irwin III wrote:
>
> In my mind it's not about the form but about how much of it is exposed.
> For instance, exposing the number of levels seems to require emulating
> an extra level for machines with 2-level pagetables.

Well, you have two choices:

- _not_ exposing fundamental details like the number of levels causes
different architectures to have wildly different code (see how UNIX
traditionally does MM, and puke)

- trivial "folding" macros to take 3 levels down to 2 (or four levels
down to 3 or two).

Note that the folding macros really _are_ trivial. The pmd macros for x86
are basically these few lines:

static inline int pgd_none(pgd_t pgd) { return 0; }
static inline int pgd_bad(pgd_t pgd) { return 0; }
static inline int pgd_present(pgd_t pgd) { return 1; }
#define pgd_clear(xp) do { } while (0)

static inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address)
{
return (pmd_t *) dir;
}

And that's it.

So I'd much rather have a generic VM and do some trivial folding.

> It's quite a happy coincidence when this happens, and in my mind making
> it happen more often would be quite nice.

I really isn't a co-incidence. The reason so many architectures have page
table trees is that most architects try to make good decisions, and a tree
layout is a simple and efficient data structure that maps well to both
hardware and to usage patterns.

Hashed page tables are incredibly naive, and perform badly for build-up
and tear-down (and mostly have horrible cache access patterns). At least
in some version of the UltraSparc, the Linux tree-based software TLB fill
outperformed the Solaris version, even though the Solaris version was
handtuned assembly and used hardware acceleration for the hash
computations. That should tell you something.

Linus

2002-01-29 21:00:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, Jan 29, 2002 at 12:49:24PM -0800, Linus Torvalds wrote:
> I really isn't a co-incidence. The reason so many architectures have page
> table trees is that most architects try to make good decisions, and a tree
> layout is a simple and efficient data structure that maps well to both
> hardware and to usage patterns.

No debate there.

On Tue, Jan 29, 2002 at 12:49:24PM -0800, Linus Torvalds wrote:
> Hashed page tables are incredibly naive, and perform badly for build-up
> and tear-down (and mostly have horrible cache access patterns). At least
> in some version of the UltraSparc, the Linux tree-based software TLB fill
> outperformed the Solaris version, even though the Solaris version was
> handtuned assembly and used hardware acceleration for the hash
> computations. That should tell you something.

Given this, it appears less useful to play with the representations.
There also appears to be some other material on this subject which
you yourself have written. I'll review that for my own enlightenment,
and regardless, I'll focus on something else.

Thanks again,
Bill

2002-01-29 21:01:26

by Oliver Xymoron

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, 29 Jan 2002, Daniel Phillips wrote:

> On January 29, 2002 06:25 pm, Rik van Riel wrote:
> > On Tue, 29 Jan 2002, Oliver Xymoron wrote:
> >
> > > Daniel's approach seems to be workable (once he's spelled out all the
> > > details) but it misses the big performance win for fork/exec, which is
> > > surely the common case. Given that exec will be throwing away all these
> > > mappings, we can safely assume that we will not be inheriting many shared
> > > mappings from parents of parents so Daniel's approach also still ends up
> > > marking most of the pages RO still.
> >
> > It gets worse. His approach also needs to adjust the reference
> > counts on all pages (and swap pages).
>
> Well, Rik, time to present your algorithm. I assume it won't reference
> counts on pages, and will do some kind of traversal of the mm tree. Note
> however, that I did investigate the class of algorithm you are interested in,
> and found only nasty, complex solutions there, with challenging locking
> problems. (I also looked at a number of possible improvements to virtual
> scanning, as you know, and likewise only found ugly or inadequate solutions.)

I think it goes something like this:

fork:
detach page tables from parent
retain pointer to "backing page tables" in parent and child
update use count in page tables
"prefault" tables for current stack and instruction pages in both parent
and child

page fault:
if faulted on page table:
look up backing page tables
if use count > 1: copy, dec use count
else: take ownership

> Before you sink a lot of time into it though, you might add up the actual
> overhead you're worried about above, and see if it moves the needle in a real
> system.

I'm pretty sure something like the above does signficantly less work in
the fork/exec case, which is the important one.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-01-29 21:09:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure


On Tue, 29 Jan 2002, Oliver Xymoron wrote:
>
> fork:
> detach page tables from parent

- leave the option ot just mark them read-only on architectures that
support it (ie x86, I think alpha does this too).

> retain pointer to "backing page tables" in parent and child
> update use count in page tables

You want to copy the top-level page table directory (with the "present
bit" disabled or something), not just retain a pointer to it. Otherwise
you just get really confused after two fork() calls, where you can have
multiple "backing page tables".

> "prefault" tables for current stack and instruction pages in both parent
> and child

Don't unconditionally prefault. There are many potentially useful things
that do _not_ want to do this, for example snapshot creation.

So the generic "do_fork()" thing should _not_ do prefaulting, although
"sys_fork()" may well choose to do it.

Linus

2002-01-29 21:13:56

by Oliver Xymoron

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, 29 Jan 2002, Linus Torvalds wrote:

>
> On Tue, 29 Jan 2002, Oliver Xymoron wrote:
> >
> > fork:
> > detach page tables from parent
>
> - leave the option ot just mark them read-only on architectures that
> support it (ie x86, I think alpha does this too).

I don't think read-only for the tables is sufficient if the pages
themselves are writable.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-01-29 21:52:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure


On Tue, 29 Jan 2002, Oliver Xymoron wrote:
>
> I don't think read-only for the tables is sufficient if the pages
> themselves are writable.

At least on x86, the WRITE bit in the page directory entries will override
any bits int he PTE. In other words, it doesn't make the page directory
entries thmselves unwritable - it makes the final pages unwritable.

Which are exactly the semantics we want.

I have this strong feeling (but am lazy enough to not try to find the
documentation) that on alpha the access bits in the upper page tables are
just ignored (ie you have to actually turn off the present bit), which is
a bit sad as it shouldn't matter from a PAL-code standpoint (just two more
"and" instructions to and all the levels access bits together).

Linus

2002-01-29 22:03:13

by Oliver Xymoron

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, 29 Jan 2002, Linus Torvalds wrote:

>
> On Tue, 29 Jan 2002, Oliver Xymoron wrote:
> >
> > I don't think read-only for the tables is sufficient if the pages
> > themselves are writable.
>
> At least on x86, the WRITE bit in the page directory entries will override
> any bits int he PTE. In other words, it doesn't make the page directory
> entries thmselves unwritable - it makes the final pages unwritable.
>
> Which are exactly the semantics we want.

Oh. Cool. I knew I must have been missing some detail.

> I have this strong feeling (but am lazy enough to not try to find the
> documentation) that on alpha the access bits in the upper page tables are
> just ignored (ie you have to actually turn off the present bit), which is
> a bit sad as it shouldn't matter from a PAL-code standpoint (just two more
> "and" instructions to and all the levels access bits together).

The "detached mm" approach should be sufficiently parallel to the
read-only page directory entries that the two can use almost the same
framework. The downside is faults on reads in the detached case, but that
shouldn't be significantly worse than the original copy, thanks to the
large fanout.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-01-29 22:12:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure


On Tue, 29 Jan 2002, Oliver Xymoron wrote:
>
> The "detached mm" approach should be sufficiently parallel to the
> read-only page directory entries that the two can use almost the same
> framework.

Yes. I suspect that it can be trivially hidden in just two
architecture-specific functions, ie something like "detach_pgd(pgd)" and
"attach_pgd_entry(mm, address)".

> The downside is faults on reads in the detached case, but that
> shouldn't be significantly worse than the original copy, thanks to the
> large fanout.

Right. We'd get a few "unnecessary" page faults, but they should be on the
order of 0.1% of the necessary ones. In fact, with pre-faulting in
sys_fork(), I wouldn't be surprised if the common case is to not have any
directory-related page faults at all.

Linus

2002-01-29 22:17:33

by Hans Reiser

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

Chris Mason wrote:

>>I don't mean to suggest that the dentry cache locking is an easy problem to solve, but the problem discussed is a real one, and it is sufficient to illustrate that the unified cache is fundamentally flawed as an algorithm compared to using subcache plugins.
>>
>
>It isn't just dentries. If a subcache object is in use, it can't be moved
>to a warmer page without invalidating all existing pointers to it.
>
>If it isn't in use, it can be migrated when the VM asks for the page to
>be flushed.
>
garbage collection is a lot of work to implement --- there are a lot of
good reasons why ext2 doesn't shrink directories.....;-)

really guys, you can get me to agree that it is more work to code, you
can even get me to agree to skip it for now because we are all busy, but
the design principle remains valid --- using per page aging of subpage
objects that do not correlate in accesses leads to diffused hot sets,
and that means that the cache will perform as though it was much smaller
than it is.

Unless you do a properly designed (and 2.2 was a wrongly designed)
central pressure distributor that tells subcache plugins what aging
pressure to apply to themselves, and ensures that aging pressure is
evenly distributed in proportion to subcache size, you are going to have
suboptimal performance for those kernel subsystems for which pages are
not the natural unit of cache object insertion and removal.

Josh MacDonald wrote:

>
>Perhaps a combination of the approaches would work best. When the VM
>system begins forcing the dcache to writeout(), the dcache could both
>release some of its pages by ejecting all the entries (as above) and
>in addition it could run something like prune_dcache(), thus creating
>free space in the hotter set of physical pages so that over a period
>of prolonged memory pressure, the hotter dcache entries would
>eventually become located on the same pages.
>
This seems very reasonable to me. It could be implemented in a caching
plugin model as I am asking for.

On the other hand, what Linus wants is simple enough to code for, the
impact on ReiserFS of aging pages not slums is (I think) not nearly so
bad as the impact on the dcache, aging slums would be more code, there
are hardware support advantages to using page units, and given that we
have a tight Sept. 30 deadline for reiser4, we'll do it. Hey, doing
things too well usually leads to missing the market.

We can implement ReiserFS as flushing whole slums even if the slum as a
whole is hot just because a single node in the slum has gotten old, and
the algorithm isn't ideal, but it will still be faster than ReiserFS V3.

So hey, let's make this yet another optimization deferred until V4.1.....

Hans

2002-01-29 22:49:13

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 10:50 pm, Linus Torvalds wrote:
> On Tue, 29 Jan 2002, Oliver Xymoron wrote:
> >
> > I don't think read-only for the tables is sufficient if the pages
> > themselves are writable.
>
> At least on x86, the WRITE bit in the page directory entries will override
> any bits int he PTE. In other words, it doesn't make the page directory
> entries thmselves unwritable - it makes the final pages unwritable.

Ah, didn't know that.

> Which are exactly the semantics we want.

Yes. This feature might be useful to grab back a few cycles at the expense
of some extra complexity. It is not however, a fundamental change to my
algorithm, just a decoration of it. It's also possible that the cost of the
resulting extra fault will wipe out the (small) saving of setting pte entries
RO in the average case.

It's likely there are architectures where this won't work, and just not doing
this optimization is the correct approach.

--
Daniel

2002-01-29 22:50:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 29, 2002 10:00 pm, Oliver Xymoron wrote:
> On Tue, 29 Jan 2002, Daniel Phillips wrote:
>
> > On January 29, 2002 06:25 pm, Rik van Riel wrote:
> > > On Tue, 29 Jan 2002, Oliver Xymoron wrote:
> > >
> > > > Daniel's approach seems to be workable (once he's spelled out all the
> > > > details) but it misses the big performance win for fork/exec, which is
> > > > surely the common case. Given that exec will be throwing away all these
> > > > mappings, we can safely assume that we will not be inheriting many shared
> > > > mappings from parents of parents so Daniel's approach also still ends up
> > > > marking most of the pages RO still.
> > >
> > > It gets worse. His approach also needs to adjust the reference
> > > counts on all pages (and swap pages).
> >
> > Well, Rik, time to present your algorithm. I assume it won't reference
> > counts on pages, and will do some kind of traversal of the mm tree. Note
> > however, that I did investigate the class of algorithm you are interested in,
> > and found only nasty, complex solutions there, with challenging locking
> > problems. (I also looked at a number of possible improvements to virtual
> > scanning, as you know, and likewise only found ugly or inadequate solutions.)
>
> I think it goes something like this:
>
> fork:
> detach page tables from parent
> retain pointer to "backing page tables" in parent and child
> update use count in page tables
> "prefault" tables for current stack and instruction pages in both parent
> and child
>
> page fault:
> if faulted on page table:
> look up backing page tables
> if use count > 1: copy, dec use count
> else: take ownership
>
> > Before you sink a lot of time into it though, you might add up the actual
> > overhead you're worried about above, and see if it moves the needle in a real
> > system.
>
> I'm pretty sure something like the above does signficantly less work in
> the fork/exec case, which is the important one.

With fork/exec, for each page table there are two cases:

- The parent instantiated the page table. In this case the extra work to
set the ptes RO (only for CoW pages) is insignificant.

- The parent is still sharing the page table with its parent and so the
ptes are still set RO.

I don't see how there is a whole lot of fat to cut here.

--
Daniel

2002-01-29 23:05:07

by Oliver Xymoron

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Tue, 29 Jan 2002, Daniel Phillips wrote:

> > I think it goes something like this:
> >
> > fork:
> > detach page tables from parent
> > retain pointer to "backing page tables" in parent and child
> > update use count in page tables
> > "prefault" tables for current stack and instruction pages in both parent
> > and child
> >
> > page fault:
> > if faulted on page table:
> > look up backing page tables
> > if use count > 1: copy, dec use count
> > else: take ownership
> >
> > > Before you sink a lot of time into it though, you might add up the actual
> > > overhead you're worried about above, and see if it moves the needle in a real
> > > system.
> >
> > I'm pretty sure something like the above does signficantly less work in
> > the fork/exec case, which is the important one.
>
> With fork/exec, for each page table there are two cases:
>
> - The parent instantiated the page table. In this case the extra work to
> set the ptes RO (only for CoW pages) is insignificant.

Marking the page table entries rather than the page directory entries
read-only is a lot of work on a large process. And it doesn't make a lot
of sense for a large process that wants to fork/exec something tiny. In
fact, I'm slightly worried about the possible growth of the page
directories on really big boxes. Detaching the entire mm is comparatively
cheap and doesn't grow with process size.

> - The parent is still sharing the page table with its parent and so the
> ptes are still set RO.

Fork/exec is far and away the most common case, and the fork/fork case is
rare enough that it's not even worth thinking about.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-01-29 23:20:04

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 30, 2002 12:02 am, Oliver Xymoron wrote:
> On Tue, 29 Jan 2002, Daniel Phillips wrote:
> > With fork/exec, for each page table there are two cases:
> >
> > - The parent instantiated the page table. In this case the extra work
> > to set the ptes RO (only for CoW pages) is insignificant.
>
> Marking the page table entries rather than the page directory entries
> read-only is a lot of work on a large process.

I'm still missing your point. When the parent's page table was instantiated
we took a fault. Later, we walk through up to 1024 ptes setting them RO, if
they are not already (which they probably are). Don't you think the cost of
the former dwarves the latter? In fact, if we are worried about this, we can
keep a flag on the page table telling us all the ptes are still set RO so we
don't have to do it again.

> And it doesn't make a lot
> of sense for a large process that wants to fork/exec something tiny.
>
> In
> fact, I'm slightly worried about the possible growth of the page
> directories on really big boxes. Detaching the entire mm is comparatively
> cheap and doesn't grow with process size.
>
> > - The parent is still sharing the page table with its parent and so the
> > ptes are still set RO.
>
> Fork/exec is far and away the most common case, and the fork/fork case is
> rare enough that it's not even worth thinking about.

I'm not sure I agree with this. I matters a lot if that rare case happens to
be the application your using all the time, and then it becomes the common
case.

--
Daniel

2002-01-30 07:12:18

by Oliver Xymoron

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

On Wed, 30 Jan 2002, Hans Reiser wrote:

> Chris Mason wrote:
>
> >>I don't mean to suggest that the dentry cache locking is an easy problem to solve, but the problem discussed is a real one, and it is sufficient to illustrate that the unified cache is fundamentally flawed as an algorithm compared to using subcache plugins.
> >>
> >
> >It isn't just dentries. If a subcache object is in use, it can't be moved
> >to a warmer page without invalidating all existing pointers to it.
> >
> >If it isn't in use, it can be migrated when the VM asks for the page to
> >be flushed.
> >
> garbage collection is a lot of work to implement --- there are a lot of
> good reasons why ext2 doesn't shrink directories.....;-)
>
> really guys, you can get me to agree that it is more work to code, you
> can even get me to agree to skip it for now because we are all busy, but
> the design principle remains valid --- using per page aging of subpage
> objects that do not correlate in accesses leads to diffused hot sets,
> and that means that the cache will perform as though it was much smaller
> than it is.

Can we get you to agree that basically all subpage objects are immovable?
And as a consequence that garbage collecting at subpage levels doesn't
guarantee freeing up any pages that can then be given up to other
subsystems in response to VM pressure? The GC must think in terms of pages
to actually make progress.

One of the design goals of slab by the way is that objects of a similar
type will end up having similar lifetimes, avoiding some of the worst
cases of sub-page allocations.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-01-30 07:18:07

by Oliver Xymoron

[permalink] [raw]
Subject: Re: [reiserfs-list] Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

On Tue, 29 Jan 2002, Andrew Morton wrote:

> Andreas Dilger wrote:
> >
> > But if it is unused and not recently referenced, there is little benefit
> > in keeping it around, is there?
>
> In all of this, please remember that all caches are not of
> equal value-per-byte. A single page contains 32 dentries,
> and can thus save up to 32 disk seeks. It's potentially
> a *lot* more valuable than a (single-seek) pagecache page.

Or it might equally well be 32 contiguous directory entries that you
scanned over to get to the file you wanted. If it's 32 hot items, as a
page it's going to be aged significantly less than one equally hot
pagecache page, so I don't think we need to worry about that too much.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-01-30 07:40:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [reiserfs-list] Re: [reiserfs-dev] Re: Note describing poordcache utilization under high memory pressure

Oliver Xymoron wrote:
>
> On Tue, 29 Jan 2002, Andrew Morton wrote:
>
> > Andreas Dilger wrote:
> > >
> > > But if it is unused and not recently referenced, there is little benefit
> > > in keeping it around, is there?
> >
> > In all of this, please remember that all caches are not of
> > equal value-per-byte. A single page contains 32 dentries,
> > and can thus save up to 32 disk seeks. It's potentially
> > a *lot* more valuable than a (single-seek) pagecache page.
>
> Or it might equally well be 32 contiguous directory entries that you
> scanned over to get to the file you wanted.

The `scanned over' entries will be retained in the pagecache,
not in the dentry cache.

> If it's 32 hot items, as a
> page it's going to be aged significantly less than one equally hot
> pagecache page, so I don't think we need to worry about that too much.

Sure they're LRU at present and we could use the referenced bit
in their backing page in the future.

But what we do *not* want to do is to reclaim memory "fairly" from
all caches. The value of a cached page should be measured in terms
of the number of seeks required to repopulate it. That's all.

Andrea's as-yet-unmerged VM patch is much more aggressive about
shrinking the inode and dentry caches. From reading the code, I
have a bad feeling that these caches will shrivel to zilch as soon
as we come under a bit of memory pressure. If so, this could
represent significant imbalance. But this is speculation - I
have not yet tested for this, nor looked at the code closely, and
I probably shan't until it's put up for merging.

-

2002-01-30 07:52:43

by Oliver Xymoron

[permalink] [raw]
Subject: Re: [reiserfs-list] Re: [reiserfs-dev] Re: Note describing poordcache utilization under high memory pressure

On Tue, 29 Jan 2002, Andrew Morton wrote:

> Oliver Xymoron wrote:
> >
> > On Tue, 29 Jan 2002, Andrew Morton wrote:
> >
> > > Andreas Dilger wrote:
> > > >
> > > > But if it is unused and not recently referenced, there is little benefit
> > > > in keeping it around, is there?
> > >
> > > In all of this, please remember that all caches are not of
> > > equal value-per-byte. A single page contains 32 dentries,
> > > and can thus save up to 32 disk seeks. It's potentially
> > > a *lot* more valuable than a (single-seek) pagecache page.
> >
> > Or it might equally well be 32 contiguous directory entries that you
> > scanned over to get to the file you wanted.
>
> The `scanned over' entries will be retained in the pagecache,
> not in the dentry cache.

For most values of 'scanned over', but not all. There are lots of ways of
looking for something that will prime the dcache..

> > If it's 32 hot items, as a
> > page it's going to be aged significantly less than one equally hot
> > pagecache page, so I don't think we need to worry about that too much.
>
> Sure they're LRU at present and we could use the referenced bit
> in their backing page in the future.
>
> But what we do *not* want to do is to reclaim memory "fairly" from
> all caches. The value of a cached page should be measured in terms
> of the number of seeks required to repopulate it. That's all.

..which may be zero in the above case if the dentries are all backed by
dirs in the page cache you mentioned. Or one in the case of dentries found
by visiting neighboring files. It's very hard to guess at what that number
is for any discardable VM object because it's entirely dependent on what
else is in cache and what order the data's accessed.

I'm sure that there is some merit to valuing some caches more highly than
other but there are obviously other issues to sort out before we can begin
weighing them against each other.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-01-30 09:08:15

by Horst von Brand

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

Daniel Phillips <[email protected]> said:
> On January 29, 2002 12:54 pm, Helge Hafting wrote:
> > Momchil Velikov wrote:

[...]

> > > Umm, all the ptes af the parent ought to be made COW, no ?

> > Sure. But quite a few of them may be COW already, if the parent
> > itself is a result of some earlier fork.

> Right, or if the parent has already forked at least one child.

But most of this will be lost on exec(2). Also, it is my impression that
the tree of _running_ processes isn't usually very deep (Say init --> X -->
[Random processes] --> [compilations &c], this would make 5 or 6 deep, no
more. Should take a pstree(1) listing on a busy machine and work out some
statistics... here (a personal worstation) the tree is very fat at the
first level below init(8), and just 5 deep when running pstree(1)). Sure,
all processes will all end up sharing glibc, and the graphical stuff will
share the X &c libraries, so this would end up being a win this way.
--
Horst von Brand http://counter.li.org # 22616

2002-01-30 09:57:56

by Hans Reiser

[permalink] [raw]
Subject: Re: [reiserfs-dev] Re: Note describing poor dcache utilization under high memory pressure

Oliver Xymoron wrote:

>
>Can we get you to agree that basically all subpage objects are immovable?
>
No. Certainly not in the general case, and I think Josh found ways to
handle the dcache case. If we can simply free the old objects, we don't
actually have to move the hot ones, as he points out.

>
>And as a consequence that garbage collecting at subpage levels doesn't
>guarantee freeing up any pages that can then be given up to other
>subsystems in response to VM pressure? The GC must think in terms of pages
>to actually make progress.
>
>One of the design goals of slab by the way is that objects of a similar
>type will end up having similar lifetimes, avoiding some of the worst
>cases of sub-page allocations.
>



2002-01-30 10:04:07

by Hans Reiser

[permalink] [raw]
Subject: Re: [reiserfs-list] Re: [reiserfs-dev] Re: Note describing poordcache utilization under high memory pressure

Andrew Morton wrote:

>
>But what we do *not* want to do is to reclaim memory "fairly" from
>all caches. The value of a cached page should be measured in terms
>of the number of seeks required to repopulate it. That's all.
>

This is a good point, and I would like it to be considered by ReiserFS
developers in regards to internal nodes, and why Sizif's making internal
nodes have a longer lifetime in his experimental code for squid helped
performance.

Hans

2002-01-30 10:51:28

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 30, 2002 10:07 am, Horst von Brand wrote:
> Daniel Phillips <[email protected]> said:
> > On January 29, 2002 12:54 pm, Helge Hafting wrote:
> > > Momchil Velikov wrote:
>
> [...]
>
> > > > Umm, all the ptes af the parent ought to be made COW, no ?
>
> > > Sure. But quite a few of them may be COW already, if the parent
> > > itself is a result of some earlier fork.
>
> > Right, or if the parent has already forked at least one child.
>
> But most of this will be lost on exec(2).

Even if we doing nothing more than the algorithm on the table, I doubt you'll
see a measurable overhead on fork+exec. Certainly it will be as good or
better than what we have currently.

If that's not good enough, I'm considering keeping a bit on the page table
indicating whether are particular page table is currently in the 'all CoWable
ptes set RO' state, and if so, don't do it again. I think that with this
small optimization, the value of further improvements will be small indeed.

That said, Linus's suggestion of using the x86's ability to have the
writeprotect bits in a page directory override the protections at the page
level is a good one, and reduces the cost of detecting the fork+exec case to
a very small number of faults - none if we are clever. But this is entirely
secondary to the main goal of sharing page tables at all, which is a rather
fundamental shift in the way the Linux VM works. (Though it seems the patch
will be small.)

> Also, it is my impression that
> the tree of _running_ processes isn't usually very deep (Say init --> X -->
> [Random processes] --> [compilations &c], this would make 5 or 6 deep, no
> more.

Worst case is just as important as typical case here, since there will always
be x% of users out there whose normal workload consists entirely of worst
case.

> Should take a pstree(1) listing on a busy machine and work out some
> statistics... here (a personal worstation) the tree is very fat at the
> first level below init(8), and just 5 deep when running pstree(1)).

Here's my tree - on a non-very-busy laptop. Why is my X tree so much deeper?
I suppose if I was running java this would look considerably more interesting.

init-+-apache---8*[apache]
|-apmd
|-bash---bash---xinit-+-XFree86
| `-xfwm-+-xfce---gnome-terminal-+-bash---pstree
| | `-gnome-pty-helpe
| `-xfgnome
|-cardmgr
|-cupsd---http
|-5*[getty]
|-gpm
|-kapm-idled
|-kdeinit---kdeinit
|-5*[kdeinit]
|-kdesud
|-keventd
|-kmail
|-mozilla-bin---mozilla-bin---3*[mozilla-bin]
|-portmap
|-sshd
`-xchat

> Sure, all processes will all end up sharing glibc, and the graphical stuff
> will share the X &c libraries, so this would end up being a win this way.

Nobody has suggested that the sharing algorithm as described isn't a win,
IMHO, we are quibbling over the last few percent of the win. It's getting
high time to end the suspense by benchmarking the code.

Caveat: the page table sharing as described does not do a lot for shared
mmaps, such as glibc. (Unless those are inherited through a fork of course,
then it helps a lot.) Let me reiterate my goal with this patch: *Fix The
Fork Problem With Rmap* so that we can quit spending months fiddling with
virtual scanning, trying to get it to work properly (it never will).

I see the value in the various suggestions I've received, but what I don't
see is the value in delaying, or getting stuck adding new features. Let's
concentrate on making the simple thing I've described work *now* and add
features to it later.

I'm gratified that nobody has yet pointed out any fundamental flaws that
would keep it from working. I wasn't at all sure of that when I set out on
this path a month ago.

--
Daniel

2002-01-30 14:47:26

by Rik van Riel

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Wed, 30 Jan 2002, Daniel Phillips wrote:
> On January 30, 2002 10:07 am, Horst von Brand wrote:

> > But most of this will be lost on exec(2).

> > Also, it is my impression that
> > the tree of _running_ processes isn't usually very deep (Say init --> X -->
> > [Random processes] --> [compilations &c], this would make 5 or 6 deep, no
> > more.

> Here's my tree - on a non-very-busy laptop. Why is my X tree so much deeper?
> I suppose if I was running java this would look considerably more interesting.

> |-bash---bash---xinit-+-XFree86
> | `-xfwm-+-xfce---gnome-terminal-+-bash---pstree

It doesn't matter how deep the tree is, on exec() all
previously shared page tables will be blown away.

In this part of the tree, I see exactly 2 processes
which could be sharing page tables (the two bash
processes).

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-30 14:55:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 30, 2002 03:46 pm, Rik van Riel wrote:
> On Wed, 30 Jan 2002, Daniel Phillips wrote:
> > On January 30, 2002 10:07 am, Horst von Brand wrote:
>
> > > But most of this will be lost on exec(2).
>
> > > Also, it is my impression that
> > > the tree of _running_ processes isn't usually very deep (Say init --> X -->
> > > [Random processes] --> [compilations &c], this would make 5 or 6 deep, no
> > > more.
>
> > Here's my tree - on a non-very-busy laptop. Why is my X tree so much deeper?
> > I suppose if I was running java this would look considerably more interesting.
>
> > |-bash---bash---xinit-+-XFree86
> > | `-xfwm-+-xfce---gnome-terminal-+-bash---pstree
>
> It doesn't matter how deep the tree is, on exec() all
> previously shared page tables will be blown away.
>
> In this part of the tree, I see exactly 2 processes
> which could be sharing page tables (the two bash
> processes).

Sure, your point is that there is no problem and the speed of rmap on fork
is not something to worry about?

--
Daniel

2002-01-30 15:55:46

by Rik van Riel

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On Wed, 30 Jan 2002, Daniel Phillips wrote:
> On January 30, 2002 03:46 pm, Rik van Riel wrote:
> > On Wed, 30 Jan 2002, Daniel Phillips wrote:

> > > |-bash---bash---xinit-+-XFree86
> > > | `-xfwm-+-xfce---gnome-terminal-+-bash---pstree
> >
> > It doesn't matter how deep the tree is, on exec() all
> > previously shared page tables will be blown away.
> >
> > In this part of the tree, I see exactly 2 processes
> > which could be sharing page tables (the two bash
> > processes).
>
> Sure, your point is that there is no problem and the speed of rmap on
> fork is not something to worry about?

No. The point is that we should optimise for fork()+exec(),
not for a long series of consecutive fork()s all sharing the
same page tables.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-30 16:31:06

by Daniel Phillips

[permalink] [raw]
Subject: Re: Note describing poor dcache utilization under high memory pressure

On January 30, 2002 04:54 pm, Rik van Riel wrote:
> On Wed, 30 Jan 2002, Daniel Phillips wrote:
> > On January 30, 2002 03:46 pm, Rik van Riel wrote:
> > > On Wed, 30 Jan 2002, Daniel Phillips wrote:
>
> > > > |-bash---bash---xinit-+-XFree86
> > > > | `-xfwm-+-xfce---gnome-terminal-+-bash---pstree
> > >
> > > It doesn't matter how deep the tree is, on exec() all
> > > previously shared page tables will be blown away.
> > >
> > > In this part of the tree, I see exactly 2 processes
> > > which could be sharing page tables (the two bash
> > > processes).
> >
> > Sure, your point is that there is no problem and the speed of rmap on
> > fork is not something to worry about?
>
> No. The point is that we should optimise for fork()+exec(),
> not for a long series of consecutive fork()s all sharing the
> same page tables.

Fork+exec is adequately optimized for. Fork+100 execs is supremely well
optimized for. I'm entirely satisfied with the way the performance looks
at this point, it will outdo anything we've seen to date. With Linus's
write-protect-in-page-directory optimization there's not a lot more fat
to be squeezed out, if any, and even without it, it will be a screamer.
I think we've done this one, it's time to move on from here.

--
Daniel