Linus Torvalds wrote:
>
> ...
> Dream on. It's good, and it's not getting removed. The "struct page" is
> size-critical, and also correctness-critical (see above on gcc issues).
>
Plan B is to remove page->index.
- Replace ->mapping with a pointer to the page's radix tree
slot. Use address masking to go from page.radix_tree_slot
to the radix tree node.
- Store the base index in the radix tree node, use math to
derive page->index. Gives 64-bit index without increasing
the size of struct page. 4 bytes saved.
- Implement radix_tree_gang_lookup() as previously described. Use
this in truncate_inode_pages, invalidate_inode_pages[2], readahead
and writeback.
- The only thing we now need page.list for is tracking dirty pages.
Implement a 64-bit dirtiness bitmap in radix_tree_node, propagate
that up the radix tree so we can efficiently traverse dirty pages
in a mapping. This also allows writeback to always write in ascending
index order. Remove page->list. 8 bytes saved.
- Few pages use ->private for much. Hash for it. 4(ish) bytes
saved.
- Remove ->virtual, do page_address() via a hash. 4(ish) bytes saved.
- Remove the rmap chain (I just broke ptep_to_address() anyway). 4 bytes
saved. struct page is now 20 bytes.
There look. In five minutes I shrunk 24 bytes from the page
structure. Who said programming was hard?
-
Linus Torvalds wrote:
>> Dream on. It's good, and it's not getting removed. The "struct page" is
>> size-critical, and also correctness-critical (see above on gcc issues).
32-bit is a sad, broken, and depressing reality we're going to be
saddled with on mainstream systems for ages. It's stinking up the
kernel like a dead woodchuck under the porch as it is, and the 64GB
abominations on their way out the ass-end of hardware vendor pipelines
are truly vomitous.
On Sun, Jul 28, 2002 at 05:10:48PM -0700, Andrew Morton wrote:
> Plan B is to remove page->index.
> - Replace ->mapping with a pointer to the page's radix tree
> slot. Use address masking to go from page.radix_tree_slot
> to the radix tree node.
> - Store the base index in the radix tree node, use math to
> derive page->index. Gives 64-bit index without increasing
> the size of struct page. 4 bytes saved.
> - Implement radix_tree_gang_lookup() as previously described. Use
> this in truncate_inode_pages, invalidate_inode_pages[2], readahead
> and writeback.
> - The only thing we now need page.list for is tracking dirty pages.
> Implement a 64-bit dirtiness bitmap in radix_tree_node, propagate
> that up the radix tree so we can efficiently traverse dirty pages
> in a mapping. This also allows writeback to always write in ascending
> index order. Remove page->list. 8 bytes saved.
> - Few pages use ->private for much. Hash for it. 4(ish) bytes
> saved.
> - Remove ->virtual, do page_address() via a hash. 4(ish) bytes saved.
> - Remove the rmap chain (I just broke ptep_to_address() anyway). 4 bytes
> saved. struct page is now 20 bytes.
> There look. In five minutes I shrunk 24 bytes from the page
> structure. Who said programming was hard?
This is so aggressive I'm obligated to pursue it. The pte_chain will
die shortly if I get my way as it is.
Cheers,
Bill
On Sun, Jul 28, 2002 at 05:10:48PM -0700, Andrew Morton wrote:
> Linus Torvalds wrote:
> >
> > ...
> > Dream on. It's good, and it's not getting removed. The "struct page" is
> > size-critical, and also correctness-critical (see above on gcc issues).
> >
>
> Plan B is to remove page->index.
>
> - Replace ->mapping with a pointer to the page's radix tree
> slot. Use address masking to go from page.radix_tree_slot
> to the radix tree node.
that's not immediate anymore, you've to walk the tree backwards, like
you do for the lookups.
I recall you are benchmarking radix tree with dbench. You've to
benchmark the worst case not dbench with small files. with small files
even the rbtree was nicer than the hashtable. The right benchmark is:
truncate(800GByte)
write(1G)
lseek(800Gbyte)
fsync()
addr = mmap(1G)
benchmark_start()
mlock(1G)
benchmark_stop()
instead of mlock you can also do read(1G) as you prefer, but the
overhead of the copy-user would be certainly more significant than the
overhead of filling the ptes, so an mlock or *even* a page fault should
be lighter to allow us to better benchmark the pagecache performance.
The nocopy hack from Lincol would be fine too, then you could read the
whole thing with one syscall and no pagetable overhead.
(of course you need >1G of ram to avoid to hit the disk during the read,
probably the read pass should be read 1 byte, lseek to the next 1byte,
and you also need the hashtable allocated with the bootmem allocator)
Nobody did that yet AFIK, so it's not a surprise nobody found any
regression with the radix tree (yet). I expect walking 6/7 cacheline
steps every time is going to be a significant hit (even worse if you
need to do that every time to derive the index with the gang method). Of
course if you work with small files you'll never walk more than a few
steps and the regression doesn't showup, that was true with the rbtree
too two/three years ago.
The other major problems of radix trees are the GFP_ATOMIC allocations
that can lead at the very least to I/O failures, the mempool usage seems
just a band-aid to hide those failures but it works by pure luck it
seems.
On the same GFP_ATOMIC lines we can find in rmap.c:
static void alloc_new_pte_chains()
{
struct pte_chain * pte_chain = (void *) get_zeroed_page(GFP_ATOMIC);
int i = PAGE_SIZE / sizeof(struct pte_chain);
if (pte_chain) {
inc_page_state(nr_pte_chain_pages);
for (; i-- > 0; pte_chain++)
pte_chain_push(pte_chain);
} else {
/* Yeah yeah, I'll fix the pte_chain allocation ... */
panic("Fix pte_chain allocation, you lazy bastard!\n");
how will you fix the pte_chain allocation to avoid deadlocks? please
elaborate. none of the callers can handle a failure there, no surprise
there is a panic there.
> - The only thing we now need page.list for is tracking dirty pages.
> Implement a 64-bit dirtiness bitmap in radix_tree_node, propagate
> that up the radix tree so we can efficiently traverse dirty pages
> in a mapping. This also allows writeback to always write in ascending
> index order. Remove page->list. 8 bytes saved.
>
page->list is needed for the freelist, but ok you could now share
freelist and lru since they're mutually exclusive. This seems a nice
idea for the ordering in particular, but again the "find" algorithm will
be slower and walking a tree in order is even a recursive operation that
will require either sane programming and a recursive algorithm that will
overflow the stack with a big radix tree at offset 200T on a 64bit arch,
or dynamic allocations and overcomplex code.
> - Remove ->virtual, do page_address() via a hash. 4(ish) bytes saved.
note you still have to handle the collisions, it's not like the
waitqueue hash were you avoid handling the collisions by doing a
wake-all. You should handle the collision with kmalloc dyn-alloc
but you have no failure path there.
In short none of these things cames for free, it's not like the page
based writeback that removes overhead. They're not obvious optimizations
to my eyes, but yes you could theoretically shrunk the struct page that
way, but I'm pretty much fine to pay with ram if the other option is to
run slower.
The keeping track of dirty pages into a tree and to walk the tree in
ascendent order to flush those dirty pages may actually pay off, in
userspace with a recursive stack, but in kernel even only the fact we
cannot handle a kmalloc failure during dirty flushing is a showstopper
for those algorithms that means OOM deadlock, you cannot just avoid to
flush dirty pages and try again, everybody may be trying to flush dirty
pages to make progress. In userspace that just means "task dies with
-ENOMEM or sigkill from kernel, plug some more ram or add some more swap
and try again", but for an operative system the thing is different.
Andrea
On Sun, Jul 28, 2002 at 05:43:25PM -0700, William Lee Irwin III wrote:
> Linus Torvalds wrote:
> >> Dream on. It's good, and it's not getting removed. The "struct page" is
> >> size-critical, and also correctness-critical (see above on gcc issues).
>
> 32-bit is a sad, broken, and depressing reality we're going to be
> saddled with on mainstream systems for ages. It's stinking up the
> kernel like a dead woodchuck under the porch as it is, and the 64GB
> abominations on their way out the ass-end of hardware vendor pipelines
> are truly vomitous.
>
>
> On Sun, Jul 28, 2002 at 05:10:48PM -0700, Andrew Morton wrote:
> > Plan B is to remove page->index.
> > - Replace ->mapping with a pointer to the page's radix tree
> > slot. Use address masking to go from page.radix_tree_slot
> > to the radix tree node.
> > - Store the base index in the radix tree node, use math to
> > derive page->index. Gives 64-bit index without increasing
> > the size of struct page. 4 bytes saved.
> > - Implement radix_tree_gang_lookup() as previously described. Use
> > this in truncate_inode_pages, invalidate_inode_pages[2], readahead
> > and writeback.
> > - The only thing we now need page.list for is tracking dirty pages.
> > Implement a 64-bit dirtiness bitmap in radix_tree_node, propagate
> > that up the radix tree so we can efficiently traverse dirty pages
> > in a mapping. This also allows writeback to always write in ascending
> > index order. Remove page->list. 8 bytes saved.
> > - Few pages use ->private for much. Hash for it. 4(ish) bytes
> > saved.
> > - Remove ->virtual, do page_address() via a hash. 4(ish) bytes saved.
> > - Remove the rmap chain (I just broke ptep_to_address() anyway). 4 bytes
> > saved. struct page is now 20 bytes.
> > There look. In five minutes I shrunk 24 bytes from the page
> > structure. Who said programming was hard?
>
> This is so aggressive I'm obligated to pursue it. The pte_chain will
> die shortly if I get my way as it is.
if you look at DaveM first full rmap implementation it never had a
pte-chain. He used the same rmap logic we always hand in linux since the
first 2.1 kernel I looked at, to handle correctly truncate against
MAP_SHARED. Unfortunately that's not very efficient and requires some
metadata allocation for anonymous pages (that's the address space
pointer, anon pages regularly doesn't have a dedicated address space),
and overhead that we never had w/o full rmap (and for inode backed
mappings we just have this info in the inode, just the shared_lock
locking isn't trivial). Hope you can came up with a better algorithm
(nevertheless also the current rmap implementation adds significant
measurable overhead in the fast paths), Rik told me a few days ago he
also wanted to drop the pte_chain, but I assume you're just in sync with him.
Andrea
On Sun, Jul 28, 2002 at 05:10:48PM -0700, Andrew Morton wrote:
> - Few pages use ->private for much. Hash for it. 4(ish) bytes
> saved.
Do you know an approximate reasonable constant of proportionality
for how many pages have ->private attached?
On Sun, Jul 28, 2002 at 05:10:48PM -0700, Andrew Morton wrote:
> - Remove the rmap chain (I just broke ptep_to_address() anyway). 4 bytes
> saved. struct page is now 20 bytes.
How did ptep_to_address() break? I browsed over your latest changes and
missed the bit where that fell apart. I'll at least take a stab at fixing
it up until the other bits materialize.
Cheers,
Bill
On Sun, Jul 28, 2002 at 05:43:25PM -0700, William Lee Irwin III wrote:
>> This is so aggressive I'm obligated to pursue it. The pte_chain will
>> die shortly if I get my way as it is.
On Mon, Jul 29, 2002 at 02:56:12AM +0200, Andrea Arcangeli wrote:
> if you look at DaveM first full rmap implementation it never had a
> pte-chain. He used the same rmap logic we always hand in linux since the
> first 2.1 kernel I looked at, to handle correctly truncate against
> MAP_SHARED. Unfortunately that's not very efficient and requires some
> metadata allocation for anonymous pages (that's the address space
> pointer, anon pages regularly doesn't have a dedicated address space),
> and overhead that we never had w/o full rmap (and for inode backed
> mappings we just have this info in the inode, just the shared_lock
> locking isn't trivial). Hope you can came up with a better algorithm
> (nevertheless also the current rmap implementation adds significant
> measurable overhead in the fast paths), Rik told me a few days ago he
> also wanted to drop the pte_chain, but I assume you're just in sync with him.
I've seen davem's implementation. The anonymous page metadata
allocations, while they are overhead, are likely to be significantly
smaller than per-pte overhead. The rest is a matter of details. You're
welcome to participate with the design and/or implementation.
Cheers,
Bill
On Mon, 29 Jul 2002, Andrea Arcangeli wrote:
> if you look at DaveM first full rmap implementation it never had a
> pte-chain. He used the same rmap logic we always hand in linux since the
> first 2.1 kernel I looked at, to handle correctly truncate against
> MAP_SHARED. Unfortunately that's not very efficient and requires some
> metadata allocation for anonymous pages (that's the address space
> pointer, anon pages regularly doesn't have a dedicated address space),
Together with the K42 people we found a way to avoid the
badnesses of an object-based VM.
The space overhead will be a "double wide" radix tree
per anonymous memory object, where each entry has a
copy-on-write count and either the page frame number
or the position in swap.
Added benefits are not having to modify page->count for
most pages on fork(), exec(), etc. and being able to just
throw away page tables.
This scheme doesn't have deep tree walking of memory objects,
doesn't have the disadvantage of leaving 'stale' pages behind
in parent objects after COW and can still do refcounting on an
object basis instead of a page by page basis.
It'll also allow us to drop the usage count for swap entries,
turning those into a simple bitmap (or maybe a better form?).
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
On Sun, Jul 28, 2002 at 05:10:48PM -0700, Andrew Morton wrote:
> - Remove ->virtual, do page_address() via a hash. 4(ish) bytes saved.
Hmmmmmmm. page_address() is already 5 loads (on ARM) if page->virtual
isn't used. I'm seriously considering changing page_address() to cover
the 3 cases more efficiently:
1. non-discontiguous case (should be around 2 loads + math)
2. discontiguous case (currently 5 loads + lots of math)
3. weirder setups where loading page->virtual is faster
We currently ignore (1) completely, and just assume its the same as (2).
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html