There are a few VM-related patches in this series. Mainly fixes;
feature work is on hold.
We have some fairly serious locking contention problems with the reverse
mapping's pte_chains. Until we have a clear way out of that I believe
that it is best to not merge code which has a lot of rmap dependency.
It is apparent that these problems will not be solved by tweaking -
some redesign is needed. In the 2.5 timeframe the only practical
solution appears to be page table sharing, based on Daniel's February
work. Daniel and Dave McCracken are working that.
Some bits and pieces here:
- list_splice() has an open-coded list_empty() in it. Use
list_empty() instead.
- in shrink_cache() we have a local `nr_pages' which shadows another
local. Rename the inner one. (Nikita Danilov)
- Add a BUG() on a can't-happen code path in page_remove_rmap().
- Tighten up the bug checks in the BH completion handlers - if the
buffer is still under IO then it must be locked, because we unlock it
inside the page_uptodate_lock.
fs/buffer.c | 10 ++++------
include/linux/list.h | 4 +---
mm/rmap.c | 2 ++
mm/vmscan.c | 6 +++---
4 files changed, 10 insertions(+), 12 deletions(-)
--- 2.5.29/include/linux/list.h~misc Sat Jul 27 23:38:12 2002
+++ 2.5.29-akpm/include/linux/list.h Sat Jul 27 23:38:12 2002
@@ -156,9 +156,7 @@ static inline void __list_splice(list_t
*/
static inline void list_splice(list_t *list, list_t *head)
{
- list_t *first = list->next;
-
- if (first != list)
+ if (!list_empty(list))
__list_splice(list, head);
}
--- 2.5.29/mm/vmscan.c~misc Sat Jul 27 23:38:12 2002
+++ 2.5.29-akpm/mm/vmscan.c Sat Jul 27 23:49:01 2002
@@ -200,8 +200,8 @@ shrink_cache(int nr_pages, zone_t *class
* so the direct writes to the page cannot get lost.
*/
int (*writeback)(struct page *, int *);
- const int nr_pages = SWAP_CLUSTER_MAX;
- int nr_to_write = nr_pages;
+ const int cluster_size = SWAP_CLUSTER_MAX;
+ int nr_to_write = cluster_size;
writeback = mapping->a_ops->vm_writeback;
if (writeback == NULL)
@@ -209,7 +209,7 @@ shrink_cache(int nr_pages, zone_t *class
page_cache_get(page);
spin_unlock(&pagemap_lru_lock);
(*writeback)(page, &nr_to_write);
- max_scan -= (nr_pages - nr_to_write);
+ max_scan -= (cluster_size - nr_to_write);
page_cache_release(page);
spin_lock(&pagemap_lru_lock);
continue;
--- 2.5.29/mm/rmap.c~misc Sat Jul 27 23:38:12 2002
+++ 2.5.29-akpm/mm/rmap.c Sat Jul 27 23:49:03 2002
@@ -205,6 +205,8 @@ void page_remove_rmap(struct page * page
}
printk("\n");
printk(KERN_ERR "page_remove_rmap: driver cleared PG_reserved ?\n");
+#else
+ BUG();
#endif
out:
--- 2.5.29/fs/buffer.c~misc Sat Jul 27 23:38:16 2002
+++ 2.5.29-akpm/fs/buffer.c Sat Jul 27 23:38:25 2002
@@ -514,9 +514,8 @@ static void end_buffer_async_read(struct
if (!buffer_uptodate(tmp))
page_uptodate = 0;
if (buffer_async_read(tmp)) {
- if (buffer_locked(tmp))
- goto still_busy;
- BUG();
+ BUG_ON(!buffer_locked(tmp));
+ goto still_busy;
}
tmp = tmp->b_this_page;
} while (tmp != bh);
@@ -564,9 +563,8 @@ static void end_buffer_async_write(struc
tmp = bh->b_this_page;
while (tmp != bh) {
if (buffer_async_write(tmp)) {
- if (buffer_locked(tmp))
- goto still_busy;
- BUG();
+ BUG_ON(!buffer_locked(tmp));
+ goto still_busy;
}
tmp = tmp->b_this_page;
}
.
On Sunday 28 July 2002 09:32, Andrew Morton wrote:
> We have some fairly serious locking contention problems with the reverse
> mapping's pte_chains. Until we have a clear way out of that I believe
> that it is best to not merge code which has a lot of rmap dependency.
>
> It is apparent that these problems will not be solved by tweaking -
> some redesign is needed. In the 2.5 timeframe the only practical
> solution appears to be page table sharing, based on Daniel's February
> work. Daniel and Dave McCracken are working that.
Sadly, it turns out that there are no possibilities for page table sharing
when forking from bash. It turns out there are only about 200 pages being
shared amonst three page tables (stack, text and .interp) and at least one
page in each of these gets written during the exec, so all are unshared and
hence there is no reduction in the number of pte chains that have to be
created. For forking from a larger parent, page table sharing has a
measurable benefit, but not from these little guys.
But there is something massively wrong with this whole picture. Your kickass
4 way is managing to set up and tear down only one pte chain node per
microsecond, if I'm reading your benchmark results correctly. That's really
horrible. I think we need to revisit the locking.
The idea I'm playing with now is to address an array of locks based on
something like:
spin_lock(pte_chain_locks + ((page->index >> 4) & 0xff));
so that 16 consecutive filemap pages use the same lock and there is a limited
total number of locks to keep cache pressure down. Since we are creating the
vast majority of the pte chain nodes while walking across page tables, this
should give nice locality.
For this to work, anon pages will need to have something in page->index.
This isn't too much of a challenge. A reasonable value to put in there is
the creator's virtual address, shifted right, and perhaps mangled a little to
reduce contention.
We can also look at batching the pte chain node creation by preallocating 16
nodes, taking the lock, and walking through the 16 nodes filling in the
pointers. If the page index changes to a different lock we drop the one we
have and acquire the new one.
--
Daniel
On Mon, 29 Jul 2002, Daniel Phillips wrote:
> Sadly, it turns out that there are no possibilities for page table
> sharing when forking from bash.
Are you sure bash is using fork and not vfork ?
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
From: Rik van Riel <[email protected]>
Date: Mon, 29 Jul 2002 05:44:45 -0300 (BRT)
Are you sure bash is using fork and not vfork ?
GNU Bash uses fork, GNU Make uses vfork :-)
Daniel Phillips wrote:
>
>...
> >
> > It is apparent that these problems will not be solved by tweaking -
> > some redesign is needed. In the 2.5 timeframe the only practical
> > solution appears to be page table sharing, based on Daniel's February
> > work. Daniel and Dave McCracken are working that.
>
> Sadly, it turns out that there are no possibilities for page table sharing
> when forking from bash. It turns out there are only about 200 pages being
> shared amonst three page tables (stack, text and .interp) and at least one
> page in each of these gets written during the exec, so all are unshared and
> hence there is no reduction in the number of pte chains that have to be
> created. For forking from a larger parent, page table sharing has a
> measurable benefit, but not from these little guys.
Thanks for looking into this.
> But there is something massively wrong with this whole picture. Your kickass
> 4 way is managing to set up and tear down only one pte chain node per
> microsecond, if I'm reading your benchmark results correctly.
s/kickass/slow as a wet sock/.
That little script which did 12,000 fork/exec/exits ran page_add_rmap
and page_remove_rmap about 4,000,000 times each, and total runtime
went from 30 seconds to 34. So the combined overhead of both adding and
removing the reverse mapping is around a microsecond per page, yes.
That's four locked operations.
> That's really
> horrible. I think we need to revisit the locking.
Anton did some testing on a 4-way PPC. Similar results, I think.
As an experiment he added a spinlock to the page structure and used
that instead of the PG_chainlock flag. It helped a lot. He thinks
that is because their spin_unlock() is not buslocked (like ia32).
Which tends to imply that a __clear_bit() in pte_chain_unlock()
will be beneficial. I don't know how portable that would be though.
> The idea I'm playing with now is to address an array of locks based on
> something like:
>
> spin_lock(pte_chain_locks + ((page->index >> 4) & 0xff));
>
> so that 16 consecutive filemap pages use the same lock and there is a limited
> total number of locks to keep cache pressure down. Since we are creating the
> vast majority of the pte chain nodes while walking across page tables, this
> should give nice locality.
Something like that could help.
At some point, when the reverse map is as CPU efficient as we can make it,
we need to decide whether the remaining cost is worth the benefit. I
wonder how to do that.
> For this to work, anon pages will need to have something in page->index.
> This isn't too much of a challenge. A reasonable value to put in there is
> the creator's virtual address, shifted right, and perhaps mangled a little to
> reduce contention.
Well you want the likely-to-be-temporally-adjacent anon pages to
use the same lock. So maybe
page->index = some_global_int++;
Except ->index gets stomped on when the page gets added to swapcache.
Which means that the address of its lock will change. I can't immediately
think of a fix for that.
On Mon, 29 Jul 2002, Andrew Morton wrote:
> At some point, when the reverse map is as CPU efficient as we can make
> it, we need to decide whether the remaining cost is worth the benefit.
> I wonder how to do that.
On a system which isn't swapping, the pte_chain based reverse
maps will never be worth it. However, it seems that well over
90% of Linux machines have data in swap ...
The object-based reverse maps K42 has should be a lot better
than what's possible with pte_chains. In fact, it should be
lower overhead than non-rmap Linux because it doesn't need
to do refcounting on a page by page basis.
However, the object-based reverse mapping scheme is something
to do after 2.5 development is done and 2.6 has more or less
stabilised. The only relevance it has is the knowledge that
we won't be stuck with pte_chains forever ;)
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
On Monday 29 July 2002 22:00, Andrew Morton wrote:
> > The idea I'm playing with now is to address an array of locks based on
> > something like:
> >
> > spin_lock(pte_chain_locks + ((page->index >> 4) & 0xff));
> >
> > so that 16 consecutive filemap pages use the same lock and there is a limited
> > total number of locks to keep cache pressure down. Since we are creating the
> > vast majority of the pte chain nodes while walking across page tables, this
> > should give nice locality.
>
> Something like that could help.
>
> At some point, when the reverse map is as CPU efficient as we can make it,
> we need to decide whether the remaining cost is worth the benefit. I
> wonder how to do that.
We need measurements for a few other loads I think.
> > For this to work, anon pages will need to have something in page->index.
> > This isn't too much of a challenge. A reasonable value to put in there is
> > the creator's virtual address, shifted right, and perhaps mangled a little to
> > reduce contention.
>
> Well you want the likely-to-be-temporally-adjacent anon pages to
> use the same lock. So maybe
>
> page->index = some_global_int++;
Yes, that's better.
> Except ->index gets stomped on when the page gets added to swapcache.
> Which means that the address of its lock will change. I can't immediately
> think of a fix for that.
We'd have to hold the lock while changing the page->index. Pte_chain_lock
would additionally have to check the page->index after acquiring the lock
and, if changed, drop it and take the new one. I don't think the overhead
for this check is significant.
Add_to_page_cache would want new flavor that shortens up the pte chain lock
hold time, but it looks like it should have a swap-specific variant anyway.
--
Daniel
On Monday 29 July 2002 22:55, Rik van Riel wrote:
> On Mon, 29 Jul 2002, Andrew Morton wrote:
>
> > At some point, when the reverse map is as CPU efficient as we can make
> > it, we need to decide whether the remaining cost is worth the benefit.
> > I wonder how to do that.
>
> On a system which isn't swapping, the pte_chain based reverse
> maps will never be worth it. However, it seems that well over
> 90% of Linux machines have data in swap ...
There is also the promise of being able to do active defragmentation, an
enabler for large pages, which have been shown to significantly enhance
performance due to reduced tlb pressure.
So the 'never will be worth it' case is limited to loads that don't create
any tlb pressure and don't do any swapping or mmap IO.
--
Daniel
On Monday 29 July 2002 22:00, Andrew Morton wrote:
> Anton did some testing on a 4-way PPC. Similar results, I think.
> As an experiment he added a spinlock to the page structure and used
> that instead of the PG_chainlock flag. It helped a lot. He thinks
> that is because their spin_unlock() is not buslocked (like ia32).
So what happened when you tried that on your machine?
--
Daniel
On Monday 29 July 2002 22:00, Andrew Morton wrote:
> > For this to work, anon pages will need to have something in page->index.
> > This isn't too much of a challenge. A reasonable value to put in there is
> > the creator's virtual address, shifted right, and perhaps mangled a little to
> > reduce contention.
>
> Well you want the likely-to-be-temporally-adjacent anon pages to
> use the same lock. So maybe
>
> page->index = some_global_int++;
Actually, thinking again... the nice thing about this scheme is the
opportunity to take the lock once for a whole batch of add_rmaps while
walking through a page table, and the same for remove_rmap. So it
definitely wants to be based on the virtual address.
--
Daniel
On Monday 29 July 2002 23:51, Daniel Phillips wrote:
> On Monday 29 July 2002 22:00, Andrew Morton wrote:
> > > The idea I'm playing with now is to address an array of locks based on
> > > something like:
> > >
> > > spin_lock(pte_chain_locks + ((page->index >> 4) & 0xff));
> > >
> > > so that 16 consecutive filemap pages use the same lock and there is a limited
> > > total number of locks to keep cache pressure down. Since we are creating the
> > > vast majority of the pte chain nodes while walking across page tables, this
> > > should give nice locality.
> >
> > Something like that could help.
>
> [...]
>
> > > For this to work, anon pages will need to have something in page->index.
> > > This isn't too much of a challenge. A reasonable value to put in there is
> > > the creator's virtual address, shifted right, and perhaps mangled a little to
> > > reduce contention.
> >
> > Well you want the likely-to-be-temporally-adjacent anon pages to
> > use the same lock. So maybe
> >
> > page->index = some_global_int++;
>
> Yes, that's better.
>
> > Except ->index gets stomped on when the page gets added to swapcache.
> > Which means that the address of its lock will change. I can't immediately
> > think of a fix for that.
>
> We'd have to hold the lock while changing the page->index. Pte_chain_lock
> would additionally have to check the page->index after acquiring the lock
> and, if changed, drop it and take the new one. I don't think the overhead
> for this check is significant.
>
> Add_to_page_cache would want new flavor that shortens up the pte chain lock
> hold time, but it looks like it should have a swap-specific variant anyway.
Here's a status report on this:
- I replicated your results using Rik's rmap12h patch for 2.4.19-pre7.
The overhead of rmap setup/teardown is a little higher even than the
minimal rmap patch in 2.5.recent, around 20% bottom line cost for this
(slightly unrealistic) test. Rik seems to take the pte_chain_locks
more often than necessary, a likely explanation for the higher overhead
vs 2.4.27
- I'm not doing this in 2.5 because dac960 is broken, and that's what my
only dual processor machine has. I'll return to messing with that
pretty soon (and Jens has offered help) but right now I'm focussed on
this setup/teardown slowness question.
- I've implemented the locking strategy described above and it is
apparently stable. It seems to reduce the overhead a percent or
two. On x86 it still isn't a solution, though it may well be quite
nice for ppc, judging from earlier results you mentioned. I'll
continue to fiddle with anon page assignments to page->index and see
if I can get another couple of percent. It's possible that putting
the spinlocks in separate cachelines may help as well.
- It's clear that I need to move on to batching up the pte chain creates
to get the kind of improvement we want.
I settled on a variant of the pte_chain_lock interface, and since I had to
update every use of it I adopted a more sensible name as well:
spinlock_t *lock_rmap(struct page *page);
void unlock_rmap(spinlock_t *lock);
(The functions formerly known as pte_chain_lock and pte_chain_unlock.) The
former acquires a lock indexed via page->index and returns it; the latter
is just spin_unlock. This interface is more efficient than looking up the
lock twice, and also accomodates my wrapper for resetting page->index, which
can't use page->index for the unlock, since it just changed:
static inline void set_page_index(struct page *page, unsigned long index)
{
spinlock_t *lock = lock_rmap(page);
page->index = index;
unlock_rmap(lock);
}
Lock_rmap loops to handle the possibility of page->index changing while
waiting on the lock:
static inline spinlock_t *lock_rmap(struct page *page)
{
unsigned long index = page->index;
while (1) {
spinlock_t *lock = rmap_locks + (index & (num_rmap_locks - 1));
spin_lock(lock);
if (index == page->index)
return lock;
spin_unlock(lock);
}
}
The normal case is nearly as efficient as a raw spin_lock. I probably have
to invalidate page->index for non-cache-coherent architectures. I'll worry
about that detail more if I get somewhere with the optimization on i386.
No doubt there are cases where the set_page_index doesn't have to be covered
by the rmap lock but I didn't spend time hunting for them - they're different
in 2.5 and this overhead doesn't seem to move the needle anyway.
I think this much is fairly solid. Now I'll see if I can take advantage of
this infrastructure by batching up the setup/teardown in copy_page_range,
which is apparently the main cause of the suckage.
--
Daniel