From: Martin Schwidefsky <[email protected]>
From: Hubertus Franke <[email protected]>
From: Himanshu Raj <[email protected]>
[patch 1/9] Guest page hinting: unused / free pages.
A very simple but already quite effective improvement in the handling
of guest memory vs. host memory is to tell the host when pages are
free. That allows the host to avoid the paging of guest pages without
meaningful content. The host can "forget" the page content and provide
a fresh frame containing zeroes instead.
To communicate the page states "unused" and "stable" to the host two
architecture defined primitives page_set_unused() and page_set_stable()
are introduced, which are used in the page allocator. The already
existing arch_free_page is not used for page_set_unused since it is
called before the reserved pages check. In addition arch_free_page can
do anything on a given architecture, while page_set_stable() and
page_set_unused() have a clearly defined meaning.
Signed-off-by: Martin Schwidefsky <[email protected]>
---
include/linux/mm.h | 2 ++
include/linux/page-states.h | 32 ++++++++++++++++++++++++++++++++
mm/page_alloc.c | 3 +++
3 files changed, 37 insertions(+)
diff -urpN linux-2.6/include/linux/mm.h linux-2.6-patched/include/linux/mm.h
--- linux-2.6/include/linux/mm.h 2006-09-01 12:49:32.000000000 +0200
+++ linux-2.6-patched/include/linux/mm.h 2006-09-01 12:49:35.000000000 +0200
@@ -304,6 +304,8 @@ struct page {
* routine so they can be sure the page doesn't go away from under them.
*/
+#include <linux/page-states.h>
+
/*
* Drop a ref, return true if the refcount fell to zero (the page has no users)
*/
diff -urpN linux-2.6/include/linux/page-states.h linux-2.6-patched/include/linux/page-states.h
--- linux-2.6/include/linux/page-states.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6-patched/include/linux/page-states.h 2006-09-01 12:49:35.000000000 +0200
@@ -0,0 +1,32 @@
+#ifndef _LINUX_PAGE_STATES_H
+#define _LINUX_PAGE_STATES_H
+
+/*
+ * include/linux/page-states.h
+ *
+ * (C) Copyright IBM Corp. 2005, 2006
+ *
+ * Authors: Martin Schwidefsky <[email protected]>
+ * Hubertus Franke <[email protected]>
+ * Himanshu Raj <[email protected]>
+ */
+
+#if defined(CONFIG_PAGE_STATES)
+#include <asm/page-states.h>
+#else
+
+/* Guest page hinting architecture primitives:
+ * - page_set_unused:
+ * Indicates to the host that the page content is of no interest
+ * to the guest. The host can "forget" the page content and replace
+ * it with a page containing zeroes.
+ * - page_set_stable:
+ * Indicate to the host that the page content is needed by the guest.
+ */
+
+#define page_set_unused(_page,_order) do { } while (0)
+#define page_set_stable(_page,_order) do { } while (0)
+
+#endif
+
+#endif /* _LINUX_PAGE_STATES_H */
diff -urpN linux-2.6/mm/page_alloc.c linux-2.6-patched/mm/page_alloc.c
--- linux-2.6/mm/page_alloc.c 2006-09-01 12:49:33.000000000 +0200
+++ linux-2.6-patched/mm/page_alloc.c 2006-09-01 12:49:35.000000000 +0200
@@ -515,6 +515,7 @@ static void __free_pages_ok(struct page
reserved += free_pages_check(page + i);
if (reserved)
return;
+ page_set_unused(page, order);
kernel_map_pages(page, 1 << order, 0);
local_irq_save(flags);
@@ -798,6 +799,7 @@ static void fastcall free_hot_cold_page(
page->mapping = NULL;
if (free_pages_check(page))
return;
+ page_set_unused(page, 0);
kernel_map_pages(page, 1, 0);
@@ -885,6 +887,7 @@ again:
put_cpu();
VM_BUG_ON(bad_range(zone, page));
+ page_set_stable(page, order);
if (prep_new_page(page, order, gfp_flags))
goto again;
return page;
On Fri, 2006-09-01 at 13:09 +0200, Martin Schwidefsky wrote:
> +++ linux-2.6-patched/mm/page_alloc.c 2006-09-01 12:49:35.000000000
> +0200
> @@ -515,6 +515,7 @@ static void __free_pages_ok(struct page
> reserved += free_pages_check(page + i);
> if (reserved)
> return;
> + page_set_unused(page, order);
>
> kernel_map_pages(page, 1 << order, 0);
> local_irq_save(flags);
Do these have anything in common with arch_free_page()? I thought
marking the pages as being "unused by the kernel" was the whole idea of
having that hook.
-- Dave
On Fri, 2006-09-01 at 08:33 -0700, Dave Hansen wrote:
> > +++ linux-2.6-patched/mm/page_alloc.c 2006-09-01 12:49:35.000000000
> > +0200
> > @@ -515,6 +515,7 @@ static void __free_pages_ok(struct page
> > reserved += free_pages_check(page + i);
> > if (reserved)
> > return;
> > + page_set_unused(page, order);
> >
> > kernel_map_pages(page, 1 << order, 0);
> > local_irq_save(flags);
>
> Do these have anything in common with arch_free_page()? I thought
> marking the pages as being "unused by the kernel" was the whole idea of
> having that hook.
This question did come up already. arch_free_page() is done before the
PageReserved() check so it isn't suitable for stable/unused state
transitions. You can argue that arch_free_page() should be moved but who
knows what the architecture defined function is supposed to do?
page_set_stable/page_set_unused on the other hand have a clearly defined
meaning.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 17:43 +0200, Martin Schwidefsky wrote:
> This question did come up already. arch_free_page() is done before the
> PageReserved() check so it isn't suitable for stable/unused state
> transitions. You can argue that arch_free_page() should be moved but who
> knows what the architecture defined function is supposed to do?
> page_set_stable/page_set_unused on the other hand have a clearly defined
> meaning.
With a patch set this large, I think it would at least be a nice thing
to go through and review the other architectures' uses.
-- Dave
On Fri, 2006-09-01 at 08:56 -0700, Dave Hansen wrote:
> > This question did come up already. arch_free_page() is done before the
> > PageReserved() check so it isn't suitable for stable/unused state
> > transitions. You can argue that arch_free_page() should be moved but who
> > knows what the architecture defined function is supposed to do?
> > page_set_stable/page_set_unused on the other hand have a clearly defined
> > meaning.
>
> With a patch set this large, I think it would at least be a nice thing
> to go through and review the other architectures' uses.
There is only one user right now: UML.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 18:05 +0200, Martin Schwidefsky wrote:
> On Fri, 2006-09-01 at 08:56 -0700, Dave Hansen wrote:
> > > This question did come up already. arch_free_page() is done before the
> > > PageReserved() check so it isn't suitable for stable/unused state
> > > transitions. You can argue that arch_free_page() should be moved but who
> > > knows what the architecture defined function is supposed to do?
> > > page_set_stable/page_set_unused on the other hand have a clearly defined
> > > meaning.
> >
> > With a patch set this large, I think it would at least be a nice thing
> > to go through and review the other architectures' uses.
>
> There is only one user right now: UML.
Then it should be awfully easy to go figure out what they are doing and
coordinate by using the same functions, right? ;)
-- Dave
Martin Schwidefsky wrote:
> From: Martin Schwidefsky <[email protected]>
> From: Hubertus Franke <[email protected]>
> From: Himanshu Raj <[email protected]>
>
> [patch 1/9] Guest page hinting: unused / free pages.
>
> A very simple but already quite effective improvement in the handling
> of guest memory vs. host memory is to tell the host when pages are
> free.
Would it be an idea to place this interface in-between the
per-cpu free page lists and the buddy allocator, so we can
move a batch of pages around at once and do the hinting in
a batched fashion ?
That way the overhead will be acceptable not just on S390
(where things are millicoded), but also on hypervisor based
virtualization like Xen.
Easy enough to pass a vector of pages to the hypervisor.
--
What is important? What you want to be true, or what is true?
Rik van Riel wrote:
> Martin Schwidefsky wrote:
>
>> From: Martin Schwidefsky <[email protected]>
>> From: Hubertus Franke <[email protected]>
>> From: Himanshu Raj <[email protected]>
>>
>> [patch 1/9] Guest page hinting: unused / free pages.
>>
>> A very simple but already quite effective improvement in the handling
>> of guest memory vs. host memory is to tell the host when pages are
>> free.
>
>
> Would it be an idea to place this interface in-between the
> per-cpu free page lists and the buddy allocator, so we can
> move a batch of pages around at once and do the hinting in
> a batched fashion ?
>
> That way the overhead will be acceptable not just on S390
> (where things are millicoded), but also on hypervisor based
> virtualization like Xen.
>
> Easy enough to pass a vector of pages to the hypervisor.
>
Rik, I thought that what we did.
Martin, I see the code actually does it when the page goes into the hot/cold
list. I can't remember conciously moving to that.
I thought we had a decent hit on the hot/cold, so that bulking makes sense.
Then the interface of bulking could be introduced and for s390 it could internally
be implemented as a sequence of ESSA instruction.
Do you remember the reason why we ended up putting it as part of hot/cold freeing?
-- Hubertus
Hubertus Franke wrote:
> Rik van Riel wrote:
>> Easy enough to pass a vector of pages to the hypervisor.
>
> Rik, I thought that what we did.
> Martin, I see the code actually does it when the page goes into the
> hot/cold
> list. I can't remember conciously moving to that.
> I thought we had a decent hit on the hot/cold, so that bulking makes sense.
>
> Then the interface of bulking could be introduced and for s390 it could
> internally be implemented as a sequence of ESSA instruction.
Note that the transition _to_ volatile can also be batched
and done somewhat lazily. For frequently mmaped pages that
could end up saving us the transition the other way, too...
That could make page hinting very acceptable performance wise,
even without a millicode implementation.
--
What is important? What you want to be true, or what is true?
On Tue, 2006-09-12 at 21:29 -0400, Rik van Riel wrote:
> Note that the transition _to_ volatile can also be batched
> and done somewhat lazily. For frequently mmaped pages that
> could end up saving us the transition the other way, too...
That would be helpful, only how to do it? We need some sort of list or
array where to store the pages that should be made volatile. The main
problem that I see is that you have to remove a page that is freed from
the list/array again, otherwise you would end up with a non page-cache
page being made volatile. That makes using per-cpu arrays hard since a
page can be freed on another cpu.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> On Tue, 2006-09-12 at 21:29 -0400, Rik van Riel wrote:
>
>>Note that the transition _to_ volatile can also be batched
>>and done somewhat lazily. For frequently mmaped pages that
>>could end up saving us the transition the other way, too...
>
>
> That would be helpful, only how to do it? We need some sort of list or
> array where to store the pages that should be made volatile. The main
> problem that I see is that you have to remove a page that is freed from
> the list/array again, otherwise you would end up with a non page-cache
> page being made volatile. That makes using per-cpu arrays hard since a
> page can be freed on another cpu.
>
Martin. the point was that pages
which are in the hold/cold lists are technically free.
However we keep them stable.
When the hot/cold list is spilled back to the buddy allocator
we make them volatile in buld (i.e. through the array).
So we only build the array for the duration of the bulk-release
to the buddy allocator (and potentially the other way as well).
Hence there is no "state" to maintain or track for the array.
Pages in the hot/cold lists remain stable.
This would not any of the problems you described as long as we hold
the lock for the hot/cold list during buld-volatile.
I know we did this once. I think we "discarded" the idea
in order to have as many pages volatile as possible.
But this design should be quite easily added and the architecture
can choose whether to be aggressive about this or do it in bulk.
-- Hubertus
On Wed, 2006-09-13 at 08:06 -0400, Hubertus Franke wrote:
> Martin Schwidefsky wrote:
> > On Tue, 2006-09-12 at 21:29 -0400, Rik van Riel wrote:
> >
> >>Note that the transition _to_ volatile can also be batched
> >>and done somewhat lazily. For frequently mmaped pages that
> >>could end up saving us the transition the other way, too...
> >
> >
> > That would be helpful, only how to do it? We need some sort of list or
> > array where to store the pages that should be made volatile. The main
> > problem that I see is that you have to remove a page that is freed from
> > the list/array again, otherwise you would end up with a non page-cache
> > page being made volatile. That makes using per-cpu arrays hard since a
> > page can be freed on another cpu.
> >
>
>
> Martin. the point was that pages
> which are in the hold/cold lists are technically free.
> However we keep them stable.
> When the hot/cold list is spilled back to the buddy allocator
> we make them volatile in buld (i.e. through the array).
You mean unused.
> So we only build the array for the duration of the bulk-release
> to the buddy allocator (and potentially the other way as well).
> Hence there is no "state" to maintain or track for the array.
> Pages in the hot/cold lists remain stable.
> This would not any of the problems you described as long as we hold
> the lock for the hot/cold list during buld-volatile.
I was not talking about free pages, and I don't think Rik was either.
The idea is to be lazy about the make-volatile calls. Put the pages for
which a make-volatile call should be done to some array/list and do a
bulk make-volatile. These pages are still in the page/swap cache. The
trouble is we have to be sure these pages have not been freed in the
meantime.
The bulk set-unused/set-stable to the buddy allocator should not be to
problematic. We just have to find new places where to do the calls.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> On Wed, 2006-09-13 at 08:06 -0400, Hubertus Franke wrote:
>
>>Martin Schwidefsky wrote:
>>
>>>On Tue, 2006-09-12 at 21:29 -0400, Rik van Riel wrote:
>>>
>>>
>>>>Note that the transition _to_ volatile can also be batched
>>>>and done somewhat lazily. For frequently mmaped pages that
>>>>could end up saving us the transition the other way, too...
>>>
>>>
>>>That would be helpful, only how to do it? We need some sort of list or
>>>array where to store the pages that should be made volatile. The main
>>>problem that I see is that you have to remove a page that is freed from
>>>the list/array again, otherwise you would end up with a non page-cache
>>>page being made volatile. That makes using per-cpu arrays hard since a
>>>page can be freed on another cpu.
>>>
>>
>>
>>Martin. the point was that pages
>>which are in the hold/cold lists are technically free.
>>However we keep them stable.
>>When the hot/cold list is spilled back to the buddy allocator
>>we make them volatile in buld (i.e. through the array).
>
>
> You mean unused.
Yes ...
>
>
>>So we only build the array for the duration of the bulk-release
>>to the buddy allocator (and potentially the other way as well).
>>Hence there is no "state" to maintain or track for the array.
>>Pages in the hot/cold lists remain stable.
>>This would not any of the problems you described as long as we hold
>>the lock for the hot/cold list during buld-volatile.
>
>
> I was not talking about free pages, and I don't think Rik was either.
> The idea is to be lazy about the make-volatile calls. Put the pages for
> which a make-volatile call should be done to some array/list and do a
> bulk make-volatile. These pages are still in the page/swap cache. The
> trouble is we have to be sure these pages have not been freed in the
> meantime.
Reread Rik's message. First message is about freeing pages.
Second is about "to-volatile".
I agree with you that if we batch "to-volatile" we need to keep
state, which does seem to have the problem you describe.
On the fly design thoughts ..
Let's assume we want to keep in the order of 1K pages to batch.
If we hash the pages that are made logically volatile but have not been
submitted yet, then on free_page we check whether the page is hashed
and if so remove it from the hashtable.
If the hashtable fills up we commit.
The hashtable is global or node specific (can't be per-cpu as you pointed
out).
This can be done reasonable efficient.
But another trouble you have not mentioned is what happens to pages with pending
make-volatile that need to and/or have been made stable in the meantime. They
too need to be removed from this pending list.
-- Hubertus
>
> The bulk set-unused/set-stable to the buddy allocator should not be to
> problematic. We just have to find new places where to do the calls.
>
Hubertus Franke wrote:
> But another trouble you have not mentioned is what happens to pages
> with pending make-volatile that need to and/or have been made stable
> in the meantime. They too need to be removed from this pending list.
At the time where you walk the set of pages (pagevec?) to make
volatile, you can check whether the page flags are still right.
A page that was set to be marked volatile with the hypervisor,
but later turned stable again would have that indicated in its
page flags, right?
--
What is important? What you want to be true, or what is true?
On Wed, 2006-09-13 at 10:45 -0400, Rik van Riel wrote:
> > But another trouble you have not mentioned is what happens to pages
> > with pending make-volatile that need to and/or have been made stable
> > in the meantime. They too need to be removed from this pending list.
>
> At the time where you walk the set of pages (pagevec?) to make
> volatile, you can check whether the page flags are still right.
A make volatile can be done anytime as long as the page is in page
cache. Before a page can be made stable the caller needs to make sure
that one of the conditions that prevent a make volatile becomes true.
So a page in the pending make-volatile array does not have to be removed
because a make stable has been done. It only has to be removed if it
gets freed.
> A page that was set to be marked volatile with the hypervisor,
> but later turned stable again would have that indicated in its
> page flags, right?
Several page flag bits and some other conditions like "has a mapping"
and "reference count is map count + 1". Most of the checks that need to
be done for make volatile are on page flags.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> On Wed, 2006-09-13 at 10:45 -0400, Rik van Riel wrote:
>
>>>But another trouble you have not mentioned is what happens to pages
>>>with pending make-volatile that need to and/or have been made stable
>>>in the meantime. They too need to be removed from this pending list.
>>
>>At the time where you walk the set of pages (pagevec?) to make
>>volatile, you can check whether the page flags are still right.
>
>
> A make volatile can be done anytime as long as the page is in page
> cache. Before a page can be made stable the caller needs to make sure
> that one of the conditions that prevent a make volatile becomes true.
> So a page in the pending make-volatile array does not have to be removed
> because a make stable has been done. It only has to be removed if it
> gets freed.
>
>
>>A page that was set to be marked volatile with the hypervisor,
>>but later turned stable again would have that indicated in its
>>page flags, right?
>
>
> Several page flag bits and some other conditions like "has a mapping"
> and "reference count is map count + 1". Most of the checks that need to
> be done for make volatile are on page flags.
>
Interesting..
But don't we have to do some locking on the page to avoid race conditions?
A page needs to be consistent between check through __page_discardable and
committing to the hypervisor. We could raise the PG_state_change flag for
that period.