2008-03-12 13:27:22

by Martin Schwidefsky

[permalink] [raw]
Subject: [patch 0/6] Guest page hinting version 6.

Greetings,
I've dedusted the guest page hinting patches and ported them to todays
upstream git tree. There is one reject if applied to 2.6.24-rc5-mm1 but
that is easy to fix. The code stills works as expected on my test system.

Our z/VM performance team recently published a report on guest page
hinting vs. the ballooner approach on SLES10 for a farm of web servers.
The code on SLES10 differs a bit from the upstream variant but the
performance results should be still valid. You will find the report
here:

http://www.vm.ibm.com/perf/reports/zvm/html/530cmm.html

(the VMRM-CMM the web page speaks about is the balloon approach,
CMMA is the guest page hinting).

Both approaches to the memory overcommit problem show comparable benefits
for this workload, with an advantage for guest page hinting for large
number of guests. For other workloads your mileage may vary.

The main benefit for guest page hinting vs. the ballooner is that there
is no need for a monitor that keeps track of the memory usage of all the
guests, a complex algorithm that calculates the working set sizes and for
the calls into the guest kernel to control the size of the balloons.
The host just does normal LRU based paging. If the host picks one of the
pages the guest can recreate, the host can throw it away instead of writing
it to the paging device. Simple and elegant.
The main disadvantage is the added complexity that is introduced to the
guests memory management code to do the page state changes and to deal
with discard faults.

The last versions of the patches do not differ much, I consider the code
to be stable. My question now is how to proceed with the code. I sure
would love to see the code going upstream some day but that depends on
the mm developers as the code adds complexity that needs to be supported.
If the general feeling is that the advantages of this approach do not
warrent for the added complexity this will likely be the last time you
will hear about guest page hinting.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.


2008-03-12 22:42:42

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

On Thursday 13 March 2008 00:21:32 Martin Schwidefsky wrote:
> My question now is how to proceed with the code. I sure
> would love to see the code going upstream some day but that depends on
> the mm developers as the code adds complexity that needs to be supported.

Well, I want this feature, but I agree about complexity.

AFAICT, the trivial subset of this is the hinting of Unused pages. It seems
that would buy us something, and perhaps be a stepping stone to full page
hinting?

Cheers,
Rusty.

2008-03-13 09:47:23

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

On Thu, 2008-03-13 at 09:41 +1100, Rusty Russell wrote:
> On Thursday 13 March 2008 00:21:32 Martin Schwidefsky wrote:
> > My question now is how to proceed with the code. I sure
> > would love to see the code going upstream some day but that depends on
> > the mm developers as the code adds complexity that needs to be supported.
>
> Well, I want this feature, but I agree about complexity.
>
> AFAICT, the trivial subset of this is the hinting of Unused pages. It seems
> that would buy us something, and perhaps be a stepping stone to full page
> hinting?

I've been there but the unused page thing is so small that it doesn't
make sense to separate it from the patches. If I don't see any progress
then I will come up with a patch that adds the Unused state transitions
to the arch files of s390.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2008-03-13 17:00:36

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

On Wed, 12 Mar 2008, Martin Schwidefsky wrote:
> Greetings,
> I've dedusted the guest page hinting patches and ported them to todays
> upstream git tree. There is one reject if applied to 2.6.24-rc5-mm1 but
> that is easy to fix. The code stills works as expected on my test system.
>
> Our z/VM performance team recently published a report on guest page
> hinting vs. the ballooner approach on SLES10 for a farm of web servers.
> The code on SLES10 differs a bit from the upstream variant but the
> performance results should be still valid. You will find the report
> here:
>
> http://www.vm.ibm.com/perf/reports/zvm/html/530cmm.html
>
> (the VMRM-CMM the web page speaks about is the balloon approach,
> CMMA is the guest page hinting).
>
> Both approaches to the memory overcommit problem show comparable benefits
> for this workload, with an advantage for guest page hinting for large
> number of guests. For other workloads your mileage may vary.
>
> The main benefit for guest page hinting vs. the ballooner is that there
> is no need for a monitor that keeps track of the memory usage of all the
> guests, a complex algorithm that calculates the working set sizes and for
> the calls into the guest kernel to control the size of the balloons.
> The host just does normal LRU based paging. If the host picks one of the
> pages the guest can recreate, the host can throw it away instead of writing
> it to the paging device. Simple and elegant.
> The main disadvantage is the added complexity that is introduced to the
> guests memory management code to do the page state changes and to deal
> with discard faults.
>
> The last versions of the patches do not differ much, I consider the code
> to be stable. My question now is how to proceed with the code. I sure
> would love to see the code going upstream some day but that depends on
> the mm developers as the code adds complexity that needs to be supported.
> If the general feeling is that the advantages of this approach do not
> warrent for the added complexity this will likely be the last time you
> will hear about guest page hinting.

Oh, that would be such a shame. Your guest page hinting patches remind
me of that childhood thrill, when once a year the circus comes to town ;)

But seriously, I'm ashamed to see my name in the Cc list: it would
be very unfair if your patches never made it in, just because I've
failed to find the time to wrap my own puny brain around them.

It's very encouraging to see Jeremy and Rusty weighing in. I hope
Zach will too, and I've added Andrea: their support would count a lot.
You have Nick on the list, good, I've added Christoph and Peter
(if you do resend, linux-mm might prove more useful than linux-kernel).

With support from rival virtualizers,
I do think you've a good chance of getting in.

Hugh

2008-03-13 17:15:19

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

On Thu, 2008-03-13 at 16:57 +0000, Hugh Dickins wrote:
> > The last versions of the patches do not differ much, I consider the code
> > to be stable. My question now is how to proceed with the code. I sure
> > would love to see the code going upstream some day but that depends on
> > the mm developers as the code adds complexity that needs to be supported.
> > If the general feeling is that the advantages of this approach do not
> > warrent for the added complexity this will likely be the last time you
> > will hear about guest page hinting.
>
> Oh, that would be such a shame. Your guest page hinting patches remind
> me of that childhood thrill, when once a year the circus comes to town ;)
>
> But seriously, I'm ashamed to see my name in the Cc list: it would
> be very unfair if your patches never made it in, just because I've
> failed to find the time to wrap my own puny brain around them.

It is an effort to get you head around it the first time. It gets
easiert the more you talk about it :-)

> It's very encouraging to see Jeremy and Rusty weighing in. I hope
> Zach will too, and I've added Andrea: their support would count a lot.
> You have Nick on the list, good, I've added Christoph and Peter
> (if you do resend, linux-mm might prove more useful than linux-kernel).

Grr, did I really forgot to copy linux-mm?!? (..insert you favourite
four letter word here..). I absolutely intended to copy linux-mm but
somehow replaced it with linux-s390.

> With support from rival virtualizers,
> I do think you've a good chance of getting in.

Yes, it would be great if we can find another user for it.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2008-03-13 17:40:36

by Zachary Amsden

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.


On Thu, 2008-03-13 at 16:57 +0000, Hugh Dickins wrote:
> Oh, that would be such a shame. Your guest page hinting patches remind
> me of that childhood thrill, when once a year the circus comes to town ;)

I like the circus too.

> But seriously, I'm ashamed to see my name in the Cc list: it would
> be very unfair if your patches never made it in, just because I've
> failed to find the time to wrap my own puny brain around them.

Bah! So modest.

> It's very encouraging to see Jeremy and Rusty weighing in. I hope
> Zach will too, and I've added Andrea: their support would count a lot.
> You have Nick on the list, good, I've added Christoph and Peter
> (if you do resend, linux-mm might prove more useful than linux-kernel).

I agree the page hinting technique is generally useful, even
cross-architecture.

What doesn't appear to be useful however, is support for this under
VMware. It can be done, even without the writable pte support (yes,
really). But due to us exploiting optimizations at lower layers, it
doesn't appear that it will gain us any performance - and we must
already have the complex working set algorithms to support
non-paravirtualized guests.

> With support from rival virtualizers,
> I do think you've a good chance of getting in.

I would say we support it, but I don't expect us to make use of the
infrastructure anytime soon. For us it would make more sense to use the
swap-fault optimization, but this requires some significant design
changes in our monitor.

Either way, these are both great ideas and I would not want to be held
responsible for blocking their upstream progress. Someday, with the
evolving x86 architecture (if we ever get per-page dirty bits), they
might make sense for us to do as well.

Zach

2008-03-13 18:44:02

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

Hugh Dickins wrote:
> On Wed, 12 Mar 2008, Martin Schwidefsky wrote:
>
>> Greetings,
>> I've dedusted the guest page hinting patches and ported them to todays
>> upstream git tree. There is one reject if applied to 2.6.24-rc5-mm1 but
>> that is easy to fix. The code stills works as expected on my test system.
>>
>> Our z/VM performance team recently published a report on guest page
>> hinting vs. the ballooner approach on SLES10 for a farm of web servers.
>> The code on SLES10 differs a bit from the upstream variant but the
>> performance results should be still valid. You will find the report
>> here:
>>
>> http://www.vm.ibm.com/perf/reports/zvm/html/530cmm.html
>>
>> (the VMRM-CMM the web page speaks about is the balloon approach,
>> CMMA is the guest page hinting).
>>
>> Both approaches to the memory overcommit problem show comparable benefits
>> for this workload, with an advantage for guest page hinting for large
>> number of guests. For other workloads your mileage may vary.
>>
>> The main benefit for guest page hinting vs. the ballooner is that there
>> is no need for a monitor that keeps track of the memory usage of all the
>> guests, a complex algorithm that calculates the working set sizes and for
>> the calls into the guest kernel to control the size of the balloons.
>> The host just does normal LRU based paging. If the host picks one of the
>> pages the guest can recreate, the host can throw it away instead of writing
>> it to the paging device. Simple and elegant.
>> The main disadvantage is the added complexity that is introduced to the
>> guests memory management code to do the page state changes and to deal
>> with discard faults.
>>
>> The last versions of the patches do not differ much, I consider the code
>> to be stable. My question now is how to proceed with the code. I sure
>> would love to see the code going upstream some day but that depends on
>> the mm developers as the code adds complexity that needs to be supported.
>> If the general feeling is that the advantages of this approach do not
>> warrent for the added complexity this will likely be the last time you
>> will hear about guest page hinting.
>>
>
> Oh, that would be such a shame. Your guest page hinting patches remind
> me of that childhood thrill, when once a year the circus comes to town ;)
>
> But seriously, I'm ashamed to see my name in the Cc list: it would
> be very unfair if your patches never made it in, just because I've
> failed to find the time to wrap my own puny brain around them.
>
> It's very encouraging to see Jeremy and Rusty weighing in. I hope
> Zach will too, and I've added Andrea: their support would count a lot.
> You have Nick on the list, good, I've added Christoph and Peter
> (if you do resend, linux-mm might prove more useful than linux-kernel).
>

I like the idea and it seems basically sound, but unfortunately Xen
won't be able to make use of it in the near term, because it doesn't
support any kind of backing for guest domain memory. There has been
some thought about adding this kind of functionality to Xen. Keir, Ian:
do you think this kind of support in the kernel be useful to us?

One concern I have is that 4k is really a very fine grain. We're
thinking about moving Xen's memory management to operate in 2M chunk
units, which would allow guests to directly use large pages with the
corresponding reduction in TLB pressure. One side-effect of this is
that we'd need to change ballooning to be in 2M rather than 4k units in
order to prevent physical memory fragmentation.

Page hinting at 4k resolution poses the same problem. Would this
technique still be useful operating on 2M chunks? Certainly it seems
less likely that you could easily get a whole 2M area with the same
fine-grained properties that these patches track. Would some kind of
page/sub-page tracking be useful?

My other concern is just correctness over time on the Linux side. We
already have enough trouble keeping things like the pte and page
structure state in sync, with resulting rare data-loss bugs. Adding
another layer which only applies in specific environments raises the
possibility for new bugs to be un-noticed for a long time. How can we
structure the VM changes to make sure that its robust in the face of
maintenance?

J

2008-03-13 18:59:08

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

On Thu, 13 Mar 2008, Jeremy Fitzhardinge wrote:
>
> My other concern is just correctness over time on the Linux side. We already
> have enough trouble keeping things like the pte and page structure state in
> sync, with resulting rare data-loss bugs. Adding another layer which only
> applies in specific environments raises the possibility for new bugs to be
> un-noticed for a long time. How can we structure the VM changes to make sure
> that its robust in the face of maintenance?

Yes, that's the main concern, as whenever lots of subtlety is added.
I wonder if there's any chance of a CONFIG_DEBUG mode, which could be
run on anybody's x86 machine, without involving any virtualization, but
in which the PAGE_STATEs become essential to the correct working of the mm.

Hugh

2008-03-13 19:45:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

On Thu, Mar 13, 2008 at 10:45:07AM -0700, Zachary Amsden wrote:
> What doesn't appear to be useful however, is support for this under
> VMware. It can be done, even without the writable pte support (yes,
> really). But due to us exploiting optimizations at lower layers, it
> doesn't appear that it will gain us any performance - and we must
> already have the complex working set algorithms to support
> non-paravirtualized guests.

With non-paravirt all you can do is to swap the guest physical memory
(mmu notifiers allows linux to do that) or share memory (mmu notifiers
+ ksm allows linux to do that too). We also have complex working set
algorithms that we use to finds which parts of the guest physical
address space are best to swap first: the core linux VM.

What paravirt allows us to do (and that's the whole point of the paper
I guess), is to go one step further than just guest swapping and to
ask the guest if the page really need to be swapped or if it can be
freed right away. So this would be an extension of the mmu notifiers
(this also shows how EMM API is too restrictive, while MMU notifiers
will allow that extension in the future) to avoid I/O sometime if
guest tells us it's not necessary to swap through paravirt ops.

When talking with friends about ballooning I already once suggested to
auto inflate the balloon with pages in the freelist.

Now this paper goes well beyond the pages in the freelist (called
U/unused in the paper), this also covers cache and mapped-clean cache
in the guest. That would have been the next step.

Anyway plain ballooning remains useful as rss limiting or numa
compartments in the linux hypervisor, to provide unfariness to certain
guests.

I didn't read the patch yet, but I think paravirt knowledge about
U/unused pages is needed to avoid guest swapping. The cache and mapped
cache in the guest is a gray area, because linux as hypervisor will be
extremely efficient at swapping out and swapping in the guest cache
(host swapping guest cache, may be faster than re-issuing a read-I/O
to refill the cache by itself, clearly with guest using
paravirt). Let's say I'm mostly interested about page-hinting for the
U pages initially.

I'm currently busy with other two features and trying to get mmu
notifier #v9 into mainline which is orders of magnitude more important
than avoiding a few swapouts sometime (without mmu notifiers
everything else is irrelevant, including guest page hinting and
including ballooning too cause madvise(don't need) won't clear sptes
and invalidate guest tlbs).

2008-03-13 19:49:26

by Zachary Amsden

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.


On Thu, 2008-03-13 at 18:55 +0000, Hugh Dickins wrote:
> On Thu, 13 Mar 2008, Jeremy Fitzhardinge wrote:
> >
> > My other concern is just correctness over time on the Linux side. We already
> > have enough trouble keeping things like the pte and page structure state in
> > sync, with resulting rare data-loss bugs. Adding another layer which only
> > applies in specific environments raises the possibility for new bugs to be
> > un-noticed for a long time. How can we structure the VM changes to make sure
> > that its robust in the face of maintenance?
>
> Yes, that's the main concern, as whenever lots of subtlety is added.
> I wonder if there's any chance of a CONFIG_DEBUG mode, which could be
> run on anybody's x86 machine, without involving any virtualization, but
> in which the PAGE_STATEs become essential to the correct working of the mm.

How about a fake hypervisor, which is really just a random page evictor,
following the rules of CMM?

2008-03-13 21:37:39

by Zachary Amsden

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.


On Thu, 2008-03-13 at 20:45 +0100, Andrea Arcangeli wrote:
> On Thu, Mar 13, 2008 at 10:45:07AM -0700, Zachary Amsden wrote:
> > What doesn't appear to be useful however, is support for this under
> > VMware. It can be done, even without the writable pte support (yes,
> > really). But due to us exploiting optimizations at lower layers, it
> > doesn't appear that it will gain us any performance - and we must
> > already have the complex working set algorithms to support
> > non-paravirtualized guests.
>
> With non-paravirt all you can do is to swap the guest physical memory
> (mmu notifiers allows linux to do that) or share memory (mmu notifiers
> + ksm allows linux to do that too). We also have complex working set
> algorithms that we use to finds which parts of the guest physical
> address space are best to swap first: the core linux VM.

We can tap into those algorithms just as effectively using ballooning, and we've optimized the sharing and working set models from outside of the guest. So while CMM gives slightly better information for a random forced page eviction, the complexity doesn't appear to justify the savings for a VMware implementation.

Things would change quite a bit if we had hardware supported per-page dirty bits.

> than avoiding a few swapouts sometime (without mmu notifiers
> everything else is irrelevant, including guest page hinting and
> including ballooning too cause madvise(don't need) won't clear sptes
> and invalidate guest tlbs).

Ballooning still works if you use a kernel based balloon driver. Using
madvise wouldn't be a reliable way to balloon anyway. Are you talking
about an API to manage working sets and such from userspace?

Cheers,

Zach

2008-03-14 18:32:25

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

Zachary Amsden wrote:
> On Thu, 2008-03-13 at 18:55 +0000, Hugh Dickins wrote:
>
>> On Thu, 13 Mar 2008, Jeremy Fitzhardinge wrote:
>>
>>> My other concern is just correctness over time on the Linux side. We already
>>> have enough trouble keeping things like the pte and page structure state in
>>> sync, with resulting rare data-loss bugs. Adding another layer which only
>>> applies in specific environments raises the possibility for new bugs to be
>>> un-noticed for a long time. How can we structure the VM changes to make sure
>>> that its robust in the face of maintenance?
>>>
>> Yes, that's the main concern, as whenever lots of subtlety is added.
>> I wonder if there's any chance of a CONFIG_DEBUG mode, which could be
>> run on anybody's x86 machine, without involving any virtualization, but
>> in which the PAGE_STATEs become essential to the correct working of the mm.
>>
>
> How about a fake hypervisor, which is really just a random page evictor,
> following the rules of CMM?
>

Probably simpler to just have variants of the page_set_* functions which
simulate the worst-possible host action immediately (ie, stealing pages,
logically swapping them, etc). That wouldn't give you full coverage,
but it would go some way. An async variant which schedules a change in
a few milliseconds would help too.

I guess that's equivalent to having a special-purpose hypervisor built
into the kernel (hm, sounds familiar...).

J

2008-03-14 21:28:10

by Zachary Amsden

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.


On Fri, 2008-03-14 at 11:30 -0700, Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
> > How about a fake hypervisor, which is really just a random page evictor,
> > following the rules of CMM?
> >
>
> Probably simpler to just have variants of the page_set_* functions which
> simulate the worst-possible host action immediately (ie, stealing pages,
> logically swapping them, etc). That wouldn't give you full coverage,
> but it would go some way. An async variant which schedules a change in
> a few milliseconds would help too.
>
> I guess that's equivalent to having a special-purpose hypervisor built
> into the kernel (hm, sounds familiar...).

It needn't be that hard on s390, I believe you don't need to worry about
PTEs becoming asynchronous when stealing a page, since if I understand
the hypervisor architecture, there is a per-page mapping level
available, allowing you to generate discard faults on access. It might
be possible to use this mapping layer without implementing a full blown
hypervisor. Martin?

For x86, at discard time, you would have to manually walk and invalidate
any PTEs potentially mapping the discarded page, but there is already
this great thing called Xen paravirt-ops which actually does that for
completely different reasons (PT page protection).

I think a random exponential distribution for discard would be needed to
catch all the racey failure modes.

Zach

2008-03-14 21:39:20

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

Zachary Amsden wrote:
> It needn't be that hard on s390, I believe you don't need to worry about
> PTEs becoming asynchronous when stealing a page, since if I understand
> the hypervisor architecture, there is a per-page mapping level
> available, allowing you to generate discard faults on access. It might
> be possible to use this mapping layer without implementing a full blown
> hypervisor. Martin?
>

Yes, I don't expect its a problem for s390, but the point is making
something workalike enough to make sure there's an evenly distributed
number of explosions-in-face when things go wrong.

> For x86, at discard time, you would have to manually walk and invalidate
> any PTEs potentially mapping the discarded page, but there is already
> this great thing called Xen paravirt-ops which actually does that for
> completely different reasons (PT page protection).
>

Not sure I follow. Xen pvops pays attention to whether a particular
page is being used as part of a pagetable, and changes its permissions
accordingly. But because pagetable pages are strictly kernel-only, we
can get away with updating a single kernel-mapping pte which is shared
across all processes. In the guest page hinting case, we need to deal
with general pages which can be mapped anywhere, so that really does
require a full traversal of the pagetables. Presumably rmap would be
helpful here.

J

2008-03-17 08:22:04

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [patch 0/6] Guest page hinting version 6.

On Fri, 2008-03-14 at 14:32 -0700, Zachary Amsden wrote:
> On Fri, 2008-03-14 at 11:30 -0700, Jeremy Fitzhardinge wrote:
> > Zachary Amsden wrote:
> > > How about a fake hypervisor, which is really just a random page evictor,
> > > following the rules of CMM?
> > >
> >
> > Probably simpler to just have variants of the page_set_* functions which
> > simulate the worst-possible host action immediately (ie, stealing pages,
> > logically swapping them, etc). That wouldn't give you full coverage,
> > but it would go some way. An async variant which schedules a change in
> > a few milliseconds would help too.
> >
> > I guess that's equivalent to having a special-purpose hypervisor built
> > into the kernel (hm, sounds familiar...).
>
> It needn't be that hard on s390, I believe you don't need to worry about
> PTEs becoming asynchronous when stealing a page, since if I understand
> the hypervisor architecture, there is a per-page mapping level
> available, allowing you to generate discard faults on access. It might
> be possible to use this mapping layer without implementing a full blown
> hypervisor. Martin?

Yes, on s390 the PTEs cannot be asynchronous because there is no need to
synchronize them in the first place. A mapping layer with all primitives
without using the SIE instruction will be difficult. For one we cannot
use the ESSA instruction which isolates the state changes and host page
table is tied to the SIE. The page state is stored in the page table
extension and the discard state is basically a specially marked invalid
pte in the host table. A mapping layer with some restrictions is
certainly possible.

> For x86, at discard time, you would have to manually walk and invalidate
> any PTEs potentially mapping the discarded page, but there is already
> this great thing called Xen paravirt-ops which actually does that for
> completely different reasons (PT page protection).

If you have to walk the guest page tables you call into the guest, no ?
I would characterize this more as a ballooner since you need guest
activity to do the page stealing. The trick with guest page hinting is
that you do NOT call into the guest to do the discard. You'll a nested
page table for that I'm afraid.

> I think a random exponential distribution for discard would be needed to
> catch all the racey failure modes.

We had quite a few of these racy failures. Nasty. Hard to find.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.