Greetings,
the circus is back in town -- another version of the guest page hinting
patches. The patches differ from version 6 only in the kernel version,
they apply against 2.6.29. My short sniff test showed that the code
is still working as expected.
To recap (you can skip this if you read the boiler plate of the last
version of the patches):
The main benefit for guest page hinting vs. the ballooner is that there
is no need for a monitor that keeps track of the memory usage of all the
guests, a complex algorithm that calculates the working set sizes and for
the calls into the guest kernel to control the size of the balloons.
The host just does normal LRU based paging. If the host picks one of the
pages the guest can recreate, the host can throw it away instead of writing
it to the paging device. Simple and elegant.
The main disadvantage is the added complexity that is introduced to the
guests memory management code to do the page state changes and to deal
with discard faults.
Right after booting the page states on my 256 MB z/VM guest looked like
this (r=resident, p=preserved, z=zero, S=stable, U=unused,
P=potentially volatile, V=volatile):
<state>|--tot--|---r---|---p---|---z---|
S | 19719| 19673| 0| 46|
U | 235416| 2734| 0| 232682|
P | 1| 1| 0| 0|
V | 7008| 7008| 0| 0|
tot-> | 262144| 29416| 0| 232728|
about 25% of the pages are in voltile state. After grepping through the
linux source tree this picture changes:
<state>|--tot--|---r---|---p---|---z---|
S | 43784| 43744| 0| 40|
U | 78631| 2397| 0| 76234|
P | 2| 2| 0| 0|
V | 139727| 139727| 0| 0|
tot-> | 262144| 185870| 0| 76274|
about 75% of the pages are now volatile. Depending on the workload you
will get different results.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote:
> If the host picks one of the
> pages the guest can recreate, the host can throw it away instead of writing
> it to the paging device. Simple and elegant.
Heh, simple and elegant for the hypervisor. But I'm not sure I'm going
to call *anything* that requires a new CPU instruction elegant. ;)
I don't see any description of it in there any more, but I thought this
entire patch set was to get rid of the idiotic triple I/Os in the
following scenario:
1. Hypervisor picks a page and evicts it out to disk, pays the I/O cost
to get it written out. (I/O #1)
2. Linux comes along (being a bit late to the party) and picks the same
page, also decides it needs to be out to disk
3. Linux tries to write the page to disk, but touches it in the
process, pulling the page back in from the store where the hypervisor
wrote it. (I/O #2)
4. Linux writes the page to its swap device (I/O #3)
I don't see that mentioned at all in the current description.
Simplifying the hypervisor is hard to get behind, but cutting system I/O
by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;)
Can we persuade the hypervisor to tell us which pages it decided to page
out and just skip those when we're scanning the LRU?
-- Dave
Dave Hansen wrote:
> On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote:
>> If the host picks one of the
>> pages the guest can recreate, the host can throw it away instead of writing
>> it to the paging device. Simple and elegant.
>
> Heh, simple and elegant for the hypervisor. But I'm not sure I'm going
> to call *anything* that requires a new CPU instruction elegant. ;)
I am convinced that it could be done with a guest-writable
"bitmap", with 2 bits per page. That would make this scheme
useful for KVM, too.
> I don't see any description of it in there any more, but I thought this
> entire patch set was to get rid of the idiotic triple I/Os in the
> following scenario:
> I don't see that mentioned at all in the current description.
> Simplifying the hypervisor is hard to get behind, but cutting system I/O
> by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;)
Cutting down on a fair bit of IO is absolutely worth
1200 lines of fairly well isolated code.
> Can we persuade the hypervisor to tell us which pages it decided to page
> out and just skip those when we're scanning the LRU?
The easiest "notification" points are in the page fault
handler and the page cache lookup code.
--
All rights reversed.
On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote:
> Greetings,
> the circus is back in town -- another version of the guest page hinting
> patches. The patches differ from version 6 only in the kernel version,
> they apply against 2.6.29. My short sniff test showed that the code
> is still working as expected.
>
> To recap (you can skip this if you read the boiler plate of the last
> version of the patches):
> The main benefit for guest page hinting vs. the ballooner is that there
> is no need for a monitor that keeps track of the memory usage of all the
> guests, a complex algorithm that calculates the working set sizes and for
> the calls into the guest kernel to control the size of the balloons.
I thought you weren't convinced of the concrete benefits over ballooning,
or am I misremembering?
Thanks,
Rusty.
On Fri, 27 Mar 2009 16:03:43 -0700
Dave Hansen <[email protected]> wrote:
> On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote:
> > If the host picks one of the
> > pages the guest can recreate, the host can throw it away instead of writing
> > it to the paging device. Simple and elegant.
>
> Heh, simple and elegant for the hypervisor. But I'm not sure I'm going
> to call *anything* that requires a new CPU instruction elegant. ;)
Hey its cool if you can request an instruction to solve your problem :-)
> I don't see any description of it in there any more, but I thought this
> entire patch set was to get rid of the idiotic triple I/Os in the
> following scenario:
>
> 1. Hypervisor picks a page and evicts it out to disk, pays the I/O cost
> to get it written out. (I/O #1)
> 2. Linux comes along (being a bit late to the party) and picks the same
> page, also decides it needs to be out to disk
> 3. Linux tries to write the page to disk, but touches it in the
> process, pulling the page back in from the store where the hypervisor
> wrote it. (I/O #2)
> 4. Linux writes the page to its swap device (I/O #3)
>
> I don't see that mentioned at all in the current description.
> Simplifying the hypervisor is hard to get behind, but cutting system I/O
> by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;)
You are right, for a newcomer to the party the advantages of this
approach are not really obvious. Should have copied some more text from
the boilerplate from the previous versions.
Yes, the guest page hinting code aims to reduce the hosts swap I/O.
There are two scenarios, one is the above, the other is a simple
read-only file cache page.
Without hinting:
1. Hypervisor picks a page and evicts it, that is one write I/O
2. Linux access the page and causes a host page fault. The host reads
the page from its swap disk, one read I/O.
In total 2 I/O operations.
With hinting:
1. Hypervisor picks a page, finds it volatile and throws it away.
2. Linux access the page and gets a discard fault from the host. Linux
reads the file page from its block device.
This is just one I/O operation.
> Can we persuade the hypervisor to tell us which pages it decided to page
> out and just skip those when we're scanning the LRU?
One principle of the whole approach is that the hypervisor does not
call into an otherwise idle guest. The cost of schedulung the virtual
cpu is just too high. So we would a means to store the information where
the guest can pick it up when it happens to do LRU. I don't think that
this will work out.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Fri, 27 Mar 2009 20:06:03 -0400
Rik van Riel <[email protected]> wrote:
> Dave Hansen wrote:
> > On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote:
> >> If the host picks one of the
> >> pages the guest can recreate, the host can throw it away instead of writing
> >> it to the paging device. Simple and elegant.
> >
> > Heh, simple and elegant for the hypervisor. But I'm not sure I'm going
> > to call *anything* that requires a new CPU instruction elegant. ;)
>
> I am convinced that it could be done with a guest-writable
> "bitmap", with 2 bits per page. That would make this scheme
> useful for KVM, too.
This was our initial approach before we came up with the milli-code
instruction. The reason we did not use a bitmap was to prevent the
guest to change the host state (4 guest states U/S/V/P and 3 host
states r/p/z). With the full set of states you'd need 4 bits. And the
hosts need to have a "master" copy of the host bits, one the guest
cannot change, otherwise you get into trouble.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Sat, 28 Mar 2009 17:05:28 +1030
Rusty Russell <[email protected]> wrote:
> On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote:
> > Greetings,
> > the circus is back in town -- another version of the guest page hinting
> > patches. The patches differ from version 6 only in the kernel version,
> > they apply against 2.6.29. My short sniff test showed that the code
> > is still working as expected.
> >
> > To recap (you can skip this if you read the boiler plate of the last
> > version of the patches):
> > The main benefit for guest page hinting vs. the ballooner is that there
> > is no need for a monitor that keeps track of the memory usage of all the
> > guests, a complex algorithm that calculates the working set sizes and for
> > the calls into the guest kernel to control the size of the balloons.
>
> I thought you weren't convinced of the concrete benefits over ballooning,
> or am I misremembering?
The performance test I have seen so far show that the benefits of
ballooning vs. guest page hinting are about the same. I am still
convinced that the guest page hinting is the way to go because you do
not need an external monitor. Calculating the working set size for a
guest is a challenge. With guest page hinting there is no need for a
working set size calculation.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> On Fri, 27 Mar 2009 20:06:03 -0400
> Rik van Riel <[email protected]> wrote:
>
>> Dave Hansen wrote:
>>> On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote:
>>>> If the host picks one of the
>>>> pages the guest can recreate, the host can throw it away instead of writing
>>>> it to the paging device. Simple and elegant.
>>> Heh, simple and elegant for the hypervisor. But I'm not sure I'm going
>>> to call *anything* that requires a new CPU instruction elegant. ;)
>> I am convinced that it could be done with a guest-writable
>> "bitmap", with 2 bits per page. That would make this scheme
>> useful for KVM, too.
>
> This was our initial approach before we came up with the milli-code
> instruction. The reason we did not use a bitmap was to prevent the
> guest to change the host state (4 guest states U/S/V/P and 3 host
> states r/p/z). With the full set of states you'd need 4 bits. And the
> hosts need to have a "master" copy of the host bits, one the guest
> cannot change, otherwise you get into trouble.
KVM already has the info from the host bits somewhere else,
which is needed to be able to actually find the physical
pages used by a guest.
That leaves just the guest states, so a compare-and-swap may
work for non-s390.
--
All rights reversed.
On Sun, 2009-03-29 at 16:12 +0200, Martin Schwidefsky wrote:
> > Can we persuade the hypervisor to tell us which pages it decided to page
> > out and just skip those when we're scanning the LRU?
>
> One principle of the whole approach is that the hypervisor does not
> call into an otherwise idle guest. The cost of schedulung the virtual
> cpu is just too high. So we would a means to store the information where
> the guest can pick it up when it happens to do LRU. I don't think that
> this will work out.
I didn't mean for it to actively notify the guest. Perhaps, as Rik
said, have a bitmap where the host can set or clear bit for the guest to
see.
As the guest is scanning the LRU, it checks the structure (or makes an
hcall or whatever) and sees that the hypervisor has already taken care
of the page. It skips these pages in the first round of scanning.
I do see what you're saying about this saving the page-*out* operation
on the hypervisor side. It can simply toss out pages instead of paging
them itself. That's a pretty advanced optimization, though. What would
this code look like if we didn't optimize to that level?
It also occurs to me that the hypervisor could be doing a lot of this
internally. This whole scheme is about telling the hypervisor about
pages that we (the kernel) know we can regenerate. The hypervisor
should know a lot of that information, too. We ask it to populate a
page with stuff from virtual I/O devices or write a page out to those
devices. The page remains volatile until something from the guest
writes to it. The hypervisor could keep a record of how to recreate the
page as long as it remains volatile and clean.
That wouldn't cover things like page cache from network filesystems,
though.
This patch does look like the full monty but I have to wonder what other
partial approaches are out there.
-- Dave
On Mon, 30 Mar 2009 08:54:55 -0700
Dave Hansen <[email protected]> wrote:
> On Sun, 2009-03-29 at 16:12 +0200, Martin Schwidefsky wrote:
> > > Can we persuade the hypervisor to tell us which pages it decided to page
> > > out and just skip those when we're scanning the LRU?
> >
> > One principle of the whole approach is that the hypervisor does not
> > call into an otherwise idle guest. The cost of schedulung the virtual
> > cpu is just too high. So we would a means to store the information where
> > the guest can pick it up when it happens to do LRU. I don't think that
> > this will work out.
>
> I didn't mean for it to actively notify the guest. Perhaps, as Rik
> said, have a bitmap where the host can set or clear bit for the guest to
> see.
Yes, agreed.
> As the guest is scanning the LRU, it checks the structure (or makes an
> hcall or whatever) and sees that the hypervisor has already taken care
> of the page. It skips these pages in the first round of scanning.
As long as we make this optional I'm fine with it. On s390 with the
current implementation that translates to an ESSA call. Which is not
exactly inexpensive, we are talking about > 100 cycles. The better
solution for us is to age the page with the standard active/inactive
processing.
> I do see what you're saying about this saving the page-*out* operation
> on the hypervisor side. It can simply toss out pages instead of paging
> them itself. That's a pretty advanced optimization, though. What would
> this code look like if we didn't optimize to that level?
Why? It is just a simple test in the hosts LRU scan. If the page is at
the end of the inactive list AND has the volatile state then don't
bother with writeback, just throw it away. This is the only place where
the host has to check for the page state.
> It also occurs to me that the hypervisor could be doing a lot of this
> internally. This whole scheme is about telling the hypervisor about
> pages that we (the kernel) know we can regenerate. The hypervisor
> should know a lot of that information, too. We ask it to populate a
> page with stuff from virtual I/O devices or write a page out to those
> devices. The page remains volatile until something from the guest
> writes to it. The hypervisor could keep a record of how to recreate the
> page as long as it remains volatile and clean.
Unfortunately it is not that simple. There are quite a few reasons why
a page has to be made stable. You'd have to pass that information back
and forth between the guest and the host otherwise the host will throw
away e.g. an mlocked page because it is still marked as volatile in the
virtual block device.
> That wouldn't cover things like page cache from network filesystems,
> though.
Yes, there are pages with a backing the host knows nothing about.
> This patch does look like the full monty but I have to wonder what other
> partial approaches are out there.
I am open for suggestions. The simples partial approach is already
implemented for s390: unused/stable transitions in the buddy allocator.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Dave Hansen wrote:
> It also occurs to me that the hypervisor could be doing a lot of this
> internally. This whole scheme is about telling the hypervisor about
> pages that we (the kernel) know we can regenerate. The hypervisor
> should know a lot of that information, too. We ask it to populate a
> page with stuff from virtual I/O devices or write a page out to those
> devices. The page remains volatile until something from the guest
> writes to it. The hypervisor could keep a record of how to recreate the
> page as long as it remains volatile and clean.
>
That potentially pushes a lot of complexity elsewhere. If you have
multiple paths to a storage device, or a cluster store shared between
multiple machines, then the underlying storage can change making the
guest's copies of those blocks unbacked. Obviously the host/hypervisor
could deal with that, but it would be a pile of new mechanisms which
don't necessarily exist (for example, it would have to be an active
participant in a distributed locking scheme for a shared block device
rather than just passing it all through to the guest to handle).
That said, people have been looking at tracking block IO to work out
when it might be useful to try and share pages between guests under Xen.
J
Jeremy Fitzhardinge wrote:
> That said, people have been looking at tracking block IO to work out
> when it might be useful to try and share pages between guests under Xen.
Tracking block IO seems like a bass-ackwards way to figure
out what the contents of a memory page are.
The KVM KSM code has a simpler, yet still efficient, way of
figuring out which memory pages can be shared.
--
All rights reversed.
Rik van Riel wrote:
> Jeremy Fitzhardinge wrote:
>
>> That said, people have been looking at tracking block IO to work out
>> when it might be useful to try and share pages between guests under Xen.
>
> Tracking block IO seems like a bass-ackwards way to figure
> out what the contents of a memory page are.
Well, they're research projects, so nobody said that they're necessarily
useful results ;). I think the rationale is that, in general, there
aren't all that many sharable pages, and asize from zero-pages, the bulk
of them are the result of IO. Since its much simpler to compare
device+block references than doing page content matching, it is worth
looking at the IO stream to work out what your candidates are.
> The KVM KSM code has a simpler, yet still efficient, way of
> figuring out which memory pages can be shared.
How's that? Does it do page content comparison?
J
Jeremy Fitzhardinge wrote:
> Rik van Riel wrote:
>> Jeremy Fitzhardinge wrote:
>>
>>> That said, people have been looking at tracking block IO to work out
>>> when it might be useful to try and share pages between guests under Xen.
>>
>> Tracking block IO seems like a bass-ackwards way to figure
>> out what the contents of a memory page are.
>
> Well, they're research projects, so nobody said that they're necessarily
> useful results ;). I think the rationale is that, in general, there
> aren't all that many sharable pages, and asize from zero-pages, the bulk
> of them are the result of IO.
I'll give you a hint: Windows zeroes out freed pages.
It should also be possible to hook up arch_free_page() so
freed pages in Linux guests become sharable.
Furthermore, every guest with the same OS version will be
running the same system daemons, same glibc, etc. This
means sharable pages from not just disk IO (probably from
different disks anyway), but also in the BSS and possibly
even on the heap.
>> The KVM KSM code has a simpler, yet still efficient, way of
>> figuring out which memory pages can be shared.
> How's that? Does it do page content comparison?
Eventually. It starts out with hashing the first 128 (IIRC)
bytes of page content and comparing the hashes. If that
matches, it will do content comparison.
Content comparison is done in the background on the host.
I suspect (but have not checked) that it is somehow hooked
up to the page reclaim code on the host.
--
All rights reversed.
Rik van Riel wrote:
> Jeremy Fitzhardinge wrote:
>> Rik van Riel wrote:
>>> Jeremy Fitzhardinge wrote:
>>>
>>>> That said, people have been looking at tracking block IO to work
>>>> out when it might be useful to try and share pages between guests
>>>> under Xen.
>>>
>>> Tracking block IO seems like a bass-ackwards way to figure
>>> out what the contents of a memory page are.
>>
>> Well, they're research projects, so nobody said that they're
>> necessarily useful results ;). I think the rationale is that, in
>> general, there aren't all that many sharable pages, and asize from
>> zero-pages, the bulk of them are the result of IO.
>
> I'll give you a hint: Windows zeroes out freed pages.
Right: "aside from zero-pages". If you exclude zero-pages from your
count of shared pages, the amount of sharing drops a lot.
> It should also be possible to hook up arch_free_page() so
> freed pages in Linux guests become sharable.
>
> Furthermore, every guest with the same OS version will be
> running the same system daemons, same glibc, etc. This
> means sharable pages from not just disk IO (probably from
> different disks anyway),
Why? If you're starting a bunch of cookie-cutter guests, then you're
probably starting them from the same template image or COW block
devices. (Also, if you're wearing the cost of physical IO anyway, then
additional cost of hashing is relatively small.)
> but also in the BSS and possibly
> even on the heap.
Well, maybe. Modern systems generally randomize memory layouts, so even
if they're semantically the same, the pointers will all have different
values.
Other research into "sharing" mostly-similar pages is more promising for
that kind of case.
> Eventually. It starts out with hashing the first 128 (IIRC)
> bytes of page content and comparing the hashes. If that
> matches, it will do content comparison.
>
> Content comparison is done in the background on the host.
> I suspect (but have not checked) that it is somehow hooked
> up to the page reclaim code on the host.
Yeah, that's the straightforward approach; there's about a research
project/year doing a Xen implementation, but they never seem to get very
good results aside from very artificial test conditions.
J
Jeremy Fitzhardinge wrote:
> Rik van Riel wrote:
>
>> Jeremy Fitzhardinge wrote:
>>
>>> Rik van Riel wrote:
>>>
>>>> Jeremy Fitzhardinge wrote:
>>>>
>>>>
>>>>> That said, people have been looking at tracking block IO to work
>>>>> out when it might be useful to try and share pages between guests
>>>>> under Xen.
>>>>>
>>>> Tracking block IO seems like a bass-ackwards way to figure
>>>> out what the contents of a memory page are.
>>>>
>>> Well, they're research projects, so nobody said that they're
>>> necessarily useful results ;). I think the rationale is that, in
>>> general, there aren't all that many sharable pages, and asize from
>>> zero-pages, the bulk of them are the result of IO.
>>>
>> I'll give you a hint: Windows zeroes out freed pages.
>>
>
> Right: "aside from zero-pages". If you exclude zero-pages from your
> count of shared pages, the amount of sharing drops a lot.
>
>
>> It should also be possible to hook up arch_free_page() so
>> freed pages in Linux guests become sharable.
>>
>> Furthermore, every guest with the same OS version will be
>> running the same system daemons, same glibc, etc. This
>> means sharable pages from not just disk IO (probably from
>> different disks anyway),
>>
>
> Why? If you're starting a bunch of cookie-cutter guests, then you're
> probably starting them from the same template image or COW block
> devices. (Also, if you're wearing the cost of physical IO anyway, then
> additional cost of hashing is relatively small.)
>
>
>> but also in the BSS and possibly
>> even on the heap.
>>
>
> Well, maybe. Modern systems generally randomize memory layouts, so even
> if they're semantically the same, the pointers will all have different
> values.
>
> Other research into "sharing" mostly-similar pages is more promising for
> that kind of case.
>
>
>> Eventually. It starts out with hashing the first 128 (IIRC)
>> bytes of page content and comparing the hashes. If that
>> matches, it will do content comparison.
>>
The algorithm was changed quite a bit. Izik is planning to resubmit it
any day now.
>> Content comparison is done in the background on the host.
>> I suspect (but have not checked) that it is somehow hooked
>> up to the page reclaim code on the host.
>>
>
> Yeah, that's the straightforward approach; there's about a research
> project/year doing a Xen implementation, but they never seem to get very
> good results aside from very artificial test conditions.
>
Actually we got really good results by using ksm along with kvm, running
large
amount of windows virtual machines. We can achieve over commit ratio
of up to 400% of the host ram for VMs doing M$ office load.
-dor
Jeremy Fitzhardinge wrote:
>> Rik van Riel wrote:
>>
>>> Jeremy Fitzhardinge wrote:
>>>
>>>> Rik van Riel wrote:
>>>>
>>>>> Jeremy Fitzhardinge wrote:
>>>>>
>>>>>
>>>>>> That said, people have been looking at tracking block IO to work
>>>>>> out when it might be useful to try and share pages between guests
>>>>>> under Xen.
>>>>>>
>>>>> Tracking block IO seems like a bass-ackwards way to figure
>>>>> out what the contents of a memory page are.
>>>>>
>>>> Well, they're research projects, so nobody said that they're
>>>> necessarily useful results ;). I think the rationale is that, in
>>>> general, there aren't all that many sharable pages, and asize from
>>>> zero-pages, the bulk of them are the result of IO.
>>> I'll give you a hint: Windows zeroes out freed pages.
>>>
>>
>> Right: "aside from zero-pages". If you exclude zero-pages from your
>> count of shared pages, the amount of sharing drops a lot.
20026 root 15 0 707m 526m 246m S 7.0 14.0 0:39.57
qemu-system-x86
20010 root 15 0 707m 526m 239m S 6.3 14.0 0:47.16
qemu-system-x86
20015 root 15 0 707m 526m 247m S 5.7 14.0 0:46.84
qemu-system-x86
20031 root 15 0 707m 526m 242m S 5.7 14.1 0:46.74
qemu-system-x86
20005 root 15 0 707m 526m 239m S 0.3 14.0 0:54.16 qemu-system-x86
I just ran 5 debian 5.0 guests with each have 512 mb physical ram,
all i did was just open X, and then open thunderbird and firefox in
them, check the SHR field...
You cannot ignore the fact that the librarys and the kernel would be
identical among guests and would be shared...
Other than the library we got the big bonus that is called zero page in
windows, but that is really not the case for the above example since
thigs guests are linux.....
>>
>>
>>> It should also be possible to hook up arch_free_page() so
>>> freed pages in Linux guests become sharable.
>>>
>>> Furthermore, every guest with the same OS version will be
>>> running the same system daemons, same glibc, etc. This
>>> means sharable pages from not just disk IO (probably from
>>> different disks anyway),
>>>
>>
>> Why? If you're starting a bunch of cookie-cutter guests, then you're
>> probably starting them from the same template image or COW block
>> devices. (Also, if you're wearing the cost of physical IO anyway,
>> then additional cost of hashing is relatively small.)
>>
>>
>>> but also in the BSS and possibly
>>> even on the heap.
>>>
>>
>> Well, maybe. Modern systems generally randomize memory layouts, so
>> even if they're semantically the same, the pointers will all have
>> different values.
>>
>> Other research into "sharing" mostly-similar pages is more promising
>> for that kind of case.
>>
>>
>>> Eventually. It starts out with hashing the first 128 (IIRC)
>>> bytes of page content and comparing the hashes. If that
>>> matches, it will do content comparison.
>>>
> The algorithm was changed quite a bit. Izik is planning to resubmit it
> any day now.
>>> Content comparison is done in the background on the host.
>>> I suspect (but have not checked) that it is somehow hooked
>>> up to the page reclaim code on the host.
>>>
>>
>> Yeah, that's the straightforward approach; there's about a research
>> project/year doing a Xen implementation, but they never seem to get
>> very good results aside from very artificial test conditions.
I keep hear this argument from Microsoft but even in the hardest test
condition, how would you make the librarys and the kernel wont be
identical among the guests?.
Anyway Page sharing is running and installed for our customers and so
far i only hear from sells guys how surprised and happy the costumers
are from the overcommit that page sharing is offer...
Anyway i have ready massive-changed (mostly the logical algorithm for
finding pages) ksm version that i made against the mainline version and
is ready to be send after i will get some better benchmarks numbers to
post on the list when together with the patch...
On Monday 30 March 2009 01:23:36 Martin Schwidefsky wrote:
> On Sat, 28 Mar 2009 17:05:28 +1030
>
> Rusty Russell <[email protected]> wrote:
> > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote:
> > > Greetings,
> > > the circus is back in town -- another version of the guest page hinting
> > > patches. The patches differ from version 6 only in the kernel version,
> > > they apply against 2.6.29. My short sniff test showed that the code
> > > is still working as expected.
> > >
> > > To recap (you can skip this if you read the boiler plate of the last
> > > version of the patches):
> > > The main benefit for guest page hinting vs. the ballooner is that there
> > > is no need for a monitor that keeps track of the memory usage of all
> > > the guests, a complex algorithm that calculates the working set sizes
> > > and for the calls into the guest kernel to control the size of the
> > > balloons.
> >
> > I thought you weren't convinced of the concrete benefits over ballooning,
> > or am I misremembering?
>
> The performance test I have seen so far show that the benefits of
> ballooning vs. guest page hinting are about the same. I am still
> convinced that the guest page hinting is the way to go because you do
> not need an external monitor. Calculating the working set size for a
> guest is a challenge. With guest page hinting there is no need for a
> working set size calculation.
Sounds backwards to me. If the benefits are the same, then having
complexity in an external monitor (which, by the way, shares many
problems and goals of single-kernel resource/workload management),
rather than putting a huge chunk of crap in the guest kernel's core
mm code.
I still think this needs much more justification.
On Thu, 2 Apr 2009 22:32:00 +1100
Nick Piggin <[email protected]> wrote:
> On Monday 30 March 2009 01:23:36 Martin Schwidefsky wrote:
> > On Sat, 28 Mar 2009 17:05:28 +1030
> >
> > Rusty Russell <[email protected]> wrote:
> > > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote:
> > > > Greetings,
> > > > the circus is back in town -- another version of the guest page hinting
> > > > patches. The patches differ from version 6 only in the kernel version,
> > > > they apply against 2.6.29. My short sniff test showed that the code
> > > > is still working as expected.
> > > >
> > > > To recap (you can skip this if you read the boiler plate of the last
> > > > version of the patches):
> > > > The main benefit for guest page hinting vs. the ballooner is that there
> > > > is no need for a monitor that keeps track of the memory usage of all
> > > > the guests, a complex algorithm that calculates the working set sizes
> > > > and for the calls into the guest kernel to control the size of the
> > > > balloons.
> > >
> > > I thought you weren't convinced of the concrete benefits over ballooning,
> > > or am I misremembering?
> >
> > The performance test I have seen so far show that the benefits of
> > ballooning vs. guest page hinting are about the same. I am still
> > convinced that the guest page hinting is the way to go because you do
> > not need an external monitor. Calculating the working set size for a
> > guest is a challenge. With guest page hinting there is no need for a
> > working set size calculation.
>
> Sounds backwards to me. If the benefits are the same, then having
> complexity in an external monitor (which, by the way, shares many
> problems and goals of single-kernel resource/workload management),
> rather than putting a huge chunk of crap in the guest kernel's core
> mm code.
The benefits are the same but the algorithmic complexity is reduced.
The patch to the memory management has complexity in itself but from a
1000 feet standpoint guest page hinting is simpler, no? The question
how much memory each guest has to release does not exist. With the
balloner I have seen a few problematic cases where the size of
the balloon in principle killed the guest. My favorite is the "clever"
monitor script that queried the guests free memory and put all free
memory into the balloon. Now gues what happened with a guest that just
booted..
And could you please explain with a few more words >what< you consider
to be "crap"? I can't do anything with a general statement "this is
crap". Which translates to me: leave me alone..
> I still think this needs much more justification.
Ok, I can understand that. We probably need a KVM based version to show
that benefits exist on non-s390 hardware as well.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
>> I still think this needs much more justification.
>>
>
> Ok, I can understand that. We probably need a KVM based version to show
> that benefits exist on non-s390 hardware as well.
>
BTW, there was a presentation at the most recent Xen summit which makes
use of CMM ("Satori: Enlightened Page Sharing",
http://www.xen.org/files/xensummit_oracle09/Satori.pdf).
J
On Friday 03 April 2009 02:52:49 Martin Schwidefsky wrote:
> On Thu, 2 Apr 2009 22:32:00 +1100
> Nick Piggin <[email protected]> wrote:
>
> > On Monday 30 March 2009 01:23:36 Martin Schwidefsky wrote:
> > > On Sat, 28 Mar 2009 17:05:28 +1030
> > >
> > > Rusty Russell <[email protected]> wrote:
> > > > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote:
> > > > > Greetings,
> > > > > the circus is back in town -- another version of the guest page hinting
> > > > > patches. The patches differ from version 6 only in the kernel version,
> > > > > they apply against 2.6.29. My short sniff test showed that the code
> > > > > is still working as expected.
> > > > >
> > > > > To recap (you can skip this if you read the boiler plate of the last
> > > > > version of the patches):
> > > > > The main benefit for guest page hinting vs. the ballooner is that there
> > > > > is no need for a monitor that keeps track of the memory usage of all
> > > > > the guests, a complex algorithm that calculates the working set sizes
> > > > > and for the calls into the guest kernel to control the size of the
> > > > > balloons.
> > > >
> > > > I thought you weren't convinced of the concrete benefits over ballooning,
> > > > or am I misremembering?
> > >
> > > The performance test I have seen so far show that the benefits of
> > > ballooning vs. guest page hinting are about the same. I am still
> > > convinced that the guest page hinting is the way to go because you do
> > > not need an external monitor. Calculating the working set size for a
> > > guest is a challenge. With guest page hinting there is no need for a
> > > working set size calculation.
> >
> > Sounds backwards to me. If the benefits are the same, then having
> > complexity in an external monitor (which, by the way, shares many
> > problems and goals of single-kernel resource/workload management),
> > rather than putting a huge chunk of crap in the guest kernel's core
> > mm code.
>
> The benefits are the same but the algorithmic complexity is reduced.
> The patch to the memory management has complexity in itself but from a
> 1000 feet standpoint guest page hinting is simpler, no?
Yeah but that's a tradeoff I'll begrudgingly make, considering
a) lots of people doing workload management inside cgroups/containers
need similar algorithmic complexity so improvements to those
algorithms will help one another
b) it may be adding complexity, but it isn't adding complexity to a
subsystem that is already among the most complex in the kernel
c) i don't have to help maintain it
> The question
> how much memory each guest has to release does not exist. With the
> balloner I have seen a few problematic cases where the size of
> the balloon in principle killed the guest. My favorite is the "clever"
> monitor script that queried the guests free memory and put all free
> memory into the balloon. Now gues what happened with a guest that just
> booted..
>
> And could you please explain with a few more words >what< you consider
> to be "crap"? I can't do anything with a general statement "this is
> crap". Which translates to me: leave me alone..
:) No it's cool code, interesting idea etc, and last time I looked I
don't think I saw any fundamental (or even any significant incidental)
bugs.
So I guess my problem with it is that it adds complexity to benefit a
small portion of users where there is already another solution that
another set of users already require.
> > I still think this needs much more justification.
>
> Ok, I can understand that. We probably need a KVM based version to show
> that benefits exist on non-s390 hardware as well.
Should be significantly better than ballooning too.
Martin Schwidefsky wrote:
> The benefits are the same but the algorithmic complexity is reduced.
> The patch to the memory management has complexity in itself but from a
> 1000 feet standpoint guest page hinting is simpler, no?
Page hinting has a complex, but well understood, mechanism
and simple policy.
Ballooning has a simpler mechanism, but relies on an
as-of-yet undiscovered policy.
Having experienced a zillion VM corner cases over the
last decade and a bit, I think I prefer a complex mechanism
over complex (or worse, unknown!) policy any day.
> Ok, I can understand that. We probably need a KVM based version to show
> that benefits exist on non-s390 hardware as well.
I believe it can work for KVM just fine, if we keep the host state
and the guest state in separate places (so the guest can always
write the guest state without a hypercall).
On Friday 03 April 2009 06:06:31 Rik van Riel wrote:
> Martin Schwidefsky wrote:
> > The benefits are the same but the algorithmic complexity is reduced.
> > The patch to the memory management has complexity in itself but from a
> > 1000 feet standpoint guest page hinting is simpler, no?
> Page hinting has a complex, but well understood, mechanism
> and simple policy.
>
> Ballooning has a simpler mechanism, but relies on an
> as-of-yet undiscovered policy.
>
> Having experienced a zillion VM corner cases over the
> last decade and a bit, I think I prefer a complex mechanism
> over complex (or worse, unknown!) policy any day.
I disagree with it being so clear cut. Volatile pagecache policy is completely
out of the control of the Linux VM. Wheras ballooning does have to make some
tradeoff between guests, but the actual reclaim will be driven by the guests.
Neither way is perfect, but it's not like the hypervisor reclaim is foolproof
against making a bad tradeoff between guests.
On Thu, 2 Apr 2009, Martin Schwidefsky wrote:
> On Thu, 2 Apr 2009 22:32:00 +1100
> Nick Piggin <[email protected]> wrote:
>
> > I still think this needs much more justification.
>
> Ok, I can understand that. We probably need a KVM based version to show
> that benefits exist on non-s390 hardware as well.
That would indeed help your cause enormously (I think I made the same
point last time). All these complex transitions, added to benefit only
an architecture to which few developers have access, asks for trouble -
we mm hackers already get caught out often enough by your
too-well-camouflaged page_test_dirty().
Hugh
Rik van Riel wrote:
> Page hinting has a complex, but well understood, mechanism
> and simple policy.
>
For the guest perhaps, and yes, it does push the problem out to the
host. But that doesn't make solving a performance problem any easier if
you end up in a mess.
> Ballooning has a simpler mechanism, but relies on an
> as-of-yet undiscovered policy.
>
(I'm talking about Xen ballooning here; I know KVM ballooning works
differently.)
Yes and no. If you want to be able to shrink the guest very
aggressively, then you need to be very careful about not shrinking too
much for its current and near-future needs. But you'll get into an
equivalently bad state with page hinting if the host decides to swap out
and discard lots of persistent guest pages.
When the host demands memory from the guest, the simple caseballooning
is analogous to page hinting:
* give up free pages == mark pages unused
* give up clean pages == mark pages volatile
* cause pressure to release some memory == host swapping
The flipside is how guests can ask for memory if their needs increase
again. Page-hinting is fault-driven, so the guest may stall while the
host sorts out some memory to back the guests pages. Ballooning
requires the guest to explicitly ask for memory, and that could be done
in advance if it notices the pool of easily-freed pages is shrinking
rapidly (though I guess it could be done on demand as well, but we don't
have hooks for that).
But of course, there are other approaches people are playing with, like
Dan Magenheimer's transcendental memory, which is a pool of
hypervisor-owned and managed pages which guests can use via a copy
interface, as a second-chance page discard cache, fast swap, etc. Such
mechanisms may be easier on both the guest complexity and policy fronts.
The more complex host policy decisions of how to balance overall memory
use system-wide are much in the same for both mechanisms.
J
Nick Piggin wrote:
> On Friday 03 April 2009 06:06:31 Rik van Riel wrote:
>
>> Ballooning has a simpler mechanism, but relies on an
>> as-of-yet undiscovered policy.
>>
>> Having experienced a zillion VM corner cases over the
>> last decade and a bit, I think I prefer a complex mechanism
>> over complex (or worse, unknown!) policy any day.
>>
>
> I disagree with it being so clear cut. Volatile pagecache policy is completely
> out of the control of the Linux VM. Wheras ballooning does have to make some
> tradeoff between guests, but the actual reclaim will be driven by the guests.
> Neither way is perfect, but it's not like the hypervisor reclaim is foolproof
> against making a bad tradeoff between guests.
>
I guess we could try to figure out a simple and robust policy
for ballooning. If we can come up with a policy which nobody
can shoot holes in by just discussing it, it may be worth
implementing and benchmarking.
Maybe something based on the host passing memory pressure
on to the guests, and the guests having their own memory
pressure push back to the host.
I'l start by telling you the best auto-ballooning policy idea
I have come up with so far, and the (major?) hole in it.
Basically, the host needs the memory pressure notification,
where the VM will notify the guests when memory is running
low (and something could soon be swapped). At that point,
each guest which receives the signal will try to free some
memory and return it to the host.
Each guest can have the reverse in its own pageout code.
Once memory pressure grows to a certain point (eg. when
the guest is about to swap something out), it could reclaim
a few pages from the host.
If all the guests behave themselves, this could work.
However, even with just reasonably behaving guests,
differences between the VMs in each guest could lead
to unbalanced reclaiming, penalizing better behaving
guests.
If one guest is behaving badly, it could really impact
the other guests.
Can you think of improvements to this idea?
Can you think of another balloon policy that does
not have nasty corner cases?
Jeremy Fitzhardinge wrote:
> The more complex host policy decisions of how to balance overall
> memory use system-wide are much in the same for both mechanisms.
Not at all. Page hinting is just an optimization to host swapping, where
IO can be avoided on many of the pages that hit the end of the LRU.
No decisions have to be made at all about balancing memory use
between guests, it just happens through regular host LRU aging.
Automatic ballooning requires that something on the host figures
out how much memory each guest needs and sizes the guests
appropriately. All the proposed policies for that which I have
seen have some nasty corner cases or are simply very limited
in scope.
Rik van Riel wrote:
> Jeremy Fitzhardinge wrote:
>> The more complex host policy decisions of how to balance overall
>> memory use system-wide are much in the same for both mechanisms.
> Not at all. Page hinting is just an optimization to host swapping, where
> IO can be avoided on many of the pages that hit the end of the LRU.
>
> No decisions have to be made at all about balancing memory use
> between guests, it just happens through regular host LRU aging.
When the host pages out a page belonging to guest A, then its making a
policy decision on how large guest A should be compared to B. If the
policy is a global LRU on all guest pages, then that's still a policy on
guest sizes: the target size is a function of its working set, assuming
that the working set is well modelled by LRU. I imagine that if the
guest and host are both managing their pages with an LRU-like algorithm
you'll get some nasty interactions, which page hinting tries to alleviate.
> Automatic ballooning requires that something on the host figures
> out how much memory each guest needs and sizes the guests
> appropriately. All the proposed policies for that which I have
> seen have some nasty corner cases or are simply very limited
> in scope.
Well, you could apply something equivalent to a global LRU: ask for more
pages from guests who have the most unused pages. (I'm not saying that
its necessarily a useful policy.)
J
Rik van Riel wrote:
> I guess we could try to figure out a simple and robust policy
> for ballooning. If we can come up with a policy which nobody
> can shoot holes in by just discussing it, it may be worth
> implementing and benchmarking.
>
> Maybe something based on the host passing memory pressure
> on to the guests, and the guests having their own memory
> pressure push back to the host.
>
> I'l start by telling you the best auto-ballooning policy idea
> I have come up with so far, and the (major?) hole in it.
>
I think the first step is to reasonably precisely describe what the
outcome you're trying to get to. Once you have that you can start
talking about policies and mechanisms to achieve it. I suspect we all
have basically the same thing in mind, but there's no harm in being
explicit.
I'm assuming that:
1. Each domain has a minimum guaranteed amount of resident memory.
If you want to shrink a domain to smaller than that minimum, you
may as well take all its memory away (ie suspend to disk,
completely swap out, migrate elsewhere, etc). The amount is at
least the bare-minimum WSS for the domain, but it may be higher to
achieve other guarantees.
2. Each domain has a maximum allowable resident memory, which could
be unbounded. The sums of all maximums could well exceed the
total amount of host memory, and that represents the overcommit case.
3. Each domain has a weight, or memory priority. The simple case is
that they all have the same weight, but a useful implementation
would probably allow more.
4. Domains can be cooperative, unhelpful (ignore all requests and
make none) or malicious (ignore requests, always try to claim more
memory). An incompetent cooperative domain could be effectively
unhelpful or malicious.
* hard max limits will prevent non-cooperative domains from
causing too much damage
* they could be limited in other ways, by lowering IO or CPU
priorities
* a domain's "goodness" could be measured by looking to see
how much memory is actually using relative to its min size
and its weight
* other remedies are essentially non-technical, such as more
expensive billing the more non-good a domain is
* (its hard to force a Xen domain to give up memory you've
already given it)
Given that, what outcome do we want? What are we optimising for?
* Overall throughput?
* Fairness?
* Minimise wastage?
* Rapid response to changes in conditions? (Cope with domains
swinging between 64MB and 3GB on a regular basis?)
* Punish wrong-doers / Reward cooperative domains?
* ...?
Trying to make one thing work for all cases isn't going to be simple or
robust. If we pick one or two (minimise wastage+overall throughput?)
then it might be more tractable.
> Basically, the host needs the memory pressure notification,
> where the VM will notify the guests when memory is running
> low (and something could soon be swapped). At that point,
> each guest which receives the signal will try to free some
> memory and return it to the host.
>
> Each guest can have the reverse in its own pageout code.
> Once memory pressure grows to a certain point (eg. when
> the guest is about to swap something out), it could reclaim
> a few pages from the host.
>
> If all the guests behave themselves, this could work.
>
Yes. It seems to me the basic metric is that each domain needs to keep
track of how much easily allocatable memory it has on hand (ie, pages it
can drop without causing a significant increase in IO). If it gets too
large, then it can afford to give pages back to the host. If it gets
too small, it must ask for more memory (preferably early enough to
prevent a real memory crunch).
> However, even with just reasonably behaving guests,
> differences between the VMs in each guest could lead
> to unbalanced reclaiming, penalizing better behaving
> guests.
>
Well, it depends on what you mean by penalized. If they can function
properly with the amount of memory they have, then they're fine. If
they're struggling because they don't have enough memory for their WSS,
then they got their "do I have enough memory on hand" calculation wrong.
> If one guest is behaving badly, it could really impact
> the other guests.
>
> Can you think of improvements to this idea?
>
> Can you think of another balloon policy that does
> not have nasty corner cases?
>
In fully cooperative environments you can rely on ballooning to move
things around dramatically. But with only partially cooperative guests,
the best you can hope for is that it allows you some provisioning
flexibility so you can deal with fluctuating demands in guests, but not
order-of-magnitude size changes. You just have to leave enough headroom
to make the corner cases not too pointy.
J
On Thu, 02 Apr 2009 13:34:37 -0700
Jeremy Fitzhardinge <[email protected]> wrote:
> Rik van Riel wrote:
> > Jeremy Fitzhardinge wrote:
> >> The more complex host policy decisions of how to balance overall
> >> memory use system-wide are much in the same for both mechanisms.
> > Not at all. Page hinting is just an optimization to host swapping, where
> > IO can be avoided on many of the pages that hit the end of the LRU.
> >
> > No decisions have to be made at all about balancing memory use
> > between guests, it just happens through regular host LRU aging.
>
> When the host pages out a page belonging to guest A, then its making a
> policy decision on how large guest A should be compared to B. If the
> policy is a global LRU on all guest pages, then that's still a policy on
> guest sizes: the target size is a function of its working set, assuming
> that the working set is well modelled by LRU. I imagine that if the
> guest and host are both managing their pages with an LRU-like algorithm
> you'll get some nasty interactions, which page hinting tries to alleviate.
This is the basic idea of guest page hinting. Let the host memory
manager make it decision based on the data it has. That includes page
age determined with a global LRU list, page age determined with a
per-guest LRU list, i/o rates of the guests, all kind of policy which
guest should have how much memory. The page hinting comes into play
AFTER the decision has been made which page to evict. Only then the host
should look at the volatile vs. stable page state and decide what has
to be done with the page. If it is volatile the host can throw the page
away because the guest can recreate it with LESS effort. That is the
optimization.
> > Automatic ballooning requires that something on the host figures
> > out how much memory each guest needs and sizes the guests
> > appropriately. All the proposed policies for that which I have
> > seen have some nasty corner cases or are simply very limited
> > in scope.
>
> Well, you could apply something equivalent to a global LRU: ask for more
> pages from guests who have the most unused pages. (I'm not saying that
> its necessarily a useful policy.)
But with page hinting you don't have to even ask. Just take the pages
if you need them. The guest already told you that you can have them by
setting the unused state.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> This is the basic idea of guest page hinting. Let the host memory
> manager make it decision based on the data it has. That includes page
> age determined with a global LRU list, page age determined with a
> per-guest LRU list, i/o rates of the guests, all kind of policy which
> guest should have how much memory.
Do you look at fault rates? Refault rates?
> The page hinting comes into play
> AFTER the decision has been made which page to evict. Only then the host
> should look at the volatile vs. stable page state and decide what has
> to be done with the page. If it is volatile the host can throw the page
> away because the guest can recreate it with LESS effort. That is the
> optimization.
>
Yes, and its good from that perspective. Do you really implement it
purely that way, or do you bias the LRU to push volatile and free pages
down the end of the LRU list in preference to pages which must be
preserved? If you have a small bias then you can prefer to evict easily
evictable pages compared to their near-equivalents which require IO.
> But with page hinting you don't have to even ask. Just take the pages
> if you need them. The guest already told you that you can have them by
> setting the unused state.
>
Yes. But it still depends on the guest. A very helpful guest could
deliberately preswap pages so that it can mark them as volatile, whereas
a less helpful one may keep them persistent and defer preswapping them
until there's a good reason to do so. Host swapping and page hinting
won't put any apparent memory pressure on the guest, so it has no reason
to start preswapping even if the overall system is under pressure.
Ballooning will expose each guest to its share of the overall system
memory pressure, so they can respond appropriately (one hopes).
J
On Fri, 03 Apr 2009 11:19:24 -0700
Jeremy Fitzhardinge <[email protected]> wrote:
> Martin Schwidefsky wrote:
> > This is the basic idea of guest page hinting. Let the host memory
> > manager make it decision based on the data it has. That includes page
> > age determined with a global LRU list, page age determined with a
> > per-guest LRU list, i/o rates of the guests, all kind of policy which
> > guest should have how much memory.
>
> Do you look at fault rates? Refault rates?
That is hidden in the memory management of z/VM. I know some details
how the z/VM page manager works but in the end I don't care as the
guest operating system.
> > The page hinting comes into play
> > AFTER the decision has been made which page to evict. Only then the host
> > should look at the volatile vs. stable page state and decide what has
> > to be done with the page. If it is volatile the host can throw the page
> > away because the guest can recreate it with LESS effort. That is the
> > optimization.
> >
>
> Yes, and its good from that perspective. Do you really implement it
> purely that way, or do you bias the LRU to push volatile and free pages
> down the end of the LRU list in preference to pages which must be
> preserved? If you have a small bias then you can prefer to evict easily
> evictable pages compared to their near-equivalents which require IO.
We though about a bias to prefer volatile pages but in the end decided
against it. We do prefer free pages, if the page manager finds a unused
page it will reuse it immediately.
> > But with page hinting you don't have to even ask. Just take the pages
> > if you need them. The guest already told you that you can have them by
> > setting the unused state.
> >
>
> Yes. But it still depends on the guest. A very helpful guest could
> deliberately preswap pages so that it can mark them as volatile, whereas
> a less helpful one may keep them persistent and defer preswapping them
> until there's a good reason to do so. Host swapping and page hinting
> won't put any apparent memory pressure on the guest, so it has no reason
> to start preswapping even if the overall system is under pressure.
> Ballooning will expose each guest to its share of the overall system
> memory pressure, so they can respond appropriately (one hopes).
Why should the guest want to do preswapping? It is as expensive for
the host to swap a page and get it back as it is for the guest (= one
write + one read). It is a waste of cpu time to call into the guest. You
need something we call PFAULT though: if a guest process hits a page
that is missing in the host page table you don't want to stop the
virtual cpu until the page is back. You notify the guest that the host
page is missing. The process that caused the fault is put to sleep
until the host retrieved the page again. You will find the pfault code
for s390 in arch/s390/mm/fault.c
So to me preswap doesn't make sense. The only thing you can gain by
putting memory pressure on the guest is to free some of the memory that
is used by the kernel for dentries, inodes, etc.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Monday 06 April 2009 17:21:11 Martin Schwidefsky wrote:
> On Fri, 03 Apr 2009 11:19:24 -0700
> > Yes. But it still depends on the guest. A very helpful guest could
> > deliberately preswap pages so that it can mark them as volatile, whereas
> > a less helpful one may keep them persistent and defer preswapping them
> > until there's a good reason to do so. Host swapping and page hinting
> > won't put any apparent memory pressure on the guest, so it has no reason
> > to start preswapping even if the overall system is under pressure.
> > Ballooning will expose each guest to its share of the overall system
> > memory pressure, so they can respond appropriately (one hopes).
>
> Why should the guest want to do preswapping? It is as expensive for
> the host to swap a page and get it back as it is for the guest (= one
> write + one read). It is a waste of cpu time to call into the guest. You
> need something we call PFAULT though: if a guest process hits a page
> that is missing in the host page table you don't want to stop the
> virtual cpu until the page is back. You notify the guest that the host
> page is missing. The process that caused the fault is put to sleep
> until the host retrieved the page again. You will find the pfault code
> for s390 in arch/s390/mm/fault.c
>
> So to me preswap doesn't make sense. The only thing you can gain by
> putting memory pressure on the guest is to free some of the memory that
> is used by the kernel for dentries, inodes, etc.
The guest kernel can have more context about usage patterns, or user
hints set on some pages or ranges. And as you say, there are
non-pagecache things to free that can be taking significant or most of
the freeable memory, and there can be policy knobs set in the guest
(swappiness or vfs_cache_pressure etc).
I guess that counters or performance monitoring events in the guest
should also look more like a normal Linux kernel (although I haven't
remembered what you do in that department in your patches).
Martin Schwidefsky wrote:
> Why should the guest want to do preswapping? It is as expensive for
> the host to swap a page and get it back as it is for the guest (= one
> write + one read).
Yes, perhaps for swapping, but in general it makes sense for the guest
to write the pages to backing store to prevent host swapping. For swap
pages there's no big benefit, but for file-backed pages its better for
the guest to do it.
> The only thing you can gain by
> putting memory pressure on the guest is to free some of the memory that
> is used by the kernel for dentries, inodes, etc.
>
Well, that's also significant. My point is that the guest has multiple
ways in which it can relieve its own memory pressure in response to
overall system memory pressure; its just that I happened to pick the
example where its much of a muchness.
J