hi,
am starting to play with XEN - the virtualisation project
(http://xen.sf.net).
i'll give some background first of all and then the question - at the
bottom - will make sense [when posting to lkml i often get questions
asked that are answered by the background material i also provide...
*sigh*]
each virtual machine requires (typically) its own physical ram (a chunk
of the host's real memory) and some virtual memory - swapspace. xen
uses 32mb for its shm guest OS inter-communication.
so, in the case i'm setting up, that's 5 virtual machines (only one of
which can get away with having only 32mb of ram, the rest require 64mb)
so that's five lots of 256mbyte swap files.
the memory usage is the major concern: i only have 256mb of ram and
you've probably by now added up that the above comes to 320mbytes.
so i started looking at ways to minimise the memory usage.
first, reducing each machine to only having 32mb of ram, and secondly,
on the host, creating a MASSIVE swap file (1gbyte), making a MASSIVE
shmfs/tmpfs partition (1gbyte) and then creating swap files in the
tmpfs partition!!!
the reasoning behind doing this is quite straightforward: by placing the
swapfiles in a tmpfs, presumably then when one of the guest OSes
requires some memory, then RAM on the host OS will be used until such
time as the amount of RAM requested exceeds the host OSes physical
memory, and then it will go into swap-space.
this is presumed to be infinitely better than forcing the swapspace to
be always on disk, especially with the guests only being allocated
32mbyte of physical RAM.
here's the problems:
1) tmpfs doesn't support sparse files
2) files created in tmpfs don't support block devices (???)
3) as a workaround i have to create a swap partition in a 256mb file,
(dd if=/dev/zero of=/mnt/swapfile bs=1M count=256 and do mkswap on it)
then copy the ENTIRE file into the tmpfs-mounted partition.
on every boot-up.
per swapfile needed.
eeeuw, yuk.
so, my question is a strategic one:
* in what other ways could the same results be achieved?
in other words, what other ways can i publish block
devices from the master OS (and they must be block
devices for XEN guest OSes to be able to see them)
that can be used as swap space, that will be in RAM if possible,
bearing in mind that they can be recreated at boot time,
i.e. they don't need to be persistent.
ta,
l.
--
--
<a href="http://lkcl.net">http://lkcl.net</a>
--
On Sun, Jan 02, 2005 at 04:26:52PM +0000, Luke Kenneth Casson Leighton wrote:
[...]
> this is presumed to be infinitely better than forcing the swapspace to
> be always on disk, especially with the guests only being allocated
> 32mbyte of physical RAM.
I'd be interested in knowing how a tmpfs that's gone far into swap
performs compared to a more normal on-disk fs. I don't know if anyone
has ever looked into it. Is it comparable, or is tmpfs's ability to
swap more a last-resort escape hatch?
This is the part where I would add something valuable to this
conversation, if I were going to do that. (But no.)
--
Joseph Fannin
[email protected]
On Mon, Jan 03, 2005 at 01:31:34PM -0500, Joseph Fannin wrote:
> On Sun, Jan 02, 2005 at 04:26:52PM +0000, Luke Kenneth Casson Leighton wrote:
> [...]
> > this is presumed to be infinitely better than forcing the swapspace to
> > be always on disk, especially with the guests only being allocated
> > 32mbyte of physical RAM.
>
> I'd be interested in knowing how a tmpfs that's gone far into swap
> performs compared to a more normal on-disk fs. I don't know if anyone
> has ever looked into it. Is it comparable, or is tmpfs's ability to
> swap more a last-resort escape hatch?
>
> This is the part where I would add something valuable to this
> conversation, if I were going to do that. (But no.)
:)
okay.
some kind person from ibm pointed out that of course if you use a
file-based swap file (in xen terminology,
disk=['file:/xen/guest1-swapfile,/dev/sda2,rw'] which means "publish
guest1-swapfile on the DOM0 VM as /dev/sda2 hard drive on the
guest1 VM) then you of course end up using the linux filesystem cache
on DOM0 which is of course RAM-based.
so this tends to suggest a strategy where you allocate as
much memory as you can afford to the DOM0 VM, and as little
as you can afford to the guests, and make the guest swap
files bigger to compensate.
... and i thought it was going to need some wacky wacko non-sharing
shared-memory virtual-memory pseudo-tmpfs block-based filesystem
driver. dang.
l.
On Mon, 3 Jan 2005, Luke Kenneth Casson Leighton wrote:
> On Mon, Jan 03, 2005 at 01:31:34PM -0500, Joseph Fannin wrote:
> > On Sun, Jan 02, 2005 at 04:26:52PM +0000, Luke Kenneth Casson Leighton wrote:
> > [...]
> > > this is presumed to be infinitely better than forcing the swapspace to
> > > be always on disk, especially with the guests only being allocated
> > > 32mbyte of physical RAM.
> >
> > I'd be interested in knowing how a tmpfs that's gone far into swap
> > performs compared to a more normal on-disk fs. I don't know if anyone
> > has ever looked into it. Is it comparable, or is tmpfs's ability to
> > swap more a last-resort escape hatch?
> >
> > This is the part where I would add something valuable to this
> > conversation, if I were going to do that. (But no.)
>
> :)
>
> okay.
>
> some kind person from ibm pointed out that of course if you use a
> file-based swap file (in xen terminology,
> disk=['file:/xen/guest1-swapfile,/dev/sda2,rw'] which means "publish
> guest1-swapfile on the DOM0 VM as /dev/sda2 hard drive on the
> guest1 VM) then you of course end up using the linux filesystem cache
> on DOM0 which is of course RAM-based.
>
> so this tends to suggest a strategy where you allocate as
> much memory as you can afford to the DOM0 VM, and as little
> as you can afford to the guests, and make the guest swap
> files bigger to compensate.
But the guest kernels need real ram to run programs in.
The problem with dom0 doing the caching, is that dom0 has no idea about the
usage pattern for the swap. It's just a plain file to dom0. Only each guest
kernel knows how to combine swap reads/writes correctly.
> so this tends to suggest a strategy where you allocate as
> much memory as you can afford to the DOM0 VM, and as little
> as you can afford to the guests, and make the guest swap
> files bigger to compensate.
This is essentially what the mainframe folks are already doing and have
been doing for some time because the kernel VM has no external inputs
for saying "you are virtualised so be nice"
for doing opportunistic page recycling ("I dont need this page but when
I ask for it back please tell me if you trashed the content") and for
hinting to the underlying VM what pages are best blasted out of
existance first and how to communicate so we dont page them back in
scanning them.
> for doing opportunistic page recycling ("I dont need this page but when
> I ask for it back please tell me if you trashed the content")
We've talked about doing this but AFAIK nobody has gotten round to it yet
because there hasn't been a pressing need (IIRC, it was on the todo list when
Xen 1.0 came out).
IMHO, it doesn't look terribly difficult but would require (hopefully small)
modifications to the architecture independent code, plus a little bit of
support code in Xen.
I'd quite like to look at this one fine day but I suspect there are more
useful things I should do first...
Cheers,
Mark
On Mon, Jan 03, 2005 at 03:07:42PM -0600, Adam Heath wrote:
> > so this tends to suggest a strategy where you allocate as
> > much memory as you can afford to the DOM0 VM, and as little
> > as you can afford to the guests, and make the guest swap
> > files bigger to compensate.
>
> But the guest kernels need real ram to run programs in.
>
> The problem with dom0 doing the caching, is that dom0 has no idea about the
> usage pattern for the swap. It's just a plain file to dom0. Only each guest
> kernel knows how to combine swap reads/writes correctly.
... hmm...
then that tends to suggest that this is an issue that should
really be dealt with by XEN.
that there needs to be coordination of swap management between the
virtual machines.
l.
--
--
<a href="http://lkcl.net">http://lkcl.net</a>
--
On Tue, 4 Jan 2005, Mark Williamson wrote:
>> for doing opportunistic page recycling ("I dont need this page but when
>> I ask for it back please tell me if you trashed the content")
>
> We've talked about doing this but AFAIK nobody has gotten round to it
> yet because there hasn't been a pressing need (IIRC, it was on the todo
> list when Xen 1.0 came out).
>
> IMHO, it doesn't look terribly difficult but would require (hopefully
> small) modifications to the architecture independent code, plus a little
> bit of support code in Xen.
The architecture independant changes are fine, since
they're also useful for S390(x), PPC64 and UML...
> I'd quite like to look at this one fine day but I suspect there are more
> useful things I should do first...
I wonder if the same effect could be achieved by just
measuring the VM pressure inside the guests and
ballooning the guests as required, letting them grow
and shrink with their workloads.
That wouldn't need many kernel changes, maybe just a
few extra statistics, or maybe all the needed stats
already exist. It would also allow more complex
policy to be done in userspace, eg. dealing with Xen
guests of different priority...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Tue, 4 Jan 2005, Luke Kenneth Casson Leighton wrote:
> then that tends to suggest that this is an issue that should
> really be dealt with by XEN.
Probably.
> that there needs to be coordination of swap management between the
> virtual machines.
I'd like to see the maximum security separation possible
between the unprivileged guests, though...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Dinsdag 04 Januar 2005 04:04, Mark Williamson wrote:
> > for doing opportunistic page recycling ("I dont need this page but when
> > I ask for it back please tell me if you trashed the content")
>
> We've talked about doing this but AFAIK nobody has gotten round to it yet
> because there hasn't been a pressing need (IIRC, it was on the todo list when
> Xen 1.0 came out).
>
> IMHO, it doesn't look terribly difficult but would require (hopefully small)
> modifications to the architecture independent code, plus a little bit of
> support code in Xen.
>
> I'd quite like to look at this one fine day but I suspect there are more
> useful things I should do first...
There are two other alternatives that are already used on s390 for making
multi-level paging a little more pleasant:
- Pseudo faults: When Linux accesses a page that it believes to be present
but is actually swapped out in z/VM, the VM hypervisor causes a special
PFAULT exception. Linux can then choose to either ignore this exception
and continue, which will force VM to swap the page back in. Or it can
do a task switch and wait for the page to come back. At the point where
VM has read the page back from its swap device, it causes another
exception, after which Linux wakes up the sleeping process.
see arch/s390/mm/fault.c
- Ballooning:
z/VM has an interface (DIAG 10) for the OS to tell it about a page that
is currently unused. The kernel uses get_free_page to reserve a number
of pages, then calls DIAG10 to give it to z/VM. The amount of pages to
give back to the hypervisor is determined by a system wide workload
manager.
see arch/s390/mm/cmm.c
When you want to introduce some interface in Xen, you probably want
something more powerful than these, but it probably makes sense to
see them as a base line of what can be done with practically no
common code changes (if you don't do similar stuff already).
Arnd <><
On Tue, Jan 04, 2005 at 09:05:13AM -0500, Rik van Riel wrote:
> On Tue, 4 Jan 2005, Mark Williamson wrote:
>
> >>for doing opportunistic page recycling ("I dont need this page but when
> >>I ask for it back please tell me if you trashed the content")
> >
> >We've talked about doing this but AFAIK nobody has gotten round to it
> >yet because there hasn't been a pressing need (IIRC, it was on the todo
> >list when Xen 1.0 came out).
> >
> >IMHO, it doesn't look terribly difficult but would require (hopefully
> >small) modifications to the architecture independent code, plus a little
> >bit of support code in Xen.
>
> The architecture independant changes are fine, since
> they're also useful for S390(x), PPC64 and UML...
>
> >I'd quite like to look at this one fine day but I suspect there are more
> >useful things I should do first...
>
> I wonder if the same effect could be achieved by just
> measuring the VM pressure inside the guests and
> ballooning the guests as required, letting them grow
> and shrink with their workloads.
mem = 64M-128M
target = 64M
"if needed, grow me to 128mb but if not, whittle down to 64".
mem=64M-128
target=128M
"if you absolutely have to, steal some of my memory, but don't nick
any more than 64M".
i'm probably going to have to "manually" implement something like this.
l.
On Wed, 5 Jan 2005, Arnd Bergmann wrote:
> - Pseudo faults:
These are a problem, because they turn what would be a single
pageout into a pageout, a pagein, and another pageout, in
effect tripling the amount of IO that needs to be done.
> - Ballooning:
Xen already has this. I wonder if it makes sense to
consolidate the various balloon approaches into a single
driver, and keep the amount of ballooned memory into
account when reporting statistics in /proc/meminfo.
> When you want to introduce some interface in Xen, you probably want
> something more powerful than these,
Xen has a nice balloon driver, that can also be
controlled from outside the guest domain.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
> > - Pseudo faults:
>
> These are a problem, because they turn what would be a single
> pageout into a pageout, a pagein, and another pageout, in
> effect tripling the amount of IO that needs to be done.
The Disco VMM tackled this by detecting attempts to double-page using a
special virtual swap disk. Perhaps it would be possible to find some cleaner
way to avoid wasteful double-paging by adding some more hooks for virtualised
architectures...
In any case, for now Xen guests are not swapped onto disk storage at runtime -
they retain their physical memory reservation unless they alter it using the
balloon driver.
> Xen already has this. I wonder if it makes sense to
> consolidate the various balloon approaches into a single
> driver, and keep the amount of ballooned memory into
> account when reporting statistics in /proc/meminfo.
If multiple platforms want to do this, we could refactor the code so that the
core of the balloon driver can be used in multiple archs. We could have an
arch_release/request_memory() that the core balloon driver can call into to
actually return memory to the VMM.
> > When you want to introduce some interface in Xen, you probably want
> > something more powerful than these,
>
> Xen has a nice balloon driver, that can also be
> controlled from outside the guest domain.
The Xen control interface made this fairly trivial to implement. Again, the
balloon driver core could be plumbed into whatever the preferred virtual
machine control interface for the platform is (I don't know if / how other
platforms tackle this).
Cheers,
Mark