2009-07-07 16:19:34

by Dan Magenheimer

[permalink] [raw]
Subject: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Tmem [PATCH 0/4] (Take 2): Transcendent memory
Transcendent memory - Take 2
Changes since take 1:
1) Patches can be applied serially; function names in diff (Rik van Riel)
2) Descriptions and diffstats for individual patches (Rik van Riel)
3) Restructure of tmem_ops to be more Linux-like (Jeremy Fitzhardinge)
4) Drop shared pools until security implications are understood (Pavel
Machek and Jeremy Fitzhardinge)
5) Documentation/transcendent-memory.txt added including API description
(see also below for API description).

Signed-off-by: Dan Magenheimer <[email protected]>

Normal memory is directly addressable by the kernel, of a known
normally-fixed size, synchronously accessible, and persistent (though
not across a reboot).

What if there was a class of memory that is of unknown and dynamically
variable size, is addressable only indirectly by the kernel, can be
configured either as persistent or as "ephemeral" (meaning it will be
around for awhile, but might disappear without warning), and is still
fast enough to be synchronously accessible?

We call this latter class "transcendent memory" and it provides an
interesting opportunity to more efficiently utilize RAM in a virtualized
environment. However this "memory but not really memory" may also have
applications in NON-virtualized environments, such as hotplug-memory
deletion, SSDs, and page cache compression. Others have suggested ideas
such as allowing use of highmem memory without a highmem kernel, or use
of spare video memory.

Transcendent memory, or "tmem" for short, provides a well-defined API to
access this unusual class of memory. (A summary of the API is provided
below.) The basic operations are page-copy-based and use a flexible
object-oriented addressing mechanism. Tmem assumes that some "privileged
entity" is capable of executing tmem requests and storing pages of data;
this entity is currently a hypervisor and operations are performed via
hypercalls, but the entity could be a kernel policy, or perhaps a
"memory node" in a cluster of blades connected by a high-speed
interconnect such as hypertransport or QPI.

Since tmem is not directly accessible and because page copying is done
to/from physical pageframes, it more suitable for in-kernel memory needs
than for userland applications. However, there may be yet undiscovered
userland possibilities.

With the tmem concept outlined vaguely and its broader potential hinted,
we will overview two existing examples of how tmem can be used by the
kernel.

"Precache" can be thought of as a page-granularity victim cache for clean
pages that the kernel's pageframe replacement algorithm (PFRA) would like
to keep around, but can't since there isn't enough memory. So when the
PFRA "evicts" a page, it first puts it into the precache via a call to
tmem. And any time a filesystem reads a page from disk, it first attempts
to get the page from precache. If it's there, a disk access is eliminated.
If not, the filesystem just goes to the disk like normal. Precache is
"ephemeral" so whether a page is kept in precache (between the "put" and
the "get") is dependent on a number of factors that are invisible to
the kernel.

"Preswap" IS persistent, but for various reasons may not always be
available for use, again due to factors that may not be visible to the
kernel (but, briefly, if the kernel is being "good" and has shared its
resources nicely, then it will be able to use preswap, else it will not).
Once a page is put, a get on the page will always succeed. So when the
kernel finds itself in a situation where it needs to swap out a page, it
first attempts to use preswap. If the put works, a disk write and
(usually) a disk read are avoided. If it doesn't, the page is written
to swap as usual. Unlike precache, whether a page is stored in preswap
vs swap is recorded in kernel data structures, so when a page needs to
be fetched, the kernel does a get if it is in preswap and reads from
swap if it is not in preswap.

Both precache and preswap may be optionally compressed, trading off 2x
space reduction vs 10x performance for access. Precache also has a
sharing feature, which allows different nodes in a "virtual cluster"
to share a local page cache.

Tmem has some similarity to IBM's Collaborative Memory Management, but
creates more of a partnership between the kernel and the "privileged
entity" and is not very invasive. Tmem may be applicable for KVM and
containers; there is some disagreement on the extent of its value.
Tmem is highly complementary to ballooning (aka page granularity hot
plug) and memory deduplication (aka transparent content-based page
sharing) but still has value when neither are present.

Performance is difficult to quantify because some benchmarks respond
very favorably to increases in memory and tmem may do quite well on
those, depending on how much tmem is available which may vary widely
and dynamically, depending on conditions completely outside of the
system being measured. Ideas on how best to provide useful metrics
would be appreciated.

Tmem is now supported in Xen's unstable tree (targeted for the Xen 3.5
release) and in Xen's Linux 2.6.18-xen source tree. Again, Xen is not
necessarily a requirement, but currently provides the only existing
implementation of tmem.

Lots more information about tmem can be found at:
http://oss.oracle.com/projects/tmem and there will be
a talk about it on the first day of Linux Symposium in July 2009.
Tmem is the result of a group effort, including Dan Magenheimer,
Chris Mason, Dave McCracken, Kurt Hackel and Zhigang Wang, with helpful
input from Jeremy Fitzhardinge, Keir Fraser, Ian Pratt, Sunil Mushran,
Joel Becker, and Jan Beulich.

THE TRANSCENDENT MEMORY API

Transcendent memory is made up of a set of pools. Each pool is made
up of a set of objects. And each object contains a set of pages.
The combination of a 32-bit pool id, a 64-bit object id, and a 32-bit
page id, uniquely identify a page of tmem data, and this tuple is called
a "handle." Commonly, the three parts of a handle are used to address
a filesystem, a file within that filesystem, and a page within that file;
however an OS can use any values as long as they uniquely identify
a page of data.

When a tmem pool is created, it is given certain attributes: It can
be private or shared, and it can be persistent or ephemeral. Each
combination of these attributes provides a different set of useful
functionality and also defines a slightly different set of semantics
for the various operations on the pool. Other pool attributes include
the size of the page and a version number.

Once a pool is created, operations are performed on the pool. Pages
are copied between the OS and tmem and are addressed using a handle.
Pages and/or objects may also be flushed from the pool. When all
operations are completed, a pool can be destroyed.

The specific tmem functions are called in Linux through a set of
accessor functions:

int (*new_pool)(struct tmem_pool_uuid uuid, u32 flags);
int (*destroy_pool)(u32 pool_id);
int (*put_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn);
int (*get_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn);
int (*flush_page)(u32 pool_id, u64 object, u32 index);
int (*flush_object)(u32 pool_id, u64 object);

The new_pool accessor creates a new pool and returns a pool id
which is a non-negative 32-bit integer. If the flags parameter
specifies that the pool is to be shared, the uuid is a 128-bit "shared
secret" else it is ignored. The destroy_pool accessor destroys the pool.
(Note: shared pools are not supported until security implications
are better understood.)

The put_page accessor copies a page of data from the specified pageframe
and associates it with the specified handle.

The get_page accessor looks up a page of data in tmem associated with
the specified handle and, if found, copies it to the specified pageframe.

The flush_page accessor ensures that subsequent gets of a page with
the specified handle will fail. The flush_object accessor ensures
that subsequent gets of any page matching the pool id and object
will fail.

There are many subtle but critical behaviors for get_page and put_page:
- Any put_page (with one notable exception) may be rejected and the client
must be prepared to deal with that failure. A put_page copies, NOT moves,
data; that is the data exists in both places. Linux is responsible for
destroying or overwriting its own copy, or alternately managing any
coherency between the copies.
- Every page successfully put to a persistent pool must be found by a
subsequent get_page that specifies the same handle. A page successfully
put to an ephemeral pool has an indeterminate lifetime and even an
immediately subsequent get_page may fail.
- A get_page to a private pool is destructive, that is it behaves as if
the get_page were atomically followed by a flush_page. A get_page
to a shared pool is non-destructive. A flush_page behaves just like
a get_page to a private pool except the data is thrown away.
- Put-put-get coherency is guaranteed. For example, after the sequence:
put_page(ABC,D1);
put_page(ABC,D2);
get_page(ABC,E)
E may never contain the data from D1. However, even for a persistent
pool, the get_page may fail if the second put_page indicates failure.
- Get-get coherency is guaranteed. For example, in the sequence:
put_page(ABC,D);
get_page(ABC,E1);
get_page(ABC,E2)
if the first get_page fails, the second must also fail.
- A tmem implementation provides no serialization guarantees (e.g. to
an SMP Linux). So if different Linux threads are putting and flushing
the same page, the results are indeterminate.
guaranteed and must be synchronized by Linux.

Changed core kernel files:
fs/buffer.c | 5 +
fs/ext3/super.c | 2
fs/mpage.c | 8 ++
fs/super.c | 5 +
include/linux/fs.h | 7 ++
include/linux/swap.h | 57 +++++++++++++++++++++
include/linux/sysctl.h | 1
kernel/sysctl.c | 12 ++++
mm/Kconfig | 26 +++++++++
mm/Makefile | 3 +
mm/filemap.c | 11 ++++
mm/page_io.c | 12 ++++
mm/swapfile.c | 46 ++++++++++++++--
mm/truncate.c | 10 +++
14 files changed, 199 insertions(+), 6 deletions(-)

Newly added core kernel files:
Documentation/transcendent-memory.txt | 175 +++++++++++++
include/linux/tmem.h | 88 ++++++
mm/precache.c | 134 ++++++++++
mm/preswap.c | 273 +++++++++++++++++++++
4 files changed, 670 insertions(+)

Changed xen-specific files:
arch/x86/include/asm/xen/hypercall.h | 8 +++
drivers/xen/Makefile | 1
include/xen/interface/tmem.h | 43 +++++++++++++++++++++
include/xen/interface/xen.h | 22 ++++++++++
4 files changed, 74 insertions(+)

Newly added xen-specific files:
drivers/xen/tmem.c | 97 +++++++++++++++++++++
include/xen/interface/tmem.h | 43 +++++++++
2 files changed, 140 insertions(+)


2009-07-07 17:28:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Dan Magenheimer wrote:

> "Preswap" IS persistent, but for various reasons may not always be
> available for use, again due to factors that may not be visible to the
> kernel (but, briefly, if the kernel is being "good" and has shared its
> resources nicely, then it will be able to use preswap, else it will not).
> Once a page is put, a get on the page will always succeed.

What happens when all of the free memory on a system
has been consumed by preswap by a few guests?

Will the system be unable to start another guest,
or is there some way to free the preswap memory?

2009-07-07 19:54:48

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

> From: Rik van Riel [mailto:[email protected]]

> Dan Magenheimer wrote:
> > "Preswap" IS persistent, but for various reasons may not always be
> > available for use, again due to factors that may not be
> visible to the
> > kernel (but, briefly, if the kernel is being "good" and has
> shared its
> > resources nicely, then it will be able to use preswap, else
> it will not).
> > Once a page is put, a get on the page will always succeed.
>
> What happens when all of the free memory on a system
> has been consumed by preswap by a few guests?
> Will the system be unable to start another guest,

The default policy (and only policy implemented as of now) is
that no guest is allowed to use more than max_mem for the
sum of directly-addressable memory (e.g. RAM) and persistent
tmem (e.g. preswap). So if a guest is using its default
memory==max_mem and is doing no ballooning, nothing can
be put in preswap by that guest.

> or is there some way to free the preswap memory?

Yes and no. There is no way externally to free preswap
memory, but an in-guest userland root service can write to sysfs
to affect preswap size. This essentially does a partial
swapoff on preswap if there is sufficient (directly addressable)
guest RAM available. (I have this prototyped as part of
the xenballoond self-ballooning service in xen-unstable.)

Dan

2009-07-08 22:57:18

by Anthony Liguori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Dan Magenheimer wrote:
> Tmem [PATCH 0/4] (Take 2): Transcendent memory
> Transcendent memory - Take 2
> Changes since take 1:
> 1) Patches can be applied serially; function names in diff (Rik van Riel)
> 2) Descriptions and diffstats for individual patches (Rik van Riel)
> 3) Restructure of tmem_ops to be more Linux-like (Jeremy Fitzhardinge)
> 4) Drop shared pools until security implications are understood (Pavel
> Machek and Jeremy Fitzhardinge)
> 5) Documentation/transcendent-memory.txt added including API description
> (see also below for API description).
>
> Signed-off-by: Dan Magenheimer <[email protected]>
>
> Normal memory is directly addressable by the kernel, of a known
> normally-fixed size, synchronously accessible, and persistent (though
> not across a reboot).
>
> What if there was a class of memory that is of unknown and dynamically
> variable size, is addressable only indirectly by the kernel, can be
> configured either as persistent or as "ephemeral" (meaning it will be
> around for awhile, but might disappear without warning), and is still
> fast enough to be synchronously accessible?

I have trouble mapping this to a VMM capable of overcommit without just
coming back to CMM2.

In CMM2 parlance, ephemeral tmem pools is just normal kernel memory
marked in the volatile state, no?

It seems to me that an architecture built around hinting would be more
robust than having to use separate memory pools for this type of memory
(especially since you are requiring a copy to/from the pool).

For instance, you can mark data DMA'd from disk (perhaps by read-ahead)
as volatile without ever bringing it into the CPU cache. With tmem, if
you wanted to use a tmem pool for all of the page cache, you'd likely
suffer significant overhead due to copying.

Regards,

Anthony Liguori

2009-07-08 23:33:45

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Hi Anthony --

Thanks for the comments.

> I have trouble mapping this to a VMM capable of overcommit
> without just coming back to CMM2.
>
> In CMM2 parlance, ephemeral tmem pools is just normal kernel memory
> marked in the volatile state, no?

They are similar in concept, but a volatile-marked kernel page
is still a kernel page, can be changed by a kernel (or user)
store instruction, and counts as part of the memory used
by the VM. An ephemeral tmem page cannot be directly written
by a kernel (or user) store, can only be read via a "get" (which
may or may not succeed), and doesn't count against the memory
used by the VM (even though it likely contains -- for awhile --
data useful to the VM).

> It seems to me that an architecture built around hinting
> would be more
> robust than having to use separate memory pools for this type
> of memory
> (especially since you are requiring a copy to/from the pool).

Depends on what you mean by robust, I suppose. Once you
understand the basics of tmem, it is very simple and this
is borne out in the low invasiveness of the Linux patch.
Simplicity is another form of robustness.

> For instance, you can mark data DMA'd from disk (perhaps by
> read-ahead)
> as volatile without ever bringing it into the CPU cache.
> With tmem, if
> you wanted to use a tmem pool for all of the page cache, you'd likely
> suffer significant overhead due to copying.

The copy may be expensive on an older machine, but on newer
machines copying a page is relatively inexpensive. On a reasonable
multi-VM-kernbench-like benchmark I'll be presenting at Linux
Symposium next week, the overhead is on the order of 0.01%
for a fairly significant savings in IOs.

2009-07-08 23:57:51

by Anthony Liguori

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Dan Magenheimer wrote:
> Hi Anthony --
>
> Thanks for the comments.
>
>
>> I have trouble mapping this to a VMM capable of overcommit
>> without just coming back to CMM2.
>>
>> In CMM2 parlance, ephemeral tmem pools is just normal kernel memory
>> marked in the volatile state, no?
>>
>
> They are similar in concept, but a volatile-marked kernel page
> is still a kernel page, can be changed by a kernel (or user)
> store instruction, and counts as part of the memory used
> by the VM. An ephemeral tmem page cannot be directly written
> by a kernel (or user) store,

Why does tmem require a special store?

A VMM can trap write operations pages can be stored on disk
transparently by the VMM if necessary. I guess that's the bit I'm missing.

>> It seems to me that an architecture built around hinting
>> would be more
>> robust than having to use separate memory pools for this type
>> of memory
>> (especially since you are requiring a copy to/from the pool).
>>
>
> Depends on what you mean by robust, I suppose. Once you
> understand the basics of tmem, it is very simple and this
> is borne out in the low invasiveness of the Linux patch.
> Simplicity is another form of robustness.
>

The main disadvantage I see is that you need to explicitly convert
portions of the kernel to use a data copying API. That seems like an
invasive change to me. Hinting on the other hand can be done in a
less-invasive way.

I'm not really arguing against tmem, just the need to have explicit
get/put mechanisms for the transcendent memory areas.

> The copy may be expensive on an older machine, but on newer
> machines copying a page is relatively inexpensive.

I don't think that's a true statement at all :-) If you had a workload
where data never came into the CPU cache (zero-copy) and now you
introduce a copy, even with new system, you're going to see a
significant performance hit.

> On a reasonable
> multi-VM-kernbench-like benchmark I'll be presenting at Linux
> Symposium next week, the overhead is on the order of 0.01%
> for a fairly significant savings in IOs.
>
But how would something like specweb do where you should be doing
zero-copy IO from the disk to the network? This is the area where I
would be concerned. For something like kernbench, you're already
bringing the disk data into the CPU cache anyway so I can appreciate
that the copy could get lost in the noise.

Regards,

Anthony Liguori

2009-07-09 00:17:28

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

On 07/08/09 16:57, Anthony Liguori wrote:
> Why does tmem require a special store?
>
> A VMM can trap write operations pages can be stored on disk
> transparently by the VMM if necessary. I guess that's the bit I'm
> missing.

tmem doesn't store anything to disk. It's more about making sure that
free host memory can be quickly and efficiently be handed out to guests
as they need it; to increase "memory liquidity" as it were. Guests need
to explicitly ask to use tmem, rather than having the host/hypervisor
try to intuit what to do based on access patterns and hints; typically
they'll use tmem as the first line storage for memory which they were
about to swap out anyway. There's no point in making tmem swappable,
because the guest is perfectly capable of swapping its own memory.

The copying interface avoids a lot of the delicate corners of the CMM
code, in which subtle races can lurk in fairly hard-to-test-for ways.

>> The copy may be expensive on an older machine, but on newer
>> machines copying a page is relatively inexpensive.
>
> I don't think that's a true statement at all :-) If you had a
> workload where data never came into the CPU cache (zero-copy) and now
> you introduce a copy, even with new system, you're going to see a
> significant performance hit.

If the copy helps avoid physical disk IO, then it is cheap at the
price. A guest generally wouldn't push a page into tmem unless it was
about to evict it anyway, so it has already determined the page is
cold/unwanted, and the copy isn't a great cost. Hot/busy pages
shouldn't be anywhere near tmem; if they are, it suggests you've cut
your domain's memory too aggressively.

J

2009-07-09 00:27:57

by Anthony Liguori

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Jeremy Fitzhardinge wrote:
> On 07/08/09 16:57, Anthony Liguori wrote:
>
>> Why does tmem require a special store?
>>
>> A VMM can trap write operations pages can be stored on disk
>> transparently by the VMM if necessary. I guess that's the bit I'm
>> missing.
>>
>
> tmem doesn't store anything to disk. It's more about making sure that
> free host memory can be quickly and efficiently be handed out to guests
> as they need it; to increase "memory liquidity" as it were. Guests need
> to explicitly ask to use tmem, rather than having the host/hypervisor
> try to intuit what to do based on access patterns and hints; typically
> they'll use tmem as the first line storage for memory which they were
> about to swap out anyway.

If the primary use of tmem is to avoid swapping when measure pressure
would have forced it, how is this different using ballooning along with
a shrinker callback?

With virtio-balloon, a guest can touch any of the memory it's ballooned
to immediately reclaim that memory. I think the main difference with
tmem is that you can also mark a page as being volatile. The hypervisor
can then reclaim that page without swapping it (it can always reclaim
memory and swap it) and generate a special fault to the guest if it
attempts to access it.

You can fail to put with tmem, right? You can also fail to get? In
both cases though, these failures can be handled because Linux is able
to recreate the page on it's on (by doing disk IO). So why not just
generate a special fault instead of having to introduce special accessors?

Regards,

Anthony Liguori

2009-07-09 01:20:58

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Anthony Liguori wrote:

> I have trouble mapping this to a VMM capable of overcommit without just
> coming back to CMM2.

Same for me. CMM2 has a more complex mechanism, but way
easier policy than anything else out there.

> In CMM2 parlance, ephemeral tmem pools is just normal kernel memory
> marked in the volatile state, no?

Basically.

> It seems to me that an architecture built around hinting would be more
> robust than having to use separate memory pools for this type of memory
> (especially since you are requiring a copy to/from the pool).

I agree. Something along the lines of CMM2 needs more
infrastructure, but will be infinitely easier to get right
from the policy side.

Automatic ballooning is an option too, with fairly simple
infrastructure, but potentially insanely complex policy
issues to sort out...

--
All rights reversed.

2009-07-09 21:11:28

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

> > I have trouble mapping this to a VMM capable of overcommit
> without just
> > coming back to CMM2.
>
> Same for me. CMM2 has a more complex mechanism, but way
> easier policy than anything else out there.

Although tmem and CMS have similar conceptual objectives,
let me try to describe what I see as a fundamental
difference in approach.

The primary objective of both is to utilize RAM more
efficiently. Both are ideally complemented with some
longer term "memory shaping" mechanism such as automatic
ballooning or hotplug.

CMM2's focus is on increasing the number of VM's that
can run on top of the hypervisor. To do this, it
depends on hints provided by Linux to surreptitiously
steal memory away from Linux. The stolen memory still
"belongs" to Linux and if Linux goes to use it but the
hypervisor has already given it to another Linux, the
hypervisor must jump through hoops to give it back.
If it guesses wrong and overcommits too aggressively,
the hypervisor must swap some memory to a "hypervisor
swap disk" (which btw has some policy challenges).
IMHO this is more of a "mainframe" model.

Tmem's focus is on helping Linux to aggressively manage
the amount of memory it uses (and thus reduce the amount
of memory it would get "billed" for using). To do this, it
provides two "safety valve" services, one to reduce the
cost of "refaults" (Rik's term) and the other to reduce
the cost of swapping. Both services are almost
always available, but if the memory of the physical
machine get overcommitted, the most aggressive Linux
guests must fall back to using their disks (because the
hypervisor does not have a "hypervisor swap disk"). But
when physical memory is undercommitted, it is still being
used usefully without compromising "memory liquidity".
(I like this term Jeremy!) IMHO this is more of a "cloud"
model.

In other words, CMM2, despite its name, is more of a
"subservient" memory management system (Linux is
subservient to the hypervisor) and tmem is more
collaborative (Linux and the hypervisor share the
responsibilities and the benefits/costs).

I'm not saying either one is bad or good -- and I'm sure
each can be adapted to approximately deliver the value
of the other -- they are just approaching the same problem
from different perspectives.

2009-07-09 21:28:21

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Dan Magenheimer wrote:

> I'm not saying either one is bad or good -- and I'm sure
> each can be adapted to approximately deliver the value
> of the other -- they are just approaching the same problem
> from different perspectives.

Indeed. Tmem and auto-ballooning have a simple mechanism,
but the policy required to make it work right could well
be too complex to ever get right.

CMM2 has a more complex mechanism, but the policy is
absolutely trivial.

CMM2 and auto-ballooning seem to give about similar
performance gains on zSystem.

I suspect that for Xen and KVM, we'll want to choose
for the approach that has the simpler policy, because
relying on different versions of different operating
systems to all get the policy of auto-ballooning or
tmem right is likely to result in bad interactions
between guests and other intractable issues.

--
All rights reversed.

2009-07-09 21:41:53

by Anthony Liguori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Dan Magenheimer wrote:
> CMM2's focus is on increasing the number of VM's that
> can run on top of the hypervisor. To do this, it
> depends on hints provided by Linux to surreptitiously
> steal memory away from Linux. The stolen memory still
> "belongs" to Linux and if Linux goes to use it but the
> hypervisor has already given it to another Linux, the
> hypervisor must jump through hoops to give it back.
>

It depends on how you define "jump through hoops".

> If it guesses wrong and overcommits too aggressively,
> the hypervisor must swap some memory to a "hypervisor
> swap disk" (which btw has some policy challenges).
> IMHO this is more of a "mainframe" model.
>

No, not at all. A guest marks a page as being "volatile", which tells
the hypervisor it never needs to swap that page. It can discard it
whenever it likes.

If the guest later tries to access that page, it will get a special
"discard fault". For a lot of types of memory, the discard fault
handler can then restore that page transparently to the code that
generated the discard fault.

AFAICT, ephemeral tmem has the exact same characteristics as volatile
CMM2 pages. The difference is that tmem introduces an API to explicitly
manage this memory behind a copy interface whereas CMM2 uses hinting and
a special fault handler to allow any piece of memory to be marked in
this way.

> In other words, CMM2, despite its name, is more of a
> "subservient" memory management system (Linux is
> subservient to the hypervisor) and tmem is more
> collaborative (Linux and the hypervisor share the
> responsibilities and the benefits/costs).
>

I don't really agree with your analysis of CMM2. We can map CMM2
operations directly to ephemeral tmem interfaces so tmem is a subset of
CMM2, no?

What's appealing to me about CMM2 is that it doesn't change the guest
semantically but rather just gives the VMM more information about how
the VMM is using it's memory. This suggests that it allows greater
flexibility in the long term to the VMM and more importantly, provides
an easier implementation across a wide range of guests.

Regards,

Anthony Liguori

2009-07-09 21:50:27

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

> > I'm not saying either one is bad or good -- and I'm sure
> > each can be adapted to approximately deliver the value
> > of the other -- they are just approaching the same problem
> > from different perspectives.
>
> Indeed. Tmem and auto-ballooning have a simple mechanism,
> but the policy required to make it work right could well
> be too complex to ever get right.
>
> CMM2 has a more complex mechanism, but the policy is
> absolutely trivial.

Could you elaborate a bit more on what policy you
are referring to and what decisions the policies are
trying to guide? And are you looking at the policies
in Linux or in the hypervisor or the sum of both?

The Linux-side policies in the tmem patch seem trivial
to me and the Xen-side implementation is certainly
working correctly, though "working right" is a hard
objective to measure. But depending on how you define
"working right", the pageframe replacement algorithm
in Linux may also be "too complex to ever get right"
but it's been working well enough for a long time.

> CMM2 and auto-ballooning seem to give about similar
> performance gains on zSystem.

Tmem provides a huge advantage over my self-ballooning
implementation, but maybe that's because it is more
aggressive than the CMM auto-ballooning, resulting
in more refaults that must be "fixed".

> I suspect that for Xen and KVM, we'll want to choose
> for the approach that has the simpler policy, because
> relying on different versions of different operating
> systems to all get the policy of auto-ballooning or
> tmem right is likely to result in bad interactions
> between guests and other intractable issues.

Again, not sure what tmem policy in Linux you are referring
to or what bad interactions you foresee. Could you
clarify?

Auto-ballooning policy is certainly a challenge, but
that's true whether CMM or tmem, right?

Thanks,
Dan

2009-07-09 22:36:30

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

> > If it guesses wrong and overcommits too aggressively,
> > the hypervisor must swap some memory to a "hypervisor
> > swap disk" (which btw has some policy challenges).
> > IMHO this is more of a "mainframe" model.
>
> No, not at all. A guest marks a page as being "volatile",
> which tells
> the hypervisor it never needs to swap that page. It can discard it
> whenever it likes.
>
> If the guest later tries to access that page, it will get a special
> "discard fault". For a lot of types of memory, the discard fault
> handler can then restore that page transparently to the code that
> generated the discard fault.

But this means that either the content of that page must have been
preserved somewhere or the discard fault handler has sufficient
information to go back and get the content from the source (e.g.
the filesystem). Or am I misunderstanding?

With tmem, the equivalent of the "failure to access a discarded page"
is inline and synchronous, so if the tmem access "fails", the
normal code immediately executes.

> AFAICT, ephemeral tmem has the exact same characteristics as volatile
> CMM2 pages. The difference is that tmem introduces an API to
> explicitly
> manage this memory behind a copy interface whereas CMM2 uses
> hinting and
> a special fault handler to allow any piece of memory to be marked in
> this way.
> :
> I don't really agree with your analysis of CMM2. We can map CMM2
> operations directly to ephemeral tmem interfaces so tmem is a
> subset of CMM2, no?

Not really. I suppose one *could* use tmem that way, immediately
writing every page read from disk into tmem, though that would
probably cause some real coherency challenges. But the patch as
proposed only puts ready-to-be-replaced pages (as determined by
Linux's PFRA) into ephemeral tmem.

The two services provided to Linux (in the proposed patch) by
tmem are:

1) "I have a page of memory that I'm about to throw away because
I'm not sure I need it any more and I have a better use for
that pageframe right now. Mr Tmem might you have someplace
you can squirrel it away for me in case I need it again?
Oh, and by the way, if you can't or you lose it, no big deal
as I can go get it from disk if I need to."
2) "I'm out of memory and have to put this page somewhere. Mr
Tmem, can you take it? But if you do take it, you have to
promise to give it back when I ask for it! If you can't
promise, never mind, I'll find something else to do with it."

> > In other words, CMM2, despite its name, is more of a
> > "subservient" memory management system (Linux is
> > subservient to the hypervisor) and tmem is more
> > collaborative (Linux and the hypervisor share the
> > responsibilities and the benefits/costs).
>
> What's appealing to me about CMM2 is that it doesn't change the guest
> semantically but rather just gives the VMM more information about how
> the VMM is using it's memory. This suggests that it allows greater
> flexibility in the long term to the VMM and more importantly,
> provides an easier implementation across a wide range of guests.

I suppose changing Linux to utilize the two tmem services
as described above is a semantic change. But to me it
seems no more of a semantic change than requiring a new
special page fault handler because a page of memory might
disappear behind the OS's back.

But IMHO this is a corollary of the fundamental difference. CMM2's
is more the "VMware" approach which is that OS's should never have
to be modified to run in a virtual environment. (Oh, but maybe
modified just slightly to make the hypervisor a little less
clueless about the OS's resource utilization.) Tmem asks: If an
OS is going to often run in a virtualized environment, what
can be done to share the responsibility for resource management
so that the OS does what it can with the knowledge that it has
and the hypervisor can most flexibly manage resources across
all the guests? I do agree that adding an additional API
binds the user and provider of the API less flexibly then without
the API, but as long as the API is optional (as it is for both
tmem and CMM2), I don't see why CMM2 provides more flexibility.

Thanks,
Dan

2009-07-09 22:46:23

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Dan Magenheimer wrote:

> But this means that either the content of that page must have been
> preserved somewhere or the discard fault handler has sufficient
> information to go back and get the content from the source (e.g.
> the filesystem). Or am I misunderstanding?

The latter. Only pages which can be fetched from
source again are marked as volatile.

> But IMHO this is a corollary of the fundamental difference. CMM2's
> is more the "VMware" approach which is that OS's should never have
> to be modified to run in a virtual environment.

Actually, the CMM2 mechanism is quite invasive in
the guest operating system's kernel.

> ( I don't see why CMM2 provides more flexibility.

I don't think anyone is arguing that. One thing
that people have argued is that CMM2 can be more
efficient, and easier to get the policy right in
the face of multiple guest operating systems.

--
All rights reversed.

2009-07-09 23:33:28

by Anthony Liguori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Dan Magenheimer wrote:
> But this means that either the content of that page must have been
> preserved somewhere or the discard fault handler has sufficient
> information to go back and get the content from the source (e.g.
> the filesystem). Or am I misunderstanding?
>

As Rik said, it's the later.

> With tmem, the equivalent of the "failure to access a discarded page"
> is inline and synchronous, so if the tmem access "fails", the
> normal code immediately executes.
>

Yup. This is the main difference AFAICT. It's really just API
semantics within Linux.

You could clearly use the volatile state of CMM2 to implement tmem as an
API in Linux. The get/put functions would set a flag such that if the
discard handler was invoked as long as that operation happened, the
operation could safely fail. That's why I claimed tmem is a subset of CMM2.

> I suppose changing Linux to utilize the two tmem services
> as described above is a semantic change. But to me it
> seems no more of a semantic change than requiring a new
> special page fault handler because a page of memory might
> disappear behind the OS's back.
>
> But IMHO this is a corollary of the fundamental difference. CMM2's
> is more the "VMware" approach which is that OS's should never have
> to be modified to run in a virtual environment. (Oh, but maybe
> modified just slightly to make the hypervisor a little less
> clueless about the OS's resource utilization.)

While I always enjoy a good holy war, I'd like to avoid one here because
I want to stay on the topic at hand.

If there was one change to tmem that would make it more palatable, for
me it would be changing the way pools are "allocated". Instead of
getting an opaque handle from the hypervisor, I would force the guest to
allocate it's own memory and to tell the hypervisor that it's a tmem
pool. You could then introduce semantics about whether the guest was
allowed to directly manipulate the memory as long as it was in the
pool. It would be required to access the memory via get/put functions
that under Xen, would end up being a hypercall and a copy. Presumably
you would do some tricks with ballooning to allocate empty memory in Xen
and then use those addresses as tmem pools. On KVM, we could do
something more clever.

The big advantage of keeping the tmem pool part of the normal set of
guest memory is that you don't introduce new challenges with respect to
memory accounting. Whether or not tmem is directly accessible from the
guest, it is another memory resource. I'm certain that you'll want to
do accounting of how much tmem is being consumed by each guest, and I
strongly suspect that you'll want to do tmem accounting on a per-process
basis. I also suspect that doing tmem limiting for things like cgroups
would be desirable.

That all points to making tmem normal memory so that all that
infrastructure can be reused. I'm not sure how well this maps to Xen
guests, but it works out fine when the VMM is capable of presenting
memory to the guest without actually allocating it (via overcommit).

Regards,

Anthony Liguori

2009-07-10 15:25:28

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

> > But IMHO this is a corollary of the fundamental difference. CMM2's
> > is more the "VMware" approach which is that OS's should never have
> > to be modified to run in a virtual environment. (Oh, but maybe
> > modified just slightly to make the hypervisor a little less
> > clueless about the OS's resource utilization.)
>
> While I always enjoy a good holy war, I'd like to avoid one
> here because
> I want to stay on the topic at hand.

Oops, sorry, I guess that was a bit inflammatory. What I meant to
say is that inferring resource utilization efficiency is a very
hard problem and VMware (and I'm sure IBM too) has done a fine job
with it; CMM2 explicitly provides some very useful information from
within the OS to the hypervisor so that it doesn't have to infer
that information; but tmem is trying to go a step further by making
the cooperation between the OS and hypervisor more explicit
and directly beneficial to the OS.

> If there was one change to tmem that would make it more
> palatable, for
> me it would be changing the way pools are "allocated". Instead of
> getting an opaque handle from the hypervisor, I would force
> the guest to
> allocate it's own memory and to tell the hypervisor that it's a tmem
> pool.

An interesting idea but one of the nice advantages of tmem being
completely external to the OS is that the tmem pool may be much
larger than the total memory available to the OS. As an extreme
example, assume you have one 1GB guest on a physical machine that
has 64GB physical RAM. The guest now has 1GB of directly-addressable
memory and 63GB of indirectly-addressable memory through tmem.
That 63GB requires no page structs or other data structures in the
guest. And in the current (external) implementation, the size
of each pool is constantly changing, sometimes dramatically so
the guest would have to be prepared to handle this. I also wonder
if this would make shared-tmem-pools more difficult.

I can see how it might be useful for KVM though. Once the
core API and all the hooks are in place, a KVM implementation of
tmem could attempt something like this.

> The big advantage of keeping the tmem pool part of the normal set of
> guest memory is that you don't introduce new challenges with
> respect to memory accounting. Whether or not tmem is directly
> accessible from the guest, it is another memory resource. I'm
> certain that you'll want to do accounting of how much tmem is being
> consumed by each guest

Yes, the Xen implementation of tmem does accounting on a per-pool
and a per-guest basis and exposes the data via a privileged
"tmem control" hypercall.

> and I strongly suspect that you'll want to do tmem accounting on a
> per-process
> basis. I also suspect that doing tmem limiting for things
> like cgroups would be desirable.

This can be done now if each process or cgroup creates a different
tmem pool. The proposed patch doesn't do this, but it certainly
seems possible.

Dan

2009-07-12 09:18:36

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

On 07/10/2009 06:23 PM, Dan Magenheimer wrote:
>> If there was one change to tmem that would make it more
>> palatable, for
>> me it would be changing the way pools are "allocated". Instead of
>> getting an opaque handle from the hypervisor, I would force
>> the guest to
>> allocate it's own memory and to tell the hypervisor that it's a tmem
>> pool.
>>
>
> An interesting idea but one of the nice advantages of tmem being
> completely external to the OS is that the tmem pool may be much
> larger than the total memory available to the OS. As an extreme
> example, assume you have one 1GB guest on a physical machine that
> has 64GB physical RAM. The guest now has 1GB of directly-addressable
> memory and 63GB of indirectly-addressable memory through tmem.
> That 63GB requires no page structs or other data structures in the
> guest. And in the current (external) implementation, the size
> of each pool is constantly changing, sometimes dramatically so
> the guest would have to be prepared to handle this. I also wonder
> if this would make shared-tmem-pools more difficult.
>

Having no struct pages is also a downside; for example this guest cannot
have more than 1GB of anonymous memory without swapping like mad.
Swapping to tmem is fast but still a lot slower than having the memory
available.

tmem makes life a lot easier to the hypervisor and to the guest, but
also gives up a lot of flexibility. There's a difference between memory
and a very fast synchronous backing store.

> I can see how it might be useful for KVM though. Once the
> core API and all the hooks are in place, a KVM implementation of
> tmem could attempt something like this.
>

My worry is that tmem for kvm leaves a lot of niftiness on the table,
since it was designed for a hypervisor with much simpler memory
management. kvm can already use spare memory for backing guest swap,
and can already convert unused guest memory to free memory (by swapping
it). tmem doesn't really integrate well with these capabilities.


--
error compiling committee.c: too many arguments to function

2009-07-12 13:28:46

by Anthony Liguori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Dan Magenheimer wrote:
> Oops, sorry, I guess that was a bit inflammatory. What I meant to
> say is that inferring resource utilization efficiency is a very
> hard problem and VMware (and I'm sure IBM too) has done a fine job
> with it; CMM2 explicitly provides some very useful information from
> within the OS to the hypervisor so that it doesn't have to infer
> that information; but tmem is trying to go a step further by making
> the cooperation between the OS and hypervisor more explicit
> and directly beneficial to the OS.
>

KVM definitely falls into the camp of trying to minimize modification to
the guest.

>> If there was one change to tmem that would make it more
>> palatable, for
>> me it would be changing the way pools are "allocated". Instead of
>> getting an opaque handle from the hypervisor, I would force
>> the guest to
>> allocate it's own memory and to tell the hypervisor that it's a tmem
>> pool.
>>
>
> An interesting idea but one of the nice advantages of tmem being
> completely external to the OS is that the tmem pool may be much
> larger than the total memory available to the OS. As an extreme
> example, assume you have one 1GB guest on a physical machine that
> has 64GB physical RAM. The guest now has 1GB of directly-addressable
> memory and 63GB of indirectly-addressable memory through tmem.
> That 63GB requires no page structs or other data structures in the
> guest. And in the current (external) implementation, the size
> of each pool is constantly changing, sometimes dramatically so
> the guest would have to be prepared to handle this. I also wonder
> if this would make shared-tmem-pools more difficult.
>
> I can see how it might be useful for KVM though. Once the
> core API and all the hooks are in place, a KVM implementation of
> tmem could attempt something like this.
>

It's the core API that is really the issue. The semantics of tmem
(external memory pool with copy interface) is really what is problematic.

The basic concept, notifying the VMM about memory that can be recreated
by the guest to avoid the VMM having to swap before reclaim, is great
and I'd love to see Linux support it in some way.

>> The big advantage of keeping the tmem pool part of the normal set of
>> guest memory is that you don't introduce new challenges with
>> respect to memory accounting. Whether or not tmem is directly
>> accessible from the guest, it is another memory resource. I'm
>> certain that you'll want to do accounting of how much tmem is being
>> consumed by each guest
>>
>
> Yes, the Xen implementation of tmem does accounting on a per-pool
> and a per-guest basis and exposes the data via a privileged
> "tmem control" hypercall.
>

I was talking about accounting within the guest. It's not just a matter
of accounting within the mm, it's also about accounting in userspace. A
lot of software out there depends on getting detailed statistics from
Linux about how much memory is in use in order to determine things like
memory pressure. If you introduce a new class of memory, you need a new
class of statistics to expose to userspace and all those tools need
updating.

Regards,

Anthony Liguori

2009-07-12 16:22:20

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

> > that information; but tmem is trying to go a step further by making
> > the cooperation between the OS and hypervisor more explicit
> > and directly beneficial to the OS.
>
> KVM definitely falls into the camp of trying to minimize
> modification to the guest.

No argument there. Well, maybe one :-) Yes, but KVM
also heavily encourages unmodified guests. Tmem is
philosophically in favor of finding a balance between
things that work well with no changes to any OS (and
thus work just fine regardless of whether the OS is
running in a virtual environment or not), and things
that could work better if the OS is knowledgable that
it is running in a virtual environment.

For those that believe virtualization is a flash-in-
the-pan, no modifications to the OS is the right answer.
For those that believe it will be pervasive in the
future, finding the right balance is a critical step
in operating system evolution.

(Sorry for the Sunday morning evangelizing :-)

> >> If there was one change to tmem that would make it more
> >> palatable, for
> >> me it would be changing the way pools are "allocated". Instead of
> >> getting an opaque handle from the hypervisor, I would force
> >> the guest to
> >> allocate it's own memory and to tell the hypervisor that
> it's a tmem
> >> pool.
> >
> > I can see how it might be useful for KVM though. Once the
> > core API and all the hooks are in place, a KVM implementation of
> > tmem could attempt something like this.
>
> It's the core API that is really the issue. The semantics of tmem
> (external memory pool with copy interface) is really what is
> problematic.
> The basic concept, notifying the VMM about memory that can be
> recreated
> by the guest to avoid the VMM having to swap before reclaim, is great
> and I'd love to see Linux support it in some way.

Is it the tmem API or the precache/preswap API layered on
top of it that is problematic? Both currently assume copying
but perhaps the precache/preswap API could, with minor
modifications, meet KVM's needs better?

> > Yes, the Xen implementation of tmem does accounting on a per-pool
> > and a per-guest basis and exposes the data via a privileged
> > "tmem control" hypercall.
>
> I was talking about accounting within the guest. It's not
> just a matter
> of accounting within the mm, it's also about accounting in
> userspace. A
> lot of software out there depends on getting detailed statistics from
> Linux about how much memory is in use in order to determine
> things like
> memory pressure. If you introduce a new class of memory, you
> need a new
> class of statistics to expose to userspace and all those tools need
> updating.

OK, I see.

Well, first, tmem's very name means memory that is "beyond the
range of normal perception". This is certainly not the first class
of memory in use in data centers that can't be accounted at
process granularity. I'm thinking disk array caches as the
primary example. Also lots of tools that work great in a
non-virtualized OS are worthless or misleading in a virtual
environment.

Second, CPUs are getting much more complicated with massive
pipelines, many layers of caches each with different characteristics,
etc, and its getting increasingly impossible to accurately and
reproducibly measure performance at a very fine granularity.
One could only expect that other resources, such as memory,
would move in that direction.

2009-07-12 16:31:02

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

> > That 63GB requires no page structs or other data structures in the
> > guest. And in the current (external) implementation, the size
> > of each pool is constantly changing, sometimes dramatically so
> > the guest would have to be prepared to handle this. I also wonder
> > if this would make shared-tmem-pools more difficult.
>
> Having no struct pages is also a downside; for example this
> guest cannot
> have more than 1GB of anonymous memory without swapping like mad.
> Swapping to tmem is fast but still a lot slower than having
> the memory
> available.

Yes, true. Tmem offers little additional advantage for workloads
that have a huge variation in working set size that is primarily
anonymous memory. That larger scale "memory shaping" is left to
ballooning and hotplug.

> tmem makes life a lot easier to the hypervisor and to the guest, but
> also gives up a lot of flexibility. There's a difference
> between memory
> and a very fast synchronous backing store.

I don't see that it gives up that flexibility. System adminstrators
are still free to size their guests properly. Tmem's contribution
is in environments that are highly dynamic, where the only
alternative is really sizing memory maximally (and thus wasting
it for the vast majority of time in which the working set is smaller).

> > I can see how it might be useful for KVM though. Once the
> > core API and all the hooks are in place, a KVM implementation of
> > tmem could attempt something like this.
> >
>
> My worry is that tmem for kvm leaves a lot of niftiness on the table,
> since it was designed for a hypervisor with much simpler memory
> management. kvm can already use spare memory for backing guest swap,
> and can already convert unused guest memory to free memory
> (by swapping
> it). tmem doesn't really integrate well with these capabilities.

I'm certainly open to identifying compromises and layer modifications
that help meet the needs of both Xen and KVM (and others). For
example, if we can determine that the basic hook placement for
precache/preswap (or even just precache for KVM) can be built
on different underlying layers, that would be great!

Dan

2009-07-12 17:16:27

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

On 07/12/2009 07:20 PM, Dan Magenheimer wrote:
>>> that information; but tmem is trying to go a step further by making
>>> the cooperation between the OS and hypervisor more explicit
>>> and directly beneficial to the OS.
>>>
>> KVM definitely falls into the camp of trying to minimize
>> modification to the guest.
>>
>
> No argument there. Well, maybe one :-) Yes, but KVM
> also heavily encourages unmodified guests. Tmem is
> philosophically in favor of finding a balance between
> things that work well with no changes to any OS (and
> thus work just fine regardless of whether the OS is
> running in a virtual environment or not), and things
> that could work better if the OS is knowledgable that
> it is running in a virtual environment.
>


CMM2 and tmem are not any different in this regard; both require OS
modification, and both make information available to the hypervisor. In
fact CMM2 is much more intrusive (but on the other hand provides much
more information).

> For those that believe virtualization is a flash-in-
> the-pan, no modifications to the OS is the right answer.
> For those that believe it will be pervasive in the
> future, finding the right balance is a critical step
> in operating system evolution.
>

You're arguing for CMM2 here IMO.

> Is it the tmem API or the precache/preswap API layered on
> top of it that is problematic? Both currently assume copying
> but perhaps the precache/preswap API could, with minor
> modifications, meet KVM's needs better?
>
>

My take on this is that precache (predecache?) / preswap can be
implemented even without tmem by using write-through backing for the
virtual disk. For swap this is actually slight;y more efficient than
tmem preswap, for preuncache slightly less efficient (since there will
be some double caching). So I'm more interested in other use cases of
tmem/CMM2.Well, first, tmem's very name means memory that is "beyond the
> range of normal perception". This is certainly not the first class
> of memory in use in data centers that can't be accounted at
> process granularity. I'm thinking disk array caches as the
> primary example. Also lots of tools that work great in a
> non-virtualized OS are worthless or misleading in a virtual
> environment.
>
>

Right, the transient uses of tmem when applied to disk objects
(swap/pagecache) are very similar to disk caches. Which is why you can
get a very similar effect when caching your virtual disks; this can be
done without any guest modification.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-07-12 17:28:17

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

On 07/12/2009 07:28 PM, Dan Magenheimer wrote:
>> Having no struct pages is also a downside; for example this
>> guest cannot
>> have more than 1GB of anonymous memory without swapping like mad.
>> Swapping to tmem is fast but still a lot slower than having
>> the memory
>> available.
>>
>
> Yes, true. Tmem offers little additional advantage for workloads
> that have a huge variation in working set size that is primarily
> anonymous memory. That larger scale "memory shaping" is left to
> ballooning and hotplug.
>

And this is where the policy problems erupt. When do you balloon in
favor of tmem? which guest do you balloon? do you leave it to the
administrator? there's the host's administrator and the guests'
administrators.

CMM2 solves this neatly by providing information to the host. The host
can pick the least recently used page (or a better algorithm) and evict
it using information from the guest, either dropping it or swapping it.
It also provides information back to the guest when it references an
evicted page: either the guest needs to recreate the page or it just
needs to wait.

>> tmem makes life a lot easier to the hypervisor and to the guest, but
>> also gives up a lot of flexibility. There's a difference
>> between memory
>> and a very fast synchronous backing store.
>>
>
> I don't see that it gives up that flexibility. System adminstrators
> are still free to size their guests properly. Tmem's contribution
> is in environments that are highly dynamic, where the only
> alternative is really sizing memory maximally (and thus wasting
> it for the vast majority of time in which the working set is smaller).
>

I meant that once a page is converted to tmem, there's a limited amount
of things you can do with it compared to normal memory. For example
tmem won't help with a dcache intensive workload.

> I'm certainly open to identifying compromises and layer modifications
> that help meet the needs of both Xen and KVM (and others). For
> example, if we can determine that the basic hook placement for
> precache/preswap (or even just precache for KVM) can be built
> on different underlying layers, that would be great!
>

I'm not sure preswap/precache by itself justifies tmem since it can be
emulated by backing the disk with a cached file. What I'm missing in
tmem is the ability for the hypervisor to take a global view on memory;
instead it's forced to look at memory and tmem separately. That's fine
for Xen since it can't really make any decisions on normal memory
(lacking swap); on the other hand kvm doesn't map well to tmem since
"free memory" is already used by the host pagecache.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-07-12 19:34:39

by Anthony Liguori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Avi Kivity wrote:
>
> In fact CMM2 is much more intrusive (but on the other hand provides
> much more information).
I don't think this will remain true long term. CMM2 touches a lot of
core mm code and certainly qualifies as intrusive. However the result
is that the VMM has a tremendous amount of insight into how the guest is
using it's memory and can implement all sorts of fancy policy for
reclaim. Since the reclaim policy can evolve without any additional
assistance from the guest, the guest doesn't have to change as policy
evolves.

Since tmem requires that reclaim policy is implemented within the guest,
I think in the long term, tmem will have to touch a broad number of
places within Linux. Beside the core mm, the first round of patches
already touch filesystems (just ext3 to start out with). To truly be
effective, tmem would have to be a first class kernel citizen and I
suspect a lot of code would have to be aware of it.

So while CMM2 does a lot of code no one wants to touch, I think in the
long term it would remain relatively well contained compared to tmem
which will steadily increase in complexity within the guest.

Regards,

Anthony Liguori

2009-07-12 20:41:06

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

> CMM2 and tmem are not any different in this regard; both require OS
> modification, and both make information available to the
> hypervisor. In
> fact CMM2 is much more intrusive (but on the other hand provides much
> more information).
>
> > For those that believe it will be pervasive in the
> > future, finding the right balance is a critical step
> > in operating system evolution.
>
> You're arguing for CMM2 here IMO.

I'm arguing that both are a good thing and a step in
the right direction. In some ways, tmem is a bigger
step and in some ways CMM2 is a bigger step.

> My take on this is that precache (predecache?) / preswap can be
> implemented even without tmem by using write-through backing for the
> virtual disk. For swap this is actually slight;y more efficient than
> tmem preswap, for preuncache slightly less efficient (since
> there will
> be some double caching). So I'm more interested in other use
> cases of tmem/CMM2.
>
> Right, the transient uses of tmem when applied to disk objects
> (swap/pagecache) are very similar to disk caches. Which is
> why you can
> get a very similar effect when caching your virtual disks;
> this can be
> done without any guest modification.

Write-through backing and virtual disk cacheing offer a
similar effect, but it is far from the same.

2009-07-12 20:43:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

On 07/12/2009 11:39 PM, Dan Magenheimer wrote:
>> Right, the transient uses of tmem when applied to disk objects
>> (swap/pagecache) are very similar to disk caches. Which is
>> why you can
>> get a very similar effect when caching your virtual disks;
>> this can be
>> done without any guest modification.
>>
>
> Write-through backing and virtual disk cacheing offer a
> similar effect, but it is far from the same.
>

Can you explain how it differs for the swap case? Maybe I don't
understand how tmem preswap works.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-07-12 21:01:01

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux


> > anonymous memory. That larger scale "memory shaping" is left to
> > ballooning and hotplug.
>
> And this is where the policy problems erupt. When do you balloon in
> favor of tmem? which guest do you balloon? do you leave it to the
> administrator? there's the host's administrator and the guests'
> administrators.
> :
> CMM2 solves this neatly by providing information to the host.

As with CMM2, ballooning is for larger scale memory shaping.
Tmem provides a safety valve if the shaping is too aggressive
(and thus encourages more aggressive ballooning). So they
are complementary. Tmem also provides plenty of information
to the host that can be used to fine tune ballooning policy
if desired (and this can be done in userland and/or management
tools).

> > I don't see that it gives up that flexibility. System adminstrators
> > are still free to size their guests properly. Tmem's contribution
> > is in environments that are highly dynamic, where the only
> > alternative is really sizing memory maximally (and thus wasting
> > it for the vast majority of time in which the working set
> is smaller).
>
> I meant that once a page is converted to tmem, there's a
> limited amount
> of things you can do with it compared to normal memory. For example
> tmem won't help with a dcache intensive workload.

Yes that's true. But that's part of the point of tmem. Tmem
isn't just providing benefits to a single guest. It's
providing "memory liquidity" (Jeremy's term, but I like it)
which benefits the collective of guests on a machine and
across the data center. For KVM+CMM2, I suppose this might be
less valuable because of the more incestuous relationship
between the host and guests.

> > I'm certainly open to identifying compromises and layer
> modifications
> > that help meet the needs of both Xen and KVM (and others). For
> > example, if we can determine that the basic hook placement for
> > precache/preswap (or even just precache for KVM) can be built
> > on different underlying layers, that would be great!
>
> I'm not sure preswap/precache by itself justifies tmem since
> it can be
> emulated by backing the disk with a cached file.

I don't see that it can... though perhaps it can in the KVM
world.

> What I'm missing in
> tmem is the ability for the hypervisor to take a global view
> on memory;
> instead it's forced to look at memory and tmem separately.

Again, I guess I see this as one of the key values of tmem.
Memory *does* have different attributes and calling out the
differences in some cases allows more flexibility to the
whole collective of guests with very little impact to any
one guest.

P.S. I have to mostly disconnect from this discussion for
a few days except for short replies.

2009-07-12 21:10:30

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

> >> Right, the transient uses of tmem when applied to disk objects
> >> (swap/pagecache) are very similar to disk caches. Which is
> >> why you can
> >> get a very similar effect when caching your virtual disks;
> >> this can be
> >> done without any guest modification.
> >
> > Write-through backing and virtual disk cacheing offer a
> > similar effect, but it is far from the same.
>
> Can you explain how it differs for the swap case? Maybe I don't
> understand how tmem preswap works.

The key differences I see are the "please may I store something"
API and the fact that the reply (yes or no) can vary across time
depending on the state of the collective of guests. Virtual
disk cacheing requires the host to always say yes and always
deliver persistence. I can see that this is less of a concern
for KVM because the host can swap... though doesn't this hide
information from the guest and potentially have split-brain
swapping issues?

(thanks for the great discussion so far... going offline mostly now
for a few days)

Dan

2009-07-13 11:31:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

On 07/13/2009 12:08 AM, Dan Magenheimer wrote:
>> Can you explain how it differs for the swap case? Maybe I don't
>> understand how tmem preswap works.
>>
>
> The key differences I see are the "please may I store something"
> API and the fact that the reply (yes or no) can vary across time
> depending on the state of the collective of guests. Virtual
> disk cacheing requires the host to always say yes and always
> deliver persistence.

We need to compare tmem+swap to swap+cache, not just tmem to cache.
Here's how I see it:

tmem+swap swapout:
- guest copies page to tmem (may fail)
- guest writes page to disk

cached drive swapout:
- guest writes page to disk
- host copies page to cache

tmem+swap swapin:
- guest reads page from tmem (may fail)
- on tmem failure, guest reads swap from disk
- guest drops tmem page

cached drive swapin:
- guest reads page from disk
- host may satisfy read from cache

tmem+swap ageing:
- host may drop tmem page at any time

cached drive ageing:
- host may drop cached page at any time

So they're pretty similar. The main difference is that tmem can drop
the page on swapin. It could be made to work with swap by supporting
the TRIM command.

> I can see that this is less of a concern
> for KVM because the host can swap... though doesn't this hide
> information from the guest and potentially have split-brain
> swapping issues?
>

Double swap is bad for performance, yes. CMM2 addresses it nicely.
tmem doesn't address it at all - it assumes you have excess memory.

> (thanks for the great discussion so far... going offline mostly now
> for a few days)
>

I'm going offline too so it cancels out.

--
error compiling committee.c: too many arguments to function

2009-07-13 20:18:51

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

On Sun, Jul 12, 2009 at 02:34:25PM -0500, Anthony Liguori wrote:
> Avi Kivity wrote:
>>
>> In fact CMM2 is much more intrusive (but on the other hand provides
>> much more information).
> I don't think this will remain true long term. CMM2 touches a lot of
> core mm code and certainly qualifies as intrusive. However the result
> is that the VMM has a tremendous amount of insight into how the guest is
> using it's memory and can implement all sorts of fancy policy for
> reclaim. Since the reclaim policy can evolve without any additional
> assistance from the guest, the guest doesn't have to change as policy
> evolves.
>
> Since tmem requires that reclaim policy is implemented within the guest,
> I think in the long term, tmem will have to touch a broad number of
> places within Linux. Beside the core mm, the first round of patches
> already touch filesystems (just ext3 to start out with). To truly be
> effective, tmem would have to be a first class kernel citizen and I
> suspect a lot of code would have to be aware of it.

This depends on the extent to which tmem is integrated into the VM. For
filesystem usage, the hooks are relatively simple because we already
have a lot of code sharing in this area. Basically tmem is concerned
with when we free a clean page and when the contents of a particular
offset in the file are no longer valid.

The nice part about tmem is that any time a given corner case gets
tricky, you can just invalidate that offset in tmem and move on.

-chris

2009-07-13 20:38:51

by Anthony Liguori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Chris Mason wrote:
> This depends on the extent to which tmem is integrated into the VM. For
> filesystem usage, the hooks are relatively simple because we already
> have a lot of code sharing in this area. Basically tmem is concerned
> with when we free a clean page and when the contents of a particular
> offset in the file are no longer valid.
>

But filesystem usage is perhaps the least interesting part of tmem.

The VMM already knows which pages in the guest are the result of disk IO
(it's the one that put it there, afterall). It also knows when those
pages have been invalidated (or it can tell based on write-faulting).

The VMM also knows when the disk IO has been rerequested by tracking
previous requests. It can keep the old IO requests cached in memory and
use that to satisfy re-reads as long as the memory isn't needed for
something else. Basically, we have tmem today with kvm and we use it by
default by using the host page cache to do I/O caching (via
cache=writethrough).

The difference between our "tmem" is that instead of providing an
interface where the guest explicitly says, "I'm throwing away this
memory, I may need it later", and then asking again for it, the guest
throws away the page and then we can later satisfy the disk I/O request
that results from re-requesting the page instantaneously.

This transparent approach is far superior too because it enables
transparent sharing across multiple guests. This works well for CoW
images and would work really well if we had a file system capable of
block-level deduplification... :-)

Regards,

Anthony Liguori

2009-07-13 21:02:06

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

On Mon, Jul 13, 2009 at 03:38:45PM -0500, Anthony Liguori wrote:
> Chris Mason wrote:
>> This depends on the extent to which tmem is integrated into the VM. For
>> filesystem usage, the hooks are relatively simple because we already
>> have a lot of code sharing in this area. Basically tmem is concerned
>> with when we free a clean page and when the contents of a particular
>> offset in the file are no longer valid.
>>
>
> But filesystem usage is perhaps the least interesting part of tmem.
>
> The VMM already knows which pages in the guest are the result of disk IO
> (it's the one that put it there, afterall). It also knows when those
> pages have been invalidated (or it can tell based on write-faulting).
>
> The VMM also knows when the disk IO has been rerequested by tracking
> previous requests. It can keep the old IO requests cached in memory and
> use that to satisfy re-reads as long as the memory isn't needed for
> something else. Basically, we have tmem today with kvm and we use it by
> default by using the host page cache to do I/O caching (via
> cache=writethrough).

I'll definitely grant that caching with writethough adds more caching,
but it does need trim support before it is similar to tmem. The caching
is transparent to the guest, but it is also transparent to qemu, and so
it is harder to manage and size (or even get a stat for how big it
currently is).

>
> The difference between our "tmem" is that instead of providing an
> interface where the guest explicitly says, "I'm throwing away this
> memory, I may need it later", and then asking again for it, the guest
> throws away the page and then we can later satisfy the disk I/O request
> that results from re-requesting the page instantaneously.
>
> This transparent approach is far superior too because it enables
> transparent sharing across multiple guests. This works well for CoW
> images and would work really well if we had a file system capable of
> block-level deduplification... :-)

Grin, I'm afraid that even if someone were to jump in and write the
perfect cow based filesystem and then find a willing contributor to code
up a dedup implementation, each cow image would be a different file
and so it would have its own address space.

Dedup and COW are an easy way to have hints about which pages are
supposed to be have the same contents, but they would have to go with
some other duplicate page sharing scheme.

-chris

2009-07-13 21:17:16

by Anthony Liguori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

Chris Mason wrote:
> On Mon, Jul 13, 2009 at 03:38:45PM -0500, Anthony Liguori wrote:
>
> I'll definitely grant that caching with writethough adds more caching,
> but it does need trim support before it is similar to tmem.

I think trim is somewhat orthogonal but even if you do need it, the nice
thing about implementing ATA trim support verses a paravirtualization is
that it works with a wide variety of guests.

From the perspective of the VMM, it seems like a good thing.

> The caching
> is transparent to the guest, but it is also transparent to qemu, and so
> it is harder to manage and size (or even get a stat for how big it
> currently is).
>

That's certainly a fixable problem though. We could expose statistics
to userspace and then further expose those to guests. I think the first
question to answer though is what you would use those statistics for.

>> The difference between our "tmem" is that instead of providing an
>> interface where the guest explicitly says, "I'm throwing away this
>> memory, I may need it later", and then asking again for it, the guest
>> throws away the page and then we can later satisfy the disk I/O request
>> that results from re-requesting the page instantaneously.
>>
>> This transparent approach is far superior too because it enables
>> transparent sharing across multiple guests. This works well for CoW
>> images and would work really well if we had a file system capable of
>> block-level deduplification... :-)
>>
>
> Grin, I'm afraid that even if someone were to jump in and write the
> perfect cow based filesystem and then find a willing contributor to code
> up a dedup implementation, each cow image would be a different file
> and so it would have its own address space.
>
> Dedup and COW are an easy way to have hints about which pages are
> supposed to be have the same contents, but they would have to go with
> some other duplicate page sharing scheme.
>

Yes. We have the information we need to dedup this memory though. We
just need a way to track non-dirty pages that result from DMA, map the
host page cache directly into the guest, and then CoW when the guest
tries to dirty that memory.

We don't quite have the right infrastructure in Linux yet to do this
effectively, but this is entirely an issue with the host. The guest
doesn't need any changes here.

Regards,

Anthony Liguori
> -chris
>
>

2009-07-26 14:57:16

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

On 07/14/2009 12:17 AM, Anthony Liguori wrote:
> Chris Mason wrote:
>> On Mon, Jul 13, 2009 at 03:38:45PM -0500, Anthony Liguori wrote:
>> I'll definitely grant that caching with writethough adds more caching,
>> but it does need trim support before it is similar to tmem.
>
> I think trim is somewhat orthogonal but even if you do need it, the
> nice thing about implementing ATA trim support verses a
> paravirtualization is that it works with a wide variety of guests.
>
> From the perspective of the VMM, it seems like a good thing.

trim is also lovely in that images will no longer grow monotonously even
though guest disk usage is constant or is even reduced.

--
error compiling committee.c: too many arguments to function