2009-12-18 00:38:24

by Dan Magenheimer

[permalink] [raw]
Subject: Tmem [PATCH 0/5] (Take 3): Transcendent memory

Tmem [PATCH 0/5] (Take 3): Transcendent memory
Transcendent memory

Changes since RFC "Take 2" posting (7 July 2009) reviewed at
http://lwn.net/Articles/340080/

1) Refreshed to 2.6.32
2) Support added for btrfs and ext4
3) "Precache" and "preswap" renamed to "cleancache" and "frontswap"
in response to feedback that "pre" is overloaded and too generic.
4) Most important usage statistics now provided via sysfs, under
/sys/vm/tmem/cleancache and /sys/vm/tmem/frontswap.
5) Shared pools security issue resolved by external administrative
tools; shared pool support and ocfs2 support added back again.
6) Some performance measurement done (on a linux-compile workload)
and presented at OLS; in short, showed savings of ~300 IO/sec
at an approximate cost of 0.1%-0.2% of one CPU.
And FYI, tmem support is now released as a technology preview in Oracle's
Xen-based Oracle VM 2.2 product and will be released with Xen 4.0
early in 2010. Both of these provide full save/restore/live migration
support for tmem-enabled VMs and a small set of console-oriented
management tools to view detailed tmem usage across all domains
that use tmem.

(Transcendent memory documentation follows below diffstats.)

Signed-off-by: Dan Magenheimer <[email protected]>

Changed core kernel files:
fs/btrfs/extent_io.c | 9 +++
fs/btrfs/super.c | 2
fs/buffer.c | 5 ++
fs/ext3/super.c | 2
fs/ext4/super.c | 2
fs/mpage.c | 8 +++
fs/ocfs2/super.c | 2
fs/super.c | 6 ++
include/linux/fs.h | 7 ++
include/linux/swap.h | 51 +++++++++++++++++++++
include/linux/sysctl.h | 1
kernel/sysctl.c | 11 ++++
mm/Kconfig | 26 ++++++++++
mm/Makefile | 3 +
mm/filemap.c | 11 ++++
mm/page_io.c | 12 ++++
mm/swapfile.c | 43 +++++++++++++++--
mm/truncate.c | 10 ++++
18 files changed, 204 insertions(+), 7 deletions(-)

Newly added core kernel files:
Documentation/transcendent-memory.txt | 176 +++++++++++
include/linux/tmem.h | 88 +++++
mm/cleancache.c | 184 ++++++++++++
mm/frontswap.c | 319 +++++++++++++++++++++
4 files changed, 767 insertions(+)

Changed xen-specific files:
arch/x86/include/asm/xen/hypercall.h | 8 +++
drivers/xen/Makefile | 1
include/xen/interface/tmem.h | 43 +++++++++++++++++++++
include/xen/interface/xen.h | 22 ++++++++++
4 files changed, 74 insertions(+)

Newly added xen-specific files:
drivers/xen/tmem.c | 97 +++++++++++++++++++++
include/xen/interface/tmem.h | 43 +++++++++
2 files changed, 140 insertions(+)

Normal memory is directly addressable by the kernel, of a known
normally-fixed size, synchronously accessible, and persistent (though
not across a reboot).

What if there was a class of memory that is of unknown and dynamically
variable size, is addressable only indirectly by the kernel, can be
configured either as persistent or as "ephemeral" (meaning it will be
around for awhile, but might disappear without warning), and is still
fast enough to be synchronously accessible?

We call this latter class "transcendent memory" and it provides an
interesting opportunity to more efficiently utilize RAM in a virtualized
environment. However this "memory but not really memory" may also have
applications in NON-virtualized environments, such as hotplug-memory
deletion, SSDs, and page cache compression. Others have suggested ideas
such as allowing use of highmem memory without a highmem kernel, or use
of spare video memory.

Transcendent memory, or "tmem" for short, provides a well-defined API to
access this unusual class of memory. (A summary of the API is provided
below.) The basic operations are page-copy-based and use a flexible
object-oriented addressing mechanism. Tmem assumes that some "privileged
entity" is capable of executing tmem requests and storing pages of data;
this entity is currently a hypervisor and operations are performed via
hypercalls, but the entity could be a kernel policy, or perhaps a
"memory node" in a cluster of blades connected by a high-speed
interconnect such as hypertransport or QPI.

Since tmem is not directly accessible and because page copying is done
to/from physical pageframes, it more suitable for in-kernel memory needs
than for userland applications. However, there may be yet undiscovered
userland possibilities.

With the tmem concept outlined vaguely and its broader potential hinted,
we will overview two existing examples of how tmem can be used by the
kernel.

"Cleancache" can be thought of as a page-granularity victim cache for clean
pages that the kernel's pageframe replacement algorithm (PFRA) would like
to keep around, but can't since there isn't enough memory. So when the
PFRA "evicts" a page, it first puts it into the cleancache via a call to
tmem. And any time a filesystem reads a page from disk, it first attempts
to get the page from cleancache. If it's there, a disk access is eliminated.
If not, the filesystem just goes to the disk like normal. Cleancache is
"ephemeral" so whether a page is kept in cleancache (between the "put" and
the "get") is dependent on a number of factors that are invisible to
the kernel.

"Frontswap" is so named because it can be thought of as the opposite of
a "backing store". Frontswap IS persistent, but for various reasons may not
always be available for use, again due to factors that may not be visible to
the kernel. (But, briefly, if the kernel is being "good" and has shared its
resources nicely, then it will be able to use frontswap, else it will not.)
Once a page is put, a get on the page will always succeed. So when the
kernel finds itself in a situation where it needs to swap out a page, it
first attempts to use frontswap. If the put works, a disk write and
(usually) a disk read are avoided. If it doesn't, the page is written
to swap as usual. Unlike cleancache, whether a page is stored in frontswap
vs swap is recorded in kernel data structures, so when a page needs to
be fetched, the kernel does a get if it is in frontswap and reads from
swap if it is not in frontswap.

Both cleancache and frontswap may be optionally compressed, trading off 2x
space reduction vs 10x performance for access. Cleancache also has a
sharing feature, which allows different nodes in a "virtual cluster"
to share a local page cache.

Tmem has some similarity to IBM's Collaborative Memory Management, but
creates more of a partnership between the kernel and the "privileged
entity" and is not very invasive. Tmem may be applicable for KVM and
containers; there is some disagreement on the extent of its value.
Tmem is highly complementary to ballooning (aka page granularity hot
plug) and memory deduplication (aka transparent content-based page
sharing) but still has value when neither are present.

Performance is difficult to quantify because some benchmarks respond
very favorably to increases in memory and tmem may do quite well on
those, depending on how much tmem is available which may vary widely
and dynamically, depending on conditions completely outside of the
system being measured. Ideas on how best to provide useful metrics
would be appreciated.

Tmem is supported starting in Xen 4.0 and is in Xen's Linux 2.6.18-xen
source tree. It is also released as a technology preview in Oracle's
Xen-based virtualization product, Oracle VM 2.2. Again, Xen is not
necessarily a requirement, but currently provides the only existing
implementation of tmem.

Lots more information about tmem can be found at:
http://oss.oracle.com/projects/tmem
and there was a talk about it on the first day of Linux Symposium in
July 2009; an updated talk is planned at linux.conf.au in January 2010.
Tmem is the result of a group effort, including Dan Magenheimer,
Chris Mason, Dave McCracken, Kurt Hackel and Zhigang Wang, with helpful
input from Jeremy Fitzhardinge, Keir Fraser, Ian Pratt, Sunil Mushran,
Joel Becker, and Jan Beulich.

THE TRANSCENDENT MEMORY API

Transcendent memory is made up of a set of pools. Each pool is made
up of a set of objects. And each object contains a set of pages.
The combination of a 32-bit pool id, a 64-bit object id, and a 32-bit
page id, uniquely identify a page of tmem data, and this tuple is called
a "handle." Commonly, the three parts of a handle are used to address
a filesystem, a file within that filesystem, and a page within that file;
however an OS can use any values as long as they uniquely identify
a page of data.

When a tmem pool is created, it is given certain attributes: It can
be private or shared, and it can be persistent or ephemeral. Each
combination of these attributes provides a different set of useful
functionality and also defines a slightly different set of semantics
for the various operations on the pool. Other pool attributes include
the size of the page and a version number.

Once a pool is created, operations are performed on the pool. Pages
are copied between the OS and tmem and are addressed using a handle.
Pages and/or objects may also be flushed from the pool. When all
operations are completed, a pool can be destroyed.

The specific tmem functions are called in Linux through a set of
accessor functions:

int (*new_pool)(struct tmem_pool_uuid uuid, u32 flags);
int (*destroy_pool)(u32 pool_id);
int (*put_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn);
int (*get_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn);
int (*flush_page)(u32 pool_id, u64 object, u32 index);
int (*flush_object)(u32 pool_id, u64 object);

The new_pool accessor creates a new pool and returns a pool id
which is a non-negative 32-bit integer. If the flags parameter
specifies that the pool is to be shared, the uuid is a 128-bit "shared
secret" else it is ignored. The destroy_pool accessor destroys the pool.
(Note: shared pools are not supported until security implications
are better understood.)

The put_page accessor copies a page of data from the specified pageframe
and associates it with the specified handle.

The get_page accessor looks up a page of data in tmem associated with
the specified handle and, if found, copies it to the specified pageframe.

The flush_page accessor ensures that subsequent gets of a page with
the specified handle will fail. The flush_object accessor ensures
that subsequent gets of any page matching the pool id and object
will fail.

There are many subtle but critical behaviors for get_page and put_page:
- Any put_page (with one notable exception) may be rejected and the client
must be prepared to deal with that failure. A put_page copies, NOT moves,
data; that is the data exists in both places. Linux is responsible for
destroying or overwriting its own copy, or alternately managing any
coherency between the copies.
- Every page successfully put to a persistent pool must be found by a
subsequent get_page that specifies the same handle. A page successfully
put to an ephemeral pool has an indeterminate lifetime and even an
immediately subsequent get_page may fail.
- A get_page to a private pool is destructive, that is it behaves as if
the get_page were atomically followed by a flush_page. A get_page
to a shared pool is non-destructive. A flush_page behaves just like
a get_page to a private pool except the data is thrown away.
- Put-put-get coherency is guaranteed. For example, after the sequence:
put_page(ABC,D1);
put_page(ABC,D2);
get_page(ABC,E)
E may never contain the data from D1. However, even for a persistent
pool, the get_page may fail if the second put_page indicates failure.
- Get-get coherency is guaranteed. For example, in the sequence:
put_page(ABC,D);
get_page(ABC,E1);
get_page(ABC,E2)
if the first get_page fails, the second must also fail.
- A tmem implementation provides no serialization guarantees (e.g. to
an SMP Linux). So if different Linux threads are putting and flushing
the same page, the results are indeterminate.


2009-12-18 08:06:41

by Pavel Machek

[permalink] [raw]
Subject: Re: Tmem [PATCH 0/5] (Take 3): Transcendent memory

Hi!

> Performance is difficult to quantify because some benchmarks respond
> very favorably to increases in memory and tmem may do quite well on
> those, depending on how much tmem is available which may vary widely
> and dynamically, depending on conditions completely outside of the
> system being measured. Ideas on how best to provide useful metrics
> would be appreciated.

So... take 1GB system, run your favourite benchmark. Then reserve
512MB for tmem, rerun your benchmark, then run the system with
512MB/512MB swap, rerun the benchmark?

Tune the sizes so that first to last run differ by 100% or so, and see
how much first and second differs? If it is in 1% range, you are
probably doing good...?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-12-21 13:48:34

by Nitin Gupta

[permalink] [raw]
Subject: Re: Tmem [PATCH 0/5] (Take 3): Transcendent memory

Hi Dan,

(I'm not sure if gmane.org interface sends mail to everyone in CC list, so
sending again. Sorry if you are getting duplicate mail).


Dan Magenheimer <dan.magenheimer <at> oracle.com> writes:

>
> Tmem [PATCH 0/5] (Take 3): Transcendent memory
> Transcendent memory
<snip>
>
> Normal memory is directly addressable by the kernel, of a known
> normally-fixed size, synchronously accessible, and persistent (though
> not across a reboot).
>
> What if there was a class of memory that is of unknown and dynamically
> variable size, is addressable only indirectly by the kernel, can be
> configured either as persistent or as "ephemeral" (meaning it will be
> around for awhile, but might disappear without warning), and is still
> fast enough to be synchronously accessible?
>

I really like the idea of allocating cache memory from hypervisor directly. This
is much more flexible than assigning fixed size memory to guests.

>
> "Frontswap" is so named because it can be thought of as the opposite of
> a "backing store". Frontswap IS persistent, but for various reasons may not
> always be available for use, again due to factors that may not be visible to
> the kernel. (But, briefly, if the kernel is being "good" and has shared its
> resources nicely, then it will be able to use frontswap, else it will not.)
> Once a page is put, a get on the page will always succeed. So when the
> kernel finds itself in a situation where it needs to swap out a page, it
> first attempts to use frontswap. If the put works, a disk write and
> (usually) a disk read are avoided. If it doesn't, the page is written
> to swap as usual. Unlike cleancache, whether a page is stored in frontswap
> vs swap is recorded in kernel data structures, so when a page needs to
> be fetched, the kernel does a get if it is in frontswap and reads from
> swap if it is not in frontswap.
>

I think 'frontswap' part seriously overlaps the functionality provided by
'ramzswap' which is a virtual block device driver recently added to
drivers/staging/ramzswap/. This device acts as a swap disk which compresses and
stores pages in memory itself.

To provide frontswap functionality, ramzswap needs few changes only:
instead of:
compress --> alloc and store within guest.
do:
compress --> send out to hypervisor (tmem_put_page).

Also, ramzswap driver supports multiple /dev/ramzswap{0,1,2...} devices. Each of
these devices can have separate backing partition/file which is used to flush
out incompressible pages or when (per-device) memory limit is exceeded.
When used on native systems, it uses custom xvmalloc allocator which is
specially designed to handle these compressed pages.

We can use all this by just a minor change in ramzswap as mentioned above.

> "Cleancache" can be thought of as a page-granularity victim cache for clean
> pages that the kernel's pageframe replacement algorithm (PFRA) would like
> to keep around, but can't since there isn't enough memory. So when the
> PFRA "evicts" a page, it first puts it into the cleancache via a call to
> tmem. And any time a filesystem reads a page from disk, it first attempts
> to get the page from cleancache. If it's there, a disk access is eliminated.
> If not, the filesystem just goes to the disk like normal. Cleancache is
> "ephemeral" so whether a page is kept in cleancache (between the "put" and
> the "get") is dependent on a number of factors that are invisible to
> the kernel.

Just an idea: as an alternate approach, we can create an 'in-memory compressed
storage' backend for FS-Cache. This way, all filesystems modified to use
fs-cache can benefit from this backend. To make it virtualization friendly like
tmem, we can again provide (per-cache?) option to allocate from hypervisor i.e.
tmem_{put,get}_page() or use [compress]+alloc natively.

For guest<-->hypervisor interface, maybe we can use virtio so that all
hypervisors can benefit? Not quite sure about this one.

Thanks,
Nitin

2009-12-21 23:48:19

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Tmem [PATCH 0/5] (Take 3): Transcendent memory

> From: Nitin Gupta [mailto:[email protected]]

> Hi Dan,

Hi Nitin --

Thanks for your review!

> (I'm not sure if gmane.org interface sends mail to everyone
> in CC list, so
> sending again. Sorry if you are getting duplicate mail).

FWIW, I only got this one copy (at least so far)!

> I really like the idea of allocating cache memory from
> hypervisor directly. This
> is much more flexible than assigning fixed size memory to guests.

Thanks!

> I think 'frontswap' part seriously overlaps the functionality
> provided by 'ramzswap'

Could be, but I suspect there's a subtle difference.
A key part of the tmem frontswap api is that any
"put" at any time can be rejected. There's no way
for the kernel to know a priori whether the put
will be rejected or not, and the kernel must be able
to react by writing the page to a "true" swap device
and must keep track of which pages were put
to tmem frontswap and which were written to disk.
As a result, tmem frontswap cannot be configured or
used as a true swap "device".

This is critical to acheive the flexibility you
commented above that you like. Only the hypervisor
knows if a free page is available "now" because
it is flexibly managing tmem requests from multiple
guest kernels.

If my understanding of ramzswap is incorrect or you
have some clever solution that I misunderstood,
please let me know.

>> Cleancache is
> > "ephemeral" so whether a page is kept in cleancache
> (between the "put" and
> > the "get") is dependent on a number of factors that are invisible to
> > the kernel.
>
> Just an idea: as an alternate approach, we can create an
> 'in-memory compressed
> storage' backend for FS-Cache. This way, all filesystems
> modified to use
> fs-cache can benefit from this backend. To make it
> virtualization friendly like
> tmem, we can again provide (per-cache?) option to allocate
> from hypervisor i.e.
> tmem_{put,get}_page() or use [compress]+alloc natively.

I looked at FS-Cache and cachefiles and thought I understood
that it is not restricted to clean pages only, thus
not a good match for tmem cleancache.

Again, if I'm wrong (or if it is easy to tell FS-Cache that
pages may "disappear" underneath it), let me know.

BTW, pages put to tmem (both frontswap and cleancache) can
be optionally compressed.

> For guest<-->hypervisor interface, maybe we can use virtio so that all
> hypervisors can benefit? Not quite sure about this one.

I'm not very familiar with virtio, but the existence of "I/O"
in the name concerns me because tmem is entirely synchronous.

Also, tmem is well-layered so very little work needs to be
done on the Linux side for other hypervisors to benefit.
Of course these other hypervisors would need to implement
the hypervisor-side of tmem as well, but there is a well-defined
API to guide other hypervisor-side implementations... and the
opensource tmem code in Xen has a clear split between the
hypervisor-dependent and hypervisor-independent code, which
should simplify implementation for other opensource hypervisors.

I realize in "Take 3" I didn't provide the URL for more information:
http://oss.oracle.com/projects/tmem

2009-12-23 06:28:25

by Nitin Gupta

[permalink] [raw]
Subject: Re: Tmem [PATCH 0/5] (Take 3): Transcendent memory

Hi Dan,

(mail to Rusty [at] rcsinet15.oracle.com was failing, so I removed
this address from CC list).

On Tue, Dec 22, 2009 at 5:16 AM, Dan Magenheimer
<[email protected]> wrote:
>> From: Nitin Gupta [mailto:[email protected]]

>
>> I think 'frontswap' part seriously overlaps the functionality
>> provided by 'ramzswap'
>
> Could be, but I suspect there's a subtle difference.
> A key part of the tmem frontswap api is that any
> "put" at any time can be rejected. There's no way
> for the kernel to know a priori whether the put
> will be rejected or not, and the kernel must be able
> to react by writing the page to a "true" swap device
> and must keep track of which pages were put
> to tmem frontswap and which were written to disk.
> As a result, tmem frontswap cannot be configured or
> used as a true swap "device".
>
> This is critical to acheive the flexibility you
> commented above that you like. Only the hypervisor
> knows if a free page is available "now" because
> it is flexibly managing tmem requests from multiple
> guest kernels.
>

ramzswap devices can easily track which pages it sent
to hypervisor, which pages are in backing swap (physical) disk
and which are in (compressed) memory. Its simply a matter
of adding some more flags. Latter two are already done in this
driver.

So, to gain flexibility of frontswap, we can have hypervisor
send the driver a callback whenever it wants to discard swap
pages under its domain. If you want to avoid even this callback,
then kernel will have to keep a copy within guest, which I think
defeats the whole purpose of swapping to hypervisor. Such
"ephemeral" pools should be used only for clean fs cache and
not for swap.

Swapping to hypervisor is mainly useful to overcome
'static partitioning' problem you mentioned in article:
http://oss.oracle.com/projects/tmem/
...such 'para-swap' can shrink/expand outside of VM constraints.


>
>>> Cleancache is
>> > "ephemeral" so whether a page is kept in cleancache
>> (between the "put" and
>> > the "get") is dependent on a number of factors that are invisible to
>> > the kernel.
>>
>> Just an idea: as an alternate approach, we can create an
>> 'in-memory compressed
>> storage' backend for FS-Cache. This way, all filesystems
>> modified to use
>> fs-cache can benefit from this backend. To make it
>> virtualization friendly like
>> tmem, we can again provide (per-cache?) option to allocate
>> from hypervisor i.e.
>> tmem_{put,get}_page() or use [compress]+alloc natively.
>
> I looked at FS-Cache and cachefiles and thought I understood
> that it is not restricted to clean pages only, thus
> not a good match for tmem cleancache.
>
> Again, if I'm wrong (or if it is easy to tell FS-Cache that
> pages may "disappear" underneath it), let me know.
>

fs-cache backend can keep 'dirty' pages within guest and forward
clean pages to hypervisor. These clean pages can be added to
ephemeral pools which can be reclaimed at any time by hypervisor.
BTW, I have not yet started work on any such fs-cache backend, so
we might later encounter some hidder/dangerous problems :)


> BTW, pages put to tmem (both frontswap and cleancache) can
> be optionally compressed.
>

If ramzswap is extended for this virtualization case, then enforcing
compression might not be good. We can then throw out pages to hvisor
even before compression stage. All such changes to ramzswap are IMHO
pretty straight forward to do.


>> For guest<-->hypervisor interface, maybe we can use virtio so that all
>> hypervisors can benefit? Not quite sure about this one.
>
> I'm not very familiar with virtio, but the existence of "I/O"
> in the name concerns me because tmem is entirely synchronous.
>

Is synchronous working a *requirement* for tmem to work correctly?


> Also, tmem is well-layered so very little work needs to be
> done on the Linux side for other hypervisors to benefit.
> Of course these other hypervisors would need to implement
> the hypervisor-side of tmem as well, but there is a well-defined
> API to guide other hypervisor-side implementations... and the
> opensource tmem code in Xen has a clear split between the
> hypervisor-dependent and hypervisor-independent code, which
> should simplify implementation for other opensource hypervisors.
>

As I mentioned, I really like the idea behind tmem. All I am proposing
is that we should probably explore some alternatives to achive this using
some existing infrastructure in kernel. I also don't have experience working
on virtio[1] or virtual-bus[2] but I have the feeling that once guest
to hvisor channels are created, both ramzswap extension and fs-cache backend
can share the same code.

[1] virtio: http://portal.acm.org/citation.cfm?id=1400097.1400108
[2] virtual-bus: http://developer.novell.com/wiki/index.php/Virtual-bus


Thanks,
Nitin

2009-12-23 17:27:34

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Tmem [PATCH 0/5] (Take 3): Transcendent memory

> As I mentioned, I really like the idea behind tmem. All I am proposing
> is that we should probably explore some alternatives to achive this using
> some existing infrastructure in kernel.

Hi Nitin --

Sorry if I sounded overly negative... too busy around the holidays.

I'm definitely OK with exploring alternatives. I just think that
existing kernel mechanisms are very firmly rooted in the notion
that either the kernel owns the memory/cache or an asynchronous
device owns it. Tmem falls somewhere in between and is very
carefully designed to maximize memory flexibility *outside* of
the kernel -- across all guests in a virtualized environment --
with minimal impact to the kernel, while still providing the
kernel with the ability to use -- but not own, directly address,
or control -- additional memory when conditions allow. And
these conditions are not only completely invisible to the kernel,
but change frequently and asynchronously from the kernel,
unlike most external devices for which the kernel can "reserve"
space and use it asynchronously later.

Maybe ramzswap and FS-cache could be augmented to have similar
advantages in a virtualized environment, but I suspect they'd
end up with something very similar to tmem. Since the objective
of both is to optimize memory that IS owned (used, directly
addressable, and controlled) by the kernel, they are entirely
complementary with tmem.

> Is synchronous working a *requirement* for tmem to work correctly?

Yes. Asynchronous behavior would introduce lots of race
conditions between the hypervisor and kernel which would
greatly increase complexity and reduce performance. And
tmem then essentially becomes an I/O device, which defeats
its purpose, especially compared to a fast SSD.

> Swapping to hypervisor is mainly useful to overcome
> 'static partitioning' problem you mentioned in article:
> http://oss.oracle.com/projects/tmem/
> ...such 'para-swap' can shrink/expand outside of VM constraints.

Frontswap is very different than "hypervisor swapping" as what's
done by VMware as a side-effect of transparent page-sharing. With
frontswap, the kernel still decides which pages are swapped out.
If frontswap says there is space, the swap goes "fast" to tmem;
if not, the kernel writes it to its own swapdisk. So there's
no "double paging" or random page selection/swapping. On
the downside, kernels must have real swap configured and,
to avoid DoS issues, frontswap is limited by the same constraint
as ballooning (ie. can NOT expand outside of VM constraints).

Thanks,
Dan

P.S. If you want to look at implementing FS-cache or ramzswap
on top of tmem, I'd be happy to help, but I'll bet your concern:

> we might later encounter some hidder/dangerous problems :)

will prove to be correct.

> -----Original Message-----
> From: Nitin Gupta [mailto:[email protected]]
> Sent: Tuesday, December 22, 2009 11:28 PM
> To: Dan Magenheimer
> Cc: Nick Piggin; Andrew Morton; [email protected];
> [email protected]; [email protected];
> Rusty Russell;
> Rik van Riel; Dave Mccracken; Sunil Mushran; Avi Kivity; Schwidefsky;
> Balbir Singh; Marcelo Tosatti; Alan Cox; Chris Mason; Pavel Machek;
> linux-mm; linux-kernel
> Subject: Re: Tmem [PATCH 0/5] (Take 3): Transcendent memory
>
>
> Hi Dan,
>
> (mail to Rusty [at] rcsinet15.oracle.com was failing, so I removed
> this address from CC list).
>
> On Tue, Dec 22, 2009 at 5:16 AM, Dan Magenheimer
> <[email protected]> wrote:
> >> From: Nitin Gupta [mailto:[email protected]]
>
> >
> >> I think 'frontswap' part seriously overlaps the functionality
> >> provided by 'ramzswap'
> >
> > Could be, but I suspect there's a subtle difference.
> > A key part of the tmem frontswap api is that any
> > "put" at any time can be rejected. There's no way
> > for the kernel to know a priori whether the put
> > will be rejected or not, and the kernel must be able
> > to react by writing the page to a "true" swap device
> > and must keep track of which pages were put
> > to tmem frontswap and which were written to disk.
> > As a result, tmem frontswap cannot be configured or
> > used as a true swap "device".
> >
> > This is critical to acheive the flexibility you
> > commented above that you like. Only the hypervisor
> > knows if a free page is available "now" because
> > it is flexibly managing tmem requests from multiple
> > guest kernels.
> >
>
> ramzswap devices can easily track which pages it sent
> to hypervisor, which pages are in backing swap (physical) disk
> and which are in (compressed) memory. Its simply a matter
> of adding some more flags. Latter two are already done in this
> driver.
>
> So, to gain flexibility of frontswap, we can have hypervisor
> send the driver a callback whenever it wants to discard swap
> pages under its domain. If you want to avoid even this callback,
> then kernel will have to keep a copy within guest, which I think
> defeats the whole purpose of swapping to hypervisor. Such
> "ephemeral" pools should be used only for clean fs cache and
> not for swap.
>
> Swapping to hypervisor is mainly useful to overcome
> 'static partitioning' problem you mentioned in article:
> http://oss.oracle.com/projects/tmem/
> ...such 'para-swap' can shrink/expand outside of VM constraints.
>
>
> >
> >>> Cleancache is
> >> > "ephemeral" so whether a page is kept in cleancache
> >> (between the "put" and
> >> > the "get") is dependent on a number of factors that are
> invisible to
> >> > the kernel.
> >>
> >> Just an idea: as an alternate approach, we can create an
> >> 'in-memory compressed
> >> storage' backend for FS-Cache. This way, all filesystems
> >> modified to use
> >> fs-cache can benefit from this backend. To make it
> >> virtualization friendly like
> >> tmem, we can again provide (per-cache?) option to allocate
> >> from hypervisor i.e.
> >> tmem_{put,get}_page() or use [compress]+alloc natively.
> >
> > I looked at FS-Cache and cachefiles and thought I understood
> > that it is not restricted to clean pages only, thus
> > not a good match for tmem cleancache.
> >
> > Again, if I'm wrong (or if it is easy to tell FS-Cache that
> > pages may "disappear" underneath it), let me know.
> >
>
> fs-cache backend can keep 'dirty' pages within guest and forward
> clean pages to hypervisor. These clean pages can be added to
> ephemeral pools which can be reclaimed at any time by hypervisor.
> BTW, I have not yet started work on any such fs-cache backend, so
> we might later encounter some hidder/dangerous problems :)
>
>
> > BTW, pages put to tmem (both frontswap and cleancache) can
> > be optionally compressed.
> >
>
> If ramzswap is extended for this virtualization case, then enforcing
> compression might not be good. We can then throw out pages to hvisor
> even before compression stage. All such changes to ramzswap are IMHO
> pretty straight forward to do.
>
>
> >> For guest<-->hypervisor interface, maybe we can use virtio
> so that all
> >> hypervisors can benefit? Not quite sure about this one.
> >
> > I'm not very familiar with virtio, but the existence of "I/O"
> > in the name concerns me because tmem is entirely synchronous.
> >
>
> Is synchronous working a *requirement* for tmem to work correctly?
>
>
> > Also, tmem is well-layered so very little work needs to be
> > done on the Linux side for other hypervisors to benefit.
> > Of course these other hypervisors would need to implement
> > the hypervisor-side of tmem as well, but there is a well-defined
> > API to guide other hypervisor-side implementations... and the
> > opensource tmem code in Xen has a clear split between the
> > hypervisor-dependent and hypervisor-independent code, which
> > should simplify implementation for other opensource hypervisors.
> >
>
> As I mentioned, I really like the idea behind tmem. All I am proposing
> is that we should probably explore some alternatives to
> achive this using
> some existing infrastructure in kernel. I also don't have
> experience working
> on virtio[1] or virtual-bus[2] but I have the feeling that once guest
> to hvisor channels are created, both ramzswap extension and
> fs-cache backend
> can share the same code.
>
> [1] virtio: http://portal.acm.org/citation.cfm?id=1400097.1400108
> [2] virtual-bus:
http://developer.novell.com/wiki/index.php/Virtual-bus


Thanks,
Nitin

2009-12-24 03:29:12

by Nitin Gupta

[permalink] [raw]
Subject: Re: Tmem [PATCH 0/5] (Take 3): Transcendent memory

Hi Dan,

On 12/23/2009 10:45 PM, Dan Magenheimer wrote:


> I'm definitely OK with exploring alternatives. I just think that
> existing kernel mechanisms are very firmly rooted in the notion
> that either the kernel owns the memory/cache or an asynchronous
> device owns it. Tmem falls somewhere in between and is very
> carefully designed to maximize memory flexibility *outside* of
> the kernel -- across all guests in a virtualized environment --
> with minimal impact to the kernel, while still providing the
> kernel with the ability to use -- but not own, directly address,
> or control -- additional memory when conditions allow. And
> these conditions are not only completely invisible to the kernel,
> but change frequently and asynchronously from the kernel,
> unlike most external devices for which the kernel can "reserve"
> space and use it asynchronously later.
>
> Maybe ramzswap and FS-cache could be augmented to have similar
> advantages in a virtualized environment, but I suspect they'd
> end up with something very similar to tmem. Since the objective
> of both is to optimize memory that IS owned (used, directly
> addressable, and controlled) by the kernel, they are entirely
> complementary with tmem.
>

What we want is surely tmem but attempt is to better integrate with
existing infrastructure. Please give me few days as I try to develop
a prototype.

>> Swapping to hypervisor is mainly useful to overcome
>> 'static partitioning' problem you mentioned in article:
>> http://oss.oracle.com/projects/tmem/
>> ...such 'para-swap' can shrink/expand outside of VM constraints.
>
> Frontswap is very different than "hypervisor swapping" as what's
> done by VMware as a side-effect of transparent page-sharing. With
> frontswap, the kernel still decides which pages are swapped out.
> If frontswap says there is space, the swap goes "fast" to tmem;
> if not, the kernel writes it to its own swapdisk. So there's
> no "double paging" or random page selection/swapping. On
> the downside, kernels must have real swap configured and,
> to avoid DoS issues, frontswap is limited by the same constraint
> as ballooning (ie. can NOT expand outside of VM constraints).
>

I think I did not explain my point regarding "para-swap" correctly.
What I meant was a virtual swap device which appears as swap disk
to kernel. Kernel swaps to this disk as usual (hence the kernel decides
what pages to swap out). This device tries to send these pages to hvisor
but if that fails, it will fall back to swapping inside guest only. There
is no double swapping. I think this correctly explains purpose of Frontswap.

>> ...such 'para-swap' can shrink/expand outside of VM constraints.
<snip>
> frontswap is limited by the same constraint
> as ballooning (ie. can NOT expand outside of VM constraints).
>

What I meant was: A VM can have 512M of memory while this "para swap" disk
can have any size, say 1G. Now kernel can swapout 1G worth of data to this
swap which can (potentially) send all these pages to hvisor. In future, if
we want even more RAM for this VM, we can add another such swap device to guest.
Thus, in a way, we are able to overcome rigid static partitioning of VMs w.r.t.
memory resource.


>
> P.S. If you want to look at implementing FS-cache or ramzswap
> on top of tmem, I'd be happy to help, but I'll bet your concern:
>
>> we might later encounter some hidder/dangerous problems :)
>
> will prove to be correct.
>

Please allow me few days to experiment with 'virtswap' which will
be virtualization aware ramzswap driver. This will help us understand
problems we might face with such an approach. I am new to this virtio thing,
so it might take some time.

If we find virtswap to be feasible with virtio, we can go for fs-cache
backend we talked about.


Thanks,
Nitin

2009-12-24 20:52:35

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Tmem [PATCH 0/5] (Take 3): Transcendent memory

> What we want is surely tmem but attempt is to better integrate with
> existing infrastructure. Please give me few days as I try to develop
> a prototype.

Sounds good. I have a few more comments but will switch
to cc'ing just the tmem-devel* list as most people and lists are
probably not interested in this level of detail, but it will be
archived on tmem-devel in case anyone else does want to
follow it.

Thanks,
Dan

* http://oss.oracle.com/pipermail/tmem-devel/

2009-12-25 19:18:58

by Pavel Machek

[permalink] [raw]
Subject: Re: Tmem [PATCH 0/5] (Take 3): Transcendent memory

On Wed 2009-12-23 09:15:27, Dan Magenheimer wrote:
> > As I mentioned, I really like the idea behind tmem. All I am proposing
> > is that we should probably explore some alternatives to achive this using
> > some existing infrastructure in kernel.
>
> Hi Nitin --
>
> Sorry if I sounded overly negative... too busy around the holidays.
>
> I'm definitely OK with exploring alternatives. I just think that
> existing kernel mechanisms are very firmly rooted in the notion
> that either the kernel owns the memory/cache or an asynchronous
> device owns it. Tmem falls somewhere in between and is very

Well... compcache seems to be very similar to preswap: in preswap case
you don't know if hypervisor will have space, in ramzswap you don't
know if data are compressible.

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-12-28 15:59:00

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Tmem [PATCH 0/5] (Take 3): Transcendent memory


> From: Pavel Machek [mailto:[email protected]]
> > > As I mentioned, I really like the idea behind tmem. All I
> am proposing
> > > is that we should probably explore some alternatives to
> achive this using
> > > some existing infrastructure in kernel.
> >
> > Hi Nitin --
> >
> > Sorry if I sounded overly negative... too busy around the holidays.
> >
> > I'm definitely OK with exploring alternatives. I just think that
> > existing kernel mechanisms are very firmly rooted in the notion
> > that either the kernel owns the memory/cache or an asynchronous
> > device owns it. Tmem falls somewhere in between and is very
>
> Well... compcache seems to be very similar to preswap: in preswap case
> you don't know if hypervisor will have space, in ramzswap you don't
> know if data are compressible.

Hi Pavel --

Yes there are definitely similarities too. In fact, I started
prototyping preswap (now called frontswap) with Nitin's
compcache code. IIRC I ran into some problems with compcache's
difficulties in dealing with failed "puts" due to dynamic
changes in size of hypervisor-available-memory.

Nitin may have addressed this in later versions of ramzswap.

One feature of frontswap which is different than ramzswap is
that frontswap acts as a "fronting store" for all configured
swap devices, including SAN/NAS swap devices. It doesn't
need to be separately configured as a "highest priority" swap
device. In many installations and depending on how ramzswap
is configured, this difference probably doesn't make much
difference though.

Thanks,
Dan

2009-12-28 20:51:25

by Pavel Machek

[permalink] [raw]
Subject: Re: Tmem [PATCH 0/5] (Take 3): Transcendent memory

Hi!

> > achive this using
> > > > some existing infrastructure in kernel.
> > >
> > > Hi Nitin --
> > >
> > > Sorry if I sounded overly negative... too busy around the holidays.
> > >
> > > I'm definitely OK with exploring alternatives. I just think that
> > > existing kernel mechanisms are very firmly rooted in the notion
> > > that either the kernel owns the memory/cache or an asynchronous
> > > device owns it. Tmem falls somewhere in between and is very
> >
> > Well... compcache seems to be very similar to preswap: in preswap case
> > you don't know if hypervisor will have space, in ramzswap you don't
> > know if data are compressible.
>
> Hi Pavel --
>
> Yes there are definitely similarities too. In fact, I started
> prototyping preswap (now called frontswap) with Nitin's
> compcache code. IIRC I ran into some problems with compcache's
> difficulties in dealing with failed "puts" due to dynamic
> changes in size of hypervisor-available-memory.
>
> Nitin may have addressed this in later versions of ramzswap.

That would be cool to find out.

> One feature of frontswap which is different than ramzswap is
> that frontswap acts as a "fronting store" for all configured
> swap devices, including SAN/NAS swap devices. It doesn't
> need to be separately configured as a "highest priority" swap
> device. In many installations and depending on how ramzswap

Ok, I'd call it a bug, not a feature :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-12-28 21:43:30

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Tmem [PATCH 0/5] (Take 3): Transcendent memory

> > One feature of frontswap which is different than ramzswap is
> > that frontswap acts as a "fronting store" for all configured
> > swap devices, including SAN/NAS swap devices. It doesn't
> > need to be separately configured as a "highest priority" swap
> > device. In many installations and depending on how ramzswap
>
> Ok, I'd call it a bug, not a feature :-).
> Pavel

I agree it has little value (or might be considered a bug)
when managing Linux on a physical machine. But when
Linux is running in a virtual machine, it's one less thing
that a sysadmin needs to understand and configure.

2009-12-29 02:09:31

by Nitin Gupta

[permalink] [raw]
Subject: Re: Tmem [PATCH 0/5] (Take 3): Transcendent memory

On 12/28/2009 09:27 PM, Dan Magenheimer wrote:
>
>> From: Pavel Machek [mailto:[email protected]]

>>> I'm definitely OK with exploring alternatives. I just think that
>>> existing kernel mechanisms are very firmly rooted in the notion
>>> that either the kernel owns the memory/cache or an asynchronous
>>> device owns it. Tmem falls somewhere in between and is very
>>
>> Well... compcache seems to be very similar to preswap: in preswap case
>> you don't know if hypervisor will have space, in ramzswap you don't
>> know if data are compressible.
>
> Hi Pavel --
>
> Yes there are definitely similarities too. In fact, I started
> prototyping preswap (now called frontswap) with Nitin's
> compcache code. IIRC I ran into some problems with compcache's
> difficulties in dealing with failed "puts" due to dynamic
> changes in size of hypervisor-available-memory.
>
> Nitin may have addressed this in later versions of ramzswap.
>

Any kind of swap device that works entirely within guest
(or in native case), will always have problems with any write(put)
failure -- we want to reclaim a page but due to write failure, we can't. Problem!
So, ramzswap also cannot afford to have lot of write failures.

However, the story is different when ramzswap is "virtualization aware".
In this case, we can surely afford to have any numnber of "put" failures
to hypervisor. When this put fails, we will either compress the page and
keep it in guest memory itself or forward it to ramzswap backing swap
device (if present).

Another side point is that we can achieve all this with ramzswap approach
of virtual block devices without any kernel changes as everything is a module.


> One feature of frontswap which is different than ramzswap is
> that frontswap acts as a "fronting store" for all configured
> swap devices, including SAN/NAS swap devices. It doesn't
> need to be separately configured as a "highest priority" swap
> device. In many installations and depending on how ramzswap
> is configured, this difference probably doesn't make much
> difference though.
>

Having a frontswap layer over *every* swap might not be desirable. I think such
things should be completely out of way when not desired. This was one the primary
reasons to have virtual block device approach for ramzswap. You can create any number
of such devices (/dev/ramzswap{0,1,2...}) with each having separate backing device (optional),
memory pools, buffers etc. which adds additional flexibility and helps with scalability.

On a downside however, as you pointed out, managing all this can be a problem for sysadmins.
To ease this, some userspace magic can help which will dynamically manage these virtual disks,
though I have not yet thought much in this direction.

Thanks,
Nitin