2010-04-22 13:45:09

by Dan Magenheimer

[permalink] [raw]
Subject: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Patch applies to 2.6.34-rc5

In previous patch postings, frontswap was part of the Transcendent
Memory ("tmem") patchset. This patchset refocuses not on the underlying
technology (tmem) but instead on the useful functionality provided for Linux,
and provides a clean API so that frontswap can provide this very useful
functionality via a Xen tmem driver OR completely independent of tmem.
For example: Nitin Gupta (of compcache and ramzswap fame) is implementing
an in-kernel compression "backend" for frontswap; some believe
frontswap will be a very nice interface for building RAM-like functionality
for pseudo-RAM devices such as SSD or phase-change memory; and a Pune
University team is looking at a backend for virtio (see OLS'2010).

A more complete description of frontswap can be found in the introductory
comment in mm/frontswap.c (in PATCH 2/4) which is included below
for convenience.

Note that an earlier version of this patch is now shipping in OpenSuSE 11.2
and will soon ship in a release of Oracle Enterprise Linux. Underlying
tmem technology is now shipping in Oracle VM 2.2 and was just released
in Xen 4.0 on April 15, 2010. (Search news.google.com for Transcedent
Memory)

Signed-off-by: Dan Magenheimer <[email protected]>
Reviewed-by: Jeremy Fitzhardinge <[email protected]>

include/linux/frontswap.h | 98 ++++++++++++++
include/linux/swap.h | 2
include/linux/swapfile.h | 13 +
mm/Kconfig | 16 ++
mm/Makefile | 1
mm/frontswap.c | 301 ++++++++++++++++++++++++++++++++++++++++++++++
mm/page_io.c | 12 +
mm/swap.c | 4
mm/swapfile.c | 58 +++++++-
9 files changed, 496 insertions(+), 9 deletions(-)

Frontswap is so named because it can be thought of as the opposite of
a "backing" store for a swap device. The storage is assumed to be
a synchronous concurrency-safe page-oriented pseudo-RAM device (such as
Xen's Transcendent Memory, aka "tmem", or in-kernel compressed memory,
aka "zmem", or other RAM-like devices) which is not directly accessible
or addressable by the kernel and is of unknown and possibly time-varying
size. This pseudo-RAM device links itself to frontswap by setting the
frontswap_ops pointer appropriately and the functions it provides must
conform to certain policies as follows:

An "init" prepares the pseudo-RAM to receive frontswap pages and returns
a non-negative pool id, used for all swap device numbers (aka "type").
A "put_page" will copy the page to pseudo-RAM and associate it with
the type and offset associated with the page. A "get_page" will copy the
page, if found, from pseudo-RAM into kernel memory, but will NOT remove
the page from pseudo-RAM. A "flush_page" will remove the page from
pseudo-RAM and a "flush_area" will remove ALL pages associated with the
swap type (e.g., like swapoff) and notify the pseudo-RAM device to refuse
further puts with that swap type.

Once a page is successfully put, a matching get on the page will always
succeed. So when the kernel finds itself in a situation where it needs
to swap out a page, it first attempts to use frontswap. If the put returns
non-zero, the data has been successfully saved to pseudo-RAM and
a disk write and, if the data is later read back, a disk read are avoided.
If a put returns zero, pseudo-RAM has rejected the data, and the page can
be written to swap as usual.

Note that if a page is put and the page already exists in pseudo-RAM
(a "duplicate" put), either the put succeeds and the data is overwritten,
or the put fails AND the page is flushed. This ensures stale data may
never be obtained from pseudo-RAM.


2010-04-22 15:29:09

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/22/2010 04:42 PM, Dan Magenheimer wrote:
> Frontswap is so named because it can be thought of as the opposite of
> a "backing" store for a swap device. The storage is assumed to be
> a synchronous concurrency-safe page-oriented pseudo-RAM device (such as
> Xen's Transcendent Memory, aka "tmem", or in-kernel compressed memory,
> aka "zmem", or other RAM-like devices) which is not directly accessible
> or addressable by the kernel and is of unknown and possibly time-varying
> size. This pseudo-RAM device links itself to frontswap by setting the
> frontswap_ops pointer appropriately and the functions it provides must
> conform to certain policies as follows:
>

How baked in is the synchronous requirement? Memory, for example, can
be asynchronous if it is copied by a dma engine, and since there are
hardware encryption engines, there may be hardware compression engines
in the future.


--
error compiling committee.c: too many arguments to function

2010-04-22 15:49:45

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> > a synchronous concurrency-safe page-oriented pseudo-RAM device (such
> > :
> > conform to certain policies as follows:
>
> How baked in is the synchronous requirement? Memory, for example, can
> be asynchronous if it is copied by a dma engine, and since there are
> hardware encryption engines, there may be hardware compression engines
> in the future.

Thanks for the comment!

Synchronous is required, but likely could be simulated by ensuring all
coherency (and concurrency) requirements are met by some intermediate
"buffering driver" -- at the cost of an extra page copy into a buffer
and overhead of tracking the handles (poolid/inode/index) of pages in
the buffer that are "in flight". This is an approach we are considering
to implement an SSD backend, but hasn't been tested yet so, ahem, the
proof will be in the put'ing. ;-)

Dan

2010-04-22 16:14:04

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/22/2010 06:48 PM, Dan Magenheimer wrote:
>>> a synchronous concurrency-safe page-oriented pseudo-RAM device (such
>>> :
>>> conform to certain policies as follows:
>>>
>> How baked in is the synchronous requirement? Memory, for example, can
>> be asynchronous if it is copied by a dma engine, and since there are
>> hardware encryption engines, there may be hardware compression engines
>> in the future.
>>
> Thanks for the comment!
>
> Synchronous is required, but likely could be simulated by ensuring all
> coherency (and concurrency) requirements are met by some intermediate
> "buffering driver" -- at the cost of an extra page copy into a buffer
> and overhead of tracking the handles (poolid/inode/index) of pages in
> the buffer that are "in flight". This is an approach we are considering
> to implement an SSD backend, but hasn't been tested yet so, ahem, the
> proof will be in the put'ing. ;-)
>

Well, copying memory so you can use a zero-copy dma engine is
counterproductive.

Much easier to simulate an asynchronous API with a synchronous backend.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-22 20:18:27

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> > Synchronous is required, but likely could be simulated by ensuring
> all
> > coherency (and concurrency) requirements are met by some intermediate
> > "buffering driver" -- at the cost of an extra page copy into a buffer
> > and overhead of tracking the handles (poolid/inode/index) of pages in
> > the buffer that are "in flight". This is an approach we are
> considering
> > to implement an SSD backend, but hasn't been tested yet so, ahem, the
> > proof will be in the put'ing. ;-)
>
> Much easier to simulate an asynchronous API with a synchronous backend.

Indeed. But an asynchronous API is not appropriate for frontswap
(or cleancache). The reason the hooks are so simple is because they
are assumed to be synchronous so that the page can be immediately
freed/reused.

> Well, copying memory so you can use a zero-copy dma engine is
> counterproductive.

Yes, but for something like an SSD where copying can be used to
build up a full 64K write, the cost of copying memory may not be
counterproductive.

2010-04-23 09:49:22

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/22/2010 11:15 PM, Dan Magenheimer wrote:
>>
>> Much easier to simulate an asynchronous API with a synchronous backend.
>>
> Indeed. But an asynchronous API is not appropriate for frontswap
> (or cleancache). The reason the hooks are so simple is because they
> are assumed to be synchronous so that the page can be immediately
> freed/reused.
>

Swapping is inherently asynchronous, so we'll have to wait for that to
complete anyway (as frontswap does not guarantee swap-in will succeed).
I don't doubt it makes things simpler, but also less flexible and useful.

Something else that bothers me is the double swapping. Sure we're
making swapin faster, but we we're still loading the io subsystem with
writes. Much better to make swap-to-ram authoritative (and have the
hypervisor swap it to disk if it needs the memory).

>> Well, copying memory so you can use a zero-copy dma engine is
>> counterproductive.
>>
> Yes, but for something like an SSD where copying can be used to
> build up a full 64K write, the cost of copying memory may not be
> counterproductive.
>

I don't understand. Please clarify.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-23 13:48:40

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> >> Much easier to simulate an asynchronous API with a synchronous
> backend.
> >>
> > Indeed. But an asynchronous API is not appropriate for frontswap
> > (or cleancache). The reason the hooks are so simple is because they
> > are assumed to be synchronous so that the page can be immediately
> > freed/reused.
> >
>
> Swapping is inherently asynchronous, so we'll have to wait for that to
> complete anyway (as frontswap does not guarantee swap-in will succeed).
> I don't doubt it makes things simpler, but also less flexible and
> useful.
>
> Something else that bothers me is the double swapping. Sure we're
> making swapin faster, but we we're still loading the io subsystem with
> writes. Much better to make swap-to-ram authoritative (and have the
> hypervisor swap it to disk if it needs the memory).

Hmmm.... I now realize you are thinking of applying frontswap to
a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
will succeed, never double-swaps, and doesn't load the io subsystem
with writes. This all works very nicely today with a fully
synchronous "backend" (e.g. with tmem in Xen 4.0).

So, I agree, hiding a truly asynchronous interface behind
frontswap's synchronous interface may have some thorny issues.
I wasn't recommending that it should be done, just speculating
how it might be done. This doesn't make frontswap any less
useful with a fully synchronous "backend".

> >> Well, copying memory so you can use a zero-copy dma engine is
> >> counterproductive.
> >>
> > Yes, but for something like an SSD where copying can be used to
> > build up a full 64K write, the cost of copying memory may not be
> > counterproductive.
>
> I don't understand. Please clarify.

If I understand correctly, SSDs work much more efficiently when
writing 64KB blocks. So much more efficiently in fact that waiting
to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
will be faster than page-at-a-time DMA'ing them. If so, the
frontswap interface, backed by an asynchronous "buffering layer"
which collects 16 pages before writing to the SSD, may work
very nicely. Again this is still just speculation... I was
only pointing out that zero-copy DMA may not always be the best
solution.

Thanks,
Dan

2010-04-23 13:57:56

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/23/2010 04:47 PM, Dan Magenheimer wrote:
>>>> Much easier to simulate an asynchronous API with a synchronous
>>>>
>> backend.
>>
>>>>
>>> Indeed. But an asynchronous API is not appropriate for frontswap
>>> (or cleancache). The reason the hooks are so simple is because they
>>> are assumed to be synchronous so that the page can be immediately
>>> freed/reused.
>>>
>>>
>> Swapping is inherently asynchronous, so we'll have to wait for that to
>> complete anyway (as frontswap does not guarantee swap-in will succeed).
>> I don't doubt it makes things simpler, but also less flexible and
>> useful.
>>
>> Something else that bothers me is the double swapping. Sure we're
>> making swapin faster, but we we're still loading the io subsystem with
>> writes. Much better to make swap-to-ram authoritative (and have the
>> hypervisor swap it to disk if it needs the memory).
>>
> Hmmm.... I now realize you are thinking of applying frontswap to
> a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
> hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
> will succeed, never double-swaps, and doesn't load the io subsystem
> with writes. This all works very nicely today with a fully
> synchronous "backend" (e.g. with tmem in Xen 4.0).
>

Perhaps I misunderstood. Isn't frontswap in front of the normal swap
device? So we do have double swapping, first to frontswap (which is in
memory, yes, but still a nonzero cost), then the normal swap device.
The io subsystem is loaded with writes; you only save the reads.

Better to swap to the hypervisor, and make it responsible for committing
to disk on overcommit or keeping in RAM when memory is available. This
way we avoid the write to disk if memory is in fact available (or at
least defer it until later). This way you avoid both reads and writes
if memory is available.

>>>> Well, copying memory so you can use a zero-copy dma engine is
>>>> counterproductive.
>>>>
>>>>
>>> Yes, but for something like an SSD where copying can be used to
>>> build up a full 64K write, the cost of copying memory may not be
>>> counterproductive.
>>>
>> I don't understand. Please clarify.
>>
> If I understand correctly, SSDs work much more efficiently when
> writing 64KB blocks. So much more efficiently in fact that waiting
> to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
> will be faster than page-at-a-time DMA'ing them. If so, the
> frontswap interface, backed by an asynchronous "buffering layer"
> which collects 16 pages before writing to the SSD, may work
> very nicely. Again this is still just speculation... I was
> only pointing out that zero-copy DMA may not always be the best
> solution.
>

The guest can easily (and should) issue 64k dmas using scatter/gather.
No need for copying.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-23 14:44:15

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> >> Something else that bothers me is the double swapping. Sure we're
> >> making swapin faster, but we we're still loading the io subsystem
> with
> >> writes. Much better to make swap-to-ram authoritative (and have the
> >> hypervisor swap it to disk if it needs the memory).
> >>
> > Hmmm.... I now realize you are thinking of applying frontswap to
> > a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
> > hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
> > will succeed, never double-swaps, and doesn't load the io subsystem
> > with writes. This all works very nicely today with a fully
> > synchronous "backend" (e.g. with tmem in Xen 4.0).
>
> Perhaps I misunderstood. Isn't frontswap in front of the normal swap
> device? So we do have double swapping, first to frontswap (which is in
> memory, yes, but still a nonzero cost), then the normal swap device.
> The io subsystem is loaded with writes; you only save the reads.
> Better to swap to the hypervisor, and make it responsible for
> committing
> to disk on overcommit or keeping in RAM when memory is available. This
> way we avoid the write to disk if memory is in fact available (or at
> least defer it until later). This way you avoid both reads and writes
> if memory is available.

Each page is either in frontswap OR on the normal swap device,
never both. So, yes, both reads and writes are avoided if memory
is available and there is no write issued to the io subsystem if
memory is available. The is_memory_available decision is determined
by the hypervisor dynamically for each page when the guest attempts
a "frontswap_put". So, yes, you are indeed "swapping to the
hypervisor" but, at least in the case of Xen, the hypervisor
never swaps any memory to disk so there is never double swapping.

> > If I understand correctly, SSDs work much more efficiently when
> > writing 64KB blocks. So much more efficiently in fact that waiting
> > to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
> > will be faster than page-at-a-time DMA'ing them. If so, the
> > frontswap interface, backed by an asynchronous "buffering layer"
> > which collects 16 pages before writing to the SSD, may work
> > very nicely. Again this is still just speculation... I was
> > only pointing out that zero-copy DMA may not always be the best
> > solution.
>
> The guest can easily (and should) issue 64k dmas using scatter/gather.
> No need for copying.

In many cases, this is true. For the swap subsystem, it may not always
be true, though I see recent signs that it may be headed in that
direction. In any case, unless you see this SSD discussion as
critical to the proposed acceptance of the frontswap patchset,
let's table it until there's some prototyping done.

Thanks,
Dan

2010-04-23 14:54:15

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/23/2010 05:43 PM, Dan Magenheimer wrote:
>>
>> Perhaps I misunderstood. Isn't frontswap in front of the normal swap
>> device? So we do have double swapping, first to frontswap (which is in
>> memory, yes, but still a nonzero cost), then the normal swap device.
>> The io subsystem is loaded with writes; you only save the reads.
>> Better to swap to the hypervisor, and make it responsible for
>> committing
>> to disk on overcommit or keeping in RAM when memory is available. This
>> way we avoid the write to disk if memory is in fact available (or at
>> least defer it until later). This way you avoid both reads and writes
>> if memory is available.
>>
> Each page is either in frontswap OR on the normal swap device,
> never both. So, yes, both reads and writes are avoided if memory
> is available and there is no write issued to the io subsystem if
> memory is available. The is_memory_available decision is determined
> by the hypervisor dynamically for each page when the guest attempts
> a "frontswap_put". So, yes, you are indeed "swapping to the
> hypervisor" but, at least in the case of Xen, the hypervisor
> never swaps any memory to disk so there is never double swapping.
>

I see. So why not implement this as an ordinary swap device, with a
higher priority than the disk device? this way we reuse an API and keep
things asynchronous, instead of introducing a special purpose API.

Doesn't this commit the hypervisor to retain this memory? If so, isn't
it simpler to give the page to the guest (so now it doesn't need to swap
at all)?

What about live migration? do you live migrate frontswap pages?

>>> If I understand correctly, SSDs work much more efficiently when
>>> writing 64KB blocks. So much more efficiently in fact that waiting
>>> to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
>>> will be faster than page-at-a-time DMA'ing them. If so, the
>>> frontswap interface, backed by an asynchronous "buffering layer"
>>> which collects 16 pages before writing to the SSD, may work
>>> very nicely. Again this is still just speculation... I was
>>> only pointing out that zero-copy DMA may not always be the best
>>> solution.
>>>
>> The guest can easily (and should) issue 64k dmas using scatter/gather.
>> No need for copying.
>>
> In many cases, this is true. For the swap subsystem, it may not always
> be true, though I see recent signs that it may be headed in that
> direction.

I think it will be true in an overwhelming number of cases. Flash is
new enough that most devices support scatter/gather.

> In any case, unless you see this SSD discussion as
> critical to the proposed acceptance of the frontswap patchset,
> let's table it until there's some prototyping done.
>

It isn't particularly related.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-23 15:01:46

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/23/2010 05:52 PM, Avi Kivity wrote:
>
> I see. So why not implement this as an ordinary swap device, with a
> higher priority than the disk device? this way we reuse an API and
> keep things asynchronous, instead of introducing a special purpose API.
>

Ok, from your original post:

> An "init" prepares the pseudo-RAM to receive frontswap pages and returns
> a non-negative pool id, used for all swap device numbers (aka "type").
> A "put_page" will copy the page to pseudo-RAM and associate it with
> the type and offset associated with the page. A "get_page" will copy the
> page, if found, from pseudo-RAM into kernel memory, but will NOT remove
> the page from pseudo-RAM. A "flush_page" will remove the page from
> pseudo-RAM and a "flush_area" will remove ALL pages associated with the
> swap type (e.g., like swapoff) and notify the pseudo-RAM device to refuse
> further puts with that swap type.
>
> Once a page is successfully put, a matching get on the page will always
> succeed. So when the kernel finds itself in a situation where it needs
> to swap out a page, it first attempts to use frontswap. If the put returns
> non-zero, the data has been successfully saved to pseudo-RAM and
> a disk write and, if the data is later read back, a disk read are avoided.
> If a put returns zero, pseudo-RAM has rejected the data, and the page can
> be written to swap as usual.
>
> Note that if a page is put and the page already exists in pseudo-RAM
> (a "duplicate" put), either the put succeeds and the data is overwritten,
> or the put fails AND the page is flushed. This ensures stale data may
> never be obtained from pseudo-RAM.
>

Looks like "init" == open, "put_page" == write, "get_page" == read,
"flush_page|flush_area" == trim. The only difference seems to be that
an overwriting put_page may fail. Doesn't seem to be much of a win,
since a guest can simply avoid issuing the duplicate put_page, so the
hypervisor is still committed to holding this memory for the guest.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-23 15:57:19

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> > Each page is either in frontswap OR on the normal swap device,
> > never both. So, yes, both reads and writes are avoided if memory
> > is available and there is no write issued to the io subsystem if
> > memory is available. The is_memory_available decision is determined
> > by the hypervisor dynamically for each page when the guest attempts
> > a "frontswap_put". So, yes, you are indeed "swapping to the
> > hypervisor" but, at least in the case of Xen, the hypervisor
> > never swaps any memory to disk so there is never double swapping.
>
> I see. So why not implement this as an ordinary swap device, with a
> higher priority than the disk device? this way we reuse an API and
> keep
> things asynchronous, instead of introducing a special purpose API.

Because the swapping API doesn't adapt well to dynamic changes in
the size and availability of the underlying "swap" device, which
is very useful for swap to (bare-metal) hypervisor.

> Doesn't this commit the hypervisor to retain this memory? If so, isn't
> it simpler to give the page to the guest (so now it doesn't need to
> swap at all)?

Yes the hypervisor is committed to retain the memory. In
some ways, giving a page of memory to a guest (via ballooning)
is simpler and in some ways not. When a guest "owns" a page,
it can do whatever it wants with it, independent of what is best
for the "whole" virtualized system. When the hypervisor
"owns" the page on behalf of the guest but the guest can't
directly address it, the hypervisor has more flexibility.
For example, tmem optionally compresses all frontswap pages,
effectively doubling the size of its available memory.
In the future, knowing that a guest application can never
access the pages directly, it might store all frontswap pages in
(slower but still synchronous) phase change memory or "far NUMA"
memory.

> What about live migration? do you live migrate frontswap pages?

Yes, fully supported in Xen 4.0. And as another example of
flexibility, note that "lazy migration" of frontswap'ed pages
might be quite reasonable.

> >> The guest can easily (and should) issue 64k dmas using
> scatter/gather.
> >> No need for copying.
> >>
> > In many cases, this is true. For the swap subsystem, it may not
> always
> > be true, though I see recent signs that it may be headed in that
> > direction.
>
> I think it will be true in an overwhelming number of cases. Flash is
> new enough that most devices support scatter/gather.

I wasn't referring to hardware capability but to the availability
and timing constraints of the pages that need to be swapped.

> > In any case, unless you see this SSD discussion as
> > critical to the proposed acceptance of the frontswap patchset,
> > let's table it until there's some prototyping done.
>
> It isn't particularly related.

Agreed.

2010-04-23 16:28:30

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> > If a put returns zero, pseudo-RAM has rejected the data, and the page
> can
> > be written to swap as usual.
> >
> > Note that if a page is put and the page already exists in pseudo-RAM
> > (a "duplicate" put), either the put succeeds and the data is
> overwritten,
> > or the put fails AND the page is flushed. This ensures stale data
> may
> > never be obtained from pseudo-RAM.
>
> Looks like "init" == open, "put_page" == write, "get_page" == read,
> "flush_page|flush_area" == trim. The only difference seems to be that
> an overwriting put_page may fail. Doesn't seem to be much of a win,

No, ANY put_page can fail, and this is a critical part of the API
that provides all of the flexibility for the hypervisor and all
the guests. (See previous reply.)

The "duplicate put" semantics are carefully specified as there
are some coherency corner cases that are very difficult to handle
in the "backend" but very easy to handle in the kernel. So the
specification explicitly punts these to the kernel.

2010-04-23 16:35:19

by Jiahua

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On Fri, Apr 23, 2010 at 6:47 AM, Dan Magenheimer
<[email protected]> wrote:

> If I understand correctly, SSDs work much more efficiently when
> writing 64KB blocks. ?So much more efficiently in fact that waiting
> to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
> will be faster than page-at-a-time DMA'ing them. ?If so, the
> frontswap interface, backed by an asynchronous "buffering layer"
> which collects 16 pages before writing to the SSD, may work
> very nicely. ?Again this is still just speculation... I was
> only pointing out that zero-copy DMA may not always be the best
> solution.

I guess you are talking about the write amplification issue of SSD. In
fact, most of the new generation drives already solved the problem
with log like structure. Even with the old drives, the size of the
writes depends on the the size of the erase block, which is not
necessary 64KB.

Jiahua

2010-04-24 01:52:48

by Nitin Gupta

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/23/2010 08:22 PM, Avi Kivity wrote:
> On 04/23/2010 05:43 PM, Dan Magenheimer wrote:
>>>
>>> Perhaps I misunderstood. Isn't frontswap in front of the normal swap
>>> device? So we do have double swapping, first to frontswap (which is in
>>> memory, yes, but still a nonzero cost), then the normal swap device.
>>> The io subsystem is loaded with writes; you only save the reads.
>>> Better to swap to the hypervisor, and make it responsible for
>>> committing
>>> to disk on overcommit or keeping in RAM when memory is available. This
>>> way we avoid the write to disk if memory is in fact available (or at
>>> least defer it until later). This way you avoid both reads and writes
>>> if memory is available.
>>>
>> Each page is either in frontswap OR on the normal swap device,
>> never both. So, yes, both reads and writes are avoided if memory
>> is available and there is no write issued to the io subsystem if
>> memory is available. The is_memory_available decision is determined
>> by the hypervisor dynamically for each page when the guest attempts
>> a "frontswap_put". So, yes, you are indeed "swapping to the
>> hypervisor" but, at least in the case of Xen, the hypervisor
>> never swaps any memory to disk so there is never double swapping.
>>
>
> I see. So why not implement this as an ordinary swap device, with a
> higher priority than the disk device? this way we reuse an API and keep
> things asynchronous, instead of introducing a special purpose API.
>

ramzswap is exactly this: an ordinary swap device which stores every page
in (compressed) memory and its enabled as highest priority swap. Currently,
it stores these compressed chunks in guest memory itself but it is not very
difficult to send these chunks out to host/hypervisor using virtio.

However, it suffers from unnecessary block I/O layer overhead and requires
weird hooks in swap code, say to get notification when a swap slot is freed.
OTOH frontswap approach gets rid of any such artifacts and overheads.
(ramzswap: http://code.google.com/p/compcache/)

Thanks,
Nitin

2010-04-24 18:22:39

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/23/2010 06:56 PM, Dan Magenheimer wrote:
>>> Each page is either in frontswap OR on the normal swap device,
>>> never both. So, yes, both reads and writes are avoided if memory
>>> is available and there is no write issued to the io subsystem if
>>> memory is available. The is_memory_available decision is determined
>>> by the hypervisor dynamically for each page when the guest attempts
>>> a "frontswap_put". So, yes, you are indeed "swapping to the
>>> hypervisor" but, at least in the case of Xen, the hypervisor
>>> never swaps any memory to disk so there is never double swapping.
>>>
>> I see. So why not implement this as an ordinary swap device, with a
>> higher priority than the disk device? this way we reuse an API and
>> keep
>> things asynchronous, instead of introducing a special purpose API.
>>
> Because the swapping API doesn't adapt well to dynamic changes in
> the size and availability of the underlying "swap" device, which
> is very useful for swap to (bare-metal) hypervisor.
>

Can we extend it? Adding new APIs is easy, but harder to maintain in
the long term.

>> Doesn't this commit the hypervisor to retain this memory? If so, isn't
>> it simpler to give the page to the guest (so now it doesn't need to
>> swap at all)?
>>
> Yes the hypervisor is committed to retain the memory. In
> some ways, giving a page of memory to a guest (via ballooning)
> is simpler and in some ways not. When a guest "owns" a page,
> it can do whatever it wants with it, independent of what is best
> for the "whole" virtualized system. When the hypervisor
> "owns" the page on behalf of the guest but the guest can't
> directly address it, the hypervisor has more flexibility.
> For example, tmem optionally compresses all frontswap pages,
> effectively doubling the size of its available memory.
> In the future, knowing that a guest application can never
> access the pages directly, it might store all frontswap pages in
> (slower but still synchronous) phase change memory or "far NUMA"
> memory.
>

Ok. For non traditional RAM uses I really think an async API is
needed. If the API is backed by a cpu synchronous operation is fine,
but once it isn't RAM, it can be all kinds of interesting things.

Note that even if you do give the page to the guest, you still control
how it can access it, through the page tables. So for example you can
easily compress a guest's pages without telling it about it; whenever it
touches them you decompress them on the fly.

>> I think it will be true in an overwhelming number of cases. Flash is
>> new enough that most devices support scatter/gather.
>>
> I wasn't referring to hardware capability but to the availability
> and timing constraints of the pages that need to be swapped.
>

I have a feeling we're talking past each other here. Swap has no timing
constraints, it is asynchronous and usually to slow devices.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-24 18:25:22

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/23/2010 07:26 PM, Dan Magenheimer wrote:
>>
>> Looks like "init" == open, "put_page" == write, "get_page" == read,
>> "flush_page|flush_area" == trim. The only difference seems to be that
>> an overwriting put_page may fail. Doesn't seem to be much of a win,
>>
> No, ANY put_page can fail, and this is a critical part of the API
> that provides all of the flexibility for the hypervisor and all
> the guests. (See previous reply.)
>

The guest isn't required to do any put_page()s. It can issue lots of
them when memory is available, and keep them in the hypervisor forever.
Failing new put_page()s isn't enough for a dynamic system, you need to
be able to force the guest to give up some of its tmem.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-24 18:28:05

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>
>> I see. So why not implement this as an ordinary swap device, with a
>> higher priority than the disk device? this way we reuse an API and keep
>> things asynchronous, instead of introducing a special purpose API.
>>
>>
> ramzswap is exactly this: an ordinary swap device which stores every page
> in (compressed) memory and its enabled as highest priority swap. Currently,
> it stores these compressed chunks in guest memory itself but it is not very
> difficult to send these chunks out to host/hypervisor using virtio.
>
> However, it suffers from unnecessary block I/O layer overhead and requires
> weird hooks in swap code, say to get notification when a swap slot is freed.
>

Isn't that TRIM?

> OTOH frontswap approach gets rid of any such artifacts and overheads.
> (ramzswap: http://code.google.com/p/compcache/)
>

Maybe we should optimize these overheads instead. Swap used to always
be to slow devices, but swap-to-flash has the potential to make swap act
like an extension of RAM.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-25 00:31:44

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> >> I see. So why not implement this as an ordinary swap device, with a
> >> higher priority than the disk device? this way we reuse an API and
> >> keep
> >> things asynchronous, instead of introducing a special purpose API.
> >>
> > Because the swapping API doesn't adapt well to dynamic changes in
> > the size and availability of the underlying "swap" device, which
> > is very useful for swap to (bare-metal) hypervisor.
>
> Can we extend it? Adding new APIs is easy, but harder to maintain in
> the long term.

Umm... I think the difference between a "new" API and extending
an existing one here is a choice of semantics. As designed, frontswap
is an extremely simple, only-very-slightly-intrusive set of hooks that
allows swap pages to, under some conditions, go to pseudo-RAM instead
of an asynchronous disk-like device. It works today with at least
one "backend" (Xen tmem), is shipping today in real distros, and is
extremely easy to enable/disable via CONFIG or module... meaning
no impact on anyone other than those who choose to benefit from it.

"Extending" the existing swap API, which has largely been untouched for
many years, seems like a significantly more complex and error-prone
undertaking that will affect nearly all Linux users with a likely long
bug tail. And, by the way, there is no existence proof that it
will be useful.

Seems like a no-brainer to me.

> Ok. For non traditional RAM uses I really think an async API is
> needed. If the API is backed by a cpu synchronous operation is fine,
> but once it isn't RAM, it can be all kinds of interesting things.

Well, we shall see. It may also be the case that the existing
asynchronous swap API will work fine for some non traditional RAM;
and it may also be the case that frontswap works fine for some
non traditional RAM. I agree there is fertile ground for exploration
here. But let's not allow our speculation on what may or may
not work in the future halt forward progress of something that works
today.

> Note that even if you do give the page to the guest, you still control
> how it can access it, through the page tables. So for example you can
> easily compress a guest's pages without telling it about it; whenever
> it
> touches them you decompress them on the fly.

Yes, at a much larger more invasive cost to the kernel. Frontswap
and cleancache and tmem are all well-layered for a good reason.

> >> I think it will be true in an overwhelming number of cases. Flash
> is
> >> new enough that most devices support scatter/gather.
> >>
> > I wasn't referring to hardware capability but to the availability
> > and timing constraints of the pages that need to be swapped.
> >
>
> I have a feeling we're talking past each other here.

Could be.

> Swap has no timing
> constraints, it is asynchronous and usually to slow devices.

What I was referring to is that the existing swap code DOES NOT
always have the ability to collect N scattered pages before
initiating an I/O write suitable for a device (such as an SSD)
that is optimized for writing N pages at a time. That is what
I meant by a timing constraint. See references to page_cluster
in the swap code (and this is for contiguous pages, not scattered).

Dan

2010-04-25 00:41:58

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> > No, ANY put_page can fail, and this is a critical part of the API
> > that provides all of the flexibility for the hypervisor and all
> > the guests. (See previous reply.)
>
> The guest isn't required to do any put_page()s. It can issue lots of
> them when memory is available, and keep them in the hypervisor forever.
> Failing new put_page()s isn't enough for a dynamic system, you need to
> be able to force the guest to give up some of its tmem.

Yes, indeed, this is true. That is why it is important for any
policy implemented behind frontswap to "bill" the guest if it
is attempting to keep frontswap pages in the hypervisor forever
and to prod the guest to reclaim them when it no longer needs
super-fast emergency swap space. The frontswap patch already includes
the kernel mechanism to enable this and the prodding can be implemented
by a guest daemon (of which there already exists an existence proof).

(While devil's advocacy is always welcome, frontswap is NOT a
cool academic science project where these issues have not been
considered or tested.)

2010-04-25 03:14:16

by Nitin Gupta

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/24/2010 11:57 PM, Avi Kivity wrote:
> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>
>>> I see. So why not implement this as an ordinary swap device, with a
>>> higher priority than the disk device? this way we reuse an API and keep
>>> things asynchronous, instead of introducing a special purpose API.
>>>
>>>
>> ramzswap is exactly this: an ordinary swap device which stores every page
>> in (compressed) memory and its enabled as highest priority swap.
>> Currently,
>> it stores these compressed chunks in guest memory itself but it is not
>> very
>> difficult to send these chunks out to host/hypervisor using virtio.
>>
>> However, it suffers from unnecessary block I/O layer overhead and
>> requires
>> weird hooks in swap code, say to get notification when a swap slot is
>> freed.
>>
>
> Isn't that TRIM?

No: trim or discard is not useful. The problem is that we require a callback
_as soon as_ a page (swap slot) is freed. Otherwise, stale data quickly accumulates
in memory defeating the whole purpose of in-memory compressed swap devices (like ramzswap).

Increasing the frequency of discards is also not an option:
- Creating discard bio requests themselves need memory and these swap devices
come into picture only under low memory conditions.
- We need to regularly scan swap_map to issue these discards. Increasing discard
frequency also means more frequent scanning (which will still not be fast enough
for ramzswap needs).

>
>> OTOH frontswap approach gets rid of any such artifacts and overheads.
>> (ramzswap: http://code.google.com/p/compcache/)
>>
>
> Maybe we should optimize these overheads instead. Swap used to always
> be to slow devices, but swap-to-flash has the potential to make swap act
> like an extension of RAM.
>

Spending lot of effort optimizing an overhead which can be completely avoided
is probably not worth it.

Also, I think the choice of a synchronous style API for frontswap and cleancache
is justified as they want to send pages to host *RAM*. If you want to use other
devices like SSDs, then these should be just added as another swap device as
we do currently -- these should not be used as frontswap storage directly.

Thanks,
Nitin

2010-04-25 12:06:37

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/25/2010 03:41 AM, Dan Magenheimer wrote:
>>> No, ANY put_page can fail, and this is a critical part of the API
>>> that provides all of the flexibility for the hypervisor and all
>>> the guests. (See previous reply.)
>>>
>> The guest isn't required to do any put_page()s. It can issue lots of
>> them when memory is available, and keep them in the hypervisor forever.
>> Failing new put_page()s isn't enough for a dynamic system, you need to
>> be able to force the guest to give up some of its tmem.
>>
> Yes, indeed, this is true. That is why it is important for any
> policy implemented behind frontswap to "bill" the guest if it
> is attempting to keep frontswap pages in the hypervisor forever
> and to prod the guest to reclaim them when it no longer needs
> super-fast emergency swap space. The frontswap patch already includes
> the kernel mechanism to enable this and the prodding can be implemented
> by a guest daemon (of which there already exists an existence proof).
>

In this case you could use the same mechanism to stop new put_page()s?

Seems frontswap is like a reverse balloon, where the balloon is in
hypervisor space instead of the guest space.

> (While devil's advocacy is always welcome, frontswap is NOT a
> cool academic science project where these issues have not been
> considered or tested.)
>


Good to know.

--
error compiling committee.c: too many arguments to function

2010-04-25 12:12:10

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/25/2010 03:30 AM, Dan Magenheimer wrote:
>>>> I see. So why not implement this as an ordinary swap device, with a
>>>> higher priority than the disk device? this way we reuse an API and
>>>> keep
>>>> things asynchronous, instead of introducing a special purpose API.
>>>>
>>>>
>>> Because the swapping API doesn't adapt well to dynamic changes in
>>> the size and availability of the underlying "swap" device, which
>>> is very useful for swap to (bare-metal) hypervisor.
>>>
>> Can we extend it? Adding new APIs is easy, but harder to maintain in
>> the long term.
>>
> Umm... I think the difference between a "new" API and extending
> an existing one here is a choice of semantics. As designed, frontswap
> is an extremely simple, only-very-slightly-intrusive set of hooks that
> allows swap pages to, under some conditions, go to pseudo-RAM instead
> of an asynchronous disk-like device. It works today with at least
> one "backend" (Xen tmem), is shipping today in real distros, and is
> extremely easy to enable/disable via CONFIG or module... meaning
> no impact on anyone other than those who choose to benefit from it.
>
> "Extending" the existing swap API, which has largely been untouched for
> many years, seems like a significantly more complex and error-prone
> undertaking that will affect nearly all Linux users with a likely long
> bug tail. And, by the way, there is no existence proof that it
> will be useful.
>
> Seems like a no-brainer to me.
>

My issue is with the API's synchronous nature. Both RAM and more exotic
memories can be used with DMA instead of copying. A synchronous
interface gives this up.

>> Ok. For non traditional RAM uses I really think an async API is
>> needed. If the API is backed by a cpu synchronous operation is fine,
>> but once it isn't RAM, it can be all kinds of interesting things.
>>
> Well, we shall see. It may also be the case that the existing
> asynchronous swap API will work fine for some non traditional RAM;
> and it may also be the case that frontswap works fine for some
> non traditional RAM. I agree there is fertile ground for exploration
> here. But let's not allow our speculation on what may or may
> not work in the future halt forward progress of something that works
> today.
>

Let's not allow the urge to merge prevent us from doing the right thing.

>
>
>> Note that even if you do give the page to the guest, you still control
>> how it can access it, through the page tables. So for example you can
>> easily compress a guest's pages without telling it about it; whenever
>> it
>> touches them you decompress them on the fly.
>>
> Yes, at a much larger more invasive cost to the kernel. Frontswap
> and cleancache and tmem are all well-layered for a good reason.
>

No need to change the kernel at all; the hypervisor controls the page
tables.

>> Swap has no timing
>> constraints, it is asynchronous and usually to slow devices.
>>
> What I was referring to is that the existing swap code DOES NOT
> always have the ability to collect N scattered pages before
> initiating an I/O write suitable for a device (such as an SSD)
> that is optimized for writing N pages at a time. That is what
> I meant by a timing constraint. See references to page_cluster
> in the swap code (and this is for contiguous pages, not scattered).
>

I see. Given that swap-to-flash will soon be way more common than
frontswap, it needs to be solved (either in flash or in the swap code).

--
error compiling committee.c: too many arguments to function

2010-04-25 12:17:00

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/25/2010 06:11 AM, Nitin Gupta wrote:
> On 04/24/2010 11:57 PM, Avi Kivity wrote:
>
>> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>
>>>
>>>> I see. So why not implement this as an ordinary swap device, with a
>>>> higher priority than the disk device? this way we reuse an API and keep
>>>> things asynchronous, instead of introducing a special purpose API.
>>>>
>>>>
>>>>
>>> ramzswap is exactly this: an ordinary swap device which stores every page
>>> in (compressed) memory and its enabled as highest priority swap.
>>> Currently,
>>> it stores these compressed chunks in guest memory itself but it is not
>>> very
>>> difficult to send these chunks out to host/hypervisor using virtio.
>>>
>>> However, it suffers from unnecessary block I/O layer overhead and
>>> requires
>>> weird hooks in swap code, say to get notification when a swap slot is
>>> freed.
>>>
>>>
>> Isn't that TRIM?
>>
> No: trim or discard is not useful. The problem is that we require a callback
> _as soon as_ a page (swap slot) is freed. Otherwise, stale data quickly accumulates
> in memory defeating the whole purpose of in-memory compressed swap devices (like ramzswap).
>

Doesn't flash have similar requirements? The earlier you discard, the
likelier you are to reuse an erase block (or reduce the amount of copying).

> Increasing the frequency of discards is also not an option:
> - Creating discard bio requests themselves need memory and these swap devices
> come into picture only under low memory conditions.
>

That's fine, swap works under low memory conditions by using reserves.

> - We need to regularly scan swap_map to issue these discards. Increasing discard
> frequency also means more frequent scanning (which will still not be fast enough
> for ramzswap needs).
>

How does frontswap do this? Does it maintain its own data structures?

>> Maybe we should optimize these overheads instead. Swap used to always
>> be to slow devices, but swap-to-flash has the potential to make swap act
>> like an extension of RAM.
>>
>>
> Spending lot of effort optimizing an overhead which can be completely avoided
> is probably not worth it.
>

I'm not sure. Swap-to-flash will soon be everywhere. If it's slow,
people will feel it a lot more than ramzswap slowness.

> Also, I think the choice of a synchronous style API for frontswap and cleancache
> is justified as they want to send pages to host *RAM*. If you want to use other
> devices like SSDs, then these should be just added as another swap device as
> we do currently -- these should not be used as frontswap storage directly.
>

Even for copying to RAM an async API is wanted, so you can dma it
instead of copying.

--
error compiling committee.c: too many arguments to function

2010-04-25 13:15:28

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> On 04/25/2010 03:41 AM, Dan Magenheimer wrote:
> >>> No, ANY put_page can fail, and this is a critical part of the API
> >>> that provides all of the flexibility for the hypervisor and all
> >>> the guests. (See previous reply.)
> >>>
> >> The guest isn't required to do any put_page()s. It can issue lots
> of
> >> them when memory is available, and keep them in the hypervisor
> forever.
> >> Failing new put_page()s isn't enough for a dynamic system, you need
> to
> >> be able to force the guest to give up some of its tmem.
> >>
> > Yes, indeed, this is true. That is why it is important for any
> > policy implemented behind frontswap to "bill" the guest if it
> > is attempting to keep frontswap pages in the hypervisor forever
> > and to prod the guest to reclaim them when it no longer needs
> > super-fast emergency swap space. The frontswap patch already
> includes
> > the kernel mechanism to enable this and the prodding can be
> implemented
> > by a guest daemon (of which there already exists an existence proof).
>
> In this case you could use the same mechanism to stop new put_page()s?

You are suggesting the hypervisor communicate dynamically-rapidly-changing
physical memory availability information to a userland daemon in each guest,
and each daemon communicate this information to each respective kernel
to notify the kernel that hypervisor memory is not available?

Seems very convoluted to me, and anyway it doesn't eliminate the need
for a hook placed exactly where the frontswap_put hook is placed.

> Seems frontswap is like a reverse balloon, where the balloon is in
> hypervisor space instead of the guest space.

That's a reasonable analogy. Frontswap serves nicely as an
emergency safety valve when a guest has given up (too) much of
its memory via ballooning but unexpectedly has an urgent need
that can't be serviced quickly enough by the balloon driver.

2010-04-25 13:19:19

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/25/2010 04:12 PM, Dan Magenheimer wrote:
>>
>> In this case you could use the same mechanism to stop new put_page()s?
>>
> You are suggesting the hypervisor communicate dynamically-rapidly-changing
> physical memory availability information to a userland daemon in each guest,
> and each daemon communicate this information to each respective kernel
> to notify the kernel that hypervisor memory is not available?
>
> Seems very convoluted to me, and anyway it doesn't eliminate the need
> for a hook placed exactly where the frontswap_put hook is placed.
>

Yeah, it's pretty ugly. Balloons typically communicate without a daemon
too.

>> Seems frontswap is like a reverse balloon, where the balloon is in
>> hypervisor space instead of the guest space.
>>
> That's a reasonable analogy. Frontswap serves nicely as an
> emergency safety valve when a guest has given up (too) much of
> its memory via ballooning but unexpectedly has an urgent need
> that can't be serviced quickly enough by the balloon driver.
>

(or ordinary swap)

--
error compiling committee.c: too many arguments to function

2010-04-25 13:40:11

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> My issue is with the API's synchronous nature. Both RAM and more
> exotic
> memories can be used with DMA instead of copying. A synchronous
> interface gives this up.
> :
> Let's not allow the urge to merge prevent us from doing the right
> thing.
> :
> I see. Given that swap-to-flash will soon be way more common than
> frontswap, it needs to be solved (either in flash or in the swap code).

While I admit that I started this whole discussion by implying
that frontswap (and cleancache) might be useful for SSDs, I think
we are going far astray here. Frontswap is synchronous for a
reason: It uses real RAM, but RAM that is not directly addressable
by a (guest) kernel. SSD's (at least today) are still I/O devices;
even though they may be very fast, they still live on a PCI (or
slower) bus and use DMA. Frontswap is not intended for use with
I/O devices.

Today's memory technologies are either RAM that can be addressed
by the kernel, or I/O devices that sit on an I/O bus. The
exotic memories that I am referring to may be a hybrid:
memory that is fast enough to live on a QPI/hypertransport,
but slow enough that you wouldn't want to randomly mix and
hand out to userland apps some pages from "exotic RAM" and some
pages from "normal RAM". Such memory makes no sense today
because OS's wouldn't know what to do with it. But it MAY
make sense with frontswap (and cleancache).

Nevertheless, frontswap works great today with a bare-metal
hypervisor. I think it stands on its own merits, regardless
of one's vision of future SSD/memory technologies.

2010-04-25 14:15:41

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/25/2010 04:37 PM, Dan Magenheimer wrote:
>> My issue is with the API's synchronous nature. Both RAM and more
>> exotic
>> memories can be used with DMA instead of copying. A synchronous
>> interface gives this up.
>> :
>> Let's not allow the urge to merge prevent us from doing the right
>> thing.
>> :
>> I see. Given that swap-to-flash will soon be way more common than
>> frontswap, it needs to be solved (either in flash or in the swap code).
>>
> While I admit that I started this whole discussion by implying
> that frontswap (and cleancache) might be useful for SSDs, I think
> we are going far astray here. Frontswap is synchronous for a
> reason: It uses real RAM, but RAM that is not directly addressable
> by a (guest) kernel. SSD's (at least today) are still I/O devices;
> even though they may be very fast, they still live on a PCI (or
> slower) bus and use DMA. Frontswap is not intended for use with
> I/O devices.
>
> Today's memory technologies are either RAM that can be addressed
> by the kernel, or I/O devices that sit on an I/O bus. The
> exotic memories that I am referring to may be a hybrid:
> memory that is fast enough to live on a QPI/hypertransport,
> but slow enough that you wouldn't want to randomly mix and
> hand out to userland apps some pages from "exotic RAM" and some
> pages from "normal RAM". Such memory makes no sense today
> because OS's wouldn't know what to do with it. But it MAY
> make sense with frontswap (and cleancache).
>
> Nevertheless, frontswap works great today with a bare-metal
> hypervisor. I think it stands on its own merits, regardless
> of one's vision of future SSD/memory technologies.
>

Even when frontswapping to RAM on a bare metal hypervisor it makes sense
to use an async API, in case you have a DMA engine on board.

--
error compiling committee.c: too many arguments to function

2010-04-25 15:30:56

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> > While I admit that I started this whole discussion by implying
> > that frontswap (and cleancache) might be useful for SSDs, I think
> > we are going far astray here. Frontswap is synchronous for a
> > reason: It uses real RAM, but RAM that is not directly addressable
> > by a (guest) kernel. SSD's (at least today) are still I/O devices;
> > even though they may be very fast, they still live on a PCI (or
> > slower) bus and use DMA. Frontswap is not intended for use with
> > I/O devices.
> >
> > Today's memory technologies are either RAM that can be addressed
> > by the kernel, or I/O devices that sit on an I/O bus. The
> > exotic memories that I am referring to may be a hybrid:
> > memory that is fast enough to live on a QPI/hypertransport,
> > but slow enough that you wouldn't want to randomly mix and
> > hand out to userland apps some pages from "exotic RAM" and some
> > pages from "normal RAM". Such memory makes no sense today
> > because OS's wouldn't know what to do with it. But it MAY
> > make sense with frontswap (and cleancache).
> >
> > Nevertheless, frontswap works great today with a bare-metal
> > hypervisor. I think it stands on its own merits, regardless
> > of one's vision of future SSD/memory technologies.
>
> Even when frontswapping to RAM on a bare metal hypervisor it makes
> sense
> to use an async API, in case you have a DMA engine on board.

When pages are 2MB, this may be true. When pages are 4KB and
copied individually, it may take longer to program a DMA engine
than to just copy 4KB.

But in any case, frontswap works fine on all existing machines
today. If/when most commodity CPUs have an asynchronous RAM DMA
engine, an asynchronous API may be appropriate. Or the existing
swap API might be appropriate. Or the synchronous frontswap API
may work fine too. Speculating further about non-existent
hardware that might exist in the (possibly far) future is irrelevant
to the proposed patch, which works today on all existing x86 hardware
and on shipping software.

2010-04-25 16:08:39

by Nitin Gupta

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/25/2010 05:46 PM, Avi Kivity wrote:
> On 04/25/2010 06:11 AM, Nitin Gupta wrote:
>> On 04/24/2010 11:57 PM, Avi Kivity wrote:
>>
>>> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>>
>>>>
>>>>> I see. So why not implement this as an ordinary swap device, with a
>>>>> higher priority than the disk device? this way we reuse an API and
>>>>> keep
>>>>> things asynchronous, instead of introducing a special purpose API.
>>>>>
>>>>>
>>>>>
>>>> ramzswap is exactly this: an ordinary swap device which stores every
>>>> page
>>>> in (compressed) memory and its enabled as highest priority swap.
>>>> Currently,
>>>> it stores these compressed chunks in guest memory itself but it is not
>>>> very
>>>> difficult to send these chunks out to host/hypervisor using virtio.
>>>>
>>>> However, it suffers from unnecessary block I/O layer overhead and
>>>> requires
>>>> weird hooks in swap code, say to get notification when a swap slot is
>>>> freed.
>>>>
>>>>
>>> Isn't that TRIM?
>>>
>> No: trim or discard is not useful. The problem is that we require a
>> callback
>> _as soon as_ a page (swap slot) is freed. Otherwise, stale data
>> quickly accumulates
>> in memory defeating the whole purpose of in-memory compressed swap
>> devices (like ramzswap).
>>
>
> Doesn't flash have similar requirements? The earlier you discard, the
> likelier you are to reuse an erase block (or reduce the amount of copying).
>

No. We do not want to issue discard for every page as soon as it is freed.
I'm not flash expert but I guess issuing erase is just too expensive to be
issued so frequently. OTOH, ramzswap needs a callback for every page and as
soon as it is freed.


>> Increasing the frequency of discards is also not an option:
>> - Creating discard bio requests themselves need memory and these
>> swap devices
>> come into picture only under low memory conditions.
>>
>
> That's fine, swap works under low memory conditions by using reserves.
>

Ok, but still all this bio allocation and block layer overhead seems
unnecessary and is easily avoidable. I think frontswap code needs
clean up but at least it avoids all this bio overhead.

>> - We need to regularly scan swap_map to issue these discards.
>> Increasing discard
>> frequency also means more frequent scanning (which will still not be
>> fast enough
>> for ramzswap needs).
>>
>
> How does frontswap do this? Does it maintain its own data structures?
>

frontswap simply calls frontswap_flush_page() in swap_entry_free() i.e. as
soon as a swap slot is freed. No bio allocation etc.

>>> Maybe we should optimize these overheads instead. Swap used to always
>>> be to slow devices, but swap-to-flash has the potential to make swap act
>>> like an extension of RAM.
>>>
>>>
>> Spending lot of effort optimizing an overhead which can be completely
>> avoided
>> is probably not worth it.
>>
>
> I'm not sure. Swap-to-flash will soon be everywhere. If it's slow,
> people will feel it a lot more than ramzswap slowness.
>

Optimizing swap-to-flash is surely desirable but this problem is separate
from ramzswap or frontswap optimization. For the latter, I think dealing
with bio's, going through block layer is plain overhead.

>> Also, I think the choice of a synchronous style API for frontswap and
>> cleancache
>> is justified as they want to send pages to host *RAM*. If you want to
>> use other
>> devices like SSDs, then these should be just added as another swap
>> device as
>> we do currently -- these should not be used as frontswap storage
>> directly.
>>
>
> Even for copying to RAM an async API is wanted, so you can dma it
> instead of copying.
>

Maybe incremental development is better? Stabilize and refine existing
code and gradually move to async API, if required in future?

Thanks,
Nitin

2010-04-26 06:02:18

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/25/2010 06:29 PM, Dan Magenheimer wrote:
>>> While I admit that I started this whole discussion by implying
>>> that frontswap (and cleancache) might be useful for SSDs, I think
>>> we are going far astray here. Frontswap is synchronous for a
>>> reason: It uses real RAM, but RAM that is not directly addressable
>>> by a (guest) kernel. SSD's (at least today) are still I/O devices;
>>> even though they may be very fast, they still live on a PCI (or
>>> slower) bus and use DMA. Frontswap is not intended for use with
>>> I/O devices.
>>>
>>> Today's memory technologies are either RAM that can be addressed
>>> by the kernel, or I/O devices that sit on an I/O bus. The
>>> exotic memories that I am referring to may be a hybrid:
>>> memory that is fast enough to live on a QPI/hypertransport,
>>> but slow enough that you wouldn't want to randomly mix and
>>> hand out to userland apps some pages from "exotic RAM" and some
>>> pages from "normal RAM". Such memory makes no sense today
>>> because OS's wouldn't know what to do with it. But it MAY
>>> make sense with frontswap (and cleancache).
>>>
>>> Nevertheless, frontswap works great today with a bare-metal
>>> hypervisor. I think it stands on its own merits, regardless
>>> of one's vision of future SSD/memory technologies.
>>>
>> Even when frontswapping to RAM on a bare metal hypervisor it makes
>> sense
>> to use an async API, in case you have a DMA engine on board.
>>
> When pages are 2MB, this may be true. When pages are 4KB and
> copied individually, it may take longer to program a DMA engine
> than to just copy 4KB.
>

Of course, you have to use a batching API, like virtio or Xen's rings,
to avoid the overhead.

> But in any case, frontswap works fine on all existing machines
> today. If/when most commodity CPUs have an asynchronous RAM DMA
> engine, an asynchronous API may be appropriate. Or the existing
> swap API might be appropriate. Or the synchronous frontswap API
> may work fine too. Speculating further about non-existent
> hardware that might exist in the (possibly far) future is irrelevant
> to the proposed patch, which works today on all existing x86 hardware
> and on shipping software.
>

dma engines are present on commodity hardware now:

http://en.wikipedia.org/wiki/I/O_Acceleration_Technology

I don't know if consumer machines have them, but servers certainly do.
modprobe ioatdma.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-26 06:06:37

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/25/2010 07:05 PM, Nitin Gupta wrote:
>
>>> Increasing the frequency of discards is also not an option:
>>> - Creating discard bio requests themselves need memory and these
>>> swap devices
>>> come into picture only under low memory conditions.
>>>
>>>
>> That's fine, swap works under low memory conditions by using reserves.
>>
>>
> Ok, but still all this bio allocation and block layer overhead seems
> unnecessary and is easily avoidable. I think frontswap code needs
> clean up but at least it avoids all this bio overhead.
>

Ok. I agree it is silly to go through the block layer and end up
servicing it within the kernel.

>>> - We need to regularly scan swap_map to issue these discards.
>>> Increasing discard
>>> frequency also means more frequent scanning (which will still not be
>>> fast enough
>>> for ramzswap needs).
>>>
>>>
>> How does frontswap do this? Does it maintain its own data structures?
>>
>>
> frontswap simply calls frontswap_flush_page() in swap_entry_free() i.e. as
> soon as a swap slot is freed. No bio allocation etc.
>

The same code could also issue the discard?

>> Even for copying to RAM an async API is wanted, so you can dma it
>> instead of copying.
>>
>>
> Maybe incremental development is better? Stabilize and refine existing
> code and gradually move to async API, if required in future?
>

Incremental development is fine, especially for ramzswap where the APIs
are all internal. I'm more worried about external interfaces, these
stick around a lot longer and if not done right they're a pain forever.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-26 12:47:35

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> dma engines are present on commodity hardware now:
>
> http://en.wikipedia.org/wiki/I/O_Acceleration_Technology
>
> I don't know if consumer machines have them, but servers certainly do.
> modprobe ioatdma.

They don't seem to have gained much ground in the FIVE YEARS
since the patch was first posted to Linux, have they?

Maybe it's because memory-to-memory copy using a CPU
is so fast (especially for page-ish quantities of data)
and is a small percentage of CPU utilization these days?

2010-04-26 12:51:46

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> > Maybe incremental development is better? Stabilize and refine
> existing
> > code and gradually move to async API, if required in future?
>
> Incremental development is fine, especially for ramzswap where the APIs
> are all internal. I'm more worried about external interfaces, these
> stick around a lot longer and if not done right they're a pain forever.

Well if you are saying that your primary objection to the
frontswap synchronous API is that it is exposed to modules via
some EXPORT_SYMBOLs, we can certainly fix that, at least
unless/until there are other pseudo-RAM devices that can use it.

Would that resolve your concerns?

2010-04-26 13:43:46

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/26/2010 03:50 PM, Dan Magenheimer wrote:
>>> Maybe incremental development is better? Stabilize and refine
>>>
>> existing
>>
>>> code and gradually move to async API, if required in future?
>>>
>> Incremental development is fine, especially for ramzswap where the APIs
>> are all internal. I'm more worried about external interfaces, these
>> stick around a lot longer and if not done right they're a pain forever.
>>
> Well if you are saying that your primary objection to the
> frontswap synchronous API is that it is exposed to modules via
> some EXPORT_SYMBOLs, we can certainly fix that, at least
> unless/until there are other pseudo-RAM devices that can use it.
>
> Would that resolve your concerns?
>

By external interfaces I mean the guest/hypervisor interface.
EXPORT_SYMBOL is an internal interface as far as I'm concerned.

Now, the frontswap interface is also an internal interface, but it's
close to the external one. I'd feel much better if it was asynchronous.

--
error compiling committee.c: too many arguments to function

2010-04-26 13:48:44

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/26/2010 03:45 PM, Dan Magenheimer wrote:
>> dma engines are present on commodity hardware now:
>>
>> http://en.wikipedia.org/wiki/I/O_Acceleration_Technology
>>
>> I don't know if consumer machines have them, but servers certainly do.
>> modprobe ioatdma.
>>
> They don't seem to have gained much ground in the FIVE YEARS
> since the patch was first posted to Linux, have they?
>

Why do you say this? Servers have them and AFAIK networking uses them.
There are other uses of the API in the code, but I don't know how much
of this is for bulk copies.

> Maybe it's because memory-to-memory copy using a CPU
> is so fast (especially for page-ish quantities of data)
> and is a small percentage of CPU utilization these days?
>

Copies take a small percentage of cpu because a lot of care goes into
avoiding them, or placing them near the place where the copy is used.
They certainly show up in high speed networking.

A page-sized copy is small, but many of them will be expensive.

--
error compiling committee.c: too many arguments to function

2010-04-26 13:50:09

by Nitin Gupta

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/26/2010 11:36 AM, Avi Kivity wrote:
> On 04/25/2010 07:05 PM, Nitin Gupta wrote:
>>
>>>> Increasing the frequency of discards is also not an option:
>>>> - Creating discard bio requests themselves need memory and these
>>>> swap devices
>>>> come into picture only under low memory conditions.
>>>>
>>>>
>>> That's fine, swap works under low memory conditions by using reserves.
>>>
>>>
>> Ok, but still all this bio allocation and block layer overhead seems
>> unnecessary and is easily avoidable. I think frontswap code needs
>> clean up but at least it avoids all this bio overhead.
>>
>
> Ok. I agree it is silly to go through the block layer and end up
> servicing it within the kernel.
>
>>>> - We need to regularly scan swap_map to issue these discards.
>>>> Increasing discard
>>>> frequency also means more frequent scanning (which will still not be
>>>> fast enough
>>>> for ramzswap needs).
>>>>
>>>>
>>> How does frontswap do this? Does it maintain its own data structures?
>>>
>>>
>> frontswap simply calls frontswap_flush_page() in swap_entry_free()
>> i.e. as
>> soon as a swap slot is freed. No bio allocation etc.
>>
>
> The same code could also issue the discard?
>


No, we cannot issue discard bio at this place since swap_lock
spinlock is held.


Thanks,
Nitin

2010-04-27 00:49:18

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/25/2010 05:11 AM, Avi Kivity wrote:
> No need to change the kernel at all; the hypervisor controls the page
> tables.

Not in Xen PV guests (the hypervisor vets guest updates, but it can't
safely make its own changes to the pagetables). (Its kind of annoying.)

J

2010-04-27 08:31:33

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> > Well if you are saying that your primary objection to the
> > frontswap synchronous API is that it is exposed to modules via
> > some EXPORT_SYMBOLs, we can certainly fix that, at least
> > unless/until there are other pseudo-RAM devices that can use it.
> >
> > Would that resolve your concerns?
> >
>
> By external interfaces I mean the guest/hypervisor interface.
> EXPORT_SYMBOL is an internal interface as far as I'm concerned.
>
> Now, the frontswap interface is also an internal interface, but it's
> close to the external one. I'd feel much better if it was
> asynchronous.

OK, so on the one hand, you think that the proposed synchronous
interface for frontswap is insufficiently extensible for other
uses (presumably including KVM). On the other hand, you agree
that using the existing I/O subsystem is unnecessarily heavyweight.
On the third hand, Nitin has answered your questions and spent
a good part of three years finding that extending the existing swap
interface to efficiently support swap-to-pseudo-RAM requires
some kind of in-kernel notification mechanism to which Linus
has already objected.

So you are instead proposing some new guest-to-host asynchronous
notification mechanism that doesn't use the existing bio
mechanism (and so presumably not irqs), imitates or can
utilize a dma engine, and uses less cpu cycles than copying
pages. AND, for long-term maintainability, you'd like to avoid
creating a new guest-host API that does all this, even one that
is as simple and lightweight as the proposed frontswap hooks.

Does that summarize your objection well?

2010-04-27 09:22:33

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/27/2010 11:29 AM, Dan Magenheimer wrote:
>
> OK, so on the one hand, you think that the proposed synchronous
> interface for frontswap is insufficiently extensible for other
> uses (presumably including KVM). On the other hand, you agree
> that using the existing I/O subsystem is unnecessarily heavyweight.
> On the third hand, Nitin has answered your questions and spent
> a good part of three years finding that extending the existing swap
> interface to efficiently support swap-to-pseudo-RAM requires
> some kind of in-kernel notification mechanism to which Linus
> has already objected.
>
> So you are instead proposing some new guest-to-host asynchronous
> notification mechanism that doesn't use the existing bio
> mechanism (and so presumably not irqs),

(any notification mechanism has to use irqs if it exits the guest)

> imitates or can
> utilize a dma engine, and uses less cpu cycles than copying
> pages. AND, for long-term maintainability, you'd like to avoid
> creating a new guest-host API that does all this, even one that
> is as simple and lightweight as the proposed frontswap hooks.
>
> Does that summarize your objection well?
>

No. Adding a new async API that parallels the block layer would be
madness. My first preference would be to completely avoid new APIs. I
think that would work for swap-to-hypervisor but probably not for
compcache. Second preference is the synchronous API, third is a new
async API.

--
error compiling committee.c: too many arguments to function

2010-04-27 11:53:07

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On Sun, 25 Apr 2010 06:37:30 PDT, Dan Magenheimer said:

> While I admit that I started this whole discussion by implying
> that frontswap (and cleancache) might be useful for SSDs, I think
> we are going far astray here. Frontswap is synchronous for a
> reason: It uses real RAM, but RAM that is not directly addressable
> by a (guest) kernel.

Are there any production boxes that actually do this currently? I know IBM had
'expanded storage' on the 3090 series 20 years ago, haven't checked if the
Z-series still do that. Was very cool at the time - supported 900+ users with
128M of main memory and 256M of expanded storage, because you got the first
3,000 or so page faults per second for almost free. Oh, and the 3090 had 2
special opcodes for "move page to/from expanded", so it was a very fast but
still synchronous move (for whatever that's worth).


Attachments:
(No filename) (227.00 B)

2010-04-27 12:55:18

by Pavel Machek

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Hi!

> > Can we extend it? Adding new APIs is easy, but harder to maintain in
> > the long term.
>
> Umm... I think the difference between a "new" API and extending
> an existing one here is a choice of semantics. As designed, frontswap
> is an extremely simple, only-very-slightly-intrusive set of hooks that
> allows swap pages to, under some conditions, go to pseudo-RAM instead
...
> "Extending" the existing swap API, which has largely been untouched for
> many years, seems like a significantly more complex and error-prone
> undertaking that will affect nearly all Linux users with a likely long
> bug tail. And, by the way, there is no existence proof that it
> will be useful.

> Seems like a no-brainer to me.

Stop right here. Instead of improving existing swap api, you just
create one because it is less work.

We do not want apis to cummulate; please just fix the existing one.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-27 12:56:44

by Pavel Machek

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Hi!

> > > Nevertheless, frontswap works great today with a bare-metal
> > > hypervisor. I think it stands on its own merits, regardless
> > > of one's vision of future SSD/memory technologies.
> >
> > Even when frontswapping to RAM on a bare metal hypervisor it makes
> > sense
> > to use an async API, in case you have a DMA engine on board.
>
> When pages are 2MB, this may be true. When pages are 4KB and
> copied individually, it may take longer to program a DMA engine
> than to just copy 4KB.
>
> But in any case, frontswap works fine on all existing machines
> today. If/when most commodity CPUs have an asynchronous RAM DMA
> engine, an asynchronous API may be appropriate. Or the existing
> swap API might be appropriate. Or the synchronous frontswap API
> may work fine too. Speculating further about non-existent
> hardware that might exist in the (possibly far) future is irrelevant
> to the proposed patch, which works today on all existing x86 hardware
> and on shipping software.

If we added all the apis that worked when proposed, we'd have
unmaintanable mess by about 1996.

Why can't frontswap just use existing swap api?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-27 14:32:46

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> Stop right here. Instead of improving existing swap api, you just
> create one because it is less work.
>
> We do not want apis to cummulate; please just fix the existing one.

> If we added all the apis that worked when proposed, we'd have
> unmaintanable mess by about 1996.
>
> Why can't frontswap just use existing swap api?

Hi Pavel!

The existing swap API as it stands is inadequate for an efficient
synchronous interface (e.g. for swapping to RAM). Both Nitin
and I independently have found this to be true. But swap-to-RAM
is very useful in some cases (swap-to-kernel-compressed-RAM
and swap-to-hypervisor-RAM and maybe others) that were not even
conceived many years ago at the time the existing swap API was
designed for swap-to-disk. Swap-to-RAM can relieve memory
pressure faster and more resource-efficient than swap-to-device
but must assume that RAM available for swap-to-RAM is dynamic
(not fixed in size). (And swap-to-SSD, when the SSD is an
I/O device on an I/O bus is NOT the same as swap-to-RAM.)

In my opinion, frontswap is NOT a new API, but the simplest
possible extension of the existing swap API to allow for
efficient swap-to-RAM. Avi's comments about a new API
(as he explained later in the thread) refer to a new API
between kernel and hypervisor, what is essentially the
Transcendent Memory interface. Frontswap was separated from
the tmem dependency to enable Nitin's swap-to-kernel-compressed-RAM
and the possibility that there may be other interesting
swap-to-RAM uses.

Does this help?

Dan

2010-04-27 14:46:52

by Nitin Gupta

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/27/2010 06:25 PM, Pavel Machek wrote:
>
>>> Can we extend it? Adding new APIs is easy, but harder to maintain in
>>> the long term.
>>
>> Umm... I think the difference between a "new" API and extending
>> an existing one here is a choice of semantics. As designed, frontswap
>> is an extremely simple, only-very-slightly-intrusive set of hooks that
>> allows swap pages to, under some conditions, go to pseudo-RAM instead
> ...
>> "Extending" the existing swap API, which has largely been untouched for
>> many years, seems like a significantly more complex and error-prone
>> undertaking that will affect nearly all Linux users with a likely long
>> bug tail. And, by the way, there is no existence proof that it
>> will be useful.
>
>> Seems like a no-brainer to me.
>
> Stop right here. Instead of improving existing swap api, you just
> create one because it is less work.
>
> We do not want apis to cummulate; please just fix the existing one.


I'm a bit confused: What do you mean by 'existing swap API'?
Frontswap simply hooks in swap_readpage() and swap_writepage() to
call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
implementation of these function, it introduces struct frontswap_ops
so that custom implementations fronswap get/put/etc. functions can be
provided. This allows easy implementation of swap-to-hypervisor,
in-memory-compressed-swapping etc. with common set of hooks.

So, how frontswap approach can be seen as introducing a new API?

Thanks,
Nitin





2010-04-30 16:59:47

by Dave Hansen

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On Wed, 2010-04-28 at 07:55 +0200, Pavel Machek wrote:
> > > Seems frontswap is like a reverse balloon, where the balloon is in
> > > hypervisor space instead of the guest space.
> >
> > That's a reasonable analogy. Frontswap serves nicely as an
> > emergency safety valve when a guest has given up (too) much of
> > its memory via ballooning but unexpectedly has an urgent need
> > that can't be serviced quickly enough by the balloon driver.
>
> wtf? So lets fix the ballooning driver instead?
>
> There's no reason it could not be as fast as frontswap, right?
> Actually I'd expect it to be faster -- it can deal with big chunks.

Frontswap and things like CMM2[1] have some fundamental advantages over
swapping and ballooning. First of all, there are serious limits on
ballooning. It's difficult for a guest to span a very wide range of
memory sizes without also including memory hotplug in the mix. The ~1%
'struct page' penalty alone causes issues here.

A large portion of CMM2's gain came from the fact that you could take
memory away from guests without _them_ doing any work. If the system is
experiencing a load spike, you increase load even more by making the
guests swap. If you can just take some of their memory away, you can
smooth that spike out. CMM2 and frontswap do that. The guests
explicitly give up page contents that the hypervisor does not have to
first consult with the guest before discarding.

[1] http://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf

-- Dave

2010-04-30 17:10:58

by Dave Hansen

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On Fri, 2010-04-30 at 09:43 -0700, Dan Magenheimer wrote:
> And, importantly, "have your host expose a device which is write
> cached by host memory"... you are implying that all guest swapping
> should be done to a device managed/controlled by the host? That
> eliminates guest swapping to directIO/SRIOV devices doesn't it?

If you have a single swap device, sure. But, I can also see a case
where you have a "fast" swap and "slow" swap.

The part of the argument about frontswap is that I like is the lack
sizing exposed to the guest. When you're dealing with swap-only, you
are stuck adding or removing swap devices if you want to "grow/shrink"
the memory footprint. If the host (or whatever is backing the
frontswap) wants to change the sizes, they're fairly free to.

The part that bothers me it is that it just pushes the problem
elsewhere. For KVM, we still have to figure out _somewhere_ what to do
with all those pages. It's nice that the host would have the freedom to
either swap or keep them around, but it doesn't really fix the problem.

I do see the lack of sizing exposed to the guest as being a bad thing,
too. Let's say we saved 25% of system RAM to back a frontswap-type
device on a KVM host. The first time a user boots up their set of VMs
and 25% of their RAM is gone, they're going to start complaining,
despite the fact that their 25% smaller systems may end up being faster.

I think I'd be more convinced if we saw this thing actually get used
somehow. How is a ram-backed frontswap better than a /dev/ramX-backed
swap file in practice?

-- Dave

2010-04-30 17:14:10

by Pavel Machek

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On Tue 2010-04-27 20:13:39, Nitin Gupta wrote:
> On 04/27/2010 06:25 PM, Pavel Machek wrote:
> >
> >>> Can we extend it? Adding new APIs is easy, but harder to maintain in
> >>> the long term.
> >>
> >> Umm... I think the difference between a "new" API and extending
> >> an existing one here is a choice of semantics. As designed, frontswap
> >> is an extremely simple, only-very-slightly-intrusive set of hooks that
> >> allows swap pages to, under some conditions, go to pseudo-RAM instead
> > ...
> >> "Extending" the existing swap API, which has largely been untouched for
> >> many years, seems like a significantly more complex and error-prone
> >> undertaking that will affect nearly all Linux users with a likely long
> >> bug tail. And, by the way, there is no existence proof that it
> >> will be useful.
> >
> >> Seems like a no-brainer to me.
> >
> > Stop right here. Instead of improving existing swap api, you just
> > create one because it is less work.
> >
> > We do not want apis to cummulate; please just fix the existing one.
>
>
> I'm a bit confused: What do you mean by 'existing swap API'?
> Frontswap simply hooks in swap_readpage() and swap_writepage() to
> call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
> implementation of these function, it introduces struct frontswap_ops
> so that custom implementations fronswap get/put/etc. functions can be
> provided. This allows easy implementation of swap-to-hypervisor,
> in-memory-compressed-swapping etc. with common set of hooks.

Yes, and that set of hooks is new API, right?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-30 17:48:45

by Pavel Machek

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Hi!

> > Seems frontswap is like a reverse balloon, where the balloon is in
> > hypervisor space instead of the guest space.
>
> That's a reasonable analogy. Frontswap serves nicely as an
> emergency safety valve when a guest has given up (too) much of
> its memory via ballooning but unexpectedly has an urgent need
> that can't be serviced quickly enough by the balloon driver.

wtf? So lets fix the ballooning driver instead?

There's no reason it could not be as fast as frontswap, right?
Actually I'd expect it to be faster -- it can deal with big chunks.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-30 17:51:47

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Hi Pavel --

The whole concept of RAM that _might_ be available to the
kernel and is _not_ directly addressable by the kernel takes
some thinking to wrap your mind around, but I assure you
there are very good use cases for it. RAM owned and managed
by a hypervisor (using controls unknowable to the kernel)
is one example; this is Transcendent Memory. RAM which
has been compressed is another example; Nitin is working
on this using the frontswap approach because of some
issues that arise with ramzswap (see elsewhere on this
thread). There are likely more use cases.

So in that context, let me answer your questions, combined
into a single reply.

> > That's a reasonable analogy. Frontswap serves nicely as an
> > emergency safety valve when a guest has given up (too) much of
> > its memory via ballooning but unexpectedly has an urgent need
> > that can't be serviced quickly enough by the balloon driver.
>
> wtf? So lets fix the ballooning driver instead?
>
> There's no reason it could not be as fast as frontswap, right?
> Actually I'd expect it to be faster -- it can deal with big chunks.

If this was possible by fixing the balloon driver, VMware would
have done it years ago. The problem is that the balloon driver
is acting on very limited information, namely ONLY what THIS
kernel wants; every kernel is selfish and (eventually) uses every
bit of RAM it can get. This is especially true when swapping
is required (under memory pressure).

So, in general, ballooning is NOT faster because a balloon
request to "get" RAM must wait for some other balloon driver
in some other kernel to "give" RAM. OR some other entity
must periodically scan every kernels memory and guess at which
kernels are using memory inefficiently and steal it away before
a "needy" kernel asks for it.

While this does indeed "work" today in VMware, if you talk to
VMware customers that use it, many are very unhappy with the
anomalous performance problems that occur.

> > The existing swap API as it stands is inadequate for an efficient
> > synchronous interface (e.g. for swapping to RAM). Both Nitin
> > and I independently have found this to be true. But swap-to-RAM
>
> So... how much slower is swapping to RAM over current interface when
> compared to proposed interface, and how much is that slower than just
> using the memory directly?

Simply copying RAM from one page owned by the kernel to another
page owned by the kernel is pretty pointless as far as swapping
is concerned because it does nothing to reduce memory pressure,
so the comparison is a bit irrelevant. But...

In my measurements, the overhead of managing "pseudo-RAM" pages
is in the same ballpark as copying the page. Compression or
deduplication of course has additional costs. See the
performance results at the end of the following two presentations
for some performance information when "pseudo-RAM" is Transcendent
Memory.

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryLinuxConfAu2010.pdf

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf

(the latter will be presented later today)

> > I'm a bit confused: What do you mean by 'existing swap API'?
> > Frontswap simply hooks in swap_readpage() and swap_writepage() to
> > call frontswap_{get,put}_page() respectively. Now to avoid a
> hardcoded
> > implementation of these function, it introduces struct frontswap_ops
> > so that custom implementations fronswap get/put/etc. functions can be
> > provided. This allows easy implementation of swap-to-hypervisor,
> > in-memory-compressed-swapping etc. with common set of hooks.
>
> Yes, and that set of hooks is new API, right?

Well, no, if you define API as "application programming interface"
this is NOT exposed to userland. If you define API as a new
in-kernel function call, yes, these hooks are a new API, but that
is true of virtually any new code in the kernel. If you define
API as some new interface between the kernel and a hypervisor,
yes, this is a new API, but it is "optional" at several levels
so that any hypervisor (e.g. KVM) can completely ignore it.

So please let's not argue about whether the code is a "new API"
or not, but instead consider whether the concept is useful or not
and if useful, if there is or is not a cleaner way to implement it.

Thanks,
Dan

2010-04-30 17:52:36

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/30/2010 09:16 AM, Avi Kivity wrote:
> Given that whenever frontswap fails you need to swap anyway, it is
> better for the host to never fail a frontswap request and instead back
> it with disk storage if needed. This way you avoid a pointless vmexit
> when you're out of memory. Since it's disk backed it needs to be
> asynchronous and batched.

I'd argue the opposite. There's no point in having the host do swapping
on behalf of guests if guests can do it themselves; it's just a
duplication of functionality. You end up having two IO paths for each
guest, and the resulting problems in trying to account for the IO,
rate-limit it, etc. If you can simply say "all guest disk IO happens
via this single interface", its much easier to manage.

If frontswap has value, it's because its providing a new facility to
guests that doesn't already exist and can't be easily emulated with
existing interfaces.

It seems to me the great strengths of the synchronous interface are:

* it matches the needs of an existing implementation (tmem in Xen)
* it is simple to understand within the context of the kernel code
it's used in

Simplicity is important, because it allows the mm code to be understood
and maintained without having to have a deep understanding of
virtualization. One of the problems with CMM2 was that it puts a lot of
intricate constraints on the mm code which can be easily broken, which
would only become apparent in subtle edge cases in a CMM2-using
environment. An addition async frontswap-like interface - while not as
complex as CMM2 - still makes things harder for mm maintainers.

The downside is that it may not match some implementation in which the
get/put operations could take a long time (ie, physical IO to a slow
mechanical device). But a general Linux principle is not to overdesign
interfaces for hypothetical users, only for real needs.

Do you think that you would be able to use frontswap in kvm if it were
an async interface, but not otherwise? Or are you arguing a hypothetical?

> At this point we're back with the ordinary swap API. Simply have your
> host expose a device which is write cached by host memory, you'll have
> all the benefits of frontswap with none of the disadvantages, and with
> no changes to guest code.

Yes, that's comfortably within the "guests page themselves" model.
Setting up a block device for the domain which is backed by pagecache
(something we usually try hard to avoid) is pretty straightforward. But
it doesn't work well for Xen unless the blkback domain is sized so that
it has all of Xen's free memory in its pagecache.

That said, it does concern me that the host/hypervisor is left holding
the bag on frontswapped pages. A evil/uncooperative/lazy can just pump
a whole lot of pages into the frontswap pool and leave them there. I
guess this is mitigated by the fact that the API is designed such that
they can't update or read the data without also allowing the hypervisor
to drop the page (updates can fail destructively, and reads are also
destructive), so the guest can't use it as a clumsy extension of their
normal dedicated memory.

J

2010-04-30 18:11:17

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/30/2010 07:43 PM, Dan Magenheimer wrote:
>> Given that whenever frontswap fails you need to swap anyway, it is
>> better for the host to never fail a frontswap request and instead back
>> it with disk storage if needed. This way you avoid a pointless vmexit
>> when you're out of memory. Since it's disk backed it needs to be
>> asynchronous and batched.
>>
>> At this point we're back with the ordinary swap API. Simply have your
>> host expose a device which is write cached by host memory, you'll have
>> all the benefits of frontswap with none of the disadvantages, and with
>> no changes to guest .
>>
> I think you are making a number of possibly false assumptions here:
> 1) The host [the frontswap backend may not even be a hypervisor]
>

True. My remarks only apply to frontswap-to-hypervisor, for internally
consumed frontswap the situation is different.

> 2) can back it with disk storage [not if it is a bare-metal hypervisor]
>

So it seems a bare-metal hypervisor has less access to the bare metal
than a non-bare-metal hypervisor?

Seriously, leave the bare-metal FUD to Simon. People on this list know
that kvm and Xen have exactly the same access to the hardware (well
actually Xen needs to use privileged guests to access some of its hardware).

> 3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest]
>

There's still an exit. It's much faster than a vmx/svm vmexit but still
nontrivial.

But why are we optimizing for 5 year old hardware?

> 4) when you're out of memory [how can this be determined outside of
> the hypervisor?]
>

It's determined by the hypervisor, same as with tmem. The guest swaps
to a virtual disk, the hypervisor places the data in RAM if it's
available, or on disk if it isn't. Write-back caching in all its glory.

> And, importantly, "have your host expose a device which is write
> cached by host memory"... you are implying that all guest swapping
> should be done to a device managed/controlled by the host? That
> eliminates guest swapping to directIO/SRIOV devices doesn't it?
>

You can have multiple swap devices.

wrt SR/IOV, you'll see synchronous frontswap reduce throughput. SR/IOV
will swap with <1 exit/page and DMA guest pages, while frontswap/tmem
will carry a 1 exit/page hit (even if no swap actually happens) and the
copy cost (if it does).

The API really, really wants to be asynchronous.

> Anyway, I think we can see now why frontswap might not be a good
> match for a hosted hypervisor (KVM), but that doesn't make it
> any less useful for a bare-metal hypervisor (or TBD for in-kernel
> compressed swap and TBD for possible future pseudo-RAM technologies).
>

In-kernel compressed swap does seem to be a good match for a synchronous
API. For future memory devices, or even bare-metal buzzword-compliant
hypervisors, I disagree. An asynchronous API is required for
efficiency, and they'll all have swap capability sooner or later (kvm,
vmware, and I believe xen 4 already do).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-30 17:29:11

by Pavel Machek

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Hi!

> > Stop right here. Instead of improving existing swap api, you just
> > create one because it is less work.
> >
> > We do not want apis to cummulate; please just fix the existing one.
>
> > If we added all the apis that worked when proposed, we'd have
> > unmaintanable mess by about 1996.
> >
> > Why can't frontswap just use existing swap api?
>
> Hi Pavel!
>
> The existing swap API as it stands is inadequate for an efficient
> synchronous interface (e.g. for swapping to RAM). Both Nitin
> and I independently have found this to be true. But swap-to-RAM

So... how much slower is swapping to RAM over current interface when
compared to proposed interface, and how much is that slower than just
using the memory directly?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-30 18:17:43

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

(I'll back down on the CMM2 comparisons until I can go
back and read the paper :-)

> >> [frontswap is] really
> >> not very different from a synchronous swap device.
> >>
> > Not to beat a dead horse, but there is a very key difference:
> > The size and availability of frontswap is entirely dynamic;
> > any page-to-be-swapped can be rejected at any time even if
> > a page was previously successfully swapped to the same index.
> > Every other swap device is much more static so the swap code
> > assumes a static device. Existing swap code can account for
> > "bad blocks" on a static device, but this is far from sufficient
> > to handle the dynamicity needed by frontswap.
>
> Given that whenever frontswap fails you need to swap anyway, it is
> better for the host to never fail a frontswap request and instead back
> it with disk storage if needed. This way you avoid a pointless vmexit
> when you're out of memory. Since it's disk backed it needs to be
> asynchronous and batched.
>
> At this point we're back with the ordinary swap API. Simply have your
> host expose a device which is write cached by host memory, you'll have
> all the benefits of frontswap with none of the disadvantages, and with
> no changes to guest .

I think you are making a number of possibly false assumptions here:
1) The host [the frontswap backend may not even be a hypervisor]
2) can back it with disk storage [not if it is a bare-metal hypervisor]
3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest]
4) when you're out of memory [how can this be determined outside of
the hypervisor?]

And, importantly, "have your host expose a device which is write
cached by host memory"... you are implying that all guest swapping
should be done to a device managed/controlled by the host? That
eliminates guest swapping to directIO/SRIOV devices doesn't it?

Anyway, I think we can see now why frontswap might not be a good
match for a hosted hypervisor (KVM), but that doesn't make it
any less useful for a bare-metal hypervisor (or TBD for in-kernel
compressed swap and TBD for possible future pseudo-RAM technologies).

Dan

2010-04-30 18:23:51

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/29/2010 09:59 PM, Avi Kivity wrote:
>
> I'm convinced it's useful. The API is so close to a block device
> (read/write with key/value vs read/write with sector/value) that we
> should make the effort not to introduce a new API.
>

Plus of course the asynchronity and batching of the block layer. Even
if you don't use a dma engine, you improve performance by exiting one
per several dozen pages instead of for every page, perhaps enough to
allow the hypervisor to justify copying the memory with non-temporal moves.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-04-30 18:25:38

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/29/2010 05:42 PM, Dan Magenheimer wrote:
>>
>> Yes, and that set of hooks is new API, right?
>>
> Well, no, if you define API as "application programming interface"
> this is NOT exposed to userland. If you define API as a new
> in-kernel function call, yes, these hooks are a new API, but that
> is true of virtually any new code in the kernel. If you define
> API as some new interface between the kernel and a hypervisor,
> yes, this is a new API, but it is "optional" at several levels
> so that any hypervisor (e.g. KVM) can completely ignore it.
>

The concern is not with the hypervisor, but with Linux. More external
APIs reduce our flexibility to change things.

> So please let's not argue about whether the code is a "new API"
> or not, but instead consider whether the concept is useful or not
> and if useful, if there is or is not a cleaner way to implement it.
>

I'm convinced it's useful. The API is so close to a block device
(read/write with key/value vs read/write with sector/value) that we
should make the effort not to introduce a new API.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-04-30 18:26:45

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/30/2010 08:52 PM, Jeremy Fitzhardinge wrote:
> On 04/30/2010 09:16 AM, Avi Kivity wrote:
>
>> Given that whenever frontswap fails you need to swap anyway, it is
>> better for the host to never fail a frontswap request and instead back
>> it with disk storage if needed. This way you avoid a pointless vmexit
>> when you're out of memory. Since it's disk backed it needs to be
>> asynchronous and batched.
>>
> I'd argue the opposite. There's no point in having the host do swapping
> on behalf of guests if guests can do it themselves; it's just a
> duplication of functionality.

The problem with relying on the guest to swap is that it's voluntary.
The guest may not be able to do it. When the hypervisor needs memory
and guests don't cooperate, it has to swap.

But I'm not suggesting that the host swap on behalf on the guest.
Rather, the guest swaps to (what it sees as) a device with a large
write-back cache; the host simply manages that cache.

> You end up having two IO paths for each
> guest, and the resulting problems in trying to account for the IO,
> rate-limit it, etc. If you can simply say "all guest disk IO happens
> via this single interface", its much easier to manage.
>

With tmem you have to account for that memory, make sure it's
distributed fairly, claim it back when you need it (requiring guest
cooperation), live migrate and save/restore it. It's a much larger
change than introducing a write-back device for swapping (which has the
benefit of working with unmodified guests).

> If frontswap has value, it's because its providing a new facility to
> guests that doesn't already exist and can't be easily emulated with
> existing interfaces.
>
> It seems to me the great strengths of the synchronous interface are:
>
> * it matches the needs of an existing implementation (tmem in Xen)
> * it is simple to understand within the context of the kernel code
> it's used in
>
> Simplicity is important, because it allows the mm code to be understood
> and maintained without having to have a deep understanding of
> virtualization.

If we use the existing paths, things are even simpler, and we match more
needs (hypervisors with dma engines, the ability to reclaim memory
without guest cooperation).

> One of the problems with CMM2 was that it puts a lot of
> intricate constraints on the mm code which can be easily broken, which
> would only become apparent in subtle edge cases in a CMM2-using
> environment. An addition async frontswap-like interface - while not as
> complex as CMM2 - still makes things harder for mm maintainers.
>

No doubt CMM2 is hard to swallow.

> The downside is that it may not match some implementation in which the
> get/put operations could take a long time (ie, physical IO to a slow
> mechanical device). But a general Linux principle is not to overdesign
> interfaces for hypothetical users, only for real needs.
>

> Do you think that you would be able to use frontswap in kvm if it were
> an async interface, but not otherwise? Or are you arguing a hypothetical?
>

For kvm (or Xen, with some modifications) all of the benefits of
frontswap/tmem can be achieved with the ordinary swap. It would need
trim/discard support to avoid writing back freed data, but that's good
for flash as well.

The advantages are:
- just works
- old guests
- <1 exit/page (since it's batched)
- no extra overhead if no free memory
- can use dma engine (since it's asynchronous)

>> At this point we're back with the ordinary swap API. Simply have your
>> host expose a device which is write cached by host memory, you'll have
>> all the benefits of frontswap with none of the disadvantages, and with
>> no changes to guest code.
>>
> Yes, that's comfortably within the "guests page themselves" model.
> Setting up a block device for the domain which is backed by pagecache
> (something we usually try hard to avoid) is pretty straightforward. But
> it doesn't work well for Xen unless the blkback domain is sized so that
> it has all of Xen's free memory in its pagecache.
>

Could be easily achieved with ballooning?

> That said, it does concern me that the host/hypervisor is left holding
> the bag on frontswapped pages. A evil/uncooperative/lazy can just pump
> a whole lot of pages into the frontswap pool and leave them there. I
> guess this is mitigated by the fact that the API is designed such that
> they can't update or read the data without also allowing the hypervisor
> to drop the page (updates can fail destructively, and reads are also
> destructive), so the guest can't use it as a clumsy extension of their
> normal dedicated memory.
>

Eventually you'll have to swap frontswap pages, or kill uncooperative
guests. At which point all of the simplicity is gone.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-30 18:36:57

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/30/2010 04:45 AM, Dave Hansen wrote:
>
> A large portion of CMM2's gain came from the fact that you could take
> memory away from guests without _them_ doing any work. If the system is
> experiencing a load spike, you increase load even more by making the
> guests swap. If you can just take some of their memory away, you can
> smooth that spike out. CMM2 and frontswap do that. The guests
> explicitly give up page contents that the hypervisor does not have to
> first consult with the guest before discarding.
>

Frontswap does not do this. Once a page has been frontswapped, the host
is committed to retaining it until the guest releases it. It's really
not very different from a synchronous swap device.

I think cleancache allows the hypervisor to drop pages without the
guest's immediate knowledge, but I'm not sure.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-04-30 18:38:31

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/30/2010 06:59 PM, Dan Magenheimer wrote:
>>
>>> experiencing a load spike, you increase load even more by making the
>>> guests swap. If you can just take some of their memory away, you can
>>> smooth that spike out. CMM2 and frontswap do that. The guests
>>> explicitly give up page contents that the hypervisor does not have to
>>> first consult with the guest before discarding.
>>>
>> Frontswap does not do this. Once a page has been frontswapped, the
>> host
>> is committed to retaining it until the guest releases it.
>>
> Dave or others can correct me if I am wrong, but I think CMM2 also
> handles dirty pages that must be retained by the hypervisor.

But those are the guest's pages in the first place, that's not a new
commitment. CMM2 provides the hypervisor alternatives to swapping a
page out. Frontswap provides the guest alternatives to swapping a page out.

> The
> difference between CMM2 (for dirty pages) and frontswap is that
> CMM2 sets hints that can be handled asynchronously while frontswap
> provides explicit hooks that synchronously succeed/fail.
>

They are not directly comparable. In fact for dirty pages CMM2 is
mostly a no-op - the host is forced to swap them out if it wants them.
CMM2 brings value for demand zero or clean pages which can be restored
by the guest without requiring swapin.

I think for dirty pages what CMM2 brings is the ability to discard them
if the host has swapped them out but the guest doesn't need them,

> In fact, Avi, CMM2 is probably a fairly good approximation of what
> the asynchronous interface you are suggesting might look like.
> In other words,

CMM2 is more directly comparably to ballooning rather than to
frontswap. Frontswap (and cleancache) work with storage that is
external to the guest, and say nothing about the guest's page itself.

> feasible but much much more complex than frontswap.
>

The swap API (e.g. the block layer) itself is an asynchronous batched
version of frontswap. The complexity in CMM2 comes from the fact that
it is communicating information about guest pages to the host, and from
the fact that communication is two-way and asynchronous in both directions.


>
>> [frontswap is] really
>> not very different from a synchronous swap device.
>>
> Not to beat a dead horse, but there is a very key difference:
> The size and availability of frontswap is entirely dynamic;
> any page-to-be-swapped can be rejected at any time even if
> a page was previously successfully swapped to the same index.
> Every other swap device is much more static so the swap code
> assumes a static device. Existing swap code can account for
> "bad blocks" on a static device, but this is far from sufficient
> to handle the dynamicity needed by frontswap.
>

Given that whenever frontswap fails you need to swap anyway, it is
better for the host to never fail a frontswap request and instead back
it with disk storage if needed. This way you avoid a pointless vmexit
when you're out of memory. Since it's disk backed it needs to be
asynchronous and batched.

At this point we're back with the ordinary swap API. Simply have your
host expose a device which is write cached by host memory, you'll have
all the benefits of frontswap with none of the disadvantages, and with
no changes to guest code.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-30 18:59:55

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/30/2010 11:24 AM, Avi Kivity wrote:
>> I'd argue the opposite. There's no point in having the host do swapping
>> on behalf of guests if guests can do it themselves; it's just a
>> duplication of functionality.
>
> The problem with relying on the guest to swap is that it's voluntary.
> The guest may not be able to do it. When the hypervisor needs memory
> and guests don't cooperate, it has to swap.

Or fail whatever operation its trying to do. You can only use
overcommit to fake unlimited resources for so long before you need a
government bailout.

>> You end up having two IO paths for each
>> guest, and the resulting problems in trying to account for the IO,
>> rate-limit it, etc. If you can simply say "all guest disk IO happens
>> via this single interface", its much easier to manage.
>>
>
> With tmem you have to account for that memory, make sure it's
> distributed fairly, claim it back when you need it (requiring guest
> cooperation), live migrate and save/restore it. It's a much larger
> change than introducing a write-back device for swapping (which has
> the benefit of working with unmodified guests).

Well, with caveats. To be useful with migration the backing store needs
to be shared like other storage, so you can't use a specific host-local
fast (ssd) swap device. And because the device is backed by pagecache
with delayed writes, it has much weaker integrity guarantees than a
normal device, so you need to be sure that the guests are only going to
use it for swap. Sure, these are deployment issues rather than code
ones, but they're still issues.

>> If frontswap has value, it's because its providing a new facility to
>> guests that doesn't already exist and can't be easily emulated with
>> existing interfaces.
>>
>> It seems to me the great strengths of the synchronous interface are:
>>
>> * it matches the needs of an existing implementation (tmem in Xen)
>> * it is simple to understand within the context of the kernel code
>> it's used in
>>
>> Simplicity is important, because it allows the mm code to be understood
>> and maintained without having to have a deep understanding of
>> virtualization.
>
> If we use the existing paths, things are even simpler, and we match
> more needs (hypervisors with dma engines, the ability to reclaim
> memory without guest cooperation).

Well, you still can't reclaim memory; you can write it out to storage.
It may be cheaper/byte, but it's still a resource dedicated to the
guest. But that's just a consequence of allowing overcommit, and to
what extent you're happy to allow it.

What kind of DMA engine do you have in mind? Are there practical
memory->memory DMA engines that would be useful in this context?

>>> At this point we're back with the ordinary swap API. Simply have your
>>> host expose a device which is write cached by host memory, you'll have
>>> all the benefits of frontswap with none of the disadvantages, and with
>>> no changes to guest code.
>>>
>> Yes, that's comfortably within the "guests page themselves" model.
>> Setting up a block device for the domain which is backed by pagecache
>> (something we usually try hard to avoid) is pretty straightforward. But
>> it doesn't work well for Xen unless the blkback domain is sized so that
>> it has all of Xen's free memory in its pagecache.
>>
>
> Could be easily achieved with ballooning?

It could be achieved with ballooning, but it isn't completely trivial.
It wouldn't work terribly well with a driver domain setup, unless all
the swap-devices turned out to be backed by the same domain (which in
turn would need to know how to balloon in response to overall system
demand). The partitioning of the pagecache among the guests would be at
the mercy of the mm subsystem rather than subject to any specific QoS or
other per-domain policies you might want to put in place (maybe fiddling
around with [fm]advise could get you some control over that).

>
>> That said, it does concern me that the host/hypervisor is left holding
>> the bag on frontswapped pages. A evil/uncooperative/lazy can just pump
>> a whole lot of pages into the frontswap pool and leave them there. I
>> guess this is mitigated by the fact that the API is designed such that
>> they can't update or read the data without also allowing the hypervisor
>> to drop the page (updates can fail destructively, and reads are also
>> destructive), so the guest can't use it as a clumsy extension of their
>> normal dedicated memory.
>>
>
> Eventually you'll have to swap frontswap pages, or kill uncooperative
> guests. At which point all of the simplicity is gone.

Killing guests is pretty simple. Presumably the oom killer will get kvm
processes like anything else?

J

2010-04-30 19:16:58

by Avi Kivity

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On 04/28/2010 08:55 AM, Pavel Machek wrote:
>
>> That's a reasonable analogy. Frontswap serves nicely as an
>> emergency safety valve when a guest has given up (too) much of
>> its memory via ballooning but unexpectedly has an urgent need
>> that can't be serviced quickly enough by the balloon driver.
>>
> wtf? So lets fix the ballooning driver instead?
>

You can't have a negative balloon size. The two models are not equivalent.

Balloon allows you to give up a page for which you have a struct page.
Frontswap (and swap) allows you to gain a page for which you don't have
a struct page, but you can't access it directly. The similarity is that
in both cases the host may want the guest to give up a page, but cannot
force it.

> There's no reason it could not be as fast as frontswap, right?
> Actually I'd expect it to be faster -- it can deal with big chunks.
>

There's no reason for swapping and ballooning to behave differently when
swap backing storage is RAM (they probably do now since swap was tuned
for disks, not flash, but that's a bug if it's true).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-04-30 19:24:06

by Dave Hansen

[permalink] [raw]
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On Fri, 2010-04-30 at 10:13 +0300, Avi Kivity wrote:
> On 04/30/2010 04:45 AM, Dave Hansen wrote:
> >
> > A large portion of CMM2's gain came from the fact that you could take
> > memory away from guests without _them_ doing any work. If the system is
> > experiencing a load spike, you increase load even more by making the
> > guests swap. If you can just take some of their memory away, you can
> > smooth that spike out. CMM2 and frontswap do that. The guests
> > explicitly give up page contents that the hypervisor does not have to
> > first consult with the guest before discarding.
> >
>
> Frontswap does not do this. Once a page has been frontswapped, the host
> is committed to retaining it until the guest releases it. It's really
> not very different from a synchronous swap device.
>
> I think cleancache allows the hypervisor to drop pages without the
> guest's immediate knowledge, but I'm not sure.

Gah. You're right. I'm reading the two threads and confusing the
concepts. I'm a bit less mystified why the discussion is revolving
around the swap device so much. :)

-- Dave

2010-04-30 19:28:56

by Dave Hansen

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

On Fri, 2010-04-30 at 08:59 -0700, Dan Magenheimer wrote:
> Dave or others can correct me if I am wrong, but I think CMM2 also
> handles dirty pages that must be retained by the hypervisor. The
> difference between CMM2 (for dirty pages) and frontswap is that
> CMM2 sets hints that can be handled asynchronously while frontswap
> provides explicit hooks that synchronously succeed/fail.

Once pages were dirtied (or I guess just slightly before), they became
volatile, and I don't think the hypervisor could do anything with them.
It could still swap them out like usual, but none of the CMM-specific
optimizations could be performed.

CC'ing Martin since he's the expert. :)

-- Dave

2010-04-30 19:39:27

by Dan Magenheimer

[permalink] [raw]
Subject: RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

> > A large portion of CMM2's gain came from the fact that you could take
> > memory away from guests without _them_ doing any work. If the system
> is
> > experiencing a load spike, you increase load even more by making the
> > guests swap. If you can just take some of their memory away, you can
> > smooth that spike out. CMM2 and frontswap do that. The guests
> > explicitly give up page contents that the hypervisor does not have to
> > first consult with the guest before discarding.
>
> Frontswap does not do this. Once a page has been frontswapped, the
> host
> is committed to retaining it until the guest releases it.

Dave or others can correct me if I am wrong, but I think CMM2 also
handles dirty pages that must be retained by the hypervisor. The
difference between CMM2 (for dirty pages) and frontswap is that
CMM2 sets hints that can be handled asynchronously while frontswap
provides explicit hooks that synchronously succeed/fail.

In fact, Avi, CMM2 is probably a fairly good approximation of what
the asynchronous interface you are suggesting might look like.
In other words, feasible but much much more complex than frontswap.

> [frontswap is] really
> not very different from a synchronous swap device.

Not to beat a dead horse, but there is a very key difference:
The size and availability of frontswap is entirely dynamic;
any page-to-be-swapped can be rejected at any time even if
a page was previously successfully swapped to the same index.
Every other swap device is much more static so the swap code
assumes a static device. Existing swap code can account for
"bad blocks" on a static device, but this is far from sufficient
to handle the dynamicity needed by frontswap.

> I think cleancache allows the hypervisor to drop pages without the
> guest's immediate knowledge, but I'm not sure.

Yes, cleancache can drop pages at any time because (as the
name implies) only clean pages can be put into cleancache.