2009-06-19 23:54:59

by Dan Magenheimer

[permalink] [raw]
Subject: [RFC] transcendent memory for Linux

Normal memory is directly addressable by the kernel,
of a known normally-fixed size, synchronously accessible,
and persistent (though not across a reboot).

What if there was a class of memory that is of unknown
and dynamically variable size, is addressable only indirectly
by the kernel, can be configured either as persistent or
as "ephemeral" (meaning it will be around for awhile, but
might disappear without warning), and is still fast enough
to be synchronously accessible?

We call this latter class "transcendent memory" and it
provides an interesting opportunity to more efficiently
utilize RAM in a virtualized environment. However this
"memory but not really memory" may also have applications
in NON-virtualized environments, such as hotplug-memory
deletion, SSDs, and page cache compression. Others have
suggested ideas such as allowing use of highmem memory
without a highmem kernel, or use of spare video memory.

Transcendent memory, or "tmem" for short, provides a
well-defined API to access this unusual class of memory.
The basic operations are page-copy-based and use a flexible
object-oriented addressing mechanism. Tmem assumes
that some "privileged entity" is capable of executing
tmem requests and storing pages of data; this entity
is currently a hypervisor and operations are performed
via hypercalls, but the entity could be a kernel policy,
or perhaps a "memory node" in a cluster of blades connected
by a high-speed interconnect such as hypertransport or QPI.

Since tmem is not directly accessible and because page
copying is done to/from physical pageframes, it more suitable
for in-kernel memory needs than for userland applications.
However, there may be yet undiscovered userland possibilities.

With the tmem concept outlined vaguely and its broader
potential hinted, we will overview two existing examples
of how tmem can be used by the kernel. These examples are
implemented in the attached (2.6.30-based) patches.

"Precache" can be thought of as a page-granularity victim
cache for clean pages that the kernel's pageframe replacement
algorithm (PFRA) would like to keep around, but can't since
there isn't enough memory. So when the PFRA "evicts" a page,
it first puts it into the precache via a call to tmem. And
any time a filesystem reads a page from disk, it first attempts
to get the page from precache. If it's there, a disk access
is eliminated. If not, the filesystem just goes to the disk
like normal. Precache is "ephemeral" so whether a page is kept
in precache (between the "put" and the "get") is dependent on
a number of factors that are invisible to the kernel.

"Preswap" IS persistent, but for various reasons may not always
be available for use, again due to factors that may not be
visible to the kernel (but, briefly, if the kernel is being
"good" and has shared its resources nicely, then it will be
able to use preswap, else it will not). Once a page is put,
a get on the page will always succeed. So when the kernel
finds itself in a situation where it needs to swap out a page,
it first attempts to use preswap. If the put works, a disk
write and (usually) a disk read are avoided. If it doesn't,
the page is written to swap as usual. Unlike precache, whether
a page is stored in preswap vs swap is recorded in kernel data
structures, so when a page needs to be fetched, the kernel does
a get if it is in preswap and reads from swap if it is not in
preswap.

Both precache and preswap may be optionally compressed,
trading off 2x space reduction vs 10x performance for access.
Precache also has a sharing feature, which allows different nodes
in a "virtual cluster" to share a local page cache.
(In the attached patch, precache is only implemented for
ext3 and shared precache is only implemented for ocfs2.)

Tmem has some similarity to IBM's Collaborative Memory Management,
but creates more of a partnership between the kernel and the
"privileged entity" and is not very invasive. Tmem may be
applicable for KVM and containers; there is some disagreement on
the extent of its value. Tmem is highly complementary to ballooning
(aka page granularity hot plug) and memory deduplication (aka
transparent content-based page sharing) but still has value
when neither are present.

Performance is difficult to quantify because some benchmarks
respond very favorably to increases in memory and tmem may
do quite well on those, depending on how much tmem is available
which may vary widely and dynamically, depending on conditions
completely outside of the system being measured. I'd appreciate
ideas on how best to provide useful metrics.

Tmem is now supported in Xen's unstable tree and in
Xen's 2.6.18-xen source tree. Again, Xen is not necessarily
a requirement, but currently provides the only existing
implementation of tmem.

Lots more information about tmem can be found at:
http://oss.oracle.com/projects/tmem and there will be
a talk about it on the first day of Linux Symposium
next month. Tmem is the result of a group effort,
including Chris Mason, Dave McCracken, Kurt Hackel
and Zhigang Wang, with helpful input from Jeremy
Fitzhardinge, Keir Fraser, Ian Pratt, Sunil Mushran,
and Joel Becker

Patches are as follows (organized for review, not for
sequential application):
tmeminf.patch infrastructure for tmem layer and API
precache.patch precache implementation (layered on tmem)
preswap.patch preswap implementation (layered on tmem)
tmemxen.patch interface code for tmem on top of Xen

Diffstat below, reorganized to show changed vs new files,
and core kernel vs xen. (Also attached in case the
formatting gets messed up.)

Any feedback appreciated!

Thanks,
Dan Magenheimer


Changed core kernel files:
fs/buffer.c | 5 +
fs/ext3/super.c | 2
fs/mpage.c | 8 ++
fs/ocfs2/super.c | 2
fs/super.c | 5 +
include/linux/fs.h | 7 ++
include/linux/swap.h | 57 +++++++++++++++++++++
include/linux/sysctl.h | 1
kernel/sysctl.c | 12 ++++
mm/Kconfig | 27 +++++++++
mm/Makefile | 2
mm/filemap.c | 11 ++++
mm/page_io.c | 12 ++++
mm/swapfile.c | 41 ++++++++++++---
mm/truncate.c | 10 +++
15 files changed, 196 insertions(+), 6 deletions(-)

Newly added core kernel files:
include/linux/tmem.h | 22 +
mm/precache.c | 146 +++++++++++
mm/preswap.c | 274 +++++++++++++++++++++
3 files changed, 442 insertions(+)

Changed xen-specific files:
arch/x86/include/asm/xen/hypercall.h | 8 +++
drivers/xen/Makefile | 1
include/xen/interface/tmem.h | 43 +++++++++++++++++++++
include/xen/interface/xen.h | 22 ++++++++++
4 files changed, 74 insertions(+)

Newly added xen-specific files:
drivers/xen/tmem.c | 106 +++++++++++++++++++++
include/xen/interface/tmem.h | 43 ++++++++
2 files changed, 149 insertions(+)


Attachments:
tmeminf.patch (2.47 kB)
precache.patch (11.92 kB)
preswap.patch (15.91 kB)
tmemxen.patch (6.44 kB)
reorg-diffstat (1.60 kB)
Download all attachments

2009-06-20 01:36:23

by Dan Magenheimer

[permalink] [raw]
Subject: [RFC PATCH 0/4] transcendent memory ("tmem") for Linux

Apologies for the breach of netiquette with attachments.
Following up with inline patches in separate emails...

=====================

Normal memory is directly addressable by the kernel,
of a known normally-fixed size, synchronously accessible,
and persistent (though not across a reboot).

What if there was a class of memory that is of unknown
and dynamically variable size, is addressable only indirectly
by the kernel, can be configured either as persistent or
as "ephemeral" (meaning it will be around for awhile, but
might disappear without warning), and is still fast enough
to be synchronously accessible?

We call this latter class "transcendent memory" and it
provides an interesting opportunity to more efficiently
utilize RAM in a virtualized environment. However this
"memory but not really memory" may also have applications
in NON-virtualized environments, such as hotplug-memory
deletion, SSDs, and page cache compression. Others have
suggested ideas such as allowing use of highmem memory
without a highmem kernel, or use of spare video memory.

Transcendent memory, or "tmem" for short, provides a
well-defined API to access this unusual class of memory.
The basic operations are page-copy-based and use a flexible
object-oriented addressing mechanism. Tmem assumes
that some "privileged entity" is capable of executing
tmem requests and storing pages of data; this entity
is currently a hypervisor and operations are performed
via hypercalls, but the entity could be a kernel policy,
or perhaps a "memory node" in a cluster of blades connected
by a high-speed interconnect such as hypertransport or QPI.

Since tmem is not directly accessible and because page
copying is done to/from physical pageframes, it more suitable
for in-kernel memory needs than for userland applications.
However, there may be yet undiscovered userland possibilities.

With the tmem concept outlined vaguely and its broader
potential hinted, we will overview two existing examples
of how tmem can be used by the kernel. These examples are
implemented in the attached (2.6.30-based) patches.

"Precache" can be thought of as a page-granularity victim
cache for clean pages that the kernel's pageframe replacement
algorithm (PFRA) would like to keep around, but can't since
there isn't enough memory. So when the PFRA "evicts" a page,
it first puts it into the precache via a call to tmem. And
any time a filesystem reads a page from disk, it first attempts
to get the page from precache. If it's there, a disk access
is eliminated. If not, the filesystem just goes to the disk
like normal. Precache is "ephemeral" so whether a page is kept
in precache (between the "put" and the "get") is dependent on
a number of factors that are invisible to the kernel.

"Preswap" IS persistent, but for various reasons may not always
be available for use, again due to factors that may not be
visible to the kernel (but, briefly, if the kernel is being
"good" and has shared its resources nicely, then it will be
able to use preswap, else it will not). Once a page is put,
a get on the page will always succeed. So when the kernel
finds itself in a situation where it needs to swap out a page,
it first attempts to use preswap. If the put works, a disk
write and (usually) a disk read are avoided. If it doesn't,
the page is written to swap as usual. Unlike precache, whether
a page is stored in preswap vs swap is recorded in kernel data
structures, so when a page needs to be fetched, the kernel does
a get if it is in preswap and reads from swap if it is not in
preswap.

Both precache and preswap may be optionally compressed,
trading off 2x space reduction vs 10x performance for access.
Precache also has a sharing feature, which allows different nodes
in a "virtual cluster" to share a local page cache.
(In the attached patch, precache is only implemented for
ext3 and shared precache is only implemented for ocfs2.)

Tmem has some similarity to IBM's Collaborative Memory Management,
but creates more of a partnership between the kernel and the
"privileged entity" and is not very invasive. Tmem may be
applicable for KVM and containers; there is some disagreement on
the extent of its value. Tmem is highly complementary to ballooning
(aka page granularity hot plug) and memory deduplication (aka
transparent content-based page sharing) but still has value
when neither are present.

Performance is difficult to quantify because some benchmarks
respond very favorably to increases in memory and tmem may
do quite well on those, depending on how much tmem is available
which may vary widely and dynamically, depending on conditions
completely outside of the system being measured. I'd appreciate
ideas on how best to provide useful metrics.

Tmem is now supported in Xen's unstable tree and in
Xen's 2.6.18-xen source tree. Again, Xen is not necessarily
a requirement, but currently provides the only existing
implementation of tmem.

Lots more information about tmem can be found at:
http://oss.oracle.com/projects/tmem and there will be
a talk about it on the first day of Linux Symposium
next month. Tmem is the result of a group effort,
including Chris Mason, Dave McCracken, Kurt Hackel
and Zhigang Wang, with helpful input from Jeremy
Fitzhardinge, Keir Fraser, Ian Pratt, Sunil Mushran,
and Joel Becker

Patches are as follows (organized for review, not for
sequential application):
tmeminf.patch infrastructure for tmem layer and API
precache.patch precache implementation (layered on tmem)
preswap.patch preswap implementation (layered on tmem)
tmemxen.patch interface code for tmem on top of Xen

Diffstat below, reorganized to show changed vs new files,
and core kernel vs xen. (Also attached in case the
formatting gets messed up.)

Any feedback appreciated!

Thanks,
Dan Magenheimer


Changed core kernel files:
fs/buffer.c | 5 +
fs/ext3/super.c | 2
fs/mpage.c | 8 ++
fs/ocfs2/super.c | 2
fs/super.c | 5 +
include/linux/fs.h | 7 ++
include/linux/swap.h | 57 +++++++++++++++++++++
include/linux/sysctl.h | 1
kernel/sysctl.c | 12 ++++
mm/Kconfig | 27 +++++++++
mm/Makefile | 2
mm/filemap.c | 11 ++++
mm/page_io.c | 12 ++++
mm/swapfile.c | 41 ++++++++++++---
mm/truncate.c | 10 +++
15 files changed, 196 insertions(+), 6 deletions(-)

Newly added core kernel files:
include/linux/tmem.h | 22 +
mm/precache.c | 146 +++++++++++
mm/preswap.c | 274 +++++++++++++++++++++
3 files changed, 442 insertions(+)

Changed xen-specific files:
arch/x86/include/asm/xen/hypercall.h | 8 +++
drivers/xen/Makefile | 1
include/xen/interface/tmem.h | 43 +++++++++++++++++++++
include/xen/interface/xen.h | 22 ++++++++++
4 files changed, 74 insertions(+)

Newly added xen-specific files:
drivers/xen/tmem.c | 106 +++++++++++++++++++++
include/xen/interface/tmem.h | 43 ++++++++
2 files changed, 149 insertions(+)

2009-06-20 01:36:35

by Dan Magenheimer

[permalink] [raw]
Subject: [RFC PATCH 1/4] tmem: infrastructure for tmem layer

--- linux-2.6.30/mm/Kconfig 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/mm/Kconfig 2009-06-19 09:36:41.000000000 -0600
@@ -253,3 +253,30 @@
of 1 says that all excess pages should be trimmed.

See Documentation/nommu-mmap.txt for more information.
+
+#
+# support for transcendent memory
+#
+config TMEM
+ bool "Transcendent memory support"
+ depends on XEN # but in future may work without XEN
+ help
+ In a virtualized environment, allows unused and underutilized
+ system physical memory to be made accessible through a narrow
+ well-defined page-copy-based API. If unsure, say Y.
+
+config PRECACHE
+ bool "Cache clean pages in transcendent memory"
+ depends on TMEM
+ help
+ Allows the transcendent memory pool to be used to store clean
+ page-cache pages which, under some circumstances, will greatly
+ reduce paging and thus improve performance. If unsure, say Y.
+
+config PRESWAP
+ bool "Swap pages to transcendent memory"
+ depends on TMEM
+ help
+ Allows the transcendent memory pool to be used as a pseudo-swap
+ device which, under some circumstances, will greatly reduce
+ swapping and thus improve performance. If unsure, say Y.
--- linux-2.6.30/mm/Makefile 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/mm/Makefile 2009-06-19 09:33:59.000000000 -0600
@@ -16,6 +16,8 @@
obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_PRESWAP) += preswap.o
+obj-$(CONFIG_PRECACHE) += precache.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
--- linux-2.6.30/include/linux/tmem.h 1969-12-31 17:00:00.000000000 -0700
+++ linux-2.6.30-tmem/include/linux/tmem.h 2009-06-19 11:21:58.000000000 -0600
@@ -0,0 +1,22 @@
+/*
+ * linux/tmem.h
+ *
+ * Interface to transcendent memory, used by mm/precache.c and mm/preswap.c
+ *
+ * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp.
+ */
+
+struct tmem_ops {
+ int (*new_pool)(u64 uuid_lo, u64 uuid_hi, u32 flags);
+ int (*put_page)(u32 pool_id, u64 object, u32 index, unsigned long gmfn);
+ int (*get_page)(u32 pool_id, u64 object, u32 index, unsigned long gmfn);
+ int (*flush_page)(u32 pool_id, u64 object, u32 index);
+ int (*flush_object)(u32 pool_id, u64 object);
+ int (*destroy_pool)(u32 pool_id);
+};
+
+extern struct tmem_ops *tmem_ops;
+
+/* flags for tmem_ops.new_pool */
+#define TMEM_POOL_PERSIST 1
+#define TMEM_POOL_SHARED 2

2009-06-20 01:36:47

by Dan Magenheimer

[permalink] [raw]
Subject: [RFC PATCH 2/4] tmem: precache implementation (layered on tmem)

--- linux-2.6.30/fs/super.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/fs/super.c 2009-06-19 09:33:59.000000000 -0600
@@ -39,6 +39,7 @@
#include <linux/mutex.h>
#include <linux/file.h>
#include <linux/async.h>
+#include <linux/precache.h>
#include <asm/uaccess.h>
#include "internal.h"

@@ -110,6 +111,9 @@
s->s_qcop = sb_quotactl_ops;
s->s_op = &default_op;
s->s_time_gran = 1000000000;
+#ifdef CONFIG_PRECACHE
+ s->precache_poolid = -1;
+#endif
}
out:
return s;
@@ -200,6 +204,7 @@
vfs_dq_off(s, 0);
down_write(&s->s_umount);
fs->kill_sb(s);
+ precache_flush_filesystem(s);
put_filesystem(fs);
put_super(s);
}
--- linux-2.6.30/fs/ext3/super.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/fs/ext3/super.c 2009-06-19 09:33:59.000000000 -0600
@@ -37,6 +37,7 @@
#include <linux/quotaops.h>
#include <linux/seq_file.h>
#include <linux/log2.h>
+#include <linux/precache.h>

#include <asm/uaccess.h>

@@ -1306,6 +1307,7 @@
} else {
printk("internal journal\n");
}
+ precache_init(sb);
return res;
}

--- linux-2.6.30/fs/ocfs2/super.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/fs/ocfs2/super.c 2009-06-19 09:33:59.000000000 -0600
@@ -42,6 +42,7 @@
#include <linux/mount.h>
#include <linux/seq_file.h>
#include <linux/quotaops.h>
+#include <linux/precache.h>

#define MLOG_MASK_PREFIX ML_SUPER
#include <cluster/masklog.h>
@@ -2162,6 +2163,7 @@
mlog_errno(status);
goto bail;
}
+ shared_precache_init(sb, &di->id2.i_super.s_uuid[0]);

bail:
mlog_exit(status);
--- linux-2.6.30/include/linux/fs.h 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/include/linux/fs.h 2009-06-19 09:33:59.000000000 -0600
@@ -1377,6 +1377,13 @@
* storage for asynchronous operations
*/
struct list_head s_async_list;
+
+#ifdef CONFIG_PRECACHE
+ /*
+ * saved pool identifier for precache (-1 means none)
+ */
+ u32 precache_poolid;
+#endif
};

extern struct timespec current_fs_time(struct super_block *sb);
--- linux-2.6.30/fs/buffer.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/fs/buffer.c 2009-06-19 09:33:59.000000000 -0600
@@ -41,6 +41,7 @@
#include <linux/bitops.h>
#include <linux/mpage.h>
#include <linux/bit_spinlock.h>
+#include <linux/precache.h>

static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);

@@ -271,6 +272,10 @@

invalidate_bh_lrus();
invalidate_mapping_pages(mapping, 0, -1);
+ /* 99% of the time, we don't need to flush the precache on the bdev.
+ * But, for the strange corners, lets be cautious
+ */
+ precache_flush_inode(mapping);
}

/*
--- linux-2.6.30/fs/mpage.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/fs/mpage.c 2009-06-19 09:33:59.000000000 -0600
@@ -26,6 +26,7 @@
#include <linux/writeback.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+#include <linux/precache.h>

/*
* I/O completion handler for multipage BIOs.
@@ -285,6 +286,13 @@
SetPageMappedToDisk(page);
}

+ if (fully_mapped &&
+ blocks_per_page == 1 && !PageUptodate(page) &&
+ precache_get(page->mapping, page->index, page) == 1) {
+ SetPageUptodate(page);
+ goto confused;
+ }
+
/*
* This page will go to BIO. Do we need to send this BIO off first?
*/
--- linux-2.6.30/mm/truncate.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/mm/truncate.c 2009-06-19 09:37:42.000000000 -0600
@@ -18,6 +18,7 @@
#include <linux/task_io_accounting_ops.h>
#include <linux/buffer_head.h> /* grr. try_to_release_page,
do_invalidatepage */
+#include <linux/precache.h>
#include "internal.h"


@@ -50,6 +51,7 @@
static inline void truncate_partial_page(struct page *page, unsigned partial)
{
zero_user_segment(page, partial, PAGE_CACHE_SIZE);
+ precache_flush(page->mapping, page->index);
if (page_has_private(page))
do_invalidatepage(page, partial);
}
@@ -107,6 +109,10 @@
clear_page_mlock(page);
remove_from_page_cache(page);
ClearPageMappedToDisk(page);
+ /* this must be after the remove_from_page_cache which
+ * calls precache_put
+ */
+ precache_flush(mapping, page->index);
page_cache_release(page); /* pagecache ref */
}

@@ -168,6 +174,7 @@
pgoff_t next;
int i;

+ precache_flush_inode(mapping);
if (mapping->nrpages == 0)
return;

@@ -251,6 +258,7 @@
}
pagevec_release(&pvec);
}
+ precache_flush_inode(mapping);
}
EXPORT_SYMBOL(truncate_inode_pages_range);

@@ -398,6 +406,7 @@
int did_range_unmap = 0;
int wrapped = 0;

+ precache_flush_inode(mapping);
pagevec_init(&pvec, 0);
next = start;
while (next <= end && !wrapped &&
@@ -454,6 +463,7 @@
pagevec_release(&pvec);
cond_resched();
}
+ precache_flush_inode(mapping);
return ret;
}
EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
--- linux-2.6.30/mm/filemap.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/mm/filemap.c 2009-06-19 09:33:59.000000000 -0600
@@ -34,6 +34,7 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <linux/precache.h>
#include "internal.h"

/*
@@ -116,6 +117,16 @@
{
struct address_space *mapping = page->mapping;

+ /*
+ * if we're uptodate, flush out into the precache, otherwise
+ * invalidate any existing precache entries. We can't leave
+ * stale data around in the precache once our page is gone
+ */
+ if (PageUptodate(page))
+ precache_put(page->mapping, page->index, page);
+ else
+ precache_flush(page->mapping, page->index);
+
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
--- linux-2.6.30/include/linux/precache.h 1969-12-31 17:00:00.000000000 -0700
+++ linux-2.6.30-tmem/include/linux/precache.h 2009-06-19 09:33:59.000000000 -0600
@@ -0,0 +1,55 @@
+#ifndef _LINUX_PRECACHE_H
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#ifdef CONFIG_PRECACHE
+extern void precache_init(struct super_block *sb);
+extern void shared_precache_init(struct super_block *sb, char *uuid);
+extern int precache_get(struct address_space *mapping, unsigned long index,
+ struct page *empty_page);
+extern int precache_put(struct address_space *mapping, unsigned long index,
+ struct page *page);
+extern int precache_flush(struct address_space *mapping, unsigned long index);
+extern int precache_flush_inode(struct address_space *mapping);
+extern int precache_flush_filesystem(struct super_block *s);
+#else
+static inline void precache_init(struct super_block *sb)
+{
+}
+
+static inline void shared_precache_init(struct super_block *sb, char *uuid)
+{
+}
+
+static inline int precache_get(struct address_space *mapping,
+ unsigned long index, struct page *empty_page)
+{
+ return 0;
+}
+
+static inline int precache_put(struct address_space *mapping,
+ unsigned long index, struct page *page)
+{
+ return 0;
+}
+
+static inline int precache_flush(struct address_space *mapping,
+ unsigned long index)
+{
+ return 0;
+}
+
+static inline int precache_flush_inode(struct address_space *mapping)
+{
+ return 0;
+}
+
+static inline int precache_flush_filesystem(struct super_block *s)
+{
+ return 0;
+}
+#endif
+
+#define _LINUX_PRECACHE_H
+#endif /* _LINUX_PRECACHE_H */
--- linux-2.6.30/mm/precache.c 1969-12-31 17:00:00.000000000 -0700
+++ linux-2.6.30-tmem/mm/precache.c 2009-06-19 15:03:32.000000000 -0600
@@ -0,0 +1,146 @@
+/*
+ * linux/mm/precache.c
+ *
+ * Implements "precache" for filesystems/pagecache on top of transcendent
+ * memory ("tmem") API. A filesystem creates an "ephemeral tmem pool"
+ * and retains the returned pool_id in its superblock. Clean pages evicted
+ * from pagecache may be "put" into the pool and associated with a "handle"
+ * consisting of the pool_id, an object (inode) id, and an index (page offset).
+ * Note that the page is copied to tmem; no kernel mappings are changed.
+ * If the page is later needed, the filesystem (or VFS) issues a "get", passing
+ * the same handle and an empty pageframe. If successful, the page is copied
+ * into the pageframe and a disk read is avoided. But since the tmem pool
+ * is of indeterminate size, a "put" page has indeterminate longevity
+ * ("ephemeral"), and the "get" may fail, in which case the filesystem must
+ * read the page from disk as before. Note that the filesystem/pagecache are
+ * responsible for maintaining coherency between the pagecache, precache,
+ * and the disk, for which "flush page" and "flush object" actions are
+ * provided. And when a filesystem is unmounted, it must "destroy" the pool.
+ *
+ * Two types of pools may be created for a precache: "private" or "shared".
+ * For a private pool, a successful "get" always flushes, implementing
+ * exclusive semantics; for a "shared" pool (which is intended for use by
+ * co-resident nodes of a cluster filesystem), the "flush" is not guaranteed.
+ * In either case, a failed "duplicate" put (overwrite) always guarantee
+ * the old data is flushed.
+ *
+ * Note also that multiple accesses to a tmem pool may be concurrent and any
+ * ordering must be guaranteed by the caller.
+ *
+ * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp.
+ */
+
+#include <linux/precache.h>
+#include <linux/module.h>
+#include <linux/tmem.h>
+
+static int precache_auto_allocate; /* set to 1 to auto_allocate */
+
+int precache_put(struct address_space *mapping, unsigned long index,
+ struct page *page)
+{
+ u32 tmem_pool = mapping->host->i_sb->precache_poolid;
+ u64 obj = (unsigned long) mapping->host->i_ino;
+ u32 ind = (u32) index;
+ unsigned long pfn = page_to_pfn(page);
+ int ret;
+
+ if ((s32)tmem_pool < 0) {
+ if (!precache_auto_allocate)
+ return 0;
+ /* a put on a non-existent precache may auto-allocate one */
+ if (tmem_ops == NULL)
+ return 0;
+ ret = (*tmem_ops->new_pool)(0, 0, 0);
+ if (ret < 0)
+ return 0;
+ printk(KERN_INFO
+ "Mapping superblock for s_id=%s to precache_id=%d\n",
+ mapping->host->i_sb->s_id, tmem_pool);
+ mapping->host->i_sb->precache_poolid = tmem_pool;
+ }
+ if (ind != index)
+ return 0;
+ mb(); /* ensure page is quiescent; tmem may address it with an alias */
+ return (*tmem_ops->put_page)(tmem_pool, obj, ind, pfn);
+}
+
+int precache_get(struct address_space *mapping, unsigned long index,
+ struct page *empty_page)
+{
+ u32 tmem_pool = mapping->host->i_sb->precache_poolid;
+ u64 obj = (unsigned long) mapping->host->i_ino;
+ u32 ind = (u32) index;
+ unsigned long pfn = page_to_pfn(empty_page);
+
+ if ((s32)tmem_pool < 0)
+ return 0;
+ if (ind != index)
+ return 0;
+
+ return (tmem_ops->get_page)(tmem_pool, obj, ind, pfn);
+}
+EXPORT_SYMBOL(precache_get);
+
+int precache_flush(struct address_space *mapping, unsigned long index)
+{
+ u32 tmem_pool = mapping->host->i_sb->precache_poolid;
+ u64 obj = (unsigned long) mapping->host->i_ino;
+ u32 ind = (u32) index;
+
+ if ((s32)tmem_pool < 0)
+ return 0;
+ if (ind != index)
+ return 0;
+
+ return (*tmem_ops->flush_page)(tmem_pool, obj, ind);
+}
+EXPORT_SYMBOL(precache_flush);
+
+int precache_flush_inode(struct address_space *mapping)
+{
+ u32 tmem_pool = mapping->host->i_sb->precache_poolid;
+ u64 obj = (unsigned long) mapping->host->i_ino;
+
+ if ((s32)tmem_pool < 0)
+ return 0;
+
+ return (*tmem_ops->flush_object)(tmem_pool, obj);
+}
+EXPORT_SYMBOL(precache_flush_inode);
+
+int precache_flush_filesystem(struct super_block *sb)
+{
+ u32 tmem_pool = sb->precache_poolid;
+ int ret;
+
+ if ((s32)tmem_pool < 0)
+ return 0;
+ ret = (*tmem_ops->destroy_pool)(tmem_pool);
+ if (!ret)
+ return 0;
+ printk(KERN_INFO
+ "Unmapping superblock for s_id=%s from precache_id=%d\n",
+ sb->s_id, ret);
+ sb->precache_poolid = 0;
+ return 1;
+}
+EXPORT_SYMBOL(precache_flush_filesystem);
+
+void precache_init(struct super_block *sb)
+{
+ if (tmem_ops != NULL)
+ sb->precache_poolid = (*tmem_ops->new_pool)(0, 0, 0);
+}
+EXPORT_SYMBOL(precache_init);
+
+void shared_precache_init(struct super_block *sb, char *uuid)
+{
+ u64 uuid_lo = *(u64 *)uuid;
+ u64 uuid_hi = *(u64 *)(&uuid[8]);
+
+ if (tmem_ops != NULL)
+ sb->precache_poolid =(*tmem_ops->new_pool)(uuid_lo, uuid_hi,
+ TMEM_POOL_SHARED);
+}
+EXPORT_SYMBOL(shared_precache_init);

2009-06-20 01:37:05

by Dan Magenheimer

[permalink] [raw]
Subject: [RFC PATCH 3/4] tmem: preswap implementation (layered on tmem)

--- linux-2.6.30/mm/page_io.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/mm/page_io.c 2009-06-19 09:33:59.000000000 -0600
@@ -102,6 +102,12 @@
unlock_page(page);
goto out;
}
+ if (preswap_put(page) == 1) {
+ set_page_writeback(page);
+ unlock_page(page);
+ end_page_writeback(page);
+ goto out;
+ }
bio = get_swap_bio(GFP_NOIO, page_private(page), page,
end_swap_bio_write);
if (bio == NULL) {
@@ -134,6 +140,12 @@
ret = -ENOMEM;
goto out;
}
+ if (preswap_get(page) == 1) {
+ SetPageUptodate(page);
+ unlock_page(page);
+ bio_put(bio);
+ goto out;
+ }
count_vm_event(PSWPIN);
submit_bio(READ, bio);
out:
--- linux-2.6.30/mm/swapfile.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/mm/swapfile.c 2009-06-19 16:20:14.000000000 -0600
@@ -35,7 +35,7 @@
#include <linux/swapops.h>
#include <linux/page_cgroup.h>

-static DEFINE_SPINLOCK(swap_lock);
+DEFINE_SPINLOCK(swap_lock);
static unsigned int nr_swapfiles;
long nr_swap_pages;
long total_swap_pages;
@@ -47,7 +47,7 @@
static const char Bad_offset[] = "Bad swap offset entry ";
static const char Unused_offset[] = "Unused swap offset entry ";

-static struct swap_list_t swap_list = {-1, -1};
+struct swap_list_t swap_list = {-1, -1};

static struct swap_info_struct swap_info[MAX_SWAPFILES];

@@ -488,6 +488,7 @@
swap_list.next = p - swap_info;
nr_swap_pages++;
p->inuse_pages--;
+ preswap_flush(p - swap_info, offset);
mem_cgroup_uncharge_swap(ent);
}
}
@@ -864,7 +865,7 @@
* Recycle to start on reaching the end, returning 0 when empty.
*/
static unsigned int find_next_to_unuse(struct swap_info_struct *si,
- unsigned int prev)
+ unsigned int prev, unsigned int preswap)
{
unsigned int max = si->max;
unsigned int i = prev;
@@ -890,6 +891,12 @@
prev = 0;
i = 1;
}
+ if (preswap) {
+ if (preswap_test(si, i))
+ break;
+ else
+ continue;
+ }
count = si->swap_map[i];
if (count && count != SWAP_MAP_BAD)
break;
@@ -901,8 +908,12 @@
* We completely avoid races by reading each swap page in advance,
* and then search for the process using it. All the necessary
* page table adjustments can then be made atomically.
+ *
+ * if the boolean preswap is true, only unuse pages_to_unuse pages;
+ * pages_to_unuse==0 means all pages
*/
-static int try_to_unuse(unsigned int type)
+int try_to_unuse(unsigned int type, unsigned int preswap,
+ unsigned long pages_to_unuse)
{
struct swap_info_struct * si = &swap_info[type];
struct mm_struct *start_mm;
@@ -938,7 +949,7 @@
* one pass through swap_map is enough, but not necessarily:
* there are races when an instance of an entry might be missed.
*/
- while ((i = find_next_to_unuse(si, i)) != 0) {
+ while ((i = find_next_to_unuse(si, i, preswap)) != 0) {
if (signal_pending(current)) {
retval = -EINTR;
break;
@@ -1124,6 +1135,8 @@
* interactive performance.
*/
cond_resched();
+ if (preswap && pages_to_unuse && !--pages_to_unuse)
+ break;
}

mmput(start_mm);
@@ -1448,7 +1461,7 @@
spin_unlock(&swap_lock);

current->flags |= PF_SWAPOFF;
- err = try_to_unuse(type);
+ err = try_to_unuse(type, 0, 0);
current->flags &= ~PF_SWAPOFF;

if (err) {
@@ -1497,9 +1510,14 @@
swap_map = p->swap_map;
p->swap_map = NULL;
p->flags = 0;
+ preswap_flush_area(p - swap_info);
spin_unlock(&swap_lock);
mutex_unlock(&swapon_mutex);
vfree(swap_map);
+#ifdef CONFIG_PRESWAP
+ if (p->preswap_map)
+ vfree(p->preswap_map);
+#endif
/* Destroy swap account informatin */
swap_cgroup_swapoff(type);

@@ -1812,6 +1830,11 @@
}

memset(swap_map, 0, maxpages * sizeof(short));
+#ifdef CONFIG_PRESWAP
+ p->preswap_map = vmalloc(maxpages / sizeof(long));
+ if (p->preswap_map)
+ memset(p->preswap_map, 0, maxpages / sizeof(long));
+#endif
for (i = 0; i < swap_header->info.nr_badpages; i++) {
int page_nr = swap_header->info.badpages[i];
if (page_nr <= 0 || page_nr >= swap_header->info.last_page) {
@@ -1886,6 +1909,7 @@
} else {
swap_info[prev].next = p - swap_info;
}
+ preswap_init(p - swap_info);
spin_unlock(&swap_lock);
mutex_unlock(&swapon_mutex);
error = 0;
@@ -2002,6 +2026,8 @@

si = &swap_info[swp_type(entry)];
target = swp_offset(entry);
+ if (preswap_test(si, target))
+ return 0;
base = (target >> our_page_cluster) << our_page_cluster;
end = base + (1 << our_page_cluster);
if (!base) /* first page is swap header */
@@ -2018,6 +2044,9 @@
break;
if (si->swap_map[toff] == SWAP_MAP_BAD)
break;
+ /* Don't read in preswap pages */
+ if (preswap_test(si, toff))
+ break;
}
/* Count contiguous allocated slots below our target */
for (toff = target; --toff >= base; nr_pages++) {
--- linux-2.6.30/include/linux/swap.h 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/include/linux/swap.h 2009-06-19 12:51:55.000000000 -0600
@@ -8,6 +8,7 @@
#include <linux/memcontrol.h>
#include <linux/sched.h>
#include <linux/node.h>
+#include <linux/vmalloc.h>

#include <asm/atomic.h>
#include <asm/page.h>
@@ -154,8 +155,62 @@
unsigned int max;
unsigned int inuse_pages;
unsigned int old_block_size;
+#ifdef CONFIG_PRESWAP
+ unsigned long *preswap_map;
+ unsigned int preswap_pages;
+#endif
};

+#ifdef CONFIG_PRESWAP
+
+#include <linux/sysctl.h>
+extern int preswap_sysctl_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+extern const unsigned long preswap_zero, preswap_infinity;
+
+extern void preswap_shrink(unsigned long);
+extern int preswap_test(struct swap_info_struct *, unsigned long);
+extern void preswap_init(unsigned);
+extern int preswap_put(struct page *);
+extern int preswap_get(struct page *);
+extern void preswap_flush(unsigned, unsigned long);
+extern void preswap_flush_area(unsigned);
+/* in swapfile.c */
+extern int try_to_unuse(unsigned int, unsigned int, unsigned long);
+#else
+static inline void preswap_shrink(unsigned long target_pages)
+{
+}
+
+static inline int preswap_test(struct swap_info_struct *sis,
+ unsigned long offset)
+{
+ return 0;
+}
+
+static inline void preswap_init(unsigned type)
+{
+}
+
+static inline int preswap_put(struct page *page)
+{
+ return 0;
+}
+
+static inline int preswap_get(struct page *page)
+{
+ return 0;
+}
+
+static inline void preswap_flush(unsigned type, unsigned long offset)
+{
+}
+
+static inline void preswap_flush_area(unsigned type)
+{
+}
+#endif /* CONFIG_PRESWAP */
+
struct swap_list_t {
int head; /* head of priority-ordered swapfile list */
int next; /* swapfile to be used next */
@@ -312,6 +367,8 @@
extern int reuse_swap_page(struct page *);
extern int try_to_free_swap(struct page *);
struct backing_dev_info;
+extern struct swap_list_t swap_list;
+extern spinlock_t swap_lock;

/* linux/mm/thrash.c */
extern struct mm_struct * swap_token_mm;
--- linux-2.6.30/mm/preswap.c 1969-12-31 17:00:00.000000000 -0700
+++ linux-2.6.30-tmem/mm/preswap.c 2009-06-19 14:55:16.000000000 -0600
@@ -0,0 +1,274 @@
+/*
+ * linux/mm/preswap.c
+ *
+ * Implements a fast "preswap" on top of the transcendent memory ("tmem") API.
+ * When a swapdisk is enabled (with swapon), a "private persistent tmem pool"
+ * is created along with a bit-per-page preswap_map. When swapping occurs
+ * and a page is about to be written to disk, a "put" into the pool may first
+ * be attempted by passing the pageframe to be swapped, along with a "handle"
+ * consisting of a pool_id, an object id, and an index. Since the pool is of
+ * indeterminate size, the "put" may be rejected, in which case the page
+ * is swapped to disk as normal. If the "put" is successful, the page is
+ * copied to tmem and the preswap_map records the success. Later, when
+ * the page needs to be swapped in, the preswap_map is checked and, if set,
+ * the page may be obtained with a "get" operation. Note that the swap
+ * subsystem is responsible for: maintaining coherency between the swapcache,
+ * preswap, and the swapdisk; for evicting stale pages from preswap; and for
+ * emptying preswap when swapoff is performed. The "flush page" and "flush
+ * object" actions are provided for this.
+ *
+ * Note that if a "duplicate put" is performed to overwrite a page and
+ * the "put" operation fails, the page (and old data) is flushed and lost.
+ * Also note that multiple accesses to a tmem pool may be concurrent and
+ * any ordering must be guaranteed by the caller.
+ *
+ * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp.
+ */
+
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/sysctl.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/proc_fs.h>
+#include <linux/security.h>
+#include <linux/capability.h>
+#include <linux/uaccess.h>
+#include <linux/tmem.h>
+
+static u32 preswap_poolid = -1; /* if negative, preswap will never call tmem */
+
+const unsigned long preswap_zero = 0, preswap_infinity = ~0UL; /* for sysctl */
+
+/*
+ * Swizzling increases objects per swaptype, increasing tmem concurrency
+ * for heavy swaploads. Later, larger nr_cpus -> larger SWIZ_BITS
+ */
+#define SWIZ_BITS 4
+#define SWIZ_MASK ((1 << SWIZ_BITS) - 1)
+#define oswiz(_type, _ind) ((_type << SWIZ_BITS) | (_ind & SWIZ_MASK))
+#define iswiz(_ind) (_ind >> SWIZ_BITS)
+
+/*
+ * preswap_map test/set/clear operations (must be atomic)
+ */
+
+int preswap_test(struct swap_info_struct *sis, unsigned long offset)
+{
+ if (!sis->preswap_map)
+ return 0;
+ return test_bit(offset % BITS_PER_LONG,
+ &sis->preswap_map[offset/BITS_PER_LONG]);
+}
+
+static inline void preswap_set(struct swap_info_struct *sis,
+ unsigned long offset)
+{
+ if (!sis->preswap_map)
+ return;
+ set_bit(offset % BITS_PER_LONG,
+ &sis->preswap_map[offset/BITS_PER_LONG]);
+}
+
+static inline void preswap_clear(struct swap_info_struct *sis,
+ unsigned long offset)
+{
+ if (!sis->preswap_map)
+ return;
+ clear_bit(offset % BITS_PER_LONG,
+ &sis->preswap_map[offset/BITS_PER_LONG]);
+}
+
+/*
+ * preswap tmem operations
+ */
+
+/* returns 1 if the page was successfully put into preswap, 0 if the page
+ * was declined, and -ERRNO for a specific error */
+int preswap_put(struct page *page)
+{
+ swp_entry_t entry = { .val = page_private(page), };
+ unsigned type = swp_type(entry);
+ pgoff_t offset = swp_offset(entry);
+ u64 ind64 = (u64)offset;
+ u32 ind = (u32)offset;
+ unsigned long pfn = page_to_pfn(page);
+ struct swap_info_struct *sis = get_swap_info_struct(type);
+ int dup = 0, ret;
+
+ if ((s32)preswap_poolid < 0)
+ return 0;
+ if (ind64 != ind)
+ return 0;
+ if (preswap_test(sis, offset))
+ dup = 1;
+ mb(); /* ensure page is quiescent; tmem may address it with an alias */
+ ret = (*tmem_ops->put_page)(preswap_poolid, oswiz(type, ind),
+ iswiz(ind), pfn);
+ if (ret == 1) {
+ preswap_set(sis, offset);
+ if (!dup)
+ sis->preswap_pages++;
+ } else if (dup) {
+ /* failed dup put always results in an automatic flush of
+ * the (older) page from preswap */
+ preswap_clear(sis, offset);
+ sis->preswap_pages--;
+ }
+ return ret;
+}
+
+/* returns 1 if the page was successfully gotten from preswap, 0 if the page
+ * was not present (should never happen!), and -ERRNO for a specific error */
+int preswap_get(struct page *page)
+{
+ swp_entry_t entry = { .val = page_private(page), };
+ unsigned type = swp_type(entry);
+ pgoff_t offset = swp_offset(entry);
+ u64 ind64 = (u64)offset;
+ u32 ind = (u32)offset;
+ unsigned long pfn = page_to_pfn(page);
+ struct swap_info_struct *sis = get_swap_info_struct(type);
+ int ret;
+
+ if ((s32)preswap_poolid < 0)
+ return 0;
+ if (ind64 != ind)
+ return 0;
+ if (!preswap_test(sis, offset))
+ return 0;
+ ret = (*tmem_ops->get_page)(preswap_poolid, oswiz(type, ind),
+ iswiz(ind), pfn);
+ return ret;
+}
+
+/* flush a single page from preswap */
+void preswap_flush(unsigned type, unsigned long offset)
+{
+ u64 ind64 = (u64)offset;
+ u32 ind = (u32)offset;
+ struct swap_info_struct *sis = get_swap_info_struct(type);
+ int ret = 1;
+
+ if ((s32)preswap_poolid < 0)
+ return;
+ if (ind64 != ind)
+ return;
+ if (preswap_test(sis, offset)) {
+ ret = (*tmem_ops->flush_page)(preswap_poolid,
+ oswiz(type, ind), iswiz(ind));
+ sis->preswap_pages--;
+ preswap_clear(sis, offset);
+ }
+}
+
+/* flush all pages from the passed swaptype */
+void preswap_flush_area(unsigned type)
+{
+ struct swap_info_struct *sis = get_swap_info_struct(type);
+ int ind;
+
+ if ((s32)preswap_poolid < 0)
+ return;
+ for (ind = SWIZ_MASK; ind >= 0; ind--)
+ (void)(*tmem_ops->flush_object)(preswap_poolid,
+ oswiz(type, ind));
+ sis->preswap_pages = 0;
+}
+
+void preswap_init(unsigned type)
+{
+ /* only need one tmem pool for all swap types */
+ if ((s32)preswap_poolid >= 0)
+ return;
+ if (tmem_ops == NULL)
+ return;
+ preswap_poolid = (*tmem_ops->new_pool)(0, 0, TMEM_POOL_PERSIST);
+}
+
+/*
+ * preswap infrastructure functions
+ */
+
+/* code structure leveraged from sys_swapoff */
+void preswap_shrink(unsigned long target_pages)
+{
+ struct swap_info_struct *si = NULL;
+ unsigned long total_pages = 0, total_pages_to_unuse;
+ unsigned long pages = 0, unuse_pages = 0;
+ int type;
+ int wrapped = 0;
+
+ do {
+ /*
+ * we don't want to hold swap_lock while doing a very
+ * lengthy try_to_unuse, but swap_list may change
+ * so restart scan from swap_list.head each time
+ */
+ spin_lock(&swap_lock);
+ total_pages = 0;
+ for (type = swap_list.head; type >= 0; type = si->next) {
+ si = get_swap_info_struct(type);
+ total_pages += si->preswap_pages;
+ }
+ if (total_pages <= target_pages) {
+ spin_unlock(&swap_lock);
+ return;
+ }
+ total_pages_to_unuse = total_pages - target_pages;
+ for (type = swap_list.head; type >= 0; type = si->next) {
+ si = get_swap_info_struct(type);
+ if (total_pages_to_unuse < si->preswap_pages)
+ pages = unuse_pages = total_pages_to_unuse;
+ else {
+ pages = si->preswap_pages;
+ unuse_pages = 0; /* unuse all */
+ }
+ if (security_vm_enough_memory(pages))
+ continue;
+ vm_unacct_memory(pages);
+ break;
+ }
+ spin_unlock(&swap_lock);
+ if (type < 0)
+ return;
+ current->flags |= PF_SWAPOFF;
+ (void)try_to_unuse(type, 1, unuse_pages);
+ current->flags &= ~PF_SWAPOFF;
+ wrapped++;
+ } while (wrapped <= 3);
+}
+
+
+#ifdef CONFIG_SYSCTL
+/* cat /sys/proc/vm/preswap provides total number of pages in preswap
+ * across all swaptypes. echo N > /sys/proc/vm/preswap attempts to shrink
+ * preswap page usage to N (usually 0) */
+int preswap_sysctl_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ unsigned long npages;
+ int type;
+ unsigned long totalpages = 0;
+ struct swap_info_struct *si = NULL;
+
+ /* modeled after hugetlb_sysctl_handler in mm/hugetlb.c */
+ if (!write) {
+ spin_lock(&swap_lock);
+ for (type = swap_list.head; type >= 0; type = si->next) {
+ si = get_swap_info_struct(type);
+ totalpages += si->preswap_pages;
+ }
+ spin_unlock(&swap_lock);
+ npages = totalpages;
+ }
+ table->data = &npages;
+ table->maxlen = sizeof(unsigned long);
+ proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+
+ if (write)
+ preswap_shrink(npages);
+
+ return 0;
+}
+#endif
--- linux-2.6.30/include/linux/sysctl.h 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/include/linux/sysctl.h 2009-06-19 09:33:59.000000000 -0600
@@ -205,6 +205,7 @@
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
+ VM_PRESWAP_PAGES=36, /* pages/target_pages in preswap */
};


--- linux-2.6.30/kernel/sysctl.c 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/kernel/sysctl.c 2009-06-19 09:33:59.000000000 -0600
@@ -1282,6 +1282,18 @@
.proc_handler = &scan_unevictable_handler,
},
#endif
+#ifdef CONFIG_PRESWAP
+ {
+ .ctl_name = VM_PRESWAP_PAGES,
+ .procname = "preswap",
+ .data = NULL,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0644,
+ .proc_handler = &preswap_sysctl_handler,
+ .extra1 = (void *)&preswap_zero,
+ .extra2 = (void *)&preswap_infinity,
+ },
+#endif
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt

2009-06-20 01:37:29

by Dan Magenheimer

[permalink] [raw]
Subject: [RFC PATCH 4/4] tmem: interface code for tmem on top of xen

--- linux-2.6.30/arch/x86/include/asm/xen/hypercall.h 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/arch/x86/include/asm/xen/hypercall.h 2009-06-19 13:49:04.000000000 -0600
@@ -45,6 +45,7 @@
#include <xen/interface/xen.h>
#include <xen/interface/sched.h>
#include <xen/interface/physdev.h>
+#include <xen/interface/tmem.h>

/*
* The hypercall asms have to meet several constraints:
@@ -417,6 +418,13 @@
return _hypercall2(int, nmi_op, op, arg);
}

+static inline int
+HYPERVISOR_tmem_op(
+ struct tmem_op *op)
+{
+ return _hypercall1(int, tmem_op, op);
+}
+
static inline void
MULTI_fpu_taskswitch(struct multicall_entry *mcl, int set)
{
--- linux-2.6.30/drivers/xen/Makefile 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/drivers/xen/Makefile 2009-06-19 09:33:59.000000000 -0600
@@ -3,5 +3,6 @@

obj-$(CONFIG_HOTPLUG_CPU) += cpu_hotplug.o
obj-$(CONFIG_XEN_XENCOMM) += xencomm.o
+obj-$(CONFIG_TMEM) += tmem.o
obj-$(CONFIG_XEN_BALLOON) += balloon.o
obj-$(CONFIG_XENFS) += xenfs/
\ No newline at end of file
--- linux-2.6.30/include/xen/interface/tmem.h 1969-12-31 17:00:00.000000000 -0700
+++ linux-2.6.30-tmem/include/xen/interface/tmem.h 2009-06-19 11:21:24.000000000 -0600
@@ -0,0 +1,43 @@
+/*
+ * include/xen/interface/tmem.h
+ *
+ * Interface to Xen implementation of transcendent memory
+ *
+ * Copyright (C) 2009 Dan Magenheimer, Oracle Corp.
+ */
+
+#include <xen/interface/xen.h>
+
+#define TMEM_CONTROL 0
+#define TMEM_NEW_POOL 1
+#define TMEM_DESTROY_POOL 2
+#define TMEM_NEW_PAGE 3
+#define TMEM_PUT_PAGE 4
+#define TMEM_GET_PAGE 5
+#define TMEM_FLUSH_PAGE 6
+#define TMEM_FLUSH_OBJECT 7
+#define TMEM_READ 8
+#define TMEM_WRITE 9
+#define TMEM_XCHG 10
+
+/* Subops for HYPERVISOR_tmem_op(TMEM_CONTROL) */
+#define TMEMC_THAW 0
+#define TMEMC_FREEZE 1
+#define TMEMC_FLUSH 2
+#define TMEMC_DESTROY 3
+#define TMEMC_LIST 4
+#define TMEMC_SET_WEIGHT 5
+#define TMEMC_SET_CAP 6
+#define TMEMC_SET_COMPRESS 7
+
+/* Bits for HYPERVISOR_tmem_op(TMEM_NEW_POOL) */
+#define TMEM_POOL_PERSIST 1
+#define TMEM_POOL_SHARED 2
+#define TMEM_POOL_PAGESIZE_SHIFT 4
+#define TMEM_POOL_PAGESIZE_MASK 0xf
+#define TMEM_POOL_VERSION_SHIFT 24
+#define TMEM_POOL_VERSION_MASK 0xff
+
+/* Special errno values */
+#define EFROZEN 1000
+#define EEMPTY 1001
--- linux-2.6.30/include/xen/interface/xen.h 2009-06-09 21:05:27.000000000 -0600
+++ linux-2.6.30-tmem/include/xen/interface/xen.h 2009-06-19 14:39:15.000000000 -0600
@@ -58,6 +58,7 @@
#define __HYPERVISOR_event_channel_op 32
#define __HYPERVISOR_physdev_op 33
#define __HYPERVISOR_hvm_op 34
+#define __HYPERVISOR_tmem_op 38

/* Architecture-specific hypercall definitions. */
#define __HYPERVISOR_arch_0 48
@@ -461,6 +462,27 @@
#define __mk_unsigned_long(x) x ## UL
#define mk_unsigned_long(x) __mk_unsigned_long(x)

+struct tmem_op {
+ uint32_t cmd;
+ int32_t pool_id; /* private > 0; shared < 0; 0 is invalid */
+ union {
+ struct { /* for cmd == TMEM_NEW_POOL */
+ uint64_t uuid[2];
+ uint32_t flags;
+ } new;
+ struct {
+ uint64_t object;
+ uint32_t index;
+ uint32_t tmem_offset;
+ uint32_t pfn_offset;
+ uint32_t len;
+ GUEST_HANDLE(void) gmfn; /* guest machine page frame */
+ } gen;
+ } u;
+};
+typedef struct tmem_op tmem_op_t;
+DEFINE_GUEST_HANDLE_STRUCT(tmem_op_t);
+
#else /* __ASSEMBLY__ */

/* In assembly code we cannot use C numeric constant suffixes. */
--- linux-2.6.30/drivers/xen/tmem.c 1969-12-31 17:00:00.000000000 -0700
+++ linux-2.6.30-tmem/drivers/xen/tmem.c 2009-06-19 14:54:53.000000000 -0600
@@ -0,0 +1,106 @@
+/*
+ * Xen implementation for transcendent memory (tmem)
+ *
+ * Dan Magenheimer <[email protected]> 2009
+ */
+
+#include <linux/types.h>
+#include <linux/tmem.h>
+#include <xen/interface/xen.h>
+#include <xen/interface/tmem.h>
+#include <asm/xen/hypercall.h>
+#include <asm/xen/page.h>
+
+struct tmem_ops *tmem_ops = NULL;
+
+static inline int xen_tmem_op(u32 tmem_cmd, u32 tmem_pool, u64 object,
+ u32 index, unsigned long gmfn, u32 tmem_offset, u32 pfn_offset, u32 len)
+{
+ struct tmem_op op;
+ int rc = 0;
+
+ op.cmd = tmem_cmd;
+ op.pool_id = tmem_pool;
+ op.u.gen.object = object;
+ op.u.gen.index = index;
+ op.u.gen.tmem_offset = tmem_offset;
+ op.u.gen.pfn_offset = pfn_offset;
+ op.u.gen.len = len;
+ set_xen_guest_handle(op.u.gen.gmfn, (void *)gmfn);
+ rc = HYPERVISOR_tmem_op(&op);
+ return rc;
+}
+
+static inline int xen_tmem_new_pool(uint32_t tmem_cmd, uint64_t uuid_lo,
+ uint64_t uuid_hi, uint32_t flags)
+{
+ struct tmem_op op;
+ int rc = 0;
+
+ op.cmd = tmem_cmd;
+ op.u.new.uuid[0] = uuid_lo;
+ op.u.new.uuid[1] = uuid_hi;
+ op.u.new.flags = flags;
+ rc = HYPERVISOR_tmem_op(&op);
+ return rc;
+}
+
+static int tmem_put_page(u32 pool_id, u64 object, u32 index,
+ unsigned long pfn)
+{
+ unsigned long gmfn = pfn_to_mfn(pfn);
+
+ return xen_tmem_op(TMEM_PUT_PAGE, pool_id, object, index,
+ gmfn, 0, 0, 0);
+}
+
+static int tmem_get_page(u32 pool_id, u64 object, u32 index,
+ unsigned long pfn)
+{
+ unsigned long gmfn = pfn_to_mfn(pfn);
+
+ return xen_tmem_op(TMEM_GET_PAGE, pool_id, object, index,
+ gmfn, 0, 0, 0);
+}
+
+static int tmem_flush_page(u32 pool_id, u64 object, u32 index)
+{
+ return xen_tmem_op(TMEM_FLUSH_PAGE, pool_id, object, index,
+ 0, 0, 0, 0);
+}
+
+static int tmem_flush_object(u32 pool_id, u64 object)
+{
+ return xen_tmem_op(TMEM_FLUSH_OBJECT, pool_id, object, 0, 0, 0, 0, 0);
+}
+
+static int tmem_new_pool(u64 uuid_lo, u64 uuid_hi, u32 flags)
+{
+ flags |= (PAGE_SHIFT - 12) << TMEM_POOL_PAGESIZE_SHIFT;
+ return xen_tmem_new_pool(TMEM_NEW_POOL, uuid_lo, uuid_hi, flags);
+}
+
+static int tmem_destroy_pool(u32 pool_id)
+{
+ return xen_tmem_op(TMEM_DESTROY_POOL, pool_id, 0, 0, 0, 0, 0, 0);
+}
+
+static int __init xen_tmem_init(void)
+{
+ if (tmem_ops != NULL)
+ printk(KERN_WARNING, "attempt to define multiple tmem_ops\n");
+ else
+ tmem_ops = kmalloc(sizeof(struct tmem_ops), GFP_KERNEL);
+
+ if (tmem_ops == NULL)
+ return -ENODEV;
+
+ tmem_ops->new_pool = tmem_new_pool;
+ tmem_ops->put_page = tmem_put_page;
+ tmem_ops->get_page = tmem_get_page;
+ tmem_ops->flush_page = tmem_flush_page;
+ tmem_ops->flush_object = tmem_flush_object;
+ tmem_ops->destroy_pool = tmem_destroy_pool;
+
+ return 0;
+}

2009-06-20 01:51:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] tmem: infrastructure for tmem layer

Dan Magenheimer wrote:

> --- linux-2.6.30/mm/Makefile 2009-06-09 21:05:27.000000000 -0600
> +++ linux-2.6.30-tmem/mm/Makefile 2009-06-19 09:33:59.000000000 -0600
> @@ -16,6 +16,8 @@
> obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
> obj-$(CONFIG_BOUNCE) += bounce.o
> obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
> +obj-$(CONFIG_PRESWAP) += preswap.o
> +obj-$(CONFIG_PRECACHE) += precache.o

This patch does not actually add preswap.c or precache.c,
so it would lead to an uncompilable changeset.

This in turn breaks git bisect.

Please make sure that every changeset that is applied results
in a compilable and bootable kernel.

--
All rights reversed.

2009-06-20 02:29:07

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] tmem: precache implementation (layered on tmem)

Dan Magenheimer wrote:

> @@ -110,6 +111,9 @@
> s->s_qcop = sb_quotactl_ops;
> s->s_op = &default_op;
> s->s_time_gran = 1000000000;
> +#ifdef CONFIG_PRECACHE
> + s->precache_poolid = -1;
> +#endif
> }
> out:
> return s;

Please generate your patches with -up so we can see
which functions are being modified by each patch hunk.
That makes it a lot easier to find the context and
see what you are trying to do.

--
All rights reversed.

2009-06-22 11:27:19

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux

On Fri, 19 Jun 2009 16:53:45 -0700 (PDT)
Dan Magenheimer <[email protected]> wrote:

> Tmem has some similarity to IBM's Collaborative Memory Management,
> but creates more of a partnership between the kernel and the
> "privileged entity" and is not very invasive. Tmem may be
> applicable for KVM and containers; there is some disagreement on
> the extent of its value. Tmem is highly complementary to ballooning
> (aka page granularity hot plug) and memory deduplication (aka
> transparent content-based page sharing) but still has value
> when neither are present.

The basic idea seems to be that you reduce the amount of memory
available to the guest and as a compensation give the guest some
tmem, no? If that is the case then the effect of tmem is somewhat
comparable to the volatile page cache pages.

The big advantage of this approach is its simplicity, but there
are down sides as well:
1) You need to copy the data between the tmem pool and the page
cache. At least temporarily there are two copies of the same
page around. That increases the total amount of used memory.
2) The guest has a smaller memory size. Either the memory is
large enough for the working set size in which case tmem is
ineffective, or the working set does not fit which increases
the memory pressure and the cpu cycles spent in the mm code.
3) There is an additional turning knob, the size of the tmem pool
for the guest. I see the need for a clever algorithm to determine
the size for the different tmem pools.

Overall I would say its worthwhile to investigate the performance
impacts of the approach.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-06-22 14:31:41

by Chris Friesen

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux

Dan Magenheimer wrote:

> What if there was a class of memory that is of unknown
> and dynamically variable size, is addressable only indirectly
> by the kernel, can be configured either as persistent or
> as "ephemeral" (meaning it will be around for awhile, but
> might disappear without warning), and is still fast enough
> to be synchronously accessible?
>
> We call this latter class "transcendent memory"

While true that this memory is "exceeding usual limits", the more
important criteria is that it may disappear.

It might be clearer to just call it "ephemeral memory".

There is going to be some overhead due to the extra copying, and at
times there could be two copies of data in memory. It seems possible
that certain apps right a the borderline could end up running slower
because they can't fit in the regular+ephemeral memory due to the
duplication, while the same amount of memory used normally could have
been sufficient.

I suspect trying to optimize management of this could be difficult.

Chris

2009-06-22 20:42:50

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] transcendent memory for Linux

> > Tmem has some similarity to IBM's Collaborative Memory Management,
> > but creates more of a partnership between the kernel and the
> > "privileged entity" and is not very invasive. Tmem may be
> > applicable for KVM and containers; there is some disagreement on
> > the extent of its value. Tmem is highly complementary to ballooning
> > (aka page granularity hot plug) and memory deduplication (aka
> > transparent content-based page sharing) but still has value
> > when neither are present.

Hi Martin --

Thanks much for taking the time to reply!

> The basic idea seems to be that you reduce the amount of memory
> available to the guest and as a compensation give the guest some
> tmem, no?

That's mostly right. Tmem's primary role is to help
with guests that have had their available memory reduced
(via ballooning or hotplug or some future mechanism).
However tmem additionally provides a way of providing otherwise
unused-by-the-hypervisor ("fallow") memory to a guest,
essentially expanding a guest kernel's page cache if
no other guest is using the RAM anyway.

And "as a compensation GIVE the guest some tmem" is misleading,
because tmem (at least ephemeral tmem) is never "given"
to a guest. A better word might be "loaned" or "rented".
The guest gets to use some tmem for awhile but if it
doesn't use it effectively, the memory is "repossessed"
(or the guest is "evicted" from using that memory)
transparently so that it can be used more effectively
elsewhere.

> If that is the case then the effect of tmem is somewhat
> comparable to the volatile page cache pages.

There is definitely some similarity in that both are providing
useful information to the hypervisor. In CMM's case, the
guest is passively providing info; in tmem's case it is
actively providing info and making use of the info within
the kernel, not just in the hypervsior, which is why I described it
as "more of a partnership".

> The big advantage of this approach is its simplicity, but there
> are down sides as well:
> 1) You need to copy the data between the tmem pool and the page
> cache. At least temporarily there are two copies of the same
> page around. That increases the total amount of used memory.

Certainly this is theoretically true, but I think the increase
is small and transient. The kernel only puts the page into
precache when it has decided to use that page for another
purpose (due to memory pressure). Until it actually
"reprovisions" the page, the data is briefly duplicated.

On the other hand, copying eliminates the need for fancy
games with virtual mappings and TLB entries. Copying appears
to be getting much faster on recent CPUs; I'm not sure
if this is also true of TLB operations.

> 2) The guest has a smaller memory size. Either the memory is
> large enough for the working set size in which case tmem is
> ineffective...

Yes, if the kernel has memory to "waste" (e.g. never refaults and
never swaps), tmem is ineffective. The goal of tmem is to optimize
memory usage across an environment where there is contention
among multiple users (guests) for a limited resource (RAM).
If your environment always has enough RAM for every guest
and there's never any contention, you don't want tmem... but
I'd assert you've wasted money in your data center by buying
too much RAM!

> or the working set does not fit which increases
> the memory pressure and the cpu cycles spent in the mm code.

True, this is where preswap is useful. Without tmem/preswap,
"does not fit" means swap-to-disk or refaulting is required.
Preswap alleviates the memory pressure by using tmem to
essentially swap to "magic memory" and precache reduces the
need for refaulting.

> 3) There is an additional turning knob, the size of the tmem pool
> for the guest. I see the need for a clever algorithm to determine
> the size for the different tmem pools.

Yes, some policy in the hypervisor is still required, essentially
a "memory scheduler". The working implementation (in Xen)
uses FIFO, but modified by admin-configurable "weight" values
to allow QoS and avoid DoS.

> Overall I would say its worthwhile to investigate the performance
> impacts of the approach.

Thanks. I'd appreciate any thoughts or experience you have
in this area (onlist or offlist) as I don't think there are
any adequate benchmarks that aren't either myopic for a complex
environment or contrived (and thus misleading) to prove an
isolated point.

I would also guess that tmem is more beneficial on recent
multi-core processors, and more costly on older chips.

Thanks again,
Dan

2009-06-22 20:52:17

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] transcendent memory for Linux


> > What if there was a class of memory that is of unknown
> > and dynamically variable size, is addressable only indirectly
> > by the kernel, can be configured either as persistent or
> > as "ephemeral" (meaning it will be around for awhile, but
> > might disappear without warning), and is still fast enough
> > to be synchronously accessible?
> >
> > We call this latter class "transcendent memory"
>
> While true that this memory is "exceeding usual limits", the more
> important criteria is that it may disappear.
>
> It might be clearer to just call it "ephemeral memory".

Ephemeral tmem (precache) may be the most interesting, but there
is persistent tmem (preswap) as well. Both are working today
and both are included in the patches I posted.

Looking for a term encompassing both, I chose "transcendent".

> There is going to be some overhead due to the extra copying, and at
> times there could be two copies of data in memory. It seems possible
> that certain apps right a the borderline could end up running slower
> because they can't fit in the regular+ephemeral memory due to the
> duplication, while the same amount of memory used normally could have
> been sufficient.

This is likely true, but I expect the duplicates to be few
and transient and a very small fraction of the total memory cost for
virtualization (and similar abstraction technologies).

> I suspect trying to optimize management of this could be difficult.

True. Optimizing the management of ANY resource across many
consumers is difficult. But wasting the resource because its
a pain to optimize doesn't seem to be a good answer either.

Thanks!
Dan

2009-06-27 11:28:55

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux

Hi!

This description (whole mail) needs to go into Documentation/, somewhere.

> Normal memory is directly addressable by the kernel,
> of a known normally-fixed size, synchronously accessible,
> and persistent (though not across a reboot).
...
> Transcendent memory, or "tmem" for short, provides a
> well-defined API to access this unusual class of memory.
> The basic operations are page-copy-based and use a flexible
> object-oriented addressing mechanism. Tmem assumes

Should this API be documented, somewhere? Is it in-kernel API or does
userland see it?

> "Preswap" IS persistent, but for various reasons may not always
> be available for use, again due to factors that may not be
> visible to the kernel (but, briefly, if the kernel is being
> "good" and has shared its resources nicely, then it will be
> able to use preswap, else it will not). Once a page is put,
> a get on the page will always succeed. So when the kernel
> finds itself in a situation where it needs to swap out a page,
> it first attempts to use preswap. If the put works, a disk
> write and (usually) a disk read are avoided. If it doesn't,
> the page is written to swap as usual. Unlike precache, whether

Ok, how much slower this gets in the worst case? Single hypercall to
find out that preswap is unavailable? I guess that compared to disk
access that's lost in the noise?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-27 13:19:07

by Linus Walleij

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux

2009/6/20 Dan Magenheimer <[email protected]>:

> We call this latter class "transcendent memory" and it
> provides an interesting opportunity to more efficiently
> utilize RAM in a virtualized environment. ?However this
> "memory but not really memory" may also have applications
> in NON-virtualized environments, such as hotplug-memory
> deletion, SSDs, and page cache compression. ?Others have
> suggested ideas such as allowing use of highmem memory
> without a highmem kernel, or use of spare video memory.

Here is what I consider may be a use case from the embedded
world: we have to save power as much as possible, so we need
to shut off entire banks of memory.

Currently people do things like put memory into self-refresh
and then sleep, but for long lapses of time you would
want to compress memory towards lower addresses and
turn as many banks as possible off.

So we have something like 4x16MB banks of RAM = 64MB RAM,
and the most necessary stuff easily fits in one of them.
If we can shut down 3x16MB we save 3 x power supply of the
RAMs.

However in embedded we don't have any swap, so we'd need
some call that would attempt to remove a memory by paging
out code and data that has been demand-paged in
from the FS but no dirty pages, these should instead be
moved down to memory which will be retained, and the
call should fail if we didn't succeed to migrate all
dirty pages.

Would this be possible with transcendent memory?

Yours,
Linus Walleij

2009-06-28 07:42:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux

On 06/27/2009 04:18 PM, Linus Walleij wrote:
> 2009/6/20 Dan Magenheimer<[email protected]>:
>
>
>> We call this latter class "transcendent memory" and it
>> provides an interesting opportunity to more efficiently
>> utilize RAM in a virtualized environment. However this
>> "memory but not really memory" may also have applications
>> in NON-virtualized environments, such as hotplug-memory
>> deletion, SSDs, and page cache compression. Others have
>> suggested ideas such as allowing use of highmem memory
>> without a highmem kernel, or use of spare video memory.
>>
>
> Here is what I consider may be a use case from the embedded
> world: we have to save power as much as possible, so we need
> to shut off entire banks of memory.
>
> Currently people do things like put memory into self-refresh
> and then sleep, but for long lapses of time you would
> want to compress memory towards lower addresses and
> turn as many banks as possible off.
>
> So we have something like 4x16MB banks of RAM = 64MB RAM,
> and the most necessary stuff easily fits in one of them.
> If we can shut down 3x16MB we save 3 x power supply of the
> RAMs.
>
> However in embedded we don't have any swap, so we'd need
> some call that would attempt to remove a memory by paging
> out code and data that has been demand-paged in
> from the FS but no dirty pages, these should instead be
> moved down to memory which will be retained, and the
> call should fail if we didn't succeed to migrate all
> dirty pages.
>
> Would this be possible with transcendent memory?
>

You could do this with memory defragmentation, which is needed for
things like memory hotunplug ayway.

--
error compiling committee.c: too many arguments to function

2009-06-29 14:36:18

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] transcendent memory for Linux

Hi Pavel --

Thanks for the feedback!

> This description (whole mail) needs to go into
> Documentation/, somewhere.

Good idea. I'll do that for the next time I post the patches.

> > Normal memory is directly addressable by the kernel,
> > of a known normally-fixed size, synchronously accessible,
> > and persistent (though not across a reboot).
> ...
> > Transcendent memory, or "tmem" for short, provides a
> > well-defined API to access this unusual class of memory.
> > The basic operations are page-copy-based and use a flexible
> > object-oriented addressing mechanism. Tmem assumes
>
> Should this API be documented, somewhere? Is it in-kernel API or does
> userland see it?

It is documented currently at:

http://oss.oracle.com/projects/tmem/documentation/api/

(just noticed I still haven't posted version 0.0.2 which
has a few minor changes).

I will add a briefer description of this API in Documentation/

It is in-kernel only because some of the operations have
a parameter that is a physical page frame number.

> > "Preswap" IS persistent, but for various reasons may not always
> > be available for use, again due to factors that may not be
> > visible to the kernel (but, briefly, if the kernel is being
> > "good" and has shared its resources nicely, then it will be
> > able to use preswap, else it will not). Once a page is put,
> > a get on the page will always succeed. So when the kernel
> > finds itself in a situation where it needs to swap out a page,
> > it first attempts to use preswap. If the put works, a disk
> > write and (usually) a disk read are avoided. If it doesn't,
> > the page is written to swap as usual. Unlike precache, whether
>
> Ok, how much slower this gets in the worst case? Single hypercall to
> find out that preswap is unavailable? I guess that compared to disk
> access that's lost in the noise?

Yes, the overhead of one hypercall per swap page is lost in
the noise.

Dan

2009-06-29 14:46:54

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] transcendent memory for Linux



> From: Linus Walleij [mailto:[email protected]]
> Sent: Saturday, June 27, 2009 7:19 AM
> Subject: Re: [RFC] transcendent memory for Linux
>
> > We call this latter class "transcendent memory" and it
> > provides an interesting opportunity to more efficiently
> > utilize RAM in a virtualized environment. ?However this
> > "memory but not really memory" may also have applications
> > in NON-virtualized environments, such as hotplug-memory
> > deletion, SSDs, and page cache compression. ?Others have
> > suggested ideas such as allowing use of highmem memory
> > without a highmem kernel, or use of spare video memory.
>
> Here is what I consider may be a use case from the embedded
> world: we have to save power as much as possible, so we need
> to shut off entire banks of memory.
>
> Currently people do things like put memory into self-refresh
> and then sleep, but for long lapses of time you would
> want to compress memory towards lower addresses and
> turn as many banks as possible off.
>
> So we have something like 4x16MB banks of RAM = 64MB RAM,
> and the most necessary stuff easily fits in one of them.
> If we can shut down 3x16MB we save 3 x power supply of the
> RAMs.
>
> However in embedded we don't have any swap, so we'd need
> some call that would attempt to remove a memory by paging
> out code and data that has been demand-paged in
> from the FS but no dirty pages, these should instead be
> moved down to memory which will be retained, and the
> call should fail if we didn't succeed to migrate all
> dirty pages.
>
> Would this be possible with transcendent memory?

Yes, I think this would work nicely as a use case for tmem.

As Avi points out, you could do this with memory defragmentation,
but if you know in advance that you will be frequently
powering on and off a bank of RAM, you could put only
ephemeral memory into it (enforced by a kernel policy and
the tmem API), then defragmentation (and compression towards
lower addresses) would not be necessary, and you could power
off a bank with no loss of data.

One issue though: I would guess that copying pages of memory
could be very slow in an inexpensive embedded processor.

Dan

2009-06-29 20:36:31

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux


> It is documented currently at:
>
> http://oss.oracle.com/projects/tmem/documentation/api/
>
> (just noticed I still haven't posted version 0.0.2 which
> has a few minor changes).
>
> I will add a briefer description of this API in Documentation/

Please do.

At least TMEM_NEW_POOL() looks quite ugly. Why uuid? Mixing flags into
size argument is strange.

> It is in-kernel only because some of the operations have
> a parameter that is a physical page frame number.

In-kernel API is probably better described as function prototypes.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-29 21:15:22

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] transcendent memory for Linux

> > It is documented currently at:
> >
> > http://oss.oracle.com/projects/tmem/documentation/api/
> >
> > (just noticed I still haven't posted version 0.0.2 which
> > has a few minor changes).
> >
> > I will add a briefer description of this API in Documentation/
>
> Please do.

OK, will do.

> At least TMEM_NEW_POOL() looks quite ugly. Why uuid? Mixing flags into
> size argument is strange.

The uuid is only used for shared pools. If two different
"tmem clients" (guests) agree on a 128-bit "shared secret",
they can share a tmem pool. For ocfs2, the 128-bit uuid in
the on-disk superblock is used for this purpose to implement
shared precache. (Pages evicted by one cluster node
can be used by another cluster node that co-resides on
the same physical system.)

The (page)size argument is always fixed (at PAGE_SIZE) for
any given kernel. The underlying implementation can
be capable of supporting multiple pagesizes.

So for the basic precache and preswap uses, "new pool"
has a very simple interface.

> > It is in-kernel only because some of the operations have
> > a parameter that is a physical page frame number.
>
> In-kernel API is probably better described as function prototypes.

Good idea. I will do that.

Thanks,
Dan

2009-06-29 21:23:46

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux

On 06/29/09 14:13, Dan Magenheimer wrote:
> The uuid is only used for shared pools. If two different
> "tmem clients" (guests) agree on a 128-bit "shared secret",
> they can share a tmem pool. For ocfs2, the 128-bit uuid in
> the on-disk superblock is used for this purpose to implement
> shared precache. (Pages evicted by one cluster node
> can be used by another cluster node that co-resides on
> the same physical system.)
>

What are the implications of some third party VM guessing the "uuid" of
a shared pool? Presumably they could view and modify the contents of
the pool. Is there any security model beyond making UUIDs unguessable?

> The (page)size argument is always fixed (at PAGE_SIZE) for
> any given kernel. The underlying implementation can
> be capable of supporting multiple pagesizes.
>

Pavel's other point was that merging the size field into the flags is a
bit unusual/ugly. But you can workaround that by just defining the
"flag" values for each plausible page size, since there's a pretty small
bound: TMEM_PAGESZ_4K, 8K, etc.

Also, having an "API version number" is a very bad idea. Such version
numbers are very inflexible and basically don't work (esp if you're
expecting to have multiple independent implementations of this API).
Much better is to have feature flags; the caller asks for features on
the new pool, and pool creation either succeeds or doesn't (a call to
return the set of supported features is a good compliment).

J

2009-06-29 21:58:38

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] transcendent memory for Linux

> From: Jeremy Fitzhardinge [mailto:[email protected]]
>
> On 06/29/09 14:13, Dan Magenheimer wrote:
> > The uuid is only used for shared pools. If two different
> > "tmem clients" (guests) agree on a 128-bit "shared secret",
> > they can share a tmem pool. For ocfs2, the 128-bit uuid in
> > the on-disk superblock is used for this purpose to implement
> > shared precache. (Pages evicted by one cluster node
> > can be used by another cluster node that co-resides on
> > the same physical system.)
>
> What are the implications of some third party VM guessing the
> "uuid" of
> a shared pool? Presumably they could view and modify the contents of
> the pool. Is there any security model beyond making UUIDs
> unguessable?

Interesting question. But, more than the 128-bit UUID must
be guessed... a valid 64-bit object id and a valid 32-bit
page index must also be guessed (though most instances of
the page index are small numbers so easy to guess). Once
192 bits are guessed though, yes, the pages could be viewed
and modified. I suspect there are much more easily targeted
security holes in most data centers than guessing 192 (or
even 128) bits.

Now this only affects shared pools, and shared-precache is still
experimental and not really part of this patchset. Does "mount"
of an accessible disk/filesystem have a better security model?
Perhaps there are opportunities to leverage that?

> > The (page)size argument is always fixed (at PAGE_SIZE) for
> > any given kernel. The underlying implementation can
> > be capable of supporting multiple pagesizes.
>
> Pavel's other point was that merging the size field into the
> flags is a
> bit unusual/ugly. But you can workaround that by just defining the
> "flag" values for each plausible page size, since there's a
> pretty small
> bound: TMEM_PAGESZ_4K, 8K, etc.

OK I see. Yes the point (and the workaround) are valid.

> Also, having an "API version number" is a very bad idea. Such version
> numbers are very inflexible and basically don't work (esp if you're
> expecting to have multiple independent implementations of this API).
> Much better is to have feature flags; the caller asks for features on
> the new pool, and pool creation either succeeds or doesn't (a call to
> return the set of supported features is a good compliment).

Yes. Perhaps all the non-flag bits should just be reserved for
future use. Today, the implementation just checks for (and implements)
only zero anyway and nothing is defined anywhere except the 4K
pagesize at the lowest levels of the (currently xen-only) API.

Thanks,
Dan

2009-06-29 22:16:02

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux

On 06/29/09 14:57, Dan Magenheimer wrote:
> Interesting question. But, more than the 128-bit UUID must
> be guessed... a valid 64-bit object id and a valid 32-bit
> page index must also be guessed (though most instances of
> the page index are small numbers so easy to guess). Once
> 192 bits are guessed though, yes, the pages could be viewed
> and modified. I suspect there are much more easily targeted
> security holes in most data centers than guessing 192 (or
> even 128) bits.
>

If its possible to verify the uuid is valid before trying to find a
valid oid+page, then its much easier (since you can concentrate on the
uuid first). If the uuid is derived from something like the
filesystem's uuid - which wouldn't normally be considered sensitive
information - then its not like its a search of the full 128-bit space.
And even if it were secret, uuids are not generally 128 randomly chosen
bits.

You also have to consider the case of a domain which was once part of
the ocfs cluster, but now is not - it may still know the uuid, but not
be otherwise allowed to use the cluster.

> Now this only affects shared pools, and shared-precache is still
> experimental and not really part of this patchset. Does "mount"
> of an accessible disk/filesystem have a better security model?
> Perhaps there are opportunities to leverage that?
>

Well, a domain is allowed to access any block device you give it access
to. I'm not sure what the equivalent model for tmem would be.

Anyway, it sounds like you need to think a fair bit more about shared
tmem's security model before it can be considered for use.

> Yes. Perhaps all the non-flag bits should just be reserved for
> future use. Today, the implementation just checks for (and implements)
> only zero anyway and nothing is defined anywhere except the 4K
> pagesize at the lowest levels of the (currently xen-only) API.
>

Yes. It should fail if it sees any unknown flags set in a guest request.

J

2009-06-30 21:23:05

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] transcendent memory for Linux

> From: Jeremy Fitzhardinge [mailto:[email protected]]
> On 06/29/09 14:57, Dan Magenheimer wrote:
> > Interesting question. But, more than the 128-bit UUID must
> > be guessed... a valid 64-bit object id and a valid 32-bit
> > page index must also be guessed (though most instances of
> > the page index are small numbers so easy to guess). Once
> > 192 bits are guessed though, yes, the pages could be viewed
> > and modified. I suspect there are much more easily targeted
> > security holes in most data centers than guessing 192 (or
> > even 128) bits.
>
> If its possible to verify the uuid is valid before trying to find a
> valid oid+page, then its much easier (since you can concentrate on the
> uuid first).

No, the uuid can't be verified. Tmem gives no indication
as to whether a newly-created pool is already in use (shared)
by another guest. So without both the 128-bit uuid and an
already-in-use 64-bit object id and 32-bit page index, no data
is readable or writable by the attacker.

> You also have to consider the case of a domain which was once part of
> the ocfs cluster, but now is not - it may still know the uuid, but not
> be otherwise allowed to use the cluster.
> If the uuid is derived from something like the
> filesystem's uuid - which wouldn't normally be considered sensitive
> information - then its not like its a search of the full
> 128-bit space.
> And even if it were secret, uuids are not generally 128
> randomly chosen bits.

Hmmm... that is definitely a thornier problem. I guess the
security angle definitely deserves more design. But, again,
this affects only shared precache which is not intended
to part of the proposed initial tmem patchset, so this is a futures
issue.)

Thanks again for the feedback!
Dan

2009-06-30 22:47:02

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux

On 06/30/09 14:21, Dan Magenheimer wrote:
> No, the uuid can't be verified. Tmem gives no indication
> as to whether a newly-created pool is already in use (shared)
> by another guest. So without both the 128-bit uuid and an
> already-in-use 64-bit object id and 32-bit page index, no data
> is readable or writable by the attacker.
>

You have to consider things like timing attacks as well (for example, a
tmem hypercall might return faster if the uuid already exists).

Besides, you can tell whether a uuid exists, by at least a couple of
mechanisms (from a quick read of the source, so I might have overlooked
something):

1. You can create new shared pools until it starts failing as a
result of hitting the MAX_GLOBAL_SHARED_POOLS limit with junk
uuids. If you then successfully "create" a shared pool while
searching, you know it already existed.
2. The returned pool id will increase unless the pool already exists,
in which case you'll get a smaller id back (ignoring wraparound).


> Hmmm... that is definitely a thornier problem. I guess the
> security angle definitely deserves more design. But, again,
> this affects only shared precache which is not intended
> to part of the proposed initial tmem patchset, so this is a futures
> issue.)

Yeah, a shared namespace of accessible objects is an entirely new thing
in the Xen universe. I would also drop Xen support until there's a good
security story about how they can be used.

J

2009-07-01 03:41:16

by Roland Dreier

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux


> One issue though: I would guess that copying pages of memory
> could be very slow in an inexpensive embedded processor.

And copying memory could very easily burn enough power by keeping the
CPU busy that you lose the incremental gain of turning the memory off
vs. just going to self refresh. (And the copying latency would easily
be as bad as the transition latency to/from self-refresh).

- R.

2009-07-01 23:04:13

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] transcendent memory for Linux

> From: Jeremy Fitzhardinge [mailto:[email protected]]
> On 06/30/09 14:21, Dan Magenheimer wrote:
> > No, the uuid can't be verified. Tmem gives no indication
> > as to whether a newly-created pool is already in use (shared)
> > by another guest. So without both the 128-bit uuid and an
> > already-in-use 64-bit object id and 32-bit page index, no data
> > is readable or writable by the attacker.
>
> You have to consider things like timing attacks as well (for
> example, a
> tmem hypercall might return faster if the uuid already exists).
>
> Besides, you can tell whether a uuid exists, by at least a couple of
> mechanisms (from a quick read of the source, so I might have
> overlooked something):

All of these still require a large number of guesses
across a 128-bit space of possible uuids, right?
It should be easy to implement "guess limits" in xen
that disable tmem use by a guest if it fails too many guesses.
I'm a bit more worried about:

> You also have to consider the case of a domain which was once part of
> the ocfs cluster, but now is not - it may still know the uuid, but not
> be otherwise allowed to use the cluster.

But on the other hand, the security model here can be that
if a trusted entity becomes untrusted, you have to change
the locks.

> Yeah, a shared namespace of accessible objects is an entirely
> new thing
> in the Xen universe. I would also drop Xen support until
> there's a good
> security story about how they can be used.

While I agree that the security is not bulletproof, I wonder
if this position might be a bit extreme. Certainly, the NSA
should not turn on tmem in a cluster, but that doesn't mean that
nobody should be allowed to. I really suspect that there are
less costly / more rewarding attack vectors at several layers
in the hardware/software stack of most clusters.

Dan

2009-07-01 23:31:38

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux

On 07/01/09 16:02, Dan Magenheimer wrote:
> All of these still require a large number of guesses
> across a 128-bit space of possible uuids, right?
> It should be easy to implement "guess limits" in xen
> that disable tmem use by a guest if it fails too many guesses.
>

How does Xen distinguish between someone "guessing" uuids and a normal
user which wants to create lots of pools?

>> You also have to consider the case of a domain which was once part of
>> the ocfs cluster, but now is not - it may still know the uuid, but not
>> be otherwise allowed to use the cluster.
>>
>
> But on the other hand, the security model here can be that
> if a trusted entity becomes untrusted, you have to change
> the locks.
>

Revocation is one of the big problems with capabilities-based systems.

>> Yeah, a shared namespace of accessible objects is an entirely
>> new thing
>> in the Xen universe. I would also drop Xen support until
>> there's a good
>> security story about how they can be used.
>>
>
> While I agree that the security is not bulletproof, I wonder
> if this position might be a bit extreme. Certainly, the NSA
> should not turn on tmem in a cluster, but that doesn't mean that
> nobody should be allowed to. I really suspect that there are
> less costly / more rewarding attack vectors at several layers
> in the hardware/software stack of most clusters.
>

Well, I think you can define any security model you like, but I think
you need to have a defined security model before making it an available
API. At the moment the model is defined by whatever you currently have
implemented, and anyone using the API as-is - without special
consideration of its security properties - is going to end up vulnerable.

In an earlier mail I said "a shared namespace of accessible objects is
an entirely new thing in the Xen universe", which is obviously not true:
we have Xenbus.

It seems to me that a better approach to shared tmem pools should be
moderated via Xenbus, which in turn allows dom0/xenstored/tmemd/etc to
apply arbitrary policies to who gets to see what handles, revoke them, etc.

You don't need to deal with "uuids" at the tmem hypercall level.
Instead, you have a well-defined xenbus path corresponding to the
resource; reading it will return a handle number, which you can then use
with your hypercalls. If your access is denied or revoked, then the
read will fail (or the current handle will stop working if revoked).
This requires some privileged hypercalls to establish and remove tmem
handles for a particular domain.

I'm assuming that the job of managing and balancing tmem resources will
need to happen in a tmem-domain rather than trying to build all that
policy into Xen itself, so putting a bit more logic in there to manage
shared access rules doesn't add much complexity to the system.

J

2009-07-02 06:38:24

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] transcendent memory for Linux


> > Yeah, a shared namespace of accessible objects is an entirely
> > new thing
> > in the Xen universe. I would also drop Xen support until
> > there's a good
> > security story about how they can be used.
>
> While I agree that the security is not bulletproof, I wonder
> if this position might be a bit extreme. Certainly, the NSA
> should not turn on tmem in a cluster, but that doesn't mean that
> nobody should be allowed to. I really suspect that there are

This has more problems than "just" security, and yes, security should
be really solved at design time...
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-07-02 14:06:29

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] transcendent memory for Linux

OK, OK, I give up. I will ensure all code for shared pools
is removed from the next version of the patch.

Though for future reference, I am interested in what
problems it has other than "just" security (offlist
if you want).

> -----Original Message-----
> From: Pavel Machek [mailto:[email protected]]
>
> > > Yeah, a shared namespace of accessible objects is an entirely
> > > new thing
> > > in the Xen universe. I would also drop Xen support until
> > > there's a good
> > > security story about how they can be used.
> >
> > While I agree that the security is not bulletproof, I wonder
> > if this position might be a bit extreme. Certainly, the NSA
> > should not turn on tmem in a cluster, but that doesn't mean that
> > nobody should be allowed to. I really suspect that there are
>
> This has more problems than "just" security, and yes, security should
> be really solved at design time...
>
> Pavel