2010-05-28 17:37:13

by Dan Magenheimer

[permalink] [raw]
Subject: [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

[PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

Cleancache core files.

Credits: Cleancache_ops design derived from Jeremy Fitzhardinge
design for tmem; sysfs code modelled after mm/ksm.c

Note that CONFIG_CLEANCACHE defaults to on; all hooks devolve
to a compare-pointer-to-NULL so performance impact should
be negligible, but can be reduced to zero impact if config'ed off.

Signed-off-by: Dan Magenheimer <[email protected]>

Diffstat:
include/linux/cleancache.h | 90 +++++++++
mm/Kconfig | 22 ++
mm/Makefile | 1
mm/cleancache.c | 203 +++++++++++++++++++++
4 files changed, 316 insertions(+)

--- linux-2.6.34/include/linux/cleancache.h 1969-12-31 17:00:00.000000000 -0700
+++ linux-2.6.34-cleancache/include/linux/cleancache.h 2010-05-24 18:14:33.000000000 -0600
@@ -0,0 +1,90 @@
+#ifndef _LINUX_CLEANCACHE_H
+#define _LINUX_CLEANCACHE_H
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#define CLEANCACHE_GET_PAGE_SUCCESS 1
+
+struct cleancache_ops {
+ int (*init_fs)(size_t);
+ int (*init_shared_fs)(char *uuid, size_t);
+ int (*get_page)(int, ino_t, pgoff_t, struct page *);
+ int (*put_page)(int, ino_t, pgoff_t, struct page *);
+ int (*flush_page)(int, ino_t, pgoff_t);
+ int (*flush_inode)(int, ino_t);
+ void (*flush_fs)(int);
+};
+
+extern struct cleancache_ops *cleancache_ops;
+extern int __cleancache_get_page(struct page *);
+extern int __cleancache_put_page(struct page *);
+extern int __cleancache_flush_page(struct address_space *, struct page *);
+extern int __cleancache_flush_inode(struct address_space *);
+
+#ifndef CONFIG_CLEANCACHE
+#define cleancache_ops ((struct cleancache_ops *)NULL)
+#endif
+
+static inline int cleancache_init_fs(size_t pagesize)
+{
+ int ret = -1;
+
+ if (cleancache_ops)
+ ret = (*cleancache_ops->init_fs)(pagesize);
+ return ret;
+}
+
+static inline int cleancache_init_shared_fs(char *uuid, size_t pagesize)
+{
+ int ret = -1;
+
+ if (cleancache_ops)
+ ret = (*cleancache_ops->init_shared_fs)(uuid, pagesize);
+ return ret;
+}
+
+static inline int cleancache_get_page(struct page *page)
+{
+ int ret = 0;
+
+ if (cleancache_ops)
+ ret = __cleancache_get_page(page);
+ return ret;
+}
+
+static inline int cleancache_put_page(struct page *page)
+{
+ int ret = 0;
+
+ if (cleancache_ops)
+ ret = __cleancache_put_page(page);
+ return ret;
+}
+
+static inline int cleancache_flush_page(struct address_space *mapping,
+ struct page *page)
+{
+ int ret = 0;
+
+ if (cleancache_ops)
+ ret = __cleancache_flush_page(mapping, page);
+ return ret;
+}
+
+static inline int cleancache_flush_inode(struct address_space *mapping)
+{
+ int ret = 0;
+
+ if (cleancache_ops)
+ ret = __cleancache_flush_inode(mapping);
+ return ret;
+}
+
+static inline void cleancache_flush_fs(int pool_id)
+{
+ if (cleancache_ops && pool_id >= 0)
+ (*cleancache_ops->flush_fs)(pool_id);
+}
+
+#endif /* _LINUX_CLEANCACHE_H */
--- linux-2.6.34/mm/cleancache.c 1969-12-31 17:00:00.000000000 -0700
+++ linux-2.6.34-cleancache/mm/cleancache.c 2010-05-24 18:07:11.000000000 -0600
@@ -0,0 +1,203 @@
+/* mm/cleancache.c
+
+ Copyright (C) 2009-2010 Oracle Corp. All rights reserved.
+ Author: Dan Magenheimer
+
+ Cleancache can be thought of as a page-granularity victim cache for clean
+ pages that the kernel's pageframe replacement algorithm (PFRA) would like
+ to keep around, but can't since there isn't enough memory. So when the
+ PFRA "evicts" a page, it first attempts to put it into a synchronous
+ concurrency-safe page-oriented pseudo-RAM device (such as Xen's Transcendent
+ Memory, aka "tmem", or in-kernel compressed memory, aka "zmem", or other
+ RAM-like devices) which is not directly accessible or addressable by the
+ kernel and is of unknown and possibly time-varying size. And when a
+ cleancache-enabled filesystem wishes to access a page in a file on disk,
+ it first checks cleancache to see if it already contains it; if it does,
+ the page is copied into the kernel and a disk access is avoided.
+ This pseudo-RAM device links itself to cleancache by setting the
+ cleancache_ops pointer appropriately and the functions it provides must
+ conform to certain semantics as follows:
+
+ Most important, cleancache is "ephemeral". Pages which are copied into
+ cleancache have an indefinite lifetime which is completely unknowable
+ by the kernel and so may or may not still be in cleancache at any later time.
+ Thus, as its name implies, cleancache is not suitable for dirty pages. The
+ pseudo-RAM has complete discretion over what pages to preserve and what
+ pages to discard and when.
+
+ A filesystem calls "init_fs" to obtain a pool id which, if positive, must be
+ saved in the filesystem's superblock; a negative return value indicates
+ failure. A "put_page" will copy a (presumably about-to-be-evicted) page into
+ pseudo-RAM and associate it with the pool id, the file inode, and a page
+ index into the file. (The combination of a pool id, an inode, and an index
+ is called a "handle".) A "get_page" will copy the page, if found, from
+ pseudo-RAM into kernel memory. A "flush_page" will ensure the page no longer
+ is present in pseudo-RAM; a "flush_inode" will flush all pages associated
+ with the specified inode; and a "flush_fs" will flush all pages in all
+ inodes specified by the given pool id.
+
+ A "init_shared_fs", like init, obtains a pool id but tells the pseudo-RAM
+ to treat the pool as shared using a 128-bit UUID as a key. On systems
+ that may run multiple kernels (such as hard partitioned or virtualized
+ systems) that may share a clustered filesystem, and where the pseudo-RAM
+ may be shared among those kernels, calls to init_shared_fs that specify the
+ same UUID will receive the same pool id, thus allowing the pages to
+ be shared. Note that any security requirements must be imposed outside
+ of the kernel (e.g. by "tools" that control the pseudo-RAM). Or a
+ pseudo-RAM implementation can simply disable shared_init by always
+ returning a negative value.
+
+ If a get_page is successful on a non-shared pool, the page is flushed (thus
+ making cleancache an "exclusive" cache). On a shared pool, the page
+ is NOT flushed on a successful get_page so that it remains accessible to
+ other sharers. The kernel is responsible for ensuring coherency between
+ cleancache (shared or not), the page cache, and the filesystem, using
+ cleancache flush operations as required.
+
+ Note that the pseudo-RAM must enforce put-put-get coherency and get-get
+ coherency. For the former, if two puts are made to the same handle but
+ with different data, say AAA by the first put and BBB by the second, a
+ subsequent get can never return the stale data (AAA). For get-get coherency,
+ if a get for a given handle fails, subsequent gets for that handle will
+ never succeed unless preceded by a successful put with that handle.
+
+ Last, pseudo-RAM provides no SMP serialization guarantees; if two
+ different Linux threads are putting an flushing a page with the same
+ handle, the results are indeterminate.
+
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/cleancache.h>
+
+struct cleancache_ops *cleancache_ops;
+EXPORT_SYMBOL(cleancache_ops);
+
+/* useful stats available via /sys/kernel/mm/frontswap */
+static unsigned long succ_gets;
+static unsigned long failed_gets;
+static unsigned long puts;
+static unsigned long flushes;
+
+int __cleancache_get_page(struct page *page)
+{
+ int ret = 0;
+ int pool_id = page->mapping->host->i_sb->cleancache_poolid;
+
+ if (pool_id >= 0) {
+ ret = (*cleancache_ops->get_page)(pool_id,
+ page->mapping->host->i_ino,
+ page->index,
+ page);
+ if (ret == CLEANCACHE_GET_PAGE_SUCCESS)
+ succ_gets++;
+ else
+ failed_gets++;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(__cleancache_get_page);
+
+int __cleancache_put_page(struct page *page)
+{
+ int ret = 0;
+ int pool_id = page->mapping->host->i_sb->cleancache_poolid;
+
+ if (pool_id >= 0) {
+ ret = (*cleancache_ops->put_page)(pool_id,
+ page->mapping->host->i_ino,
+ page->index,
+ page);
+ puts++;
+ }
+ return ret;
+}
+
+int __cleancache_flush_page(struct address_space *mapping, struct page *page)
+{
+ int ret = 0;
+ int pool_id = mapping->host->i_sb->cleancache_poolid;
+
+ if (pool_id >= 0) {
+ ret = (*cleancache_ops->flush_page)(pool_id,
+ mapping->host->i_ino,
+ page->index);
+ flushes++;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(__cleancache_flush_page);
+
+int __cleancache_flush_inode(struct address_space *mapping)
+{
+ int ret = 0;
+ int pool_id = mapping->host->i_sb->cleancache_poolid;
+
+ if (pool_id >= 0) {
+ ret = (*cleancache_ops->flush_inode)(pool_id,
+ mapping->host->i_ino);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(__cleancache_flush_inode);
+
+#ifdef CONFIG_SYSFS
+
+#define CLEANCACHE_ATTR_RO(_name) \
+ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+
+static ssize_t succ_gets_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", succ_gets);
+}
+CLEANCACHE_ATTR_RO(succ_gets);
+
+static ssize_t failed_gets_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", failed_gets);
+}
+CLEANCACHE_ATTR_RO(failed_gets);
+
+static ssize_t puts_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", puts);
+}
+CLEANCACHE_ATTR_RO(puts);
+
+static ssize_t flushes_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", flushes);
+}
+CLEANCACHE_ATTR_RO(flushes);
+
+static struct attribute *cleancache_attrs[] = {
+ &succ_gets_attr.attr,
+ &failed_gets_attr.attr,
+ &puts_attr.attr,
+ &flushes_attr.attr,
+ NULL,
+};
+
+static struct attribute_group cleancache_attr_group = {
+ .attrs = cleancache_attrs,
+ .name = "cleancache",
+};
+
+#endif /* CONFIG_SYSFS */
+
+static int __init init_cleancache(void)
+{
+#ifdef CONFIG_SYSFS
+ int err;
+
+ err = sysfs_create_group(mm_kobj, &cleancache_attr_group);
+#endif /* CONFIG_SYSFS */
+ return 0;
+}
+module_init(init_cleancache)
--- linux-2.6.34/mm/Kconfig 2010-05-16 15:17:36.000000000 -0600
+++ linux-2.6.34-cleancache/mm/Kconfig 2010-05-24 12:14:44.000000000 -0600
@@ -287,3 +287,25 @@ config NOMMU_INITIAL_TRIM_EXCESS
of 1 says that all excess pages should be trimmed.

See Documentation/nommu-mmap.txt for more information.
+
+config CLEANCACHE
+ bool "Enable cleancache pseudo-RAM driver to cache clean pages"
+ default y
+ help
+ Cleancache can be thought of as a page-granularity victim cache
+ for clean pages that the kernel's pageframe replacement algorithm
+ (PFRA) would like to keep around, but can't since there isn't enough
+ memory. So when the PFRA "evicts" a page, it first attempts to put
+ it into a synchronous concurrency-safe page-oriented pseudo-RAM
+ device (such as Xen's Transcendent Memory, aka "tmem") which is not
+ directly accessible or addressable by the kernel and is of unknown
+ (and possibly time-varying) size. And when a cleancache-enabled
+ filesystem wishes to access a page in a file on disk, it first
+ checks cleancache to see if it already contains it; if it does,
+ the page is copied into the kernel and a disk access is avoided.
+ When a pseudo-RAM device is available, a significant I/O reduction
+ may be achieved. When none is available, all cleancache calls
+ are reduced to a single pointer-compare-against-NULL resulting
+ in a negligible performance hit.
+
+ If unsure, say Y to enable cleancache
--- linux-2.6.34/mm/Makefile 2010-05-16 15:17:36.000000000 -0600
+++ linux-2.6.34-cleancache/mm/Makefile 2010-05-24 12:14:44.000000000 -0600
@@ -44,3 +44,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_CLEANCACHE) += cleancache.o


2010-06-02 19:29:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

On Fri, 28 May 2010 10:35:50 -0700
Dan Magenheimer <[email protected]> wrote:

> [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files
>
> Cleancache core files.
>
> Credits: Cleancache_ops design derived from Jeremy Fitzhardinge
> design for tmem; sysfs code modelled after mm/ksm.c
>
> Note that CONFIG_CLEANCACHE defaults to on; all hooks devolve
> to a compare-pointer-to-NULL so performance impact should
> be negligible, but can be reduced to zero impact if config'ed off.
>
> ...
>
> --- linux-2.6.34/include/linux/cleancache.h 1969-12-31 17:00:00.000000000 -0700
> +++ linux-2.6.34-cleancache/include/linux/cleancache.h 2010-05-24 18:14:33.000000000 -0600
> @@ -0,0 +1,90 @@
> +#ifndef _LINUX_CLEANCACHE_H
> +#define _LINUX_CLEANCACHE_H
> +
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +
> +#define CLEANCACHE_GET_PAGE_SUCCESS 1
> +
> +struct cleancache_ops {
> + int (*init_fs)(size_t);
> + int (*init_shared_fs)(char *uuid, size_t);
> + int (*get_page)(int, ino_t, pgoff_t, struct page *);
> + int (*put_page)(int, ino_t, pgoff_t, struct page *);
> + int (*flush_page)(int, ino_t, pgoff_t);
> + int (*flush_inode)(int, ino_t);
> + void (*flush_fs)(int);
> +};
> +
> +extern struct cleancache_ops *cleancache_ops;

Why does this exist? If there's only ever one cleancache_ops
system-wide then we'd be better off doing

(*cleancache_ops.init_fs)()

and save a zillion pointer hops.

If instead there are different flavours of cleancache_ops then making
this pointer a system-wide singleton seems an odd decision.

>
> ...
>
> +int __cleancache_get_page(struct page *page)
> +{
> + int ret = 0;
> + int pool_id = page->mapping->host->i_sb->cleancache_poolid;
> +
> + if (pool_id >= 0) {
> + ret = (*cleancache_ops->get_page)(pool_id,
> + page->mapping->host->i_ino,
> + page->index,
> + page);
> + if (ret == CLEANCACHE_GET_PAGE_SUCCESS)
> + succ_gets++;
> + else
> + failed_gets++;
> + }
> + return ret;
> +}
> +EXPORT_SYMBOL(__cleancache_get_page);

All these undocumeted functions would appear to be racy and buggy if
the passed-in page isn't locked. But because they're undocumented, I
don't know if "the page must be locked" was an API requirement and I
ain't going to go and review all callers.

> +#ifdef CONFIG_SYSFS
> +
> +#define CLEANCACHE_ATTR_RO(_name) \
> + static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
> +
> +static ssize_t succ_gets_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", succ_gets);
> +}
> +CLEANCACHE_ATTR_RO(succ_gets);
> +
> +static ssize_t failed_gets_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", failed_gets);
> +}
> +CLEANCACHE_ATTR_RO(failed_gets);
> +
> +static ssize_t puts_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", puts);
> +}
> +CLEANCACHE_ATTR_RO(puts);
> +
> +static ssize_t flushes_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", flushes);
> +}
> +CLEANCACHE_ATTR_RO(flushes);
> +
> +static struct attribute *cleancache_attrs[] = {
> + &succ_gets_attr.attr,
> + &failed_gets_attr.attr,
> + &puts_attr.attr,
> + &flushes_attr.attr,
> + NULL,
> +};
> +
> +static struct attribute_group cleancache_attr_group = {
> + .attrs = cleancache_attrs,
> + .name = "cleancache",
> +};
> +
> +#endif /* CONFIG_SYSFS */

Please completely document the sysfs API, preferably in the changelogs.
It's the first thing reviewers should look at, because it's one thing
we can never change. And Documentation/ABI/ is a place for permanent
documentation.


> --- linux-2.6.34/mm/Kconfig 2010-05-16 15:17:36.000000000 -0600
> +++ linux-2.6.34-cleancache/mm/Kconfig 2010-05-24 12:14:44.000000000 -0600
> @@ -287,3 +287,25 @@ config NOMMU_INITIAL_TRIM_EXCESS
> of 1 says that all excess pages should be trimmed.
>
> See Documentation/nommu-mmap.txt for more information.
> +
> +config CLEANCACHE
> + bool "Enable cleancache pseudo-RAM driver to cache clean pages"
> + default y
> + help
> + Cleancache can be thought of as a page-granularity victim cache
> + for clean pages that the kernel's pageframe replacement algorithm
> + (PFRA) would like to keep around, but can't since there isn't enough
> + memory. So when the PFRA "evicts" a page, it first attempts to put
> + it into a synchronous concurrency-safe page-oriented pseudo-RAM
> + device (such as Xen's Transcendent Memory, aka "tmem") which is not
> + directly accessible or addressable by the kernel and is of unknown
> + (and possibly time-varying) size. And when a cleancache-enabled
> + filesystem wishes to access a page in a file on disk, it first
> + checks cleancache to see if it already contains it; if it does,
> + the page is copied into the kernel and a disk access is avoided.
> + When a pseudo-RAM device is available, a significant I/O reduction
> + may be achieved. When none is available, all cleancache calls
> + are reduced to a single pointer-compare-against-NULL resulting
> + in a negligible performance hit.
> +
> + If unsure, say Y to enable cleancache


I'm a bit surprised that cleancache and frontswap have their sticky
fingers so deep inside swap and filesystems and the VFS.

I'd have thought that the places where pages are added to the caches
would be highly concentrated in the page-reclaim page eviction code,
and that for reads the place where pages are retrieved would be at the
pagecache/swapcache <-> I/O boundary. Those transition points are
reasonably narrow and seem to be the obvious site at which to interpose
a cache, but it wasn't done that way.

In core MM there's been effort to treat swap-backed and file-backed pages
in the same manner (indeed in a common manner) and that effort has been
partially successful. These changes are going in the other direction.


There have been any number of compressed-swap and compressed-file
projects (if not compressed-pagecache). Where do cleancache/frontswap
overlap those and which is superior?


And the big vague general issue: where's the value? What does all this
code buy us? Why would we want to include it in Linux? When Aunt
Tillie unwraps her shiny new kernel, what would she notice was
different?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-06-03 00:06:37

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

> From: Andrew Morton [mailto:[email protected]]

Thanks very much for taking the time for feedback! I hope
I can answer all of your questions... bear with me if some
of the answers are a bit long.

> > +extern struct cleancache_ops *cleancache_ops;
>
> Why does this exist? If there's only ever one cleancache_ops
> system-wide then we'd be better off doing
>
> (*cleancache_ops.init_fs)()
>
> and save a zillion pointer hops.
>
> If instead there are different flavours of cleancache_ops then making
> this pointer a system-wide singleton seems an odd decision.

It is intended that there be different flavours but only
one can be used in any running kernel. A driver file/module
claims the cleancache_ops pointer (and should check to ensure
it is not already claimed). And if nobody claims cleancache_ops,
the hooks should be as non-intrusive as possible.

Also note that the operations occur on the order of the number
of I/O's, so definitely a lot, but "zillion" may be a bit high. :-)

If you think this is a showstoppper, it could be changed
to be bound only at compile-time, but then (I think) the claimer
could never be a dynamically-loadable module.

> All these undocumeted functions would appear to be racy and buggy if
> the passed-in page isn't locked. But because they're undocumented, I
> don't know if "the page must be locked" was an API requirement and I
> ain't going to go and review all callers.

True. The passed-in pages are assumed to be locked and,
I believe, they are at all call sites. I'm not sure if
this is possible/easy, but maybe I can put a BUG_ON(if !locked)
in the routines to enforce and document this (and document
it also with prose elsewhere).

> Please completely document the sysfs API, preferably in the changelogs.
> It's the first thing reviewers should look at, because it's one thing
> we can never change. And Documentation/ABI/ is a place for permanent
> documentation.

OK, will do.

> I'm a bit surprised that cleancache and frontswap have their sticky
> fingers so deep inside swap and filesystems and the VFS.
>
> I'd have thought that the places where pages are added to the caches
> would be highly concentrated in the page-reclaim page eviction code,
> and that for reads the place where pages are retrieved would be at the
> pagecache/swapcache <-> I/O boundary. Those transition points are
> reasonably narrow and seem to be the obvious site at which to interpose
> a cache, but it wasn't done that way.

Hmmm... I think those transition points are exactly where the
get/put/flush hooks are placed and I don't see how they can
be reduced.

The filesystem init hooks are almost entirely to allow different fs's
to "opt in" to cleancache, with one exception in btrfs since btrfs
goes around VFS in one case. And frontswap has one init call
per swap type (all in one place).

The core hooks for frontswap are also very few and brief. The
lengthy part of the patch is because the pages in frontswap are
persistent and must be managed with a (one-bit-per-page) data
structure.

Plus, there's a lot of patch-bulk due to the sysfs calls
and a lot of comments.

> In core MM there's been effort to treat swap-backed and file-backed
> pages
> in the same manner (indeed in a common manner) and that effort has been
> partially successful. These changes are going in the other direction.

But IMHO they are going the other direction for a very good
reason. Much of the value of cleancache comes from cleanly
separating clean pagecache pages from dirty pages. Frontswap
is always dealing with dirty pages.

But in any case, all the hooks are still very brief and if
swap_writepage/swap_readpage ever got merged into file-backed
MM code, there would need to be some test to differentiate
swap-backed pages from file-backed pages and the slightly
different frontswap-vs-cleancache calls would be in different
parts of the if/else, but I don't think otherwise would
interfere with attempts to "treat [them] in the same manner".

> There have been any number of compressed-swap and compressed-file
> projects (if not compressed-pagecache). Where do cleancache/frontswap
> overlap those and which is superior?

The primary target of cleancache/frontswap isn't compression
(see below), though that is a nice feature that is provided
by the Xen Transcendent Memory implementation.

For kernel-only use, Nitin Gupta's position is that the cleancache
interface will work nicely for in-kernel compressed-pagecache.
He feels differently for frontswap though.

> And the big vague general issue: where's the value? What does all this
> code buy us? Why would we want to include it in Linux? When Aunt
> Tillie unwraps her shiny new kernel, what would she notice was
> different?

A fair question so let me provide an honest answer. Like many
recent KVM changes, there is a very compelling reason to do this
for virtualized Linux and a less-compelling-but-still-possibly-
useful reason for non-virtualized Linux.

First non-virtualized (since I know you are less interested in
the virtualized case and I hope to keep your attention for a
bit longer): Cleancache/frontswap provide interfaces for
a new pseudo-RAM memory type that conceptually lies between
fast kernel-directly-addressable RAM and slower DMA/asynchronous
devices. Disallowing direct kernel or userland reads/writes
to this pseudo-RAM is ideal when data is transformed to a different
form and size (such as with compression) or secretly moved
(as might be useful for write-balancing for some RAM-like devices).
Evicted page-cache pages and swap pages are a great use for
this kind of slower-than-RAM-but-much-faster-than-disk
pseudo-RAM and the cleancache/frontswap "page-object-oriented"
specification provides a nice way to read and write and
indirectly identify the pages. There may be other uses too.

In the virtual case, the whole point of virtualization is to
statistically multiplex physical resources across the varying
demands of multiple virtual machines. This is really hard to
do with RAM and efforts to do it well with no kernel changes
have essentially failed (except in some well-publicized
special-case workloads). Cleancache and frontswap, with a
fairly small impact on the kernel, provide a huge amount of
flexibility for more dynamic, flexible RAM multiplexing.
(Think IBM's Collaborative Memory Management but much simpler.)

If you are interested in understanding this better, I can
go on with a lot more information, but that's it in a nutshell.

Thanks again!
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href
ilto:"[email protected]"> [email protected] </a>

2010-06-03 00:21:05

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

On 06/02/2010 05:06 PM, Dan Magenheimer wrote:
> It is intended that there be different flavours but only
> one can be used in any running kernel. A driver file/module
> claims the cleancache_ops pointer (and should check to ensure
> it is not already claimed). And if nobody claims cleancache_ops,
> the hooks should be as non-intrusive as possible.
>
> Also note that the operations occur on the order of the number
> of I/O's, so definitely a lot, but "zillion" may be a bit high. :-)
>
> If you think this is a showstoppper, it could be changed
> to be bound only at compile-time, but then (I think) the claimer
> could never be a dynamically-loadable module.
>

Andrew is suggesting that rather than making cleancache_ops a pointer to
a structure, just make it a structure, so that calling a function is a
matter of cleancache_ops.func rather than cleancache_ops->func, thereby
avoiding a pointer dereference.

J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-06-03 02:47:14

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

> > It is intended that there be different flavours but only
> > one can be used in any running kernel. A driver file/module
> > claims the cleancache_ops pointer (and should check to ensure
> > it is not already claimed). And if nobody claims cleancache_ops,
> > the hooks should be as non-intrusive as possible.
> >
> > Also note that the operations occur on the order of the number
> > of I/O's, so definitely a lot, but "zillion" may be a bit high. :-)
> >
> > If you think this is a showstoppper, it could be changed
> > to be bound only at compile-time, but then (I think) the claimer
> > could never be a dynamically-loadable module.
>
> Andrew is suggesting that rather than making cleancache_ops a pointer
> to
> a structure, just make it a structure, so that calling a function is a
> matter of cleancache_ops.func rather than cleancache_ops->func, thereby
> avoiding a pointer dereference.

OK, I see. So the claimer of the cleancache_ops structure
just fills in all of the func fields individually? That
would work too. IIUC it wouldn't save any instructions
when cleancache_ops is unclaimed because it is still necessary
to check a func pointer against NULL, but would save an extra
pointer indirection and possible cache miss for every use
of any func when it is claimed.

I'll change that for next rev.

Thanks and sorry I misunderstood!
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href
ilto:"[email protected]"> [email protected] </a>

2010-06-10 01:41:51

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

Hi,

On 05/28/2010 11:05 PM, Dan Magenheimer wrote:
> [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

I just finished a rough (but working) implementation of in-kernel
page cache compression backend (called zcache). During this work,
I found some issues with cleancache, mostly related to (lack of)
comments/documentation:


> +
> +static inline int cleancache_init_fs(size_t pagesize)
> +

- It is not very obvious that this function is called when
an instance of cleancache supported filesystem is *mounted*.
Initially, I thought this is called which any such filesystem
module is loaded.

- It seems that returning pool_id of 0 is considered as error
condition (as it appears from deactivate_locked_super() changes).
This seems weird; I think only negative pool_id should considered
as error. Anyway, please add function comments for these.

> +int __cleancache_get_page(struct page *page)
> +{
> + int ret = 0;
> + int pool_id = page->mapping->host->i_sb->cleancache_poolid;
> +
> + if (pool_id >= 0) {
> + ret = (*cleancache_ops->get_page)(pool_id,
> + page->mapping->host->i_ino,
> + page->index,
> + page);
> + if (ret == CLEANCACHE_GET_PAGE_SUCCESS)
> + succ_gets++;
> + else
> + failed_gets++;
> + }
> + return ret;
> +}

It seems "non-standard" to use '1' as success code. You could simply use
0 for success and negative error code as failure. Then you can also get
rid of CLEANCACHE_GET_PAGE_SUCCESS.

> +
> +int __cleancache_put_page(struct page *page)

What return values stands for successful put? 1? Anyway, following the
same, 0 for success, negative codes for errors, seems to be better.

> +
> +int __cleancache_flush_page(struct address_space *mapping, struct page *page)

> +int __cleancache_flush_inode(struct address_space *mapping)

Return values for all the flush functions is ignored everywhere, so
why not make them return void instead?

> +static inline void cleancache_flush_fs(int pool_id)

Like init_fs, please document that it is called when a cleancache
aware filesystem is unmounted (or in other cases too?).



Page cache compression was a long-pending project. I'm glad its
coming into shape with the help of cleancache :)

Thanks,
Nitin


2010-06-10 03:28:21

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [PATCH V2 2/7] Cleancache (was Transcendent Memory): core files

> I just finished a rough (but working) implementation of in-kernel
> page cache compression backend (called zcache). During this work,
> I found some issues with cleancache, mostly related to (lack of)
> comments/documentation:

Great to hear! And excellent feedback on the missing
documentation... I am working on this right now so your
feedback is very timely.

(documentation and funcition return values comments deleted
as I will fix all of them)

> > +
> > +static inline int cleancache_init_fs(size_t pagesize)
> > +
>
> - It seems that returning pool_id of 0 is considered as error
> condition (as it appears from deactivate_locked_super() changes).
> This seems weird; I think only negative pool_id should considered
> as error. Anyway, please add function comments for these.

Hmmm... this is a bug. 0 is a valid pool_id. I'll fix it
for the next rev.

> Page cache compression was a long-pending project. I'm glad its
> coming into shape with the help of cleancache :)

Thanks!
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href
ilto:"[email protected]"> [email protected] </a>