2010-07-16 12:37:32

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 0/8] zcache: page cache compression support

Frequently accessed filesystem data is stored in memory to reduce access to
(much) slower backing disks. Under memory pressure, these pages are freed and
when needed again, they have to be read from disks again. When combined working
set of all running application exceeds amount of physical RAM, we get extereme
slowdown as reading a page from disk can take time in order of milliseconds.

Memory compression increases effective memory size and allows more pages to
stay in RAM. Since de/compressing memory pages is several orders of magnitude
faster than disk I/O, this can provide signifant performance gains for many
workloads. Also, with multi-cores becoming common, benefits of reduced disk I/O
should easily outweigh the problem of increased CPU usage.

It is implemented as a "backend" for cleancache_ops [1] which provides
callbacks for events such as when a page is to be removed from the page cache
and when it is required again. We use them to implement a 'second chance' cache
for these evicted page cache pages by compressing and storing them in memory
itself.

We only keep pages that compress to PAGE_SIZE/2 or less. Compressed chunks are
stored using xvmalloc memory allocator which is already being used by zram
driver for the same purpose. Zero-filled pages are checked and no memory is
allocated for them.

A separate "pool" is created for each mount instance for a cleancache-aware
filesystem. Each incoming page is identified with <pool_id, inode_no, index>
where inode_no identifies file within the filesystem corresponding to pool_id
and index is offset of the page within this inode. Within a pool, inodes are
maintained in an rb-tree and each of its nodes points to a separate radix-tree
which maintains list of pages within that inode.

While compression reduces disk I/O, it also reduces the space available for
normal (uncompressed) page cache. This can result in more frequent page cache
reclaim and thus higher CPU overhead. Thus, it's important to maintain good hit
rate for compressed cache or increased CPU overhead can nullify any other
benefits. This requires adaptive (compressed) cache resizing and page
replacement policies that can maintain optimal cache size and quickly reclaim
unused compressed chunks. This work is yet to be done. However, in the current
state, it allows manually resizing cache size using (per-pool) sysfs node
'memlimit' which in turn frees any excess pages *sigh* randomly.

Finally, it uses percpu stats and compression buffers to allow better
performance on multi-cores. Still, there are known bottlenecks like a single
xvmalloc mempool per zcache pool and few others. I will work on this when I
start with profiling.

* Performance numbers:
- Tested using iozone filesystem benchmark
- 4 CPUs, 1G RAM
- Read performance gain: ~2.5X
- Random read performance gain: ~3X
- In general, performance gains for every kind of I/O

Test details with graphs can be found here:
http://code.google.com/p/compcache/wiki/zcacheIOzone

If I can get some help with testing, it would be intersting to find its
effect in more real-life workloads. In particular, I'm intersted in finding
out its effect in KVM virtualization case where it can potentially allow
running more number of VMs per-host for a given amount of RAM. With zcache
enabled, VMs can be assigned much smaller amount of memory since host can now
hold bulk of page-cache pages, allowing VMs to maintain similar level of
performance while a greater number of them can be hosted.

* How to test:
All patches are against 2.6.35-rc5:

- First, apply all prerequisite patches here:
http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches

- Then apply this patch series; also uploaded here:
http://compcache.googlecode.com/hg/sub-projects/zcache_patches


Nitin Gupta (8):
Allow sharing xvmalloc for zram and zcache
Basic zcache functionality
Create sysfs nodes and export basic statistics
Shrink zcache based on memlimit
Eliminate zero-filled pages
Compress pages using LZO
Use xvmalloc to store compressed chunks
Document sysfs entries

Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 +
drivers/staging/Makefile | 2 +
drivers/staging/zram/Kconfig | 22 +
drivers/staging/zram/Makefile | 5 +-
drivers/staging/zram/xvmalloc.c | 8 +
drivers/staging/zram/zcache_drv.c | 1312 ++++++++++++++++++++++
drivers/staging/zram/zcache_drv.h | 90 ++
7 files changed, 1491 insertions(+), 1 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-zcache
create mode 100644 drivers/staging/zram/zcache_drv.c
create mode 100644 drivers/staging/zram/zcache_drv.h


2010-07-16 12:37:49

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 2/8] Basic zcache functionality

Implements callback functions for cleancache_ops [1] to provide
page cache compression support. Along with other functioanlity,
cleancache provides callbacks for events when a page is to be
removed from the page cache and when it is required again. We
use them to implement a 'second chance' cache for these evicted
page cache pages by compressing and storing them in memory itself.

zcache needs xvmalloc allocator for storing variable sized
compressed chunks. So, it is placed in the same location as
the existing zram driver. This can be a problem later when
zram is moved over to drivers/block/ area since zcache itself
is not a block driver. I think a better solution would be to
move xvmalloc to lib/ and zcache to mm/?

This particular patch implemets basic functionality only:
- No compression is done. Incoming pages are simply memcpy()'d
to newly allocated pages.
- Per-pool memory limit is hard-coded to 10% of RAM size.
- No statistics are exported.

[1] cleancache: http://lwn.net/Articles/393013/

Signed-off-by: Nitin Gupta <[email protected]>
---
drivers/staging/Makefile | 1 +
drivers/staging/zram/Kconfig | 17 +
drivers/staging/zram/Makefile | 2 +
drivers/staging/zram/zcache_drv.c | 731 +++++++++++++++++++++++++++++++++++++
drivers/staging/zram/zcache_drv.h | 63 ++++
5 files changed, 814 insertions(+), 0 deletions(-)
create mode 100644 drivers/staging/zram/zcache_drv.c
create mode 100644 drivers/staging/zram/zcache_drv.h

diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index 6de8564..3c1d91b 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_MRST_RAR_HANDLER) += memrar/
obj-$(CONFIG_DX_SEP) += sep/
obj-$(CONFIG_IIO) += iio/
obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_ZCACHE) += zram/
obj-$(CONFIG_XVMALLOC) += zram/
obj-$(CONFIG_WLAGS49_H2) += wlags49_h2/
obj-$(CONFIG_WLAGS49_H25) += wlags49_h25/
diff --git a/drivers/staging/zram/Kconfig b/drivers/staging/zram/Kconfig
index 9bf26ce..f9c8224 100644
--- a/drivers/staging/zram/Kconfig
+++ b/drivers/staging/zram/Kconfig
@@ -32,3 +32,20 @@ config ZRAM_STATS

If unsure, say Y.

+config ZCACHE
+ bool "Page cache compression support"
+ depends on CLEANCACHE
+ select XVMALLOC
+ select LZO_COMPRESS
+ select LZO_DECOMPRESS
+ default n
+ help
+ Compresses relatively unused page cache pages and stores them in
+ memory itself. This increases effective memory size and can help
+ reduce access to backing store device(s) which is typically much
+ slower than access to main memory.
+
+ Statistics are expoted through sysfs interface:
+ /sys/kernel/mm/zcache/
+
+ Project home: http://compcache.googlecode.com/
diff --git a/drivers/staging/zram/Makefile b/drivers/staging/zram/Makefile
index 9900f8b..ef05ee5 100644
--- a/drivers/staging/zram/Makefile
+++ b/drivers/staging/zram/Makefile
@@ -1,4 +1,6 @@
zram-objs := zram_drv.o
+zcache-objs := zcache_drv.o

obj-$(CONFIG_ZRAM) += zram.o
+obj-$(CONFIG_ZCACHE) += zcache.o
obj-$(CONFIG_XVMALLOC) += xvmalloc.o
diff --git a/drivers/staging/zram/zcache_drv.c b/drivers/staging/zram/zcache_drv.c
new file mode 100644
index 0000000..160c172
--- /dev/null
+++ b/drivers/staging/zram/zcache_drv.c
@@ -0,0 +1,731 @@
+/*
+ * Page cache compression support
+ *
+ * Copyright (C) 2010 Nitin Gupta
+ * Released under the terms of GNU General Public License Version 2.0
+ *
+ * Project home: http://compcache.googlecode.com/
+ *
+ * Design:
+ * zcache is implemented using 'cleancache' which provides callbacks
+ * (cleancache_ops) for events such as when a page is evicted from the
+ * (uncompressed) page cache (cleancache_ops.put_page), when it is
+ * needed again (cleancache_ops.get_page), when it needs to be freed
+ * (cleancache_ops.flush_page) and so on.
+ *
+ * A page in zcache is identified with tuple <pool_id, inode_no, index>
+ * - pool_id: For every cleancache aware filesystem mounted, zcache
+ * creates a separate pool (struct zcache_pool). This is container for
+ * all metadata/data structures needed to locate pages cached for this
+ * instance of mounted filesystem.
+ * - inode_no: This represents a file/inode within this filesystem. An
+ * inode is represented using struct zcache_inode_rb within zcache which
+ * are arranged in a red-black tree indexed using inode_no.
+ * - index: This represents page/index within an inode. Pages within an
+ * inode are arranged in a radix tree. So, each node of red-black tree
+ * above refers to a separate radix tree.
+ *
+ * Locking order:
+ * 1. zcache_inode_rb->tree_lock (spin_lock)
+ * 2. zcache_pool->tree_lock (rwlock_t: write)
+ *
+ * Nodes in an inode tree are reference counted and are freed when they
+ * do not any pages i.e. corresponding radix tree, maintaining list of
+ * pages associated with this inode, is empty.
+ */
+
+#define KMSG_COMPONENT "zcache"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/cleancache.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/u64_stats_sync.h>
+
+#include "zcache_drv.h"
+
+struct zcache *zcache;
+
+/*
+ * Individual percpu values can go negative but the sum across all CPUs
+ * must always be positive (we store various counts). So, return sum as
+ * unsigned value.
+ */
+static u64 zcache_get_stat(struct zcache_pool *zpool,
+ enum zcache_pool_stats_index idx)
+{
+ int cpu;
+ s64 val = 0;
+
+ for_each_possible_cpu(cpu) {
+ unsigned int start;
+ struct zcache_pool_stats_cpu *stats;
+
+ stats = per_cpu_ptr(zpool->stats, cpu);
+ do {
+ start = u64_stats_fetch_begin(&stats->syncp);
+ val += stats->count[idx];
+ } while (u64_stats_fetch_retry(&stats->syncp, start));
+ }
+
+ BUG_ON(val < 0);
+ return val;
+}
+
+static void zcache_add_stat(struct zcache_pool *zpool,
+ enum zcache_pool_stats_index idx, s64 val)
+{
+ struct zcache_pool_stats_cpu *stats;
+
+ preempt_disable();
+ stats = __this_cpu_ptr(zpool->stats);
+ u64_stats_update_begin(&stats->syncp);
+ stats->count[idx] += val;
+ u64_stats_update_end(&stats->syncp);
+ preempt_enable();
+
+}
+
+static void zcache_inc_stat(struct zcache_pool *zpool,
+ enum zcache_pool_stats_index idx)
+{
+ zcache_add_stat(zpool, idx, 1);
+}
+
+static void zcache_dec_stat(struct zcache_pool *zpool,
+ enum zcache_pool_stats_index idx)
+{
+ zcache_add_stat(zpool, idx, -1);
+}
+
+static u64 zcache_get_memlimit(struct zcache_pool *zpool)
+{
+ u64 memlimit;
+ unsigned int start;
+
+ do {
+ start = read_seqbegin(&zpool->memlimit_lock);
+ memlimit = zpool->memlimit;
+ } while (read_seqretry(&zpool->memlimit_lock, start));
+
+ return memlimit;
+}
+
+static void zcache_set_memlimit(struct zcache_pool *zpool, u64 memlimit)
+{
+ write_seqlock(&zpool->memlimit_lock);
+ zpool->memlimit = memlimit;
+ write_sequnlock(&zpool->memlimit_lock);
+}
+
+static void zcache_dump_stats(struct zcache_pool *zpool)
+{
+ enum zcache_pool_stats_index idx;
+
+ pr_debug("Dumping statistics for pool: %p\n", zpool);
+ pr_debug("%llu ", zpool->memlimit);
+ for (idx = 0; idx < ZPOOL_STAT_NSTATS; idx++)
+ pr_debug("%llu ", zcache_get_stat(zpool, idx));
+ pr_debug("\n");
+}
+
+static void zcache_destroy_pool(struct zcache_pool *zpool)
+{
+ int i;
+
+ if (!zpool)
+ return;
+
+ spin_lock(&zcache->pool_lock);
+ zcache->num_pools--;
+ for (i = 0; i < MAX_ZCACHE_POOLS; i++)
+ if (zcache->pools[i] == zpool)
+ break;
+ zcache->pools[i] = NULL;
+ spin_unlock(&zcache->pool_lock);
+
+ if (!RB_EMPTY_ROOT(&zpool->inode_tree)) {
+ pr_warn("Memory leak detected. Freeing non-empty pool!\n");
+ zcache_dump_stats(zpool);
+ }
+
+ free_percpu(zpool->stats);
+ kfree(zpool);
+}
+
+/*
+ * Allocate a new zcache pool and set default memlimit.
+ *
+ * Returns pool_id on success, negative error code otherwise.
+ */
+int zcache_create_pool(void)
+{
+ int ret;
+ u64 memlimit;
+ struct zcache_pool *zpool = NULL;
+
+ spin_lock(&zcache->pool_lock);
+ if (zcache->num_pools == MAX_ZCACHE_POOLS) {
+ spin_unlock(&zcache->pool_lock);
+ pr_info("Cannot create new pool (limit: %u)\n",
+ MAX_ZCACHE_POOLS);
+ ret = -EPERM;
+ goto out;
+ }
+ zcache->num_pools++;
+ spin_unlock(&zcache->pool_lock);
+
+ zpool = kzalloc(sizeof(*zpool), GFP_KERNEL);
+ if (!zpool) {
+ spin_lock(&zcache->pool_lock);
+ zcache->num_pools--;
+ spin_unlock(&zcache->pool_lock);
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ zpool->stats = alloc_percpu(struct zcache_pool_stats_cpu);
+ if (!zpool->stats) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ rwlock_init(&zpool->tree_lock);
+ seqlock_init(&zpool->memlimit_lock);
+ zpool->inode_tree = RB_ROOT;
+
+ memlimit = zcache_pool_default_memlimit_perc_ram *
+ ((totalram_pages << PAGE_SHIFT) / 100);
+ memlimit &= PAGE_MASK;
+ zcache_set_memlimit(zpool, memlimit);
+
+ /* Add to pool list */
+ spin_lock(&zcache->pool_lock);
+ for (ret = 0; ret < MAX_ZCACHE_POOLS; ret++)
+ if (!zcache->pools[ret])
+ break;
+ zcache->pools[ret] = zpool;
+ spin_unlock(&zcache->pool_lock);
+
+out:
+ if (ret < 0)
+ zcache_destroy_pool(zpool);
+
+ return ret;
+}
+
+/*
+ * Allocate a new zcache node and insert it in given pool's
+ * inode_tree at location 'inode_no'.
+ *
+ * On success, returns newly allocated node and increments
+ * its refcount for caller. Returns NULL on failure.
+ */
+static struct zcache_inode_rb *zcache_inode_create(int pool_id,
+ ino_t inode_no)
+{
+ unsigned long flags;
+ struct rb_node *parent, **link;
+ struct zcache_inode_rb *new_znode;
+ struct zcache_pool *zpool = zcache->pools[pool_id];
+
+ /*
+ * We can end up allocating multiple nodes due to racing
+ * zcache_put_page(). But only one will be added to zpool
+ * inode tree and rest will be freed.
+ *
+ * To avoid this possibility of redundant allocation, we
+ * could do it inside zpool tree lock. However, that seems
+ * more wasteful.
+ */
+ new_znode = kzalloc(sizeof(*new_znode), GFP_NOWAIT);
+ if (unlikely(!new_znode))
+ return NULL;
+
+ INIT_RADIX_TREE(&new_znode->page_tree, GFP_NOWAIT);
+ spin_lock_init(&new_znode->tree_lock);
+ kref_init(&new_znode->refcount);
+ RB_CLEAR_NODE(&new_znode->rb_node);
+ new_znode->inode_no = inode_no;
+ new_znode->pool = zpool;
+
+ parent = NULL;
+ write_lock_irqsave(&zpool->tree_lock, flags);
+ link = &zpool->inode_tree.rb_node;
+ while (*link) {
+ struct zcache_inode_rb *znode;
+
+ znode = rb_entry(*link, struct zcache_inode_rb, rb_node);
+ parent = *link;
+
+ if (znode->inode_no > inode_no)
+ link = &(*link)->rb_left;
+ else if (znode->inode_no < inode_no)
+ link = &(*link)->rb_right;
+ else {
+ /*
+ * New node added by racing zcache_put_page(). Free
+ * this newly allocated node and use the existing one.
+ */
+ kfree(new_znode);
+ new_znode = znode;
+ goto out;
+ }
+ }
+
+ rb_link_node(&new_znode->rb_node, parent, link);
+ rb_insert_color(&new_znode->rb_node, &zpool->inode_tree);
+
+out:
+ kref_get(&new_znode->refcount);
+ write_unlock_irqrestore(&zpool->tree_lock, flags);
+
+ return new_znode;
+}
+
+/*
+ * Called under zcache_inode_rb->tree_lock
+ */
+static int zcache_inode_is_empty(struct zcache_inode_rb *znode)
+{
+ return znode->page_tree.rnode == NULL;
+}
+
+/*
+ * kref_put callback for zcache node.
+ *
+ * The node must have been isolated already.
+ */
+static void zcache_inode_release(struct kref *kref)
+{
+ struct zcache_inode_rb *znode;
+
+ znode = container_of(kref, struct zcache_inode_rb, refcount);
+ BUG_ON(!zcache_inode_is_empty(znode));
+ kfree(znode);
+}
+
+/*
+ * Removes the given node from its inode_tree and drops corresponding
+ * refcount. Its called when someone removes the last page from a
+ * zcache node.
+ *
+ * Called under zcache_inode_rb->spin_lock
+ */
+static void zcache_inode_isolate(struct zcache_inode_rb *znode)
+{
+ unsigned long flags;
+ struct zcache_pool *zpool = znode->pool;
+
+ write_lock_irqsave(&zpool->tree_lock, flags);
+ /*
+ * Someone can get reference on this node before we could
+ * acquire write lock above. We want to remove it from its
+ * inode_tree when only the caller and corresponding inode_tree
+ * holds a reference to it. This ensures that a racing zcache
+ * put will not end up adding a page to an isolated node and
+ * thereby losing that memory.
+ *
+ */
+ if (atomic_read(&znode->refcount.refcount) == 2) {
+ rb_erase(&znode->rb_node, &znode->pool->inode_tree);
+ RB_CLEAR_NODE(&znode->rb_node);
+ kref_put(&znode->refcount, zcache_inode_release);
+ }
+ write_unlock_irqrestore(&zpool->tree_lock, flags);
+}
+
+/*
+ * Find inode in the given pool at location 'inode_no'.
+ *
+ * If found, return the node pointer and increment its reference
+ * count; NULL otherwise.
+ */
+static struct zcache_inode_rb *zcache_find_inode(struct zcache_pool *zpool,
+ ino_t inode_no)
+{
+ unsigned long flags;
+ struct rb_node *node;
+
+ read_lock_irqsave(&zpool->tree_lock, flags);
+ node = zpool->inode_tree.rb_node;
+ while (node) {
+ struct zcache_inode_rb *znode;
+
+ znode = rb_entry(node, struct zcache_inode_rb, rb_node);
+ if (znode->inode_no > inode_no)
+ node = node->rb_left;
+ else if (znode->inode_no < inode_no)
+ node = node->rb_right;
+ else {
+ /* found */
+ kref_get(&znode->refcount);
+ read_unlock_irqrestore(&zpool->tree_lock, flags);
+ return znode;
+ }
+ }
+ read_unlock_irqrestore(&zpool->tree_lock, flags);
+
+ /* not found */
+ return NULL;
+}
+
+/*
+ * Allocate memory for storing the given page and insert
+ * it in the given node's page tree at location 'index'.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+static int zcache_store_page(struct zcache_inode_rb *znode,
+ pgoff_t index, struct page *page)
+{
+ int ret;
+ unsigned long flags;
+ struct page *zpage;
+ void *src_data, *dest_data;
+
+ zpage = alloc_page(GFP_NOWAIT);
+ if (!zpage) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ zpage->index = index;
+
+ src_data = kmap_atomic(page, KM_USER0);
+ dest_data = kmap_atomic(zpage, KM_USER1);
+ memcpy(dest_data, src_data, PAGE_SIZE);
+ kunmap_atomic(src_data, KM_USER0);
+ kunmap_atomic(dest_data, KM_USER1);
+
+ spin_lock_irqsave(&znode->tree_lock, flags);
+ ret = radix_tree_insert(&znode->page_tree, index, zpage);
+ spin_unlock_irqrestore(&znode->tree_lock, flags);
+ if (unlikely(ret))
+ __free_page(zpage);
+
+out:
+ return ret;
+}
+
+/*
+ * Frees all pages associated with the given zcache node.
+ * Code adapted from brd.c
+ *
+ * Called under zcache_inode_rb->tree_lock
+ */
+#define FREE_BATCH 16
+static void zcache_free_inode_pages(struct zcache_inode_rb *znode)
+{
+ int count;
+ unsigned long index = 0;
+ struct zcache_pool *zpool = znode->pool;
+
+ do {
+ int i;
+ struct page *pages[FREE_BATCH];
+
+ count = radix_tree_gang_lookup(&znode->page_tree,
+ (void **)pages, index, FREE_BATCH);
+
+ for (i = 0; i < count; i++) {
+ index = pages[i]->index;
+ radix_tree_delete(&znode->page_tree, index);
+ __free_page(pages[i]);
+ zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ }
+
+ index++;
+ } while (count == FREE_BATCH);
+}
+
+/*
+ * cleancache_ops.init_fs
+ *
+ * Called whenever a cleancache aware filesystem is mounted.
+ * Creates and initializes a new zcache pool and inserts it
+ * in zcache pool list.
+ *
+ * Returns pool id on success, negative error code on failure.
+ */
+static int zcache_init_fs(size_t pagesize)
+{
+ int ret;
+
+ /*
+ * pagesize parameter probably makes sense only for Xen's
+ * cleancache_ops provider which runs inside guests, passing
+ * pages to the host. Since a guest might have a different
+ * page size than that of host (really?), they need to pass
+ * around this value.
+ *
+ * However, zcache runs on the host (or natively), so there
+ * is no point of different page sizes.
+ */
+ if (pagesize != PAGE_SIZE) {
+ pr_info("Unsupported page size: %zu", pagesize);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = zcache_create_pool();
+ if (ret < 0) {
+ pr_info("Failed to create new pool\n");
+ ret = -ENOMEM;
+ goto out;
+ }
+
+out:
+ return ret;
+}
+
+/*
+ * cleancache_ops.init_shared_fs
+ *
+ * Called whenever a cleancache aware clustered filesystem is mounted.
+ */
+static int zcache_init_shared_fs(char *uuid, size_t pagesize)
+{
+ /*
+ * In Xen's implementation, cleancache_ops provider runs in each
+ * guest, sending/receiving pages to/from the host. For each guest
+ * that participates in ocfs2 like cluster, the client passes the
+ * same uuid to the host. This allows the host to create a single
+ * "shared pool" for all such guests to allow for feature like
+ * data de-duplication among these guests.
+ *
+ * Whereas, zcache run directly on the host (or natively). So, for
+ * any shared resource like ocfs2 mount point on host, it implicitly
+ * creates a single pool. Thus, we can simply ignore this 'uuid' and
+ * treat it like a usual filesystem.
+ */
+ return zcache_init_fs(pagesize);
+}
+
+/*
+ * cleancache_ops.get_page
+ *
+ * Locates stored zcache page using <pool_id, inode_no, index>.
+ * If found, copies it to the given output page 'page' and frees
+ * zcache copy of the same.
+ *
+ * Returns 0 if requested page found, -1 otherwise.
+ */
+static int zcache_get_page(int pool_id, ino_t inode_no,
+ pgoff_t index, struct page *page)
+{
+ int ret = -1;
+ unsigned long flags;
+ struct page *src_page;
+ void *src_data, *dest_data;
+ struct zcache_inode_rb *znode;
+ struct zcache_pool *zpool = zcache->pools[pool_id];
+
+ znode = zcache_find_inode(zpool, inode_no);
+ if (!znode)
+ goto out;
+
+ BUG_ON(znode->inode_no != inode_no);
+
+ spin_lock_irqsave(&znode->tree_lock, flags);
+ src_page = radix_tree_delete(&znode->page_tree, index);
+ if (zcache_inode_is_empty(znode))
+ zcache_inode_isolate(znode);
+ spin_unlock_irqrestore(&znode->tree_lock, flags);
+
+ kref_put(&znode->refcount, zcache_inode_release);
+
+ if (!src_page)
+ goto out;
+
+ src_data = kmap_atomic(src_page, KM_USER0);
+ dest_data = kmap_atomic(page, KM_USER1);
+ memcpy(dest_data, src_data, PAGE_SIZE);
+ kunmap_atomic(src_data, KM_USER0);
+ kunmap_atomic(dest_data, KM_USER1);
+
+ flush_dcache_page(page);
+
+ __free_page(src_page);
+
+ zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ ret = 0; /* success */
+
+out:
+ return ret;
+}
+
+/*
+ * cleancache_ops.put_page
+ *
+ * Copies given input page 'page' to a newly allocated page.
+ * If allocation is successful, inserts it at zcache location
+ * <pool_id, inode_no, index>.
+ */
+static void zcache_put_page(int pool_id, ino_t inode_no,
+ pgoff_t index, struct page *page)
+{
+ int ret;
+ unsigned long flags;
+ struct page *zpage;
+ struct zcache_inode_rb *znode;
+ struct zcache_pool *zpool = zcache->pools[pool_id];
+
+ /*
+ * Incrementing local pages_stored before summing it from
+ * all CPUs makes sure we do not end up storing pages in
+ * excess of memlimit. In case of failure, we revert back
+ * this local increment.
+ */
+ zcache_inc_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+
+ if (zcache_get_stat(zpool, ZPOOL_STAT_PAGES_STORED) >
+ zcache_get_memlimit(zpool) >> PAGE_SHIFT) {
+ zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ return;
+ }
+
+ znode = zcache_find_inode(zpool, inode_no);
+ if (!znode) {
+ znode = zcache_inode_create(pool_id, inode_no);
+ if (!znode) {
+ zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ return;
+ }
+ }
+
+ /* Free page that might already be present at this index */
+ spin_lock_irqsave(&znode->tree_lock, flags);
+ zpage = radix_tree_delete(&znode->page_tree, index);
+ spin_unlock_irqrestore(&znode->tree_lock, flags);
+ if (zpage) {
+ __free_page(zpage);
+ zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ }
+
+ ret = zcache_store_page(znode, index, page);
+ if (ret) { /* failure */
+ zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+
+ /*
+ * Its possible that racing zcache get/flush could not
+ * isolate this node since we held a reference to it.
+ */
+ spin_lock_irqsave(&znode->tree_lock, flags);
+ if (zcache_inode_is_empty(znode))
+ zcache_inode_isolate(znode);
+ spin_unlock_irqrestore(&znode->tree_lock, flags);
+ }
+
+ kref_put(&znode->refcount, zcache_inode_release);
+}
+
+/*
+ * cleancache_ops.flush_page
+ *
+ * Locates and fees page at zcache location <pool_id, inode_no, index>
+ */
+static void zcache_flush_page(int pool_id, ino_t inode_no, pgoff_t index)
+{
+ unsigned long flags;
+ struct page *page;
+ struct zcache_inode_rb *znode;
+ struct zcache_pool *zpool = zcache->pools[pool_id];
+
+ znode = zcache_find_inode(zpool, inode_no);
+ if (!znode)
+ return;
+
+ spin_lock_irqsave(&znode->tree_lock, flags);
+ page = radix_tree_delete(&znode->page_tree, index);
+ if (zcache_inode_is_empty(znode))
+ zcache_inode_isolate(znode);
+ spin_unlock_irqrestore(&znode->tree_lock, flags);
+
+ kref_put(&znode->refcount, zcache_inode_release);
+ if (!page)
+ return;
+
+ __free_page(page);
+
+ zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+}
+
+/*
+ * cleancache_ops.flush_inode
+ *
+ * Free all pages associated with the given inode.
+ */
+static void zcache_flush_inode(int pool_id, ino_t inode_no)
+{
+ unsigned long flags;
+ struct zcache_pool *zpool;
+ struct zcache_inode_rb *znode;
+
+ zpool = zcache->pools[pool_id];
+
+ znode = zcache_find_inode(zpool, inode_no);
+ if (!znode)
+ return;
+
+ spin_lock_irqsave(&znode->tree_lock, flags);
+ zcache_free_inode_pages(znode);
+ if (zcache_inode_is_empty(znode))
+ zcache_inode_isolate(znode);
+ spin_unlock_irqrestore(&znode->tree_lock, flags);
+
+ kref_put(&znode->refcount, zcache_inode_release);
+}
+
+/*
+ * cleancache_ops.flush_fs
+ *
+ * Called whenever a cleancache aware filesystem is unmounted.
+ * Frees all metadata and data pages in corresponding zcache pool.
+ */
+static void zcache_flush_fs(int pool_id)
+{
+ struct rb_node *node;
+ struct zcache_inode_rb *znode;
+ struct zcache_pool *zpool = zcache->pools[pool_id];
+
+ /*
+ * At this point, there is no active I/O on this filesystem.
+ * So we can free all its pages without holding any locks.
+ */
+ node = rb_first(&zpool->inode_tree);
+ while (node) {
+ znode = rb_entry(node, struct zcache_inode_rb, rb_node);
+ node = rb_next(node);
+ zcache_free_inode_pages(znode);
+ rb_erase(&znode->rb_node, &zpool->inode_tree);
+ kfree(znode);
+ }
+
+ zcache_destroy_pool(zpool);
+}
+
+static int __init zcache_init(void)
+{
+ struct cleancache_ops ops = {
+ .init_fs = zcache_init_fs,
+ .init_shared_fs = zcache_init_shared_fs,
+ .get_page = zcache_get_page,
+ .put_page = zcache_put_page,
+ .flush_page = zcache_flush_page,
+ .flush_inode = zcache_flush_inode,
+ .flush_fs = zcache_flush_fs,
+ };
+
+ zcache = kzalloc(sizeof(*zcache), GFP_KERNEL);
+ if (!zcache)
+ return -ENOMEM;
+
+ spin_lock_init(&zcache->pool_lock);
+ cleancache_ops = ops;
+
+ return 0;
+}
+
+module_init(zcache_init);
diff --git a/drivers/staging/zram/zcache_drv.h b/drivers/staging/zram/zcache_drv.h
new file mode 100644
index 0000000..bfba5d7
--- /dev/null
+++ b/drivers/staging/zram/zcache_drv.h
@@ -0,0 +1,63 @@
+/*
+ * Page cache compression support
+ *
+ * Copyright (C) 2010 Nitin Gupta
+ * Released under the terms of GNU General Public License Version 2.0
+ *
+ * Project home: http://compcache.googlecode.com/
+ */
+
+#ifndef _ZCACHE_DRV_H_
+#define _ZCACHE_DRV_H_
+
+#include <linux/kref.h>
+#include <linux/radix-tree.h>
+#include <linux/rbtree.h>
+#include <linux/rwlock.h>
+#include <linux/spinlock.h>
+#include <linux/seqlock.h>
+#include <linux/types.h>
+
+#define MAX_ZCACHE_POOLS 32 /* arbitrary */
+
+enum zcache_pool_stats_index {
+ ZPOOL_STAT_PAGES_STORED,
+ ZPOOL_STAT_NSTATS,
+};
+
+/* Default zcache per-pool memlimit: 10% of total RAM */
+static const unsigned zcache_pool_default_memlimit_perc_ram = 10;
+
+/* Red-Black tree node. Maps inode to its page-tree */
+struct zcache_inode_rb {
+ struct radix_tree_root page_tree; /* maps inode index to page */
+ spinlock_t tree_lock; /* protects page_tree */
+ struct kref refcount;
+ struct rb_node rb_node;
+ ino_t inode_no;
+ struct zcache_pool *pool; /* back-reference to parent pool */
+};
+
+struct zcache_pool_stats_cpu {
+ s64 count[ZPOOL_STAT_NSTATS];
+ struct u64_stats_sync syncp;
+};
+
+/* One zcache pool per (cleancache aware) filesystem mount instance */
+struct zcache_pool {
+ struct rb_root inode_tree; /* maps inode number to page tree */
+ rwlock_t tree_lock; /* protects inode_tree */
+
+ seqlock_t memlimit_lock; /* protects memlimit */
+ u64 memlimit; /* bytes */
+ struct zcache_pool_stats_cpu *stats; /* percpu stats */
+};
+
+/* Manage all zcache pools */
+struct zcache {
+ struct zcache_pool *pools[MAX_ZCACHE_POOLS];
+ u32 num_pools; /* current no. of zcache pools */
+ spinlock_t pool_lock; /* protects pools[] and num_pools */
+};
+
+#endif
--
1.7.1.1

2010-07-16 12:37:57

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 4/8] Shrink zcache based on memlimit

User can change (per-pool) memlimit using sysfs node:
/sys/kernel/mm/zcache/pool<id>/memlimit

When memlimit is set to a value smaller than current
number of pages allocated for that pool, excess pages
are now freed immediately instead of waiting for get/
flush for these pages.

Currently, victim page selection is essentially random.
Automatic cache resizing and better page replacement
policies will be implemented later.

Signed-off-by: Nitin Gupta <[email protected]>
---
drivers/staging/zram/zcache_drv.c | 115 ++++++++++++++++++++++++++++++++++---
1 files changed, 106 insertions(+), 9 deletions(-)

diff --git a/drivers/staging/zram/zcache_drv.c b/drivers/staging/zram/zcache_drv.c
index f680f19..c5de65d 100644
--- a/drivers/staging/zram/zcache_drv.c
+++ b/drivers/staging/zram/zcache_drv.c
@@ -41,6 +41,7 @@
#include <linux/kernel.h>
#include <linux/cleancache.h>
#include <linux/highmem.h>
+#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/u64_stats_sync.h>

@@ -416,7 +417,8 @@ out:
* Called under zcache_inode_rb->tree_lock
*/
#define FREE_BATCH 16
-static void zcache_free_inode_pages(struct zcache_inode_rb *znode)
+static void zcache_free_inode_pages(struct zcache_inode_rb *znode,
+ u32 pages_to_free)
{
int count;
unsigned long index = 0;
@@ -428,6 +430,8 @@ static void zcache_free_inode_pages(struct zcache_inode_rb *znode)

count = radix_tree_gang_lookup(&znode->page_tree,
(void **)pages, index, FREE_BATCH);
+ if (count > pages_to_free)
+ count = pages_to_free;

for (i = 0; i < count; i++) {
index = pages[i]->index;
@@ -437,7 +441,98 @@ static void zcache_free_inode_pages(struct zcache_inode_rb *znode)
}

index++;
- } while (count == FREE_BATCH);
+ pages_to_free -= count;
+ } while (pages_to_free && (count == FREE_BATCH));
+}
+
+/*
+ * Returns number of pages stored in excess of currently
+ * set memlimit for the given pool.
+ */
+static u32 zcache_count_excess_pages(struct zcache_pool *zpool)
+{
+ u32 excess_pages, memlimit_pages, pages_stored;
+
+ memlimit_pages = zcache_get_memlimit(zpool) >> PAGE_SHIFT;
+ pages_stored = zcache_get_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ excess_pages = pages_stored > memlimit_pages ?
+ pages_stored - memlimit_pages : 0;
+
+ return excess_pages;
+}
+
+/*
+ * Free pages from this pool till we come within its memlimit.
+ *
+ * Currently, its called only when user sets memlimit lower than the
+ * number of pages currently stored in that pool. We select nodes in
+ * order of increasing inode number. This, in general, has no correlation
+ * with the order in which these are added. So, it is essentially random
+ * selection of nodes. Pages within a victim node node are freed in order
+ * of increasing index number.
+ *
+ * Automatic cache resizing and better page replacement policies will
+ * be implemented later.
+ */
+static void zcache_shrink_pool(struct zcache_pool *zpool)
+{
+ struct rb_node *node;
+ struct zcache_inode_rb *znode;
+
+ read_lock(&zpool->tree_lock);
+ node = rb_first(&zpool->inode_tree);
+ if (unlikely(!node)) {
+ read_unlock(&zpool->tree_lock);
+ return;
+ }
+ znode = rb_entry(node, struct zcache_inode_rb, rb_node);
+ kref_get(&znode->refcount);
+ read_unlock(&zpool->tree_lock);
+
+ do {
+ u32 pages_to_free;
+ struct rb_node *next_node;
+ struct zcache_inode_rb *next_znode;
+
+ pages_to_free = zcache_count_excess_pages(zpool);
+ if (!pages_to_free) {
+ spin_lock(&znode->tree_lock);
+ if (zcache_inode_is_empty(znode))
+ zcache_inode_isolate(znode);
+ spin_unlock(&znode->tree_lock);
+
+ kref_put(&znode->refcount, zcache_inode_release);
+ break;
+ }
+
+ /*
+ * Get the next victim node before we (possibly) isolate
+ * the current node.
+ */
+ read_lock(&zpool->tree_lock);
+ next_node = rb_next(node);
+ next_znode = NULL;
+ if (next_node) {
+ next_znode = rb_entry(next_node,
+ struct zcache_inode_rb, rb_node);
+ kref_get(&next_znode->refcount);
+ }
+ read_unlock(&zpool->tree_lock);
+
+ spin_lock(&znode->tree_lock);
+ zcache_free_inode_pages(znode, pages_to_free);
+ if (zcache_inode_is_empty(znode))
+ zcache_inode_isolate(znode);
+ spin_unlock(&znode->tree_lock);
+
+ kref_put(&znode->refcount, zcache_inode_release);
+
+ /* Avoid busy-looping */
+ cond_resched();
+
+ node = next_node;
+ znode = next_znode;
+ } while (znode);
}

#ifdef CONFIG_SYSFS
@@ -476,10 +571,13 @@ static void memlimit_sysfs_common(struct kobject *kobj, u64 *value, int store)
{
struct zcache_pool *zpool = zcache_kobj_to_pool(kobj);

- if (store)
+ if (store) {
zcache_set_memlimit(zpool, *value);
- else
+ if (zcache_count_excess_pages(zpool))
+ zcache_shrink_pool(zpool);
+ } else {
*value = zcache_get_memlimit(zpool);
+ }
}

static ssize_t memlimit_store(struct kobject *kobj,
@@ -687,9 +785,8 @@ static void zcache_put_page(int pool_id, ino_t inode_no,
/*
* memlimit can be changed any time by user using sysfs. If
* it is set to a value smaller than current number of pages
- * stored, then excess pages are not freed immediately but
- * further puts are blocked till sufficient number of pages
- * are flushed/freed.
+ * stored, then excess pages are freed synchronously when this
+ * sysfs event occurs.
*/
if (zcache_get_stat(zpool, ZPOOL_STAT_PAGES_STORED) >
zcache_get_memlimit(zpool) >> PAGE_SHIFT) {
@@ -781,7 +878,7 @@ static void zcache_flush_inode(int pool_id, ino_t inode_no)
return;

spin_lock_irqsave(&znode->tree_lock, flags);
- zcache_free_inode_pages(znode);
+ zcache_free_inode_pages(znode, UINT_MAX);
if (zcache_inode_is_empty(znode))
zcache_inode_isolate(znode);
spin_unlock_irqrestore(&znode->tree_lock, flags);
@@ -815,7 +912,7 @@ static void zcache_flush_fs(int pool_id)
while (node) {
znode = rb_entry(node, struct zcache_inode_rb, rb_node);
node = rb_next(node);
- zcache_free_inode_pages(znode);
+ zcache_free_inode_pages(znode, UINT_MAX);
rb_erase(&znode->rb_node, &zpool->inode_tree);
kfree(znode);
}
--
1.7.1.1

2010-07-16 12:38:04

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 5/8] Eliminate zero-filled pages

Checks if incoming page is completely filled with zeroes.
For such pages, no additional memory is allocated except
for an entry in corresponding radix tree. Number of zero
pages found (per-pool) is exported through sysfs:
/proc/sys/mm/zcache/pool<id>/zero_pages

When shrinking a pool -- for example, when memlimit is set
to a value less than current number of pages stored --
only non-zero-filled pages are freed to bring pool's memory
usage within memlimit. Since no memory is allocated for zero
filled pages, these are not accounted for pool's memory usage.

To quickly identify non-zero pages while shrinking a pool,
radix tree tag (ZCACHE_TAG_PAGE_NONZERO) is used.

Signed-off-by: Nitin Gupta <[email protected]>
---
drivers/staging/zram/zcache_drv.c | 218 ++++++++++++++++++++++++++++++++----
drivers/staging/zram/zcache_drv.h | 10 ++
2 files changed, 203 insertions(+), 25 deletions(-)

diff --git a/drivers/staging/zram/zcache_drv.c b/drivers/staging/zram/zcache_drv.c
index c5de65d..3ea45a6 100644
--- a/drivers/staging/zram/zcache_drv.c
+++ b/drivers/staging/zram/zcache_drv.c
@@ -47,6 +47,16 @@

#include "zcache_drv.h"

+/*
+ * For zero-filled pages, we directly insert 'index' value
+ * in corresponding radix node. These defines make sure we
+ * do not try to store NULL value in radix node (index can
+ * be 0) and that we do not use the lowest bit (which radix
+ * tree uses for its own purpose).
+ */
+#define ZCACHE_ZERO_PAGE_INDEX_SHIFT 2
+#define ZCACHE_ZERO_PAGE_MARK_BIT (1 << 1)
+
struct zcache *zcache;

/*
@@ -101,6 +111,18 @@ static void zcache_dec_stat(struct zcache_pool *zpool,
zcache_add_stat(zpool, idx, -1);
}

+static void zcache_dec_pages(struct zcache_pool *zpool, int is_zero)
+{
+ enum zcache_pool_stats_index idx;
+
+ if (is_zero)
+ idx = ZPOOL_STAT_PAGES_ZERO;
+ else
+ idx = ZPOOL_STAT_PAGES_STORED;
+
+ zcache_dec_stat(zpool, idx);
+}
+
static u64 zcache_get_memlimit(struct zcache_pool *zpool)
{
u64 memlimit;
@@ -373,20 +395,94 @@ static struct zcache_inode_rb *zcache_find_inode(struct zcache_pool *zpool,
return NULL;
}

+static void zcache_handle_zero_page(struct page *page)
+{
+ void *user_mem;
+
+ user_mem = kmap_atomic(page, KM_USER0);
+ memset(user_mem, 0, PAGE_SIZE);
+ kunmap_atomic(user_mem, KM_USER0);
+
+ flush_dcache_page(page);
+}
+
+static int zcache_page_zero_filled(void *ptr)
+{
+ unsigned int pos;
+ unsigned long *page;
+
+ page = (unsigned long *)ptr;
+
+ for (pos = 0; pos != PAGE_SIZE / sizeof(*page); pos++) {
+ if (page[pos])
+ return 0;
+ }
+
+ return 1;
+}
+
+
+/*
+ * Identifies if the given radix node pointer actually refers
+ * to a zero-filled page.
+ */
+static int zcache_is_zero_page(void *ptr)
+{
+ return (unsigned long)(ptr) & ZCACHE_ZERO_PAGE_MARK_BIT;
+}
+
+/*
+ * Returns "pointer" value to be stored in radix node for
+ * zero-filled page at the given index.
+ */
+static void *zcache_index_to_ptr(unsigned long index)
+{
+ return (void *)((index << ZCACHE_ZERO_PAGE_INDEX_SHIFT) |
+ ZCACHE_ZERO_PAGE_MARK_BIT);
+}
+
+/*
+ * Returns index value encoded in the given radix node pointer.
+ */
+static unsigned long zcache_ptr_to_index(void *ptr)
+{
+ return (unsigned long)(ptr) >> ZCACHE_ZERO_PAGE_INDEX_SHIFT;
+}
+
+void zcache_free_page(struct zcache_pool *zpool, struct page *page)
+{
+ int is_zero;
+
+ if (unlikely(!page))
+ return;
+
+ is_zero = zcache_is_zero_page(page);
+ if (!is_zero)
+ __free_page(page);
+
+ zcache_dec_pages(zpool, is_zero);
+}
+
/*
* Allocate memory for storing the given page and insert
* it in the given node's page tree at location 'index'.
+ * Parameter 'is_zero' specifies if the page is zero-filled.
*
* Returns 0 on success, negative error code on failure.
*/
static int zcache_store_page(struct zcache_inode_rb *znode,
- pgoff_t index, struct page *page)
+ pgoff_t index, struct page *page, int is_zero)
{
int ret;
unsigned long flags;
struct page *zpage;
void *src_data, *dest_data;

+ if (is_zero) {
+ zpage = zcache_index_to_ptr(index);
+ goto out_store;
+ }
+
zpage = alloc_page(GFP_NOWAIT);
if (!zpage) {
ret = -ENOMEM;
@@ -400,21 +496,32 @@ static int zcache_store_page(struct zcache_inode_rb *znode,
kunmap_atomic(src_data, KM_USER0);
kunmap_atomic(dest_data, KM_USER1);

+out_store:
spin_lock_irqsave(&znode->tree_lock, flags);
ret = radix_tree_insert(&znode->page_tree, index, zpage);
+ if (unlikely(ret)) {
+ spin_unlock_irqrestore(&znode->tree_lock, flags);
+ if (!is_zero)
+ __free_page(zpage);
+ goto out;
+ }
+ if (!is_zero)
+ radix_tree_tag_set(&znode->page_tree, index,
+ ZCACHE_TAG_NONZERO_PAGE);
spin_unlock_irqrestore(&znode->tree_lock, flags);
- if (unlikely(ret))
- __free_page(zpage);

out:
return ret;
}

/*
- * Frees all pages associated with the given zcache node.
- * Code adapted from brd.c
+ * Free 'pages_to_free' number of pages associated with the
+ * given zcache node. Actual number of pages freed might be
+ * less than this since the node might not contain enough
+ * pages.
*
* Called under zcache_inode_rb->tree_lock
+ * (Code adapted from brd.c)
*/
#define FREE_BATCH 16
static void zcache_free_inode_pages(struct zcache_inode_rb *znode,
@@ -434,10 +541,45 @@ static void zcache_free_inode_pages(struct zcache_inode_rb *znode,
count = pages_to_free;

for (i = 0; i < count; i++) {
+ if (zcache_is_zero_page(pages[i]))
+ index = zcache_ptr_to_index(pages[i]);
+ else
+ index = pages[i]->index;
+ radix_tree_delete(&znode->page_tree, index);
+ zcache_free_page(zpool, pages[i]);
+ }
+
+ index++;
+ pages_to_free -= count;
+ } while (pages_to_free && (count == FREE_BATCH));
+}
+
+/*
+ * Same as the previous function except that we only look for
+ * pages with the given tag set.
+ *
+ * Called under zcache_inode_rb->tree_lock
+ */
+static void zcache_free_inode_pages_tag(struct zcache_inode_rb *znode,
+ u32 pages_to_free, enum zcache_tag tag)
+{
+ int count;
+ unsigned long index = 0;
+ struct zcache_pool *zpool = znode->pool;
+
+ do {
+ int i;
+ struct page *pages[FREE_BATCH];
+
+ count = radix_tree_gang_lookup_tag(&znode->page_tree,
+ (void **)pages, index, FREE_BATCH, tag);
+ if (count > pages_to_free)
+ count = pages_to_free;
+
+ for (i = 0; i < count; i++) {
index = pages[i]->index;
+ zcache_free_page(zpool, pages[i]);
radix_tree_delete(&znode->page_tree, index);
- __free_page(pages[i]);
- zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
}

index++;
@@ -520,7 +662,9 @@ static void zcache_shrink_pool(struct zcache_pool *zpool)
read_unlock(&zpool->tree_lock);

spin_lock(&znode->tree_lock);
- zcache_free_inode_pages(znode, pages_to_free);
+ /* Free 'pages_to_free' non-zero pages in the current node */
+ zcache_free_inode_pages_tag(znode, pages_to_free,
+ ZCACHE_TAG_NONZERO_PAGE);
if (zcache_inode_is_empty(znode))
zcache_inode_isolate(znode);
spin_unlock(&znode->tree_lock);
@@ -557,6 +701,16 @@ static struct zcache_pool *zcache_kobj_to_pool(struct kobject *kobj)
return zcache->pools[i];
}

+static ssize_t zero_pages_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct zcache_pool *zpool = zcache_kobj_to_pool(kobj);
+
+ return sprintf(buf, "%llu\n", zcache_get_stat(
+ zpool, ZPOOL_STAT_PAGES_ZERO));
+}
+ZCACHE_POOL_ATTR_RO(zero_pages);
+
static ssize_t orig_data_size_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -607,6 +761,7 @@ static ssize_t memlimit_show(struct kobject *kobj,
ZCACHE_POOL_ATTR(memlimit);

static struct attribute *zcache_pool_attrs[] = {
+ &zero_pages_attr.attr,
&orig_data_size_attr.attr,
&memlimit_attr.attr,
NULL,
@@ -741,6 +896,11 @@ static int zcache_get_page(int pool_id, ino_t inode_no,
if (!src_page)
goto out;

+ if (zcache_is_zero_page(src_page)) {
+ zcache_handle_zero_page(page);
+ goto out_free;
+ }
+
src_data = kmap_atomic(src_page, KM_USER0);
dest_data = kmap_atomic(page, KM_USER1);
memcpy(dest_data, src_data, PAGE_SIZE);
@@ -749,9 +909,8 @@ static int zcache_get_page(int pool_id, ino_t inode_no,

flush_dcache_page(page);

- __free_page(src_page);
-
- zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+out_free:
+ zcache_free_page(zpool, src_page);
ret = 0; /* success */

out:
@@ -768,13 +927,27 @@ out:
static void zcache_put_page(int pool_id, ino_t inode_no,
pgoff_t index, struct page *page)
{
- int ret;
+ int ret, is_zero;
unsigned long flags;
struct page *zpage;
struct zcache_inode_rb *znode;
struct zcache_pool *zpool = zcache->pools[pool_id];

/*
+ * Check if the page is zero-filled. We do not allocate any
+ * memory for such pages and hence they do not contribute
+ * towards pool's memory usage. So, we can keep accepting
+ * such pages even after we have reached memlimit.
+ */
+ void *src_data = kmap_atomic(page, KM_USER0);
+ is_zero = zcache_page_zero_filled(src_data);
+ kunmap_atomic(src_data, KM_USER0);
+ if (is_zero) {
+ zcache_inc_stat(zpool, ZPOOL_STAT_PAGES_ZERO);
+ goto out_find_store;
+ }
+
+ /*
* Incrementing local pages_stored before summing it from
* all CPUs makes sure we do not end up storing pages in
* excess of memlimit. In case of failure, we revert back
@@ -794,11 +967,12 @@ static void zcache_put_page(int pool_id, ino_t inode_no,
return;
}

+out_find_store:
znode = zcache_find_inode(zpool, inode_no);
if (!znode) {
znode = zcache_inode_create(pool_id, inode_no);
- if (!znode) {
- zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ if (unlikely(!znode)) {
+ zcache_dec_pages(zpool, is_zero);
return;
}
}
@@ -807,14 +981,12 @@ static void zcache_put_page(int pool_id, ino_t inode_no,
spin_lock_irqsave(&znode->tree_lock, flags);
zpage = radix_tree_delete(&znode->page_tree, index);
spin_unlock_irqrestore(&znode->tree_lock, flags);
- if (zpage) {
- __free_page(zpage);
- zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
- }
+ if (zpage)
+ zcache_free_page(zpool, zpage);

- ret = zcache_store_page(znode, index, page);
+ ret = zcache_store_page(znode, index, page, is_zero);
if (ret) { /* failure */
- zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ zcache_dec_pages(zpool, is_zero);

/*
* Its possible that racing zcache get/flush could not
@@ -852,12 +1024,8 @@ static void zcache_flush_page(int pool_id, ino_t inode_no, pgoff_t index)
spin_unlock_irqrestore(&znode->tree_lock, flags);

kref_put(&znode->refcount, zcache_inode_release);
- if (!page)
- return;
-
- __free_page(page);

- zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ zcache_free_page(zpool, page);
}

/*
diff --git a/drivers/staging/zram/zcache_drv.h b/drivers/staging/zram/zcache_drv.h
index 808cfb2..1e8c931 100644
--- a/drivers/staging/zram/zcache_drv.h
+++ b/drivers/staging/zram/zcache_drv.h
@@ -22,13 +22,23 @@
#define MAX_ZPOOL_NAME_LEN 8 /* "pool"+id (shown in sysfs) */

enum zcache_pool_stats_index {
+ ZPOOL_STAT_PAGES_ZERO,
ZPOOL_STAT_PAGES_STORED,
ZPOOL_STAT_NSTATS,
};

+/* Radix-tree tags */
+enum zcache_tag {
+ ZCACHE_TAG_NONZERO_PAGE,
+ ZCACHE_TAG_UNUSED
+};
+
/* Default zcache per-pool memlimit: 10% of total RAM */
static const unsigned zcache_pool_default_memlimit_perc_ram = 10;

+ /* We only keep pages that compress to less than this size */
+static const unsigned zcache_max_page_size = PAGE_SIZE / 2;
+
/* Red-Black tree node. Maps inode to its page-tree */
struct zcache_inode_rb {
struct radix_tree_root page_tree; /* maps inode index to page */
--
1.7.1.1

2010-07-16 12:38:10

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 6/8] Compress pages using LZO

Pages are now compressed using LZO compression algorithm
and a new statistic is exported through sysfs:

/sys/kernel/mm/zcache/pool<id>/compr_data_size

This gives compressed size of pages stored. So, we can
get compression ratio using this and the orig_data_size
statistic which is already exported.

We only keep pages which compress to less than or equal to
PAGE_SIZE/2. However, we still allocate full pages to store
compressed chunks.

Another change is to enforce memlimit again compressed
size instead of pages_stored (uncompressed size).

Signed-off-by: Nitin Gupta <[email protected]>
---
drivers/staging/zram/zcache_drv.c | 254 +++++++++++++++++++++++++-----------
drivers/staging/zram/zcache_drv.h | 7 +-
2 files changed, 181 insertions(+), 80 deletions(-)

diff --git a/drivers/staging/zram/zcache_drv.c b/drivers/staging/zram/zcache_drv.c
index 3ea45a6..2a02606 100644
--- a/drivers/staging/zram/zcache_drv.c
+++ b/drivers/staging/zram/zcache_drv.c
@@ -40,13 +40,18 @@
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/cleancache.h>
+#include <linux/cpu.h>
#include <linux/highmem.h>
+#include <linux/lzo.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/u64_stats_sync.h>

#include "zcache_drv.h"

+static DEFINE_PER_CPU(unsigned char *, compress_buffer);
+static DEFINE_PER_CPU(unsigned char *, compress_workmem);
+
/*
* For zero-filled pages, we directly insert 'index' value
* in corresponding radix node. These defines make sure we
@@ -96,7 +101,6 @@ static void zcache_add_stat(struct zcache_pool *zpool,
stats->count[idx] += val;
u64_stats_update_end(&stats->syncp);
preempt_enable();
-
}

static void zcache_inc_stat(struct zcache_pool *zpool,
@@ -442,11 +446,20 @@ static void *zcache_index_to_ptr(unsigned long index)
}

/*
- * Returns index value encoded in the given radix node pointer.
+ * Radix node contains "pointer" value which encode <page, offset>
+ * pair, locating the compressed object. Header of the object then
+ * contains corresponding 'index' value.
*/
-static unsigned long zcache_ptr_to_index(void *ptr)
+static unsigned long zcache_ptr_to_index(struct page *page)
{
- return (unsigned long)(ptr) >> ZCACHE_ZERO_PAGE_INDEX_SHIFT;
+ unsigned long index;
+
+ if (zcache_is_zero_page(page))
+ index = (unsigned long)(page) >> ZCACHE_ZERO_PAGE_INDEX_SHIFT;
+ else
+ index = page->index;
+
+ return index;
}

void zcache_free_page(struct zcache_pool *zpool, struct page *page)
@@ -457,8 +470,12 @@ void zcache_free_page(struct zcache_pool *zpool, struct page *page)
return;

is_zero = zcache_is_zero_page(page);
- if (!is_zero)
+ if (!is_zero) {
+ int clen = page->private;
+
+ zcache_add_stat(zpool, ZPOOL_STAT_COMPR_SIZE, -clen);
__free_page(page);
+ }

zcache_dec_pages(zpool, is_zero);
}
@@ -474,9 +491,12 @@ static int zcache_store_page(struct zcache_inode_rb *znode,
pgoff_t index, struct page *page, int is_zero)
{
int ret;
+ size_t clen;
unsigned long flags;
struct page *zpage;
- void *src_data, *dest_data;
+ unsigned char *zbuffer, *zworkmem;
+ unsigned char *src_data, *dest_data;
+ struct zcache_pool *zpool = znode->pool;

if (is_zero) {
zpage = zcache_index_to_ptr(index);
@@ -488,13 +508,33 @@ static int zcache_store_page(struct zcache_inode_rb *znode,
ret = -ENOMEM;
goto out;
}
- zpage->index = index;
+
+ preempt_disable();
+ zbuffer = __get_cpu_var(compress_buffer);
+ zworkmem = __get_cpu_var(compress_workmem);
+ if (unlikely(!zbuffer || !zworkmem)) {
+ ret = -EFAULT;
+ preempt_enable();
+ goto out;
+ }

src_data = kmap_atomic(page, KM_USER0);
- dest_data = kmap_atomic(zpage, KM_USER1);
- memcpy(dest_data, src_data, PAGE_SIZE);
+ ret = lzo1x_1_compress(src_data, PAGE_SIZE, zbuffer, &clen, zworkmem);
kunmap_atomic(src_data, KM_USER0);
- kunmap_atomic(dest_data, KM_USER1);
+
+ if (unlikely(ret != LZO_E_OK) || clen > zcache_max_page_size) {
+ ret = -EINVAL;
+ preempt_enable();
+ goto out;
+ }
+
+ dest_data = kmap_atomic(zpage, KM_USER0);
+ memcpy(dest_data, zbuffer, clen);
+ kunmap_atomic(dest_data, KM_USER0);
+ preempt_enable();
+
+ zpage->index = index;
+ zpage->private = clen;

out_store:
spin_lock_irqsave(&znode->tree_lock, flags);
@@ -505,11 +545,19 @@ out_store:
__free_page(zpage);
goto out;
}
- if (!is_zero)
+ if (is_zero) {
+ zcache_inc_stat(zpool, ZPOOL_STAT_PAGES_ZERO);
+ } else {
+ int delta = zcache_max_page_size - clen;
+ zcache_add_stat(zpool, ZPOOL_STAT_COMPR_SIZE, -delta);
+ zcache_inc_stat(zpool, ZPOOL_STAT_PAGES_STORED);
radix_tree_tag_set(&znode->page_tree, index,
ZCACHE_TAG_NONZERO_PAGE);
+ }
spin_unlock_irqrestore(&znode->tree_lock, flags);

+ ret = 0; /* success */
+
out:
return ret;
}
@@ -525,42 +573,6 @@ out:
*/
#define FREE_BATCH 16
static void zcache_free_inode_pages(struct zcache_inode_rb *znode,
- u32 pages_to_free)
-{
- int count;
- unsigned long index = 0;
- struct zcache_pool *zpool = znode->pool;
-
- do {
- int i;
- struct page *pages[FREE_BATCH];
-
- count = radix_tree_gang_lookup(&znode->page_tree,
- (void **)pages, index, FREE_BATCH);
- if (count > pages_to_free)
- count = pages_to_free;
-
- for (i = 0; i < count; i++) {
- if (zcache_is_zero_page(pages[i]))
- index = zcache_ptr_to_index(pages[i]);
- else
- index = pages[i]->index;
- radix_tree_delete(&znode->page_tree, index);
- zcache_free_page(zpool, pages[i]);
- }
-
- index++;
- pages_to_free -= count;
- } while (pages_to_free && (count == FREE_BATCH));
-}
-
-/*
- * Same as the previous function except that we only look for
- * pages with the given tag set.
- *
- * Called under zcache_inode_rb->tree_lock
- */
-static void zcache_free_inode_pages_tag(struct zcache_inode_rb *znode,
u32 pages_to_free, enum zcache_tag tag)
{
int count;
@@ -569,17 +581,26 @@ static void zcache_free_inode_pages_tag(struct zcache_inode_rb *znode,

do {
int i;
- struct page *pages[FREE_BATCH];
+ void *objs[FREE_BATCH];
+
+ if (tag == ZCACHE_TAG_INVALID)
+ count = radix_tree_gang_lookup(&znode->page_tree,
+ objs, index, FREE_BATCH);
+ else
+ count = radix_tree_gang_lookup_tag(&znode->page_tree,
+ objs, index, FREE_BATCH, tag);

- count = radix_tree_gang_lookup_tag(&znode->page_tree,
- (void **)pages, index, FREE_BATCH, tag);
if (count > pages_to_free)
count = pages_to_free;

for (i = 0; i < count; i++) {
- index = pages[i]->index;
- zcache_free_page(zpool, pages[i]);
- radix_tree_delete(&znode->page_tree, index);
+ void *obj;
+ unsigned long index;
+
+ index = zcache_ptr_to_index(objs[i]);
+ obj = radix_tree_delete(&znode->page_tree, index);
+ BUG_ON(obj != objs[i]);
+ zcache_free_page(zpool, obj);
}

index++;
@@ -663,7 +684,7 @@ static void zcache_shrink_pool(struct zcache_pool *zpool)

spin_lock(&znode->tree_lock);
/* Free 'pages_to_free' non-zero pages in the current node */
- zcache_free_inode_pages_tag(znode, pages_to_free,
+ zcache_free_inode_pages(znode, pages_to_free,
ZCACHE_TAG_NONZERO_PAGE);
if (zcache_inode_is_empty(znode))
zcache_inode_isolate(znode);
@@ -721,6 +742,16 @@ static ssize_t orig_data_size_show(struct kobject *kobj,
}
ZCACHE_POOL_ATTR_RO(orig_data_size);

+static ssize_t compr_data_size_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct zcache_pool *zpool = zcache_kobj_to_pool(kobj);
+
+ return sprintf(buf, "%llu\n", zcache_get_stat(
+ zpool, ZPOOL_STAT_COMPR_SIZE));
+}
+ZCACHE_POOL_ATTR_RO(compr_data_size);
+
static void memlimit_sysfs_common(struct kobject *kobj, u64 *value, int store)
{
struct zcache_pool *zpool = zcache_kobj_to_pool(kobj);
@@ -763,6 +794,7 @@ ZCACHE_POOL_ATTR(memlimit);
static struct attribute *zcache_pool_attrs[] = {
&zero_pages_attr.attr,
&orig_data_size_attr.attr,
+ &compr_data_size_attr.attr,
&memlimit_attr.attr,
NULL,
};
@@ -867,21 +899,25 @@ static int zcache_init_shared_fs(char *uuid, size_t pagesize)
* If found, copies it to the given output page 'page' and frees
* zcache copy of the same.
*
- * Returns 0 if requested page found, -1 otherwise.
+ * Returns 0 on success, negative error code on failure.
*/
static int zcache_get_page(int pool_id, ino_t inode_no,
pgoff_t index, struct page *page)
{
int ret = -1;
+ size_t clen;
unsigned long flags;
struct page *src_page;
- void *src_data, *dest_data;
+ unsigned char *src_data, *dest_data;
+
struct zcache_inode_rb *znode;
struct zcache_pool *zpool = zcache->pools[pool_id];

znode = zcache_find_inode(zpool, inode_no);
- if (!znode)
+ if (!znode) {
+ ret = -EFAULT;
goto out;
+ }

BUG_ON(znode->inode_no != inode_no);

@@ -893,20 +929,30 @@ static int zcache_get_page(int pool_id, ino_t inode_no,

kref_put(&znode->refcount, zcache_inode_release);

- if (!src_page)
+ if (!src_page) {
+ ret = -EFAULT;
goto out;
+ }

if (zcache_is_zero_page(src_page)) {
zcache_handle_zero_page(page);
goto out_free;
}

+ clen = PAGE_SIZE;
src_data = kmap_atomic(src_page, KM_USER0);
dest_data = kmap_atomic(page, KM_USER1);
- memcpy(dest_data, src_data, PAGE_SIZE);
+
+ ret = lzo1x_decompress_safe(src_data, src_page->private,
+ dest_data, &clen);
+
kunmap_atomic(src_data, KM_USER0);
kunmap_atomic(dest_data, KM_USER1);

+ /* Failure here means bug in LZO! */
+ if (unlikely(ret != LZO_E_OK))
+ goto out_free;
+
flush_dcache_page(page);

out_free:
@@ -942,18 +988,16 @@ static void zcache_put_page(int pool_id, ino_t inode_no,
void *src_data = kmap_atomic(page, KM_USER0);
is_zero = zcache_page_zero_filled(src_data);
kunmap_atomic(src_data, KM_USER0);
- if (is_zero) {
- zcache_inc_stat(zpool, ZPOOL_STAT_PAGES_ZERO);
+ if (is_zero)
goto out_find_store;
- }

/*
- * Incrementing local pages_stored before summing it from
+ * Incrementing local compr_size before summing it from
* all CPUs makes sure we do not end up storing pages in
* excess of memlimit. In case of failure, we revert back
* this local increment.
*/
- zcache_inc_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ zcache_add_stat(zpool, ZPOOL_STAT_COMPR_SIZE, zcache_max_page_size);

/*
* memlimit can be changed any time by user using sysfs. If
@@ -961,9 +1005,10 @@ static void zcache_put_page(int pool_id, ino_t inode_no,
* stored, then excess pages are freed synchronously when this
* sysfs event occurs.
*/
- if (zcache_get_stat(zpool, ZPOOL_STAT_PAGES_STORED) >
- zcache_get_memlimit(zpool) >> PAGE_SHIFT) {
- zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
+ if (zcache_get_stat(zpool, ZPOOL_STAT_COMPR_SIZE) >
+ zcache_get_memlimit(zpool)) {
+ zcache_add_stat(zpool, ZPOOL_STAT_COMPR_SIZE,
+ -zcache_max_page_size);
return;
}

@@ -972,7 +1017,8 @@ out_find_store:
if (!znode) {
znode = zcache_inode_create(pool_id, inode_no);
if (unlikely(!znode)) {
- zcache_dec_pages(zpool, is_zero);
+ zcache_add_stat(zpool, ZPOOL_STAT_COMPR_SIZE,
+ -zcache_max_page_size);
return;
}
}
@@ -985,9 +1031,9 @@ out_find_store:
zcache_free_page(zpool, zpage);

ret = zcache_store_page(znode, index, page, is_zero);
- if (ret) { /* failure */
- zcache_dec_pages(zpool, is_zero);
-
+ if (unlikely(ret)) {
+ zcache_add_stat(zpool, ZPOOL_STAT_COMPR_SIZE,
+ -zcache_max_page_size);
/*
* Its possible that racing zcache get/flush could not
* isolate this node since we held a reference to it.
@@ -1046,7 +1092,7 @@ static void zcache_flush_inode(int pool_id, ino_t inode_no)
return;

spin_lock_irqsave(&znode->tree_lock, flags);
- zcache_free_inode_pages(znode, UINT_MAX);
+ zcache_free_inode_pages(znode, UINT_MAX, ZCACHE_TAG_INVALID);
if (zcache_inode_is_empty(znode))
zcache_inode_isolate(znode);
spin_unlock_irqrestore(&znode->tree_lock, flags);
@@ -1080,7 +1126,7 @@ static void zcache_flush_fs(int pool_id)
while (node) {
znode = rb_entry(node, struct zcache_inode_rb, rb_node);
node = rb_next(node);
- zcache_free_inode_pages(znode, UINT_MAX);
+ zcache_free_inode_pages(znode, UINT_MAX, ZCACHE_TAG_INVALID);
rb_erase(&znode->rb_node, &zpool->inode_tree);
kfree(znode);
}
@@ -1088,8 +1134,47 @@ static void zcache_flush_fs(int pool_id)
zcache_destroy_pool(zpool);
}

+/*
+ * Callback for CPU hotplug events. Allocates percpu compression buffers.
+ */
+static int zcache_cpu_notify(struct notifier_block *nb, unsigned long action,
+ void *pcpu)
+{
+ int cpu = (long)pcpu;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ per_cpu(compress_buffer, cpu) = (void *)__get_free_pages(
+ GFP_KERNEL | __GFP_ZERO, 1);
+ per_cpu(compress_workmem, cpu) = kzalloc(
+ LZO1X_MEM_COMPRESS, GFP_KERNEL);
+
+ break;
+ case CPU_DEAD:
+ case CPU_UP_CANCELED:
+ free_pages((unsigned long)(per_cpu(compress_buffer, cpu)), 1);
+ per_cpu(compress_buffer, cpu) = NULL;
+
+ kfree(per_cpu(compress_buffer, cpu));
+ per_cpu(compress_buffer, cpu) = NULL;
+
+ break;
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block zcache_cpu_nb = {
+ .notifier_call = zcache_cpu_notify
+};
+
static int __init zcache_init(void)
{
+ int ret = -ENOMEM;
+ unsigned int cpu;
+
struct cleancache_ops ops = {
.init_fs = zcache_init_fs,
.init_shared_fs = zcache_init_shared_fs,
@@ -1102,20 +1187,33 @@ static int __init zcache_init(void)

zcache = kzalloc(sizeof(*zcache), GFP_KERNEL);
if (!zcache)
- return -ENOMEM;
+ goto out;
+
+ ret = register_cpu_notifier(&zcache_cpu_nb);
+ if (ret)
+ goto out;
+
+ for_each_online_cpu(cpu) {
+ void *pcpu = (void *)(long)cpu;
+ zcache_cpu_notify(&zcache_cpu_nb, CPU_UP_PREPARE, pcpu);
+ }

#ifdef CONFIG_SYSFS
/* Create /sys/kernel/mm/zcache/ */
zcache->kobj = kobject_create_and_add("zcache", mm_kobj);
- if (!zcache->kobj) {
- kfree(zcache);
- return -ENOMEM;
- }
+ if (!zcache->kobj)
+ goto out;
#endif

spin_lock_init(&zcache->pool_lock);
cleancache_ops = ops;

+ ret = 0; /* success */
+
+out:
+ if (ret)
+ kfree(zcache);
+
return 0;
}

diff --git a/drivers/staging/zram/zcache_drv.h b/drivers/staging/zram/zcache_drv.h
index 1e8c931..9ce97da 100644
--- a/drivers/staging/zram/zcache_drv.h
+++ b/drivers/staging/zram/zcache_drv.h
@@ -24,20 +24,22 @@
enum zcache_pool_stats_index {
ZPOOL_STAT_PAGES_ZERO,
ZPOOL_STAT_PAGES_STORED,
+ ZPOOL_STAT_COMPR_SIZE,
ZPOOL_STAT_NSTATS,
};

/* Radix-tree tags */
enum zcache_tag {
ZCACHE_TAG_NONZERO_PAGE,
- ZCACHE_TAG_UNUSED
+ ZCACHE_TAG_UNUSED,
+ ZCACHE_TAG_INVALID
};

/* Default zcache per-pool memlimit: 10% of total RAM */
static const unsigned zcache_pool_default_memlimit_perc_ram = 10;

/* We only keep pages that compress to less than this size */
-static const unsigned zcache_max_page_size = PAGE_SIZE / 2;
+static const int zcache_max_page_size = PAGE_SIZE / 2;

/* Red-Black tree node. Maps inode to its page-tree */
struct zcache_inode_rb {
@@ -61,6 +63,7 @@ struct zcache_pool {

seqlock_t memlimit_lock; /* protects memlimit */
u64 memlimit; /* bytes */
+
struct zcache_pool_stats_cpu *stats; /* percpu stats */
#ifdef CONFIG_SYSFS
unsigned char name[MAX_ZPOOL_NAME_LEN];
--
1.7.1.1

2010-07-16 12:38:18

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 7/8] Use xvmalloc to store compressed chunks

xvmalloc is an O(1) memory allocator designed specifically
for storing variable sized compressed chunks. It is already
being used by zram driver for the same purpose.

A new statistic is also exported:
/sys/kernel/mm/zcache/pool<id>/mem_used_total

This gives pool's total memory usage, including allocator
fragmentation and metadata overhead.

Currently, we use just one xvmalloc pool per zcache pool.
If this proves to be a performance bottleneck, they will
also be created per-cpu.

xvmalloc details, performance numbers and its comparison
with kmalloc (SLUB):

http://code.google.com/p/compcache/wiki/xvMalloc
http://code.google.com/p/compcache/wiki/xvMallocPerformance
http://code.google.com/p/compcache/wiki/AllocatorsComparison

Signed-off-by: Nitin Gupta <[email protected]>
---
drivers/staging/zram/zcache_drv.c | 150 +++++++++++++++++++++++++++++-------
drivers/staging/zram/zcache_drv.h | 6 ++
2 files changed, 127 insertions(+), 29 deletions(-)

diff --git a/drivers/staging/zram/zcache_drv.c b/drivers/staging/zram/zcache_drv.c
index 2a02606..71ca48a 100644
--- a/drivers/staging/zram/zcache_drv.c
+++ b/drivers/staging/zram/zcache_drv.c
@@ -47,6 +47,7 @@
#include <linux/slab.h>
#include <linux/u64_stats_sync.h>

+#include "xvmalloc.h"
#include "zcache_drv.h"

static DEFINE_PER_CPU(unsigned char *, compress_buffer);
@@ -179,6 +180,7 @@ static void zcache_destroy_pool(struct zcache_pool *zpool)
}

free_percpu(zpool->stats);
+ xv_destroy_pool(zpool->xv_pool);
kfree(zpool);
}

@@ -219,6 +221,12 @@ int zcache_create_pool(void)
goto out;
}

+ zpool->xv_pool = xv_create_pool();
+ if (!zpool->xv_pool) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
rwlock_init(&zpool->tree_lock);
seqlock_init(&zpool->memlimit_lock);
zpool->inode_tree = RB_ROOT;
@@ -446,35 +454,81 @@ static void *zcache_index_to_ptr(unsigned long index)
}

/*
+ * Encode <page, offset> as a single "pointer" value which is stored
+ * in corresponding radix node.
+ */
+static void *zcache_xv_location_to_ptr(struct page *page, u32 offset)
+{
+ unsigned long ptrval;
+
+ ptrval = page_to_pfn(page) << PAGE_SHIFT;
+ ptrval |= (offset & ~PAGE_MASK);
+
+ return (void *)ptrval;
+}
+
+/*
+ * Decode <page, offset> pair from "pointer" value returned from
+ * radix tree lookup.
+ */
+static void zcache_ptr_to_xv_location(void *ptr, struct page **page,
+ u32 *offset)
+{
+ unsigned long ptrval = (unsigned long)ptr;
+
+ *page = pfn_to_page(ptrval >> PAGE_SHIFT);
+ *offset = ptrval & ~PAGE_MASK;
+}
+
+/*
* Radix node contains "pointer" value which encode <page, offset>
* pair, locating the compressed object. Header of the object then
* contains corresponding 'index' value.
*/
-static unsigned long zcache_ptr_to_index(struct page *page)
+static unsigned long zcache_ptr_to_index(void *ptr)
{
+ u32 offset;
+ struct page *page;
unsigned long index;
+ struct zcache_objheader *zheader;

- if (zcache_is_zero_page(page))
- index = (unsigned long)(page) >> ZCACHE_ZERO_PAGE_INDEX_SHIFT;
- else
- index = page->index;
+ if (zcache_is_zero_page(ptr))
+ return (unsigned long)(ptr) >> ZCACHE_ZERO_PAGE_INDEX_SHIFT;
+
+ zcache_ptr_to_xv_location(ptr, &page, &offset);
+
+ zheader = kmap_atomic(page, KM_USER0) + offset;
+ index = zheader->index;
+ kunmap_atomic(zheader, KM_USER0);

return index;
}

-void zcache_free_page(struct zcache_pool *zpool, struct page *page)
+void zcache_free_page(struct zcache_pool *zpool, void *ptr)
{
int is_zero;
+ unsigned long flags;

- if (unlikely(!page))
+ if (unlikely(!ptr))
return;

- is_zero = zcache_is_zero_page(page);
+ is_zero = zcache_is_zero_page(ptr);
if (!is_zero) {
- int clen = page->private;
+ int clen;
+ void *obj;
+ u32 offset;
+ struct page *page;
+
+ zcache_ptr_to_xv_location(ptr, &page, &offset);
+ obj = kmap_atomic(page, KM_USER0) + offset;
+ clen = xv_get_object_size(obj) -
+ sizeof(struct zcache_objheader);
+ kunmap_atomic(obj, KM_USER0);

zcache_add_stat(zpool, ZPOOL_STAT_COMPR_SIZE, -clen);
- __free_page(page);
+ local_irq_save(flags);
+ xv_free(zpool->xv_pool, page, offset);
+ local_irq_restore(flags);
}

zcache_dec_pages(zpool, is_zero);
@@ -491,24 +545,23 @@ static int zcache_store_page(struct zcache_inode_rb *znode,
pgoff_t index, struct page *page, int is_zero)
{
int ret;
+ void *nodeptr;
size_t clen;
unsigned long flags;
+
+ u32 zoffset;
struct page *zpage;
unsigned char *zbuffer, *zworkmem;
unsigned char *src_data, *dest_data;
+
+ struct zcache_objheader *zheader;
struct zcache_pool *zpool = znode->pool;

if (is_zero) {
- zpage = zcache_index_to_ptr(index);
+ nodeptr = zcache_index_to_ptr(index);
goto out_store;
}

- zpage = alloc_page(GFP_NOWAIT);
- if (!zpage) {
- ret = -ENOMEM;
- goto out;
- }
-
preempt_disable();
zbuffer = __get_cpu_var(compress_buffer);
zworkmem = __get_cpu_var(compress_workmem);
@@ -528,17 +581,32 @@ static int zcache_store_page(struct zcache_inode_rb *znode,
goto out;
}

- dest_data = kmap_atomic(zpage, KM_USER0);
+ local_irq_save(flags);
+ ret = xv_malloc(zpool->xv_pool, clen + sizeof(*zheader),
+ &zpage, &zoffset, GFP_NOWAIT);
+ local_irq_restore(flags);
+ if (unlikely(ret)) {
+ ret = -ENOMEM;
+ preempt_enable();
+ goto out;
+ }
+
+ dest_data = kmap_atomic(zpage, KM_USER0) + zoffset;
+
+ /* Store index value in header */
+ zheader = (struct zcache_objheader *)dest_data;
+ zheader->index = index;
+ dest_data += sizeof(*zheader);
+
memcpy(dest_data, zbuffer, clen);
kunmap_atomic(dest_data, KM_USER0);
preempt_enable();

- zpage->index = index;
- zpage->private = clen;
+ nodeptr = zcache_xv_location_to_ptr(zpage, zoffset);

out_store:
spin_lock_irqsave(&znode->tree_lock, flags);
- ret = radix_tree_insert(&znode->page_tree, index, zpage);
+ ret = radix_tree_insert(&znode->page_tree, index, nodeptr);
if (unlikely(ret)) {
spin_unlock_irqrestore(&znode->tree_lock, flags);
if (!is_zero)
@@ -752,6 +820,19 @@ static ssize_t compr_data_size_show(struct kobject *kobj,
}
ZCACHE_POOL_ATTR_RO(compr_data_size);

+/*
+ * Total memory used by this pool, including allocator fragmentation
+ * and metadata overhead.
+ */
+static ssize_t mem_used_total_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct zcache_pool *zpool = zcache_kobj_to_pool(kobj);
+
+ return sprintf(buf, "%llu\n", xv_get_total_size_bytes(zpool->xv_pool));
+}
+ZCACHE_POOL_ATTR_RO(mem_used_total);
+
static void memlimit_sysfs_common(struct kobject *kobj, u64 *value, int store)
{
struct zcache_pool *zpool = zcache_kobj_to_pool(kobj);
@@ -795,6 +876,7 @@ static struct attribute *zcache_pool_attrs[] = {
&zero_pages_attr.attr,
&orig_data_size_attr.attr,
&compr_data_size_attr.attr,
+ &mem_used_total_attr.attr,
&memlimit_attr.attr,
NULL,
};
@@ -904,13 +986,17 @@ static int zcache_init_shared_fs(char *uuid, size_t pagesize)
static int zcache_get_page(int pool_id, ino_t inode_no,
pgoff_t index, struct page *page)
{
- int ret = -1;
+ int ret;
+ void *nodeptr;
size_t clen;
unsigned long flags;
+
+ u32 offset;
struct page *src_page;
unsigned char *src_data, *dest_data;

struct zcache_inode_rb *znode;
+ struct zcache_objheader *zheader;
struct zcache_pool *zpool = zcache->pools[pool_id];

znode = zcache_find_inode(zpool, inode_no);
@@ -922,29 +1008,35 @@ static int zcache_get_page(int pool_id, ino_t inode_no,
BUG_ON(znode->inode_no != inode_no);

spin_lock_irqsave(&znode->tree_lock, flags);
- src_page = radix_tree_delete(&znode->page_tree, index);
+ nodeptr = radix_tree_delete(&znode->page_tree, index);
if (zcache_inode_is_empty(znode))
zcache_inode_isolate(znode);
spin_unlock_irqrestore(&znode->tree_lock, flags);

kref_put(&znode->refcount, zcache_inode_release);

- if (!src_page) {
+ if (!nodeptr) {
ret = -EFAULT;
goto out;
}

- if (zcache_is_zero_page(src_page)) {
+ if (zcache_is_zero_page(nodeptr)) {
zcache_handle_zero_page(page);
goto out_free;
}

clen = PAGE_SIZE;
- src_data = kmap_atomic(src_page, KM_USER0);
+ zcache_ptr_to_xv_location(nodeptr, &src_page, &offset);
+
+ src_data = kmap_atomic(src_page, KM_USER0) + offset;
+ zheader = (struct zcache_objheader *)src_data;
+ BUG_ON(zheader->index != index);
+
dest_data = kmap_atomic(page, KM_USER1);

- ret = lzo1x_decompress_safe(src_data, src_page->private,
- dest_data, &clen);
+ ret = lzo1x_decompress_safe(src_data + sizeof(*zheader),
+ xv_get_object_size(src_data) - sizeof(*zheader),
+ dest_data, &clen);

kunmap_atomic(src_data, KM_USER0);
kunmap_atomic(dest_data, KM_USER1);
@@ -956,7 +1048,7 @@ static int zcache_get_page(int pool_id, ino_t inode_no,
flush_dcache_page(page);

out_free:
- zcache_free_page(zpool, src_page);
+ zcache_free_page(zpool, nodeptr);
ret = 0; /* success */

out:
diff --git a/drivers/staging/zram/zcache_drv.h b/drivers/staging/zram/zcache_drv.h
index 9ce97da..7283116 100644
--- a/drivers/staging/zram/zcache_drv.h
+++ b/drivers/staging/zram/zcache_drv.h
@@ -41,6 +41,11 @@ static const unsigned zcache_pool_default_memlimit_perc_ram = 10;
/* We only keep pages that compress to less than this size */
static const int zcache_max_page_size = PAGE_SIZE / 2;

+/* Stored in the beginning of each compressed object */
+struct zcache_objheader {
+ unsigned long index;
+};
+
/* Red-Black tree node. Maps inode to its page-tree */
struct zcache_inode_rb {
struct radix_tree_root page_tree; /* maps inode index to page */
@@ -64,6 +69,7 @@ struct zcache_pool {
seqlock_t memlimit_lock; /* protects memlimit */
u64 memlimit; /* bytes */

+ struct xv_pool *xv_pool; /* xvmalloc pool */
struct zcache_pool_stats_cpu *stats; /* percpu stats */
#ifdef CONFIG_SYSFS
unsigned char name[MAX_ZPOOL_NAME_LEN];
--
1.7.1.1

2010-07-16 12:38:22

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 8/8] Document sysfs entries

Signed-off-by: Nitin Gupta <[email protected]>
---
Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 ++++++++++++++++++++++
1 files changed, 53 insertions(+), 0 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-zcache

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-zcache b/Documentation/ABI/testing/sysfs-kernel-mm-zcache
new file mode 100644
index 0000000..7ee3f31
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-zcache
@@ -0,0 +1,53 @@
+What: /sys/kernel/mm/zcache
+Date: July 2010
+Contact: Nitin Gupta <[email protected]>
+Description:
+ /sys/kernel/mm/zcache directory contains compressed cache
+ statistics for each pool. A separate pool is created for
+ every mount instance of cleancache-aware filesystems.
+
+What: /sys/kernel/mm/zcache/pool<id>/zero_pages
+Date: July 2010
+Contact: Nitin Gupta <[email protected]>
+Description:
+ The zero_pages file is read-only and specifies number of zero
+ filled pages found in this pool.
+
+What: /sys/kernel/mm/zcache/pool<id>/orig_data_size
+Date: July 2010
+Contact: Nitin Gupta <[email protected]>
+Description:
+ The orig_data_size file is read-only and specifies uncompressed
+ size of data stored in this pool. This excludes zero-filled
+ pages (zero_pages) since no memory is allocated for them.
+ Unit: bytes
+
+What: /sys/kernel/mm/zcache/pool<id>/compr_data_size
+Date: July 2010
+Contact: Nitin Gupta <[email protected]>
+Description:
+ The compr_data_size file is read-only and specifies compressed
+ size of data stored in this pool. So, compression ratio can be
+ calculated using orig_data_size and this statistic.
+ Unit: bytes
+
+What: /sys/kernel/mm/zcache/pool<id>/mem_used_total
+Date: July 2010
+Contact: Nitin Gupta <[email protected]>
+Description:
+ The mem_used_total file is read-only and specifies the amount
+ of memory, including allocator fragmentation and metadata
+ overhead, allocated for this pool. So, allocator space
+ efficiency can be calculated using compr_data_size and this
+ statistic.
+ Unit: bytes
+
+What: /sys/kernel/mm/zcache/pool<id>/memlimit
+Date: July 2010
+Contact: Nitin Gupta <[email protected]>
+Description:
+ The memlimit file is read-write and specifies upper bound on
+ the compressed data size (compr_data_dize) stored in this pool.
+ The value specified is rounded down to nearest multiple of
+ PAGE_SIZE.
+ Unit: bytes
\ No newline at end of file
--
1.7.1.1

2010-07-16 12:37:47

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 1/8] Allow sharing xvmalloc for zram and zcache

Both zram and zcache use xvmalloc allocator. If xvmalloc
is compiled separately for both of them, we will get linker
error if they are both selected as "built-in".

So, we now compile xvmalloc separately and export its symbols
which are then used by both of zram and zcache.

Signed-off-by: Nitin Gupta <[email protected]>
---
drivers/staging/Makefile | 1 +
drivers/staging/zram/Kconfig | 5 +++++
drivers/staging/zram/Makefile | 3 ++-
drivers/staging/zram/xvmalloc.c | 8 ++++++++
4 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index 63baeee..6de8564 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_MRST_RAR_HANDLER) += memrar/
obj-$(CONFIG_DX_SEP) += sep/
obj-$(CONFIG_IIO) += iio/
obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_XVMALLOC) += zram/
obj-$(CONFIG_WLAGS49_H2) += wlags49_h2/
obj-$(CONFIG_WLAGS49_H25) += wlags49_h25/
obj-$(CONFIG_BATMAN_ADV) += batman-adv/
diff --git a/drivers/staging/zram/Kconfig b/drivers/staging/zram/Kconfig
index 4654ae2..9bf26ce 100644
--- a/drivers/staging/zram/Kconfig
+++ b/drivers/staging/zram/Kconfig
@@ -1,6 +1,11 @@
+config XVMALLOC
+ bool
+ default n
+
config ZRAM
tristate "Compressed RAM block device support"
depends on BLOCK
+ select XVMALLOC
select LZO_COMPRESS
select LZO_DECOMPRESS
default n
diff --git a/drivers/staging/zram/Makefile b/drivers/staging/zram/Makefile
index b2c087a..9900f8b 100644
--- a/drivers/staging/zram/Makefile
+++ b/drivers/staging/zram/Makefile
@@ -1,3 +1,4 @@
-zram-objs := zram_drv.o xvmalloc.o
+zram-objs := zram_drv.o

obj-$(CONFIG_ZRAM) += zram.o
+obj-$(CONFIG_XVMALLOC) += xvmalloc.o
diff --git a/drivers/staging/zram/xvmalloc.c b/drivers/staging/zram/xvmalloc.c
index 3fdbb8a..3f94ef5 100644
--- a/drivers/staging/zram/xvmalloc.c
+++ b/drivers/staging/zram/xvmalloc.c
@@ -10,6 +10,8 @@
* Released under the terms of GNU General Public License Version 2.0
*/

+#include <linux/module.h>
+#include <linux/kernel.h>
#include <linux/bitops.h>
#include <linux/errno.h>
#include <linux/highmem.h>
@@ -320,11 +322,13 @@ struct xv_pool *xv_create_pool(void)

return pool;
}
+EXPORT_SYMBOL_GPL(xv_create_pool);

void xv_destroy_pool(struct xv_pool *pool)
{
kfree(pool);
}
+EXPORT_SYMBOL_GPL(xv_destroy_pool);

/**
* xv_malloc - Allocate block of given size from pool.
@@ -413,6 +417,7 @@ int xv_malloc(struct xv_pool *pool, u32 size, struct page **page,

return 0;
}
+EXPORT_SYMBOL_GPL(xv_malloc);

/*
* Free block identified with <page, offset>
@@ -489,6 +494,7 @@ void xv_free(struct xv_pool *pool, struct page *page, u32 offset)
put_ptr_atomic(page_start, KM_USER0);
spin_unlock(&pool->lock);
}
+EXPORT_SYMBOL_GPL(xv_free);

u32 xv_get_object_size(void *obj)
{
@@ -497,6 +503,7 @@ u32 xv_get_object_size(void *obj)
blk = (struct block_header *)((char *)(obj) - XV_ALIGN);
return blk->size;
}
+EXPORT_SYMBOL_GPL(xv_get_object_size);

/*
* Returns total memory used by allocator (userdata + metadata)
@@ -505,3 +512,4 @@ u64 xv_get_total_size_bytes(struct xv_pool *pool)
{
return pool->total_pages << PAGE_SHIFT;
}
+EXPORT_SYMBOL_GPL(xv_get_total_size_bytes);
--
1.7.1.1

2010-07-16 12:39:26

by Nitin Gupta

[permalink] [raw]
Subject: [PATCH 3/8] Create sysfs nodes and export basic statistics

Creates per-pool sysfs nodes: /sys/kernel/vm/zcache/pool<id>/
(<id> = 0, 1, 2, ...) to export following statistics:
- orig_data_size: Uncompressed worth of data stored in the pool.
- memlimit: Memory limit of the pool. This also allows changing
it at runtime (default: 10% of RAM).

If memlimit is set to a value smaller than the current number
of page stored, then excess pages are not freed immediately but
further puts are blocked till sufficient number of pages are
flushed/freed.

Signed-off-by: Nitin Gupta <[email protected]>
---
drivers/staging/zram/zcache_drv.c | 132 ++++++++++++++++++++++++++++++++++++-
drivers/staging/zram/zcache_drv.h | 8 ++
2 files changed, 137 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/zram/zcache_drv.c b/drivers/staging/zram/zcache_drv.c
index 160c172..f680f19 100644
--- a/drivers/staging/zram/zcache_drv.c
+++ b/drivers/staging/zram/zcache_drv.c
@@ -440,6 +440,85 @@ static void zcache_free_inode_pages(struct zcache_inode_rb *znode)
} while (count == FREE_BATCH);
}

+#ifdef CONFIG_SYSFS
+
+#define ZCACHE_POOL_ATTR_RO(_name) \
+ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+
+#define ZCACHE_POOL_ATTR(_name) \
+ static struct kobj_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static struct zcache_pool *zcache_kobj_to_pool(struct kobject *kobj)
+{
+ int i;
+
+ spin_lock(&zcache->pool_lock);
+ for (i = 0; i < MAX_ZCACHE_POOLS; i++)
+ if (zcache->pools[i]->kobj == kobj)
+ break;
+ spin_unlock(&zcache->pool_lock);
+
+ return zcache->pools[i];
+}
+
+static ssize_t orig_data_size_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct zcache_pool *zpool = zcache_kobj_to_pool(kobj);
+
+ return sprintf(buf, "%llu\n", zcache_get_stat(
+ zpool, ZPOOL_STAT_PAGES_STORED) << PAGE_SHIFT);
+}
+ZCACHE_POOL_ATTR_RO(orig_data_size);
+
+static void memlimit_sysfs_common(struct kobject *kobj, u64 *value, int store)
+{
+ struct zcache_pool *zpool = zcache_kobj_to_pool(kobj);
+
+ if (store)
+ zcache_set_memlimit(zpool, *value);
+ else
+ *value = zcache_get_memlimit(zpool);
+}
+
+static ssize_t memlimit_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t len)
+{
+ int ret;
+ u64 memlimit;
+
+ ret = strict_strtoull(buf, 10, &memlimit);
+ if (ret)
+ return ret;
+
+ memlimit &= PAGE_MASK;
+ memlimit_sysfs_common(kobj, &memlimit, 1);
+
+ return len;
+}
+
+static ssize_t memlimit_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ u64 memlimit;
+
+ memlimit_sysfs_common(kobj, &memlimit, 0);
+ return sprintf(buf, "%llu\n", memlimit);
+}
+ZCACHE_POOL_ATTR(memlimit);
+
+static struct attribute *zcache_pool_attrs[] = {
+ &orig_data_size_attr.attr,
+ &memlimit_attr.attr,
+ NULL,
+};
+
+static struct attribute_group zcache_pool_attr_group = {
+ .attrs = zcache_pool_attrs,
+};
+#endif /* CONFIG_SYSFS */
+
/*
* cleancache_ops.init_fs
*
@@ -451,7 +530,8 @@ static void zcache_free_inode_pages(struct zcache_inode_rb *znode)
*/
static int zcache_init_fs(size_t pagesize)
{
- int ret;
+ int ret, pool_id;
+ struct zcache_pool *zpool = NULL;

/*
* pagesize parameter probably makes sense only for Xen's
@@ -469,14 +549,38 @@ static int zcache_init_fs(size_t pagesize)
goto out;
}

- ret = zcache_create_pool();
- if (ret < 0) {
+ pool_id = zcache_create_pool();
+ if (pool_id < 0) {
pr_info("Failed to create new pool\n");
ret = -ENOMEM;
goto out;
}
+ zpool = zcache->pools[pool_id];
+
+#ifdef CONFIG_SYSFS
+ snprintf(zpool->name, MAX_ZPOOL_NAME_LEN, "pool%d", pool_id);
+
+ /* Create /sys/kernel/mm/zcache/pool<id> (<id> = 0, 1, ...) */
+ zpool->kobj = kobject_create_and_add(zpool->name, zcache->kobj);
+ if (!zpool->kobj) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* Create various nodes under /sys/.../pool<id>/ */
+ ret = sysfs_create_group(zpool->kobj, &zcache_pool_attr_group);
+ if (ret) {
+ kobject_put(zpool->kobj);
+ goto out;
+ }
+#endif
+
+ ret = pool_id; /* success */

out:
+ if (ret < 0) /* failure */
+ zcache_destroy_pool(zpool);
+
return ret;
}

@@ -580,6 +684,13 @@ static void zcache_put_page(int pool_id, ino_t inode_no,
*/
zcache_inc_stat(zpool, ZPOOL_STAT_PAGES_STORED);

+ /*
+ * memlimit can be changed any time by user using sysfs. If
+ * it is set to a value smaller than current number of pages
+ * stored, then excess pages are not freed immediately but
+ * further puts are blocked till sufficient number of pages
+ * are flushed/freed.
+ */
if (zcache_get_stat(zpool, ZPOOL_STAT_PAGES_STORED) >
zcache_get_memlimit(zpool) >> PAGE_SHIFT) {
zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
@@ -690,6 +801,12 @@ static void zcache_flush_fs(int pool_id)
struct zcache_inode_rb *znode;
struct zcache_pool *zpool = zcache->pools[pool_id];

+#ifdef CONFIG_SYSFS
+ /* Remove per-pool sysfs entries */
+ sysfs_remove_group(zpool->kobj, &zcache_pool_attr_group);
+ kobject_put(zpool->kobj);
+#endif
+
/*
* At this point, there is no active I/O on this filesystem.
* So we can free all its pages without holding any locks.
@@ -722,6 +839,15 @@ static int __init zcache_init(void)
if (!zcache)
return -ENOMEM;

+#ifdef CONFIG_SYSFS
+ /* Create /sys/kernel/mm/zcache/ */
+ zcache->kobj = kobject_create_and_add("zcache", mm_kobj);
+ if (!zcache->kobj) {
+ kfree(zcache);
+ return -ENOMEM;
+ }
+#endif
+
spin_lock_init(&zcache->pool_lock);
cleancache_ops = ops;

diff --git a/drivers/staging/zram/zcache_drv.h b/drivers/staging/zram/zcache_drv.h
index bfba5d7..808cfb2 100644
--- a/drivers/staging/zram/zcache_drv.h
+++ b/drivers/staging/zram/zcache_drv.h
@@ -19,6 +19,7 @@
#include <linux/types.h>

#define MAX_ZCACHE_POOLS 32 /* arbitrary */
+#define MAX_ZPOOL_NAME_LEN 8 /* "pool"+id (shown in sysfs) */

enum zcache_pool_stats_index {
ZPOOL_STAT_PAGES_STORED,
@@ -51,6 +52,10 @@ struct zcache_pool {
seqlock_t memlimit_lock; /* protects memlimit */
u64 memlimit; /* bytes */
struct zcache_pool_stats_cpu *stats; /* percpu stats */
+#ifdef CONFIG_SYSFS
+ unsigned char name[MAX_ZPOOL_NAME_LEN];
+ struct kobject *kobj; /* sysfs */
+#endif
};

/* Manage all zcache pools */
@@ -58,6 +63,9 @@ struct zcache {
struct zcache_pool *pools[MAX_ZCACHE_POOLS];
u32 num_pools; /* current no. of zcache pools */
spinlock_t pool_lock; /* protects pools[] and num_pools */
+#ifdef CONFIG_SYSFS
+ struct kobject *kobj; /* sysfs */
+#endif
};

#endif
--
1.7.1.1

2010-07-17 18:10:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/8] Allow sharing xvmalloc for zram and zcache

On 07/16/2010 08:37 AM, Nitin Gupta wrote:
> Both zram and zcache use xvmalloc allocator. If xvmalloc
> is compiled separately for both of them, we will get linker
> error if they are both selected as "built-in".
>
> So, we now compile xvmalloc separately and export its symbols
> which are then used by both of zram and zcache.
>
> Signed-off-by: Nitin Gupta<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-07-17 21:13:42

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

Nitin,

Would you have all this in a git tree somewhere?

Considering getting this working requires 24 patches it would really help with testing.

TIA
Ed Tomlinson

On Friday 16 July 2010 08:37:42 you wrote:
> Frequently accessed filesystem data is stored in memory to reduce access to
> (much) slower backing disks. Under memory pressure, these pages are freed and
> when needed again, they have to be read from disks again. When combined working
> set of all running application exceeds amount of physical RAM, we get extereme
> slowdown as reading a page from disk can take time in order of milliseconds.
>
> Memory compression increases effective memory size and allows more pages to
> stay in RAM. Since de/compressing memory pages is several orders of magnitude
> faster than disk I/O, this can provide signifant performance gains for many
> workloads. Also, with multi-cores becoming common, benefits of reduced disk I/O
> should easily outweigh the problem of increased CPU usage.
>
> It is implemented as a "backend" for cleancache_ops [1] which provides
> callbacks for events such as when a page is to be removed from the page cache
> and when it is required again. We use them to implement a 'second chance' cache
> for these evicted page cache pages by compressing and storing them in memory
> itself.
>
> We only keep pages that compress to PAGE_SIZE/2 or less. Compressed chunks are
> stored using xvmalloc memory allocator which is already being used by zram
> driver for the same purpose. Zero-filled pages are checked and no memory is
> allocated for them.
>
> A separate "pool" is created for each mount instance for a cleancache-aware
> filesystem. Each incoming page is identified with <pool_id, inode_no, index>
> where inode_no identifies file within the filesystem corresponding to pool_id
> and index is offset of the page within this inode. Within a pool, inodes are
> maintained in an rb-tree and each of its nodes points to a separate radix-tree
> which maintains list of pages within that inode.
>
> While compression reduces disk I/O, it also reduces the space available for
> normal (uncompressed) page cache. This can result in more frequent page cache
> reclaim and thus higher CPU overhead. Thus, it's important to maintain good hit
> rate for compressed cache or increased CPU overhead can nullify any other
> benefits. This requires adaptive (compressed) cache resizing and page
> replacement policies that can maintain optimal cache size and quickly reclaim
> unused compressed chunks. This work is yet to be done. However, in the current
> state, it allows manually resizing cache size using (per-pool) sysfs node
> 'memlimit' which in turn frees any excess pages *sigh* randomly.
>
> Finally, it uses percpu stats and compression buffers to allow better
> performance on multi-cores. Still, there are known bottlenecks like a single
> xvmalloc mempool per zcache pool and few others. I will work on this when I
> start with profiling.
>
> * Performance numbers:
> - Tested using iozone filesystem benchmark
> - 4 CPUs, 1G RAM
> - Read performance gain: ~2.5X
> - Random read performance gain: ~3X
> - In general, performance gains for every kind of I/O
>
> Test details with graphs can be found here:
> http://code.google.com/p/compcache/wiki/zcacheIOzone
>
> If I can get some help with testing, it would be intersting to find its
> effect in more real-life workloads. In particular, I'm intersted in finding
> out its effect in KVM virtualization case where it can potentially allow
> running more number of VMs per-host for a given amount of RAM. With zcache
> enabled, VMs can be assigned much smaller amount of memory since host can now
> hold bulk of page-cache pages, allowing VMs to maintain similar level of
> performance while a greater number of them can be hosted.
>
> * How to test:
> All patches are against 2.6.35-rc5:
>
> - First, apply all prerequisite patches here:
> http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches
>
> - Then apply this patch series; also uploaded here:
> http://compcache.googlecode.com/hg/sub-projects/zcache_patches
>
>
> Nitin Gupta (8):
> Allow sharing xvmalloc for zram and zcache
> Basic zcache functionality
> Create sysfs nodes and export basic statistics
> Shrink zcache based on memlimit
> Eliminate zero-filled pages
> Compress pages using LZO
> Use xvmalloc to store compressed chunks
> Document sysfs entries
>
> Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 +
> drivers/staging/Makefile | 2 +
> drivers/staging/zram/Kconfig | 22 +
> drivers/staging/zram/Makefile | 5 +-
> drivers/staging/zram/xvmalloc.c | 8 +
> drivers/staging/zram/zcache_drv.c | 1312 ++++++++++++++++++++++
> drivers/staging/zram/zcache_drv.h | 90 ++
> 7 files changed, 1491 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-zcache
> create mode 100644 drivers/staging/zram/zcache_drv.c
> create mode 100644 drivers/staging/zram/zcache_drv.h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2010-07-18 02:23:01

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

Hi Ed,

On 07/18/2010 02:43 AM, Ed Tomlinson wrote:
>
> Would you have all this in a git tree somewhere?
>
> Considering getting this working requires 24 patches it would really help with testing.
>

Unfortunately, git tree for this is not hosted anywhere.

Anyways, I just uploaded monolithic zcache patch containing all its dependencies:
http://compcache.googlecode.com/hg/sub-projects/mainline/zcache_v1_2.6.35-rc5.patch

It applies on top of 2.6.35-rc5

Thanks for trying it out.
Nitin


> On Friday 16 July 2010 08:37:42 you wrote:
>> Frequently accessed filesystem data is stored in memory to reduce access to
>> (much) slower backing disks. Under memory pressure, these pages are freed and
>> when needed again, they have to be read from disks again. When combined working
>> set of all running application exceeds amount of physical RAM, we get extereme
>> slowdown as reading a page from disk can take time in order of milliseconds.
>>
>> Memory compression increases effective memory size and allows more pages to
>> stay in RAM. Since de/compressing memory pages is several orders of magnitude
>> faster than disk I/O, this can provide signifant performance gains for many
>> workloads. Also, with multi-cores becoming common, benefits of reduced disk I/O
>> should easily outweigh the problem of increased CPU usage.
>>
>> It is implemented as a "backend" for cleancache_ops [1] which provides
>> callbacks for events such as when a page is to be removed from the page cache
>> and when it is required again. We use them to implement a 'second chance' cache
>> for these evicted page cache pages by compressing and storing them in memory
>> itself.
>>
>> We only keep pages that compress to PAGE_SIZE/2 or less. Compressed chunks are
>> stored using xvmalloc memory allocator which is already being used by zram
>> driver for the same purpose. Zero-filled pages are checked and no memory is
>> allocated for them.
>>
>> A separate "pool" is created for each mount instance for a cleancache-aware
>> filesystem. Each incoming page is identified with <pool_id, inode_no, index>
>> where inode_no identifies file within the filesystem corresponding to pool_id
>> and index is offset of the page within this inode. Within a pool, inodes are
>> maintained in an rb-tree and each of its nodes points to a separate radix-tree
>> which maintains list of pages within that inode.
>>
>> While compression reduces disk I/O, it also reduces the space available for
>> normal (uncompressed) page cache. This can result in more frequent page cache
>> reclaim and thus higher CPU overhead. Thus, it's important to maintain good hit
>> rate for compressed cache or increased CPU overhead can nullify any other
>> benefits. This requires adaptive (compressed) cache resizing and page
>> replacement policies that can maintain optimal cache size and quickly reclaim
>> unused compressed chunks. This work is yet to be done. However, in the current
>> state, it allows manually resizing cache size using (per-pool) sysfs node
>> 'memlimit' which in turn frees any excess pages *sigh* randomly.
>>
>> Finally, it uses percpu stats and compression buffers to allow better
>> performance on multi-cores. Still, there are known bottlenecks like a single
>> xvmalloc mempool per zcache pool and few others. I will work on this when I
>> start with profiling.
>>
>> * Performance numbers:
>> - Tested using iozone filesystem benchmark
>> - 4 CPUs, 1G RAM
>> - Read performance gain: ~2.5X
>> - Random read performance gain: ~3X
>> - In general, performance gains for every kind of I/O
>>
>> Test details with graphs can be found here:
>> http://code.google.com/p/compcache/wiki/zcacheIOzone
>>
>> If I can get some help with testing, it would be intersting to find its
>> effect in more real-life workloads. In particular, I'm intersted in finding
>> out its effect in KVM virtualization case where it can potentially allow
>> running more number of VMs per-host for a given amount of RAM. With zcache
>> enabled, VMs can be assigned much smaller amount of memory since host can now
>> hold bulk of page-cache pages, allowing VMs to maintain similar level of
>> performance while a greater number of them can be hosted.
>>
>> * How to test:
>> All patches are against 2.6.35-rc5:
>>
>> - First, apply all prerequisite patches here:
>> http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches
>>
>> - Then apply this patch series; also uploaded here:
>> http://compcache.googlecode.com/hg/sub-projects/zcache_patches
>>
>>
>> Nitin Gupta (8):
>> Allow sharing xvmalloc for zram and zcache
>> Basic zcache functionality
>> Create sysfs nodes and export basic statistics
>> Shrink zcache based on memlimit
>> Eliminate zero-filled pages
>> Compress pages using LZO
>> Use xvmalloc to store compressed chunks
>> Document sysfs entries
>>
>> Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 +
>> drivers/staging/Makefile | 2 +
>> drivers/staging/zram/Kconfig | 22 +
>> drivers/staging/zram/Makefile | 5 +-
>> drivers/staging/zram/xvmalloc.c | 8 +
>> drivers/staging/zram/zcache_drv.c | 1312 ++++++++++++++++++++++
>> drivers/staging/zram/zcache_drv.h | 90 ++
>> 7 files changed, 1491 insertions(+), 1 deletions(-)
>> create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-zcache
>> create mode 100644 drivers/staging/zram/zcache_drv.c
>> create mode 100644 drivers/staging/zram/zcache_drv.h
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>

2010-07-18 07:50:16

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

Nitin Gupta wrote:
> Frequently accessed filesystem data is stored in memory to reduce access to
> (much) slower backing disks. Under memory pressure, these pages are freed and
> when needed again, they have to be read from disks again. When combined working
> set of all running application exceeds amount of physical RAM, we get extereme
> slowdown as reading a page from disk can take time in order of milliseconds.
>
> Memory compression increases effective memory size and allows more pages to
> stay in RAM. Since de/compressing memory pages is several orders of magnitude
> faster than disk I/O, this can provide signifant performance gains for many
> workloads. Also, with multi-cores becoming common, benefits of reduced disk I/O
> should easily outweigh the problem of increased CPU usage.
>
> It is implemented as a "backend" for cleancache_ops [1] which provides
> callbacks for events such as when a page is to be removed from the page cache
> and when it is required again. We use them to implement a 'second chance' cache
> for these evicted page cache pages by compressing and storing them in memory
> itself.
>
> We only keep pages that compress to PAGE_SIZE/2 or less. Compressed chunks are
> stored using xvmalloc memory allocator which is already being used by zram
> driver for the same purpose. Zero-filled pages are checked and no memory is
> allocated for them.
>
> A separate "pool" is created for each mount instance for a cleancache-aware
> filesystem. Each incoming page is identified with <pool_id, inode_no, index>
> where inode_no identifies file within the filesystem corresponding to pool_id
> and index is offset of the page within this inode. Within a pool, inodes are
> maintained in an rb-tree and each of its nodes points to a separate radix-tree
> which maintains list of pages within that inode.
>
> While compression reduces disk I/O, it also reduces the space available for
> normal (uncompressed) page cache. This can result in more frequent page cache
> reclaim and thus higher CPU overhead. Thus, it's important to maintain good hit
> rate for compressed cache or increased CPU overhead can nullify any other
> benefits. This requires adaptive (compressed) cache resizing and page
> replacement policies that can maintain optimal cache size and quickly reclaim
> unused compressed chunks. This work is yet to be done. However, in the current
> state, it allows manually resizing cache size using (per-pool) sysfs node
> 'memlimit' which in turn frees any excess pages *sigh* randomly.
>
> Finally, it uses percpu stats and compression buffers to allow better
> performance on multi-cores. Still, there are known bottlenecks like a single
> xvmalloc mempool per zcache pool and few others. I will work on this when I
> start with profiling.
>
> * Performance numbers:
> - Tested using iozone filesystem benchmark
> - 4 CPUs, 1G RAM
> - Read performance gain: ~2.5X
> - Random read performance gain: ~3X
> - In general, performance gains for every kind of I/O
>
> Test details with graphs can be found here:
> http://code.google.com/p/compcache/wiki/zcacheIOzone
>
> If I can get some help with testing, it would be intersting to find its
> effect in more real-life workloads. In particular, I'm intersted in finding
> out its effect in KVM virtualization case where it can potentially allow
> running more number of VMs per-host for a given amount of RAM. With zcache
> enabled, VMs can be assigned much smaller amount of memory since host can now
> hold bulk of page-cache pages, allowing VMs to maintain similar level of
> performance while a greater number of them can be hosted.

So why would someone want to use zram if they have transparent page
cache compression with zcache? That is, why is this not a replacement
for zram?

Pekka

2010-07-18 07:53:21

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 7/8] Use xvmalloc to store compressed chunks

Nitin Gupta wrote:
> @@ -528,17 +581,32 @@ static int zcache_store_page(struct zcache_inode_rb *znode,
> goto out;
> }
>
> - dest_data = kmap_atomic(zpage, KM_USER0);
> + local_irq_save(flags);

Does xv_malloc() required interrupts to be disabled? If so, why doesn't
the function do it by itself?

> + ret = xv_malloc(zpool->xv_pool, clen + sizeof(*zheader),
> + &zpage, &zoffset, GFP_NOWAIT);
> + local_irq_restore(flags);
> + if (unlikely(ret)) {
> + ret = -ENOMEM;
> + preempt_enable();
> + goto out;
> + }

2010-07-18 08:12:29

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

On 07/18/2010 01:20 PM, Pekka Enberg wrote:
> Nitin Gupta wrote:
>> Frequently accessed filesystem data is stored in memory to reduce access to
>> (much) slower backing disks. Under memory pressure, these pages are freed and
>> when needed again, they have to be read from disks again. When combined working
>> set of all running application exceeds amount of physical RAM, we get extereme
>> slowdown as reading a page from disk can take time in order of milliseconds.
>>
>> Memory compression increases effective memory size and allows more pages to
>> stay in RAM. Since de/compressing memory pages is several orders of magnitude
>> faster than disk I/O, this can provide signifant performance gains for many
>> workloads. Also, with multi-cores becoming common, benefits of reduced disk I/O
>> should easily outweigh the problem of increased CPU usage.
>>
>> It is implemented as a "backend" for cleancache_ops [1] which provides
>> callbacks for events such as when a page is to be removed from the page cache
>> and when it is required again. We use them to implement a 'second chance' cache
>> for these evicted page cache pages by compressing and storing them in memory
>> itself.
>>
<snip>

>
> So why would someone want to use zram if they have transparent page cache compression with zcache? That is, why is this not a replacement for zram?
>

zcache complements zram; it's not a replacement:

- zram compresses anonymous pages while zcache is for page cache compression.
So, workload which depends heavily on "heap memory" usage will tend to prefer
zram and those which are I/O intensive will prefer zcache. Though I have not
yet experimented much, most workloads may want to have a mix of them.

- zram is not just for swap. /dev/zram<id> are generic in-memory compressed
block devices which can be used for, say, /tmp, /var/... etc. temporary storage.

- /dev/zram<id> being a generic block devices, can be used as raw disk in other
OSes also (using virtualization): For example:
http://www.vflare.org/2010/05/compressed-ram-disk-for-windows-virtual.html

Thanks,
Nitin

2010-07-18 08:14:51

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/8] Basic zcache functionality

Nitin Gupta wrote:
> +/*
> + * Individual percpu values can go negative but the sum across all CPUs
> + * must always be positive (we store various counts). So, return sum as
> + * unsigned value.
> + */
> +static u64 zcache_get_stat(struct zcache_pool *zpool,
> + enum zcache_pool_stats_index idx)
> +{
> + int cpu;
> + s64 val = 0;
> +
> + for_each_possible_cpu(cpu) {
> + unsigned int start;
> + struct zcache_pool_stats_cpu *stats;
> +
> + stats = per_cpu_ptr(zpool->stats, cpu);
> + do {
> + start = u64_stats_fetch_begin(&stats->syncp);
> + val += stats->count[idx];
> + } while (u64_stats_fetch_retry(&stats->syncp, start));

Can we use 'struct percpu_counter' for this? OTOH, the warning on top of
include/linux/percpu_counter.h makes me think not.

> + }
> +
> + BUG_ON(val < 0);

BUG_ON() seems overly aggressive. How about

if (WARN_ON(val < 0))
return 0;

> + return val;
> +}
> +
> +static void zcache_add_stat(struct zcache_pool *zpool,
> + enum zcache_pool_stats_index idx, s64 val)
> +{
> + struct zcache_pool_stats_cpu *stats;
> +
> + preempt_disable();
> + stats = __this_cpu_ptr(zpool->stats);
> + u64_stats_update_begin(&stats->syncp);
> + stats->count[idx] += val;
> + u64_stats_update_end(&stats->syncp);
> + preempt_enable();

What is the preempt_disable/preempt_enable trying to do here?

> +static void zcache_destroy_pool(struct zcache_pool *zpool)
> +{
> + int i;
> +
> + if (!zpool)
> + return;
> +
> + spin_lock(&zcache->pool_lock);
> + zcache->num_pools--;
> + for (i = 0; i < MAX_ZCACHE_POOLS; i++)
> + if (zcache->pools[i] == zpool)
> + break;
> + zcache->pools[i] = NULL;
> + spin_unlock(&zcache->pool_lock);
> +
> + if (!RB_EMPTY_ROOT(&zpool->inode_tree)) {

Use WARN_ON here to get a stack trace?

> + pr_warn("Memory leak detected. Freeing non-empty pool!\n");
> + zcache_dump_stats(zpool);
> + }
> +
> + free_percpu(zpool->stats);
> + kfree(zpool);
> +}
> +
> +/*
> + * Allocate a new zcache pool and set default memlimit.
> + *
> + * Returns pool_id on success, negative error code otherwise.
> + */
> +int zcache_create_pool(void)
> +{
> + int ret;
> + u64 memlimit;
> + struct zcache_pool *zpool = NULL;
> +
> + spin_lock(&zcache->pool_lock);
> + if (zcache->num_pools == MAX_ZCACHE_POOLS) {
> + spin_unlock(&zcache->pool_lock);
> + pr_info("Cannot create new pool (limit: %u)\n",
> + MAX_ZCACHE_POOLS);
> + ret = -EPERM;
> + goto out;
> + }
> + zcache->num_pools++;
> + spin_unlock(&zcache->pool_lock);
> +
> + zpool = kzalloc(sizeof(*zpool), GFP_KERNEL);
> + if (!zpool) {
> + spin_lock(&zcache->pool_lock);
> + zcache->num_pools--;
> + spin_unlock(&zcache->pool_lock);
> + ret = -ENOMEM;
> + goto out;
> + }

Why not kmalloc() an new struct zcache_pool object first and then take
zcache->pool_lock() and check for MAX_ZCACHE_POOLS? That should make the
locking little less confusing here.

> +
> + zpool->stats = alloc_percpu(struct zcache_pool_stats_cpu);
> + if (!zpool->stats) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + rwlock_init(&zpool->tree_lock);
> + seqlock_init(&zpool->memlimit_lock);
> + zpool->inode_tree = RB_ROOT;
> +
> + memlimit = zcache_pool_default_memlimit_perc_ram *
> + ((totalram_pages << PAGE_SHIFT) / 100);
> + memlimit &= PAGE_MASK;
> + zcache_set_memlimit(zpool, memlimit);
> +
> + /* Add to pool list */
> + spin_lock(&zcache->pool_lock);
> + for (ret = 0; ret < MAX_ZCACHE_POOLS; ret++)
> + if (!zcache->pools[ret])
> + break;
> + zcache->pools[ret] = zpool;
> + spin_unlock(&zcache->pool_lock);
> +
> +out:
> + if (ret < 0)
> + zcache_destroy_pool(zpool);
> +
> + return ret;
> +}

> +/*
> + * Allocate memory for storing the given page and insert
> + * it in the given node's page tree at location 'index'.
> + *
> + * Returns 0 on success, negative error code on failure.
> + */
> +static int zcache_store_page(struct zcache_inode_rb *znode,
> + pgoff_t index, struct page *page)
> +{
> + int ret;
> + unsigned long flags;
> + struct page *zpage;
> + void *src_data, *dest_data;
> +
> + zpage = alloc_page(GFP_NOWAIT);
> + if (!zpage) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + zpage->index = index;
> +
> + src_data = kmap_atomic(page, KM_USER0);
> + dest_data = kmap_atomic(zpage, KM_USER1);
> + memcpy(dest_data, src_data, PAGE_SIZE);
> + kunmap_atomic(src_data, KM_USER0);
> + kunmap_atomic(dest_data, KM_USER1);

copy_highpage()

> +
> + spin_lock_irqsave(&znode->tree_lock, flags);
> + ret = radix_tree_insert(&znode->page_tree, index, zpage);
> + spin_unlock_irqrestore(&znode->tree_lock, flags);
> + if (unlikely(ret))
> + __free_page(zpage);
> +
> +out:
> + return ret;
> +}

> +/*
> + * cleancache_ops.get_page
> + *
> + * Locates stored zcache page using <pool_id, inode_no, index>.
> + * If found, copies it to the given output page 'page' and frees
> + * zcache copy of the same.
> + *
> + * Returns 0 if requested page found, -1 otherwise.
> + */
> +static int zcache_get_page(int pool_id, ino_t inode_no,
> + pgoff_t index, struct page *page)
> +{
> + int ret = -1;
> + unsigned long flags;
> + struct page *src_page;
> + void *src_data, *dest_data;
> + struct zcache_inode_rb *znode;
> + struct zcache_pool *zpool = zcache->pools[pool_id];
> +
> + znode = zcache_find_inode(zpool, inode_no);
> + if (!znode)
> + goto out;
> +
> + BUG_ON(znode->inode_no != inode_no);

Maybe use WARN_ON here and return -1?

> +
> + spin_lock_irqsave(&znode->tree_lock, flags);
> + src_page = radix_tree_delete(&znode->page_tree, index);
> + if (zcache_inode_is_empty(znode))
> + zcache_inode_isolate(znode);
> + spin_unlock_irqrestore(&znode->tree_lock, flags);
> +
> + kref_put(&znode->refcount, zcache_inode_release);
> +
> + if (!src_page)
> + goto out;
> +
> + src_data = kmap_atomic(src_page, KM_USER0);
> + dest_data = kmap_atomic(page, KM_USER1);
> + memcpy(dest_data, src_data, PAGE_SIZE);
> + kunmap_atomic(src_data, KM_USER0);
> + kunmap_atomic(dest_data, KM_USER1);

The above sequence can be replaced with copy_highpage().

> +
> + flush_dcache_page(page);
> +
> + __free_page(src_page);
> +
> + zcache_dec_stat(zpool, ZPOOL_STAT_PAGES_STORED);
> + ret = 0; /* success */
> +
> +out:
> + return ret;
> +}

2010-07-18 08:21:10

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 7/8] Use xvmalloc to store compressed chunks

On 07/18/2010 01:23 PM, Pekka Enberg wrote:
> Nitin Gupta wrote:
>> @@ -528,17 +581,32 @@ static int zcache_store_page(struct zcache_inode_rb *znode,
>> goto out;
>> }
>>
>> - dest_data = kmap_atomic(zpage, KM_USER0);
>> + local_irq_save(flags);
>
> Does xv_malloc() required interrupts to be disabled? If so, why doesn't the function do it by itself?
>


xvmalloc itself doesn't require disabling interrupts but zcache needs that since
otherwise, we can have deadlock between xvmalloc pool lock and mapping->tree_lock
which zcache_put_page() is called. OTOH, zram does not require this disabling of
interrupts. So, interrupts are disable separately for zcache case.

Thanks,
Nitin

2010-07-18 08:27:31

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 2/8] Basic zcache functionality

Nitin Gupta wrote:
> +static void zcache_add_stat(struct zcache_pool *zpool,
> + enum zcache_pool_stats_index idx, s64 val)
> +{
> + struct zcache_pool_stats_cpu *stats;
> +
> + preempt_disable();
> + stats = __this_cpu_ptr(zpool->stats);
> + u64_stats_update_begin(&stats->syncp);
> + stats->count[idx] += val;
> + u64_stats_update_end(&stats->syncp);
> + preempt_enable();
> +
> +}

You should probably use this_cpu_inc() here.

2010-07-18 08:44:37

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 2/8] Basic zcache functionality

Le vendredi 16 juillet 2010 à 18:07 +0530, Nitin Gupta a écrit :

> This particular patch implemets basic functionality only:
> +static u64 zcache_get_stat(struct zcache_pool *zpool,
> + enum zcache_pool_stats_index idx)
> +{
> + int cpu;
> + s64 val = 0;
> +
> + for_each_possible_cpu(cpu) {
> + unsigned int start;
> + struct zcache_pool_stats_cpu *stats;
> +
> + stats = per_cpu_ptr(zpool->stats, cpu);
> + do {
> + start = u64_stats_fetch_begin(&stats->syncp);
> + val += stats->count[idx];
> + } while (u64_stats_fetch_retry(&stats->syncp, start));
> + }
> +
> + BUG_ON(val < 0);
> + return val;
> +}

Sorry this is wrong.

Inside the fetch/retry block you should not do the addition to val, only
a read of value to a temporary variable, since this might be done
several times.

You want something like :

static u64 zcache_get_stat(struct zcache_pool *zpool,
enum zcache_pool_stats_index idx)
{
int cpu;
s64 temp, val = 0;

for_each_possible_cpu(cpu) {
unsigned int start;
struct zcache_pool_stats_cpu *stats;

stats = per_cpu_ptr(zpool->stats, cpu);
do {
start = u64_stats_fetch_begin(&stats->syncp);
temp = stats->count[idx];
} while (u64_stats_fetch_retry(&stats->syncp, start));
val += temp;
}

...
}


2010-07-18 09:45:29

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 2/8] Basic zcache functionality


On 07/18/2010 01:44 PM, Pekka Enberg wrote:
> Nitin Gupta wrote:
>> +/*
>> + * Individual percpu values can go negative but the sum across all CPUs
>> + * must always be positive (we store various counts). So, return sum as
>> + * unsigned value.
>> + */
>> +static u64 zcache_get_stat(struct zcache_pool *zpool,
>> + enum zcache_pool_stats_index idx)
>> +{
>> + int cpu;
>> + s64 val = 0;
>> +
>> + for_each_possible_cpu(cpu) {
>> + unsigned int start;
>> + struct zcache_pool_stats_cpu *stats;
>> +
>> + stats = per_cpu_ptr(zpool->stats, cpu);
>> + do {
>> + start = u64_stats_fetch_begin(&stats->syncp);
>> + val += stats->count[idx];
>> + } while (u64_stats_fetch_retry(&stats->syncp, start));
>
> Can we use 'struct percpu_counter' for this? OTOH, the warning on top of include/linux/percpu_counter.h makes me think not.
>

Yes, that warning only scared me :)


>> + }
>> +
>> + BUG_ON(val < 0);
>
> BUG_ON() seems overly aggressive. How about
>
> if (WARN_ON(val < 0))
> return 0;
>

Yes, this sounds better. I will change it.


>> + return val;
>> +}
>> +
>> +static void zcache_add_stat(struct zcache_pool *zpool,
>> + enum zcache_pool_stats_index idx, s64 val)
>> +{
>> + struct zcache_pool_stats_cpu *stats;
>> +
>> + preempt_disable();
>> + stats = __this_cpu_ptr(zpool->stats);
>> + u64_stats_update_begin(&stats->syncp);
>> + stats->count[idx] += val;
>> + u64_stats_update_end(&stats->syncp);
>> + preempt_enable();
>
> What is the preempt_disable/preempt_enable trying to do here?
>

On 32-bit there will be no seqlock to protect this value. So, if we
get preempted after __this_cpu_ptr(), two CPUs can end up racy-writing
to the same variable. I think for the same reason this_cpu_add() finally
does increment with preempt disabled.

Also, I think we shouldn't use this_cpu_add (as you suggested in
another mail) since we have to do this_cpu_ptr() first to get access
to seqlock (stats->syncp) anyways. So, simple increment on thus
obtained pcpu pointer should be okay.


>> +static void zcache_destroy_pool(struct zcache_pool *zpool)
>> +{
>> + int i;
>> +
>> + if (!zpool)
>> + return;
>> +
>> + spin_lock(&zcache->pool_lock);
>> + zcache->num_pools--;
>> + for (i = 0; i < MAX_ZCACHE_POOLS; i++)
>> + if (zcache->pools[i] == zpool)
>> + break;
>> + zcache->pools[i] = NULL;
>> + spin_unlock(&zcache->pool_lock);
>> +
>> + if (!RB_EMPTY_ROOT(&zpool->inode_tree)) {
>
> Use WARN_ON here to get a stack trace?
>

This sounds better, will change it.


>> + pr_warn("Memory leak detected. Freeing non-empty pool!\n");
>> + zcache_dump_stats(zpool);
>> + }
>> +
>> + free_percpu(zpool->stats);
>> + kfree(zpool);
>> +}
>> +
>> +/*
>> + * Allocate a new zcache pool and set default memlimit.
>> + *
>> + * Returns pool_id on success, negative error code otherwise.
>> + */
>> +int zcache_create_pool(void)
>> +{
>> + int ret;
>> + u64 memlimit;
>> + struct zcache_pool *zpool = NULL;
>> +
>> + spin_lock(&zcache->pool_lock);
>> + if (zcache->num_pools == MAX_ZCACHE_POOLS) {
>> + spin_unlock(&zcache->pool_lock);
>> + pr_info("Cannot create new pool (limit: %u)\n",
>> + MAX_ZCACHE_POOLS);
>> + ret = -EPERM;
>> + goto out;
>> + }
>> + zcache->num_pools++;
>> + spin_unlock(&zcache->pool_lock);
>> +
>> + zpool = kzalloc(sizeof(*zpool), GFP_KERNEL);
>> + if (!zpool) {
>> + spin_lock(&zcache->pool_lock);
>> + zcache->num_pools--;
>> + spin_unlock(&zcache->pool_lock);
>> + ret = -ENOMEM;
>> + goto out;
>> + }
>
> Why not kmalloc() an new struct zcache_pool object first and then take zcache->pool_lock() and check for MAX_ZCACHE_POOLS? That should make the locking little less confusing here.
>

kmalloc() before this check should be better. This also avoids unnecessary
num_pools decrement later if kmalloc fails.


>> +
>> + src_data = kmap_atomic(page, KM_USER0);
>> + dest_data = kmap_atomic(zpage, KM_USER1);
>> + memcpy(dest_data, src_data, PAGE_SIZE);
>> + kunmap_atomic(src_data, KM_USER0);
>> + kunmap_atomic(dest_data, KM_USER1);
>
> copy_highpage()
>

Ok. But we will again have to open-code this memcpy() when we start using
xvmalloc (patch 7/8). Same applies to another instance you pointed out.


>> +static int zcache_get_page(int pool_id, ino_t inode_no,
>> + pgoff_t index, struct page *page)
>> +{
>> + int ret = -1;
>> + unsigned long flags;
>> + struct page *src_page;
>> + void *src_data, *dest_data;
>> + struct zcache_inode_rb *znode;
>> + struct zcache_pool *zpool = zcache->pools[pool_id];
>> +
>> + znode = zcache_find_inode(zpool, inode_no);
>> + if (!znode)
>> + goto out;
>> +
>> + BUG_ON(znode->inode_no != inode_no);
>
> Maybe use WARN_ON here and return -1?
>

okay.


Thanks for the review.
Nitin

2010-07-18 09:50:58

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 2/8] Basic zcache functionality

On 07/18/2010 02:14 PM, Eric Dumazet wrote:
> Le vendredi 16 juillet 2010 à 18:07 +0530, Nitin Gupta a écrit :
>
>> This particular patch implemets basic functionality only:
>> +static u64 zcache_get_stat(struct zcache_pool *zpool,
>> + enum zcache_pool_stats_index idx)
>> +{
>> + int cpu;
>> + s64 val = 0;
>> +
>> + for_each_possible_cpu(cpu) {
>> + unsigned int start;
>> + struct zcache_pool_stats_cpu *stats;
>> +
>> + stats = per_cpu_ptr(zpool->stats, cpu);
>> + do {
>> + start = u64_stats_fetch_begin(&stats->syncp);
>> + val += stats->count[idx];
>> + } while (u64_stats_fetch_retry(&stats->syncp, start));
>> + }
>> +
>> + BUG_ON(val < 0);
>> + return val;
>> +}
>
> Sorry this is wrong.
>
> Inside the fetch/retry block you should not do the addition to val, only
> a read of value to a temporary variable, since this might be done
> several times.
>
> You want something like :
>
> static u64 zcache_get_stat(struct zcache_pool *zpool,
> enum zcache_pool_stats_index idx)
> {
> int cpu;
> s64 temp, val = 0;
>
> for_each_possible_cpu(cpu) {
> unsigned int start;
> struct zcache_pool_stats_cpu *stats;
>
> stats = per_cpu_ptr(zpool->stats, cpu);
> do {
> start = u64_stats_fetch_begin(&stats->syncp);
> temp = stats->count[idx];
> } while (u64_stats_fetch_retry(&stats->syncp, start));
> val += temp;
> }
>
> ...
> }
>
>

Oh, my bad. Thanks for the fix!

On a side note: u64_stats_* should probably be renamed to stats64_* since
they are equally applicable for s64.


Thanks,
Nitin

2010-07-19 04:36:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 7/8] Use xvmalloc to store compressed chunks

Hi Nitin,

On Sun, Jul 18, 2010 at 5:21 PM, Nitin Gupta <[email protected]> wrote:
> On 07/18/2010 01:23 PM, Pekka Enberg wrote:
>> Nitin Gupta wrote:
>>> @@ -528,17 +581,32 @@ static int zcache_store_page(struct zcache_inode_rb *znode,
>>> ? ? ? ? ?goto out;
>>> ? ? ?}
>>>
>>> - ? ?dest_data = kmap_atomic(zpage, KM_USER0);
>>> + ? ?local_irq_save(flags);
>>
>> Does xv_malloc() required interrupts to be disabled? If so, why doesn't the function do it by itself?
>>
>
>
> xvmalloc itself doesn't require disabling interrupts but zcache needs that since
> otherwise, we can have deadlock between xvmalloc pool lock and mapping->tree_lock
> which zcache_put_page() is called. OTOH, zram does not require this disabling of
> interrupts. So, interrupts are disable separately for zcache case.

cleancache_put_page always is called with spin_lock_irq.
Couldn't we replace spin_lock_irq_save with spin_lock?

--
Kind regards,
Minchan Kim

2010-07-19 06:48:08

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 7/8] Use xvmalloc to store compressed chunks

On 07/19/2010 10:06 AM, Minchan Kim wrote:
> Hi Nitin,
>
> On Sun, Jul 18, 2010 at 5:21 PM, Nitin Gupta <[email protected]> wrote:
>> On 07/18/2010 01:23 PM, Pekka Enberg wrote:
>>> Nitin Gupta wrote:
>>>> @@ -528,17 +581,32 @@ static int zcache_store_page(struct zcache_inode_rb *znode,
>>>> goto out;
>>>> }
>>>>
>>>> - dest_data = kmap_atomic(zpage, KM_USER0);
>>>> + local_irq_save(flags);
>>>
>>> Does xv_malloc() required interrupts to be disabled? If so, why doesn't the function do it by itself?
>>>
>>
>>
>> xvmalloc itself doesn't require disabling interrupts but zcache needs that since
>> otherwise, we can have deadlock between xvmalloc pool lock and mapping->tree_lock
>> which zcache_put_page() is called. OTOH, zram does not require this disabling of
>> interrupts. So, interrupts are disable separately for zcache case.
>
> cleancache_put_page always is called with spin_lock_irq.
> Couldn't we replace spin_lock_irq_save with spin_lock?
>

I was missing this point regarding cleancache_put(). So, we can now:
- take plain (non-irq) spin_lock in zcache_put_page()
- take non-irq rwlock in zcache_inode_create() which is called only by
zcache_put_page().
- Same applies to zcache_store_page(). So, we can also get rid of unnecessary
preempt_disable()/enable() in this function.

I will put up a comment for all these functions and make these changes.

Thanks,
Nitin

2010-07-19 19:59:44

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [PATCH 0/8] zcache: page cache compression support

> We only keep pages that compress to PAGE_SIZE/2 or less. Compressed
> chunks are
> stored using xvmalloc memory allocator which is already being used by
> zram
> driver for the same purpose. Zero-filled pages are checked and no
> memory is
> allocated for them.

I'm curious about this policy choice. I can see why one
would want to ensure that the average page is compressed
to less than PAGE_SIZE/2, and preferably PAGE_SIZE/2
minus the overhead of the data structures necessary to
track the page. And I see that this makes no difference
when the reclamation algorithm is random (as it is for
now). But once there is some better reclamation logic,
I'd hope that this compression factor restriction would
be lifted and replaced with something much higher. IIRC,
compression is much more expensive than decompression
so there's no CPU-overhead argument here either,
correct?

Thanks,
Dan

2010-07-20 13:50:37

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

On 07/20/2010 01:27 AM, Dan Magenheimer wrote:
>> We only keep pages that compress to PAGE_SIZE/2 or less. Compressed
>> chunks are
>> stored using xvmalloc memory allocator which is already being used by
>> zram
>> driver for the same purpose. Zero-filled pages are checked and no
>> memory is
>> allocated for them.
>
> I'm curious about this policy choice. I can see why one
> would want to ensure that the average page is compressed
> to less than PAGE_SIZE/2, and preferably PAGE_SIZE/2
> minus the overhead of the data structures necessary to
> track the page. And I see that this makes no difference
> when the reclamation algorithm is random (as it is for
> now). But once there is some better reclamation logic,
> I'd hope that this compression factor restriction would
> be lifted and replaced with something much higher. IIRC,
> compression is much more expensive than decompression
> so there's no CPU-overhead argument here either,
> correct?
>
>

Its true that we waste CPU cycles for every incompressible page
encountered but still we can't keep such pages in RAM since this
is what host wanted to reclaim and we can't help since compression
failed. Compressed caching makes sense only when we keep highly
compressible pages in RAM, regardless of reclaim scheme.

Keeping (nearly) incompressible pages in RAM probably makes sense
for Xen's case where cleancache provider runs *inside* a VM, sending
pages to host. So, if VM is limited to say 512M and host has 64G RAM,
caching guest pages, with or without compression, will help.

Thanks,
Nitin

2010-07-20 14:30:23

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [PATCH 0/8] zcache: page cache compression support

> On 07/20/2010 01:27 AM, Dan Magenheimer wrote:
> >> We only keep pages that compress to PAGE_SIZE/2 or less. Compressed
> >> chunks are
> >> stored using xvmalloc memory allocator which is already being used
> by
> >> zram
> >> driver for the same purpose. Zero-filled pages are checked and no
> >> memory is
> >> allocated for them.
> >
> > I'm curious about this policy choice. I can see why one
> > would want to ensure that the average page is compressed
> > to less than PAGE_SIZE/2, and preferably PAGE_SIZE/2
> > minus the overhead of the data structures necessary to
> > track the page. And I see that this makes no difference
> > when the reclamation algorithm is random (as it is for
> > now). But once there is some better reclamation logic,
> > I'd hope that this compression factor restriction would
> > be lifted and replaced with something much higher. IIRC,
> > compression is much more expensive than decompression
> > so there's no CPU-overhead argument here either,
> > correct?
>
> Its true that we waste CPU cycles for every incompressible page
> encountered but still we can't keep such pages in RAM since this
> is what host wanted to reclaim and we can't help since compression
> failed. Compressed caching makes sense only when we keep highly
> compressible pages in RAM, regardless of reclaim scheme.
>
> Keeping (nearly) incompressible pages in RAM probably makes sense
> for Xen's case where cleancache provider runs *inside* a VM, sending
> pages to host. So, if VM is limited to say 512M and host has 64G RAM,
> caching guest pages, with or without compression, will help.

I agree that the use model is a bit different, but PAGE_SIZE/2
still seems like an unnecessarily strict threshold. For
example, saving 3000 clean pages in 2000*PAGE_SIZE of RAM
still seems like a considerable space savings. And as
long as the _average_ is less than some threshold, saving
a few slightly-less-than-ideally-compressible pages doesn't
seem like it would be a problem. For example, IMHO, saving two
pages when one compresses to 2047 bytes and the other compresses
to 2049 bytes seems just as reasonable as saving two pages that
both compress to 2048 bytes.

Maybe the best solution is to make the threshold a sysfs
settable? Or maybe BOTH the single-page threshold and
the average threshold as two different sysfs settables?
E.g. throw away a put page if either it compresses poorly
or adding it to the pool would push the average over.

2010-07-20 23:03:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/8] Shrink zcache based on memlimit

Hi,

On Fri, Jul 16, 2010 at 9:37 PM, Nitin Gupta <[email protected]> wrote:
> User can change (per-pool) memlimit using sysfs node:
> /sys/kernel/mm/zcache/pool<id>/memlimit
>
> When memlimit is set to a value smaller than current
> number of pages allocated for that pool, excess pages
> are now freed immediately instead of waiting for get/
> flush for these pages.
>
> Currently, victim page selection is essentially random.
> Automatic cache resizing and better page replacement
> policies will be implemented later.

Okay. I know this isn't end. I just want to give a concern before you end up.
I don't know how you implement reclaim policy.
In current implementation, you use memlimit for determining when reclaim happen.
But i think we also should follow global reclaim policy of VM.
I means although memlimit doen't meet, we should reclaim zcache if
system has a trouble to reclaim memory.
AFAIK, cleancache doesn't give any hint for that. so we should
implement it in zcache itself.
At first glance, we can use shrink_slab or oom_notifier. But both
doesn't give any information of zone although global reclaim do it by
per-zone.
AFAIK, Nick try to implement zone-aware shrink slab. Also if we need
it, we can change oom_notifier with zone-aware oom_notifier. Now it
seems anyone doesn't use oom_notifier so I am not sure it's useful.

It's just my opinion.
Thanks for effort for good feature. Nitin.
--
Kind regards,
Minchan Kim

2010-07-21 04:27:02

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

On 07/20/2010 07:58 PM, Dan Magenheimer wrote:
>> On 07/20/2010 01:27 AM, Dan Magenheimer wrote:
>>>> We only keep pages that compress to PAGE_SIZE/2 or less. Compressed
>>>> chunks are
>>>> stored using xvmalloc memory allocator which is already being used
>> by
>>>> zram
>>>> driver for the same purpose. Zero-filled pages are checked and no
>>>> memory is
>>>> allocated for them.
>>>
>>> I'm curious about this policy choice. I can see why one
>>> would want to ensure that the average page is compressed
>>> to less than PAGE_SIZE/2, and preferably PAGE_SIZE/2
>>> minus the overhead of the data structures necessary to
>>> track the page. And I see that this makes no difference
>>> when the reclamation algorithm is random (as it is for
>>> now). But once there is some better reclamation logic,
>>> I'd hope that this compression factor restriction would
>>> be lifted and replaced with something much higher. IIRC,
>>> compression is much more expensive than decompression
>>> so there's no CPU-overhead argument here either,
>>> correct?
>>
>> Its true that we waste CPU cycles for every incompressible page
>> encountered but still we can't keep such pages in RAM since this
>> is what host wanted to reclaim and we can't help since compression
>> failed. Compressed caching makes sense only when we keep highly
>> compressible pages in RAM, regardless of reclaim scheme.
>>
>> Keeping (nearly) incompressible pages in RAM probably makes sense
>> for Xen's case where cleancache provider runs *inside* a VM, sending
>> pages to host. So, if VM is limited to say 512M and host has 64G RAM,
>> caching guest pages, with or without compression, will help.
>
> I agree that the use model is a bit different, but PAGE_SIZE/2
> still seems like an unnecessarily strict threshold. For
> example, saving 3000 clean pages in 2000*PAGE_SIZE of RAM
> still seems like a considerable space savings. And as
> long as the _average_ is less than some threshold, saving
> a few slightly-less-than-ideally-compressible pages doesn't
> seem like it would be a problem. For example, IMHO, saving two
> pages when one compresses to 2047 bytes and the other compresses
> to 2049 bytes seems just as reasonable as saving two pages that
> both compress to 2048 bytes.
>
> Maybe the best solution is to make the threshold a sysfs
> settable? Or maybe BOTH the single-page threshold and
> the average threshold as two different sysfs settables?
> E.g. throw away a put page if either it compresses poorly
> or adding it to the pool would push the average over.
>

Considering overall compression average instead of bothering about
individual page compressibility seems like a good point. Still, I think
storing completely incompressible pages isn't desirable.

So, I agree with the idea of separate sysfs tunables for average and single-page
compression thresholds with defaults conservatively set to 50% and PAGE_SIZE/2
respectively. I will include these in "v2" patches.

Thanks,
Nitin

2010-07-21 05:21:57

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 4/8] Shrink zcache based on memlimit

On 07/21/2010 04:33 AM, Minchan Kim wrote:
> On Fri, Jul 16, 2010 at 9:37 PM, Nitin Gupta <[email protected]> wrote:
>> User can change (per-pool) memlimit using sysfs node:
>> /sys/kernel/mm/zcache/pool<id>/memlimit
>>
>> When memlimit is set to a value smaller than current
>> number of pages allocated for that pool, excess pages
>> are now freed immediately instead of waiting for get/
>> flush for these pages.
>>
>> Currently, victim page selection is essentially random.
>> Automatic cache resizing and better page replacement
>> policies will be implemented later.
>
> Okay. I know this isn't end. I just want to give a concern before you end up.
> I don't know how you implement reclaim policy.
> In current implementation, you use memlimit for determining when reclaim happen.
> But i think we also should follow global reclaim policy of VM.
> I means although memlimit doen't meet, we should reclaim zcache if
> system has a trouble to reclaim memory.

Yes, we should have a way to do reclaim depending on system memory pressure
and also when user explicitly wants so i.e. when memlimit is lowered manually.

> AFAIK, cleancache doesn't give any hint for that. so we should
> implement it in zcache itself.

I think cleancache should be kept minimal so yes, all reclaim policies should
go in zcache layer only.

> At first glance, we can use shrink_slab or oom_notifier. But both
> doesn't give any information of zone although global reclaim do it by
> per-zone.
> AFAIK, Nick try to implement zone-aware shrink slab. Also if we need
> it, we can change oom_notifier with zone-aware oom_notifier. Now it
> seems anyone doesn't use oom_notifier so I am not sure it's useful.
>

I don't think we need these notifiers as we can simply create a thread
to monitor cache hit rate, system memory pressure etc. and shrink/expand
the cache accordingly.


Thanks for your comments.
Nitin

2010-07-21 11:32:15

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH 4/8] Shrink zcache based on memlimit

On Wednesday 21 July 2010 00:52:40 Nitin Gupta wrote:
> On 07/21/2010 04:33 AM, Minchan Kim wrote:
> > On Fri, Jul 16, 2010 at 9:37 PM, Nitin Gupta <[email protected]> wrote:
> >> User can change (per-pool) memlimit using sysfs node:
> >> /sys/kernel/mm/zcache/pool<id>/memlimit
> >>
> >> When memlimit is set to a value smaller than current
> >> number of pages allocated for that pool, excess pages
> >> are now freed immediately instead of waiting for get/
> >> flush for these pages.
> >>
> >> Currently, victim page selection is essentially random.
> >> Automatic cache resizing and better page replacement
> >> policies will be implemented later.
> >
> > Okay. I know this isn't end. I just want to give a concern before you end up.
> > I don't know how you implement reclaim policy.
> > In current implementation, you use memlimit for determining when reclaim happen.
> > But i think we also should follow global reclaim policy of VM.
> > I means although memlimit doen't meet, we should reclaim zcache if
> > system has a trouble to reclaim memory.
>
> Yes, we should have a way to do reclaim depending on system memory pressure
> and also when user explicitly wants so i.e. when memlimit is lowered manually.
>
> > AFAIK, cleancache doesn't give any hint for that. so we should
> > implement it in zcache itself.
>
> I think cleancache should be kept minimal so yes, all reclaim policies should
> go in zcache layer only.
>
> > At first glance, we can use shrink_slab or oom_notifier. But both
> > doesn't give any information of zone although global reclaim do it by
> > per-zone.
> > AFAIK, Nick try to implement zone-aware shrink slab. Also if we need
> > it, we can change oom_notifier with zone-aware oom_notifier. Now it
> > seems anyone doesn't use oom_notifier so I am not sure it's useful.
> >
>
> I don't think we need these notifiers as we can simply create a thread
> to monitor cache hit rate, system memory pressure etc. and shrink/expand
> the cache accordingly.

Nitin,

Based on experience gained when adding the shrinker callbacks, I would
strongly recommend you use them. I tried several hacks along the lines of
what you are proposing before moving settling on the callbacks. They
are effective and make sure that memory is released when its required.
What would happen with the other methods is that memory would either
not be released or would be released when it was not needed.

Thanks
Ed Tomlinson.

2010-07-21 17:39:23

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [PATCH 0/8] zcache: page cache compression support

> > Maybe the best solution is to make the threshold a sysfs
> > settable? Or maybe BOTH the single-page threshold and
> > the average threshold as two different sysfs settables?
> > E.g. throw away a put page if either it compresses poorly
> > or adding it to the pool would push the average over.
>
> Considering overall compression average instead of bothering about
> individual page compressibility seems like a good point. Still, I think
> storing completely incompressible pages isn't desirable.
>
> So, I agree with the idea of separate sysfs tunables for average and
> single-page
> compression thresholds with defaults conservatively set to 50% and
> PAGE_SIZE/2
> respectively. I will include these in "v2" patches.

Unless the single-page compression threshold is higher than the
average, the average is useless. IMHO I'd suggest at least
5*PAGE_SIZE/8 as the single-page threshold, possibly higher.

2010-07-22 19:15:36

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

On Fri, Jul 16, 2010 at 06:07:42PM +0530, Nitin Gupta wrote:
> Frequently accessed filesystem data is stored in memory to reduce access to
> (much) slower backing disks. Under memory pressure, these pages are freed and
> when needed again, they have to be read from disks again. When combined working
> set of all running application exceeds amount of physical RAM, we get extereme
> slowdown as reading a page from disk can take time in order of milliseconds.

<snip>

Given that there were a lot of comments and changes for this series, can
you resend them with your updates so I can then apply them if they are
acceptable to everyone?

thanks,

greg k-h

2010-07-22 19:57:15

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [PATCH 0/8] zcache: page cache compression support

> From: Greg KH [mailto:[email protected]]
> Sent: Thursday, July 22, 2010 1:15 PM
> To: Nitin Gupta
> Cc: Pekka Enberg; Hugh Dickins; Andrew Morton; Dan Magenheimer; Rik van
> Riel; Avi Kivity; Christoph Hellwig; Minchan Kim; Konrad Rzeszutek
> Wilk; linux-mm; linux-kernel
> Subject: Re: [PATCH 0/8] zcache: page cache compression support
>
> On Fri, Jul 16, 2010 at 06:07:42PM +0530, Nitin Gupta wrote:
> > Frequently accessed filesystem data is stored in memory to reduce
> access to
> > (much) slower backing disks. Under memory pressure, these pages are
> freed and
> > when needed again, they have to be read from disks again. When
> combined working
> > set of all running application exceeds amount of physical RAM, we get
> extereme
> > slowdown as reading a page from disk can take time in order of
> milliseconds.
>
> <snip>
>
> Given that there were a lot of comments and changes for this series,
> can
> you resend them with your updates so I can then apply them if they are
> acceptable to everyone?
>
> thanks,
> greg k-h

Hi Greg --

Nitin's zcache code is dependent on the cleancache series:
http://lkml.org/lkml/2010/6/21/411

The cleancache series has not changed since V3 (other than
fixing a couple of documentation typos) and didn't receive any
comments other than Christoph's concern that there weren't
any users... which I think has been since addressed with the
posting of the Xen tmem driver code and Nitin's zcache.

If you are ready to apply the cleancache series, great!
If not, please let me know next steps so cleancache isn't
an impediment for applying the zcache series.

Thanks,
Dan

2010-07-22 21:16:43

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

On Thu, Jul 22, 2010 at 12:54:57PM -0700, Dan Magenheimer wrote:
> > From: Greg KH [mailto:[email protected]]
> > Sent: Thursday, July 22, 2010 1:15 PM
> > To: Nitin Gupta
> > Cc: Pekka Enberg; Hugh Dickins; Andrew Morton; Dan Magenheimer; Rik van
> > Riel; Avi Kivity; Christoph Hellwig; Minchan Kim; Konrad Rzeszutek
> > Wilk; linux-mm; linux-kernel
> > Subject: Re: [PATCH 0/8] zcache: page cache compression support
> >
> > On Fri, Jul 16, 2010 at 06:07:42PM +0530, Nitin Gupta wrote:
> > > Frequently accessed filesystem data is stored in memory to reduce
> > access to
> > > (much) slower backing disks. Under memory pressure, these pages are
> > freed and
> > > when needed again, they have to be read from disks again. When
> > combined working
> > > set of all running application exceeds amount of physical RAM, we get
> > extereme
> > > slowdown as reading a page from disk can take time in order of
> > milliseconds.
> >
> > <snip>
> >
> > Given that there were a lot of comments and changes for this series,
> > can
> > you resend them with your updates so I can then apply them if they are
> > acceptable to everyone?
> >
> > thanks,
> > greg k-h
>
> Hi Greg --
>
> Nitin's zcache code is dependent on the cleancache series:
> http://lkml.org/lkml/2010/6/21/411

Ah, I didn't realize that. Hm, that makes it something that I can't
take until that code is upstream, sorry.

> The cleancache series has not changed since V3 (other than
> fixing a couple of documentation typos) and didn't receive any
> comments other than Christoph's concern that there weren't
> any users... which I think has been since addressed with the
> posting of the Xen tmem driver code and Nitin's zcache.
>
> If you are ready to apply the cleancache series, great!
> If not, please let me know next steps so cleancache isn't
> an impediment for applying the zcache series.

I don't know, work with the kernel developers to resolve the issues they
pointed out in the cleancache code and when it goes in, then I can take
these patches.

good luck,

greg k-h

2010-07-23 17:37:24

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support


----- "Nitin Gupta" <[email protected]> wrote:

> Frequently accessed filesystem data is stored in memory to reduce
> access to
> (much) slower backing disks. Under memory pressure, these pages are
> freed and
> when needed again, they have to be read from disks again. When
> combined working
> set of all running application exceeds amount of physical RAM, we get
> extereme
> slowdown as reading a page from disk can take time in order of
> milliseconds.
>
> Memory compression increases effective memory size and allows more
> pages to
> stay in RAM. Since de/compressing memory pages is several orders of
> magnitude
> faster than disk I/O, this can provide signifant performance gains for
> many
> workloads. Also, with multi-cores becoming common, benefits of reduced
> disk I/O
> should easily outweigh the problem of increased CPU usage.
>
> It is implemented as a "backend" for cleancache_ops [1] which
> provides
> callbacks for events such as when a page is to be removed from the
> page cache
> and when it is required again. We use them to implement a 'second
> chance' cache
> for these evicted page cache pages by compressing and storing them in
> memory
> itself.
>
> We only keep pages that compress to PAGE_SIZE/2 or less. Compressed
> chunks are
> stored using xvmalloc memory allocator which is already being used by
> zram
> driver for the same purpose. Zero-filled pages are checked and no
> memory is
> allocated for them.
>
> A separate "pool" is created for each mount instance for a
> cleancache-aware
> filesystem. Each incoming page is identified with <pool_id, inode_no,
> index>
> where inode_no identifies file within the filesystem corresponding to
> pool_id
> and index is offset of the page within this inode. Within a pool,
> inodes are
> maintained in an rb-tree and each of its nodes points to a separate
> radix-tree
> which maintains list of pages within that inode.
>
> While compression reduces disk I/O, it also reduces the space
> available for
> normal (uncompressed) page cache. This can result in more frequent
> page cache
> reclaim and thus higher CPU overhead. Thus, it's important to maintain
> good hit
> rate for compressed cache or increased CPU overhead can nullify any
> other
> benefits. This requires adaptive (compressed) cache resizing and page
> replacement policies that can maintain optimal cache size and quickly
> reclaim
> unused compressed chunks. This work is yet to be done. However, in the
> current
> state, it allows manually resizing cache size using (per-pool) sysfs
> node
> 'memlimit' which in turn frees any excess pages *sigh* randomly.
>
> Finally, it uses percpu stats and compression buffers to allow better
> performance on multi-cores. Still, there are known bottlenecks like a
> single
> xvmalloc mempool per zcache pool and few others. I will work on this
> when I
> start with profiling.
>
> * Performance numbers:
> - Tested using iozone filesystem benchmark
> - 4 CPUs, 1G RAM
> - Read performance gain: ~2.5X
> - Random read performance gain: ~3X
> - In general, performance gains for every kind of I/O
>
> Test details with graphs can be found here:
> http://code.google.com/p/compcache/wiki/zcacheIOzone
>
> If I can get some help with testing, it would be intersting to find
> its
> effect in more real-life workloads. In particular, I'm intersted in
> finding
> out its effect in KVM virtualization case where it can potentially
> allow
> running more number of VMs per-host for a given amount of RAM. With
> zcache
> enabled, VMs can be assigned much smaller amount of memory since host
> can now
> hold bulk of page-cache pages, allowing VMs to maintain similar level
> of
> performance while a greater number of them can be hosted.
>
> * How to test:
> All patches are against 2.6.35-rc5:
>
> - First, apply all prerequisite patches here:
> http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches
>
> - Then apply this patch series; also uploaded here:
> http://compcache.googlecode.com/hg/sub-projects/zcache_patches
>
>
> Nitin Gupta (8):
> Allow sharing xvmalloc for zram and zcache
> Basic zcache functionality
> Create sysfs nodes and export basic statistics
> Shrink zcache based on memlimit
> Eliminate zero-filled pages
> Compress pages using LZO
> Use xvmalloc to store compressed chunks
> Document sysfs entries
>
> Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 +
> drivers/staging/Makefile | 2 +
> drivers/staging/zram/Kconfig | 22 +
> drivers/staging/zram/Makefile | 5 +-
> drivers/staging/zram/xvmalloc.c | 8 +
> drivers/staging/zram/zcache_drv.c | 1312
> ++++++++++++++++++++++
> drivers/staging/zram/zcache_drv.h | 90 ++
> 7 files changed, 1491 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-zcache
> create mode 100644 drivers/staging/zram/zcache_drv.c
> create mode 100644 drivers/staging/zram/zcache_drv.h
By tested those patches on the top of the linus tree at this commit d0c6f6258478e1dba532bf7c28e2cd6e1047d3a4, the OOM was trigger even though there looked like still lots of swap.

# free -m
total used free shared buffers cached
Mem: 852 379 473 0 3 15
-/+ buffers/cache: 359 492
Swap: 2015 14 2001

# ./usemem 1024
0: Mallocing 32 megabytes
1: Mallocing 32 megabytes
2: Mallocing 32 megabytes
3: Mallocing 32 megabytes
4: Mallocing 32 megabytes
5: Mallocing 32 megabytes
6: Mallocing 32 megabytes
7: Mallocing 32 megabytes
8: Mallocing 32 megabytes
9: Mallocing 32 megabytes
10: Mallocing 32 megabytes
11: Mallocing 32 megabytes
12: Mallocing 32 megabytes
13: Mallocing 32 megabytes
14: Mallocing 32 megabytes
15: Mallocing 32 megabytes
Connection to 192.168.122.193 closed.

usemem invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
usemem cpuset=/ mems_allowed=0
Pid: 1829, comm: usemem Not tainted 2.6.35-rc5+ #5
Call Trace:
[<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff81108520>] dump_header+0x70/0x190
[<ffffffff811086c1>] oom_kill_process+0x81/0x180
[<ffffffff81108c08>] __out_of_memory+0x58/0xd0
[<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
[<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
[<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
[<ffffffff81140a69>] alloc_page_vma+0x89/0x140
[<ffffffff81125f76>] handle_mm_fault+0x6d6/0x990
[<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff81121afd>] ? follow_page+0x19d/0x350
[<ffffffff8112639c>] __get_user_pages+0x16c/0x480
[<ffffffff810127c9>] ? sched_clock+0x9/0x10
[<ffffffff811276ef>] __mlock_vma_pages_range+0xef/0x1f0
[<ffffffff81127f01>] mlock_vma_pages_range+0x91/0xa0
[<ffffffff8112ad57>] mmap_region+0x307/0x5b0
[<ffffffff8112b354>] do_mmap_pgoff+0x354/0x3a0
[<ffffffff8112b3fc>] ? sys_mmap_pgoff+0x5c/0x200
[<ffffffff8112b41a>] sys_mmap_pgoff+0x7a/0x200
[<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8100fa09>] sys_mmap+0x29/0x30
[<ffffffff8100b032>] system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 140
CPU 1: hi: 186, btch: 31 usd: 47
active_anon:128 inactive_anon:140 isolated_anon:0
active_file:0 inactive_file:9 isolated_file:0
unevictable:126855 dirty:0 writeback:125 unstable:0
free:1996 slab_reclaimable:4445 slab_unreclaimable:23646
mapped:923 shmem:7 pagetables:778 bounce:0
Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:11896kB isolated(anon):0kB isolated(file):0kB present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 994 994 994
Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB active_anon:512kB inactive_anon:560kB active_file:0kB inactive_file:36kB unevictable:495524kB isolated(anon):0kB isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB writeback:500kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 4032kB
Node 0 DMA32: 476*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB
1146 total pagecache pages
215 pages in swap cache
Swap cache stats: add 19633, delete 19418, find 941/1333
Free swap = 2051080kB
Total swap = 2064380kB
262138 pages RAM
43914 pages reserved
4832 pages shared
155665 pages non-shared
Out of memory: kill process 1727 (console-kit-dae) score 1027939 or a child
Killed process 1727 (console-kit-dae) vsz:4111756kB, anon-rss:0kB, file-rss:600kB
console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
console-kit-dae cpuset=/ mems_allowed=0
Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5
Call Trace:
[<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff81108520>] dump_header+0x70/0x190
[<ffffffff811086c1>] oom_kill_process+0x81/0x180
[<ffffffff81108c08>] __out_of_memory+0x58/0xd0
[<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
[<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
[<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
[<ffffffff8114522e>] kmem_getpages+0x6e/0x180
[<ffffffff81147d79>] fallback_alloc+0x1c9/0x2b0
[<ffffffff81147602>] ? cache_grow+0x4b2/0x520
[<ffffffff81147a5b>] ____cache_alloc_node+0xab/0x200
[<ffffffff810d55d5>] ? taskstats_exit+0x305/0x3b0
[<ffffffff8114862b>] kmem_cache_alloc+0x1fb/0x290
[<ffffffff810d55d5>] taskstats_exit+0x305/0x3b0
[<ffffffff81063a4b>] do_exit+0x12b/0x890
[<ffffffff810924fd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8108641f>] ? cpu_clock+0x6f/0x80
[<ffffffff81095cbd>] ? lock_release_holdtime+0x3d/0x190
[<ffffffff814e1010>] ? _raw_spin_unlock_irq+0x30/0x40
[<ffffffff8106420e>] do_group_exit+0x5e/0xd0
[<ffffffff81075b54>] get_signal_to_deliver+0x2d4/0x490
[<ffffffff811ea6ad>] ? inode_has_perm+0x7d/0xf0
[<ffffffff8100a2e5>] do_signal+0x75/0x7b0
[<ffffffff81169d2d>] ? vfs_ioctl+0x3d/0xf0
[<ffffffff8116a394>] ? do_vfs_ioctl+0x84/0x570
[<ffffffff8100aa85>] do_notify_resume+0x65/0x80
[<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8100b381>] int_signal+0x12/0x17
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 151
CPU 1: hi: 186, btch: 31 usd: 61
active_anon:128 inactive_anon:165 isolated_anon:0
active_file:0 inactive_file:9 isolated_file:0
unevictable:126855 dirty:0 writeback:25 unstable:0
free:1965 slab_reclaimable:4445 slab_unreclaimable:23646
mapped:923 shmem:7 pagetables:778 bounce:0
Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:11896kB isolated(anon):0kB isolated(file):0kB present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 994 994 994
Node 0 DMA32 free:3828kB min:4000kB low:5000kB high:6000kB active_anon:512kB inactive_anon:660kB active_file:0kB inactive_file:36kB unevictable:495524kB isolated(anon):0kB isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 4032kB
Node 0 DMA32: 445*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3828kB
1146 total pagecache pages
230 pages in swap cache
Swap cache stats: add 19649, delete 19419, find 942/1336
Free swap = 2051084kB
Total swap = 2064380kB
262138 pages RAM
43914 pages reserved
4818 pages shared
155685 pages non-shared
Out of memory: kill process 1806 (sshd) score 9474 or a child
Killed process 1810 (bash) vsz:108384kB, anon-rss:0kB, file-rss:656kB
console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
console-kit-dae cpuset=/ mems_allowed=0
Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5
Call Trace:
[<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff81108520>] dump_header+0x70/0x190
[<ffffffff811086c1>] oom_kill_process+0x81/0x180
[<ffffffff81108c08>] __out_of_memory+0x58/0xd0
[<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
[<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
[<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
[<ffffffff8114522e>] kmem_getpages+0x6e/0x180
[<ffffffff81147d79>] fallback_alloc+0x1c9/0x2b0
[<ffffffff81147602>] ? cache_grow+0x4b2/0x520
[<ffffffff81147a5b>] ____cache_alloc_node+0xab/0x200
[<ffffffff810d55d5>] ? taskstats_exit+0x305/0x3b0
[<ffffffff8114862b>] kmem_cache_alloc+0x1fb/0x290
[<ffffffff810d55d5>] taskstats_exit+0x305/0x3b0
[<ffffffff81063a4b>] do_exit+0x12b/0x890
[<ffffffff810924fd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8108641f>] ? cpu_clock+0x6f/0x80
[<ffffffff81095cbd>] ? lock_release_holdtime+0x3d/0x190
[<ffffffff814e1010>] ? _raw_spin_unlock_irq+0x30/0x40
[<ffffffff8106420e>] do_group_exit+0x5e/0xd0
[<ffffffff81075b54>] get_signal_to_deliver+0x2d4/0x490
[<ffffffff811ea6ad>] ? inode_has_perm+0x7d/0xf0
[<ffffffff8100a2e5>] do_signal+0x75/0x7b0
[<ffffffff81169d2d>] ? vfs_ioctl+0x3d/0xf0
[<ffffffff8116a394>] ? do_vfs_ioctl+0x84/0x570
[<ffffffff8100aa85>] do_notify_resume+0x65/0x80
[<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8100b381>] int_signal+0x12/0x17
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 119
CPU 1: hi: 186, btch: 31 usd: 73
active_anon:50 inactive_anon:175 isolated_anon:0
active_file:0 inactive_file:9 isolated_file:0
unevictable:126855 dirty:0 writeback:25 unstable:0
free:1996 slab_reclaimable:4445 slab_unreclaimable:23663
mapped:923 shmem:7 pagetables:778 bounce:0
Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:11896kB isolated(anon):0kB isolated(file):0kB present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 994 994 994
Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB active_anon:200kB inactive_anon:700kB active_file:0kB inactive_file:36kB unevictable:495524kB isolated(anon):0kB isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB slab_unreclaimable:94652kB kernel_stack:1296kB pagetables:3088kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1536 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 4032kB
Node 0 DMA32: 470*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB
1146 total pagecache pages
221 pages in swap cache
Swap cache stats: add 19848, delete 19627, find 970/1386
Free swap = 2051428kB
Total swap = 2064380kB
262138 pages RAM
43914 pages reserved
4669 pages shared
155659 pages non-shared
Out of memory: kill process 1829 (usemem) score 8253 or a child
Killed process 1829 (usemem) vsz:528224kB, anon-rss:502468kB, file-rss:376kB

# cat usemem.c
# cat usemem.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#define CHUNKS 32

int
main(int argc, char *argv[])
{
mlockall(MCL_FUTURE);

unsigned long mb;
char *buf[CHUNKS];
int i;

if (argc < 2) {
fprintf(stderr, "usage: usemem megabytes\n");
exit(1);
}
mb = strtoul(argv[1], NULL, 0);

for (i = 0; i < CHUNKS; i++) {
fprintf(stderr, "%d: Mallocing %lu megabytes\n", i, mb/CHUNKS);
buf[i] = (char *)malloc(mb/CHUNKS * 1024L * 1024L);
if (!buf[i]) {
fprintf(stderr, "malloc failure\n");
exit(1);
}
}

for (i = 0; i < CHUNKS; i++) {
fprintf(stderr, "%d: Zeroing %lu megabytes at %p\n",
i, mb/CHUNKS, buf[i]);
memset(buf[i], 0, mb/CHUNKS * 1024L * 1024L);
}


exit(0);
}

2010-07-23 17:42:27

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support


----- [email protected] wrote:

> ----- "Nitin Gupta" <[email protected]> wrote:
>
> > Frequently accessed filesystem data is stored in memory to reduce
> > access to
> > (much) slower backing disks. Under memory pressure, these pages are
> > freed and
> > when needed again, they have to be read from disks again. When
> > combined working
> > set of all running application exceeds amount of physical RAM, we
> get
> > extereme
> > slowdown as reading a page from disk can take time in order of
> > milliseconds.
> >
> > Memory compression increases effective memory size and allows more
> > pages to
> > stay in RAM. Since de/compressing memory pages is several orders of
> > magnitude
> > faster than disk I/O, this can provide signifant performance gains
> for
> > many
> > workloads. Also, with multi-cores becoming common, benefits of
> reduced
> > disk I/O
> > should easily outweigh the problem of increased CPU usage.
> >
> > It is implemented as a "backend" for cleancache_ops [1] which
> > provides
> > callbacks for events such as when a page is to be removed from the
> > page cache
> > and when it is required again. We use them to implement a 'second
> > chance' cache
> > for these evicted page cache pages by compressing and storing them
> in
> > memory
> > itself.
> >
> > We only keep pages that compress to PAGE_SIZE/2 or less. Compressed
> > chunks are
> > stored using xvmalloc memory allocator which is already being used
> by
> > zram
> > driver for the same purpose. Zero-filled pages are checked and no
> > memory is
> > allocated for them.
> >
> > A separate "pool" is created for each mount instance for a
> > cleancache-aware
> > filesystem. Each incoming page is identified with <pool_id,
> inode_no,
> > index>
> > where inode_no identifies file within the filesystem corresponding
> to
> > pool_id
> > and index is offset of the page within this inode. Within a pool,
> > inodes are
> > maintained in an rb-tree and each of its nodes points to a separate
> > radix-tree
> > which maintains list of pages within that inode.
> >
> > While compression reduces disk I/O, it also reduces the space
> > available for
> > normal (uncompressed) page cache. This can result in more frequent
> > page cache
> > reclaim and thus higher CPU overhead. Thus, it's important to
> maintain
> > good hit
> > rate for compressed cache or increased CPU overhead can nullify any
> > other
> > benefits. This requires adaptive (compressed) cache resizing and
> page
> > replacement policies that can maintain optimal cache size and
> quickly
> > reclaim
> > unused compressed chunks. This work is yet to be done. However, in
> the
> > current
> > state, it allows manually resizing cache size using (per-pool)
> sysfs
> > node
> > 'memlimit' which in turn frees any excess pages *sigh* randomly.
> >
> > Finally, it uses percpu stats and compression buffers to allow
> better
> > performance on multi-cores. Still, there are known bottlenecks like
> a
> > single
> > xvmalloc mempool per zcache pool and few others. I will work on
> this
> > when I
> > start with profiling.
> >
> > * Performance numbers:
> > - Tested using iozone filesystem benchmark
> > - 4 CPUs, 1G RAM
> > - Read performance gain: ~2.5X
> > - Random read performance gain: ~3X
> > - In general, performance gains for every kind of I/O
> >
> > Test details with graphs can be found here:
> > http://code.google.com/p/compcache/wiki/zcacheIOzone
> >
> > If I can get some help with testing, it would be intersting to find
> > its
> > effect in more real-life workloads. In particular, I'm intersted in
> > finding
> > out its effect in KVM virtualization case where it can potentially
> > allow
> > running more number of VMs per-host for a given amount of RAM. With
> > zcache
> > enabled, VMs can be assigned much smaller amount of memory since
> host
> > can now
> > hold bulk of page-cache pages, allowing VMs to maintain similar
> level
> > of
> > performance while a greater number of them can be hosted.
> >
> > * How to test:
> > All patches are against 2.6.35-rc5:
> >
> > - First, apply all prerequisite patches here:
> > http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches
> >
> > - Then apply this patch series; also uploaded here:
> > http://compcache.googlecode.com/hg/sub-projects/zcache_patches
> >
> >
> > Nitin Gupta (8):
> > Allow sharing xvmalloc for zram and zcache
> > Basic zcache functionality
> > Create sysfs nodes and export basic statistics
> > Shrink zcache based on memlimit
> > Eliminate zero-filled pages
> > Compress pages using LZO
> > Use xvmalloc to store compressed chunks
> > Document sysfs entries
> >
> > Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 +
> > drivers/staging/Makefile | 2 +
> > drivers/staging/zram/Kconfig | 22 +
> > drivers/staging/zram/Makefile | 5 +-
> > drivers/staging/zram/xvmalloc.c | 8 +
> > drivers/staging/zram/zcache_drv.c | 1312
> > ++++++++++++++++++++++
> > drivers/staging/zram/zcache_drv.h | 90 ++
> > 7 files changed, 1491 insertions(+), 1 deletions(-)
> > create mode 100644
> Documentation/ABI/testing/sysfs-kernel-mm-zcache
> > create mode 100644 drivers/staging/zram/zcache_drv.c
> > create mode 100644 drivers/staging/zram/zcache_drv.h
> By tested those patches on the top of the linus tree at this commit
> d0c6f6258478e1dba532bf7c28e2cd6e1047d3a4, the OOM was trigger even
> though there looked like still lots of swap.
>
> # free -m
> total used free shared buffers
> cached
> Mem: 852 379 473 0 3
> 15
> -/+ buffers/cache: 359 492
> Swap: 2015 14 2001
>
> # ./usemem 1024
> 0: Mallocing 32 megabytes
> 1: Mallocing 32 megabytes
> 2: Mallocing 32 megabytes
> 3: Mallocing 32 megabytes
> 4: Mallocing 32 megabytes
> 5: Mallocing 32 megabytes
> 6: Mallocing 32 megabytes
> 7: Mallocing 32 megabytes
> 8: Mallocing 32 megabytes
> 9: Mallocing 32 megabytes
> 10: Mallocing 32 megabytes
> 11: Mallocing 32 megabytes
> 12: Mallocing 32 megabytes
> 13: Mallocing 32 megabytes
> 14: Mallocing 32 megabytes
> 15: Mallocing 32 megabytes
> Connection to 192.168.122.193 closed.
>
> usemem invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
> usemem cpuset=/ mems_allowed=0
> Pid: 1829, comm: usemem Not tainted 2.6.35-rc5+ #5
> Call Trace:
> [<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
> [<ffffffff81108520>] dump_header+0x70/0x190
> [<ffffffff811086c1>] oom_kill_process+0x81/0x180
> [<ffffffff81108c08>] __out_of_memory+0x58/0xd0
> [<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
> [<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
> [<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
> [<ffffffff81140a69>] alloc_page_vma+0x89/0x140
> [<ffffffff81125f76>] handle_mm_fault+0x6d6/0x990
> [<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
> [<ffffffff81121afd>] ? follow_page+0x19d/0x350
> [<ffffffff8112639c>] __get_user_pages+0x16c/0x480
> [<ffffffff810127c9>] ? sched_clock+0x9/0x10
> [<ffffffff811276ef>] __mlock_vma_pages_range+0xef/0x1f0
> [<ffffffff81127f01>] mlock_vma_pages_range+0x91/0xa0
> [<ffffffff8112ad57>] mmap_region+0x307/0x5b0
> [<ffffffff8112b354>] do_mmap_pgoff+0x354/0x3a0
> [<ffffffff8112b3fc>] ? sys_mmap_pgoff+0x5c/0x200
> [<ffffffff8112b41a>] sys_mmap_pgoff+0x7a/0x200
> [<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff8100fa09>] sys_mmap+0x29/0x30
> [<ffffffff8100b032>] system_call_fastpath+0x16/0x1b
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU 0: hi: 0, btch: 1 usd: 0
> CPU 1: hi: 0, btch: 1 usd: 0
> Node 0 DMA32 per-cpu:
> CPU 0: hi: 186, btch: 31 usd: 140
> CPU 1: hi: 186, btch: 31 usd: 47
> active_anon:128 inactive_anon:140 isolated_anon:0
> active_file:0 inactive_file:9 isolated_file:0
> unevictable:126855 dirty:0 writeback:125 unstable:0
> free:1996 slab_reclaimable:4445 slab_unreclaimable:23646
> mapped:923 shmem:7 pagetables:778 bounce:0
> Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB
> inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:11896kB isolated(anon):0kB isolated(file):0kB
> present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB
> shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
> pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:0 all_unreclaimable? yes
> lowmem_reserve[]: 0 994 994 994
> Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB
> active_anon:512kB inactive_anon:560kB active_file:0kB
> inactive_file:36kB unevictable:495524kB isolated(anon):0kB
> isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB
> writeback:500kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB
> slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726
> all_unreclaimable? yes
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB
> 1*1024kB 1*2048kB 0*4096kB = 4032kB
> Node 0 DMA32: 476*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
> 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB
> 1146 total pagecache pages
> 215 pages in swap cache
> Swap cache stats: add 19633, delete 19418, find 941/1333
> Free swap = 2051080kB
> Total swap = 2064380kB
> 262138 pages RAM
> 43914 pages reserved
> 4832 pages shared
> 155665 pages non-shared
> Out of memory: kill process 1727 (console-kit-dae) score 1027939 or a
> child
> Killed process 1727 (console-kit-dae) vsz:4111756kB, anon-rss:0kB,
> file-rss:600kB
> console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> console-kit-dae cpuset=/ mems_allowed=0
> Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5
> Call Trace:
> [<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
> [<ffffffff81108520>] dump_header+0x70/0x190
> [<ffffffff811086c1>] oom_kill_process+0x81/0x180
> [<ffffffff81108c08>] __out_of_memory+0x58/0xd0
> [<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
> [<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
> [<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
> [<ffffffff8114522e>] kmem_getpages+0x6e/0x180
> [<ffffffff81147d79>] fallback_alloc+0x1c9/0x2b0
> [<ffffffff81147602>] ? cache_grow+0x4b2/0x520
> [<ffffffff81147a5b>] ____cache_alloc_node+0xab/0x200
> [<ffffffff810d55d5>] ? taskstats_exit+0x305/0x3b0
> [<ffffffff8114862b>] kmem_cache_alloc+0x1fb/0x290
> [<ffffffff810d55d5>] taskstats_exit+0x305/0x3b0
> [<ffffffff81063a4b>] do_exit+0x12b/0x890
> [<ffffffff810924fd>] ? trace_hardirqs_off+0xd/0x10
> [<ffffffff8108641f>] ? cpu_clock+0x6f/0x80
> [<ffffffff81095cbd>] ? lock_release_holdtime+0x3d/0x190
> [<ffffffff814e1010>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106420e>] do_group_exit+0x5e/0xd0
> [<ffffffff81075b54>] get_signal_to_deliver+0x2d4/0x490
> [<ffffffff811ea6ad>] ? inode_has_perm+0x7d/0xf0
> [<ffffffff8100a2e5>] do_signal+0x75/0x7b0
> [<ffffffff81169d2d>] ? vfs_ioctl+0x3d/0xf0
> [<ffffffff8116a394>] ? do_vfs_ioctl+0x84/0x570
> [<ffffffff8100aa85>] do_notify_resume+0x65/0x80
> [<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff8100b381>] int_signal+0x12/0x17
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU 0: hi: 0, btch: 1 usd: 0
> CPU 1: hi: 0, btch: 1 usd: 0
> Node 0 DMA32 per-cpu:
> CPU 0: hi: 186, btch: 31 usd: 151
> CPU 1: hi: 186, btch: 31 usd: 61
> active_anon:128 inactive_anon:165 isolated_anon:0
> active_file:0 inactive_file:9 isolated_file:0
> unevictable:126855 dirty:0 writeback:25 unstable:0
> free:1965 slab_reclaimable:4445 slab_unreclaimable:23646
> mapped:923 shmem:7 pagetables:778 bounce:0
> Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB
> inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:11896kB isolated(anon):0kB isolated(file):0kB
> present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB
> shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
> pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:0 all_unreclaimable? yes
> lowmem_reserve[]: 0 994 994 994
> Node 0 DMA32 free:3828kB min:4000kB low:5000kB high:6000kB
> active_anon:512kB inactive_anon:660kB active_file:0kB
> inactive_file:36kB unevictable:495524kB isolated(anon):0kB
> isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB
> writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB
> slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726
> all_unreclaimable? yes
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB
> 1*1024kB 1*2048kB 0*4096kB = 4032kB
> Node 0 DMA32: 445*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
> 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3828kB
> 1146 total pagecache pages
> 230 pages in swap cache
> Swap cache stats: add 19649, delete 19419, find 942/1336
> Free swap = 2051084kB
> Total swap = 2064380kB
> 262138 pages RAM
> 43914 pages reserved
> 4818 pages shared
> 155685 pages non-shared
> Out of memory: kill process 1806 (sshd) score 9474 or a child
> Killed process 1810 (bash) vsz:108384kB, anon-rss:0kB, file-rss:656kB
> console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> console-kit-dae cpuset=/ mems_allowed=0
> Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5
> Call Trace:
> [<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
> [<ffffffff81108520>] dump_header+0x70/0x190
> [<ffffffff811086c1>] oom_kill_process+0x81/0x180
> [<ffffffff81108c08>] __out_of_memory+0x58/0xd0
> [<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
> [<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
> [<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
> [<ffffffff8114522e>] kmem_getpages+0x6e/0x180
> [<ffffffff81147d79>] fallback_alloc+0x1c9/0x2b0
> [<ffffffff81147602>] ? cache_grow+0x4b2/0x520
> [<ffffffff81147a5b>] ____cache_alloc_node+0xab/0x200
> [<ffffffff810d55d5>] ? taskstats_exit+0x305/0x3b0
> [<ffffffff8114862b>] kmem_cache_alloc+0x1fb/0x290
> [<ffffffff810d55d5>] taskstats_exit+0x305/0x3b0
> [<ffffffff81063a4b>] do_exit+0x12b/0x890
> [<ffffffff810924fd>] ? trace_hardirqs_off+0xd/0x10
> [<ffffffff8108641f>] ? cpu_clock+0x6f/0x80
> [<ffffffff81095cbd>] ? lock_release_holdtime+0x3d/0x190
> [<ffffffff814e1010>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8106420e>] do_group_exit+0x5e/0xd0
> [<ffffffff81075b54>] get_signal_to_deliver+0x2d4/0x490
> [<ffffffff811ea6ad>] ? inode_has_perm+0x7d/0xf0
> [<ffffffff8100a2e5>] do_signal+0x75/0x7b0
> [<ffffffff81169d2d>] ? vfs_ioctl+0x3d/0xf0
> [<ffffffff8116a394>] ? do_vfs_ioctl+0x84/0x570
> [<ffffffff8100aa85>] do_notify_resume+0x65/0x80
> [<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff8100b381>] int_signal+0x12/0x17
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU 0: hi: 0, btch: 1 usd: 0
> CPU 1: hi: 0, btch: 1 usd: 0
> Node 0 DMA32 per-cpu:
> CPU 0: hi: 186, btch: 31 usd: 119
> CPU 1: hi: 186, btch: 31 usd: 73
> active_anon:50 inactive_anon:175 isolated_anon:0
> active_file:0 inactive_file:9 isolated_file:0
> unevictable:126855 dirty:0 writeback:25 unstable:0
> free:1996 slab_reclaimable:4445 slab_unreclaimable:23663
> mapped:923 shmem:7 pagetables:778 bounce:0
> Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB
> inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:11896kB isolated(anon):0kB isolated(file):0kB
> present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB
> shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
> pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:0 all_unreclaimable? yes
> lowmem_reserve[]: 0 994 994 994
> Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB
> active_anon:200kB inactive_anon:700kB active_file:0kB
> inactive_file:36kB unevictable:495524kB isolated(anon):0kB
> isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB
> writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB
> slab_unreclaimable:94652kB kernel_stack:1296kB pagetables:3088kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1536
> all_unreclaimable? yes
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB
> 1*1024kB 1*2048kB 0*4096kB = 4032kB
> Node 0 DMA32: 470*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
> 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB
> 1146 total pagecache pages
> 221 pages in swap cache
> Swap cache stats: add 19848, delete 19627, find 970/1386
> Free swap = 2051428kB
> Total swap = 2064380kB
> 262138 pages RAM
> 43914 pages reserved
> 4669 pages shared
> 155659 pages non-shared
> Out of memory: kill process 1829 (usemem) score 8253 or a child
> Killed process 1829 (usemem) vsz:528224kB, anon-rss:502468kB,
> file-rss:376kB
>
> # cat usemem.c
> # cat usemem.c
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/mman.h>
> #define CHUNKS 32
>
> int
> main(int argc, char *argv[])
> {
> mlockall(MCL_FUTURE);
>
> unsigned long mb;
> char *buf[CHUNKS];
> int i;
>
> if (argc < 2) {
> fprintf(stderr, "usage: usemem megabytes\n");
> exit(1);
> }
> mb = strtoul(argv[1], NULL, 0);
>
> for (i = 0; i < CHUNKS; i++) {
> fprintf(stderr, "%d: Mallocing %lu megabytes\n", i, mb/CHUNKS);
> buf[i] = (char *)malloc(mb/CHUNKS * 1024L * 1024L);
> if (!buf[i]) {
> fprintf(stderr, "malloc failure\n");
> exit(1);
> }
> }
>
> for (i = 0; i < CHUNKS; i++) {
> fprintf(stderr, "%d: Zeroing %lu megabytes at %p\n",
> i, mb/CHUNKS, buf[i]);
> memset(buf[i], 0, mb/CHUNKS * 1024L * 1024L);
> }
>
>
> exit(0);
> }
>
If this ever be relevant, this was tested inside the kvm guest. The host was a RHEL6 with THP enabled.

2010-07-23 18:03:18

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

Ignore me. The test case should not be using mlockall()!

----- "CAI Qian" <[email protected]> wrote:

> ----- [email protected] wrote:
>
> > ----- "Nitin Gupta" <[email protected]> wrote:
> >
> > > Frequently accessed filesystem data is stored in memory to reduce
> > > access to
> > > (much) slower backing disks. Under memory pressure, these pages
> are
> > > freed and
> > > when needed again, they have to be read from disks again. When
> > > combined working
> > > set of all running application exceeds amount of physical RAM, we
> > get
> > > extereme
> > > slowdown as reading a page from disk can take time in order of
> > > milliseconds.
> > >
> > > Memory compression increases effective memory size and allows
> more
> > > pages to
> > > stay in RAM. Since de/compressing memory pages is several orders
> of
> > > magnitude
> > > faster than disk I/O, this can provide signifant performance
> gains
> > for
> > > many
> > > workloads. Also, with multi-cores becoming common, benefits of
> > reduced
> > > disk I/O
> > > should easily outweigh the problem of increased CPU usage.
> > >
> > > It is implemented as a "backend" for cleancache_ops [1] which
> > > provides
> > > callbacks for events such as when a page is to be removed from
> the
> > > page cache
> > > and when it is required again. We use them to implement a 'second
> > > chance' cache
> > > for these evicted page cache pages by compressing and storing
> them
> > in
> > > memory
> > > itself.
> > >
> > > We only keep pages that compress to PAGE_SIZE/2 or less.
> Compressed
> > > chunks are
> > > stored using xvmalloc memory allocator which is already being
> used
> > by
> > > zram
> > > driver for the same purpose. Zero-filled pages are checked and no
> > > memory is
> > > allocated for them.
> > >
> > > A separate "pool" is created for each mount instance for a
> > > cleancache-aware
> > > filesystem. Each incoming page is identified with <pool_id,
> > inode_no,
> > > index>
> > > where inode_no identifies file within the filesystem
> corresponding
> > to
> > > pool_id
> > > and index is offset of the page within this inode. Within a pool,
> > > inodes are
> > > maintained in an rb-tree and each of its nodes points to a
> separate
> > > radix-tree
> > > which maintains list of pages within that inode.
> > >
> > > While compression reduces disk I/O, it also reduces the space
> > > available for
> > > normal (uncompressed) page cache. This can result in more
> frequent
> > > page cache
> > > reclaim and thus higher CPU overhead. Thus, it's important to
> > maintain
> > > good hit
> > > rate for compressed cache or increased CPU overhead can nullify
> any
> > > other
> > > benefits. This requires adaptive (compressed) cache resizing and
> > page
> > > replacement policies that can maintain optimal cache size and
> > quickly
> > > reclaim
> > > unused compressed chunks. This work is yet to be done. However,
> in
> > the
> > > current
> > > state, it allows manually resizing cache size using (per-pool)
> > sysfs
> > > node
> > > 'memlimit' which in turn frees any excess pages *sigh* randomly.
> > >
> > > Finally, it uses percpu stats and compression buffers to allow
> > better
> > > performance on multi-cores. Still, there are known bottlenecks
> like
> > a
> > > single
> > > xvmalloc mempool per zcache pool and few others. I will work on
> > this
> > > when I
> > > start with profiling.
> > >
> > > * Performance numbers:
> > > - Tested using iozone filesystem benchmark
> > > - 4 CPUs, 1G RAM
> > > - Read performance gain: ~2.5X
> > > - Random read performance gain: ~3X
> > > - In general, performance gains for every kind of I/O
> > >
> > > Test details with graphs can be found here:
> > > http://code.google.com/p/compcache/wiki/zcacheIOzone
> > >
> > > If I can get some help with testing, it would be intersting to
> find
> > > its
> > > effect in more real-life workloads. In particular, I'm intersted
> in
> > > finding
> > > out its effect in KVM virtualization case where it can
> potentially
> > > allow
> > > running more number of VMs per-host for a given amount of RAM.
> With
> > > zcache
> > > enabled, VMs can be assigned much smaller amount of memory since
> > host
> > > can now
> > > hold bulk of page-cache pages, allowing VMs to maintain similar
> > level
> > > of
> > > performance while a greater number of them can be hosted.
> > >
> > > * How to test:
> > > All patches are against 2.6.35-rc5:
> > >
> > > - First, apply all prerequisite patches here:
> > >
> http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches
> > >
> > > - Then apply this patch series; also uploaded here:
> > > http://compcache.googlecode.com/hg/sub-projects/zcache_patches
> > >
> > >
> > > Nitin Gupta (8):
> > > Allow sharing xvmalloc for zram and zcache
> > > Basic zcache functionality
> > > Create sysfs nodes and export basic statistics
> > > Shrink zcache based on memlimit
> > > Eliminate zero-filled pages
> > > Compress pages using LZO
> > > Use xvmalloc to store compressed chunks
> > > Document sysfs entries
> > >
> > > Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 +
> > > drivers/staging/Makefile | 2 +
> > > drivers/staging/zram/Kconfig | 22 +
> > > drivers/staging/zram/Makefile | 5 +-
> > > drivers/staging/zram/xvmalloc.c | 8 +
> > > drivers/staging/zram/zcache_drv.c | 1312
> > > ++++++++++++++++++++++
> > > drivers/staging/zram/zcache_drv.h | 90 ++
> > > 7 files changed, 1491 insertions(+), 1 deletions(-)
> > > create mode 100644
> > Documentation/ABI/testing/sysfs-kernel-mm-zcache
> > > create mode 100644 drivers/staging/zram/zcache_drv.c
> > > create mode 100644 drivers/staging/zram/zcache_drv.h
> > By tested those patches on the top of the linus tree at this commit
> > d0c6f6258478e1dba532bf7c28e2cd6e1047d3a4, the OOM was trigger even
> > though there looked like still lots of swap.
> >
> > # free -m
> > total used free shared buffers
> > cached
> > Mem: 852 379 473 0 3
>
> > 15
> > -/+ buffers/cache: 359 492
> > Swap: 2015 14 2001
> >
> > # ./usemem 1024
> > 0: Mallocing 32 megabytes
> > 1: Mallocing 32 megabytes
> > 2: Mallocing 32 megabytes
> > 3: Mallocing 32 megabytes
> > 4: Mallocing 32 megabytes
> > 5: Mallocing 32 megabytes
> > 6: Mallocing 32 megabytes
> > 7: Mallocing 32 megabytes
> > 8: Mallocing 32 megabytes
> > 9: Mallocing 32 megabytes
> > 10: Mallocing 32 megabytes
> > 11: Mallocing 32 megabytes
> > 12: Mallocing 32 megabytes
> > 13: Mallocing 32 megabytes
> > 14: Mallocing 32 megabytes
> > 15: Mallocing 32 megabytes
> > Connection to 192.168.122.193 closed.
> >
> > usemem invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
> > usemem cpuset=/ mems_allowed=0
> > Pid: 1829, comm: usemem Not tainted 2.6.35-rc5+ #5
> > Call Trace:
> > [<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
> > [<ffffffff81108520>] dump_header+0x70/0x190
> > [<ffffffff811086c1>] oom_kill_process+0x81/0x180
> > [<ffffffff81108c08>] __out_of_memory+0x58/0xd0
> > [<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
> > [<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
> > [<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
> > [<ffffffff81140a69>] alloc_page_vma+0x89/0x140
> > [<ffffffff81125f76>] handle_mm_fault+0x6d6/0x990
> > [<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
> > [<ffffffff81121afd>] ? follow_page+0x19d/0x350
> > [<ffffffff8112639c>] __get_user_pages+0x16c/0x480
> > [<ffffffff810127c9>] ? sched_clock+0x9/0x10
> > [<ffffffff811276ef>] __mlock_vma_pages_range+0xef/0x1f0
> > [<ffffffff81127f01>] mlock_vma_pages_range+0x91/0xa0
> > [<ffffffff8112ad57>] mmap_region+0x307/0x5b0
> > [<ffffffff8112b354>] do_mmap_pgoff+0x354/0x3a0
> > [<ffffffff8112b3fc>] ? sys_mmap_pgoff+0x5c/0x200
> > [<ffffffff8112b41a>] sys_mmap_pgoff+0x7a/0x200
> > [<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [<ffffffff8100fa09>] sys_mmap+0x29/0x30
> > [<ffffffff8100b032>] system_call_fastpath+0x16/0x1b
> > Mem-Info:
> > Node 0 DMA per-cpu:
> > CPU 0: hi: 0, btch: 1 usd: 0
> > CPU 1: hi: 0, btch: 1 usd: 0
> > Node 0 DMA32 per-cpu:
> > CPU 0: hi: 186, btch: 31 usd: 140
> > CPU 1: hi: 186, btch: 31 usd: 47
> > active_anon:128 inactive_anon:140 isolated_anon:0
> > active_file:0 inactive_file:9 isolated_file:0
> > unevictable:126855 dirty:0 writeback:125 unstable:0
> > free:1996 slab_reclaimable:4445 slab_unreclaimable:23646
> > mapped:923 shmem:7 pagetables:778 bounce:0
> > Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB
> > inactive_anon:0kB active_file:0kB inactive_file:0kB
> > unevictable:11896kB isolated(anon):0kB isolated(file):0kB
> > present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB
> > shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
> kernel_stack:0kB
> > pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB
> > pages_scanned:0 all_unreclaimable? yes
> > lowmem_reserve[]: 0 994 994 994
> > Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB
> > active_anon:512kB inactive_anon:560kB active_file:0kB
> > inactive_file:36kB unevictable:495524kB isolated(anon):0kB
> > isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB
> > writeback:500kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB
> > slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB
> > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726
> > all_unreclaimable? yes
> > lowmem_reserve[]: 0 0 0 0
> > Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB
> 0*512kB
> > 1*1024kB 1*2048kB 0*4096kB = 4032kB
> > Node 0 DMA32: 476*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
> > 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB
> > 1146 total pagecache pages
> > 215 pages in swap cache
> > Swap cache stats: add 19633, delete 19418, find 941/1333
> > Free swap = 2051080kB
> > Total swap = 2064380kB
> > 262138 pages RAM
> > 43914 pages reserved
> > 4832 pages shared
> > 155665 pages non-shared
> > Out of memory: kill process 1727 (console-kit-dae) score 1027939 or
> a
> > child
> > Killed process 1727 (console-kit-dae) vsz:4111756kB, anon-rss:0kB,
> > file-rss:600kB
> > console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0,
> oom_adj=0
> > console-kit-dae cpuset=/ mems_allowed=0
> > Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5
> > Call Trace:
> > [<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
> > [<ffffffff81108520>] dump_header+0x70/0x190
> > [<ffffffff811086c1>] oom_kill_process+0x81/0x180
> > [<ffffffff81108c08>] __out_of_memory+0x58/0xd0
> > [<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
> > [<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
> > [<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
> > [<ffffffff8114522e>] kmem_getpages+0x6e/0x180
> > [<ffffffff81147d79>] fallback_alloc+0x1c9/0x2b0
> > [<ffffffff81147602>] ? cache_grow+0x4b2/0x520
> > [<ffffffff81147a5b>] ____cache_alloc_node+0xab/0x200
> > [<ffffffff810d55d5>] ? taskstats_exit+0x305/0x3b0
> > [<ffffffff8114862b>] kmem_cache_alloc+0x1fb/0x290
> > [<ffffffff810d55d5>] taskstats_exit+0x305/0x3b0
> > [<ffffffff81063a4b>] do_exit+0x12b/0x890
> > [<ffffffff810924fd>] ? trace_hardirqs_off+0xd/0x10
> > [<ffffffff8108641f>] ? cpu_clock+0x6f/0x80
> > [<ffffffff81095cbd>] ? lock_release_holdtime+0x3d/0x190
> > [<ffffffff814e1010>] ? _raw_spin_unlock_irq+0x30/0x40
> > [<ffffffff8106420e>] do_group_exit+0x5e/0xd0
> > [<ffffffff81075b54>] get_signal_to_deliver+0x2d4/0x490
> > [<ffffffff811ea6ad>] ? inode_has_perm+0x7d/0xf0
> > [<ffffffff8100a2e5>] do_signal+0x75/0x7b0
> > [<ffffffff81169d2d>] ? vfs_ioctl+0x3d/0xf0
> > [<ffffffff8116a394>] ? do_vfs_ioctl+0x84/0x570
> > [<ffffffff8100aa85>] do_notify_resume+0x65/0x80
> > [<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [<ffffffff8100b381>] int_signal+0x12/0x17
> > Mem-Info:
> > Node 0 DMA per-cpu:
> > CPU 0: hi: 0, btch: 1 usd: 0
> > CPU 1: hi: 0, btch: 1 usd: 0
> > Node 0 DMA32 per-cpu:
> > CPU 0: hi: 186, btch: 31 usd: 151
> > CPU 1: hi: 186, btch: 31 usd: 61
> > active_anon:128 inactive_anon:165 isolated_anon:0
> > active_file:0 inactive_file:9 isolated_file:0
> > unevictable:126855 dirty:0 writeback:25 unstable:0
> > free:1965 slab_reclaimable:4445 slab_unreclaimable:23646
> > mapped:923 shmem:7 pagetables:778 bounce:0
> > Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB
> > inactive_anon:0kB active_file:0kB inactive_file:0kB
> > unevictable:11896kB isolated(anon):0kB isolated(file):0kB
> > present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB
> > shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
> kernel_stack:0kB
> > pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB
> > pages_scanned:0 all_unreclaimable? yes
> > lowmem_reserve[]: 0 994 994 994
> > Node 0 DMA32 free:3828kB min:4000kB low:5000kB high:6000kB
> > active_anon:512kB inactive_anon:660kB active_file:0kB
> > inactive_file:36kB unevictable:495524kB isolated(anon):0kB
> > isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB
> > writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB
> > slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB
> > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726
> > all_unreclaimable? yes
> > lowmem_reserve[]: 0 0 0 0
> > Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB
> 0*512kB
> > 1*1024kB 1*2048kB 0*4096kB = 4032kB
> > Node 0 DMA32: 445*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
> > 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3828kB
> > 1146 total pagecache pages
> > 230 pages in swap cache
> > Swap cache stats: add 19649, delete 19419, find 942/1336
> > Free swap = 2051084kB
> > Total swap = 2064380kB
> > 262138 pages RAM
> > 43914 pages reserved
> > 4818 pages shared
> > 155685 pages non-shared
> > Out of memory: kill process 1806 (sshd) score 9474 or a child
> > Killed process 1810 (bash) vsz:108384kB, anon-rss:0kB,
> file-rss:656kB
> > console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0,
> oom_adj=0
> > console-kit-dae cpuset=/ mems_allowed=0
> > Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5
> > Call Trace:
> > [<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
> > [<ffffffff81108520>] dump_header+0x70/0x190
> > [<ffffffff811086c1>] oom_kill_process+0x81/0x180
> > [<ffffffff81108c08>] __out_of_memory+0x58/0xd0
> > [<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
> > [<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
> > [<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
> > [<ffffffff8114522e>] kmem_getpages+0x6e/0x180
> > [<ffffffff81147d79>] fallback_alloc+0x1c9/0x2b0
> > [<ffffffff81147602>] ? cache_grow+0x4b2/0x520
> > [<ffffffff81147a5b>] ____cache_alloc_node+0xab/0x200
> > [<ffffffff810d55d5>] ? taskstats_exit+0x305/0x3b0
> > [<ffffffff8114862b>] kmem_cache_alloc+0x1fb/0x290
> > [<ffffffff810d55d5>] taskstats_exit+0x305/0x3b0
> > [<ffffffff81063a4b>] do_exit+0x12b/0x890
> > [<ffffffff810924fd>] ? trace_hardirqs_off+0xd/0x10
> > [<ffffffff8108641f>] ? cpu_clock+0x6f/0x80
> > [<ffffffff81095cbd>] ? lock_release_holdtime+0x3d/0x190
> > [<ffffffff814e1010>] ? _raw_spin_unlock_irq+0x30/0x40
> > [<ffffffff8106420e>] do_group_exit+0x5e/0xd0
> > [<ffffffff81075b54>] get_signal_to_deliver+0x2d4/0x490
> > [<ffffffff811ea6ad>] ? inode_has_perm+0x7d/0xf0
> > [<ffffffff8100a2e5>] do_signal+0x75/0x7b0
> > [<ffffffff81169d2d>] ? vfs_ioctl+0x3d/0xf0
> > [<ffffffff8116a394>] ? do_vfs_ioctl+0x84/0x570
> > [<ffffffff8100aa85>] do_notify_resume+0x65/0x80
> > [<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [<ffffffff8100b381>] int_signal+0x12/0x17
> > Mem-Info:
> > Node 0 DMA per-cpu:
> > CPU 0: hi: 0, btch: 1 usd: 0
> > CPU 1: hi: 0, btch: 1 usd: 0
> > Node 0 DMA32 per-cpu:
> > CPU 0: hi: 186, btch: 31 usd: 119
> > CPU 1: hi: 186, btch: 31 usd: 73
> > active_anon:50 inactive_anon:175 isolated_anon:0
> > active_file:0 inactive_file:9 isolated_file:0
> > unevictable:126855 dirty:0 writeback:25 unstable:0
> > free:1996 slab_reclaimable:4445 slab_unreclaimable:23663
> > mapped:923 shmem:7 pagetables:778 bounce:0
> > Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB
> > inactive_anon:0kB active_file:0kB inactive_file:0kB
> > unevictable:11896kB isolated(anon):0kB isolated(file):0kB
> > present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB
> > shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
> kernel_stack:0kB
> > pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB
> > pages_scanned:0 all_unreclaimable? yes
> > lowmem_reserve[]: 0 994 994 994
> > Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB
> > active_anon:200kB inactive_anon:700kB active_file:0kB
> > inactive_file:36kB unevictable:495524kB isolated(anon):0kB
> > isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB
> > writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB
> > slab_unreclaimable:94652kB kernel_stack:1296kB pagetables:3088kB
> > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1536
> > all_unreclaimable? yes
> > lowmem_reserve[]: 0 0 0 0
> > Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB
> 0*512kB
> > 1*1024kB 1*2048kB 0*4096kB = 4032kB
> > Node 0 DMA32: 470*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
> > 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB
> > 1146 total pagecache pages
> > 221 pages in swap cache
> > Swap cache stats: add 19848, delete 19627, find 970/1386
> > Free swap = 2051428kB
> > Total swap = 2064380kB
> > 262138 pages RAM
> > 43914 pages reserved
> > 4669 pages shared
> > 155659 pages non-shared
> > Out of memory: kill process 1829 (usemem) score 8253 or a child
> > Killed process 1829 (usemem) vsz:528224kB, anon-rss:502468kB,
> > file-rss:376kB
> >
> > # cat usemem.c
> > # cat usemem.c
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <sys/mman.h>
> > #define CHUNKS 32
> >
> > int
> > main(int argc, char *argv[])
> > {
> > mlockall(MCL_FUTURE);
> >
> > unsigned long mb;
> > char *buf[CHUNKS];
> > int i;
> >
> > if (argc < 2) {
> > fprintf(stderr, "usage: usemem megabytes\n");
> > exit(1);
> > }
> > mb = strtoul(argv[1], NULL, 0);
> >
> > for (i = 0; i < CHUNKS; i++) {
> > fprintf(stderr, "%d: Mallocing %lu megabytes\n", i, mb/CHUNKS);
> > buf[i] = (char *)malloc(mb/CHUNKS * 1024L * 1024L);
> > if (!buf[i]) {
> > fprintf(stderr, "malloc failure\n");
> > exit(1);
> > }
> > }
> >
> > for (i = 0; i < CHUNKS; i++) {
> > fprintf(stderr, "%d: Zeroing %lu megabytes at %p\n",
> > i, mb/CHUNKS, buf[i]);
> > memset(buf[i], 0, mb/CHUNKS * 1024L * 1024L);
> > }
> >
> >
> > exit(0);
> > }
> >
> If this ever be relevant, this was tested inside the kvm guest. The
> host was a RHEL6 with THP enabled.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-07-23 19:22:55

by Nitin Gupta

[permalink] [raw]
Subject: Re: [PATCH 4/8] Shrink zcache based on memlimit

On 07/21/2010 05:02 PM, Ed Tomlinson wrote:
> On Wednesday 21 July 2010 00:52:40 Nitin Gupta wrote:
>> On 07/21/2010 04:33 AM, Minchan Kim wrote:
>>> On Fri, Jul 16, 2010 at 9:37 PM, Nitin Gupta <[email protected]> wrote:
>>>> User can change (per-pool) memlimit using sysfs node:
>>>> /sys/kernel/mm/zcache/pool<id>/memlimit
>>>>
>>>> When memlimit is set to a value smaller than current
>>>> number of pages allocated for that pool, excess pages
>>>> are now freed immediately instead of waiting for get/
>>>> flush for these pages.
>>>>
>>>> Currently, victim page selection is essentially random.
>>>> Automatic cache resizing and better page replacement
>>>> policies will be implemented later.
>>>
>>> Okay. I know this isn't end. I just want to give a concern before you end up.
>>> I don't know how you implement reclaim policy.
>>> In current implementation, you use memlimit for determining when reclaim happen.
>>> But i think we also should follow global reclaim policy of VM.
>>> I means although memlimit doen't meet, we should reclaim zcache if
>>> system has a trouble to reclaim memory.
>>
>> Yes, we should have a way to do reclaim depending on system memory pressure
>> and also when user explicitly wants so i.e. when memlimit is lowered manually.
>>
>>> AFAIK, cleancache doesn't give any hint for that. so we should
>>> implement it in zcache itself.
>>
>> I think cleancache should be kept minimal so yes, all reclaim policies should
>> go in zcache layer only.
>>
>>> At first glance, we can use shrink_slab or oom_notifier. But both
>>> doesn't give any information of zone although global reclaim do it by
>>> per-zone.
>>> AFAIK, Nick try to implement zone-aware shrink slab. Also if we need
>>> it, we can change oom_notifier with zone-aware oom_notifier. Now it
>>> seems anyone doesn't use oom_notifier so I am not sure it's useful.
>>>
>>
>> I don't think we need these notifiers as we can simply create a thread
>> to monitor cache hit rate, system memory pressure etc. and shrink/expand
>> the cache accordingly.
>
> Nitin,
>
> Based on experience gained when adding the shrinker callbacks, I would
> strongly recommend you use them. I tried several hacks along the lines of
> what you are proposing before moving settling on the callbacks. They
> are effective and make sure that memory is released when its required.
> What would happen with the other methods is that memory would either
> not be released or would be released when it was not needed.
>


I had similar experience with "swap notify callback" -- yes, things
don't seem to work without a proper callback. I will check if some
callback already exists for OOM like condition or if new one can
be added easily.

Thanks,
Nitin

2010-07-24 14:43:22

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH 0/8] zcache: page cache compression support

On Fri, 23 Jul 2010 14:02:16 EDT, CAI Qian said:
> Ignore me. The test case should not be using mlockall()!

I'm confused. I don't see any mlockall() call in the usemem.c you posted? Or
was what you posted not what you actually ran?


Attachments:
(No filename) (227.00 B)