From: Joonsoo Kim <[email protected]>
This patchset implements deduplication feature in zram. Motivation
is to save memory usage by zram. There are detailed description
about motivation and experimental results on patch 3 so please
refer it.
Thanks.
Joonsoo Kim (4):
mm/zsmalloc: always set movable/highmem flag to the zspage
zram: introduce zram_entry to prepare dedup functionality
zram: implement deduplication in zram
zram: make deduplication feature optional
drivers/block/zram/zram_drv.c | 338 +++++++++++++++++++++++++++++++++++++-----
drivers/block/zram/zram_drv.h | 25 +++-
mm/zsmalloc.c | 10 +-
3 files changed, 330 insertions(+), 43 deletions(-)
--
1.9.1
From: Joonsoo Kim <[email protected]>
Zspage is always movable and is used through zs_map_object() function
which returns directly accessible pointer that contains content of
zspage. It is independent on the user's allocation flag.
Therefore, it's better to always set movable/highmem flag to the zspage.
After that, we don't need __GFP_MOVABLE/__GFP_HIGHMEM clearing in
cache_alloc_handle()/cache_alloc_zspage() since there is no zs_malloc
caller who specifies __GFP_MOVABLE/__GFP_HIGHMEM.
Signed-off-by: Joonsoo Kim <[email protected]>
---
drivers/block/zram/zram_drv.c | 9 ++-------
mm/zsmalloc.c | 10 ++++------
2 files changed, 6 insertions(+), 13 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0194441..f65dcd1 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -684,19 +684,14 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
*/
if (!handle)
handle = zs_malloc(meta->mem_pool, clen,
- __GFP_KSWAPD_RECLAIM |
- __GFP_NOWARN |
- __GFP_HIGHMEM |
- __GFP_MOVABLE);
+ __GFP_KSWAPD_RECLAIM | __GFP_NOWARN);
if (!handle) {
zcomp_stream_put(zram->comp);
zstrm = NULL;
atomic64_inc(&zram->stats.writestall);
- handle = zs_malloc(meta->mem_pool, clen,
- GFP_NOIO | __GFP_HIGHMEM |
- __GFP_MOVABLE);
+ handle = zs_malloc(meta->mem_pool, clen, GFP_NOIO);
if (handle)
goto compress_again;
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b7ee9c3..fada232 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -347,8 +347,7 @@ static void destroy_cache(struct zs_pool *pool)
static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
{
- return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
- gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+ return (unsigned long)kmem_cache_alloc(pool->handle_cachep, gfp);
}
static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
@@ -358,9 +357,8 @@ static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
{
- return kmem_cache_alloc(pool->zspage_cachep,
- flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
-}
+ return kmem_cache_alloc(pool->zspage_cachep, flags);
+};
static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
{
@@ -1120,7 +1118,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
for (i = 0; i < class->pages_per_zspage; i++) {
struct page *page;
- page = alloc_page(gfp);
+ page = alloc_page(gfp | __GFP_MOVABLE | __GFP_HIGHMEM);
if (!page) {
while (--i >= 0) {
dec_zone_page_state(pages[i], NR_ZSPAGES);
--
1.9.1
From: Joonsoo Kim <[email protected]>
This patch implements deduplication feature in zram. The purpose
of this work is naturally to save amount of memory usage by zram.
Android is one of the biggest users to use zram as swap and it's
really important to save amount of memory usage. There is a paper
that reports that duplication ratio of Android's memory content is
rather high [1]. And, there is a similar work on zswap that also
reports that experiments has shown that around 10-15% of pages
stored in zswp are duplicates and deduplicate them provides some
benefits [2].
Also, there is a different kind of workload that uses zram as blockdev
and store build outputs into it to reduce wear-out problem of real
blockdev. In this workload, deduplication hit is very high
although I don't know exact reason about it.
Anyway, if we can detect duplicated content and avoid to store duplicated
content at different memory space, we can save memory. This patch
tries to do that.
Implementation is almost simple and intuitive but I should note
one thing about implementation detail.
To check duplication, this patch uses checksum of the page and
collision of this checksum could be possible. There would be
many choices to handle this situation but this patch chooses
to allow entry with duplicated checksum to be added to the hash,
but, not to compare all entries with duplicated checksum
when checking duplication. I guess that checksum collision is quite
rare event and we don't need to pay any attention to such a case.
Therefore, I decided the most simplest way to implement the feature.
If there is a different opinion, I can accept and go that way.
Following is the result of this patch.
Test result #1 (Swap):
Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18
orig_data_size: 145297408
compr_data_size: 32408125
mem_used_total: 32276480
dup_data_size: 3188134
meta_data_size: 1444272
Last two metrics added to mm_stat are related to this work.
First one, dup_data_size, is amount of saved memory by avoiding
to store duplicated page. Later one, meta_data_size, is the amount of
data structure to support deduplication. If dup > meta, we can judge
that the patch improves memory usage.
In Adnroid, we can save 5% of memory usage by this work.
Test result #2 (Blockdev):
build the kernel and store output to ext4 FS on zram
<no-dedup>
Elapsed time: 249 s
mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0
<dedup>
Elapsed time: 250 s
mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792
There is no performance degradation and save 23% memory.
Test result #3 (Blockdev):
copy android build output dir(out/host) to ext4 FS on zram
<no-dedup>
Elapsed time: out/host: 88 s
mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0
<dedup>
Elapsed time: out/host: 100 s
mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336
It shows performance degradation roughly 13% and save 24% memory. Maybe,
it is due to overhead of calculating checksum and comparison.
Test result #4 (Blockdev):
copy android build output dir(out/target/common) to ext4 FS on zram
<no-dedup>
Elapsed time: out/host: 203 s
mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0
<dedup>
Elapsed time: out/host: 201 s
mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336
Memory is saved by 42% and performance is the same. Even if there is overhead
of calculating checksum and comparison, large hit ratio compensate it since
hit leads to less compression attempt.
Anyway, benefit seems to be largely dependent on the workload so
following patch will make this feature optional. However, this feature
can help some usecases so is deserved to be merged.
[1]: MemScope: Analyzing Memory Duplication on Android Systems,
dl.acm.org/citation.cfm?id=2797023
[2]: zswap: Optimize compressed pool memory utilization,
lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2
Signed-off-by: Joonsoo Kim <[email protected]>
---
drivers/block/zram/zram_drv.c | 212 ++++++++++++++++++++++++++++++++++++++----
drivers/block/zram/zram_drv.h | 18 ++++
2 files changed, 214 insertions(+), 16 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a3d9cbca..012425f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -32,6 +32,7 @@
#include <linux/idr.h>
#include <linux/sysfs.h>
#include <linux/cpuhotplug.h>
+#include <linux/jhash.h>
#include "zram_drv.h"
@@ -385,14 +386,16 @@ static ssize_t mm_stat_show(struct device *dev,
max_used = atomic_long_read(&zram->stats.max_used_pages);
ret = scnprintf(buf, PAGE_SIZE,
- "%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n",
+ "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n",
orig_size << PAGE_SHIFT,
(u64)atomic64_read(&zram->stats.compr_data_size),
mem_used << PAGE_SHIFT,
zram->limit_pages << PAGE_SHIFT,
max_used << PAGE_SHIFT,
(u64)atomic64_read(&zram->stats.same_pages),
- pool_stats.pages_compacted);
+ pool_stats.pages_compacted,
+ (u64)atomic64_read(&zram->stats.dup_data_size),
+ (u64)atomic64_read(&zram->stats.meta_data_size));
up_read(&zram->init_lock);
return ret;
@@ -419,29 +422,165 @@ static ssize_t debug_stat_show(struct device *dev,
static DEVICE_ATTR_RO(mm_stat);
static DEVICE_ATTR_RO(debug_stat);
-static struct zram_entry *zram_entry_alloc(struct zram_meta *meta,
+static u32 zram_calc_checksum(unsigned char *mem)
+{
+ return jhash(mem, PAGE_SIZE, 0);
+}
+
+static struct zram_entry *zram_entry_alloc(struct zram *zram,
unsigned int len, gfp_t flags)
{
+ struct zram_meta *meta = zram->meta;
struct zram_entry *entry;
+ unsigned long handle;
- entry = kzalloc(sizeof(*entry), flags);
- if (!entry)
+ handle = zs_malloc(meta->mem_pool, len, flags);
+ if (!handle)
return NULL;
- entry->handle = zs_malloc(meta->mem_pool, len, flags);
- if (!entry->handle) {
- kfree(entry);
+ entry = kzalloc(sizeof(*entry), flags);
+ if (!entry) {
+ zs_free(meta->mem_pool, handle);
return NULL;
}
+ entry->handle = handle;
+ RB_CLEAR_NODE(&entry->rb_node);
+ entry->refcount = 1;
+ entry->len = len;
+ atomic64_add(sizeof(*entry), &zram->stats.meta_data_size);
+
return entry;
}
-static inline void zram_entry_free(struct zram_meta *meta,
- struct zram_entry *entry)
+static void zram_entry_insert(struct zram *zram, struct zram_entry *new,
+ u32 checksum)
+{
+ struct zram_meta *meta = zram->meta;
+ struct zram_hash *hash;
+ struct rb_root *rb_root;
+ struct rb_node **rb_node, *parent = NULL;
+ struct zram_entry *entry;
+
+ new->checksum = checksum;
+ hash = &meta->hash[checksum % meta->hash_size];
+ rb_root = &hash->rb_root;
+
+ spin_lock(&hash->lock);
+ rb_node = &rb_root->rb_node;
+ while (*rb_node) {
+ parent = *rb_node;
+ entry = rb_entry(parent, struct zram_entry, rb_node);
+ if (checksum < entry->checksum)
+ rb_node = &parent->rb_left;
+ else if (checksum > entry->checksum)
+ rb_node = &parent->rb_right;
+ else
+ rb_node = &parent->rb_left;
+ }
+
+ rb_link_node(&new->rb_node, parent, rb_node);
+ rb_insert_color(&new->rb_node, rb_root);
+ spin_unlock(&hash->lock);
+}
+
+static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
+ unsigned char *mem)
+{
+ bool match = false;
+ unsigned char *cmem;
+ struct zram_meta *meta = zram->meta;
+ struct zcomp_strm *zstrm;
+
+ cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
+ if (entry->len == PAGE_SIZE) {
+ match = !memcmp(mem, cmem, PAGE_SIZE);
+ } else {
+ zstrm = zcomp_stream_get(zram->comp);
+ if (!zcomp_decompress(zstrm, cmem, entry->len, zstrm->buffer))
+ match = !memcmp(mem, zstrm->buffer, PAGE_SIZE);
+ zcomp_stream_put(zram->comp);
+ }
+ zs_unmap_object(meta->mem_pool, entry->handle);
+
+ return match;
+}
+
+static inline void zram_entry_free(struct zram *zram, struct zram_meta *meta,
+ struct zram_entry *entry)
{
zs_free(meta->mem_pool, entry->handle);
kfree(entry);
+ if (zram)
+ atomic64_sub(sizeof(*entry), &zram->stats.meta_data_size);
+}
+
+static bool zram_entry_put(struct zram *zram, struct zram_meta *meta,
+ struct zram_entry *entry, bool populated)
+{
+ struct zram_hash *hash;
+ u32 checksum;
+
+ if (!populated)
+ goto free;
+
+ checksum = entry->checksum;
+ hash = &meta->hash[checksum % meta->hash_size];
+
+ spin_lock(&hash->lock);
+ entry->refcount--;
+ if (!entry->refcount) {
+ populated = false;
+ rb_erase(&entry->rb_node, &hash->rb_root);
+ RB_CLEAR_NODE(&entry->rb_node);
+ }
+ spin_unlock(&hash->lock);
+
+free:
+ if (!populated)
+ zram_entry_free(zram, meta, entry);
+
+ return populated;
+}
+
+static struct zram_entry *zram_entry_get(struct zram *zram,
+ unsigned char *mem, u32 checksum)
+{
+ struct zram_meta *meta = zram->meta;
+ struct zram_hash *hash;
+ struct zram_entry *entry;
+ struct rb_node *rb_node;
+
+ hash = &meta->hash[checksum % meta->hash_size];
+
+ spin_lock(&hash->lock);
+ rb_node = hash->rb_root.rb_node;
+ while (rb_node) {
+ entry = rb_entry(rb_node, struct zram_entry, rb_node);
+ if (checksum == entry->checksum) {
+ entry->refcount++;
+ atomic64_add(entry->len, &zram->stats.dup_data_size);
+ spin_unlock(&hash->lock);
+
+ if (zram_entry_match(zram, entry, mem))
+ return entry;
+
+ if (zram_entry_put(zram, meta, entry, true)) {
+ atomic64_sub(entry->len,
+ &zram->stats.dup_data_size);
+ }
+
+ return NULL;
+ }
+
+ if (checksum < entry->checksum)
+ rb_node = rb_node->rb_left;
+ else
+ rb_node = rb_node->rb_right;
+ }
+ spin_unlock(&hash->lock);
+
+ return NULL;
}
static void zram_meta_free(struct zram_meta *meta, u64 disksize)
@@ -459,18 +598,31 @@ static void zram_meta_free(struct zram_meta *meta, u64 disksize)
if (!entry || zram_test_flag(meta, index, ZRAM_SAME))
continue;
- zram_entry_free(meta, entry);
+ zram_entry_put(NULL, meta, entry, true);
}
zs_destroy_pool(meta->mem_pool);
+ vfree(meta->hash);
vfree(meta->table);
kfree(meta);
}
+static void zram_meta_init(struct zram_meta *meta)
+{
+ int i;
+ struct zram_hash *hash;
+
+ for (i = 0; i < meta->hash_size; i++) {
+ hash = &meta->hash[i];
+ spin_lock_init(&hash->lock);
+ hash->rb_root = RB_ROOT;
+ }
+}
+
static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
{
size_t num_pages;
- struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL);
+ struct zram_meta *meta = kzalloc(sizeof(*meta), GFP_KERNEL);
if (!meta)
return NULL;
@@ -482,15 +634,27 @@ static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
goto out_error;
}
+ meta->hash_size = num_pages >> ZRAM_HASH_SHIFT;
+ meta->hash_size = min_t(size_t, ZRAM_HASH_SIZE_MAX, meta->hash_size);
+ meta->hash_size = max_t(size_t, ZRAM_HASH_SIZE_MIN, meta->hash_size);
+ meta->hash = vzalloc(meta->hash_size * sizeof(struct zram_hash));
+ if (!meta->hash) {
+ pr_err("Error allocating zram entry hash\n");
+ goto out_error;
+ }
+
meta->mem_pool = zs_create_pool(pool_name);
if (!meta->mem_pool) {
pr_err("Error creating memory pool\n");
goto out_error;
}
+ zram_meta_init(meta);
+
return meta;
out_error:
+ vfree(meta->hash);
vfree(meta->table);
kfree(meta);
return NULL;
@@ -520,7 +684,8 @@ static void zram_free_page(struct zram *zram, size_t index)
if (!entry)
return;
- zram_entry_free(meta, entry);
+ if (zram_entry_put(zram, meta, entry, true))
+ atomic64_sub(entry->len, &zram->stats.dup_data_size);
atomic64_sub(zram_get_obj_size(meta, index),
&zram->stats.compr_data_size);
@@ -631,6 +796,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
struct zcomp_strm *zstrm = NULL;
unsigned long alloced_pages;
unsigned long element;
+ u32 checksum;
page = bvec->bv_page;
if (is_partial_io(bvec)) {
@@ -674,6 +840,18 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
goto out;
}
+ checksum = zram_calc_checksum(uncmem);
+ if (!entry) {
+ entry = zram_entry_get(zram, uncmem, checksum);
+ if (entry) {
+ if (!is_partial_io(bvec))
+ kunmap_atomic(user_mem);
+
+ clen = entry->len;
+ goto found_duplication;
+ }
+ }
+
zstrm = zcomp_stream_get(zram->comp);
ret = zcomp_compress(zstrm, uncmem, &clen);
if (!is_partial_io(bvec)) {
@@ -708,7 +886,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
* from the slow path and entry has already been allocated.
*/
if (!entry) {
- entry = zram_entry_alloc(meta, clen,
+ entry = zram_entry_alloc(zram, clen,
__GFP_KSWAPD_RECLAIM | __GFP_NOWARN);
}
@@ -718,7 +896,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
atomic64_inc(&zram->stats.writestall);
- entry = zram_entry_alloc(meta, clen, GFP_NOIO);
+ entry = zram_entry_alloc(zram, clen, GFP_NOIO);
if (entry)
goto compress_again;
@@ -732,7 +910,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
update_used_max(zram, alloced_pages);
if (zram->limit_pages && alloced_pages > zram->limit_pages) {
- zram_entry_free(meta, entry);
+ zram_entry_put(zram, meta, entry, false);
ret = -ENOMEM;
goto out;
}
@@ -750,7 +928,9 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
zcomp_stream_put(zram->comp);
zstrm = NULL;
zs_unmap_object(meta->mem_pool, entry->handle);
+ zram_entry_insert(zram, entry, checksum);
+found_duplication:
/*
* Free memory associated with this sector
* before overwriting unused sectors.
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index a7ae46c..07d1f8d 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -18,6 +18,7 @@
#include <linux/rwsem.h>
#include <linux/zsmalloc.h>
#include <linux/crypto.h>
+#include <linux/spinlock.h>
#include "zcomp.h"
@@ -45,6 +46,10 @@
#define ZRAM_SECTOR_PER_LOGICAL_BLOCK \
(1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
+/* One slot will contain 128 pages theoretically */
+#define ZRAM_HASH_SHIFT 7
+#define ZRAM_HASH_SIZE_MIN (1 << 10)
+#define ZRAM_HASH_SIZE_MAX (1 << 31)
/*
* The lower ZRAM_FLAG_SHIFT bits of table.value is for
@@ -70,6 +75,10 @@ enum zram_pageflags {
/*-- Data structures */
struct zram_entry {
+ struct rb_node rb_node;
+ u32 len;
+ u32 checksum;
+ unsigned long refcount;
unsigned long handle;
};
@@ -94,11 +103,20 @@ struct zram_stats {
atomic64_t pages_stored; /* no. of pages currently stored */
atomic_long_t max_used_pages; /* no. of maximum pages stored */
atomic64_t writestall; /* no. of write slow paths */
+ atomic64_t dup_data_size; /* compressed size of pages duplicated */
+ atomic64_t meta_data_size; /* size of zram_entries */
+};
+
+struct zram_hash {
+ spinlock_t lock;
+ struct rb_root rb_root;
};
struct zram_meta {
struct zram_table_entry *table;
struct zs_pool *mem_pool;
+ struct zram_hash *hash;
+ size_t hash_size;
};
struct zram {
--
1.9.1
From: Joonsoo Kim <[email protected]>
Following patch will implement deduplication functionality
in the zram and it requires an indirection layer to manage
the life cycle of zsmalloc handle. To prepare that, this patch
introduces zram_entry which can be used to manage the life-cycle
of zsmalloc handle. Many lines are changed due to rename but
core change is just simple introduction about newly data structure.
Signed-off-by: Joonsoo Kim <[email protected]>
---
drivers/block/zram/zram_drv.c | 81 ++++++++++++++++++++++++++++---------------
drivers/block/zram/zram_drv.h | 6 +++-
2 files changed, 59 insertions(+), 28 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f65dcd1..a3d9cbca 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -419,6 +419,31 @@ static ssize_t debug_stat_show(struct device *dev,
static DEVICE_ATTR_RO(mm_stat);
static DEVICE_ATTR_RO(debug_stat);
+static struct zram_entry *zram_entry_alloc(struct zram_meta *meta,
+ unsigned int len, gfp_t flags)
+{
+ struct zram_entry *entry;
+
+ entry = kzalloc(sizeof(*entry), flags);
+ if (!entry)
+ return NULL;
+
+ entry->handle = zs_malloc(meta->mem_pool, len, flags);
+ if (!entry->handle) {
+ kfree(entry);
+ return NULL;
+ }
+
+ return entry;
+}
+
+static inline void zram_entry_free(struct zram_meta *meta,
+ struct zram_entry *entry)
+{
+ zs_free(meta->mem_pool, entry->handle);
+ kfree(entry);
+}
+
static void zram_meta_free(struct zram_meta *meta, u64 disksize)
{
size_t num_pages = disksize >> PAGE_SHIFT;
@@ -426,15 +451,15 @@ static void zram_meta_free(struct zram_meta *meta, u64 disksize)
/* Free all pages that are still in this zram device */
for (index = 0; index < num_pages; index++) {
- unsigned long handle = meta->table[index].handle;
+ struct zram_entry *entry = meta->table[index].entry;
/*
* No memory is allocated for same element filled pages.
* Simply clear same page flag.
*/
- if (!handle || zram_test_flag(meta, index, ZRAM_SAME))
+ if (!entry || zram_test_flag(meta, index, ZRAM_SAME))
continue;
- zs_free(meta->mem_pool, handle);
+ zram_entry_free(meta, entry);
}
zs_destroy_pool(meta->mem_pool);
@@ -479,7 +504,7 @@ static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
static void zram_free_page(struct zram *zram, size_t index)
{
struct zram_meta *meta = zram->meta;
- unsigned long handle = meta->table[index].handle;
+ struct zram_entry *entry = meta->table[index].entry;
/*
* No memory is allocated for same element filled pages.
@@ -492,16 +517,16 @@ static void zram_free_page(struct zram *zram, size_t index)
return;
}
- if (!handle)
+ if (!entry)
return;
- zs_free(meta->mem_pool, handle);
+ zram_entry_free(meta, entry);
atomic64_sub(zram_get_obj_size(meta, index),
&zram->stats.compr_data_size);
atomic64_dec(&zram->stats.pages_stored);
- meta->table[index].handle = 0;
+ meta->table[index].entry = NULL;
zram_set_obj_size(meta, index, 0);
}
@@ -510,20 +535,20 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
int ret = 0;
unsigned char *cmem;
struct zram_meta *meta = zram->meta;
- unsigned long handle;
+ struct zram_entry *entry;
unsigned int size;
bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
- handle = meta->table[index].handle;
+ entry = meta->table[index].entry;
size = zram_get_obj_size(meta, index);
- if (!handle || zram_test_flag(meta, index, ZRAM_SAME)) {
+ if (!entry || zram_test_flag(meta, index, ZRAM_SAME)) {
bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
zram_fill_page(mem, PAGE_SIZE, meta->table[index].element);
return 0;
}
- cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_RO);
+ cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
if (size == PAGE_SIZE) {
copy_page(mem, cmem);
} else {
@@ -532,7 +557,7 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
ret = zcomp_decompress(zstrm, cmem, size, mem);
zcomp_stream_put(zram->comp);
}
- zs_unmap_object(meta->mem_pool, handle);
+ zs_unmap_object(meta->mem_pool, entry->handle);
bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
/* Should NEVER happen. Return bio error if it does. */
@@ -554,7 +579,7 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
page = bvec->bv_page;
bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
- if (unlikely(!meta->table[index].handle) ||
+ if (unlikely(!meta->table[index].entry) ||
zram_test_flag(meta, index, ZRAM_SAME)) {
bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
handle_same_page(bvec, meta->table[index].element);
@@ -599,7 +624,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
{
int ret = 0;
unsigned int clen;
- unsigned long handle = 0;
+ struct zram_entry *entry = NULL;
struct page *page;
unsigned char *user_mem, *cmem, *src, *uncmem = NULL;
struct zram_meta *meta = zram->meta;
@@ -670,29 +695,31 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
}
/*
- * handle allocation has 2 paths:
+ * entry allocation has 2 paths:
* a) fast path is executed with preemption disabled (for
* per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear,
* since we can't sleep;
* b) slow path enables preemption and attempts to allocate
* the page with __GFP_DIRECT_RECLAIM bit set. we have to
* put per-cpu compression stream and, thus, to re-do
- * the compression once handle is allocated.
+ * the compression once entry is allocated.
*
- * if we have a 'non-null' handle here then we are coming
- * from the slow path and handle has already been allocated.
+ * if we have a 'non-null' entry here then we are coming
+ * from the slow path and entry has already been allocated.
*/
- if (!handle)
- handle = zs_malloc(meta->mem_pool, clen,
+ if (!entry) {
+ entry = zram_entry_alloc(meta, clen,
__GFP_KSWAPD_RECLAIM | __GFP_NOWARN);
- if (!handle) {
+ }
+
+ if (!entry) {
zcomp_stream_put(zram->comp);
zstrm = NULL;
atomic64_inc(&zram->stats.writestall);
- handle = zs_malloc(meta->mem_pool, clen, GFP_NOIO);
- if (handle)
+ entry = zram_entry_alloc(meta, clen, GFP_NOIO);
+ if (entry)
goto compress_again;
pr_err("Error allocating memory for compressed page: %u, size=%u\n",
@@ -705,12 +732,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
update_used_max(zram, alloced_pages);
if (zram->limit_pages && alloced_pages > zram->limit_pages) {
- zs_free(meta->mem_pool, handle);
+ zram_entry_free(meta, entry);
ret = -ENOMEM;
goto out;
}
- cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
+ cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_WO);
if ((clen == PAGE_SIZE) && !is_partial_io(bvec)) {
src = kmap_atomic(page);
@@ -722,7 +749,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
zcomp_stream_put(zram->comp);
zstrm = NULL;
- zs_unmap_object(meta->mem_pool, handle);
+ zs_unmap_object(meta->mem_pool, entry->handle);
/*
* Free memory associated with this sector
@@ -731,7 +758,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
zram_free_page(zram, index);
- meta->table[index].handle = handle;
+ meta->table[index].entry = entry;
zram_set_obj_size(meta, index, clen);
bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index caeff51..a7ae46c 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -69,10 +69,14 @@ enum zram_pageflags {
/*-- Data structures */
+struct zram_entry {
+ unsigned long handle;
+};
+
/* Allocated for each disk page */
struct zram_table_entry {
union {
- unsigned long handle;
+ struct zram_entry *entry;
unsigned long element;
};
unsigned long value;
--
1.9.1
From: Joonsoo Kim <[email protected]>
Benefit of deduplication is dependent on the workload so it's not
preferable to always enable. Therefore, make it optional.
Signed-off-by: Joonsoo Kim <[email protected]>
---
drivers/block/zram/zram_drv.c | 80 ++++++++++++++++++++++++++++++++++++++-----
drivers/block/zram/zram_drv.h | 1 +
2 files changed, 73 insertions(+), 8 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 012425f..e45aa9f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -328,6 +328,39 @@ static ssize_t comp_algorithm_store(struct device *dev,
return len;
}
+static ssize_t use_dedup_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int val;
+ struct zram *zram = dev_to_zram(dev);
+
+ down_read(&zram->init_lock);
+ val = zram->use_dedup;
+ up_read(&zram->init_lock);
+
+ return scnprintf(buf, PAGE_SIZE, "%d\n", val);
+}
+
+static ssize_t use_dedup_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ int val;
+ struct zram *zram = dev_to_zram(dev);
+
+ if (kstrtoint(buf, 10, &val) || (val != 0 && val != 1))
+ return -EINVAL;
+
+ down_write(&zram->init_lock);
+ if (init_done(zram)) {
+ up_write(&zram->init_lock);
+ pr_info("Can't change dedup usage for initialized device\n");
+ return -EBUSY;
+ }
+ zram->use_dedup = val;
+ up_write(&zram->init_lock);
+ return len;
+}
+
static ssize_t compact_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
@@ -422,11 +455,23 @@ static ssize_t debug_stat_show(struct device *dev,
static DEVICE_ATTR_RO(mm_stat);
static DEVICE_ATTR_RO(debug_stat);
-static u32 zram_calc_checksum(unsigned char *mem)
+static u32 zram_calc_checksum(struct zram *zram, unsigned char *mem)
{
+ if (!zram->use_dedup)
+ return 0;
+
return jhash(mem, PAGE_SIZE, 0);
}
+static unsigned long zram_entry_handle(struct zram *zram,
+ struct zram_entry *entry)
+{
+ if (!zram->use_dedup)
+ return (unsigned long)entry;
+
+ return entry->handle;
+}
+
static struct zram_entry *zram_entry_alloc(struct zram *zram,
unsigned int len, gfp_t flags)
{
@@ -438,6 +483,9 @@ static struct zram_entry *zram_entry_alloc(struct zram *zram,
if (!handle)
return NULL;
+ if (!zram->use_dedup)
+ return (struct zram_entry *)handle;
+
entry = kzalloc(sizeof(*entry), flags);
if (!entry) {
zs_free(meta->mem_pool, handle);
@@ -462,6 +510,9 @@ static void zram_entry_insert(struct zram *zram, struct zram_entry *new,
struct rb_node **rb_node, *parent = NULL;
struct zram_entry *entry;
+ if (!zram->use_dedup)
+ return;
+
new->checksum = checksum;
hash = &meta->hash[checksum % meta->hash_size];
rb_root = &hash->rb_root;
@@ -492,7 +543,8 @@ static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
struct zram_meta *meta = zram->meta;
struct zcomp_strm *zstrm;
- cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
+ cmem = zs_map_object(meta->mem_pool,
+ zram_entry_handle(zram, entry), ZS_MM_RO);
if (entry->len == PAGE_SIZE) {
match = !memcmp(mem, cmem, PAGE_SIZE);
} else {
@@ -501,7 +553,7 @@ static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
match = !memcmp(mem, zstrm->buffer, PAGE_SIZE);
zcomp_stream_put(zram->comp);
}
- zs_unmap_object(meta->mem_pool, entry->handle);
+ zs_unmap_object(meta->mem_pool, zram_entry_handle(zram, entry));
return match;
}
@@ -521,6 +573,11 @@ static bool zram_entry_put(struct zram *zram, struct zram_meta *meta,
struct zram_hash *hash;
u32 checksum;
+ if (!zram->use_dedup) {
+ zs_free(meta->mem_pool, zram_entry_handle(zram, entry));
+ return false;
+ }
+
if (!populated)
goto free;
@@ -551,6 +608,9 @@ static struct zram_entry *zram_entry_get(struct zram *zram,
struct zram_entry *entry;
struct rb_node *rb_node;
+ if (!zram->use_dedup)
+ return NULL;
+
hash = &meta->hash[checksum % meta->hash_size];
spin_lock(&hash->lock);
@@ -713,7 +773,8 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
return 0;
}
- cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
+ cmem = zs_map_object(meta->mem_pool,
+ zram_entry_handle(zram, entry), ZS_MM_RO);
if (size == PAGE_SIZE) {
copy_page(mem, cmem);
} else {
@@ -722,7 +783,7 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
ret = zcomp_decompress(zstrm, cmem, size, mem);
zcomp_stream_put(zram->comp);
}
- zs_unmap_object(meta->mem_pool, entry->handle);
+ zs_unmap_object(meta->mem_pool, zram_entry_handle(zram, entry));
bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
/* Should NEVER happen. Return bio error if it does. */
@@ -840,7 +901,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
goto out;
}
- checksum = zram_calc_checksum(uncmem);
+ checksum = zram_calc_checksum(zram, uncmem);
if (!entry) {
entry = zram_entry_get(zram, uncmem, checksum);
if (entry) {
@@ -915,7 +976,8 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
goto out;
}
- cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_WO);
+ cmem = zs_map_object(meta->mem_pool,
+ zram_entry_handle(zram, entry), ZS_MM_WO);
if ((clen == PAGE_SIZE) && !is_partial_io(bvec)) {
src = kmap_atomic(page);
@@ -927,7 +989,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
zcomp_stream_put(zram->comp);
zstrm = NULL;
- zs_unmap_object(meta->mem_pool, entry->handle);
+ zs_unmap_object(meta->mem_pool, zram_entry_handle(zram, entry));
zram_entry_insert(zram, entry, checksum);
found_duplication:
@@ -1310,6 +1372,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
static DEVICE_ATTR_WO(mem_used_max);
static DEVICE_ATTR_RW(max_comp_streams);
static DEVICE_ATTR_RW(comp_algorithm);
+static DEVICE_ATTR_RW(use_dedup);
static struct attribute *zram_disk_attrs[] = {
&dev_attr_disksize.attr,
@@ -1320,6 +1383,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
&dev_attr_mem_used_max.attr,
&dev_attr_max_comp_streams.attr,
&dev_attr_comp_algorithm.attr,
+ &dev_attr_use_dedup.attr,
&dev_attr_io_stat.attr,
&dev_attr_mm_stat.attr,
&dev_attr_debug_stat.attr,
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 07d1f8d..b823555 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -141,5 +141,6 @@ struct zram {
* zram is claimed so open request will be failed
*/
bool claim; /* Protected by bdev->bd_mutex */
+ int use_dedup;
};
#endif
--
1.9.1
Hi Joonsoo,
On Thu, Mar 16, 2017 at 11:46:35AM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Zspage is always movable and is used through zs_map_object() function
> which returns directly accessible pointer that contains content of
> zspage. It is independent on the user's allocation flag.
> Therefore, it's better to always set movable/highmem flag to the zspage.
> After that, we don't need __GFP_MOVABLE/__GFP_HIGHMEM clearing in
> cache_alloc_handle()/cache_alloc_zspage() since there is no zs_malloc
> caller who specifies __GFP_MOVABLE/__GFP_HIGHMEM.
Hmm, I wanted this when you pointed out to me firstly but when I think
again, I don't see it's improvement. Sorry for that.
The zs_malloc is exported symbol and it has gfp_t argument so user can
do whatever he want with any zone modifiers flags. IOW, if someuser want
to allocate pages from {normal|dma} zone by whatever reason, he can
omit __GFP_HIGHMEM from the gfp flag to fullfill the goal.
Hi Joonsoo,
On Thu, Mar 16, 2017 at 11:46:37AM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> This patch implements deduplication feature in zram. The purpose
> of this work is naturally to save amount of memory usage by zram.
>
> Android is one of the biggest users to use zram as swap and it's
> really important to save amount of memory usage. There is a paper
> that reports that duplication ratio of Android's memory content is
> rather high [1]. And, there is a similar work on zswap that also
> reports that experiments has shown that around 10-15% of pages
> stored in zswp are duplicates and deduplicate them provides some
> benefits [2].
>
> Also, there is a different kind of workload that uses zram as blockdev
> and store build outputs into it to reduce wear-out problem of real
> blockdev. In this workload, deduplication hit is very high
> although I don't know exact reason about it.
Hmm, Isn't it due to binary composed by obj files so that many of
part between binary and object are sharable?
I'd like to clear it out because dedup was not useful for swap workload
for the testing in android unlike papers you mentioned.
Of course, it depends on the workload so someone might find it's
huge useful for his swap workload. However, I want to focus clear
winning case scenario rather than "might be better".
That's why I want to know clear reason the saving. If my assumption
is right(i.e., object file vs. binary file), it would be enough
justfication to merge this feature because user can turn on the feature
with reasonable expectation.
>
> Anyway, if we can detect duplicated content and avoid to store duplicated
> content at different memory space, we can save memory. This patch
> tries to do that.
>
> Implementation is almost simple and intuitive but I should note
> one thing about implementation detail.
>
> To check duplication, this patch uses checksum of the page and
> collision of this checksum could be possible. There would be
> many choices to handle this situation but this patch chooses
> to allow entry with duplicated checksum to be added to the hash,
> but, not to compare all entries with duplicated checksum
> when checking duplication. I guess that checksum collision is quite
Hmm, if there are many duplicated checksum, what a specific checksum
is compared among them?
> rare event and we don't need to pay any attention to such a case.
If it's rare event, why can't we iterate all of entries?
> Therefore, I decided the most simplest way to implement the feature.
> If there is a different opinion, I can accept and go that way.
>
> Following is the result of this patch.
>
> Test result #1 (Swap):
> Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18
>
> orig_data_size: 145297408
> compr_data_size: 32408125
> mem_used_total: 32276480
> dup_data_size: 3188134
> meta_data_size: 1444272
>
> Last two metrics added to mm_stat are related to this work.
> First one, dup_data_size, is amount of saved memory by avoiding
> to store duplicated page. Later one, meta_data_size, is the amount of
> data structure to support deduplication. If dup > meta, we can judge
> that the patch improves memory usage.
>
> In Adnroid, we can save 5% of memory usage by this work.
>
> Test result #2 (Blockdev):
> build the kernel and store output to ext4 FS on zram
>
> <no-dedup>
> Elapsed time: 249 s
> mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0
>
> <dedup>
> Elapsed time: 250 s
> mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792
>
> There is no performance degradation and save 23% memory.
>
> Test result #3 (Blockdev):
> copy android build output dir(out/host) to ext4 FS on zram
>
> <no-dedup>
> Elapsed time: out/host: 88 s
> mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0
>
> <dedup>
> Elapsed time: out/host: 100 s
> mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336
>
> It shows performance degradation roughly 13% and save 24% memory. Maybe,
> it is due to overhead of calculating checksum and comparison.
>
> Test result #4 (Blockdev):
> copy android build output dir(out/target/common) to ext4 FS on zram
>
> <no-dedup>
> Elapsed time: out/host: 203 s
> mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0
>
> <dedup>
> Elapsed time: out/host: 201 s
> mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336
>
> Memory is saved by 42% and performance is the same. Even if there is overhead
> of calculating checksum and comparison, large hit ratio compensate it since
> hit leads to less compression attempt.
Cool! It would help a lot for users have used zram to build output directory.
>
> Anyway, benefit seems to be largely dependent on the workload so
> following patch will make this feature optional. However, this feature
> can help some usecases so is deserved to be merged.
>
> [1]: MemScope: Analyzing Memory Duplication on Android Systems,
> dl.acm.org/citation.cfm?id=2797023
> [2]: zswap: Optimize compressed pool memory utilization,
> lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2
Overall, it looks good. Thanks!
However, I want to change function naming/structuring into my preference style
to maintain because it's non-trival feature. So, please look below.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> drivers/block/zram/zram_drv.c | 212 ++++++++++++++++++++++++++++++++++++++----
> drivers/block/zram/zram_drv.h | 18 ++++
> 2 files changed, 214 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index a3d9cbca..012425f 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -32,6 +32,7 @@
> #include <linux/idr.h>
> #include <linux/sysfs.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/jhash.h>
>
> #include "zram_drv.h"
>
> @@ -385,14 +386,16 @@ static ssize_t mm_stat_show(struct device *dev,
> max_used = atomic_long_read(&zram->stats.max_used_pages);
>
> ret = scnprintf(buf, PAGE_SIZE,
> - "%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n",
> + "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n",
> orig_size << PAGE_SHIFT,
> (u64)atomic64_read(&zram->stats.compr_data_size),
> mem_used << PAGE_SHIFT,
> zram->limit_pages << PAGE_SHIFT,
> max_used << PAGE_SHIFT,
> (u64)atomic64_read(&zram->stats.same_pages),
> - pool_stats.pages_compacted);
> + pool_stats.pages_compacted,
> + (u64)atomic64_read(&zram->stats.dup_data_size),
zram_dedeup_stat_read?
The reason to use wrapper function is I reallly want to remove any dedup
related code unless users turns on the feature in Kconfig.
Wrapper function would be more clean rather than ifdef.
> + (u64)atomic64_read(&zram->stats.meta_data_size));
> up_read(&zram->init_lock);
>
> return ret;
> @@ -419,29 +422,165 @@ static ssize_t debug_stat_show(struct device *dev,
> static DEVICE_ATTR_RO(mm_stat);
> static DEVICE_ATTR_RO(debug_stat);
>
> -static struct zram_entry *zram_entry_alloc(struct zram_meta *meta,
> +static u32 zram_calc_checksum(unsigned char *mem)
zram_dedup_checksum.
I want to use 'dedup' term all of dedup related to functions.
> +{
> + return jhash(mem, PAGE_SIZE, 0);
> +}
> +
> +static struct zram_entry *zram_entry_alloc(struct zram *zram,
> unsigned int len, gfp_t flags)
> {
> + struct zram_meta *meta = zram->meta;
> struct zram_entry *entry;
> + unsigned long handle;
>
> - entry = kzalloc(sizeof(*entry), flags);
> - if (!entry)
> + handle = zs_malloc(meta->mem_pool, len, flags);
> + if (!handle)
> return NULL;
Separate allocate zram_entry and dedeup.
IOW,
struct zram_entry *entry = zram_entry_alloc(zram, xxx);
if (!zram_dedup_init(zram, entry, xxx))
zram_entry_free(entry);
>
> - entry->handle = zs_malloc(meta->mem_pool, len, flags);
> - if (!entry->handle) {
> - kfree(entry);
> + entry = kzalloc(sizeof(*entry), flags);
> + if (!entry) {
> + zs_free(meta->mem_pool, handle);
> return NULL;
> }
>
> + entry->handle = handle;
> + RB_CLEAR_NODE(&entry->rb_node);
> + entry->refcount = 1;
> + entry->len = len;
> + atomic64_add(sizeof(*entry), &zram->stats.meta_data_size);
> +
> return entry;
> }
>
> -static inline void zram_entry_free(struct zram_meta *meta,
> - struct zram_entry *entry)
> +static void zram_entry_insert(struct zram *zram, struct zram_entry *new,
zram_dedup_insert or add?
> + u32 checksum)
> +{
> + struct zram_meta *meta = zram->meta;
> + struct zram_hash *hash;
> + struct rb_root *rb_root;
> + struct rb_node **rb_node, *parent = NULL;
> + struct zram_entry *entry;
> +
> + new->checksum = checksum;
> + hash = &meta->hash[checksum % meta->hash_size];
> + rb_root = &hash->rb_root;
> +
> + spin_lock(&hash->lock);
> + rb_node = &rb_root->rb_node;
> + while (*rb_node) {
> + parent = *rb_node;
> + entry = rb_entry(parent, struct zram_entry, rb_node);
> + if (checksum < entry->checksum)
> + rb_node = &parent->rb_left;
> + else if (checksum > entry->checksum)
> + rb_node = &parent->rb_right;
> + else
> + rb_node = &parent->rb_left;
> + }
> +
> + rb_link_node(&new->rb_node, parent, rb_node);
> + rb_insert_color(&new->rb_node, rb_root);
> + spin_unlock(&hash->lock);
> +}
> +
> +static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
Ditto.
zram_dedup_match?
> + unsigned char *mem)
> +{
> + bool match = false;
> + unsigned char *cmem;
> + struct zram_meta *meta = zram->meta;
> + struct zcomp_strm *zstrm;
> +
> + cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
> + if (entry->len == PAGE_SIZE) {
> + match = !memcmp(mem, cmem, PAGE_SIZE);
> + } else {
> + zstrm = zcomp_stream_get(zram->comp);
> + if (!zcomp_decompress(zstrm, cmem, entry->len, zstrm->buffer))
> + match = !memcmp(mem, zstrm->buffer, PAGE_SIZE);
> + zcomp_stream_put(zram->comp);
> + }
> + zs_unmap_object(meta->mem_pool, entry->handle);
> +
> + return match;
> +}
> +
> +static inline void zram_entry_free(struct zram *zram, struct zram_meta *meta,
> + struct zram_entry *entry)
> {
> zs_free(meta->mem_pool, entry->handle);
> kfree(entry);
> + if (zram)
> + atomic64_sub(sizeof(*entry), &zram->stats.meta_data_size);
> +}
> +
> +static bool zram_entry_put(struct zram *zram, struct zram_meta *meta,
> + struct zram_entry *entry, bool populated)
Please, separate entry and dedup part like zram_entry_alloc.
Then, Can't we remove 'populated'?
> +{
> + struct zram_hash *hash;
> + u32 checksum;
> +
> + if (!populated)
> + goto free;
> +
> + checksum = entry->checksum;
> + hash = &meta->hash[checksum % meta->hash_size];
> +
> + spin_lock(&hash->lock);
> + entry->refcount--;
> + if (!entry->refcount) {
> + populated = false;
> + rb_erase(&entry->rb_node, &hash->rb_root);
> + RB_CLEAR_NODE(&entry->rb_node);
> + }
> + spin_unlock(&hash->lock);
> +
> +free:
> + if (!populated)
> + zram_entry_free(zram, meta, entry);
> +
> + return populated;
> +}
> +
> +static struct zram_entry *zram_entry_get(struct zram *zram,
We can rename it to zram_entry_match.
> + unsigned char *mem, u32 checksum)
> +{
> + struct zram_meta *meta = zram->meta;
> + struct zram_hash *hash;
> + struct zram_entry *entry;
> + struct rb_node *rb_node;
> +
> + hash = &meta->hash[checksum % meta->hash_size];
> +
> + spin_lock(&hash->lock);
> + rb_node = hash->rb_root.rb_node;
> + while (rb_node) {
> + entry = rb_entry(rb_node, struct zram_entry, rb_node);
> + if (checksum == entry->checksum) {
> + entry->refcount++;
> + atomic64_add(entry->len, &zram->stats.dup_data_size);
> + spin_unlock(&hash->lock);
> +
> + if (zram_entry_match(zram, entry, mem))
__zram_entry_match? Or just open-code.
> + return entry;
> +
> + if (zram_entry_put(zram, meta, entry, true)) {
> + atomic64_sub(entry->len,
> + &zram->stats.dup_data_size);
> + }
> +
> + return NULL;
> + }
> +
> + if (checksum < entry->checksum)
> + rb_node = rb_node->rb_left;
> + else
> + rb_node = rb_node->rb_right;
> + }
> + spin_unlock(&hash->lock);
> +
> + return NULL;
> }
>
> static void zram_meta_free(struct zram_meta *meta, u64 disksize)
> @@ -459,18 +598,31 @@ static void zram_meta_free(struct zram_meta *meta, u64 disksize)
> if (!entry || zram_test_flag(meta, index, ZRAM_SAME))
> continue;
>
> - zram_entry_free(meta, entry);
> + zram_entry_put(NULL, meta, entry, true);
> }
>
> zs_destroy_pool(meta->mem_pool);
> + vfree(meta->hash);
> vfree(meta->table);
> kfree(meta);
> }
>
> +static void zram_meta_init(struct zram_meta *meta)
> +{
> + int i;
> + struct zram_hash *hash;
> +
> + for (i = 0; i < meta->hash_size; i++) {
> + hash = &meta->hash[i];
> + spin_lock_init(&hash->lock);
> + hash->rb_root = RB_ROOT;
> + }
> +}
> +
> static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
> {
> size_t num_pages;
> - struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL);
> + struct zram_meta *meta = kzalloc(sizeof(*meta), GFP_KERNEL);
>
> if (!meta)
> return NULL;
> @@ -482,15 +634,27 @@ static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
> goto out_error;
> }
>
> + meta->hash_size = num_pages >> ZRAM_HASH_SHIFT;
> + meta->hash_size = min_t(size_t, ZRAM_HASH_SIZE_MAX, meta->hash_size);
> + meta->hash_size = max_t(size_t, ZRAM_HASH_SIZE_MIN, meta->hash_size);
> + meta->hash = vzalloc(meta->hash_size * sizeof(struct zram_hash));
> + if (!meta->hash) {
> + pr_err("Error allocating zram entry hash\n");
> + goto out_error;
> + }
> +
> meta->mem_pool = zs_create_pool(pool_name);
> if (!meta->mem_pool) {
> pr_err("Error creating memory pool\n");
> goto out_error;
> }
>
> + zram_meta_init(meta);
zram_dedup_init?
Can't we put above hash initialization routines to zram_meta_init?
> +
> return meta;
>
> out_error:
> + vfree(meta->hash);
> vfree(meta->table);
> kfree(meta);
> return NULL;
> @@ -520,7 +684,8 @@ static void zram_free_page(struct zram *zram, size_t index)
> if (!entry)
> return;
>
> - zram_entry_free(meta, entry);
> + if (zram_entry_put(zram, meta, entry, true))
> + atomic64_sub(entry->len, &zram->stats.dup_data_size);
Hmm,
I want to put dedup stat logic in dedup functions without exporting
higher level functions like zram_free_page.
So,
zram_dedup_put:
...
...
atomic64_sub(entry->len, &zram->stats.dup_data_size);
zram_free_page:
...
if (zram_dedup_put())
zram_entry_free
>
> atomic64_sub(zram_get_obj_size(meta, index),
> &zram->stats.compr_data_size);
> @@ -631,6 +796,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> struct zcomp_strm *zstrm = NULL;
> unsigned long alloced_pages;
> unsigned long element;
> + u32 checksum;
>
> page = bvec->bv_page;
> if (is_partial_io(bvec)) {
> @@ -674,6 +840,18 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> goto out;
> }
>
> + checksum = zram_calc_checksum(uncmem);
> + if (!entry) {
> + entry = zram_entry_get(zram, uncmem, checksum);
entry = zram_dedup_get
> + if (entry) {
> + if (!is_partial_io(bvec))
> + kunmap_atomic(user_mem);
> +
> + clen = entry->len;
> + goto found_duplication;
> + }
> + }
> +
> zstrm = zcomp_stream_get(zram->comp);
> ret = zcomp_compress(zstrm, uncmem, &clen);
> if (!is_partial_io(bvec)) {
> @@ -708,7 +886,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> * from the slow path and entry has already been allocated.
> */
> if (!entry) {
> - entry = zram_entry_alloc(meta, clen,
> + entry = zram_entry_alloc(zram, clen,
> __GFP_KSWAPD_RECLAIM | __GFP_NOWARN);
> }
>
> @@ -718,7 +896,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>
> atomic64_inc(&zram->stats.writestall);
>
> - entry = zram_entry_alloc(meta, clen, GFP_NOIO);
> + entry = zram_entry_alloc(zram, clen, GFP_NOIO);
> if (entry)
> goto compress_again;
>
> @@ -732,7 +910,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> update_used_max(zram, alloced_pages);
>
> if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> - zram_entry_free(meta, entry);
> + zram_entry_put(zram, meta, entry, false);
> ret = -ENOMEM;
> goto out;
> }
> @@ -750,7 +928,9 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> zcomp_stream_put(zram->comp);
> zstrm = NULL;
> zs_unmap_object(meta->mem_pool, entry->handle);
> + zram_entry_insert(zram, entry, checksum);
>
> +found_duplication:
Nit:
I prefer found_dup.
> /*
> * Free memory associated with this sector
> * before overwriting unused sectors.
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index a7ae46c..07d1f8d 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -18,6 +18,7 @@
> #include <linux/rwsem.h>
> #include <linux/zsmalloc.h>
> #include <linux/crypto.h>
> +#include <linux/spinlock.h>
>
> #include "zcomp.h"
>
> @@ -45,6 +46,10 @@
> #define ZRAM_SECTOR_PER_LOGICAL_BLOCK \
> (1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
>
> +/* One slot will contain 128 pages theoretically */
> +#define ZRAM_HASH_SHIFT 7
> +#define ZRAM_HASH_SIZE_MIN (1 << 10)
> +#define ZRAM_HASH_SIZE_MAX (1 << 31)
>
> /*
> * The lower ZRAM_FLAG_SHIFT bits of table.value is for
> @@ -70,6 +75,10 @@ enum zram_pageflags {
> /*-- Data structures */
>
> struct zram_entry {
> + struct rb_node rb_node;
> + u32 len;
> + u32 checksum;
> + unsigned long refcount;
It should be compiled out if users doesn't turn on the feature in Kconfig.
I hope you makes Kconfig option to [4/4].
On Thu, Mar 16, 2017 at 11:46:38AM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Benefit of deduplication is dependent on the workload so it's not
> preferable to always enable. Therefore, make it optional.
Please make it to Kconfig, too. And write down the description to impress
"help a lot for users who uses zram to build output directory"
And the feature should be disabled as default.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> drivers/block/zram/zram_drv.c | 80 ++++++++++++++++++++++++++++++++++++++-----
> drivers/block/zram/zram_drv.h | 1 +
> 2 files changed, 73 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 012425f..e45aa9f 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -328,6 +328,39 @@ static ssize_t comp_algorithm_store(struct device *dev,
> return len;
> }
>
> +static ssize_t use_dedup_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + int val;
> + struct zram *zram = dev_to_zram(dev);
> +
> + down_read(&zram->init_lock);
> + val = zram->use_dedup;
> + up_read(&zram->init_lock);
> +
> + return scnprintf(buf, PAGE_SIZE, "%d\n", val);
> +}
> +
> +static ssize_t use_dedup_store(struct device *dev,
> + struct device_attribute *attr, const char *buf, size_t len)
> +{
> + int val;
> + struct zram *zram = dev_to_zram(dev);
> +
> + if (kstrtoint(buf, 10, &val) || (val != 0 && val != 1))
> + return -EINVAL;
> +
> + down_write(&zram->init_lock);
> + if (init_done(zram)) {
> + up_write(&zram->init_lock);
> + pr_info("Can't change dedup usage for initialized device\n");
> + return -EBUSY;
> + }
> + zram->use_dedup = val;
> + up_write(&zram->init_lock);
> + return len;
> +}
> +
> static ssize_t compact_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t len)
> {
> @@ -422,11 +455,23 @@ static ssize_t debug_stat_show(struct device *dev,
> static DEVICE_ATTR_RO(mm_stat);
> static DEVICE_ATTR_RO(debug_stat);
>
> -static u32 zram_calc_checksum(unsigned char *mem)
> +static u32 zram_calc_checksum(struct zram *zram, unsigned char *mem)
> {
> + if (!zram->use_dedup)
> + return 0;
> +
Hmm, I don't like this style which every dedup functions have the check
"use_dedup".
Can't we abstract like this?
I want to find more simple way to no need to add the check when new dedup
functions pop up.
bool zram_dedup_check
if (!zram->dedup)
return false;
zram_dedup_checksum
entry = zram_dedup_get
if (!entry) {
zram_dedup_add
return false;
}
..
return true;
zram_bvec_write:
...
...
if (zram_dedup_match())
goto found_dup;
> return jhash(mem, PAGE_SIZE, 0);
> }
>
> +static unsigned long zram_entry_handle(struct zram *zram,
> + struct zram_entry *entry)
> +{
> + if (!zram->use_dedup)
> + return (unsigned long)entry;
> +
> + return entry->handle;
> +}
> +
> static struct zram_entry *zram_entry_alloc(struct zram *zram,
> unsigned int len, gfp_t flags)
> {
> @@ -438,6 +483,9 @@ static struct zram_entry *zram_entry_alloc(struct zram *zram,
> if (!handle)
> return NULL;
>
> + if (!zram->use_dedup)
> + return (struct zram_entry *)handle;
> +
> entry = kzalloc(sizeof(*entry), flags);
> if (!entry) {
> zs_free(meta->mem_pool, handle);
> @@ -462,6 +510,9 @@ static void zram_entry_insert(struct zram *zram, struct zram_entry *new,
> struct rb_node **rb_node, *parent = NULL;
> struct zram_entry *entry;
>
> + if (!zram->use_dedup)
> + return;
> +
> new->checksum = checksum;
> hash = &meta->hash[checksum % meta->hash_size];
> rb_root = &hash->rb_root;
> @@ -492,7 +543,8 @@ static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
> struct zram_meta *meta = zram->meta;
> struct zcomp_strm *zstrm;
>
> - cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
> + cmem = zs_map_object(meta->mem_pool,
> + zram_entry_handle(zram, entry), ZS_MM_RO);
> if (entry->len == PAGE_SIZE) {
> match = !memcmp(mem, cmem, PAGE_SIZE);
> } else {
> @@ -501,7 +553,7 @@ static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
> match = !memcmp(mem, zstrm->buffer, PAGE_SIZE);
> zcomp_stream_put(zram->comp);
> }
> - zs_unmap_object(meta->mem_pool, entry->handle);
> + zs_unmap_object(meta->mem_pool, zram_entry_handle(zram, entry));
>
> return match;
> }
> @@ -521,6 +573,11 @@ static bool zram_entry_put(struct zram *zram, struct zram_meta *meta,
> struct zram_hash *hash;
> u32 checksum;
>
> + if (!zram->use_dedup) {
> + zs_free(meta->mem_pool, zram_entry_handle(zram, entry));
> + return false;
> + }
> +
> if (!populated)
> goto free;
>
> @@ -551,6 +608,9 @@ static struct zram_entry *zram_entry_get(struct zram *zram,
> struct zram_entry *entry;
> struct rb_node *rb_node;
>
> + if (!zram->use_dedup)
> + return NULL;
> +
> hash = &meta->hash[checksum % meta->hash_size];
>
> spin_lock(&hash->lock);
> @@ -713,7 +773,8 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
> return 0;
> }
>
> - cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
> + cmem = zs_map_object(meta->mem_pool,
> + zram_entry_handle(zram, entry), ZS_MM_RO);
> if (size == PAGE_SIZE) {
> copy_page(mem, cmem);
> } else {
> @@ -722,7 +783,7 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
> ret = zcomp_decompress(zstrm, cmem, size, mem);
> zcomp_stream_put(zram->comp);
> }
> - zs_unmap_object(meta->mem_pool, entry->handle);
> + zs_unmap_object(meta->mem_pool, zram_entry_handle(zram, entry));
> bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
>
> /* Should NEVER happen. Return bio error if it does. */
> @@ -840,7 +901,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> goto out;
> }
>
> - checksum = zram_calc_checksum(uncmem);
> + checksum = zram_calc_checksum(zram, uncmem);
> if (!entry) {
> entry = zram_entry_get(zram, uncmem, checksum);
> if (entry) {
> @@ -915,7 +976,8 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> goto out;
> }
>
> - cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_WO);
> + cmem = zs_map_object(meta->mem_pool,
> + zram_entry_handle(zram, entry), ZS_MM_WO);
>
> if ((clen == PAGE_SIZE) && !is_partial_io(bvec)) {
> src = kmap_atomic(page);
> @@ -927,7 +989,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>
> zcomp_stream_put(zram->comp);
> zstrm = NULL;
> - zs_unmap_object(meta->mem_pool, entry->handle);
> + zs_unmap_object(meta->mem_pool, zram_entry_handle(zram, entry));
> zram_entry_insert(zram, entry, checksum);
>
> found_duplication:
> @@ -1310,6 +1372,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
> static DEVICE_ATTR_WO(mem_used_max);
> static DEVICE_ATTR_RW(max_comp_streams);
> static DEVICE_ATTR_RW(comp_algorithm);
> +static DEVICE_ATTR_RW(use_dedup);
>
> static struct attribute *zram_disk_attrs[] = {
> &dev_attr_disksize.attr,
> @@ -1320,6 +1383,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
> &dev_attr_mem_used_max.attr,
> &dev_attr_max_comp_streams.attr,
> &dev_attr_comp_algorithm.attr,
> + &dev_attr_use_dedup.attr,
> &dev_attr_io_stat.attr,
> &dev_attr_mm_stat.attr,
> &dev_attr_debug_stat.attr,
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 07d1f8d..b823555 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -141,5 +141,6 @@ struct zram {
> * zram is claimed so open request will be failed
> */
> bool claim; /* Protected by bdev->bd_mutex */
> + int use_dedup;
For binary result, I want to use 'bool'
> };
> #endif
> --
> 1.9.1
>
On Tue, Mar 21, 2017 at 08:10:05PM +0900, Minchan Kim wrote:
> Hi Joonsoo,
>
> On Thu, Mar 16, 2017 at 11:46:35AM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Zspage is always movable and is used through zs_map_object() function
> > which returns directly accessible pointer that contains content of
> > zspage. It is independent on the user's allocation flag.
> > Therefore, it's better to always set movable/highmem flag to the zspage.
> > After that, we don't need __GFP_MOVABLE/__GFP_HIGHMEM clearing in
> > cache_alloc_handle()/cache_alloc_zspage() since there is no zs_malloc
> > caller who specifies __GFP_MOVABLE/__GFP_HIGHMEM.
>
> Hmm, I wanted this when you pointed out to me firstly but when I think
> again, I don't see it's improvement. Sorry for that.
> The zs_malloc is exported symbol and it has gfp_t argument so user can
> do whatever he want with any zone modifiers flags. IOW, if someuser want
> to allocate pages from {normal|dma} zone by whatever reason, he can
> omit __GFP_HIGHMEM from the gfp flag to fullfill the goal.
Hello,
I don't think that such flexibility makes things better. User cannot
fully understand what flags are the best since it highly depends on
implementation detail. For example, __GFP_MOVABLE is needed to
optimize memory fragmentation and user cannot know it and there is no
reason that user need to know it. __GFP_HIGHMEM is the similar case.
He cannot know that he can pass __GFP_HIGHMEM without knowing the
implementation detail and he cannot know the impact of __GFP_HIGHMEM
here. So, I think that adding these flags in zsmalloc can be justified.
Anyway, this patch isn't so important for this series so if you don't
like it, I will drop it.
Thanks.
On Wed, Mar 22, 2017 at 08:41:21AM +0900, Minchan Kim wrote:
> Hi Joonsoo,
>
> On Thu, Mar 16, 2017 at 11:46:37AM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > This patch implements deduplication feature in zram. The purpose
> > of this work is naturally to save amount of memory usage by zram.
> >
> > Android is one of the biggest users to use zram as swap and it's
> > really important to save amount of memory usage. There is a paper
> > that reports that duplication ratio of Android's memory content is
> > rather high [1]. And, there is a similar work on zswap that also
> > reports that experiments has shown that around 10-15% of pages
> > stored in zswp are duplicates and deduplicate them provides some
> > benefits [2].
> >
> > Also, there is a different kind of workload that uses zram as blockdev
> > and store build outputs into it to reduce wear-out problem of real
> > blockdev. In this workload, deduplication hit is very high
> > although I don't know exact reason about it.
>
> Hmm, Isn't it due to binary composed by obj files so that many of
> part between binary and object are sharable?
>
> I'd like to clear it out because dedup was not useful for swap workload
> for the testing in android unlike papers you mentioned.
> Of course, it depends on the workload so someone might find it's
> huge useful for his swap workload. However, I want to focus clear
> winning case scenario rather than "might be better".
>
> That's why I want to know clear reason the saving. If my assumption
> is right(i.e., object file vs. binary file), it would be enough
> justfication to merge this feature because user can turn on the feature
> with reasonable expectation.
Okay. I checked the reason of benefit on the kernel build now. There are
some cases that deduplication happens.
1) *.cmd
Build command is usually similar in one directory. So, content of
these file are very similar. Please check fs/ext4/.namei.o.cmd and
fs/ext4/.inode.o.cmd. In my system, more than 789 lines are the same
in 944 and 939 line of the file, respectively.
2) object file
built-in.o and temporal object file have the similar content. More
than 50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o. I
think that we can optimize in this case. ext4.o is temporal object
file and it isn't necessarily needed. We can change following
fs/ext4/Makefile to optimized one.
orig>
obj-$(CONFIG_EXT4_FS) += ext4.o
ext4-y := XXX YYY ZZZ
optimized>
obj-$(CONFIG_EXT4_FS) += XXX YYY ZZZ
3) vmlinux
.tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boot/compressed/vmlinux.bin
have the similar content.
I did similar checking for Android and it also has similar case.
Some of object files (.class and .so) are similar with another object
files. (./host/linux-x86/lib/libartd.so and
./host/linux-x86/lib/libartd-compiler.so)
>
> >
> > Anyway, if we can detect duplicated content and avoid to store duplicated
> > content at different memory space, we can save memory. This patch
> > tries to do that.
> >
> > Implementation is almost simple and intuitive but I should note
> > one thing about implementation detail.
> >
> > To check duplication, this patch uses checksum of the page and
> > collision of this checksum could be possible. There would be
> > many choices to handle this situation but this patch chooses
> > to allow entry with duplicated checksum to be added to the hash,
> > but, not to compare all entries with duplicated checksum
> > when checking duplication. I guess that checksum collision is quite
>
> Hmm, if there are many duplicated checksum, what a specific checksum
> is compared among them?
I implemented it to check just first one.
> > rare event and we don't need to pay any attention to such a case.
>
> If it's rare event, why can't we iterate all of entries?
It's possible. If you prefer it, I can do it.
> > Therefore, I decided the most simplest way to implement the feature.
> > If there is a different opinion, I can accept and go that way.
> >
> > Following is the result of this patch.
> >
> > Test result #1 (Swap):
> > Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18
> >
> > orig_data_size: 145297408
> > compr_data_size: 32408125
> > mem_used_total: 32276480
> > dup_data_size: 3188134
> > meta_data_size: 1444272
> >
> > Last two metrics added to mm_stat are related to this work.
> > First one, dup_data_size, is amount of saved memory by avoiding
> > to store duplicated page. Later one, meta_data_size, is the amount of
> > data structure to support deduplication. If dup > meta, we can judge
> > that the patch improves memory usage.
> >
> > In Adnroid, we can save 5% of memory usage by this work.
> >
> > Test result #2 (Blockdev):
> > build the kernel and store output to ext4 FS on zram
> >
> > <no-dedup>
> > Elapsed time: 249 s
> > mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0
> >
> > <dedup>
> > Elapsed time: 250 s
> > mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792
> >
> > There is no performance degradation and save 23% memory.
> >
> > Test result #3 (Blockdev):
> > copy android build output dir(out/host) to ext4 FS on zram
> >
> > <no-dedup>
> > Elapsed time: out/host: 88 s
> > mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0
> >
> > <dedup>
> > Elapsed time: out/host: 100 s
> > mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336
> >
> > It shows performance degradation roughly 13% and save 24% memory. Maybe,
> > it is due to overhead of calculating checksum and comparison.
> >
> > Test result #4 (Blockdev):
> > copy android build output dir(out/target/common) to ext4 FS on zram
> >
> > <no-dedup>
> > Elapsed time: out/host: 203 s
> > mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0
> >
> > <dedup>
> > Elapsed time: out/host: 201 s
> > mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336
> >
> > Memory is saved by 42% and performance is the same. Even if there is overhead
> > of calculating checksum and comparison, large hit ratio compensate it since
> > hit leads to less compression attempt.
>
> Cool! It would help a lot for users have used zram to build output directory.
>
> >
> > Anyway, benefit seems to be largely dependent on the workload so
> > following patch will make this feature optional. However, this feature
> > can help some usecases so is deserved to be merged.
> >
> > [1]: MemScope: Analyzing Memory Duplication on Android Systems,
> > dl.acm.org/citation.cfm?id=2797023
> > [2]: zswap: Optimize compressed pool memory utilization,
> > lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2
>
> Overall, it looks good. Thanks!
> However, I want to change function naming/structuring into my preference style
> to maintain because it's non-trival feature. So, please look below.
Okay. I understand all you commented. I will fix them and I won't reply to each one.
Thanks.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > ---
> > drivers/block/zram/zram_drv.c | 212 ++++++++++++++++++++++++++++++++++++++----
> > drivers/block/zram/zram_drv.h | 18 ++++
> > 2 files changed, 214 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index a3d9cbca..012425f 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -32,6 +32,7 @@
> > #include <linux/idr.h>
> > #include <linux/sysfs.h>
> > #include <linux/cpuhotplug.h>
> > +#include <linux/jhash.h>
> >
> > #include "zram_drv.h"
> >
> > @@ -385,14 +386,16 @@ static ssize_t mm_stat_show(struct device *dev,
> > max_used = atomic_long_read(&zram->stats.max_used_pages);
> >
> > ret = scnprintf(buf, PAGE_SIZE,
> > - "%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n",
> > + "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n",
> > orig_size << PAGE_SHIFT,
> > (u64)atomic64_read(&zram->stats.compr_data_size),
> > mem_used << PAGE_SHIFT,
> > zram->limit_pages << PAGE_SHIFT,
> > max_used << PAGE_SHIFT,
> > (u64)atomic64_read(&zram->stats.same_pages),
> > - pool_stats.pages_compacted);
> > + pool_stats.pages_compacted,
> > + (u64)atomic64_read(&zram->stats.dup_data_size),
>
> zram_dedeup_stat_read?
>
> The reason to use wrapper function is I reallly want to remove any dedup
> related code unless users turns on the feature in Kconfig.
> Wrapper function would be more clean rather than ifdef.
>
> > + (u64)atomic64_read(&zram->stats.meta_data_size));
>
>
> > up_read(&zram->init_lock);
> >
> > return ret;
> > @@ -419,29 +422,165 @@ static ssize_t debug_stat_show(struct device *dev,
> > static DEVICE_ATTR_RO(mm_stat);
> > static DEVICE_ATTR_RO(debug_stat);
> >
> > -static struct zram_entry *zram_entry_alloc(struct zram_meta *meta,
> > +static u32 zram_calc_checksum(unsigned char *mem)
>
> zram_dedup_checksum.
>
> I want to use 'dedup' term all of dedup related to functions.
>
>
> > +{
> > + return jhash(mem, PAGE_SIZE, 0);
> > +}
> > +
> > +static struct zram_entry *zram_entry_alloc(struct zram *zram,
> > unsigned int len, gfp_t flags)
> > {
> > + struct zram_meta *meta = zram->meta;
> > struct zram_entry *entry;
> > + unsigned long handle;
> >
> > - entry = kzalloc(sizeof(*entry), flags);
> > - if (!entry)
> > + handle = zs_malloc(meta->mem_pool, len, flags);
> > + if (!handle)
> > return NULL;
>
> Separate allocate zram_entry and dedeup.
> IOW,
>
> struct zram_entry *entry = zram_entry_alloc(zram, xxx);
> if (!zram_dedup_init(zram, entry, xxx))
> zram_entry_free(entry);
>
> >
> > - entry->handle = zs_malloc(meta->mem_pool, len, flags);
> > - if (!entry->handle) {
> > - kfree(entry);
> > + entry = kzalloc(sizeof(*entry), flags);
> > + if (!entry) {
> > + zs_free(meta->mem_pool, handle);
> > return NULL;
> > }
> >
> > + entry->handle = handle;
> > + RB_CLEAR_NODE(&entry->rb_node);
> > + entry->refcount = 1;
> > + entry->len = len;
> > + atomic64_add(sizeof(*entry), &zram->stats.meta_data_size);
> > +
> > return entry;
> > }
> >
> > -static inline void zram_entry_free(struct zram_meta *meta,
> > - struct zram_entry *entry)
> > +static void zram_entry_insert(struct zram *zram, struct zram_entry *new,
>
> zram_dedup_insert or add?
>
> > + u32 checksum)
> > +{
> > + struct zram_meta *meta = zram->meta;
> > + struct zram_hash *hash;
> > + struct rb_root *rb_root;
> > + struct rb_node **rb_node, *parent = NULL;
> > + struct zram_entry *entry;
> > +
> > + new->checksum = checksum;
> > + hash = &meta->hash[checksum % meta->hash_size];
> > + rb_root = &hash->rb_root;
> > +
> > + spin_lock(&hash->lock);
> > + rb_node = &rb_root->rb_node;
> > + while (*rb_node) {
> > + parent = *rb_node;
> > + entry = rb_entry(parent, struct zram_entry, rb_node);
> > + if (checksum < entry->checksum)
> > + rb_node = &parent->rb_left;
> > + else if (checksum > entry->checksum)
> > + rb_node = &parent->rb_right;
> > + else
> > + rb_node = &parent->rb_left;
> > + }
> > +
> > + rb_link_node(&new->rb_node, parent, rb_node);
> > + rb_insert_color(&new->rb_node, rb_root);
> > + spin_unlock(&hash->lock);
> > +}
> > +
> > +static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
>
> Ditto.
>
> zram_dedup_match?
>
> > + unsigned char *mem)
> > +{
> > + bool match = false;
> > + unsigned char *cmem;
> > + struct zram_meta *meta = zram->meta;
> > + struct zcomp_strm *zstrm;
> > +
> > + cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
> > + if (entry->len == PAGE_SIZE) {
> > + match = !memcmp(mem, cmem, PAGE_SIZE);
> > + } else {
> > + zstrm = zcomp_stream_get(zram->comp);
> > + if (!zcomp_decompress(zstrm, cmem, entry->len, zstrm->buffer))
> > + match = !memcmp(mem, zstrm->buffer, PAGE_SIZE);
> > + zcomp_stream_put(zram->comp);
> > + }
> > + zs_unmap_object(meta->mem_pool, entry->handle);
> > +
> > + return match;
> > +}
> > +
> > +static inline void zram_entry_free(struct zram *zram, struct zram_meta *meta,
> > + struct zram_entry *entry)
> > {
> > zs_free(meta->mem_pool, entry->handle);
> > kfree(entry);
> > + if (zram)
> > + atomic64_sub(sizeof(*entry), &zram->stats.meta_data_size);
> > +}
> > +
> > +static bool zram_entry_put(struct zram *zram, struct zram_meta *meta,
> > + struct zram_entry *entry, bool populated)
>
> Please, separate entry and dedup part like zram_entry_alloc.
> Then, Can't we remove 'populated'?
>
> > +{
> > + struct zram_hash *hash;
> > + u32 checksum;
> > +
> > + if (!populated)
> > + goto free;
> > +
> > + checksum = entry->checksum;
> > + hash = &meta->hash[checksum % meta->hash_size];
> > +
> > + spin_lock(&hash->lock);
> > + entry->refcount--;
> > + if (!entry->refcount) {
> > + populated = false;
> > + rb_erase(&entry->rb_node, &hash->rb_root);
> > + RB_CLEAR_NODE(&entry->rb_node);
> > + }
> > + spin_unlock(&hash->lock);
> > +
> > +free:
> > + if (!populated)
> > + zram_entry_free(zram, meta, entry);
> > +
> > + return populated;
> > +}
> > +
> > +static struct zram_entry *zram_entry_get(struct zram *zram,
>
> We can rename it to zram_entry_match.
>
> > + unsigned char *mem, u32 checksum)
> > +{
> > + struct zram_meta *meta = zram->meta;
> > + struct zram_hash *hash;
> > + struct zram_entry *entry;
> > + struct rb_node *rb_node;
> > +
> > + hash = &meta->hash[checksum % meta->hash_size];
> > +
> > + spin_lock(&hash->lock);
> > + rb_node = hash->rb_root.rb_node;
> > + while (rb_node) {
> > + entry = rb_entry(rb_node, struct zram_entry, rb_node);
> > + if (checksum == entry->checksum) {
> > + entry->refcount++;
> > + atomic64_add(entry->len, &zram->stats.dup_data_size);
> > + spin_unlock(&hash->lock);
> > +
> > + if (zram_entry_match(zram, entry, mem))
>
> __zram_entry_match? Or just open-code.
>
> > + return entry;
> > +
> > + if (zram_entry_put(zram, meta, entry, true)) {
> > + atomic64_sub(entry->len,
> > + &zram->stats.dup_data_size);
> > + }
> > +
> > + return NULL;
> > + }
> > +
> > + if (checksum < entry->checksum)
> > + rb_node = rb_node->rb_left;
> > + else
> > + rb_node = rb_node->rb_right;
> > + }
> > + spin_unlock(&hash->lock);
> > +
> > + return NULL;
> > }
> >
> > static void zram_meta_free(struct zram_meta *meta, u64 disksize)
> > @@ -459,18 +598,31 @@ static void zram_meta_free(struct zram_meta *meta, u64 disksize)
> > if (!entry || zram_test_flag(meta, index, ZRAM_SAME))
> > continue;
> >
> > - zram_entry_free(meta, entry);
> > + zram_entry_put(NULL, meta, entry, true);
> > }
> >
> > zs_destroy_pool(meta->mem_pool);
> > + vfree(meta->hash);
> > vfree(meta->table);
> > kfree(meta);
> > }
> >
> > +static void zram_meta_init(struct zram_meta *meta)
> > +{
> > + int i;
> > + struct zram_hash *hash;
> > +
> > + for (i = 0; i < meta->hash_size; i++) {
> > + hash = &meta->hash[i];
> > + spin_lock_init(&hash->lock);
> > + hash->rb_root = RB_ROOT;
> > + }
> > +}
> > +
> > static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
> > {
> > size_t num_pages;
> > - struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL);
> > + struct zram_meta *meta = kzalloc(sizeof(*meta), GFP_KERNEL);
> >
> > if (!meta)
> > return NULL;
> > @@ -482,15 +634,27 @@ static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
> > goto out_error;
> > }
> >
> > + meta->hash_size = num_pages >> ZRAM_HASH_SHIFT;
> > + meta->hash_size = min_t(size_t, ZRAM_HASH_SIZE_MAX, meta->hash_size);
> > + meta->hash_size = max_t(size_t, ZRAM_HASH_SIZE_MIN, meta->hash_size);
> > + meta->hash = vzalloc(meta->hash_size * sizeof(struct zram_hash));
> > + if (!meta->hash) {
> > + pr_err("Error allocating zram entry hash\n");
> > + goto out_error;
> > + }
> > +
> > meta->mem_pool = zs_create_pool(pool_name);
> > if (!meta->mem_pool) {
> > pr_err("Error creating memory pool\n");
> > goto out_error;
> > }
> >
> > + zram_meta_init(meta);
>
> zram_dedup_init?
>
> Can't we put above hash initialization routines to zram_meta_init?
>
> > +
> > return meta;
> >
> > out_error:
> > + vfree(meta->hash);
> > vfree(meta->table);
> > kfree(meta);
> > return NULL;
> > @@ -520,7 +684,8 @@ static void zram_free_page(struct zram *zram, size_t index)
> > if (!entry)
> > return;
> >
> > - zram_entry_free(meta, entry);
> > + if (zram_entry_put(zram, meta, entry, true))
> > + atomic64_sub(entry->len, &zram->stats.dup_data_size);
>
> Hmm,
>
> I want to put dedup stat logic in dedup functions without exporting
> higher level functions like zram_free_page.
>
> So,
>
> zram_dedup_put:
> ...
> ...
> atomic64_sub(entry->len, &zram->stats.dup_data_size);
>
> zram_free_page:
> ...
> if (zram_dedup_put())
> zram_entry_free
>
> >
> > atomic64_sub(zram_get_obj_size(meta, index),
> > &zram->stats.compr_data_size);
> > @@ -631,6 +796,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> > struct zcomp_strm *zstrm = NULL;
> > unsigned long alloced_pages;
> > unsigned long element;
> > + u32 checksum;
> >
> > page = bvec->bv_page;
> > if (is_partial_io(bvec)) {
> > @@ -674,6 +840,18 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> > goto out;
> > }
> >
> > + checksum = zram_calc_checksum(uncmem);
> > + if (!entry) {
> > + entry = zram_entry_get(zram, uncmem, checksum);
>
> entry = zram_dedup_get
>
> > + if (entry) {
> > + if (!is_partial_io(bvec))
> > + kunmap_atomic(user_mem);
> > +
> > + clen = entry->len;
> > + goto found_duplication;
> > + }
> > + }
> > +
> > zstrm = zcomp_stream_get(zram->comp);
> > ret = zcomp_compress(zstrm, uncmem, &clen);
> > if (!is_partial_io(bvec)) {
> > @@ -708,7 +886,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> > * from the slow path and entry has already been allocated.
> > */
> > if (!entry) {
> > - entry = zram_entry_alloc(meta, clen,
> > + entry = zram_entry_alloc(zram, clen,
> > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN);
> > }
> >
> > @@ -718,7 +896,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >
> > atomic64_inc(&zram->stats.writestall);
> >
> > - entry = zram_entry_alloc(meta, clen, GFP_NOIO);
> > + entry = zram_entry_alloc(zram, clen, GFP_NOIO);
> > if (entry)
> > goto compress_again;
> >
> > @@ -732,7 +910,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> > update_used_max(zram, alloced_pages);
> >
> > if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> > - zram_entry_free(meta, entry);
> > + zram_entry_put(zram, meta, entry, false);
> > ret = -ENOMEM;
> > goto out;
> > }
> > @@ -750,7 +928,9 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> > zcomp_stream_put(zram->comp);
> > zstrm = NULL;
> > zs_unmap_object(meta->mem_pool, entry->handle);
> > + zram_entry_insert(zram, entry, checksum);
> >
> > +found_duplication:
>
> Nit:
>
> I prefer found_dup.
>
> > /*
> > * Free memory associated with this sector
> > * before overwriting unused sectors.
> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> > index a7ae46c..07d1f8d 100644
> > --- a/drivers/block/zram/zram_drv.h
> > +++ b/drivers/block/zram/zram_drv.h
> > @@ -18,6 +18,7 @@
> > #include <linux/rwsem.h>
> > #include <linux/zsmalloc.h>
> > #include <linux/crypto.h>
> > +#include <linux/spinlock.h>
> >
> > #include "zcomp.h"
> >
> > @@ -45,6 +46,10 @@
> > #define ZRAM_SECTOR_PER_LOGICAL_BLOCK \
> > (1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
> >
> > +/* One slot will contain 128 pages theoretically */
> > +#define ZRAM_HASH_SHIFT 7
> > +#define ZRAM_HASH_SIZE_MIN (1 << 10)
> > +#define ZRAM_HASH_SIZE_MAX (1 << 31)
> >
> > /*
> > * The lower ZRAM_FLAG_SHIFT bits of table.value is for
> > @@ -70,6 +75,10 @@ enum zram_pageflags {
> > /*-- Data structures */
> >
> > struct zram_entry {
> > + struct rb_node rb_node;
> > + u32 len;
> > + u32 checksum;
> > + unsigned long refcount;
>
> It should be compiled out if users doesn't turn on the feature in Kconfig.
> I hope you makes Kconfig option to [4/4].
On Wed, Mar 22, 2017 at 09:00:59AM +0900, Minchan Kim wrote:
> On Thu, Mar 16, 2017 at 11:46:38AM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Benefit of deduplication is dependent on the workload so it's not
> > preferable to always enable. Therefore, make it optional.
>
> Please make it to Kconfig, too. And write down the description to impress
> "help a lot for users who uses zram to build output directory"
> And the feature should be disabled as default.
Okay.
>
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > ---
> > drivers/block/zram/zram_drv.c | 80 ++++++++++++++++++++++++++++++++++++++-----
> > drivers/block/zram/zram_drv.h | 1 +
> > 2 files changed, 73 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 012425f..e45aa9f 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -328,6 +328,39 @@ static ssize_t comp_algorithm_store(struct device *dev,
> > return len;
> > }
> >
> > +static ssize_t use_dedup_show(struct device *dev,
> > + struct device_attribute *attr, char *buf)
> > +{
> > + int val;
> > + struct zram *zram = dev_to_zram(dev);
> > +
> > + down_read(&zram->init_lock);
> > + val = zram->use_dedup;
> > + up_read(&zram->init_lock);
> > +
> > + return scnprintf(buf, PAGE_SIZE, "%d\n", val);
> > +}
> > +
> > +static ssize_t use_dedup_store(struct device *dev,
> > + struct device_attribute *attr, const char *buf, size_t len)
> > +{
> > + int val;
> > + struct zram *zram = dev_to_zram(dev);
> > +
> > + if (kstrtoint(buf, 10, &val) || (val != 0 && val != 1))
> > + return -EINVAL;
> > +
> > + down_write(&zram->init_lock);
> > + if (init_done(zram)) {
> > + up_write(&zram->init_lock);
> > + pr_info("Can't change dedup usage for initialized device\n");
> > + return -EBUSY;
> > + }
> > + zram->use_dedup = val;
> > + up_write(&zram->init_lock);
> > + return len;
> > +}
> > +
> > static ssize_t compact_store(struct device *dev,
> > struct device_attribute *attr, const char *buf, size_t len)
> > {
> > @@ -422,11 +455,23 @@ static ssize_t debug_stat_show(struct device *dev,
> > static DEVICE_ATTR_RO(mm_stat);
> > static DEVICE_ATTR_RO(debug_stat);
> >
> > -static u32 zram_calc_checksum(unsigned char *mem)
> > +static u32 zram_calc_checksum(struct zram *zram, unsigned char *mem)
> > {
> > + if (!zram->use_dedup)
> > + return 0;
> > +
>
> Hmm, I don't like this style which every dedup functions have the check
> "use_dedup".
>
> Can't we abstract like this?
I will try but I'm not sure it can be.
>
> I want to find more simple way to no need to add the check when new dedup
> functions pop up.
>
> bool zram_dedup_check
> if (!zram->dedup)
> return false;
>
> zram_dedup_checksum
> entry = zram_dedup_get
> if (!entry) {
> zram_dedup_add
> return false;
> }
> ..
> return true;
>
>
> zram_bvec_write:
> ...
> ...
>
> if (zram_dedup_match())
> goto found_dup;
>
>
>
> > return jhash(mem, PAGE_SIZE, 0);
> > }
> >
> > +static unsigned long zram_entry_handle(struct zram *zram,
> > + struct zram_entry *entry)
> > +{
> > + if (!zram->use_dedup)
> > + return (unsigned long)entry;
> > +
> > + return entry->handle;
> > +}
> > +
> > static struct zram_entry *zram_entry_alloc(struct zram *zram,
> > unsigned int len, gfp_t flags)
> > {
> > @@ -438,6 +483,9 @@ static struct zram_entry *zram_entry_alloc(struct zram *zram,
> > if (!handle)
> > return NULL;
> >
> > + if (!zram->use_dedup)
> > + return (struct zram_entry *)handle;
> > +
> > entry = kzalloc(sizeof(*entry), flags);
> > if (!entry) {
> > zs_free(meta->mem_pool, handle);
> > @@ -462,6 +510,9 @@ static void zram_entry_insert(struct zram *zram, struct zram_entry *new,
> > struct rb_node **rb_node, *parent = NULL;
> > struct zram_entry *entry;
> >
> > + if (!zram->use_dedup)
> > + return;
> > +
> > new->checksum = checksum;
> > hash = &meta->hash[checksum % meta->hash_size];
> > rb_root = &hash->rb_root;
> > @@ -492,7 +543,8 @@ static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
> > struct zram_meta *meta = zram->meta;
> > struct zcomp_strm *zstrm;
> >
> > - cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
> > + cmem = zs_map_object(meta->mem_pool,
> > + zram_entry_handle(zram, entry), ZS_MM_RO);
> > if (entry->len == PAGE_SIZE) {
> > match = !memcmp(mem, cmem, PAGE_SIZE);
> > } else {
> > @@ -501,7 +553,7 @@ static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
> > match = !memcmp(mem, zstrm->buffer, PAGE_SIZE);
> > zcomp_stream_put(zram->comp);
> > }
> > - zs_unmap_object(meta->mem_pool, entry->handle);
> > + zs_unmap_object(meta->mem_pool, zram_entry_handle(zram, entry));
> >
> > return match;
> > }
> > @@ -521,6 +573,11 @@ static bool zram_entry_put(struct zram *zram, struct zram_meta *meta,
> > struct zram_hash *hash;
> > u32 checksum;
> >
> > + if (!zram->use_dedup) {
> > + zs_free(meta->mem_pool, zram_entry_handle(zram, entry));
> > + return false;
> > + }
> > +
> > if (!populated)
> > goto free;
> >
> > @@ -551,6 +608,9 @@ static struct zram_entry *zram_entry_get(struct zram *zram,
> > struct zram_entry *entry;
> > struct rb_node *rb_node;
> >
> > + if (!zram->use_dedup)
> > + return NULL;
> > +
> > hash = &meta->hash[checksum % meta->hash_size];
> >
> > spin_lock(&hash->lock);
> > @@ -713,7 +773,8 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
> > return 0;
> > }
> >
> > - cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
> > + cmem = zs_map_object(meta->mem_pool,
> > + zram_entry_handle(zram, entry), ZS_MM_RO);
> > if (size == PAGE_SIZE) {
> > copy_page(mem, cmem);
> > } else {
> > @@ -722,7 +783,7 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
> > ret = zcomp_decompress(zstrm, cmem, size, mem);
> > zcomp_stream_put(zram->comp);
> > }
> > - zs_unmap_object(meta->mem_pool, entry->handle);
> > + zs_unmap_object(meta->mem_pool, zram_entry_handle(zram, entry));
> > bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
> >
> > /* Should NEVER happen. Return bio error if it does. */
> > @@ -840,7 +901,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> > goto out;
> > }
> >
> > - checksum = zram_calc_checksum(uncmem);
> > + checksum = zram_calc_checksum(zram, uncmem);
> > if (!entry) {
> > entry = zram_entry_get(zram, uncmem, checksum);
> > if (entry) {
> > @@ -915,7 +976,8 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> > goto out;
> > }
> >
> > - cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_WO);
> > + cmem = zs_map_object(meta->mem_pool,
> > + zram_entry_handle(zram, entry), ZS_MM_WO);
> >
> > if ((clen == PAGE_SIZE) && !is_partial_io(bvec)) {
> > src = kmap_atomic(page);
> > @@ -927,7 +989,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> >
> > zcomp_stream_put(zram->comp);
> > zstrm = NULL;
> > - zs_unmap_object(meta->mem_pool, entry->handle);
> > + zs_unmap_object(meta->mem_pool, zram_entry_handle(zram, entry));
> > zram_entry_insert(zram, entry, checksum);
> >
> > found_duplication:
> > @@ -1310,6 +1372,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
> > static DEVICE_ATTR_WO(mem_used_max);
> > static DEVICE_ATTR_RW(max_comp_streams);
> > static DEVICE_ATTR_RW(comp_algorithm);
> > +static DEVICE_ATTR_RW(use_dedup);
> >
> > static struct attribute *zram_disk_attrs[] = {
> > &dev_attr_disksize.attr,
> > @@ -1320,6 +1383,7 @@ static int zram_open(struct block_device *bdev, fmode_t mode)
> > &dev_attr_mem_used_max.attr,
> > &dev_attr_max_comp_streams.attr,
> > &dev_attr_comp_algorithm.attr,
> > + &dev_attr_use_dedup.attr,
> > &dev_attr_io_stat.attr,
> > &dev_attr_mm_stat.attr,
> > &dev_attr_debug_stat.attr,
> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> > index 07d1f8d..b823555 100644
> > --- a/drivers/block/zram/zram_drv.h
> > +++ b/drivers/block/zram/zram_drv.h
> > @@ -141,5 +141,6 @@ struct zram {
> > * zram is claimed so open request will be failed
> > */
> > bool claim; /* Protected by bdev->bd_mutex */
> > + int use_dedup;
>
> For binary result, I want to use 'bool'
Okay.
Thanks.
Hello Joonsoo,
On (03/16/17 11:46), [email protected] wrote:
[..]
> +static bool zram_entry_put(struct zram *zram, struct zram_meta *meta,
> + struct zram_entry *entry, bool populated)
> +{
> + struct zram_hash *hash;
> + u32 checksum;
> +
> + if (!populated)
> + goto free;
> +
> + checksum = entry->checksum;
> + hash = &meta->hash[checksum % meta->hash_size];
> +
> + spin_lock(&hash->lock);
> + entry->refcount--;
> + if (!entry->refcount) {
> + populated = false;
> + rb_erase(&entry->rb_node, &hash->rb_root);
> + RB_CLEAR_NODE(&entry->rb_node);
> + }
> + spin_unlock(&hash->lock);
> +
> +free:
> + if (!populated)
> + zram_entry_free(zram, meta, entry);
> +
> + return populated;
> +}
[ 2935.830100] BUG: unable to handle kernel NULL pointer dereference at 0000000000000154
[ 2935.830106] IP: zram_entry_put+0x15/0xbb [zram]
[ 2935.830107] PGD 4063aa067
[ 2935.830108] P4D 4063aa067
[ 2935.830108] PUD 19d426067
[ 2935.830109] PMD 0
[ 2935.830110] Oops: 0000 [#1] PREEMPT SMP
[ 2935.830111] Modules linked in: lzo zram(-) zsmalloc mousedev nls_iso8859_1 nls_cp437 vfat fat psmouse serio_raw atkbd libps2 coretemp hwmon iwlmvm crc32c_intel i2c_i801 r8169 iwlwifi lpc_ich mii mfd_core ie31200_edac edac_core thermal i8042 serio wmi evdev acpi_cpufreq sd_mod
[ 2935.830127] CPU: 3 PID: 1599 Comm: rmmod Tainted: G W 4.11.0-rc3-next-20170323-dbg-00012-gdda44065c67c-dirty #155
[ 2935.830128] task: ffff8804061c14c0 task.stack: ffff8801f7908000
[ 2935.830130] RIP: 0010:zram_entry_put+0x15/0xbb [zram]
[ 2935.830131] RSP: 0018:ffff8801f790bdc0 EFLAGS: 00010246
[ 2935.830132] RAX: 0000000000000000 RBX: ffff88041cdbfce0 RCX: 0000000000000001
[ 2935.830132] RDX: ffff880405f4cdb8 RSI: ffff88041cdbfce0 RDI: 0000000000000000
[ 2935.830133] RBP: ffff8801f790bde8 R08: ffffffffffffff80 R09: 00000000000000ff
[ 2935.830134] R10: ffff8801f790bd90 R11: 0000000000000000 R12: 0000000000000000
[ 2935.830134] R13: ffff88041cdbfce0 R14: ffff88041cdbfd00 R15: ffff88041cdbfce0
[ 2935.830135] FS: 00007fb350b62b40(0000) GS:ffff88042fac0000(0000) knlGS:0000000000000000
[ 2935.830136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2935.830137] CR2: 0000000000000154 CR3: 00000002d3f53000 CR4: 00000000001406e0
[ 2935.830137] Call Trace:
[ 2935.830139] zram_meta_free+0x50/0x7e [zram]
[ 2935.830141] zram_reset_device+0xd2/0xe5 [zram]
[ 2935.830142] zram_remove+0x9f/0xf1 [zram]
[ 2935.830143] ? zram_remove+0xf1/0xf1 [zram]
[ 2935.830145] zram_remove_cb+0x11/0x15 [zram]
[ 2935.830148] idr_for_each+0x3b/0x87
[ 2935.830149] destroy_devices+0x2a/0x56 [zram]
[ 2935.830150] zram_exit+0x9/0x6f6 [zram]
[ 2935.830153] SyS_delete_module+0xf1/0x181
[ 2935.830156] entry_SYSCALL_64_fastpath+0x18/0xad
[ 2935.830157] RIP: 0033:0x7fb3502659b7
[ 2935.830158] RSP: 002b:00007fff90ca5468 EFLAGS: 00000202 ORIG_RAX: 00000000000000b0
[ 2935.830159] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fb3502659b7
[ 2935.830159] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000000002187208
[ 2935.830160] RBP: 0000000000000000 R08: 00007fff90ca43e1 R09: 000000000000000a
[ 2935.830161] R10: 0000000000000892 R11: 0000000000000202 R12: 00000000021871a0
[ 2935.830162] R13: 00007fff90ca4450 R14: 00000000021871a0 R15: 0000000002186010
[ 2935.830163] Code: 1e 6f fb ff 4c 89 e7 e8 9e 05 f5 e0 48 89 d8 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 f5 41 54 53 <83> bf 54 01 00 00 00 48 89 d3 75 0e 48 8b 7e 08 48 89 d6 e8 0b
[ 2935.830184] RIP: zram_entry_put+0x15/0xbb [zram] RSP: ffff8801f790bdc0
[ 2935.830184] CR2: 0000000000000154
[ 2935.838309] ---[ end trace f7f95fd0ed6c72a0 ]---
-ss
Hi Joonsoo,
On Thu, Mar 23, 2017 at 11:10:23AM +0900, Joonsoo Kim wrote:
> On Tue, Mar 21, 2017 at 08:10:05PM +0900, Minchan Kim wrote:
> > Hi Joonsoo,
> >
> > On Thu, Mar 16, 2017 at 11:46:35AM +0900, [email protected] wrote:
> > > From: Joonsoo Kim <[email protected]>
> > >
> > > Zspage is always movable and is used through zs_map_object() function
> > > which returns directly accessible pointer that contains content of
> > > zspage. It is independent on the user's allocation flag.
> > > Therefore, it's better to always set movable/highmem flag to the zspage.
> > > After that, we don't need __GFP_MOVABLE/__GFP_HIGHMEM clearing in
> > > cache_alloc_handle()/cache_alloc_zspage() since there is no zs_malloc
> > > caller who specifies __GFP_MOVABLE/__GFP_HIGHMEM.
> >
> > Hmm, I wanted this when you pointed out to me firstly but when I think
> > again, I don't see it's improvement. Sorry for that.
> > The zs_malloc is exported symbol and it has gfp_t argument so user can
> > do whatever he want with any zone modifiers flags. IOW, if someuser want
> > to allocate pages from {normal|dma} zone by whatever reason, he can
> > omit __GFP_HIGHMEM from the gfp flag to fullfill the goal.
>
> Hello,
>
> I don't think that such flexibility makes things better. User cannot
> fully understand what flags are the best since it highly depends on
> implementation detail. For example, __GFP_MOVABLE is needed to
zone modifier(GFP_DMA|DMA32|HIGHMEM|MOVABLE|potentially CMA):
User can select one of zone for his goal by S/W|H/W constraint.
> optimize memory fragmentation and user cannot know it and there is no
> reason that user need to know it. __GFP_HIGHMEM is the similar case.
> He cannot know that he can pass __GFP_HIGHMEM without knowing the
> implementation detail and he cannot know the impact of __GFP_HIGHMEM
> here. So, I think that adding these flags in zsmalloc can be justified.
>
> Anyway, this patch isn't so important for this series so if you don't
> like it, I will drop it.
Yes, I don't feel strongly we need it at this moment.
Thanks.
On Thu, Mar 23, 2017 at 12:04:23PM +0900, Joonsoo Kim wrote:
> On Wed, Mar 22, 2017 at 08:41:21AM +0900, Minchan Kim wrote:
> > Hi Joonsoo,
> >
> > On Thu, Mar 16, 2017 at 11:46:37AM +0900, [email protected] wrote:
> > > From: Joonsoo Kim <[email protected]>
> > >
> > > This patch implements deduplication feature in zram. The purpose
> > > of this work is naturally to save amount of memory usage by zram.
> > >
> > > Android is one of the biggest users to use zram as swap and it's
> > > really important to save amount of memory usage. There is a paper
> > > that reports that duplication ratio of Android's memory content is
> > > rather high [1]. And, there is a similar work on zswap that also
> > > reports that experiments has shown that around 10-15% of pages
> > > stored in zswp are duplicates and deduplicate them provides some
> > > benefits [2].
> > >
> > > Also, there is a different kind of workload that uses zram as blockdev
> > > and store build outputs into it to reduce wear-out problem of real
> > > blockdev. In this workload, deduplication hit is very high
> > > although I don't know exact reason about it.
> >
> > Hmm, Isn't it due to binary composed by obj files so that many of
> > part between binary and object are sharable?
> >
> > I'd like to clear it out because dedup was not useful for swap workload
> > for the testing in android unlike papers you mentioned.
> > Of course, it depends on the workload so someone might find it's
> > huge useful for his swap workload. However, I want to focus clear
> > winning case scenario rather than "might be better".
> >
> > That's why I want to know clear reason the saving. If my assumption
> > is right(i.e., object file vs. binary file), it would be enough
> > justfication to merge this feature because user can turn on the feature
> > with reasonable expectation.
>
> Okay. I checked the reason of benefit on the kernel build now. There are
> some cases that deduplication happens.
>
> 1) *.cmd
> Build command is usually similar in one directory. So, content of
> these file are very similar. Please check fs/ext4/.namei.o.cmd and
> fs/ext4/.inode.o.cmd. In my system, more than 789 lines are the same
> in 944 and 939 line of the file, respectively.
>
> 2) object file
> built-in.o and temporal object file have the similar content. More
> than 50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o. I
> think that we can optimize in this case. ext4.o is temporal object
> file and it isn't necessarily needed. We can change following
> fs/ext4/Makefile to optimized one.
>
> orig>
> obj-$(CONFIG_EXT4_FS) += ext4.o
> ext4-y := XXX YYY ZZZ
>
> optimized>
> obj-$(CONFIG_EXT4_FS) += XXX YYY ZZZ
>
> 3) vmlinux
> .tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boot/compressed/vmlinux.bin
> have the similar content.
>
> I did similar checking for Android and it also has similar case.
> Some of object files (.class and .so) are similar with another object
> files. (./host/linux-x86/lib/libartd.so and
> ./host/linux-x86/lib/libartd-compiler.so)
Great.
Please include your analysis in the patch's description.
With that, we can recommend to turn on dedup feature to users who want to use
zram to build output directory. Please guide it in Kconfig.
One thing I forgot when the review is "add document about use_dedup on
zram.txt"
>
> >
> > >
> > > Anyway, if we can detect duplicated content and avoid to store duplicated
> > > content at different memory space, we can save memory. This patch
> > > tries to do that.
> > >
> > > Implementation is almost simple and intuitive but I should note
> > > one thing about implementation detail.
> > >
> > > To check duplication, this patch uses checksum of the page and
> > > collision of this checksum could be possible. There would be
> > > many choices to handle this situation but this patch chooses
> > > to allow entry with duplicated checksum to be added to the hash,
> > > but, not to compare all entries with duplicated checksum
> > > when checking duplication. I guess that checksum collision is quite
> >
> > Hmm, if there are many duplicated checksum, what a specific checksum
> > is compared among them?
>
> I implemented it to check just first one.
>
> > > rare event and we don't need to pay any attention to such a case.
> >
> > If it's rare event, why can't we iterate all of entries?
>
> It's possible. If you prefer it, I can do it.
Yes, please do. I want to give expected behavior to the user unless it
gives big overhead/make the code mess.
>
> > > Therefore, I decided the most simplest way to implement the feature.
> > > If there is a different opinion, I can accept and go that way.
> > >
> > > Following is the result of this patch.
> > >
> > > Test result #1 (Swap):
> > > Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18
> > >
> > > orig_data_size: 145297408
> > > compr_data_size: 32408125
> > > mem_used_total: 32276480
> > > dup_data_size: 3188134
> > > meta_data_size: 1444272
> > >
> > > Last two metrics added to mm_stat are related to this work.
> > > First one, dup_data_size, is amount of saved memory by avoiding
> > > to store duplicated page. Later one, meta_data_size, is the amount of
> > > data structure to support deduplication. If dup > meta, we can judge
> > > that the patch improves memory usage.
> > >
> > > In Adnroid, we can save 5% of memory usage by this work.
> > >
> > > Test result #2 (Blockdev):
> > > build the kernel and store output to ext4 FS on zram
> > >
> > > <no-dedup>
> > > Elapsed time: 249 s
> > > mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0
> > >
> > > <dedup>
> > > Elapsed time: 250 s
> > > mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792
> > >
> > > There is no performance degradation and save 23% memory.
> > >
> > > Test result #3 (Blockdev):
> > > copy android build output dir(out/host) to ext4 FS on zram
> > >
> > > <no-dedup>
> > > Elapsed time: out/host: 88 s
> > > mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0
> > >
> > > <dedup>
> > > Elapsed time: out/host: 100 s
> > > mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336
> > >
> > > It shows performance degradation roughly 13% and save 24% memory. Maybe,
> > > it is due to overhead of calculating checksum and comparison.
> > >
> > > Test result #4 (Blockdev):
> > > copy android build output dir(out/target/common) to ext4 FS on zram
> > >
> > > <no-dedup>
> > > Elapsed time: out/host: 203 s
> > > mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0
> > >
> > > <dedup>
> > > Elapsed time: out/host: 201 s
> > > mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336
> > >
> > > Memory is saved by 42% and performance is the same. Even if there is overhead
> > > of calculating checksum and comparison, large hit ratio compensate it since
> > > hit leads to less compression attempt.
> >
> > Cool! It would help a lot for users have used zram to build output directory.
> >
> > >
> > > Anyway, benefit seems to be largely dependent on the workload so
> > > following patch will make this feature optional. However, this feature
> > > can help some usecases so is deserved to be merged.
> > >
> > > [1]: MemScope: Analyzing Memory Duplication on Android Systems,
> > > dl.acm.org/citation.cfm?id=2797023
> > > [2]: zswap: Optimize compressed pool memory utilization,
> > > lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2
> >
> > Overall, it looks good. Thanks!
> > However, I want to change function naming/structuring into my preference style
> > to maintain because it's non-trival feature. So, please look below.
>
> Okay. I understand all you commented. I will fix them and I won't reply to each one.
Thanks for the nice work!
On (03/23/17 12:05), Joonsoo Kim wrote:
> On Wed, Mar 22, 2017 at 09:00:59AM +0900, Minchan Kim wrote:
> > On Thu, Mar 16, 2017 at 11:46:38AM +0900, [email protected] wrote:
> > > From: Joonsoo Kim <[email protected]>
> > >
> > > Benefit of deduplication is dependent on the workload so it's not
> > > preferable to always enable. Therefore, make it optional.
> >
> > Please make it to Kconfig, too. And write down the description to impress
> > "help a lot for users who uses zram to build output directory"
> > And the feature should be disabled as default.
>
> Okay.
so I was thinking for a moment -- do we want to keep this
functionality in zram or may be it belongs to allocator (zsmalloc)?
what do you think? just a question.
-ss
On Thu, Mar 23, 2017 at 10:40:59PM +0900, Sergey Senozhatsky wrote:
>
> Hello Joonsoo,
>
> On (03/16/17 11:46), [email protected] wrote:
> [..]
> > +static bool zram_entry_put(struct zram *zram, struct zram_meta *meta,
> > + struct zram_entry *entry, bool populated)
> > +{
> > + struct zram_hash *hash;
> > + u32 checksum;
> > +
> > + if (!populated)
> > + goto free;
> > +
> > + checksum = entry->checksum;
> > + hash = &meta->hash[checksum % meta->hash_size];
> > +
> > + spin_lock(&hash->lock);
> > + entry->refcount--;
> > + if (!entry->refcount) {
> > + populated = false;
> > + rb_erase(&entry->rb_node, &hash->rb_root);
> > + RB_CLEAR_NODE(&entry->rb_node);
> > + }
> > + spin_unlock(&hash->lock);
> > +
> > +free:
> > + if (!populated)
> > + zram_entry_free(zram, meta, entry);
> > +
> > + return populated;
> > +}
>
> [ 2935.830100] BUG: unable to handle kernel NULL pointer dereference at 0000000000000154
> [ 2935.830106] IP: zram_entry_put+0x15/0xbb [zram]
> [ 2935.830107] PGD 4063aa067
> [ 2935.830108] P4D 4063aa067
> [ 2935.830108] PUD 19d426067
> [ 2935.830109] PMD 0
>
> [ 2935.830110] Oops: 0000 [#1] PREEMPT SMP
> [ 2935.830111] Modules linked in: lzo zram(-) zsmalloc mousedev nls_iso8859_1 nls_cp437 vfat fat psmouse serio_raw atkbd libps2 coretemp hwmon iwlmvm crc32c_intel i2c_i801 r8169 iwlwifi lpc_ich mii mfd_core ie31200_edac edac_core thermal i8042 serio wmi evdev acpi_cpufreq sd_mod
> [ 2935.830127] CPU: 3 PID: 1599 Comm: rmmod Tainted: G W 4.11.0-rc3-next-20170323-dbg-00012-gdda44065c67c-dirty #155
> [ 2935.830128] task: ffff8804061c14c0 task.stack: ffff8801f7908000
> [ 2935.830130] RIP: 0010:zram_entry_put+0x15/0xbb [zram]
> [ 2935.830131] RSP: 0018:ffff8801f790bdc0 EFLAGS: 00010246
> [ 2935.830132] RAX: 0000000000000000 RBX: ffff88041cdbfce0 RCX: 0000000000000001
> [ 2935.830132] RDX: ffff880405f4cdb8 RSI: ffff88041cdbfce0 RDI: 0000000000000000
> [ 2935.830133] RBP: ffff8801f790bde8 R08: ffffffffffffff80 R09: 00000000000000ff
> [ 2935.830134] R10: ffff8801f790bd90 R11: 0000000000000000 R12: 0000000000000000
> [ 2935.830134] R13: ffff88041cdbfce0 R14: ffff88041cdbfd00 R15: ffff88041cdbfce0
> [ 2935.830135] FS: 00007fb350b62b40(0000) GS:ffff88042fac0000(0000) knlGS:0000000000000000
> [ 2935.830136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2935.830137] CR2: 0000000000000154 CR3: 00000002d3f53000 CR4: 00000000001406e0
> [ 2935.830137] Call Trace:
> [ 2935.830139] zram_meta_free+0x50/0x7e [zram]
> [ 2935.830141] zram_reset_device+0xd2/0xe5 [zram]
> [ 2935.830142] zram_remove+0x9f/0xf1 [zram]
> [ 2935.830143] ? zram_remove+0xf1/0xf1 [zram]
> [ 2935.830145] zram_remove_cb+0x11/0x15 [zram]
> [ 2935.830148] idr_for_each+0x3b/0x87
> [ 2935.830149] destroy_devices+0x2a/0x56 [zram]
> [ 2935.830150] zram_exit+0x9/0x6f6 [zram]
> [ 2935.830153] SyS_delete_module+0xf1/0x181
> [ 2935.830156] entry_SYSCALL_64_fastpath+0x18/0xad
> [ 2935.830157] RIP: 0033:0x7fb3502659b7
> [ 2935.830158] RSP: 002b:00007fff90ca5468 EFLAGS: 00000202 ORIG_RAX: 00000000000000b0
> [ 2935.830159] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fb3502659b7
> [ 2935.830159] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000000002187208
> [ 2935.830160] RBP: 0000000000000000 R08: 00007fff90ca43e1 R09: 000000000000000a
> [ 2935.830161] R10: 0000000000000892 R11: 0000000000000202 R12: 00000000021871a0
> [ 2935.830162] R13: 00007fff90ca4450 R14: 00000000021871a0 R15: 0000000002186010
> [ 2935.830163] Code: 1e 6f fb ff 4c 89 e7 e8 9e 05 f5 e0 48 89 d8 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 f5 41 54 53 <83> bf 54 01 00 00 00 48 89 d3 75 0e 48 8b 7e 08 48 89 d6 e8 0b
> [ 2935.830184] RIP: zram_entry_put+0x15/0xbb [zram] RSP: ffff8801f790bdc0
> [ 2935.830184] CR2: 0000000000000154
> [ 2935.838309] ---[ end trace f7f95fd0ed6c72a0 ]---
Hello,
Thanks for reporting.
I have noticed this bug and fixed it in version 2.
Thanks.
On Mon, Mar 27, 2017 at 05:11:05PM +0900, Sergey Senozhatsky wrote:
> On (03/23/17 12:05), Joonsoo Kim wrote:
> > On Wed, Mar 22, 2017 at 09:00:59AM +0900, Minchan Kim wrote:
> > > On Thu, Mar 16, 2017 at 11:46:38AM +0900, [email protected] wrote:
> > > > From: Joonsoo Kim <[email protected]>
> > > >
> > > > Benefit of deduplication is dependent on the workload so it's not
> > > > preferable to always enable. Therefore, make it optional.
> > >
> > > Please make it to Kconfig, too. And write down the description to impress
> > > "help a lot for users who uses zram to build output directory"
> > > And the feature should be disabled as default.
> >
> > Okay.
>
> so I was thinking for a moment -- do we want to keep this
> functionality in zram or may be it belongs to allocator (zsmalloc)?
> what do you think? just a question.
I think that zram is more appropriate layer to implement this feature.
I may be wrong so please let me know if I'm missing something.
First, I'd like to leave allocator to just allocator. If it awares the
contents, further improvement would be restricted. For example, we
should use map/unmap semantic to store contents, since, without them,
we can't know when the content is changed and when deduplication check
should be done. I know that zsmalloc is already implemented by that
way but I guess that similar issue could happen in the future.
Second, we always need to compress the page to check duplication
if it is implemented in zsmalloc since we store compressed page to
zsmalloc. I guess that less compression would be better in performance
wise.
Third, in case of zsmalloc dedup, we always need to allocate zs memory
before checking duplication and need to free it if duplication is
found. It's also undesirable.
If you are okay with above arguments, I will send v2 soon.
Thanks.
Cc Seth and Dan, just in case
Hello Joonsoo,
On (03/28/17 10:02), Joonsoo Kim wrote:
[..]
> > so I was thinking for a moment -- do we want to keep this
> > functionality in zram or may be it belongs to allocator (zsmalloc)?
> > what do you think? just a question.
>
> I think that zram is more appropriate layer to implement this feature.
> I may be wrong so please let me know if I'm missing something.
>
> First, I'd like to leave allocator to just allocator. If it awares the
> contents, further improvement would be restricted. For example, we
> should use map/unmap semantic to store contents, since, without them,
> we can't know when the content is changed and when deduplication check
> should be done. I know that zsmalloc is already implemented by that
> way but I guess that similar issue could happen in the future.
>
> Second, we always need to compress the page to check duplication
> if it is implemented in zsmalloc since we store compressed page to
> zsmalloc. I guess that less compression would be better in performance
> wise.
>
> Third, in case of zsmalloc dedup, we always need to allocate zs memory
> before checking duplication and need to free it if duplication is
> found. It's also undesirable.
>
> If you are okay with above arguments, I will send v2 soon.
thanks.
I'm OK with your arguments.
to explain my point a bit further (zsmalloc was a bad call,
I guess I meant zpool):
the reason I asked was that both zram and zswap sort of trying to
have same optimizations - zero filled pages handling, for example.
zram is a bit ahead now (to the best of my knowledge), because of
the recent 'same element' filled pages. zswap, probably, will have
something like this as well some day. or may be it won't, up to Seth
and Dan. de-duplication definitely can improve both zram and zswap,
which, once again, suggests that at some point zswap will have its
own implementation. well, or it won't.
so I though that may be we could have zero filled pages handling/same
element pages handling/de-duplication somewhere in the "middle" layer.
like zpool for instance (zram does not support zpool as of now) so we
could unify things.
just an idea. no pressure.
-ss
Hi Sergey,
On Tue, Mar 28, 2017 at 11:22:45AM +0900, Sergey Senozhatsky wrote:
> Cc Seth and Dan, just in case
>
>
> Hello Joonsoo,
>
> On (03/28/17 10:02), Joonsoo Kim wrote:
> [..]
> > > so I was thinking for a moment -- do we want to keep this
> > > functionality in zram or may be it belongs to allocator (zsmalloc)?
> > > what do you think? just a question.
> >
> > I think that zram is more appropriate layer to implement this feature.
> > I may be wrong so please let me know if I'm missing something.
> >
> > First, I'd like to leave allocator to just allocator. If it awares the
> > contents, further improvement would be restricted. For example, we
> > should use map/unmap semantic to store contents, since, without them,
> > we can't know when the content is changed and when deduplication check
> > should be done. I know that zsmalloc is already implemented by that
> > way but I guess that similar issue could happen in the future.
> >
> > Second, we always need to compress the page to check duplication
> > if it is implemented in zsmalloc since we store compressed page to
> > zsmalloc. I guess that less compression would be better in performance
> > wise.
> >
> > Third, in case of zsmalloc dedup, we always need to allocate zs memory
> > before checking duplication and need to free it if duplication is
> > found. It's also undesirable.
> >
> > If you are okay with above arguments, I will send v2 soon.
>
> thanks.
> I'm OK with your arguments.
>
>
> to explain my point a bit further (zsmalloc was a bad call,
> I guess I meant zpool):
>
> the reason I asked was that both zram and zswap sort of trying to
> have same optimizations - zero filled pages handling, for example.
> zram is a bit ahead now (to the best of my knowledge), because of
> the recent 'same element' filled pages. zswap, probably, will have
> something like this as well some day. or may be it won't, up to Seth
> and Dan. de-duplication definitely can improve both zram and zswap,
> which, once again, suggests that at some point zswap will have its
> own implementation. well, or it won't.
As I pointed out, at least, dedup was no benefit for the swap case.
I don't want to disrupt zsmalloc without any *proved* benefit.
Even though it *might* have benefit, it shouldn't be in allocator
layer unless it's really huge benefit like performance.
It makes hard zram's allocator change in future.
And please consider zswap is born for the latency in server workload
while zram is memory efficiency in embedded world.
dedup feature is trade-off for them and zram is perfectly matched.
>
> so I though that may be we could have zero filled pages handling/same
> element pages handling/de-duplication somewhere in the "middle" layer.
> like zpool for instance (zram does not support zpool as of now) so we
> could unify things.
>
> just an idea. no pressure.
>
> -ss
Hello Minchan,
On (03/28/17 11:50), Minchan Kim wrote:
[..]
> > the reason I asked was that both zram and zswap sort of trying to
> > have same optimizations - zero filled pages handling, for example.
> > zram is a bit ahead now (to the best of my knowledge), because of
> > the recent 'same element' filled pages. zswap, probably, will have
> > something like this as well some day. or may be it won't, up to Seth
> > and Dan. de-duplication definitely can improve both zram and zswap,
> > which, once again, suggests that at some point zswap will have its
> > own implementation. well, or it won't.
>
> As I pointed out, at least, dedup was no benefit for the swap case.
> I don't want to disrupt zsmalloc without any *proved* benefit.
> Even though it *might* have benefit, it shouldn't be in allocator
> layer unless it's really huge benefit like performance.
sure.
zpool, I meant zpool. I mistakenly used the word 'allocator'.
I meant some intermediate layer between zram and actual memory allocator,
a common layer which both zram and zswap can use and which can have
common functionality. just an idea. haven't really thought about it yet.
> It makes hard zram's allocator change in future.
> And please consider zswap is born for the latency in server workload
> while zram is memory efficiency in embedded world.
may be. I do suspect zswap is used in embedded as well [1]. there is even
a brand new allocator that 'reportedly' uses less memory than zsmalloc
and outperforms zsmalloc in embedded setups [1] (once again, reportedly.
I haven't tried it).
if z3fold is actually this good (I'm not saying it is not, haven't
tested it), then it makes sense to switch to zpool API in zram and let
zram users to select the allocator that fits their setups better.
just saying.
[1] http://events.linuxfoundation.org/sites/events/files/slides/zram1.pdf
-ss
On Tue, Mar 28, 2017 at 02:12:04PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
>
> On (03/28/17 11:50), Minchan Kim wrote:
> [..]
> > > the reason I asked was that both zram and zswap sort of trying to
> > > have same optimizations - zero filled pages handling, for example.
> > > zram is a bit ahead now (to the best of my knowledge), because of
> > > the recent 'same element' filled pages. zswap, probably, will have
> > > something like this as well some day. or may be it won't, up to Seth
> > > and Dan. de-duplication definitely can improve both zram and zswap,
> > > which, once again, suggests that at some point zswap will have its
> > > own implementation. well, or it won't.
> >
> > As I pointed out, at least, dedup was no benefit for the swap case.
> > I don't want to disrupt zsmalloc without any *proved* benefit.
> > Even though it *might* have benefit, it shouldn't be in allocator
> > layer unless it's really huge benefit like performance.
>
> sure.
>
> zpool, I meant zpool. I mistakenly used the word 'allocator'.
>
> I meant some intermediate layer between zram and actual memory allocator,
> a common layer which both zram and zswap can use and which can have
> common functionality. just an idea. haven't really thought about it yet.
>
> > It makes hard zram's allocator change in future.
> > And please consider zswap is born for the latency in server workload
> > while zram is memory efficiency in embedded world.
>
> may be. I do suspect zswap is used in embedded as well [1]. there is even
> a brand new allocator that 'reportedly' uses less memory than zsmalloc
> and outperforms zsmalloc in embedded setups [1] (once again, reportedly.
> I haven't tried it).
>
> if z3fold is actually this good (I'm not saying it is not, haven't
> tested it), then it makes sense to switch to zpool API in zram and let
> zram users to select the allocator that fits their setups better.
>
> just saying.
>
>
> [1] http://events.linuxfoundation.org/sites/events/files/slides/zram1.pdf
I do not want to support multiple allocators in zram.
It's really maintainance headache as well as making zram's goal float.
If new allocator *saves* much memory compared to zsmalloc, it might be
good candidate for replacing zsmalloc.
If so, feel free to send patches with test workload without *any noise*.
Please, do not tell "it's good" with just simple test. What we need is
"why it's good" so that we can investigate what is current problem and
if it is caused by zsmalloc's design so it's hard to change, then
we might think of new allocator seriously.
Anyway, it's off-topic with Joonsoo's patch.