Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932514AbdDQCXB (ORCPT ); Sun, 16 Apr 2017 22:23:01 -0400 Received: from mail-pg0-f67.google.com ([74.125.83.67]:34323 "EHLO mail-pg0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932320AbdDQCW7 (ORCPT ); Sun, 16 Apr 2017 22:22:59 -0400 Date: Mon, 17 Apr 2017 11:22:52 +0900 From: Joonsoo Kim To: Minchan Kim Cc: Andrew Morton , Sergey Senozhatsky , linux-kernel@vger.kernel.org, kernel-team@lge.com Subject: Re: [PATCH v2 2/4] zram: implement deduplication in zram Message-ID: <20170417022250.GC1351@js1304-desktop> References: <1490852289-11913-1-git-send-email-iamjoonsoo.kim@lge.com> <1490852289-11913-3-git-send-email-iamjoonsoo.kim@lge.com> <20170417013816.GA18215@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170417013816.GA18215@bbox> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 18498 Lines: 564 On Mon, Apr 17, 2017 at 10:38:16AM +0900, Minchan Kim wrote: > Hi Joonsoo, > > I reviewed this patch and overall, looks great! Thanks. Thanks! > However, as you know, recently, zram had lots of clean up so > this patchset should be rebased on it massively. > Sorry for the inconvenience. No problem. > > And there are some minor in below. I hope you handle them in > next submit, please. Okay. > On Thu, Mar 30, 2017 at 02:38:07PM +0900, js1304@gmail.com wrote: > > From: Joonsoo Kim > > > > This patch implements deduplication feature in zram. The purpose > > of this work is naturally to save amount of memory usage by zram. > > > > Android is one of the biggest users to use zram as swap and it's > > really important to save amount of memory usage. There is a paper > > that reports that duplication ratio of Android's memory content is > > rather high [1]. And, there is a similar work on zswap that also > > reports that experiments has shown that around 10-15% of pages > > stored in zswp are duplicates and deduplicate them provides some > > benefits [2]. > > > > Also, there is a different kind of workload that uses zram as blockdev > > and store build outputs into it to reduce wear-out problem of real > > blockdev. In this workload, deduplication hit is very high due to > > temporary files and intermediate object files. Detailed analysis is > > on the bottom of this description. > > > > Anyway, if we can detect duplicated content and avoid to store duplicated > > content at different memory space, we can save memory. This patch > > tries to do that. > > > > Implementation is almost simple and intuitive but I should note > > one thing about implementation detail. > > > > To check duplication, this patch uses checksum of the page and > > collision of this checksum could be possible. There would be > > many choices to handle this situation but this patch chooses > > to allow entry with duplicated checksum to be added to the hash, > > but, not to compare all entries with duplicated checksum > > when checking duplication. I guess that checksum collision is quite > > rare event and we don't need to pay any attention to such a case. > > Therefore, I decided the most simplest way to implement the feature. > > If there is a different opinion, I can accept and go that way. > > > > Following is the result of this patch. > > > > Test result #1 (Swap): > > Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18 > > > > orig_data_size: 145297408 > > compr_data_size: 32408125 > > mem_used_total: 32276480 > > dup_data_size: 3188134 > > meta_data_size: 1444272 > > > > Last two metrics added to mm_stat are related to this work. > > First one, dup_data_size, is amount of saved memory by avoiding > > to store duplicated page. Later one, meta_data_size, is the amount of > > data structure to support deduplication. If dup > meta, we can judge > > that the patch improves memory usage. > > > > In Adnroid, we can save 5% of memory usage by this work. > > > > Test result #2 (Blockdev): > > build the kernel and store output to ext4 FS on zram > > > > > > Elapsed time: 249 s > > mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0 > > > > > > Elapsed time: 250 s > > mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792 > > > > There is no performance degradation and save 23% memory. > > > > Test result #3 (Blockdev): > > copy android build output dir(out/host) to ext4 FS on zram > > > > > > Elapsed time: out/host: 88 s > > mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0 > > > > > > Elapsed time: out/host: 100 s > > mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336 > > > > It shows performance degradation roughly 13% and save 24% memory. Maybe, > > it is due to overhead of calculating checksum and comparison. > > > > Test result #4 (Blockdev): > > copy android build output dir(out/target/common) to ext4 FS on zram > > > > > > Elapsed time: out/host: 203 s > > mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0 > > > > > > Elapsed time: out/host: 201 s > > mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336 > > > > Memory is saved by 42% and performance is the same. Even if there is overhead > > of calculating checksum and comparison, large hit ratio compensate it since > > hit leads to less compression attempt. > > > > I checked the detailed reason of savings on kernel build workload and > > there are some cases that deduplication happens. > > > > 1) *.cmd > > Build command is usually similar in one directory so content of these file > > are very similar. In my system, more than 789 lines in fs/ext4/.namei.o.cmd > > and fs/ext4/.inode.o.cmd are the same in 944 and 938 lines of the file, > > respectively. > > > > 2) intermediate object files > > built-in.o and temporary object file have the similar contents. More than > > 50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o. > > > > 3) vmlinux > > .tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boo/compressed/vmlinux.bin > > have the similar contents. > > > > Android test has similar case that some of object files(.class > > and .so) are similar with another ones. > > (./host/linux-x86/lib/libartd.so and > > ./host/linux-x86-lib/libartd-comiler.so) > > > > Anyway, benefit seems to be largely dependent on the workload so > > following patch will make this feature optional. However, this feature > > can help some usecases so is deserved to be merged. > > > > [1]: MemScope: Analyzing Memory Duplication on Android Systems, > > dl.acm.org/citation.cfm?id=2797023 > > [2]: zswap: Optimize compressed pool memory utilization, > > lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2 > > > > Signed-off-by: Joonsoo Kim > > --- > > Documentation/blockdev/zram.txt | 2 + > > drivers/block/zram/Makefile | 2 +- > > drivers/block/zram/zram_dedup.c | 222 ++++++++++++++++++++++++++++++++++++++++ > > drivers/block/zram/zram_dedup.h | 25 +++++ > > drivers/block/zram/zram_drv.c | 62 ++++++++--- > > drivers/block/zram/zram_drv.h | 18 ++++ > > 6 files changed, 314 insertions(+), 17 deletions(-) > > create mode 100644 drivers/block/zram/zram_dedup.c > > create mode 100644 drivers/block/zram/zram_dedup.h > > > > diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt > > index 4fced8a..2cdc303 100644 > > --- a/Documentation/blockdev/zram.txt > > +++ b/Documentation/blockdev/zram.txt > > @@ -217,6 +217,8 @@ line of text and contains the following stats separated by whitespace: > > same_pages the number of same element filled pages written to this disk. > > No memory is allocated for such pages. > > pages_compacted the number of pages freed during compaction > > + dup_data_size deduplicated data size > > + meta_data_size the amount of metadata allocated for deduplication feature > > > > 9) Deactivate: > > swapoff /dev/zram0 > > diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile > > index 9e2b79e..29cb008 100644 > > --- a/drivers/block/zram/Makefile > > +++ b/drivers/block/zram/Makefile > > @@ -1,3 +1,3 @@ > > -zram-y := zcomp.o zram_drv.o > > +zram-y := zcomp.o zram_drv.o zram_dedup.o > > > > obj-$(CONFIG_ZRAM) += zram.o > > diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c > > new file mode 100644 > > index 0000000..d313fc8 > > --- /dev/null > > +++ b/drivers/block/zram/zram_dedup.c > > @@ -0,0 +1,222 @@ > > +/* > > + * Copyright (C) 2017 Joonsoo Kim. > > + * > > + * This program is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU General Public License > > + * as published by the Free Software Foundation; either version > > + * 2 of the License, or (at your option) any later version. > > + */ > > + > > +#include > > +#include > > + > > +#include "zram_drv.h" > > + > > +/* One slot will contain 128 pages theoretically */ > > +#define ZRAM_HASH_SHIFT 7 > > +#define ZRAM_HASH_SIZE_MIN (1 << 10) > > +#define ZRAM_HASH_SIZE_MAX (1 << 31) > > + > > +u64 zram_dedup_dup_size(struct zram *zram) > > +{ > > + return (u64)atomic64_read(&zram->stats.dup_data_size); > > +} > > + > > +u64 zram_dedup_meta_size(struct zram *zram) > > +{ > > + return (u64)atomic64_read(&zram->stats.meta_data_size); > > +} > > + > > +static u32 zram_dedup_checksum(unsigned char *mem) > > +{ > > + return jhash(mem, PAGE_SIZE, 0); > > +} > > + > > +void zram_dedup_insert(struct zram *zram, struct zram_entry *new, > > + u32 checksum) > > +{ > > + struct zram_meta *meta = zram->meta; > > + struct zram_hash *hash; > > + struct rb_root *rb_root; > > + struct rb_node **rb_node, *parent = NULL; > > + struct zram_entry *entry; > > + > > + new->checksum = checksum; > > + hash = &meta->hash[checksum % meta->hash_size]; > > + rb_root = &hash->rb_root; > > + > > + spin_lock(&hash->lock); > > + rb_node = &rb_root->rb_node; > > + while (*rb_node) { > > + parent = *rb_node; > > + entry = rb_entry(parent, struct zram_entry, rb_node); > > + if (checksum < entry->checksum) > > + rb_node = &parent->rb_left; > > + else if (checksum > entry->checksum) > > + rb_node = &parent->rb_right; > > + else > > + rb_node = &parent->rb_left; > > + } > > + > > + rb_link_node(&new->rb_node, parent, rb_node); > > + rb_insert_color(&new->rb_node, rb_root); > > + spin_unlock(&hash->lock); > > +} > > + > > +static bool zram_dedup_match(struct zram *zram, struct zram_entry *entry, > > + unsigned char *mem) > > +{ > > + bool match = false; > > + unsigned char *cmem; > > + struct zram_meta *meta = zram->meta; > > + struct zcomp_strm *zstrm; > > + > > + cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO); > > + if (entry->len == PAGE_SIZE) { > > + match = !memcmp(mem, cmem, PAGE_SIZE); > > + } else { > > + zstrm = zcomp_stream_get(zram->comp); > > + if (!zcomp_decompress(zstrm, cmem, entry->len, zstrm->buffer)) > > + match = !memcmp(mem, zstrm->buffer, PAGE_SIZE); > > + zcomp_stream_put(zram->comp); > > + } > > + zs_unmap_object(meta->mem_pool, entry->handle); > > + > > + return match; > > +} > > + > > +static unsigned long zram_dedup_put(struct zram *zram, struct zram_meta *meta, > > + struct zram_entry *entry) > > +{ > > + struct zram_hash *hash; > > + u32 checksum; > > + unsigned long refcount; > > + > > + checksum = entry->checksum; > > + hash = &meta->hash[checksum % meta->hash_size]; > > + > > + spin_lock(&hash->lock); > > + entry->refcount--; > > + refcount = entry->refcount; > > + if (!entry->refcount) { > > + rb_erase(&entry->rb_node, &hash->rb_root); > > + RB_CLEAR_NODE(&entry->rb_node); > > What's the purpose of this RB_CLEAR_NODE? > I think we don't need it. I thought that it is needed to initialize the rb entry but with re-check it seems to be unnecessary. I will remove it. > > + } else if (zram) { > > + atomic64_sub(entry->len, &zram->stats.dup_data_size); > > + } > > + spin_unlock(&hash->lock); > > + > > + return refcount; > > +} > > + > > +static struct zram_entry *zram_dedup_get(struct zram *zram, > > + unsigned char *mem, u32 checksum) > > +{ > > + struct zram_meta *meta = zram->meta; > > + struct zram_hash *hash; > > + struct zram_entry *entry; > > + struct rb_node *rb_node; > > + > > + hash = &meta->hash[checksum % meta->hash_size]; > > + > > + spin_lock(&hash->lock); > > + rb_node = hash->rb_root.rb_node; > > + while (rb_node) { > > + entry = rb_entry(rb_node, struct zram_entry, rb_node); > > + if (checksum == entry->checksum) { > > + entry->refcount++; > > + atomic64_add(entry->len, &zram->stats.dup_data_size); > > + spin_unlock(&hash->lock); > > + > > + if (zram_dedup_match(zram, entry, mem)) > > + return entry; > > + > > + zram_entry_free(zram, meta, entry); > > + > > + return NULL; > > + } > > + > > + if (checksum < entry->checksum) > > + rb_node = rb_node->rb_left; > > + else > > + rb_node = rb_node->rb_right; > > + } > > + spin_unlock(&hash->lock); > > + > > + return NULL; > > +} > > + > > +struct zram_entry *zram_dedup_find(struct zram *zram, unsigned char *mem, > > + u32 *checksum) > > +{ > > + *checksum = zram_dedup_checksum(mem); > > + > > + return zram_dedup_get(zram, mem, *checksum); > > +} > > + > > +struct zram_entry *zram_dedup_alloc(struct zram *zram, > > + unsigned long handle, unsigned int len, > > + gfp_t flags) > > +{ > > + struct zram_entry *entry; > > + > > + entry = kzalloc(sizeof(*entry), > > + flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE)); > > zram_entry_alloc can mask off GFP_HIGHMEM|__GFP_MOVABLE and > it can pass filtered gfp flags to zs_malloc and zram_dedup_alloc. Okay. > > + if (!entry) > > + return NULL; > > + > > + entry->handle = handle; > > + RB_CLEAR_NODE(&entry->rb_node); > > Ditto. I think we don't need to clear rb_node. Okay. > > + entry->refcount = 1; > > + entry->len = len; > > + > > + atomic64_add(sizeof(*entry), &zram->stats.meta_data_size); > > + > > + return entry; > > +} > > + > > +unsigned long zram_dedup_free(struct zram *zram, struct zram_meta *meta, > > + struct zram_entry *entry) > > +{ > > + unsigned long handle; > > + > > + if (zram_dedup_put(zram, meta, entry)) > > + return 0; > > + > > + handle = entry->handle; > > + kfree(entry); > > + > > + /* !zram happens when reset/fail and updating stat is useless */ > > With recent version, it's not true so your code would be more clean. :) Good! > > > + if (zram) > > + atomic64_sub(sizeof(*entry), &zram->stats.meta_data_size); > > + > > + return handle; > > +} > > + > > +int zram_dedup_init(struct zram_meta *meta, size_t num_pages) > > +{ > > + int i; > > + struct zram_hash *hash; > > + > > + meta->hash_size = num_pages >> ZRAM_HASH_SHIFT; > > + meta->hash_size = min_t(size_t, ZRAM_HASH_SIZE_MAX, meta->hash_size); > > + meta->hash_size = max_t(size_t, ZRAM_HASH_SIZE_MIN, meta->hash_size); > > + meta->hash = vzalloc(meta->hash_size * sizeof(struct zram_hash)); > > + if (!meta->hash) { > > + pr_err("Error allocating zram entry hash\n"); > > + return -ENOMEM; > > + } > > + > > + for (i = 0; i < meta->hash_size; i++) { > > + hash = &meta->hash[i]; > > + spin_lock_init(&hash->lock); > > + hash->rb_root = RB_ROOT; > > + } > > + > > + return 0; > > +} > > + > > +void zram_dedup_fini(struct zram_meta *meta) > > +{ > > + vfree(meta->hash); > > +} > > diff --git a/drivers/block/zram/zram_dedup.h b/drivers/block/zram/zram_dedup.h > > new file mode 100644 > > index 0000000..7071f32 > > --- /dev/null > > +++ b/drivers/block/zram/zram_dedup.h > > @@ -0,0 +1,25 @@ > > +#ifndef _ZRAM_DEDUP_H_ > > +#define _ZRAM_DEDUP_H_ > > + > > +struct zram; > > +struct zram_meta; > > +struct zram_entry; > > + > > +u64 zram_dedup_dup_size(struct zram *zram); > > +u64 zram_dedup_meta_size(struct zram *zram); > > + > > +void zram_dedup_insert(struct zram *zram, struct zram_entry *new, > > + u32 checksum); > > +struct zram_entry *zram_dedup_find(struct zram *zram, unsigned char *mem, > > + u32 *checksum); > > + > > +struct zram_entry *zram_dedup_alloc(struct zram *zram, > > + unsigned long handle, unsigned int len, > > + gfp_t flags); > > +unsigned long zram_dedup_free(struct zram *zram, struct zram_meta *meta, > > + struct zram_entry *entry); > > + > > +int zram_dedup_init(struct zram_meta *meta, size_t num_pages); > > +void zram_dedup_fini(struct zram_meta *meta); > > + > > +#endif /* _ZRAM_DEDUP_H_ */ > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > > index f3949da..15cecd6 100644 > > --- a/drivers/block/zram/zram_drv.c > > +++ b/drivers/block/zram/zram_drv.c > > @@ -385,14 +385,16 @@ static ssize_t mm_stat_show(struct device *dev, > > max_used = atomic_long_read(&zram->stats.max_used_pages); > > > > ret = scnprintf(buf, PAGE_SIZE, > > - "%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n", > > + "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n", > > orig_size << PAGE_SHIFT, > > (u64)atomic64_read(&zram->stats.compr_data_size), > > mem_used << PAGE_SHIFT, > > zram->limit_pages << PAGE_SHIFT, > > max_used << PAGE_SHIFT, > > (u64)atomic64_read(&zram->stats.same_pages), > > - pool_stats.pages_compacted); > > + pool_stats.pages_compacted, > > + zram_dedup_dup_size(zram), > > + zram_dedup_meta_size(zram)); > > up_read(&zram->init_lock); > > > > return ret; > > @@ -424,25 +426,29 @@ static struct zram_entry *zram_entry_alloc(struct zram *zram, > > { > > struct zram_meta *meta = zram->meta; > > struct zram_entry *entry; > > + unsigned long handle; > > > > - entry = kzalloc(sizeof(*entry), flags); > > - if (!entry) > > + handle = zs_malloc(meta->mem_pool, len, flags); > > + if (!handle) > > return NULL; > > > > - entry->handle = zs_malloc(meta->mem_pool, len, flags); > > - if (!entry->handle) { > > - kfree(entry); > > + entry = zram_dedup_alloc(zram, handle, len, flags); > > I don't like to hide zram_entry allocation behind zram_dedup_alloc. > It's never wrong because zram_table allocation happens on only > with dedup feature but zram_table is owned by zram_drv so I want > for zram_drv to control zram_table alloc/free. Okay. > How about this? > > zram_entry_alloc > unsigned long handle; > struct zram_entry *entry; > > handle = zs_malloc; > if (zram_dedup_enabled()) > return handle; > > entry = kzalloc(); > zram_dedup_init(entry, handle, len); > > return entry; > > I will change it as you suggested. > > + if (!entry) { > > + zs_free(meta->mem_pool, handle); > > return NULL; > > } > > > > return entry; > > } > > > > -static inline void zram_entry_free(struct zram_meta *meta, > > - struct zram_entry *entry) > > +void zram_entry_free(struct zram *zram, struct zram_meta *meta, > > + struct zram_entry *entry) > > { > > - zs_free(meta->mem_pool, entry->handle); > > - kfree(entry); > > + unsigned long handle; > > + > > + handle = zram_dedup_free(zram, meta, entry); > > Ditto. > unsigned long handle = entry->handle; > > if (zram_dedup_put(zram, meta, entry)) > kfree(entry); > > zs_free(meta->mem_pool, handle); > > With this, In [4/4], I want that zram_entry_handle > doesn't rely on zram_dedup_handle. Okay. Thanks.