2022-10-18 04:58:14

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 0/9] zram: Support multiple compression streams

Hello,

This series adds support for multiple (per-CPU)
compression streams (at point only 2). The main idea is that
different compression algorithms have different characteristics
and zram may benefit when it uses a combination of algorithms:
a default algorithm that is faster but have lower compression
rate and a secondary algorithm that can use higher compression
rate at a price of slower compression/decompression.

There are several use-case for this functionality:

- huge pages re-compression: zstd or deflate can successfully
compress huge pages (~50% of huge pages on my synthetic ChromeOS
tests), IOW pages that lzo was not able to compress.

- idle pages re-compression: idle/cold pages sit in the memory
and we may reduce zsmalloc memory usage if we recompress those
idle pages.

User-space has a number of ways to control the behavior
and impact of zram recompression: what type of pages should be
recompressed, size watermarks, etc. Please refer to documentation
patch.

v4:
-- added IS_ERR_VALUE patch (Andrew)
-- documented SIZE units (bytes) (Andrew)
-- re-phrased writeback BIO error comment (Andrew)
-- return zs_malloc() error code from zram_recompress()
-- do not lose zram_recompress() error in recompress_store()
-- corrected a typo
-- fixed previous rebase errors
-- rebased the series

v3:
-- conditionally reschedule during recompression loop so that
we don't stall RCU grace periods
-- fixed a false-positive WARN_ON

v2:
-- rebased
-- mark completely incompressible pages (neither default nor secondary
algorithm can compress them) with a new flag so that we don't attempt
to recompress them all the time

Sergey Senozhatsky (9):
zram: Preparation for multi-zcomp support
zram: Add recompression algorithm sysfs knob
zram: Factor out WB and non-WB zram read functions
zram: Introduce recompress sysfs knob
documentation: Add recompression documentation
zram: Add recompression algorithm choice to Kconfig
zram: Add recompress flag to read_block_state()
zram: Clarify writeback_store() comment
zram: Use IS_ERR_VALUE() to check for zs_malloc() errors

Documentation/admin-guide/blockdev/zram.rst | 64 ++-
drivers/block/zram/Kconfig | 55 +++
drivers/block/zram/zcomp.c | 6 +-
drivers/block/zram/zcomp.h | 2 +-
drivers/block/zram/zram_drv.c | 458 +++++++++++++++++---
drivers/block/zram/zram_drv.h | 16 +-
6 files changed, 526 insertions(+), 75 deletions(-)

--
2.38.0.413.g74048e4d9e-goog


2022-10-18 04:58:15

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 1/9] zram: Preparation for multi-zcomp support

The patch turns compression streams and compressor algorithm
name struct zram members into arrays, so that we can have
multiple compression streams support (in the next patches).

The patch uses a rather explicit API for compressor selection:

- Get primary (default) compression stream
zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP])
- Get secondary compression stream
zcomp_stream_get(zram->comps[ZRAM_SECONDARY_ZCOMP])

We use similar API for compression streams put().

At this point we always have just one compression stream,
since CONFIG_ZRAM_MULTI_COMP is not yet defined.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
drivers/block/zram/zcomp.c | 6 +--
drivers/block/zram/zcomp.h | 2 +-
drivers/block/zram/zram_drv.c | 87 ++++++++++++++++++++++++-----------
drivers/block/zram/zram_drv.h | 14 +++++-
4 files changed, 77 insertions(+), 32 deletions(-)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 0916de952e09..55af4efd7983 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -206,7 +206,7 @@ void zcomp_destroy(struct zcomp *comp)
* case of allocation error, or any other error potentially
* returned by zcomp_init().
*/
-struct zcomp *zcomp_create(const char *compress)
+struct zcomp *zcomp_create(const char *alg)
{
struct zcomp *comp;
int error;
@@ -216,14 +216,14 @@ struct zcomp *zcomp_create(const char *compress)
* is not loaded yet. We must do it here, otherwise we are about to
* call /sbin/modprobe under CPU hot-plug lock.
*/
- if (!zcomp_available_algorithm(compress))
+ if (!zcomp_available_algorithm(alg))
return ERR_PTR(-EINVAL);

comp = kzalloc(sizeof(struct zcomp), GFP_KERNEL);
if (!comp)
return ERR_PTR(-ENOMEM);

- comp->name = compress;
+ comp->name = alg;
error = zcomp_init(comp);
if (error) {
kfree(comp);
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index 40f6420f4b2e..cdefdef93da8 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -27,7 +27,7 @@ int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node);
ssize_t zcomp_available_show(const char *comp, char *buf);
bool zcomp_available_algorithm(const char *comp);

-struct zcomp *zcomp_create(const char *comp);
+struct zcomp *zcomp_create(const char *alg);
void zcomp_destroy(struct zcomp *comp);

struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 966aab902d19..770ea3489eb6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1007,36 +1007,53 @@ static ssize_t comp_algorithm_show(struct device *dev,
struct zram *zram = dev_to_zram(dev);

down_read(&zram->init_lock);
- sz = zcomp_available_show(zram->compressor, buf);
+ sz = zcomp_available_show(zram->comp_algs[ZRAM_PRIMARY_ZCOMP], buf);
up_read(&zram->init_lock);

return sz;
}

+static void comp_algorithm_set(struct zram *zram, u32 idx, const char *alg)
+{
+ /* Do not kfree() algs that we didn't allocate, IOW the default ones */
+ if (zram->comp_algs[idx] != default_compressor)
+ kfree(zram->comp_algs[idx]);
+ zram->comp_algs[idx] = alg;
+}
+
static ssize_t comp_algorithm_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct zram *zram = dev_to_zram(dev);
- char compressor[ARRAY_SIZE(zram->compressor)];
+ char *compressor;
size_t sz;

- strscpy(compressor, buf, sizeof(compressor));
+ sz = strlen(buf);
+ if (sz >= CRYPTO_MAX_ALG_NAME)
+ return -E2BIG;
+
+ compressor = kstrdup(buf, GFP_KERNEL);
+ if (!compressor)
+ return -ENOMEM;
+
/* ignore trailing newline */
- sz = strlen(compressor);
if (sz > 0 && compressor[sz - 1] == '\n')
compressor[sz - 1] = 0x00;

- if (!zcomp_available_algorithm(compressor))
+ if (!zcomp_available_algorithm(compressor)) {
+ kfree(compressor);
return -EINVAL;
+ }

down_write(&zram->init_lock);
if (init_done(zram)) {
up_write(&zram->init_lock);
+ kfree(compressor);
pr_info("Can't change algorithm for initialized device\n");
return -EBUSY;
}

- strcpy(zram->compressor, compressor);
+ comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP, compressor);
up_write(&zram->init_lock);
return len;
}
@@ -1284,7 +1301,7 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
size = zram_get_obj_size(zram, index);

if (size != PAGE_SIZE)
- zstrm = zcomp_stream_get(zram->comp);
+ zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP]);

src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
if (size == PAGE_SIZE) {
@@ -1296,7 +1313,7 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
dst = kmap_atomic(page);
ret = zcomp_decompress(zstrm, src, size, dst);
kunmap_atomic(dst);
- zcomp_stream_put(zram->comp);
+ zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
}
zs_unmap_object(zram->mem_pool, handle);
zram_slot_unlock(zram, index);
@@ -1363,13 +1380,13 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
kunmap_atomic(mem);

compress_again:
- zstrm = zcomp_stream_get(zram->comp);
+ zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP]);
src = kmap_atomic(page);
ret = zcomp_compress(zstrm, src, &comp_len);
kunmap_atomic(src);

if (unlikely(ret)) {
- zcomp_stream_put(zram->comp);
+ zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
pr_err("Compression failed! err=%d\n", ret);
zs_free(zram->mem_pool, handle);
return ret;
@@ -1397,7 +1414,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
__GFP_HIGHMEM |
__GFP_MOVABLE);
if (IS_ERR((void *)handle)) {
- zcomp_stream_put(zram->comp);
+ zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
atomic64_inc(&zram->stats.writestall);
handle = zs_malloc(zram->mem_pool, comp_len,
GFP_NOIO | __GFP_HIGHMEM |
@@ -1414,14 +1431,14 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
* zstrm buffer back. It is necessary that the dereferencing
* of the zstrm variable below occurs correctly.
*/
- zstrm = zcomp_stream_get(zram->comp);
+ zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP]);
}

alloced_pages = zs_get_total_pages(zram->mem_pool);
update_used_max(zram, alloced_pages);

if (zram->limit_pages && alloced_pages > zram->limit_pages) {
- zcomp_stream_put(zram->comp);
+ zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
zs_free(zram->mem_pool, handle);
return -ENOMEM;
}
@@ -1435,7 +1452,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
if (comp_len == PAGE_SIZE)
kunmap_atomic(src);

- zcomp_stream_put(zram->comp);
+ zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
zs_unmap_object(zram->mem_pool, handle);
atomic64_add(comp_len, &zram->stats.compr_data_size);
out:
@@ -1710,6 +1727,20 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
return ret;
}

+static void zram_destroy_comps(struct zram *zram)
+{
+ u32 idx;
+
+ for (idx = 0; idx < ZRAM_MAX_ZCOMPS; idx++) {
+ struct zcomp *comp = zram->comps[idx];
+
+ zram->comps[idx] = NULL;
+ if (IS_ERR_OR_NULL(comp))
+ continue;
+ zcomp_destroy(comp);
+ }
+}
+
static void zram_reset_device(struct zram *zram)
{
down_write(&zram->init_lock);
@@ -1727,11 +1758,11 @@ static void zram_reset_device(struct zram *zram)
/* I/O operation under all of CPU are done so let's free */
zram_meta_free(zram, zram->disksize);
zram->disksize = 0;
+ zram_destroy_comps(zram);
memset(&zram->stats, 0, sizeof(zram->stats));
- zcomp_destroy(zram->comp);
- zram->comp = NULL;
reset_bdev(zram);

+ comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP, default_compressor);
up_write(&zram->init_lock);
}

@@ -1742,6 +1773,7 @@ static ssize_t disksize_store(struct device *dev,
struct zcomp *comp;
struct zram *zram = dev_to_zram(dev);
int err;
+ u32 idx;

disksize = memparse(buf, NULL);
if (!disksize)
@@ -1760,22 +1792,25 @@ static ssize_t disksize_store(struct device *dev,
goto out_unlock;
}

- comp = zcomp_create(zram->compressor);
- if (IS_ERR(comp)) {
- pr_err("Cannot initialise %s compressing backend\n",
- zram->compressor);
- err = PTR_ERR(comp);
- goto out_free_meta;
- }
+ for (idx = 0; idx < ZRAM_MAX_ZCOMPS; idx++) {
+ comp = zcomp_create(zram->comp_algs[idx]);
+ if (IS_ERR(comp)) {
+ pr_err("Cannot initialise %s compressing backend\n",
+ zram->comp_algs[idx]);
+ err = PTR_ERR(comp);
+ goto out_free_comps;
+ }

- zram->comp = comp;
+ zram->comps[idx] = comp;
+ }
zram->disksize = disksize;
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);

return len;

-out_free_meta:
+out_free_comps:
+ zram_destroy_comps(zram);
zram_meta_free(zram, disksize);
out_unlock:
up_write(&zram->init_lock);
@@ -1962,7 +1997,7 @@ static int zram_add(void)
if (ret)
goto out_cleanup_disk;

- strscpy(zram->compressor, default_compressor, sizeof(zram->compressor));
+ zram->comp_algs[ZRAM_PRIMARY_ZCOMP] = default_compressor;

zram_debugfs_register(zram);
pr_info("Added device: %s\n", zram->disk->disk_name);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index a2bda53020fd..4044ddbb2326 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -89,10 +89,20 @@ struct zram_stats {
#endif
};

+#ifdef CONFIG_ZRAM_MULTI_COMP
+#define ZRAM_PRIMARY_ZCOMP 0
+#define ZRAM_SECONDARY_ZCOMP 1
+#define ZRAM_MAX_ZCOMPS 2
+#else
+#define ZRAM_PRIMARY_ZCOMP 0
+#define ZRAM_SECONDARY_ZCOMP 0
+#define ZRAM_MAX_ZCOMPS 1
+#endif
+
struct zram {
struct zram_table_entry *table;
struct zs_pool *mem_pool;
- struct zcomp *comp;
+ struct zcomp *comps[ZRAM_MAX_ZCOMPS];
struct gendisk *disk;
/* Prevent concurrent execution of device init */
struct rw_semaphore init_lock;
@@ -107,7 +117,7 @@ struct zram {
* we can store in a disk.
*/
u64 disksize; /* bytes */
- char compressor[CRYPTO_MAX_ALG_NAME];
+ const char *comp_algs[ZRAM_MAX_ZCOMPS];
/*
* zram is claimed so open request will be failed
*/
--
2.38.0.413.g74048e4d9e-goog

2022-10-18 04:58:23

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

Introduce recomp_algorithm sysfs knob that controls
secondary algorithm selection used for recompression.
This device attribute works in a similar way with
comp_algorithm attribute.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
drivers/block/zram/zram_drv.c | 111 +++++++++++++++++++++++++++-------
1 file changed, 90 insertions(+), 21 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 770ea3489eb6..a8ef3c0c3dae 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -41,7 +41,12 @@ static DEFINE_IDR(zram_index_idr);
static DEFINE_MUTEX(zram_index_mutex);

static int zram_major;
-static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
+static const char *default_comp_algs[ZRAM_MAX_ZCOMPS] = {
+ CONFIG_ZRAM_DEF_COMP,
+#ifdef CONFIG_ZRAM_MULTI_COMP
+ "zstd",
+#endif
+};

/* Module params (documentation at end) */
static unsigned int num_devices = 1;
@@ -1000,31 +1005,37 @@ static ssize_t max_comp_streams_store(struct device *dev,
return len;
}

-static ssize_t comp_algorithm_show(struct device *dev,
- struct device_attribute *attr, char *buf)
+static void comp_algorithm_set(struct zram *zram, u32 idx, const char *alg)
{
- size_t sz;
- struct zram *zram = dev_to_zram(dev);
+ bool default_alg = false;
+ int i;

- down_read(&zram->init_lock);
- sz = zcomp_available_show(zram->comp_algs[ZRAM_PRIMARY_ZCOMP], buf);
- up_read(&zram->init_lock);
+ /* Do not kfree() algs that we didn't allocate, IOW the default ones */
+ for (i = 0; i < ZRAM_MAX_ZCOMPS; i++) {
+ if (zram->comp_algs[idx] == default_comp_algs[i]) {
+ default_alg = true;
+ break;
+ }
+ }

- return sz;
+ if (!default_alg)
+ kfree(zram->comp_algs[idx]);
+ zram->comp_algs[idx] = alg;
}

-static void comp_algorithm_set(struct zram *zram, u32 idx, const char *alg)
+static ssize_t __comp_algorithm_show(struct zram *zram, u32 idx, char *buf)
{
- /* Do not kfree() algs that we didn't allocate, IOW the default ones */
- if (zram->comp_algs[idx] != default_compressor)
- kfree(zram->comp_algs[idx]);
- zram->comp_algs[idx] = alg;
+ ssize_t sz;
+
+ down_read(&zram->init_lock);
+ sz = zcomp_available_show(zram->comp_algs[idx], buf);
+ up_read(&zram->init_lock);
+
+ return sz;
}

-static ssize_t comp_algorithm_store(struct device *dev,
- struct device_attribute *attr, const char *buf, size_t len)
+static int __comp_algorithm_store(struct zram *zram, u32 idx, const char *buf)
{
- struct zram *zram = dev_to_zram(dev);
char *compressor;
size_t sz;

@@ -1053,11 +1064,55 @@ static ssize_t comp_algorithm_store(struct device *dev,
return -EBUSY;
}

- comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP, compressor);
+ comp_algorithm_set(zram, idx, compressor);
up_write(&zram->init_lock);
- return len;
+ return 0;
+}
+
+static ssize_t comp_algorithm_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct zram *zram = dev_to_zram(dev);
+
+ return __comp_algorithm_show(zram, ZRAM_PRIMARY_ZCOMP, buf);
+}
+
+static ssize_t comp_algorithm_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf,
+ size_t len)
+{
+ struct zram *zram = dev_to_zram(dev);
+ int ret;
+
+ ret = __comp_algorithm_store(zram, ZRAM_PRIMARY_ZCOMP, buf);
+ return ret ? ret : len;
}

+#ifdef CONFIG_ZRAM_MULTI_COMP
+static ssize_t recomp_algorithm_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct zram *zram = dev_to_zram(dev);
+
+ return __comp_algorithm_show(zram, ZRAM_SECONDARY_ZCOMP, buf);
+}
+
+static ssize_t recomp_algorithm_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf,
+ size_t len)
+{
+ struct zram *zram = dev_to_zram(dev);
+ int ret;
+
+ ret = __comp_algorithm_store(zram, ZRAM_SECONDARY_ZCOMP, buf);
+ return ret ? ret : len;
+}
+#endif
+
static ssize_t compact_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
@@ -1762,7 +1817,11 @@ static void zram_reset_device(struct zram *zram)
memset(&zram->stats, 0, sizeof(zram->stats));
reset_bdev(zram);

- comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP, default_compressor);
+ comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP,
+ default_comp_algs[ZRAM_PRIMARY_ZCOMP]);
+ if (IS_ENABLED(CONFIG_ZRAM_MULTI_COMP))
+ comp_algorithm_set(zram, ZRAM_SECONDARY_ZCOMP,
+ default_comp_algs[ZRAM_SECONDARY_ZCOMP]);
up_write(&zram->init_lock);
}

@@ -1895,6 +1954,9 @@ static DEVICE_ATTR_WO(writeback);
static DEVICE_ATTR_RW(writeback_limit);
static DEVICE_ATTR_RW(writeback_limit_enable);
#endif
+#ifdef CONFIG_ZRAM_MULTI_COMP
+static DEVICE_ATTR_RW(recomp_algorithm);
+#endif

static struct attribute *zram_disk_attrs[] = {
&dev_attr_disksize.attr,
@@ -1918,6 +1980,9 @@ static struct attribute *zram_disk_attrs[] = {
&dev_attr_bd_stat.attr,
#endif
&dev_attr_debug_stat.attr,
+#ifdef CONFIG_ZRAM_MULTI_COMP
+ &dev_attr_recomp_algorithm.attr,
+#endif
NULL,
};

@@ -1997,7 +2062,11 @@ static int zram_add(void)
if (ret)
goto out_cleanup_disk;

- zram->comp_algs[ZRAM_PRIMARY_ZCOMP] = default_compressor;
+ zram->comp_algs[ZRAM_PRIMARY_ZCOMP] =
+ default_comp_algs[ZRAM_PRIMARY_ZCOMP];
+ if (IS_ENABLED(CONFIG_ZRAM_MULTI_COMP))
+ zram->comp_algs[ZRAM_SECONDARY_ZCOMP] =
+ default_comp_algs[ZRAM_SECONDARY_ZCOMP];

zram_debugfs_register(zram);
pr_info("Added device: %s\n", zram->disk->disk_name);
--
2.38.0.413.g74048e4d9e-goog

2022-10-18 04:58:45

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 3/9] zram: Factor out WB and non-WB zram read functions

We will use non-WB variant in ZRAM page recompression path.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
drivers/block/zram/zram_drv.c | 73 ++++++++++++++++++++++++-----------
1 file changed, 50 insertions(+), 23 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a8ef3c0c3dae..94c62d7ea818 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1314,8 +1314,30 @@ static void zram_free_page(struct zram *zram, size_t index)
~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
}

-static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
- struct bio *bio, bool partial_io)
+/*
+ * Reads a page from the writeback devices. Corresponding ZRAM slot
+ * should be unlocked.
+ */
+static int zram_read_from_writeback(struct zram *zram, struct page *page,
+ u32 index, struct bio *bio,
+ bool partial_io)
+{
+ struct bio_vec bvec;
+
+ bvec.bv_page = page;
+ bvec.bv_len = PAGE_SIZE;
+ bvec.bv_offset = 0;
+ return read_from_bdev(zram, &bvec,
+ zram_get_element(zram, index),
+ bio, partial_io);
+}
+
+/*
+ * Reads (decompresses if needed) a page from zspool (zsmalloc).
+ * Corresponding ZRAM slot should be locked.
+ */
+static int zram_read_from_zspool(struct zram *zram, struct page *page,
+ u32 index)
{
struct zcomp_strm *zstrm;
unsigned long handle;
@@ -1323,23 +1345,6 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
void *src, *dst;
int ret;

- zram_slot_lock(zram, index);
- if (zram_test_flag(zram, index, ZRAM_WB)) {
- struct bio_vec bvec;
-
- zram_slot_unlock(zram, index);
- /* A null bio means rw_page was used, we must fallback to bio */
- if (!bio)
- return -EOPNOTSUPP;
-
- bvec.bv_page = page;
- bvec.bv_len = PAGE_SIZE;
- bvec.bv_offset = 0;
- return read_from_bdev(zram, &bvec,
- zram_get_element(zram, index),
- bio, partial_io);
- }
-
handle = zram_get_handle(zram, index);
if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
unsigned long value;
@@ -1349,7 +1354,6 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
mem = kmap_atomic(page);
zram_fill_page(mem, PAGE_SIZE, value);
kunmap_atomic(mem);
- zram_slot_unlock(zram, index);
return 0;
}

@@ -1371,17 +1375,40 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
}
zs_unmap_object(zram->mem_pool, handle);
- zram_slot_unlock(zram, index);
+ return ret;
+}
+
+static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
+ struct bio *bio, bool partial_io)
+{
+ int ret;
+
+ zram_slot_lock(zram, index);
+ if (!zram_test_flag(zram, index, ZRAM_WB)) {
+ /* Slot should be locked through out the function call */
+ ret = zram_read_from_zspool(zram, page, index);
+ zram_slot_unlock(zram, index);
+ } else {
+ /* Slot should be unlocked before the function call */
+ zram_slot_unlock(zram, index);
+
+ /* A null bio means rw_page was used, we must fallback to bio */
+ if (!bio)
+ return -EOPNOTSUPP;
+
+ ret = zram_read_from_writeback(zram, page, index, bio,
+ partial_io);
+ }

/* Should NEVER happen. Return bio error if it does. */
- if (WARN_ON(ret))
+ if (WARN_ON(ret < 0))
pr_err("Decompression failed! err=%d, page=%u\n", ret, index);

return ret;
}

static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
- u32 index, int offset, struct bio *bio)
+ u32 index, int offset, struct bio *bio)
{
int ret;
struct page *page;
--
2.38.0.413.g74048e4d9e-goog

2022-10-18 04:58:45

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

Allow zram to recompress (using secondary compression streams)
pages. We support three modes:

1) IDLE pages recompression is activated by `idle` mode

echo idle > /sys/block/zram0/recompress

2) Since there may be many idle pages user-space may pass a size
watermark value (in bytes) and we will recompress IDLE pages only
of equal or greater size:

echo 888 > /sys/block/zram0/recompress

3) HUGE pages recompression is activated by `huge` mode

echo huge > /sys/block/zram0/recompress

4) HUGE_IDLE pages recompression is activated by `huge_idle` mode

echo huge_idle > /sys/block/zram0/recompress

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
drivers/block/zram/Kconfig | 15 +++
drivers/block/zram/zram_drv.c | 196 +++++++++++++++++++++++++++++++++-
drivers/block/zram/zram_drv.h | 2 +
3 files changed, 210 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index d4100b0c083e..3e00656a6f8a 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -78,3 +78,18 @@ config ZRAM_MEMORY_TRACKING
/sys/kernel/debug/zram/zramX/block_state.

See Documentation/admin-guide/blockdev/zram.rst for more information.
+
+config ZRAM_MULTI_COMP
+ bool "Enable multiple per-CPU compression streams"
+ depends on ZRAM
+ help
+ This will enable per-CPU multi-compression streams, so that ZRAM
+ can re-compress IDLE/huge pages, using a potentially slower but
+ more effective compression algorithm. Note, that IDLE page support
+ requires ZRAM_MEMORY_TRACKING.
+
+ echo TIMEOUT > /sys/block/zramX/idle
+ echo SIZE > /sys/block/zramX/recompress
+
+ SIZE (in bytes) parameter sets the object size watermark: idle
+ objects that are of a smaller size will not get recompressed.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 94c62d7ea818..da11560ecf70 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1282,6 +1282,12 @@ static void zram_free_page(struct zram *zram, size_t index)
atomic64_dec(&zram->stats.huge_pages);
}

+ if (zram_test_flag(zram, index, ZRAM_RECOMP))
+ zram_clear_flag(zram, index, ZRAM_RECOMP);
+
+ if (zram_test_flag(zram, index, ZRAM_RECOMP_SKIP))
+ zram_clear_flag(zram, index, ZRAM_RECOMP_SKIP);
+
if (zram_test_flag(zram, index, ZRAM_WB)) {
zram_clear_flag(zram, index, ZRAM_WB);
free_block_bdev(zram, zram_get_element(zram, index));
@@ -1343,6 +1349,7 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
unsigned long handle;
unsigned int size;
void *src, *dst;
+ u32 idx;
int ret;

handle = zram_get_handle(zram, index);
@@ -1359,8 +1366,13 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,

size = zram_get_obj_size(zram, index);

- if (size != PAGE_SIZE)
- zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP]);
+ if (size != PAGE_SIZE) {
+ idx = ZRAM_PRIMARY_ZCOMP;
+ if (zram_test_flag(zram, index, ZRAM_RECOMP))
+ idx = ZRAM_SECONDARY_ZCOMP;
+
+ zstrm = zcomp_stream_get(zram->comps[idx]);
+ }

src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
if (size == PAGE_SIZE) {
@@ -1372,7 +1384,7 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
dst = kmap_atomic(page);
ret = zcomp_decompress(zstrm, src, size, dst);
kunmap_atomic(dst);
- zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
+ zcomp_stream_put(zram->comps[idx]);
}
zs_unmap_object(zram->mem_pool, handle);
return ret;
@@ -1603,6 +1615,182 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
return ret;
}

+#ifdef CONFIG_ZRAM_MULTI_COMP
+/*
+ * This function will decompress (unless it's ZRAM_HUGE) the page and then
+ * attempt to compress it using secondary compression algorithm (which is
+ * potentially more effective).
+ *
+ * Corresponding ZRAM slot should be locked.
+ */
+static int zram_recompress(struct zram *zram, u32 index, struct page *page,
+ int size_watermark)
+{
+ unsigned long handle_prev;
+ unsigned long handle_next;
+ unsigned int comp_len_next;
+ unsigned int comp_len_prev;
+ struct zcomp_strm *zstrm;
+ void *src, *dst;
+ int ret;
+
+ handle_prev = zram_get_handle(zram, index);
+ if (!handle_prev)
+ return -EINVAL;
+
+ comp_len_prev = zram_get_obj_size(zram, index);
+ /*
+ * Do not recompress objects that are already "small enough".
+ */
+ if (comp_len_prev < size_watermark)
+ return 0;
+
+ ret = zram_read_from_zspool(zram, page, index);
+ if (ret)
+ return ret;
+
+ zstrm = zcomp_stream_get(zram->comps[ZRAM_SECONDARY_ZCOMP]);
+ src = kmap_atomic(page);
+ ret = zcomp_compress(zstrm, src, &comp_len_next);
+ kunmap_atomic(src);
+
+ /*
+ * Either a compression error or we failed to compressed the object
+ * in a way that will save us memory. Mark the object so that we
+ * don't attempt to re-compress it again (RECOMP_SKIP).
+ */
+ if (comp_len_next >= huge_class_size ||
+ comp_len_next >= comp_len_prev ||
+ ret) {
+ zram_set_flag(zram, index, ZRAM_RECOMP_SKIP);
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+ zcomp_stream_put(zram->comps[ZRAM_SECONDARY_ZCOMP]);
+ return ret;
+ }
+
+ /*
+ * No direct reclaim (slow path) for handle allocation and no
+ * re-compression attempt (unlike in __zram_bvec_write()) since
+ * we already have stored that object in zsmalloc. If we cannot
+ * alloc memory for recompressed object then we bail out and
+ * simply keep the old (existing) object in zsmalloc.
+ */
+ handle_next = zs_malloc(zram->mem_pool, comp_len_next,
+ __GFP_KSWAPD_RECLAIM |
+ __GFP_NOWARN |
+ __GFP_HIGHMEM |
+ __GFP_MOVABLE);
+ if (IS_ERR((void *)handle_next)) {
+ zcomp_stream_put(zram->comps[ZRAM_SECONDARY_ZCOMP]);
+ return PTR_ERR((void *)handle_next);
+ }
+
+ dst = zs_map_object(zram->mem_pool, handle_next, ZS_MM_WO);
+ memcpy(dst, zstrm->buffer, comp_len_next);
+ zcomp_stream_put(zram->comps[ZRAM_SECONDARY_ZCOMP]);
+
+ zs_unmap_object(zram->mem_pool, handle_next);
+
+ zram_free_page(zram, index);
+ zram_set_handle(zram, index, handle_next);
+ zram_set_obj_size(zram, index, comp_len_next);
+
+ zram_set_flag(zram, index, ZRAM_RECOMP);
+ atomic64_add(comp_len_next, &zram->stats.compr_data_size);
+ atomic64_inc(&zram->stats.pages_stored);
+
+ return 0;
+}
+
+#define RECOMPRESS_IDLE (1 << 0)
+#define RECOMPRESS_HUGE (1 << 1)
+
+static ssize_t recompress_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct zram *zram = dev_to_zram(dev);
+ unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+ unsigned long index;
+ struct page *page;
+ ssize_t ret;
+ int mode, size_watermark = 0;
+
+ if (sysfs_streq(buf, "idle")) {
+ mode = RECOMPRESS_IDLE;
+ } else if (sysfs_streq(buf, "huge")) {
+ mode = RECOMPRESS_HUGE;
+ } else if (sysfs_streq(buf, "huge_idle")) {
+ mode = RECOMPRESS_IDLE | RECOMPRESS_HUGE;
+ } else {
+ /*
+ * We will re-compress only idle objects equal or greater
+ * in size than watermark.
+ */
+ ret = kstrtoint(buf, 10, &size_watermark);
+ if (ret)
+ return ret;
+ mode = RECOMPRESS_IDLE;
+ }
+
+ if (size_watermark > PAGE_SIZE)
+ return -EINVAL;
+
+ down_read(&zram->init_lock);
+ if (!init_done(zram)) {
+ ret = -EINVAL;
+ goto release_init_lock;
+ }
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page) {
+ ret = -ENOMEM;
+ goto release_init_lock;
+ }
+
+ ret = len;
+ for (index = 0; index < nr_pages; index++) {
+ int err = 0;
+
+ zram_slot_lock(zram, index);
+
+ if (!zram_allocated(zram, index))
+ goto next;
+
+ if (mode & RECOMPRESS_IDLE &&
+ !zram_test_flag(zram, index, ZRAM_IDLE))
+ goto next;
+
+ if (mode & RECOMPRESS_HUGE &&
+ !zram_test_flag(zram, index, ZRAM_HUGE))
+ goto next;
+
+ if (zram_test_flag(zram, index, ZRAM_WB) ||
+ zram_test_flag(zram, index, ZRAM_UNDER_WB) ||
+ zram_test_flag(zram, index, ZRAM_SAME) ||
+ zram_test_flag(zram, index, ZRAM_RECOMP) ||
+ zram_test_flag(zram, index, ZRAM_RECOMP_SKIP))
+ goto next;
+
+ err = zram_recompress(zram, index, page, size_watermark);
+next:
+ zram_slot_unlock(zram, index);
+ if (err) {
+ ret = err;
+ break;
+ }
+
+ cond_resched();
+ }
+
+ __free_page(page);
+
+release_init_lock:
+ up_read(&zram->init_lock);
+ return ret;
+}
+#endif
+
/*
* zram_bio_discard - handler on discard request
* @index: physical block index in PAGE_SIZE units
@@ -1983,6 +2171,7 @@ static DEVICE_ATTR_RW(writeback_limit_enable);
#endif
#ifdef CONFIG_ZRAM_MULTI_COMP
static DEVICE_ATTR_RW(recomp_algorithm);
+static DEVICE_ATTR_WO(recompress);
#endif

static struct attribute *zram_disk_attrs[] = {
@@ -2009,6 +2198,7 @@ static struct attribute *zram_disk_attrs[] = {
&dev_attr_debug_stat.attr,
#ifdef CONFIG_ZRAM_MULTI_COMP
&dev_attr_recomp_algorithm.attr,
+ &dev_attr_recompress.attr,
#endif
NULL,
};
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 4044ddbb2326..09b9ceb5dfa3 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -49,6 +49,8 @@ enum zram_pageflags {
ZRAM_UNDER_WB, /* page is under writeback */
ZRAM_HUGE, /* Incompressible page */
ZRAM_IDLE, /* not accessed page since last idle marking */
+ ZRAM_RECOMP, /* page was recompressed */
+ ZRAM_RECOMP_SKIP, /* secondary algorithm cannot compress this page */

__NR_ZRAM_PAGEFLAGS,
};
--
2.38.0.413.g74048e4d9e-goog

2022-10-18 04:58:51

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 5/9] documentation: Add recompression documentation

Document user-space visible device attributes that
are enabled by ZRAM_MULTI_COMP.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
Documentation/admin-guide/blockdev/zram.rst | 55 +++++++++++++++++++++
1 file changed, 55 insertions(+)

diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index c73b16930449..c916c2b9da55 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -401,6 +401,61 @@ budget in next setting is user's job.
If admin wants to measure writeback count in a certain period, they could
know it via /sys/block/zram0/bd_stat's 3rd column.

+recompression
+-------------
+
+With CONFIG_ZRAM_MULTI_COMP, zram can recompress idle/huge pages using
+alternative (secondary) compression algorithm. The basic idea is that
+alternative compression algorithm can provide better compression ratio
+at a price of (potentially) slower compression/decompression speeds.
+Alternative compression algorithm can, for example, be more successful
+compressing huge pages (those that default algorithm failed to compress).
+Another application is idle pages recompression - pages that are cold and
+sit in the memory can be recompressed using more effective algorithm and,
+hence, reduce zsmalloc memory usage.
+
+With CONFIG_ZRAM_MULTI_COMP, zram will setup two compression algorithms
+per-CPU: primary and secondary ones. Primary zram compressor is explained
+in "3) Select compression algorithm", the secondary algorithm is configured
+in a similar way, using recomp_algorithm device attribute:
+
+Examples::
+
+ #show supported recompression algorithms
+ cat /sys/block/zramX/recomp_algorithm
+ zstd [lzo]
+
+ #select zstd recompression algorithm
+ echo zstd > /sys/block/zramX/recomp_algorithm
+
+Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress,
+which controls recompression:
+
+Examples::
+
+ #IDLE pages recompression is activated by `idle` mode
+ echo idle > /sys/block/zramX/recompress
+
+ #HUGE pages recompression is activated by `huge` mode
+ echo huge > /sys/block/zram0/recompress
+
+ #HUGE_IDLE pages recompression is activated by `huge_idle` mode
+ echo huge_idle > /sys/block/zramX/recompress
+
+The number of idle pages can be significant, so user-space can pass a size
+watermark value (in bytes) to the recompress knob, to filter out idle pages
+for recompression: zram will recompress only idle pages of equal or greater
+size:::
+
+ #recompress idle pages larger than 3000 bytes
+ echo 3000 > /sys/block/zramX/recompress
+
+ #recompress idle pages larger than 2000 bytes
+ echo 2000 > /sys/block/zramX/recompress
+
+Recompression is mostly focused on idle pages (except for huge pages
+recompression), so it works better in conjunction with memory tracking.
+
memory tracking
===============

--
2.38.0.413.g74048e4d9e-goog

2022-10-18 04:59:04

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 6/9] zram: Add recompression algorithm choice to Kconfig

Make (secondary) recompression algorithm selectable just like
we do it for the (primary) default one.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
drivers/block/zram/Kconfig | 40 +++++++++++++++++++++++++++++++++++
drivers/block/zram/zram_drv.c | 2 +-
2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 3e00656a6f8a..076a76cd1664 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -93,3 +93,43 @@ config ZRAM_MULTI_COMP

SIZE (in bytes) parameter sets the object size watermark: idle
objects that are of a smaller size will not get recompressed.
+
+choice
+ prompt "Default zram recompression algorithm"
+ default ZRAM_DEF_RECOMP_ZSTD
+ depends on ZRAM && ZRAM_MULTI_COMP
+
+config ZRAM_DEF_RECOMP_LZORLE
+ bool "lzo-rle"
+ depends on CRYPTO_LZO
+
+config ZRAM_DEF_RECOMP_ZSTD
+ bool "zstd"
+ depends on CRYPTO_ZSTD
+
+config ZRAM_DEF_RECOMP_LZ4
+ bool "lz4"
+ depends on CRYPTO_LZ4
+
+config ZRAM_DEF_RECOMP_LZO
+ bool "lzo"
+ depends on CRYPTO_LZO
+
+config ZRAM_DEF_RECOMP_LZ4HC
+ bool "lz4hc"
+ depends on CRYPTO_LZ4HC
+
+config ZRAM_DEF_RECOMP_842
+ bool "842"
+ depends on CRYPTO_842
+
+endchoice
+
+config ZRAM_DEF_RECOMP
+ string
+ default "lzo-rle" if ZRAM_DEF_RECOMP_LZORLE
+ default "zstd" if ZRAM_DEF_RECOMP_ZSTD
+ default "lz4" if ZRAM_DEF_RECOMP_LZ4
+ default "lzo" if ZRAM_DEF_RECOMP_LZO
+ default "lz4hc" if ZRAM_DEF_RECOMP_LZ4HC
+ default "842" if ZRAM_DEF_RECOMP_842
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index da11560ecf70..1e9561217466 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,7 +44,7 @@ static int zram_major;
static const char *default_comp_algs[ZRAM_MAX_ZCOMPS] = {
CONFIG_ZRAM_DEF_COMP,
#ifdef CONFIG_ZRAM_MULTI_COMP
- "zstd",
+ CONFIG_ZRAM_DEF_RECOMP,
#endif
};

--
2.38.0.413.g74048e4d9e-goog

2022-10-18 04:59:51

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 7/9] zram: Add recompress flag to read_block_state()

Add a new flag to zram block state that shows if the page
was recompressed (using alternative compression algorithm).

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
Documentation/admin-guide/blockdev/zram.rst | 9 ++++++---
drivers/block/zram/zram_drv.c | 5 +++--
2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index c916c2b9da55..010fb05a5999 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -466,9 +466,10 @@ pages of the process with*pagemap.
If you enable the feature, you could see block state via
/sys/kernel/debug/zram/zram0/block_state". The output is as follows::

- 300 75.033841 .wh.
- 301 63.806904 s...
- 302 63.806919 ..hi
+ 300 75.033841 .wh..
+ 301 63.806904 s....
+ 302 63.806919 ..hi.
+ 303 62.801919 ....r

First column
zram's block index.
@@ -485,6 +486,8 @@ Third column
huge page
i:
idle page
+ r:
+ recompressed page (secondary compression algorithm)

First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 1e9561217466..5d760467e0bc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -925,13 +925,14 @@ static ssize_t read_block_state(struct file *file, char __user *buf,

ts = ktime_to_timespec64(zram->table[index].ac_time);
copied = snprintf(kbuf + written, count,
- "%12zd %12lld.%06lu %c%c%c%c\n",
+ "%12zd %12lld.%06lu %c%c%c%c%c\n",
index, (s64)ts.tv_sec,
ts.tv_nsec / NSEC_PER_USEC,
zram_test_flag(zram, index, ZRAM_SAME) ? 's' : '.',
zram_test_flag(zram, index, ZRAM_WB) ? 'w' : '.',
zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.',
- zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.');
+ zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.',
+ zram_test_flag(zram, index, ZRAM_RECOMP) ? 'r' : '.');

if (count <= copied) {
zram_slot_unlock(zram, index);
--
2.38.0.413.g74048e4d9e-goog

2022-10-18 06:03:45

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 9/9] zram: Use IS_ERR_VALUE() to check for zs_malloc() errors

Avoid type casts that are needed for IS_ERR() and use
IS_ERR_VALUE() instead.

Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
---
drivers/block/zram/zram_drv.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6b56f6f2ce82..422e3cfd2ddc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1506,19 +1506,19 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
* if we have a 'non-null' handle here then we are coming
* from the slow path and handle has already been allocated.
*/
- if (IS_ERR((void *)handle))
+ if (IS_ERR_VALUE(handle))
handle = zs_malloc(zram->mem_pool, comp_len,
__GFP_KSWAPD_RECLAIM |
__GFP_NOWARN |
__GFP_HIGHMEM |
__GFP_MOVABLE);
- if (IS_ERR((void *)handle)) {
+ if (IS_ERR_VALUE(handle)) {
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
atomic64_inc(&zram->stats.writestall);
handle = zs_malloc(zram->mem_pool, comp_len,
GFP_NOIO | __GFP_HIGHMEM |
__GFP_MOVABLE);
- if (IS_ERR((void *)handle))
+ if (IS_ERR_VALUE(handle))
return PTR_ERR((void *)handle);

if (comp_len != PAGE_SIZE)
@@ -1685,7 +1685,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
__GFP_NOWARN |
__GFP_HIGHMEM |
__GFP_MOVABLE);
- if (IS_ERR((void *)handle_next)) {
+ if (IS_ERR_VALUE(handle_next)) {
zcomp_stream_put(zram->comps[ZRAM_SECONDARY_ZCOMP]);
return PTR_ERR((void *)handle_next);
}
--
2.38.0.413.g74048e4d9e-goog

2022-10-18 06:04:05

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv4 8/9] zram: Clarify writeback_store() comment

Re-phrase writeback BIO error comment.

Reported-by: Andrew Morton <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
---
drivers/block/zram/zram_drv.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 5d760467e0bc..6b56f6f2ce82 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -758,8 +758,12 @@ static ssize_t writeback_store(struct device *dev,
zram_clear_flag(zram, index, ZRAM_IDLE);
zram_slot_unlock(zram, index);
/*
- * Return last IO error unless every IO were
- * not suceeded.
+ * BIO errors are not fatal, we continue and simply
+ * attempt to writeback the remaining objects (pages).
+ * At the same time we need to signal user-space that
+ * some writes (at least one, but also could be all of
+ * them) were not successful and we do so by returning
+ * the most recent BIO error.
*/
ret = err;
continue;
--
2.38.0.413.g74048e4d9e-goog

2022-11-02 20:21:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 1/9] zram: Preparation for multi-zcomp support

On Tue, Oct 18, 2022 at 01:55:25PM +0900, Sergey Senozhatsky wrote:
> The patch turns compression streams and compressor algorithm
> name struct zram members into arrays, so that we can have
> multiple compression streams support (in the next patches).
>
> The patch uses a rather explicit API for compressor selection:
>
> - Get primary (default) compression stream
> zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP])
> - Get secondary compression stream
> zcomp_stream_get(zram->comps[ZRAM_SECONDARY_ZCOMP])
>
> We use similar API for compression streams put().
>
> At this point we always have just one compression stream,
> since CONFIG_ZRAM_MULTI_COMP is not yet defined.
>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> drivers/block/zram/zcomp.c | 6 +--
> drivers/block/zram/zcomp.h | 2 +-
> drivers/block/zram/zram_drv.c | 87 ++++++++++++++++++++++++-----------
> drivers/block/zram/zram_drv.h | 14 +++++-
> 4 files changed, 77 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
> index 0916de952e09..55af4efd7983 100644
> --- a/drivers/block/zram/zcomp.c
> +++ b/drivers/block/zram/zcomp.c
> @@ -206,7 +206,7 @@ void zcomp_destroy(struct zcomp *comp)
> * case of allocation error, or any other error potentially
> * returned by zcomp_init().
> */
> -struct zcomp *zcomp_create(const char *compress)
> +struct zcomp *zcomp_create(const char *alg)
> {
> struct zcomp *comp;
> int error;
> @@ -216,14 +216,14 @@ struct zcomp *zcomp_create(const char *compress)
> * is not loaded yet. We must do it here, otherwise we are about to
> * call /sbin/modprobe under CPU hot-plug lock.
> */
> - if (!zcomp_available_algorithm(compress))
> + if (!zcomp_available_algorithm(alg))
> return ERR_PTR(-EINVAL);
>
> comp = kzalloc(sizeof(struct zcomp), GFP_KERNEL);
> if (!comp)
> return ERR_PTR(-ENOMEM);
>
> - comp->name = compress;
> + comp->name = alg;
> error = zcomp_init(comp);
> if (error) {
> kfree(comp);
> diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
> index 40f6420f4b2e..cdefdef93da8 100644
> --- a/drivers/block/zram/zcomp.h
> +++ b/drivers/block/zram/zcomp.h
> @@ -27,7 +27,7 @@ int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node);
> ssize_t zcomp_available_show(const char *comp, char *buf);
> bool zcomp_available_algorithm(const char *comp);
>
> -struct zcomp *zcomp_create(const char *comp);
> +struct zcomp *zcomp_create(const char *alg);
> void zcomp_destroy(struct zcomp *comp);
>
> struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 966aab902d19..770ea3489eb6 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1007,36 +1007,53 @@ static ssize_t comp_algorithm_show(struct device *dev,
> struct zram *zram = dev_to_zram(dev);
>
> down_read(&zram->init_lock);
> - sz = zcomp_available_show(zram->compressor, buf);
> + sz = zcomp_available_show(zram->comp_algs[ZRAM_PRIMARY_ZCOMP], buf);
> up_read(&zram->init_lock);
>
> return sz;
> }
>
> +static void comp_algorithm_set(struct zram *zram, u32 idx, const char *alg)
> +{
> + /* Do not kfree() algs that we didn't allocate, IOW the default ones */
> + if (zram->comp_algs[idx] != default_compressor)
> + kfree(zram->comp_algs[idx]);
> + zram->comp_algs[idx] = alg;
> +}
> +
> static ssize_t comp_algorithm_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t len)
> {
> struct zram *zram = dev_to_zram(dev);
> - char compressor[ARRAY_SIZE(zram->compressor)];
> + char *compressor;
> size_t sz;
>
> - strscpy(compressor, buf, sizeof(compressor));
> + sz = strlen(buf);
> + if (sz >= CRYPTO_MAX_ALG_NAME)
> + return -E2BIG;
> +
> + compressor = kstrdup(buf, GFP_KERNEL);
> + if (!compressor)
> + return -ENOMEM;
> +
> /* ignore trailing newline */
> - sz = strlen(compressor);
> if (sz > 0 && compressor[sz - 1] == '\n')
> compressor[sz - 1] = 0x00;
>
> - if (!zcomp_available_algorithm(compressor))
> + if (!zcomp_available_algorithm(compressor)) {
> + kfree(compressor);
> return -EINVAL;
> + }
>
> down_write(&zram->init_lock);
> if (init_done(zram)) {
> up_write(&zram->init_lock);
> + kfree(compressor);
> pr_info("Can't change algorithm for initialized device\n");
> return -EBUSY;
> }
>
> - strcpy(zram->compressor, compressor);
> + comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP, compressor);
> up_write(&zram->init_lock);
> return len;
> }
> @@ -1284,7 +1301,7 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
> size = zram_get_obj_size(zram, index);
>
> if (size != PAGE_SIZE)
> - zstrm = zcomp_stream_get(zram->comp);
> + zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP]);
>
> src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
> if (size == PAGE_SIZE) {
> @@ -1296,7 +1313,7 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
> dst = kmap_atomic(page);
> ret = zcomp_decompress(zstrm, src, size, dst);
> kunmap_atomic(dst);
> - zcomp_stream_put(zram->comp);
> + zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
> }
> zs_unmap_object(zram->mem_pool, handle);
> zram_slot_unlock(zram, index);
> @@ -1363,13 +1380,13 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
> kunmap_atomic(mem);
>
> compress_again:
> - zstrm = zcomp_stream_get(zram->comp);
> + zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP]);
> src = kmap_atomic(page);
> ret = zcomp_compress(zstrm, src, &comp_len);
> kunmap_atomic(src);
>
> if (unlikely(ret)) {
> - zcomp_stream_put(zram->comp);
> + zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
> pr_err("Compression failed! err=%d\n", ret);
> zs_free(zram->mem_pool, handle);
> return ret;
> @@ -1397,7 +1414,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
> __GFP_HIGHMEM |
> __GFP_MOVABLE);
> if (IS_ERR((void *)handle)) {
> - zcomp_stream_put(zram->comp);
> + zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
> atomic64_inc(&zram->stats.writestall);
> handle = zs_malloc(zram->mem_pool, comp_len,
> GFP_NOIO | __GFP_HIGHMEM |
> @@ -1414,14 +1431,14 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
> * zstrm buffer back. It is necessary that the dereferencing
> * of the zstrm variable below occurs correctly.
> */
> - zstrm = zcomp_stream_get(zram->comp);
> + zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP]);
> }
>
> alloced_pages = zs_get_total_pages(zram->mem_pool);
> update_used_max(zram, alloced_pages);
>
> if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> - zcomp_stream_put(zram->comp);
> + zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
> zs_free(zram->mem_pool, handle);
> return -ENOMEM;
> }
> @@ -1435,7 +1452,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
> if (comp_len == PAGE_SIZE)
> kunmap_atomic(src);
>
> - zcomp_stream_put(zram->comp);
> + zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
> zs_unmap_object(zram->mem_pool, handle);
> atomic64_add(comp_len, &zram->stats.compr_data_size);
> out:
> @@ -1710,6 +1727,20 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
> return ret;
> }
>
> +static void zram_destroy_comps(struct zram *zram)
> +{
> + u32 idx;
> +
> + for (idx = 0; idx < ZRAM_MAX_ZCOMPS; idx++) {
> + struct zcomp *comp = zram->comps[idx];
> +
> + zram->comps[idx] = NULL;
> + if (IS_ERR_OR_NULL(comp))

nit:

Why don't you use just NULL check? I don't see any error setting
for zram->comps(Maybe later patch? Will keep check)?

2022-11-02 20:45:44

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 0/9] zram: Support multiple compression streams

On Tue, Oct 18, 2022 at 01:55:24PM +0900, Sergey Senozhatsky wrote:
> Hello,
>
> This series adds support for multiple (per-CPU)
> compression streams (at point only 2). The main idea is that
> different compression algorithms have different characteristics
> and zram may benefit when it uses a combination of algorithms:
> a default algorithm that is faster but have lower compression
> rate and a secondary algorithm that can use higher compression
> rate at a price of slower compression/decompression.
>
> There are several use-case for this functionality:
>
> - huge pages re-compression: zstd or deflate can successfully
> compress huge pages (~50% of huge pages on my synthetic ChromeOS
> tests), IOW pages that lzo was not able to compress.
>
> - idle pages re-compression: idle/cold pages sit in the memory
> and we may reduce zsmalloc memory usage if we recompress those
> idle pages.
>
> User-space has a number of ways to control the behavior
> and impact of zram recompression: what type of pages should be
> recompressed, size watermarks, etc. Please refer to documentation
> patch.

Hi Sergey,

First of all, I am really sorry to attend the party too late.

I absolutely agree the feature is really useful and even I am
thinking to support multiple comression trials on the fly for
future. So I'd like to introduce the feature more general shape
to be extended later so review will go.

Thanks!

2022-11-02 20:49:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 3/9] zram: Factor out WB and non-WB zram read functions

On Tue, Oct 18, 2022 at 01:55:27PM +0900, Sergey Senozhatsky wrote:
> We will use non-WB variant in ZRAM page recompression path.
>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> drivers/block/zram/zram_drv.c | 73 ++++++++++++++++++++++++-----------
> 1 file changed, 50 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index a8ef3c0c3dae..94c62d7ea818 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1314,8 +1314,30 @@ static void zram_free_page(struct zram *zram, size_t index)
> ~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
> }
>
> -static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
> - struct bio *bio, bool partial_io)
> +/*
> + * Reads a page from the writeback devices. Corresponding ZRAM slot
> + * should be unlocked.
> + */
> +static int zram_read_from_writeback(struct zram *zram, struct page *page,

How about zram_read_from_bdev?

2022-11-02 21:16:48

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On Tue, Oct 18, 2022 at 01:55:28PM +0900, Sergey Senozhatsky wrote:
> Allow zram to recompress (using secondary compression streams)
> pages. We support three modes:
>
> 1) IDLE pages recompression is activated by `idle` mode
>
> echo idle > /sys/block/zram0/recompress
>
> 2) Since there may be many idle pages user-space may pass a size
> watermark value (in bytes) and we will recompress IDLE pages only
> of equal or greater size:
>
> echo 888 > /sys/block/zram0/recompress

Hmm, how about having seperate knob for threshold?

recompress_threshold

With that, we could make rescompress 888 and idle/huge
as well as only 888.

echo 888 > /sys/block/zram0/recompress_threshold
echo 1 > /sys/block/zram0/recompress

or

echo 888 > /sys/block/zram0/recompress_threshold
echo idle > /sys/block/zram0/recompress

or we can introduce the threshold with action item.

echo "idle 888" > /sys/block/zram0/recompress
echo "huge 888" > /sys/block/zram0/recompress
echo "normal 888" > /sys/block/zram0/recompress

>
> 3) HUGE pages recompression is activated by `huge` mode
>
> echo huge > /sys/block/zram0/recompress
>
> 4) HUGE_IDLE pages recompression is activated by `huge_idle` mode
>
> echo huge_idle > /sys/block/zram0/recompress
>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> drivers/block/zram/Kconfig | 15 +++
> drivers/block/zram/zram_drv.c | 196 +++++++++++++++++++++++++++++++++-
> drivers/block/zram/zram_drv.h | 2 +
> 3 files changed, 210 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
> index d4100b0c083e..3e00656a6f8a 100644
> --- a/drivers/block/zram/Kconfig
> +++ b/drivers/block/zram/Kconfig
> @@ -78,3 +78,18 @@ config ZRAM_MEMORY_TRACKING
> /sys/kernel/debug/zram/zramX/block_state.
>
> See Documentation/admin-guide/blockdev/zram.rst for more information.
> +
> +config ZRAM_MULTI_COMP
> + bool "Enable multiple per-CPU compression streams"

per-CPU is implementation detail. Let's do not mention it.

> + depends on ZRAM
> + help
> + This will enable per-CPU multi-compression streams, so that ZRAM

indentation

> + can re-compress IDLE/huge pages, using a potentially slower but
> + more effective compression algorithm. Note, that IDLE page support
> + requires ZRAM_MEMORY_TRACKING.
> +
> + echo TIMEOUT > /sys/block/zramX/idle
> + echo SIZE > /sys/block/zramX/recompress
> +
> + SIZE (in bytes) parameter sets the object size watermark: idle
> + objects that are of a smaller size will not get recompressed.
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 94c62d7ea818..da11560ecf70 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1282,6 +1282,12 @@ static void zram_free_page(struct zram *zram, size_t index)
> atomic64_dec(&zram->stats.huge_pages);
> }
>
> + if (zram_test_flag(zram, index, ZRAM_RECOMP))
> + zram_clear_flag(zram, index, ZRAM_RECOMP);
> +
> + if (zram_test_flag(zram, index, ZRAM_RECOMP_SKIP))
> + zram_clear_flag(zram, index, ZRAM_RECOMP_SKIP);

Let's squeeze the comp algo index into meta area since we have
some rooms for the bits. Then can we could remove the specific
recomp two flags?

I am thinking the feature to work incoming pages on the fly,
not only for recompress manual knob so it would be good
if we could make the interface abstract to work something
like this(I may miss something why we couldn't. I need more
time to look into then)

zram_bvec_write:

for (i = 0; i < MAX_COMP_ALGO; i++) {
zstrm = zcomp_stream_get(i);
zcomp_compress(src, &comp_len);
if (comp_len > threshold) {
zcomp_stream_put(i);
continue;
}
break;
}

zram_bvec_read:
algo_idx = zram_get_algo_idx(zram, index);
zstrm = zcomp_stream_get(zram, algo_idx);
zcomp_decompress(zstrm);
zcomp_stream_put(zram, algo_idx);


> +
> if (zram_test_flag(zram, index, ZRAM_WB)) {
> zram_clear_flag(zram, index, ZRAM_WB);
> free_block_bdev(zram, zram_get_element(zram, index));
> @@ -1343,6 +1349,7 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
> unsigned long handle;
> unsigned int size;
> void *src, *dst;
> + u32 idx;
> int ret;
>
> handle = zram_get_handle(zram, index);
> @@ -1359,8 +1366,13 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
>
> size = zram_get_obj_size(zram, index);
>
> - if (size != PAGE_SIZE)
> - zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_ZCOMP]);
> + if (size != PAGE_SIZE) {
> + idx = ZRAM_PRIMARY_ZCOMP;
> + if (zram_test_flag(zram, index, ZRAM_RECOMP))
> + idx = ZRAM_SECONDARY_ZCOMP;
> +
> + zstrm = zcomp_stream_get(zram->comps[idx]);
> + }
>
> src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
> if (size == PAGE_SIZE) {
> @@ -1372,7 +1384,7 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
> dst = kmap_atomic(page);
> ret = zcomp_decompress(zstrm, src, size, dst);
> kunmap_atomic(dst);
> - zcomp_stream_put(zram->comps[ZRAM_PRIMARY_ZCOMP]);
> + zcomp_stream_put(zram->comps[idx]);
> }
> zs_unmap_object(zram->mem_pool, handle);
> return ret;
> @@ -1603,6 +1615,182 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
> return ret;
> }
>
> +#ifdef CONFIG_ZRAM_MULTI_COMP
> +/*
> + * This function will decompress (unless it's ZRAM_HUGE) the page and then
> + * attempt to compress it using secondary compression algorithm (which is
> + * potentially more effective).
> + *
> + * Corresponding ZRAM slot should be locked.
> + */
> +static int zram_recompress(struct zram *zram, u32 index, struct page *page,
> + int size_watermark)
> +{
> + unsigned long handle_prev;
> + unsigned long handle_next;
> + unsigned int comp_len_next;
> + unsigned int comp_len_prev;

How about orig_handle and new_nandle with orig_comp_len and new_comp_len?

> + struct zcomp_strm *zstrm;
> + void *src, *dst;
> + int ret;
> +
> + handle_prev = zram_get_handle(zram, index);
> + if (!handle_prev)
> + return -EINVAL;
> +
> + comp_len_prev = zram_get_obj_size(zram, index);
> + /*
> + * Do not recompress objects that are already "small enough".
> + */
> + if (comp_len_prev < size_watermark)
> + return 0;
> +
> + ret = zram_read_from_zspool(zram, page, index);
> + if (ret)
> + return ret;
> +
> + zstrm = zcomp_stream_get(zram->comps[ZRAM_SECONDARY_ZCOMP]);
> + src = kmap_atomic(page);
> + ret = zcomp_compress(zstrm, src, &comp_len_next);
> + kunmap_atomic(src);
> +
> + /*
> + * Either a compression error or we failed to compressed the object
> + * in a way that will save us memory. Mark the object so that we
> + * don't attempt to re-compress it again (RECOMP_SKIP).
> + */
> + if (comp_len_next >= huge_class_size ||
> + comp_len_next >= comp_len_prev ||
> + ret) {
> + zram_set_flag(zram, index, ZRAM_RECOMP_SKIP);
> + zram_clear_flag(zram, index, ZRAM_IDLE);
> + zcomp_stream_put(zram->comps[ZRAM_SECONDARY_ZCOMP]);
> + return ret;
> + }
> +
> + /*
> + * No direct reclaim (slow path) for handle allocation and no
> + * re-compression attempt (unlike in __zram_bvec_write()) since
> + * we already have stored that object in zsmalloc. If we cannot
> + * alloc memory for recompressed object then we bail out and
> + * simply keep the old (existing) object in zsmalloc.
> + */
> + handle_next = zs_malloc(zram->mem_pool, comp_len_next,
> + __GFP_KSWAPD_RECLAIM |
> + __GFP_NOWARN |
> + __GFP_HIGHMEM |
> + __GFP_MOVABLE);
> + if (IS_ERR((void *)handle_next)) {
> + zcomp_stream_put(zram->comps[ZRAM_SECONDARY_ZCOMP]);
> + return PTR_ERR((void *)handle_next);
> + }
> +
> + dst = zs_map_object(zram->mem_pool, handle_next, ZS_MM_WO);
> + memcpy(dst, zstrm->buffer, comp_len_next);
> + zcomp_stream_put(zram->comps[ZRAM_SECONDARY_ZCOMP]);
> +
> + zs_unmap_object(zram->mem_pool, handle_next);
> +
> + zram_free_page(zram, index);
> + zram_set_handle(zram, index, handle_next);
> + zram_set_obj_size(zram, index, comp_len_next);
> +
> + zram_set_flag(zram, index, ZRAM_RECOMP);
> + atomic64_add(comp_len_next, &zram->stats.compr_data_size);
> + atomic64_inc(&zram->stats.pages_stored);
> +
> + return 0;
> +}
> +
> +#define RECOMPRESS_IDLE (1 << 0)
> +#define RECOMPRESS_HUGE (1 << 1)
> +
> +static ssize_t recompress_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + struct zram *zram = dev_to_zram(dev);
> + unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
> + unsigned long index;
> + struct page *page;
> + ssize_t ret;
> + int mode, size_watermark = 0;
> +
> + if (sysfs_streq(buf, "idle")) {
> + mode = RECOMPRESS_IDLE;
> + } else if (sysfs_streq(buf, "huge")) {
> + mode = RECOMPRESS_HUGE;
> + } else if (sysfs_streq(buf, "huge_idle")) {
> + mode = RECOMPRESS_IDLE | RECOMPRESS_HUGE;
> + } else {
> + /*
> + * We will re-compress only idle objects equal or greater
> + * in size than watermark.
> + */
> + ret = kstrtoint(buf, 10, &size_watermark);
> + if (ret)
> + return ret;
> + mode = RECOMPRESS_IDLE;
> + }
> +
> + if (size_watermark > PAGE_SIZE)

nit: How about threshold instead of watermark?

> + return -EINVAL;
> +
> + down_read(&zram->init_lock);
> + if (!init_done(zram)) {
> + ret = -EINVAL;
> + goto release_init_lock;
> + }
> +
> + page = alloc_page(GFP_KERNEL);
> + if (!page) {
> + ret = -ENOMEM;
> + goto release_init_lock;
> + }
> +
> + ret = len;
> + for (index = 0; index < nr_pages; index++) {
> + int err = 0;
> +
> + zram_slot_lock(zram, index);
> +
> + if (!zram_allocated(zram, index))
> + goto next;
> +
> + if (mode & RECOMPRESS_IDLE &&
> + !zram_test_flag(zram, index, ZRAM_IDLE))
> + goto next;
> +
> + if (mode & RECOMPRESS_HUGE &&
> + !zram_test_flag(zram, index, ZRAM_HUGE))
> + goto next;
> +
> + if (zram_test_flag(zram, index, ZRAM_WB) ||
> + zram_test_flag(zram, index, ZRAM_UNDER_WB) ||
> + zram_test_flag(zram, index, ZRAM_SAME) ||
> + zram_test_flag(zram, index, ZRAM_RECOMP) ||
> + zram_test_flag(zram, index, ZRAM_RECOMP_SKIP))
> + goto next;
> +
> + err = zram_recompress(zram, index, page, size_watermark);
> +next:
> + zram_slot_unlock(zram, index);
> + if (err) {
> + ret = err;
> + break;
> + }
> +
> + cond_resched();
> + }
> +
> + __free_page(page);
> +
> +release_init_lock:
> + up_read(&zram->init_lock);
> + return ret;
> +}
> +#endif
> +
> /*
> * zram_bio_discard - handler on discard request
> * @index: physical block index in PAGE_SIZE units
> @@ -1983,6 +2171,7 @@ static DEVICE_ATTR_RW(writeback_limit_enable);
> #endif
> #ifdef CONFIG_ZRAM_MULTI_COMP
> static DEVICE_ATTR_RW(recomp_algorithm);
> +static DEVICE_ATTR_WO(recompress);
> #endif
>
> static struct attribute *zram_disk_attrs[] = {
> @@ -2009,6 +2198,7 @@ static struct attribute *zram_disk_attrs[] = {
> &dev_attr_debug_stat.attr,
> #ifdef CONFIG_ZRAM_MULTI_COMP
> &dev_attr_recomp_algorithm.attr,
> + &dev_attr_recompress.attr,
> #endif
> NULL,
> };
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 4044ddbb2326..09b9ceb5dfa3 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -49,6 +49,8 @@ enum zram_pageflags {
> ZRAM_UNDER_WB, /* page is under writeback */
> ZRAM_HUGE, /* Incompressible page */
> ZRAM_IDLE, /* not accessed page since last idle marking */
> + ZRAM_RECOMP, /* page was recompressed */
> + ZRAM_RECOMP_SKIP, /* secondary algorithm cannot compress this page */
>
> __NR_ZRAM_PAGEFLAGS,
> };
> --
> 2.38.0.413.g74048e4d9e-goog
>
>

2022-11-02 21:26:12

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Tue, Oct 18, 2022 at 01:55:26PM +0900, Sergey Senozhatsky wrote:
> Introduce recomp_algorithm sysfs knob that controls
> secondary algorithm selection used for recompression.
> This device attribute works in a similar way with
> comp_algorithm attribute.
>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> drivers/block/zram/zram_drv.c | 111 +++++++++++++++++++++++++++-------
> 1 file changed, 90 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 770ea3489eb6..a8ef3c0c3dae 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -41,7 +41,12 @@ static DEFINE_IDR(zram_index_idr);
> static DEFINE_MUTEX(zram_index_mutex);
>
> static int zram_major;
> -static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
> +static const char *default_comp_algs[ZRAM_MAX_ZCOMPS] = {
> + CONFIG_ZRAM_DEF_COMP,
> +#ifdef CONFIG_ZRAM_MULTI_COMP
> + "zstd",
> +#endif
> +};
>
> /* Module params (documentation at end) */
> static unsigned int num_devices = 1;
> @@ -1000,31 +1005,37 @@ static ssize_t max_comp_streams_store(struct device *dev,
> return len;
> }
>
> -static ssize_t comp_algorithm_show(struct device *dev,
> - struct device_attribute *attr, char *buf)

Do you have any reason to change show and set placement? Otherwise,
please keep the function order to reduce unnecesssary churns.

> +static void comp_algorithm_set(struct zram *zram, u32 idx, const char *alg)
> {
> - size_t sz;
> - struct zram *zram = dev_to_zram(dev);
> + bool default_alg = false;
> + int i;
>
> - down_read(&zram->init_lock);
> - sz = zcomp_available_show(zram->comp_algs[ZRAM_PRIMARY_ZCOMP], buf);
> - up_read(&zram->init_lock);
> + /* Do not kfree() algs that we didn't allocate, IOW the default ones */
> + for (i = 0; i < ZRAM_MAX_ZCOMPS; i++) {
> + if (zram->comp_algs[idx] == default_comp_algs[i]) {
> + default_alg = true;
> + break;
> + }
> + }
>
> - return sz;
> + if (!default_alg)
> + kfree(zram->comp_algs[idx]);
> + zram->comp_algs[idx] = alg;
> }
>
> -static void comp_algorithm_set(struct zram *zram, u32 idx, const char *alg)
> +static ssize_t __comp_algorithm_show(struct zram *zram, u32 idx, char *buf)
> {
> - /* Do not kfree() algs that we didn't allocate, IOW the default ones */
> - if (zram->comp_algs[idx] != default_compressor)
> - kfree(zram->comp_algs[idx]);
> - zram->comp_algs[idx] = alg;
> + ssize_t sz;
> +
> + down_read(&zram->init_lock);
> + sz = zcomp_available_show(zram->comp_algs[idx], buf);
> + up_read(&zram->init_lock);
> +
> + return sz;
> }
>
> -static ssize_t comp_algorithm_store(struct device *dev,
> - struct device_attribute *attr, const char *buf, size_t len)
> +static int __comp_algorithm_store(struct zram *zram, u32 idx, const char *buf)
> {
> - struct zram *zram = dev_to_zram(dev);
> char *compressor;
> size_t sz;
>
> @@ -1053,11 +1064,55 @@ static ssize_t comp_algorithm_store(struct device *dev,
> return -EBUSY;
> }
>
> - comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP, compressor);
> + comp_algorithm_set(zram, idx, compressor);
> up_write(&zram->init_lock);
> - return len;
> + return 0;
> +}
> +
> +static ssize_t comp_algorithm_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct zram *zram = dev_to_zram(dev);
> +
> + return __comp_algorithm_show(zram, ZRAM_PRIMARY_ZCOMP, buf);
> +}
> +
> +static ssize_t comp_algorithm_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf,
> + size_t len)
> +{
> + struct zram *zram = dev_to_zram(dev);
> + int ret;
> +
> + ret = __comp_algorithm_store(zram, ZRAM_PRIMARY_ZCOMP, buf);
> + return ret ? ret : len;
> }
>
> +#ifdef CONFIG_ZRAM_MULTI_COMP
> +static ssize_t recomp_algorithm_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct zram *zram = dev_to_zram(dev);
> +
> + return __comp_algorithm_show(zram, ZRAM_SECONDARY_ZCOMP, buf);
> +}

Just open question(I might be too paranoid?)

I am thinking someone want to add third comp algorithm in future
to balance decompression and memory efficiency.

If it's not too crazy idea, let's think about the interface.
Maybe, could we make the recomp knobs works like list?

# A primary comp
echo "A" > /zram/comp_algo

# Multiple secondary comps
echo "B threshold" > /zram/add_recomp_algo
echo "C threshold" > /zram/add_recomp_algo
echo "D threshold" > /zram/add_recomp_algo

"cat /zram/recomp_algo" shows the list

echo "C" > /zram/remove_recomp_algo
will remove the C algorithm in stack.

My point is that we don't need to implement it atm but makes the
interface to open the possibility for future extension.

What do you think?

> +
> +static ssize_t recomp_algorithm_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf,
> + size_t len)
> +{
> + struct zram *zram = dev_to_zram(dev);
> + int ret;
> +
> + ret = __comp_algorithm_store(zram, ZRAM_SECONDARY_ZCOMP, buf);
> + return ret ? ret : len;
> +}
> +#endif
> +
> static ssize_t compact_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t len)
> {
> @@ -1762,7 +1817,11 @@ static void zram_reset_device(struct zram *zram)
> memset(&zram->stats, 0, sizeof(zram->stats));
> reset_bdev(zram);
>
> - comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP, default_compressor);
> + comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP,
> + default_comp_algs[ZRAM_PRIMARY_ZCOMP]);
> + if (IS_ENABLED(CONFIG_ZRAM_MULTI_COMP))

Dumb question:

Why do you use IS_ENABLED instead of ifdef?


> + comp_algorithm_set(zram, ZRAM_SECONDARY_ZCOMP,
> + default_comp_algs[ZRAM_SECONDARY_ZCOMP]);
> up_write(&zram->init_lock);
> }
>
> @@ -1895,6 +1954,9 @@ static DEVICE_ATTR_WO(writeback);
> static DEVICE_ATTR_RW(writeback_limit);
> static DEVICE_ATTR_RW(writeback_limit_enable);
> #endif
> +#ifdef CONFIG_ZRAM_MULTI_COMP
> +static DEVICE_ATTR_RW(recomp_algorithm);
> +#endif
>
> static struct attribute *zram_disk_attrs[] = {
> &dev_attr_disksize.attr,
> @@ -1918,6 +1980,9 @@ static struct attribute *zram_disk_attrs[] = {
> &dev_attr_bd_stat.attr,
> #endif
> &dev_attr_debug_stat.attr,
> +#ifdef CONFIG_ZRAM_MULTI_COMP
> + &dev_attr_recomp_algorithm.attr,
> +#endif
> NULL,
> };
>
> @@ -1997,7 +2062,11 @@ static int zram_add(void)
> if (ret)
> goto out_cleanup_disk;
>
> - zram->comp_algs[ZRAM_PRIMARY_ZCOMP] = default_compressor;
> + zram->comp_algs[ZRAM_PRIMARY_ZCOMP] =
> + default_comp_algs[ZRAM_PRIMARY_ZCOMP];
> + if (IS_ENABLED(CONFIG_ZRAM_MULTI_COMP))
> + zram->comp_algs[ZRAM_SECONDARY_ZCOMP] =
> + default_comp_algs[ZRAM_SECONDARY_ZCOMP];
>
> zram_debugfs_register(zram);
> pr_info("Added device: %s\n", zram->disk->disk_name);
> --
> 2.38.0.413.g74048e4d9e-goog
>

2022-11-03 03:02:37

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 3/9] zram: Factor out WB and non-WB zram read functions

On (22/11/02 13:20), Minchan Kim wrote:
> On Tue, Oct 18, 2022 at 01:55:27PM +0900, Sergey Senozhatsky wrote:
> > We will use non-WB variant in ZRAM page recompression path.
> >
> > Signed-off-by: Sergey Senozhatsky <[email protected]>
> > ---
> > drivers/block/zram/zram_drv.c | 73 ++++++++++++++++++++++++-----------
> > 1 file changed, 50 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index a8ef3c0c3dae..94c62d7ea818 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1314,8 +1314,30 @@ static void zram_free_page(struct zram *zram, size_t index)
> > ~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
> > }
> >
> > -static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
> > - struct bio *bio, bool partial_io)
> > +/*
> > + * Reads a page from the writeback devices. Corresponding ZRAM slot
> > + * should be unlocked.
> > + */
> > +static int zram_read_from_writeback(struct zram *zram, struct page *page,
>
> How about zram_read_from_bdev?

As far as I can see, we already have that one, so that name is already taken.

We have read_from_bdev, read_from_bdev_sync, read_from_bdev_async, and
probably some more.

2022-11-03 03:15:07

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 1/9] zram: Preparation for multi-zcomp support

On (22/11/02 13:13), Minchan Kim wrote:
[..]
> >
> > +static void zram_destroy_comps(struct zram *zram)
> > +{
> > + u32 idx;
> > +
> > + for (idx = 0; idx < ZRAM_MAX_ZCOMPS; idx++) {
> > + struct zcomp *comp = zram->comps[idx];
> > +
> > + zram->comps[idx] = NULL;
> > + if (IS_ERR_OR_NULL(comp))
>
> nit:
>
> Why don't you use just NULL check? I don't see any error setting
> for zram->comps(Maybe later patch? Will keep check)?

A defense measure. zcomp_create() returns err pointer, so here
I check for err and nil just in case (if somehow someday we
accidentally have err values stored in comps). It's cheap and
safer than NULL.

2022-11-03 03:23:26

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/02 13:15), Minchan Kim wrote:
[..]
> > /* Module params (documentation at end) */
> > static unsigned int num_devices = 1;
> > @@ -1000,31 +1005,37 @@ static ssize_t max_comp_streams_store(struct device *dev,
> > return len;
> > }
> >
> > -static ssize_t comp_algorithm_show(struct device *dev,
> > - struct device_attribute *attr, char *buf)
>
> Do you have any reason to change show and set placement? Otherwise,
> please keep the function order to reduce unnecesssary churns.

I don't change their placement. It's just show and store for primary and
secondary algorithms use the same __store and __show functions, which
are static and are placed ahead of store and show.

[..]
> Just open question(I might be too paranoid?)
>
> I am thinking someone want to add third comp algorithm in future
> to balance decompression and memory efficiency.
>
> If it's not too crazy idea, let's think about the interface.
> Maybe, could we make the recomp knobs works like list?
>
> # A primary comp
> echo "A" > /zram/comp_algo
>
> # Multiple secondary comps
> echo "B threshold" > /zram/add_recomp_algo
> echo "C threshold" > /zram/add_recomp_algo
> echo "D threshold" > /zram/add_recomp_algo

What is the threshold here? My design approach is that ZRAM doesn't do
recompression on its own, so no magic is happening automatically. It's
the user-space that triggers recompression for selected pages when
user-space thinks it's time to. This allows us to have various flexible
policies and consider things that ZRAM is not even aware of: battery level,
free memory, CPU load average, etc. E.g. no recompression when all CPUs
are busy rendering video game, or when we are draining battery too fast,
etc.

> "cat /zram/recomp_algo" shows the list
>
> echo "C" > /zram/remove_recomp_algo
> will remove the C algorithm in stack.

What is the use case for removal of a secondary algorithm?

> My point is that we don't need to implement it atm but makes the
> interface to open the possibility for future extension.
>
> What do you think?

So, as far as I understand, we don't have reason to add remove_recomp_algo
right now. And existing recomp_algo does not enforce any particular format,
it can be extended. Right now we accept "$name" but can do something like
"$name:$priority". The only thing that we probably need to do is rename
recomp_algo to either add_recomp_algo or register_recomp_algo?

> > +static ssize_t recomp_algorithm_store(struct device *dev,
> > + struct device_attribute *attr,
> > + const char *buf,
> > + size_t len)
> > +{
> > + struct zram *zram = dev_to_zram(dev);
> > + int ret;
> > +
> > + ret = __comp_algorithm_store(zram, ZRAM_SECONDARY_ZCOMP, buf);
> > + return ret ? ret : len;
> > +}
> > +#endif
> > +
> > static ssize_t compact_store(struct device *dev,
> > struct device_attribute *attr, const char *buf, size_t len)
> > {
> > @@ -1762,7 +1817,11 @@ static void zram_reset_device(struct zram *zram)
> > memset(&zram->stats, 0, sizeof(zram->stats));
> > reset_bdev(zram);
> >
> > - comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP, default_compressor);
> > + comp_algorithm_set(zram, ZRAM_PRIMARY_ZCOMP,
> > + default_comp_algs[ZRAM_PRIMARY_ZCOMP]);
> > + if (IS_ENABLED(CONFIG_ZRAM_MULTI_COMP))
>
> Dumb question:
>
> Why do you use IS_ENABLED instead of ifdef?

#ifdef-s are banned in the new C-code, as far as I know. IS_ENABLED is
what we should use.

2022-11-03 03:39:48

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On (22/11/02 14:06), Minchan Kim wrote:
> On Tue, Oct 18, 2022 at 01:55:28PM +0900, Sergey Senozhatsky wrote:
> > Allow zram to recompress (using secondary compression streams)
> > pages. We support three modes:
> >
> > 1) IDLE pages recompression is activated by `idle` mode
> >
> > echo idle > /sys/block/zram0/recompress
> >
> > 2) Since there may be many idle pages user-space may pass a size
> > watermark value (in bytes) and we will recompress IDLE pages only
> > of equal or greater size:
> >
> > echo 888 > /sys/block/zram0/recompress
>
> Hmm, how about having seperate knob for threshold?

Per-my understanding this threshold can change quite often,
depending on memory pressure and so on. So we may force
user-space to issues more syscalls, without any gain in
simplicity.

> recompress_threshold
>
> With that, we could make rescompress 888 and idle/huge
> as well as only 888.
>
> echo 888 > /sys/block/zram0/recompress_threshold
> echo 1 > /sys/block/zram0/recompress
>
> or
>
> echo 888 > /sys/block/zram0/recompress_threshold
> echo idle > /sys/block/zram0/recompress
>
> or we can introduce the threshold with action item.
>
> echo "idle 888" > /sys/block/zram0/recompress
> echo "huge 888" > /sys/block/zram0/recompress
> echo "normal 888" > /sys/block/zram0/recompress

I like the latter one, when threshold is an optional argument.
I probably would even go a bit further and add keywords:

type=STRING threshold=INT

> > 3) HUGE pages recompression is activated by `huge` mode
> >
> > echo huge > /sys/block/zram0/recompress
> >
> > 4) HUGE_IDLE pages recompression is activated by `huge_idle` mode
> >
> > echo huge_idle > /sys/block/zram0/recompress
> >
> > Signed-off-by: Sergey Senozhatsky <[email protected]>
> > ---
> > drivers/block/zram/Kconfig | 15 +++
> > drivers/block/zram/zram_drv.c | 196 +++++++++++++++++++++++++++++++++-
> > drivers/block/zram/zram_drv.h | 2 +
> > 3 files changed, 210 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
> > index d4100b0c083e..3e00656a6f8a 100644
> > --- a/drivers/block/zram/Kconfig
> > +++ b/drivers/block/zram/Kconfig
> > @@ -78,3 +78,18 @@ config ZRAM_MEMORY_TRACKING
> > /sys/kernel/debug/zram/zramX/block_state.
> >
> > See Documentation/admin-guide/blockdev/zram.rst for more information.
> > +
> > +config ZRAM_MULTI_COMP
> > + bool "Enable multiple per-CPU compression streams"
>
> per-CPU is implementation detail. Let's do not mention it.

OK.

> > + depends on ZRAM
> > + help
> > + This will enable per-CPU multi-compression streams, so that ZRAM
>
> indentation

OK. A question: does this matter? I don't see any problems in menuconfig.

> > + can re-compress IDLE/huge pages, using a potentially slower but
> > + more effective compression algorithm. Note, that IDLE page support
> > + requires ZRAM_MEMORY_TRACKING.
> > +
> > + echo TIMEOUT > /sys/block/zramX/idle
> > + echo SIZE > /sys/block/zramX/recompress
> > +
> > + SIZE (in bytes) parameter sets the object size watermark: idle
> > + objects that are of a smaller size will not get recompressed.
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 94c62d7ea818..da11560ecf70 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1282,6 +1282,12 @@ static void zram_free_page(struct zram *zram, size_t index)
> > atomic64_dec(&zram->stats.huge_pages);
> > }
> >
> > + if (zram_test_flag(zram, index, ZRAM_RECOMP))
> > + zram_clear_flag(zram, index, ZRAM_RECOMP);
> > +
> > + if (zram_test_flag(zram, index, ZRAM_RECOMP_SKIP))
> > + zram_clear_flag(zram, index, ZRAM_RECOMP_SKIP);
>
> Let's squeeze the comp algo index into meta area since we have
> some rooms for the bits. Then can we could remove the specific
> recomp two flags?

What is meta area?

> I am thinking the feature to work incoming pages on the fly,
> not only for recompress manual knob so it would be good
> if we could make the interface abstract to work something
> like this(I may miss something why we couldn't. I need more
> time to look into then)
>
> zram_bvec_write:
>
> for (i = 0; i < MAX_COMP_ALGO; i++) {
> zstrm = zcomp_stream_get(i);
> zcomp_compress(src, &comp_len);
> if (comp_len > threshold) {
> zcomp_stream_put(i);
> continue;
> }
> break;
> }
>
> zram_bvec_read:
> algo_idx = zram_get_algo_idx(zram, index);
> zstrm = zcomp_stream_get(zram, algo_idx);
> zcomp_decompress(zstrm);
> zcomp_stream_put(zram, algo_idx);

Hmm. This is something that should not be enabled by default.
N compressions per every stored page is very very CPU and
power intensive. We definitely want a way to have recompression
as a user-space event, which gives all sorts of flexibility and
extensibility. ZRAM doesn't (and should not) know about too many
things, so ZRAM can't make good decisions (and probably should not
try). User-space can make good decisions on the other hand.

So recompression for us is not something that happens all the time,
unconditionally. It's something that happens sometimes, depending on
the situation on the host.

[..]
> > +static int zram_recompress(struct zram *zram, u32 index, struct page *page,
> > + int size_watermark)
> > +{
> > + unsigned long handle_prev;
> > + unsigned long handle_next;
> > + unsigned int comp_len_next;
> > + unsigned int comp_len_prev;
>
> How about orig_handle and new_nandle with orig_comp_len and new_comp_len?

No opinion. Can we have prev and next? :)

[..]
> > + if (sysfs_streq(buf, "idle")) {
> > + mode = RECOMPRESS_IDLE;
> > + } else if (sysfs_streq(buf, "huge")) {
> > + mode = RECOMPRESS_HUGE;
> > + } else if (sysfs_streq(buf, "huge_idle")) {
> > + mode = RECOMPRESS_IDLE | RECOMPRESS_HUGE;
> > + } else {
> > + /*
> > + * We will re-compress only idle objects equal or greater
> > + * in size than watermark.
> > + */
> > + ret = kstrtoint(buf, 10, &size_watermark);
> > + if (ret)
> > + return ret;
> > + mode = RECOMPRESS_IDLE;
> > + }
> > +
> > + if (size_watermark > PAGE_SIZE)
>
> nit: How about threshold instead of watermark?

OK. MM uses watermarks everywhere so I just used the same term.
Can change to threshold.

2022-11-03 03:49:21

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 0/9] zram: Support multiple compression streams

Hi Minchan,

On (22/11/02 13:07), Minchan Kim wrote:
[..]
> Hi Sergey,
>
> First of all, I am really sorry to attend the party too late.
>
> I absolutely agree the feature is really useful and even I am
> thinking to support multiple comression trials on the fly for
> future. So I'd like to introduce the feature more general shape
> to be extended later so review will go.

On the fly recompression (from the same context) was what I had as
a first version (which was internal and was never published), and
we didn't like it at all. It's too limited, has zero flexibility,
zero extensibility and has too high of a price tag attached to it
(in terms of CPU cycles, power and time).

So we moved it to user-space (deferred context) and this unlocked
numerous possibilities: recompress only when we really need to and,
more importantly, can afford it; wire in numerous metrics: battery,
CPU load etc. And many more.

2022-11-03 04:29:34

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/03 12:05), Sergey Senozhatsky wrote:
> What is the use case for removal of a secondary algorithm?
>
> > My point is that we don't need to implement it atm but makes the
> > interface to open the possibility for future extension.
> >
> > What do you think?
>
> So, as far as I understand, we don't have reason to add remove_recomp_algo
> right now. And existing recomp_algo does not enforce any particular format,
> it can be extended. Right now we accept "$name" but can do something like
> "$name:$priority".

Or with keywords:

name=STRING priority=INT

and the only legal priority for now is 1.

2022-11-03 04:30:54

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/03 12:05), Sergey Senozhatsky wrote:
> [..]
> > Just open question(I might be too paranoid?)
> >
> > I am thinking someone want to add third comp algorithm in future
> > to balance decompression and memory efficiency.
> >
> > If it's not too crazy idea, let's think about the interface.
> > Maybe, could we make the recomp knobs works like list?
> >
> > # A primary comp
> > echo "A" > /zram/comp_algo
> >
> > # Multiple secondary comps
> > echo "B threshold" > /zram/add_recomp_algo
> > echo "C threshold" > /zram/add_recomp_algo
> > echo "D threshold" > /zram/add_recomp_algo

As a side note:
The way it's implemented currently is that comps is an array, so we
can store more comps there. I sort of was thinking that we probably
can have more than two algos at some point the in the future (hence
the MULTI_COMPRESS config option).

2022-11-03 05:54:46

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/03 13:09), Sergey Senozhatsky wrote:
> On (22/11/03 12:05), Sergey Senozhatsky wrote:
> > What is the use case for removal of a secondary algorithm?
> >
> > > My point is that we don't need to implement it atm but makes the
> > > interface to open the possibility for future extension.
> > >
> > > What do you think?
> >
> > So, as far as I understand, we don't have reason to add remove_recomp_algo
> > right now. And existing recomp_algo does not enforce any particular format,
> > it can be extended. Right now we accept "$name" but can do something like
> > "$name:$priority".
>
> Or with keywords:
>
> name=STRING priority=INT
>
> and the only legal priority for now is 1.


E.g. recomp_algorithm support for algorithms name= and optional
integer priority=.

I sort of like the recomp_algorithm name so far, but we can change it.

---

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index cf9d3474b80c..9a614253de07 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1102,9 +1102,38 @@ static ssize_t recomp_algorithm_store(struct device *dev,
size_t len)
{
struct zram *zram = dev_to_zram(dev);
+ int prio = ZRAM_SECONDARY_ZCOMP;
+ char *args, *param, *val;
+ char *alg = NULL;
int ret;

- ret = __comp_algorithm_store(zram, ZRAM_SECONDARY_ZCOMP, buf);
+ args = skip_spaces(buf);
+ while (*args) {
+ args = next_arg(args, &param, &val);
+
+ if (!*val)
+ return -EINVAL;
+
+ if (!strcmp(param, "name")) {
+ alg = val;
+ continue;
+ }
+
+ if (!strcmp(param, "priority")) {
+ ret = kstrtoint(val, 10, &prio);
+ if (ret)
+ return ret;
+ continue;
+ }
+ }
+
+ if (!alg)
+ return -EINVAL;
+
+ if (prio < ZRAM_SECONDARY_ZCOMP || prio >= ZRAM_MAX_ZCOMPS)
+ return -EINVAL;
+
+ ret = __comp_algorithm_store(zram, prio, alg);
return ret ? ret : len;
}
#endif

2022-11-03 06:51:28

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On (22/11/03 12:25), Sergey Senozhatsky wrote:
> > or we can introduce the threshold with action item.
> >
> > echo "idle 888" > /sys/block/zram0/recompress
> > echo "huge 888" > /sys/block/zram0/recompress
> > echo "normal 888" > /sys/block/zram0/recompress
>
> I like the latter one, when threshold is an optional argument.
> I probably would even go a bit further and add keywords:
>
> type=STRING threshold=INT

E.g. recompress support for type= and optional threshold=

We kind of don't have a use case of type=normal, as it is an equivalent
of no type. So we have huge, idle, huge_idle and no param means all
pages (which is sort of logical). threshold is optional.

---
drivers/block/zram/zram_drv.c | 55 ++++++++++++++++++++++-------------
1 file changed, 34 insertions(+), 21 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 9a614253de07..12f03745baf9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1688,7 +1688,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
* Corresponding ZRAM slot should be locked.
*/
static int zram_recompress(struct zram *zram, u32 index, struct page *page,
- int size_watermark)
+ int size_threshold)
{
unsigned long handle_prev;
unsigned long handle_next;
@@ -1708,7 +1708,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
/*
* Do not recompress objects that are already "small enough".
*/
- if (comp_len_prev < size_watermark)
+ if (comp_len_prev < size_threshold)
return 0;

ret = zram_read_from_zspool(zram, page, index);
@@ -1780,29 +1780,42 @@ static ssize_t recompress_store(struct device *dev,
{
struct zram *zram = dev_to_zram(dev);
unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+ char *args, *param, *val;
unsigned long index;
struct page *page;
ssize_t ret;
- int mode, size_watermark = 0;
-
- if (sysfs_streq(buf, "idle")) {
- mode = RECOMPRESS_IDLE;
- } else if (sysfs_streq(buf, "huge")) {
- mode = RECOMPRESS_HUGE;
- } else if (sysfs_streq(buf, "huge_idle")) {
- mode = RECOMPRESS_IDLE | RECOMPRESS_HUGE;
- } else {
- /*
- * We will re-compress only idle objects equal or greater
- * in size than watermark.
- */
- ret = kstrtoint(buf, 10, &size_watermark);
- if (ret)
- return ret;
- mode = RECOMPRESS_IDLE;
+ int mode = 0, size_threshold = 0;
+
+ args = skip_spaces(buf);
+ while (*args) {
+ args = next_arg(args, &param, &val);
+
+ if (!*val)
+ return -EINVAL;
+
+ if (!strcmp(param, "type")) {
+ if (!strcmp(val, "idle"))
+ mode = RECOMPRESS_IDLE;
+ if (!strcmp(val, "huge"))
+ mode = RECOMPRESS_HUGE;
+ if (!strcmp(val, "huge_idle"))
+ mode = RECOMPRESS_IDLE | RECOMPRESS_HUGE;
+ continue;
+ }
+
+ if (!strcmp(param, "threshold")) {
+ /*
+ * We will re-compress only idle objects equal or
+ * greater in size than watermark.
+ */
+ ret = kstrtoint(val, 10, &size_threshold);
+ if (ret)
+ return ret;
+ continue;
+ }
}

- if (size_watermark > PAGE_SIZE)
+ if (size_threshold > PAGE_SIZE)
return -EINVAL;

down_read(&zram->init_lock);
@@ -1841,7 +1854,7 @@ static ssize_t recompress_store(struct device *dev,
zram_test_flag(zram, index, ZRAM_RECOMP_SKIP))
goto next;

- err = zram_recompress(zram, index, page, size_watermark);
+ err = zram_recompress(zram, index, page, size_threshold);
next:
zram_slot_unlock(zram, index);
if (err) {
--
2.38.1.273.g43a17bfeac-goog


2022-11-03 16:47:20

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Thu, Nov 03, 2022 at 12:05:06PM +0900, Sergey Senozhatsky wrote:

< snip >

> > Just open question(I might be too paranoid?)
> >
> > I am thinking someone want to add third comp algorithm in future
> > to balance decompression and memory efficiency.
> >
> > If it's not too crazy idea, let's think about the interface.
> > Maybe, could we make the recomp knobs works like list?
> >
> > # A primary comp
> > echo "A" > /zram/comp_algo
> >
> > # Multiple secondary comps
> > echo "B threshold" > /zram/add_recomp_algo
> > echo "C threshold" > /zram/add_recomp_algo
> > echo "D threshold" > /zram/add_recomp_algo
>
> What is the threshold here? My design approach is that ZRAM doesn't do

As your term, watermark but yeah, priority you suggested would be good
for me.

> recompression on its own, so no magic is happening automatically. It's
> the user-space that triggers recompression for selected pages when
> user-space thinks it's time to. This allows us to have various flexible
> policies and consider things that ZRAM is not even aware of: battery level,
> free memory, CPU load average, etc. E.g. no recompression when all CPUs
> are busy rendering video game, or when we are draining battery too fast,
> etc.
>
> > "cat /zram/recomp_algo" shows the list
> >
> > echo "C" > /zram/remove_recomp_algo
> > will remove the C algorithm in stack.
>
> What is the use case for removal of a secondary algorithm?

Without the interface, How can we modify the selection if admin want to
change the order of second algorithms?

>
> > My point is that we don't need to implement it atm but makes the
> > interface to open the possibility for future extension.
> >
> > What do you think?
>
> So, as far as I understand, we don't have reason to add remove_recomp_algo
> right now. And existing recomp_algo does not enforce any particular format,
> it can be extended. Right now we accept "$name" but can do something like
> "$name:$priority". The only thing that we probably need to do is rename
> recomp_algo to either add_recomp_algo or register_recomp_algo?

Yeah, I like the name and priority format.

Only question is how we could support algorithm selection change
under considering multiple secondary algorithms.

2022-11-03 17:15:16

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Thu, Nov 03, 2022 at 01:09:49PM +0900, Sergey Senozhatsky wrote:
> On (22/11/03 12:05), Sergey Senozhatsky wrote:
> > What is the use case for removal of a secondary algorithm?
> >
> > > My point is that we don't need to implement it atm but makes the
> > > interface to open the possibility for future extension.
> > >
> > > What do you think?
> >
> > So, as far as I understand, we don't have reason to add remove_recomp_algo
> > right now. And existing recomp_algo does not enforce any particular format,
> > it can be extended. Right now we accept "$name" but can do something like
> > "$name:$priority".
>
> Or with keywords:
>
> name=STRING priority=INT
>
> and the only legal priority for now is 1.

I like it.

2022-11-03 17:26:10

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On Thu, Nov 03, 2022 at 12:25:43PM +0900, Sergey Senozhatsky wrote:
> On (22/11/02 14:06), Minchan Kim wrote:
> > On Tue, Oct 18, 2022 at 01:55:28PM +0900, Sergey Senozhatsky wrote:
> > > Allow zram to recompress (using secondary compression streams)
> > > pages. We support three modes:
> > >
> > > 1) IDLE pages recompression is activated by `idle` mode
> > >
> > > echo idle > /sys/block/zram0/recompress
> > >
> > > 2) Since there may be many idle pages user-space may pass a size
> > > watermark value (in bytes) and we will recompress IDLE pages only
> > > of equal or greater size:
> > >
> > > echo 888 > /sys/block/zram0/recompress
> >
> > Hmm, how about having seperate knob for threshold?
>
> Per-my understanding this threshold can change quite often,
> depending on memory pressure and so on. So we may force
> user-space to issues more syscalls, without any gain in
> simplicity.

Sorry, didn't understand your point. Let me clarify my idea.
If we have separate knob for recompress thresh hold, we could
work like this.

# recompress any compressed pages which is greater than 888 bytes.
echo 888 > /sys/block/zram0/recompress_threshold

# try to compress any pages greather than threshold with following
# algorithm.

echo "type=lzo priority=1" > /sys/block/zram0/recompress_algo
echo "type=zstd priority=2" > /sys/block/zram0/recompress_algo
echo "type=deflate priority=3" > /sys/block/zram0/recompress_algo

>
> > recompress_threshold
> >
> > With that, we could make rescompress 888 and idle/huge
> > as well as only 888.
> >
> > echo 888 > /sys/block/zram0/recompress_threshold
> > echo 1 > /sys/block/zram0/recompress
> >
> > or
> >
> > echo 888 > /sys/block/zram0/recompress_threshold
> > echo idle > /sys/block/zram0/recompress
> >
> > or we can introduce the threshold with action item.
> >
> > echo "idle 888" > /sys/block/zram0/recompress
> > echo "huge 888" > /sys/block/zram0/recompress
> > echo "normal 888" > /sys/block/zram0/recompress
>
> I like the latter one, when threshold is an optional argument.
> I probably would even go a bit further and add keywords:
>
> type=STRING threshold=INT

Yeah, kerwords would be better. Let's discuss we need a threshold
optional argument for each algo or just have one threshold
for every secondary algorithm or both.

>
> > > 3) HUGE pages recompression is activated by `huge` mode
> > >
> > > echo huge > /sys/block/zram0/recompress
> > >
> > > 4) HUGE_IDLE pages recompression is activated by `huge_idle` mode
> > >
> > > echo huge_idle > /sys/block/zram0/recompress
> > >
> > > Signed-off-by: Sergey Senozhatsky <[email protected]>
> > > ---
> > > drivers/block/zram/Kconfig | 15 +++
> > > drivers/block/zram/zram_drv.c | 196 +++++++++++++++++++++++++++++++++-
> > > drivers/block/zram/zram_drv.h | 2 +
> > > 3 files changed, 210 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
> > > index d4100b0c083e..3e00656a6f8a 100644
> > > --- a/drivers/block/zram/Kconfig
> > > +++ b/drivers/block/zram/Kconfig
> > > @@ -78,3 +78,18 @@ config ZRAM_MEMORY_TRACKING
> > > /sys/kernel/debug/zram/zramX/block_state.
> > >
> > > See Documentation/admin-guide/blockdev/zram.rst for more information.
> > > +
> > > +config ZRAM_MULTI_COMP
> > > + bool "Enable multiple per-CPU compression streams"
> >
> > per-CPU is implementation detail. Let's do not mention it.
>
> OK.
>
> > > + depends on ZRAM
> > > + help
> > > + This will enable per-CPU multi-compression streams, so that ZRAM
> >
> > indentation
>
> OK. A question: does this matter? I don't see any problems in menuconfig.
>
> > > + can re-compress IDLE/huge pages, using a potentially slower but
> > > + more effective compression algorithm. Note, that IDLE page support
> > > + requires ZRAM_MEMORY_TRACKING.
> > > +
> > > + echo TIMEOUT > /sys/block/zramX/idle
> > > + echo SIZE > /sys/block/zramX/recompress
> > > +
> > > + SIZE (in bytes) parameter sets the object size watermark: idle
> > > + objects that are of a smaller size will not get recompressed.
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index 94c62d7ea818..da11560ecf70 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1282,6 +1282,12 @@ static void zram_free_page(struct zram *zram, size_t index)
> > > atomic64_dec(&zram->stats.huge_pages);
> > > }
> > >
> > > + if (zram_test_flag(zram, index, ZRAM_RECOMP))
> > > + zram_clear_flag(zram, index, ZRAM_RECOMP);
> > > +
> > > + if (zram_test_flag(zram, index, ZRAM_RECOMP_SKIP))
> > > + zram_clear_flag(zram, index, ZRAM_RECOMP_SKIP);
> >
> > Let's squeeze the comp algo index into meta area since we have
> > some rooms for the bits. Then can we could remove the specific
> > recomp two flags?
>
> What is meta area?

zram->table[index].flags

If we squeeze the algorithm index, we could work like this
without ZRAM_RECOMP_SKIP.

read_block_state
zram_algo_idx(zram, index) > 0 ? 'r' : '.');

zram_read_from_zpool
if (zram_algo_idx(zram, idx) != 0)
idx = 1;

zram_recompress
..
we don't need to set ZRAM_RECOMP since every meta will have the algo
idx.

zram_free_page
zram_clear_algo(zram, index);

recompress_store
int algo_idx = zram_algo_idx(zram, index);

if (algo_idx == max_algo_idx)
goto next

>
> > I am thinking the feature to work incoming pages on the fly,
> > not only for recompress manual knob so it would be good
> > if we could make the interface abstract to work something
> > like this(I may miss something why we couldn't. I need more
> > time to look into then)
> >
> > zram_bvec_write:
> >
> > for (i = 0; i < MAX_COMP_ALGO; i++) {
> > zstrm = zcomp_stream_get(i);
> > zcomp_compress(src, &comp_len);
> > if (comp_len > threshold) {
> > zcomp_stream_put(i);
> > continue;
> > }
> > break;
> > }
> >
> > zram_bvec_read:
> > algo_idx = zram_get_algo_idx(zram, index);
> > zstrm = zcomp_stream_get(zram, algo_idx);
> > zcomp_decompress(zstrm);
> > zcomp_stream_put(zram, algo_idx);
>
> Hmm. This is something that should not be enabled by default.

Exactly. I don't mean to enable by default, either.

> N compressions per every stored page is very very CPU and
> power intensive. We definitely want a way to have recompression
> as a user-space event, which gives all sorts of flexibility and
> extensibility. ZRAM doesn't (and should not) know about too many
> things, so ZRAM can't make good decisions (and probably should not
> try). User-space can make good decisions on the other hand.
>
> So recompression for us is not something that happens all the time,
> unconditionally. It's something that happens sometimes, depending on
> the situation on the host.

Totally agree. I am not saying we should enable the feature by default
but at lesat consider it for the future. I have something in mind to
be useful later.

>
> [..]
> > > +static int zram_recompress(struct zram *zram, u32 index, struct page *page,
> > > + int size_watermark)
> > > +{
> > > + unsigned long handle_prev;
> > > + unsigned long handle_next;
> > > + unsigned int comp_len_next;
> > > + unsigned int comp_len_prev;
> >
> > How about orig_handle and new_nandle with orig_comp_len and new_comp_len?
>
> No opinion. Can we have prev and next? :)

prev and next gives the impression position something like list.
orig and new gives the impression stale and fresh.

We are doing latter here.

2022-11-03 17:29:52

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Thu, Nov 03, 2022 at 12:54:43PM +0900, Sergey Senozhatsky wrote:
> On (22/11/03 12:05), Sergey Senozhatsky wrote:
> > [..]
> > > Just open question(I might be too paranoid?)
> > >
> > > I am thinking someone want to add third comp algorithm in future
> > > to balance decompression and memory efficiency.
> > >
> > > If it's not too crazy idea, let's think about the interface.
> > > Maybe, could we make the recomp knobs works like list?
> > >
> > > # A primary comp
> > > echo "A" > /zram/comp_algo
> > >
> > > # Multiple secondary comps
> > > echo "B threshold" > /zram/add_recomp_algo
> > > echo "C threshold" > /zram/add_recomp_algo
> > > echo "D threshold" > /zram/add_recomp_algo
>
> As a side note:
> The way it's implemented currently is that comps is an array, so we
> can store more comps there. I sort of was thinking that we probably
> can have more than two algos at some point the in the future (hence
> the MULTI_COMPRESS config option).

Sure.

2022-11-04 03:33:05

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/03 09:34), Minchan Kim wrote:
> > > My point is that we don't need to implement it atm but makes the
> > > interface to open the possibility for future extension.
> > >
> > > What do you think?
> >
> > So, as far as I understand, we don't have reason to add remove_recomp_algo
> > right now. And existing recomp_algo does not enforce any particular format,
> > it can be extended. Right now we accept "$name" but can do something like
> > "$name:$priority". The only thing that we probably need to do is rename
> > recomp_algo to either add_recomp_algo or register_recomp_algo?
>
> Yeah, I like the name and priority format.
>
> Only question is how we could support algorithm selection change
> under considering multiple secondary algorithms.

So what I was thinking about, and I'm still in the mental model that
re-compression is a user-space event, just like writeback, extension
of recompress sysfs knob with "algo_index" (or something similar) which
will mirror algorithm priority.

Example:

Configure 2 alternative algos, with priority 1 and 2

echo "name=lz4 priority=1" > recomp_algo
echo "name=lz5 priority=2" > recomp_algo

Recompress pages using algo 1 and algo 2

echo "type=huge threshold=3000 algo_idx=1" > recompress
echo "type=idle threshold=2000 algo_idx=2" > recompress

Maybe we can even pass algo name instead of idx.

2022-11-04 03:56:15

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On (22/11/03 10:00), Minchan Kim wrote:
[..]
> > Per-my understanding this threshold can change quite often,
> > depending on memory pressure and so on. So we may force
> > user-space to issues more syscalls, without any gain in
> > simplicity.
>
> Sorry, didn't understand your point. Let me clarify my idea.
> If we have separate knob for recompress thresh hold, we could
> work like this.
>
> # recompress any compressed pages which is greater than 888 bytes.
> echo 888 > /sys/block/zram0/recompress_threshold
>
> # try to compress any pages greather than threshold with following
> # algorithm.
>
> echo "type=lzo priority=1" > /sys/block/zram0/recompress_algo
> echo "type=zstd priority=2" > /sys/block/zram0/recompress_algo
> echo "type=deflate priority=3" > /sys/block/zram0/recompress_algo

OK. We can always add more sysfs knobs and make threshold a global
per-device value.

I think I prefer the approach when threshold is part of the current
recompress context, not something derived form another context. That
is, when all values (page type, threshold, possibly algorithm index)
are submitted by user-space for this particular recompression

echo "type=huge threshold=3000 ..." > recompress

If threshold is a global value that is applied to all recompress calls
then how does user-space say no-threshold? For instance, when it wants
to recompress only huge pages. It probably still needs to supply something
like threshold=0. So my personal preference for now - keep threshold
as a context dependent value.

Another thing that I like about threshold= being context dependent
is that then we don't need to protect recompression against concurrent
global threshold modifications with lock and so on. It keeps things
simpler.

[..]
> > > Let's squeeze the comp algo index into meta area since we have
> > > some rooms for the bits. Then can we could remove the specific
> > > recomp two flags?
> >
> > What is meta area?
>
> zram->table[index].flags
>
> If we squeeze the algorithm index, we could work like this
> without ZRAM_RECOMP_SKIP.

We still need ZRAM_RECOMP_SKIP. Recompression may fail to compress
object further: sometimes we can get recompressed object that is larger
than the original one, sometimes of the same size, sometimes of a smaller
size but still belonging to the same size class, which doesn't save us
any memory. Without ZRAM_RECOMP_SKIP we will continue re-compressing
objects that are in-compressible (in a way that saves us memory in
zsmalloc) by any of the ZRAM's algorithms.

> read_block_state
> zram_algo_idx(zram, index) > 0 ? 'r' : '.');
>
> zram_read_from_zpool
> if (zram_algo_idx(zram, idx) != 0)
> idx = 1;

As an idea, maybe we can store everything re-compression related
in a dedicated meta field? SKIP flag, algorithm ID, etc.

We don't have too many bits left in ->flags on 32-bit systems. We
currently probably need at least 3 bits - one for RECOMP_SKIP and at
least 2 for algorithm ID. 2 bits for algorithm ID put us into situation
that we can have only 00, 01, 10, 11 as IDs, that is maximum 3 recompress
algorithms: 00 is the primary one and the rest are alternative ones.
Maximum 3 re-compression algorithms sounds like a reasonable max value to
me. Yeah, maybe we can use flags bits for it.

[..]
> > > zram_bvec_read:
> > > algo_idx = zram_get_algo_idx(zram, index);
> > > zstrm = zcomp_stream_get(zram, algo_idx);
> > > zcomp_decompress(zstrm);
> > > zcomp_stream_put(zram, algo_idx);
> >
> > Hmm. This is something that should not be enabled by default.
>
> Exactly. I don't mean to enable by default, either.

OK.

> > N compressions per every stored page is very very CPU and
> > power intensive. We definitely want a way to have recompression
> > as a user-space event, which gives all sorts of flexibility and
> > extensibility. ZRAM doesn't (and should not) know about too many
> > things, so ZRAM can't make good decisions (and probably should not
> > try). User-space can make good decisions on the other hand.
> >
> > So recompression for us is not something that happens all the time,
> > unconditionally. It's something that happens sometimes, depending on
> > the situation on the host.
>
> Totally agree. I am not saying we should enable the feature by default
> but at lesat consider it for the future. I have something in mind to
> be useful later.

OK.

> > [..]
> > > > +static int zram_recompress(struct zram *zram, u32 index, struct page *page,
> > > > + int size_watermark)
> > > > +{
> > > > + unsigned long handle_prev;
> > > > + unsigned long handle_next;
> > > > + unsigned int comp_len_next;
> > > > + unsigned int comp_len_prev;
> > >
> > > How about orig_handle and new_nandle with orig_comp_len and new_comp_len?
> >
> > No opinion. Can we have prev and next? :)
>
> prev and next gives the impression position something like list.
> orig and new gives the impression stale and fresh.
>
> We are doing latter here.

Yeah, like I said in internal email, this will make rebasing harder on
my side, because this breaks a patch from Alexey and then breaks a higher
order zspages patch series. It's an very old series and we already have
quite a bit of patches depending on it.

2022-11-04 05:01:50

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/04 12:18), Sergey Senozhatsky wrote:
> On (22/11/03 09:34), Minchan Kim wrote:
> > Yeah, I like the name and priority format.
> >
> > Only question is how we could support algorithm selection change
> > under considering multiple secondary algorithms.
>
> So what I was thinking about, and I'm still in the mental model that
> re-compression is a user-space event, just like writeback, extension
> of recompress sysfs knob with "algo_index" (or something similar) which
> will mirror algorithm priority.
>
> Example:
>
> Configure 2 alternative algos, with priority 1 and 2
>
> echo "name=lz4 priority=1" > recomp_algo
> echo "name=lz5 priority=2" > recomp_algo
>
> Recompress pages using algo 1 and algo 2
>
> echo "type=huge threshold=3000 algo_idx=1" > recompress
> echo "type=idle threshold=2000 algo_idx=2" > recompress
>
> Maybe we can even pass algo name instead of idx.

Or pass priority= so that interface that uses algorithms has the
same keyword that the interface that configures those algorithms.

I still don't see many use-cases for "delete algorithm", to be honest.
ZRAM is configured by scripts in 99.99999% of cases and it is
quite static once it has been configured. So we probably can use
the "don't setup algorithms that you don't need" approach, to keep
things simpler.

2022-11-04 08:02:17

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On (22/11/04 12:48), Sergey Senozhatsky wrote:
> > read_block_state
> > zram_algo_idx(zram, index) > 0 ? 'r' : '.');
> >
> > zram_read_from_zpool
> > if (zram_algo_idx(zram, idx) != 0)
> > idx = 1;
>
> As an idea, maybe we can store everything re-compression related
> in a dedicated meta field? SKIP flag, algorithm ID, etc.

That's just an idea.

Something like this

---
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index bdfc9bf0bdd5..c011d0f145f6 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -49,8 +49,6 @@ enum zram_pageflags {
ZRAM_UNDER_WB, /* page is under writeback */
ZRAM_HUGE, /* Incompressible page */
ZRAM_IDLE, /* not accessed page since last idle marking */
- ZRAM_RECOMP, /* page was recompressed */
- ZRAM_RECOMP_SKIP, /* secondary algorithm cannot compress this page */

__NR_ZRAM_PAGEFLAGS,
};
@@ -64,6 +62,11 @@ struct zram_table_entry {
unsigned long element;
};
unsigned long flags;
+#ifdef CONFIG_ZRAM_MULTI_COMP
+ unsigned int incompressible:1;
+ unsigned int priority:2;
+#endif
+
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
ktime_t ac_time;
#endif
---

The reason I'm thinking about it is that we have flags bits that are
used only when particular .config options are enabled. Without those
options we just waste bits.

Recompression is one such thing. Another one is writeback.

---
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index c011d0f145f6..7076ec209e79 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -45,8 +45,6 @@ enum zram_pageflags {
/* zram slot is locked */
ZRAM_LOCK = ZRAM_FLAG_SHIFT,
ZRAM_SAME, /* Page consists the same element */
- ZRAM_WB, /* page is stored on backing_device */
- ZRAM_UNDER_WB, /* page is under writeback */
ZRAM_HUGE, /* Incompressible page */
ZRAM_IDLE, /* not accessed page since last idle marking */

@@ -68,6 +66,8 @@ struct zram_table_entry {
#endif

#ifdef CONFIG_ZRAM_MEMORY_TRACKING
+ unsigned int under_writeback:1;
+ unsigned int written_back:1;
ktime_t ac_time;
#endif
};
---

So we can use ->flags only for things that are independent of .config

2022-11-04 08:17:07

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On (22/11/03 10:00), Minchan Kim wrote:
> zram->table[index].flags
>
> If we squeeze the algorithm index, we could work like this
> without ZRAM_RECOMP_SKIP.

Something like this?

Allocate two ->flags bits for priority and one bit for RECOMP_SKIP.
Two priority bits let us to have 3 alternative algorithms (01 10 11)
plus one default (00). So 4 in total.

---
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 12f03745baf9..af0ff58087ca 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -127,6 +127,19 @@ static size_t zram_get_obj_size(struct zram *zram, u32 index)
return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
}

+static inline void zram_set_priority(struct zram *zram, u32 index, u32 prio)
+{
+ prio &= ZRAM_COMP_PRIORITY_MASK;
+ zram->table[index].flags &= (prio << ZRAM_COMP_PRIORITY_1);
+}
+
+static inline u32 zram_get_priority(struct zram *zram, u32 index)
+{
+ u32 prio = zram->table[index].flags;
+
+ return prio & ZRAM_COMP_PRIORITY_MASK;
+}
+
static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
{
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index bdfc9bf0bdd5..33e52c5a9a90 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -40,6 +40,8 @@
*/
#define ZRAM_FLAG_SHIFT (PAGE_SHIFT + 1)

+#define ZRAM_COMP_PRIORITY_MASK 0x3
+
/* Flags for zram pages (table[page_no].flags) */
enum zram_pageflags {
/* zram slot is locked */
@@ -49,8 +51,9 @@ enum zram_pageflags {
ZRAM_UNDER_WB, /* page is under writeback */
ZRAM_HUGE, /* Incompressible page */
ZRAM_IDLE, /* not accessed page since last idle marking */
- ZRAM_RECOMP, /* page was recompressed */
ZRAM_RECOMP_SKIP, /* secondary algorithm cannot compress this page */
+ ZRAM_COMP_PRIORITY_1,
+ ZRAM_COMP_PRIORITY_2,

__NR_ZRAM_PAGEFLAGS,
};

2022-11-04 08:56:05

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On (22/11/04 16:53), Sergey Senozhatsky wrote:
[..]
> +static inline void zram_set_priority(struct zram *zram, u32 index, u32 prio)
> +{
> + prio &= ZRAM_COMP_PRIORITY_MASK;
> + zram->table[index].flags &= (prio << ZRAM_COMP_PRIORITY_1);
> +}

Uh... Something like this, sorry.

+static inline void zram_set_priority(struct zram *zram, u32 index, u32 prio)
+{
+ prio &= ZRAM_RECOMP_PRIO_MASK;
+ zram->table[index].flags &= ~(ZRAM_RECOMP_PRIO_MASK <<
+ ZRAM_RECOMP_PRIORITY_1);
+ zram->table[index].flags |= (prio << ZRAM_RECOMP_PRIORITY_1);
+}

2022-11-04 17:29:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Fri, Nov 04, 2022 at 12:18:43PM +0900, Sergey Senozhatsky wrote:
> On (22/11/03 09:34), Minchan Kim wrote:
> > > > My point is that we don't need to implement it atm but makes the
> > > > interface to open the possibility for future extension.
> > > >
> > > > What do you think?
> > >
> > > So, as far as I understand, we don't have reason to add remove_recomp_algo
> > > right now. And existing recomp_algo does not enforce any particular format,
> > > it can be extended. Right now we accept "$name" but can do something like
> > > "$name:$priority". The only thing that we probably need to do is rename
> > > recomp_algo to either add_recomp_algo or register_recomp_algo?
> >
> > Yeah, I like the name and priority format.
> >
> > Only question is how we could support algorithm selection change
> > under considering multiple secondary algorithms.
>
> So what I was thinking about, and I'm still in the mental model that
> re-compression is a user-space event, just like writeback, extension
> of recompress sysfs knob with "algo_index" (or something similar) which
> will mirror algorithm priority.
>
> Example:
>
> Configure 2 alternative algos, with priority 1 and 2
>
> echo "name=lz4 priority=1" > recomp_algo
> echo "name=lz5 priority=2" > recomp_algo
>
> Recompress pages using algo 1 and algo 2
>
> echo "type=huge threshold=3000 algo_idx=1" > recompress
> echo "type=idle threshold=2000 algo_idx=2" > recompress
>
> Maybe we can even pass algo name instead of idx.

Let's use name rather than index.

2022-11-04 17:37:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On Fri, Nov 04, 2022 at 12:48:42PM +0900, Sergey Senozhatsky wrote:
> On (22/11/03 10:00), Minchan Kim wrote:
> [..]
> > > Per-my understanding this threshold can change quite often,
> > > depending on memory pressure and so on. So we may force
> > > user-space to issues more syscalls, without any gain in
> > > simplicity.
> >
> > Sorry, didn't understand your point. Let me clarify my idea.
> > If we have separate knob for recompress thresh hold, we could
> > work like this.
> >
> > # recompress any compressed pages which is greater than 888 bytes.
> > echo 888 > /sys/block/zram0/recompress_threshold
> >
> > # try to compress any pages greather than threshold with following
> > # algorithm.
> >
> > echo "type=lzo priority=1" > /sys/block/zram0/recompress_algo
> > echo "type=zstd priority=2" > /sys/block/zram0/recompress_algo
> > echo "type=deflate priority=3" > /sys/block/zram0/recompress_algo
>
> OK. We can always add more sysfs knobs and make threshold a global
> per-device value.
>
> I think I prefer the approach when threshold is part of the current
> recompress context, not something derived form another context. That
> is, when all values (page type, threshold, possibly algorithm index)
> are submitted by user-space for this particular recompression
>
> echo "type=huge threshold=3000 ..." > recompress
>
> If threshold is a global value that is applied to all recompress calls
> then how does user-space say no-threshold? For instance, when it wants
> to recompress only huge pages. It probably still needs to supply something
> like threshold=0. So my personal preference for now - keep threshold
> as a context dependent value.
>
> Another thing that I like about threshold= being context dependent
> is that then we don't need to protect recompression against concurrent
> global threshold modifications with lock and so on. It keeps things
> simpler.

Sure. Let's go with per-algo threshold.

>
> [..]
> > > > Let's squeeze the comp algo index into meta area since we have
> > > > some rooms for the bits. Then can we could remove the specific
> > > > recomp two flags?
> > >
> > > What is meta area?
> >
> > zram->table[index].flags
> >
> > If we squeeze the algorithm index, we could work like this
> > without ZRAM_RECOMP_SKIP.
>
> We still need ZRAM_RECOMP_SKIP. Recompression may fail to compress
> object further: sometimes we can get recompressed object that is larger
> than the original one, sometimes of the same size, sometimes of a smaller
> size but still belonging to the same size class, which doesn't save us
> any memory. Without ZRAM_RECOMP_SKIP we will continue re-compressing

Indeed.

> objects that are in-compressible (in a way that saves us memory in
> zsmalloc) by any of the ZRAM's algorithms.
>
> > read_block_state
> > zram_algo_idx(zram, index) > 0 ? 'r' : '.');
> >
> > zram_read_from_zpool
> > if (zram_algo_idx(zram, idx) != 0)
> > idx = 1;
>
> As an idea, maybe we can store everything re-compression related
> in a dedicated meta field? SKIP flag, algorithm ID, etc.
>
> We don't have too many bits left in ->flags on 32-bit systems. We
> currently probably need at least 3 bits - one for RECOMP_SKIP and at
> least 2 for algorithm ID. 2 bits for algorithm ID put us into situation
> that we can have only 00, 01, 10, 11 as IDs, that is maximum 3 recompress
> algorithms: 00 is the primary one and the rest are alternative ones.
> Maximum 3 re-compression algorithms sounds like a reasonable max value to
> me. Yeah, maybe we can use flags bits for it.

If possbile, let's go with those three bits into flags since we could
factor them out into dedicated field, anytime later since it's not ABI.

2022-11-04 18:20:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On Fri, Nov 04, 2022 at 04:53:13PM +0900, Sergey Senozhatsky wrote:
> On (22/11/03 10:00), Minchan Kim wrote:
> > zram->table[index].flags
> >
> > If we squeeze the algorithm index, we could work like this
> > without ZRAM_RECOMP_SKIP.
>
> Something like this?
>
> Allocate two ->flags bits for priority and one bit for RECOMP_SKIP.
> Two priority bits let us to have 3 alternative algorithms (01 10 11)
> plus one default (00). So 4 in total.

Looks good!

2022-11-04 18:53:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On Fri, Nov 04, 2022 at 04:12:13PM +0900, Sergey Senozhatsky wrote:
> On (22/11/04 12:48), Sergey Senozhatsky wrote:
> > > read_block_state
> > > zram_algo_idx(zram, index) > 0 ? 'r' : '.');
> > >
> > > zram_read_from_zpool
> > > if (zram_algo_idx(zram, idx) != 0)
> > > idx = 1;
> >
> > As an idea, maybe we can store everything re-compression related
> > in a dedicated meta field? SKIP flag, algorithm ID, etc.
>
> That's just an idea.
>
> Something like this
>
> ---
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index bdfc9bf0bdd5..c011d0f145f6 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -49,8 +49,6 @@ enum zram_pageflags {
> ZRAM_UNDER_WB, /* page is under writeback */
> ZRAM_HUGE, /* Incompressible page */
> ZRAM_IDLE, /* not accessed page since last idle marking */
> - ZRAM_RECOMP, /* page was recompressed */
> - ZRAM_RECOMP_SKIP, /* secondary algorithm cannot compress this page */
>
> __NR_ZRAM_PAGEFLAGS,
> };
> @@ -64,6 +62,11 @@ struct zram_table_entry {
> unsigned long element;
> };
> unsigned long flags;
> +#ifdef CONFIG_ZRAM_MULTI_COMP
> + unsigned int incompressible:1;
> + unsigned int priority:2;
> +#endif
> +
> #ifdef CONFIG_ZRAM_MEMORY_TRACKING
> ktime_t ac_time;
> #endif
> ---
>
> The reason I'm thinking about it is that we have flags bits that are
> used only when particular .config options are enabled. Without those
> options we just waste bits.
>
> Recompression is one such thing. Another one is writeback.
>
> ---
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index c011d0f145f6..7076ec209e79 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -45,8 +45,6 @@ enum zram_pageflags {
> /* zram slot is locked */
> ZRAM_LOCK = ZRAM_FLAG_SHIFT,
> ZRAM_SAME, /* Page consists the same element */
> - ZRAM_WB, /* page is stored on backing_device */
> - ZRAM_UNDER_WB, /* page is under writeback */
> ZRAM_HUGE, /* Incompressible page */
> ZRAM_IDLE, /* not accessed page since last idle marking */
>
> @@ -68,6 +66,8 @@ struct zram_table_entry {
> #endif
>
> #ifdef CONFIG_ZRAM_MEMORY_TRACKING
> + unsigned int under_writeback:1;
> + unsigned int written_back:1;
> ktime_t ac_time;
> #endif
> };
> ---
>
> So we can use ->flags only for things that are independent of .config

Couldn't we move them into a dedicated field to introduce one more field
overhead in the meta area when we run out the extra field in the flag?
So I'd like to squeeze the bits into flag.

2022-11-04 19:11:04

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Fri, Nov 04, 2022 at 01:53:11PM +0900, Sergey Senozhatsky wrote:
> On (22/11/04 12:18), Sergey Senozhatsky wrote:
> > On (22/11/03 09:34), Minchan Kim wrote:
> > > Yeah, I like the name and priority format.
> > >
> > > Only question is how we could support algorithm selection change
> > > under considering multiple secondary algorithms.
> >
> > So what I was thinking about, and I'm still in the mental model that
> > re-compression is a user-space event, just like writeback, extension
> > of recompress sysfs knob with "algo_index" (or something similar) which
> > will mirror algorithm priority.
> >
> > Example:
> >
> > Configure 2 alternative algos, with priority 1 and 2
> >
> > echo "name=lz4 priority=1" > recomp_algo
> > echo "name=lz5 priority=2" > recomp_algo
> >
> > Recompress pages using algo 1 and algo 2
> >
> > echo "type=huge threshold=3000 algo_idx=1" > recompress
> > echo "type=idle threshold=2000 algo_idx=2" > recompress
> >
> > Maybe we can even pass algo name instead of idx.
>
> Or pass priority= so that interface that uses algorithms has the
> same keyword that the interface that configures those algorithms.

Hmm, why do we need algo_idx here if we already set up every
fields at algorithm setup time?

My understaind(assuming default(i.e., primary) algo is lzo) is

echo "name=lz4 priority=1" > recomp_algo
echo "name=lz5 priority=2" > recomp_algo

echo "type=huge threshold=3000" > recompress

It will try compress every objects which greater than 3000B with lz4 first
and then lz5 if it's stillgreater or equal than 3000(or same size class).

>
> I still don't see many use-cases for "delete algorithm", to be honest.
> ZRAM is configured by scripts in 99.99999% of cases and it is

For the development time in the local side, people usually type in
until they will have solid script version. If we asks resetting
to zram to modify it, it's not good and consistent with other
sysfs knobs we could overwrite it to change it. How about supporting
overwritting to chage it over priority?

echo "name=lz4 priority=1" > recomp_algo
echo "name=lz5 priority=2" > recomp_algo

# or I realized to change lz5 to lz7 so
echo "name=lz6 priority=2" > recomp_algo

> quite static once it has been configured. So we probably can use
> the "don't setup algorithms that you don't need" approach, to keep
> things simpler.

2022-11-04 23:40:26

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 4/9] zram: Introduce recompress sysfs knob

On (22/11/04 10:27), Minchan Kim wrote:
> > objects that are in-compressible (in a way that saves us memory in
> > zsmalloc) by any of the ZRAM's algorithms.
> >
> > > read_block_state
> > > zram_algo_idx(zram, index) > 0 ? 'r' : '.');
> > >
> > > zram_read_from_zpool
> > > if (zram_algo_idx(zram, idx) != 0)
> > > idx = 1;
> >
> > As an idea, maybe we can store everything re-compression related
> > in a dedicated meta field? SKIP flag, algorithm ID, etc.
> >
> > We don't have too many bits left in ->flags on 32-bit systems. We
> > currently probably need at least 3 bits - one for RECOMP_SKIP and at
> > least 2 for algorithm ID. 2 bits for algorithm ID put us into situation
> > that we can have only 00, 01, 10, 11 as IDs, that is maximum 3 recompress
> > algorithms: 00 is the primary one and the rest are alternative ones.
> > Maximum 3 re-compression algorithms sounds like a reasonable max value to
> > me. Yeah, maybe we can use flags bits for it.
>
> If possbile, let's go with those three bits into flags since we could
> factor them out into dedicated field, anytime later since it's not ABI.

Ack.

2022-11-04 23:40:54

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/04 09:34), Minchan Kim wrote:
> > > > So, as far as I understand, we don't have reason to add remove_recomp_algo
> > > > right now. And existing recomp_algo does not enforce any particular format,
> > > > it can be extended. Right now we accept "$name" but can do something like
> > > > "$name:$priority". The only thing that we probably need to do is rename
> > > > recomp_algo to either add_recomp_algo or register_recomp_algo?
> > >
> > > Yeah, I like the name and priority format.
> > >
> > > Only question is how we could support algorithm selection change
> > > under considering multiple secondary algorithms.
> >
> > So what I was thinking about, and I'm still in the mental model that
> > re-compression is a user-space event, just like writeback, extension
> > of recompress sysfs knob with "algo_index" (or something similar) which
> > will mirror algorithm priority.
> >
> > Example:
> >
> > Configure 2 alternative algos, with priority 1 and 2
> >
> > echo "name=lz4 priority=1" > recomp_algo
> > echo "name=lz5 priority=2" > recomp_algo
> >
> > Recompress pages using algo 1 and algo 2
> >
> > echo "type=huge threshold=3000 algo_idx=1" > recompress
> > echo "type=idle threshold=2000 algo_idx=2" > recompress
> >
> > Maybe we can even pass algo name instead of idx.
>
> Let's use name rather than index.

OK. Any preference on the keyword? "name="? "algo="? "algorithm="?
"compressor="? "comp="?

I want use the same keyword for recomp_algo. I sort of like "algo=",
but not sure.

2022-11-04 23:46:51

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/04 10:43), Minchan Kim wrote:
> > > Configure 2 alternative algos, with priority 1 and 2
> > >
> > > echo "name=lz4 priority=1" > recomp_algo
> > > echo "name=lz5 priority=2" > recomp_algo
> > >
> > > Recompress pages using algo 1 and algo 2
> > >
> > > echo "type=huge threshold=3000 algo_idx=1" > recompress
> > > echo "type=idle threshold=2000 algo_idx=2" > recompress
> > >
> > > Maybe we can even pass algo name instead of idx.
> >
> > Or pass priority= so that interface that uses algorithms has the
> > same keyword that the interface that configures those algorithms.
>
> Hmm, why do we need algo_idx here if we already set up every
> fields at algorithm setup time?
>
> My understaind(assuming default(i.e., primary) algo is lzo) is
>
> echo "name=lz4 priority=1" > recomp_algo
> echo "name=lz5 priority=2" > recomp_algo
>
> echo "type=huge threshold=3000" > recompress
>
> It will try compress every objects which greater than 3000B with lz4 first
> and then lz5 if it's stillgreater or equal than 3000(or same size class).

One can be SW one can be HW. So I thought about having flexibility here.
Instead of doing

for (idx = 1; idx < MAX_IDX; idx++) {
len = zcomp_compress(zram->comps[idx]);
if (len <= threshold)
break;
}

We would just directly use the suggested algo.

But we probably don't need that param at all and can use
the loop instead?

[..]
> echo "name=lz4 priority=1" > recomp_algo
> echo "name=lz5 priority=2" > recomp_algo
>
> # or I realized to change lz5 to lz7 so
> echo "name=lz6 priority=2" > recomp_algo

So the latter should delete lz5 at idx 2 and put lz6 there?
I can add that.

2022-11-05 00:00:52

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/04 16:40), Minchan Kim wrote:
> > > > Configure 2 alternative algos, with priority 1 and 2
> > > >
> > > > echo "name=lz4 priority=1" > recomp_algo
> > > > echo "name=lz5 priority=2" > recomp_algo
> > > >
> > > > Recompress pages using algo 1 and algo 2
> > > >
> > > > echo "type=huge threshold=3000 algo_idx=1" > recompress
> > > > echo "type=idle threshold=2000 algo_idx=2" > recompress
> > > >
> > > > Maybe we can even pass algo name instead of idx.
> > >
> > > Let's use name rather than index.
> >
> > OK. Any preference on the keyword? "name="? "algo="? "algorithm="?
> > "compressor="? "comp="?
> >
> > I want use the same keyword for recomp_algo. I sort of like "algo=",
> > but not sure.
>
> +1 with algo

Minchan, I'm sorry I'm getting a bit confused (didn't sleep last night).
I just saw your other email and you suggested there that we don't need
any idx or name in recompress. Or did I misunderstand it?

2022-11-05 00:01:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Sat, Nov 05, 2022 at 08:25:44AM +0900, Sergey Senozhatsky wrote:
> On (22/11/04 09:34), Minchan Kim wrote:
> > > > > So, as far as I understand, we don't have reason to add remove_recomp_algo
> > > > > right now. And existing recomp_algo does not enforce any particular format,
> > > > > it can be extended. Right now we accept "$name" but can do something like
> > > > > "$name:$priority". The only thing that we probably need to do is rename
> > > > > recomp_algo to either add_recomp_algo or register_recomp_algo?
> > > >
> > > > Yeah, I like the name and priority format.
> > > >
> > > > Only question is how we could support algorithm selection change
> > > > under considering multiple secondary algorithms.
> > >
> > > So what I was thinking about, and I'm still in the mental model that
> > > re-compression is a user-space event, just like writeback, extension
> > > of recompress sysfs knob with "algo_index" (or something similar) which
> > > will mirror algorithm priority.
> > >
> > > Example:
> > >
> > > Configure 2 alternative algos, with priority 1 and 2
> > >
> > > echo "name=lz4 priority=1" > recomp_algo
> > > echo "name=lz5 priority=2" > recomp_algo
> > >
> > > Recompress pages using algo 1 and algo 2
> > >
> > > echo "type=huge threshold=3000 algo_idx=1" > recompress
> > > echo "type=idle threshold=2000 algo_idx=2" > recompress
> > >
> > > Maybe we can even pass algo name instead of idx.
> >
> > Let's use name rather than index.
>
> OK. Any preference on the keyword? "name="? "algo="? "algorithm="?
> "compressor="? "comp="?
>
> I want use the same keyword for recomp_algo. I sort of like "algo=",
> but not sure.

+1 with algo

2022-11-05 00:19:19

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Sat, Nov 05, 2022 at 08:44:46AM +0900, Sergey Senozhatsky wrote:
> On (22/11/04 16:40), Minchan Kim wrote:
> > > > > Configure 2 alternative algos, with priority 1 and 2
> > > > >
> > > > > echo "name=lz4 priority=1" > recomp_algo
> > > > > echo "name=lz5 priority=2" > recomp_algo
> > > > >
> > > > > Recompress pages using algo 1 and algo 2
> > > > >
> > > > > echo "type=huge threshold=3000 algo_idx=1" > recompress
> > > > > echo "type=idle threshold=2000 algo_idx=2" > recompress
> > > > >
> > > > > Maybe we can even pass algo name instead of idx.
> > > >
> > > > Let's use name rather than index.
> > >
> > > OK. Any preference on the keyword? "name="? "algo="? "algorithm="?
> > > "compressor="? "comp="?
> > >
> > > I want use the same keyword for recomp_algo. I sort of like "algo=",
> > > but not sure.
> >
> > +1 with algo
>
> Minchan, I'm sorry I'm getting a bit confused (didn't sleep last night).
> I just saw your other email and you suggested there that we don't need
> any idx or name in recompress. Or did I misunderstand it?
>

I should have more clear. Sorry for that.

I meant if you need some reason, I prefer "algo=' to make review
proceed. If you agree we don't need it, then, yeah, we are all good.

2022-11-05 00:21:24

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Sat, Nov 05, 2022 at 08:41:37AM +0900, Sergey Senozhatsky wrote:
> On (22/11/04 10:43), Minchan Kim wrote:
> > > > Configure 2 alternative algos, with priority 1 and 2
> > > >
> > > > echo "name=lz4 priority=1" > recomp_algo
> > > > echo "name=lz5 priority=2" > recomp_algo
> > > >
> > > > Recompress pages using algo 1 and algo 2
> > > >
> > > > echo "type=huge threshold=3000 algo_idx=1" > recompress
> > > > echo "type=idle threshold=2000 algo_idx=2" > recompress
> > > >
> > > > Maybe we can even pass algo name instead of idx.
> > >
> > > Or pass priority= so that interface that uses algorithms has the
> > > same keyword that the interface that configures those algorithms.
> >
> > Hmm, why do we need algo_idx here if we already set up every
> > fields at algorithm setup time?
> >
> > My understaind(assuming default(i.e., primary) algo is lzo) is
> >
> > echo "name=lz4 priority=1" > recomp_algo
> > echo "name=lz5 priority=2" > recomp_algo
> >
> > echo "type=huge threshold=3000" > recompress
> >
> > It will try compress every objects which greater than 3000B with lz4 first
> > and then lz5 if it's stillgreater or equal than 3000(or same size class).
>
> One can be SW one can be HW. So I thought about having flexibility here.
> Instead of doing
>
> for (idx = 1; idx < MAX_IDX; idx++) {
> len = zcomp_compress(zram->comps[idx]);
> if (len <= threshold)
> break;
> }
>
> We would just directly use the suggested algo.
>
> But we probably don't need that param at all and can use
> the loop instead?

I don't understand what param you are saying. I expected
the zram->comps array already has sorted algoritm based on the
priority so the loop will try compression as expected so loop is fine.
Are we on same page?

>
> [..]
> > echo "name=lz4 priority=1" > recomp_algo
> > echo "name=lz5 priority=2" > recomp_algo
> >
> > # or I realized to change lz5 to lz7 so
> > echo "name=lz6 priority=2" > recomp_algo
>
> So the latter should delete lz5 at idx 2 and put lz6 there?
> I can add that.

Yub.

2022-11-05 00:44:03

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/05 08:41), Sergey Senozhatsky wrote:
> One can be SW one can be HW. So I thought about having flexibility here.
> Instead of doing
>
> for (idx = 1; idx < MAX_IDX; idx++) {
> len = zcomp_compress(zram->comps[idx]);
> if (len <= threshold)
> break;
> }
>
> We would just directly use the suggested algo.
>
> But we probably don't need that param at all and can use
> the loop instead?

My idea was that recompress does not loop through the algos (what
on the fly recompression can do for instance), but instead
recompress only does one thing.

Because we can have, for instance, something like this

algo=zstd priority=1
algo=deflate priority=2

And we recompress with algo=zstd only first. Without going
into the slowest, most CPU and power consuming one. If the
memory pressure keeps increasing then we might do algo=deflate
recompress, as the last resort before we do writeback. But we
may skip algo=deflate altogether and go straight to writeback,
for instance because we have less than 30% of battery left.

So the reason I suggested algo= in recompress was, basically,
for that for having exact control of what recompression does.

2022-11-05 01:55:21

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/04 17:01), Minchan Kim wrote:
> > One can be SW one can be HW. So I thought about having flexibility here.
> > Instead of doing
> >
> > for (idx = 1; idx < MAX_IDX; idx++) {
> > len = zcomp_compress(zram->comps[idx]);
> > if (len <= threshold)
> > break;
> > }
> >
> > We would just directly use the suggested algo.
> >
> > But we probably don't need that param at all and can use
> > the loop instead?
>
> I don't understand what param you are saying. I expected
> the zram->comps array already has sorted algoritm based on the
> priority so the loop will try compression as expected so loop is fine.
> Are we on same page?

I'll try to implement both loop and a specific algorithm selection.

2022-11-07 19:52:55

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On Sat, Nov 05, 2022 at 09:00:33AM +0900, Sergey Senozhatsky wrote:
> On (22/11/05 08:41), Sergey Senozhatsky wrote:
> > One can be SW one can be HW. So I thought about having flexibility here.
> > Instead of doing
> >
> > for (idx = 1; idx < MAX_IDX; idx++) {
> > len = zcomp_compress(zram->comps[idx]);
> > if (len <= threshold)
> > break;
> > }
> >
> > We would just directly use the suggested algo.
> >
> > But we probably don't need that param at all and can use
> > the loop instead?
>
> My idea was that recompress does not loop through the algos (what
> on the fly recompression can do for instance), but instead
> recompress only does one thing.
>
> Because we can have, for instance, something like this
>
> algo=zstd priority=1
> algo=deflate priority=2
>
> And we recompress with algo=zstd only first. Without going
> into the slowest, most CPU and power consuming one. If the
> memory pressure keeps increasing then we might do algo=deflate
> recompress, as the last resort before we do writeback. But we
> may skip algo=deflate altogether and go straight to writeback,
> for instance because we have less than 30% of battery left.
>
> So the reason I suggested algo= in recompress was, basically,
> for that for having exact control of what recompression does.


I am thinking like this:

* Without recomp_algo setup, user can do whatever they want on the fly


echo "type=idle threshold=3000 algo=zstd" > recompress

Later they could do

echo "type=idle threshold=3000 algo=deflate" > recompress

or

writeback to backing device

* With recomp_algo setup like this,

echo "algo=zstd priority=1" > recomp_algo
echo "algo=deflate priority=2" > recomp_algo
..
..
echo "type=idle threshold=3000" > recompress

If zstd fails, it will continue to work with deflate transparently.

IOW, "algo=" in *recompress* interface will override the global policy
in the *recomp_algo* setup. Otherwise, the recomp_algo's set up will
work.

2022-11-08 01:06:38

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv4 2/9] zram: Add recompression algorithm sysfs knob

On (22/11/07 11:08), Minchan Kim wrote:
[..]
> I am thinking like this:
>
> * Without recomp_algo setup, user can do whatever they want on the fly
>
>
> echo "type=idle threshold=3000 algo=zstd" > recompress
>
> Later they could do
>
> echo "type=idle threshold=3000 algo=deflate" > recompress

By "without recomp_algo setup" you mean that user doesn't configure
anything before `echo XG > zramX/disksize`? Currently algorithm and
recomp algorithm need to be selected at the same time - before zram
device is initialised, because we use the same code and same approaches:
we need to have zcomp back-ends in per-CPU data in zram read/write/recompress.

Creating per-CPU zcomps on the fly is probably going to be a little bit
intrusive to zram.



What I currently have is as follows.

A copy paste from my test script:

- init device
echo "algo=lz4 priority=1" > /sys/block/zram0/recomp_algorithm
echo "algo=zstd priority=2" > /sys/block/zram0/recomp_algorithm
echo "algo=deflate priority=3" > /sys/block/zram0/recomp_algorithm
echo 5G > /sys/block/zram0/disksize


Various recompression use cases:

- recompress huge pages using all secondary algos in order of priority
echo "type=huge" > /sys/block/zram0/recompress

- recompress huge pages using zstd only
echo "type=huge algo=zstd" > /sys/block/zram0/recompress

- recompress all pages using lz4
echo "algo=lz4" > /sys/block/zram0/recompress

- recompress idle pages, use all algos in priority order, with threshold
echo "type=idle threshold=3000" > /sys/block/zram0/recompress

- recompress idle pages, using zstd only, with threshold
echo "algo=zstd type=idle threshold=2000" > /sys/block/zram0/recompress