2018-11-26 08:29:30

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 0/7] zram idle page writeback

Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.

This patchset supports zram idle page writeback feature.

* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout

Detail is on each patch's description.

Below first two patches are -stable material so it could go first
separately with others in this series.

zram: fix lockdep warning of free block handling
zram: fix double free backing device

* from v1
- add fix dobule free backing device - minchan
- change writeback/idle interface - minchan
- remove direct incompressible page writeback - sergey

Minchan Kim (7):
zram: fix lockdep warning of free block handling
zram: fix double free backing device
zram: refactoring flags and writeback stuff
zram: introduce ZRAM_IDLE flag
zram: support idle/huge page writeback
zram: add bd_stat statistics
zram: writeback throttle

Documentation/ABI/testing/sysfs-block-zram | 32 ++
Documentation/blockdev/zram.txt | 51 +-
drivers/block/zram/Kconfig | 5 +-
drivers/block/zram/zram_drv.c | 516 +++++++++++++++------
drivers/block/zram/zram_drv.h | 18 +-
5 files changed, 463 insertions(+), 159 deletions(-)

--
2.20.0.rc0.387.gc7a69e6b6c-goog



2018-11-26 08:29:33

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 1/7] zram: fix lockdep warning of free block handling

[ 254.519728] ================================
[ 254.520311] WARNING: inconsistent lock state
[ 254.520898] 4.19.0+ #390 Not tainted
[ 254.521387] --------------------------------
[ 254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[ 254.521732] 00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50
[ 254.521732] {SOFTIRQ-ON-W} state was registered at:
[ 254.521732] _raw_spin_lock+0x2c/0x40
[ 254.521732] zram_make_request+0x755/0xdc9
[ 254.521732] generic_make_request+0x373/0x6a0
[ 254.521732] submit_bio+0x6c/0x140
[ 254.521732] __swap_writepage+0x3a8/0x480
[ 254.521732] shrink_page_list+0x1102/0x1a60
[ 254.521732] shrink_inactive_list+0x21b/0x3f0
[ 254.521732] shrink_node_memcg.constprop.99+0x4f8/0x7e0
[ 254.521732] shrink_node+0x7d/0x2f0
[ 254.521732] do_try_to_free_pages+0xe0/0x300
[ 254.521732] try_to_free_pages+0x116/0x2b0
[ 254.521732] __alloc_pages_slowpath+0x3f4/0xf80
[ 254.521732] __alloc_pages_nodemask+0x2a2/0x2f0
[ 254.521732] __handle_mm_fault+0x42e/0xb50
[ 254.521732] handle_mm_fault+0x55/0xb0
[ 254.521732] __do_page_fault+0x235/0x4b0
[ 254.521732] page_fault+0x1e/0x30
[ 254.521732] irq event stamp: 228412
[ 254.521732] hardirqs last enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600
[ 254.521732] hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600
[ 254.521732] softirqs last enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427
[ 254.521732] softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0
[ 254.521732]
[ 254.521732] other info that might help us debug this:
[ 254.521732] Possible unsafe locking scenario:
[ 254.521732]
[ 254.521732] CPU0
[ 254.521732] ----
[ 254.521732] lock(&(&zram->bitmap_lock)->rlock);
[ 254.521732] <Interrupt>
[ 254.521732] lock(&(&zram->bitmap_lock)->rlock);
[ 254.521732]
[ 254.521732] *** DEADLOCK ***
[ 254.521732]
[ 254.521732] no locks held by zram_verify/2095.
[ 254.521732]
[ 254.521732] stack backtrace:
[ 254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[ 254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 254.521732] Call Trace:
[ 254.521732] <IRQ>
[ 254.521732] dump_stack+0x67/0x9b
[ 254.521732] print_usage_bug+0x1bd/0x1d3
[ 254.521732] mark_lock+0x4aa/0x540
[ 254.521732] ? check_usage_backwards+0x160/0x160
[ 254.521732] __lock_acquire+0x51d/0x1300
[ 254.521732] ? free_debug_processing+0x24e/0x400
[ 254.521732] ? bio_endio+0x6d/0x1a0
[ 254.521732] ? lockdep_hardirqs_on+0x9b/0x180
[ 254.521732] ? lock_acquire+0x90/0x180
[ 254.521732] lock_acquire+0x90/0x180
[ 254.521732] ? put_entry_bdev+0x1e/0x50
[ 254.521732] _raw_spin_lock+0x2c/0x40
[ 254.521732] ? put_entry_bdev+0x1e/0x50
[ 254.521732] put_entry_bdev+0x1e/0x50
[ 254.521732] zram_free_page+0xf6/0x110
[ 254.521732] zram_slot_free_notify+0x42/0xa0
[ 254.521732] end_swap_bio_read+0x5b/0x170
[ 254.521732] blk_update_request+0x8f/0x340
[ 254.521732] scsi_end_request+0x2c/0x1e0
[ 254.521732] scsi_io_completion+0x98/0x650
[ 254.521732] blk_done_softirq+0x9e/0xd0
[ 254.521732] __do_softirq+0xcc/0x427
[ 254.521732] irq_exit+0xd1/0xe0
[ 254.521732] do_IRQ+0x93/0x120
[ 254.521732] common_interrupt+0xf/0xf
[ 254.521732] </IRQ>

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

The problem is not only bitmap_lock but it is also zram_slot_lock
so straightforward solution would disable irq on zram_slot_lock
which covers every bitmap_lock, too.
Although duration disabling the irq is short in many places
zram_slot_lock is used, a place(ie, decompress) is not fast
enough to hold irqlock on relying on compression algorithm
so it's not a option.

The approach in this patch is just "best effort", not guarantee
"freeing orphan zpage". If the zram_slot_lock contention may happen,
kernel couldn't free the zpage until it recycles the block. However,
such contention between zram_slot_free_notify and other places to
hold zram_slot_lock should be very rare in real practice.
To see how often it happens, this patch adds new debug stat
"miss_free".

It also adds irq lock in get/put_block_bdev to prevent deadlock
lockdep reported. The reason I used irq disable rather than bottom
half is swap_slot_free_notify could be called with irq disabled
so it breaks local_bh_enable's rule. The irqlock works on only
writebacked zram slot entry so it should be not frequent lock.

Cc: [email protected] # 4.14+
Signed-off-by: Minchan Kim <[email protected]>
---
drivers/block/zram/zram_drv.c | 56 +++++++++++++++++++++++++----------
drivers/block/zram/zram_drv.h | 1 +
2 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4879595200e1..472027eaed60 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -53,6 +53,11 @@ static size_t huge_class_size;

static void zram_free_page(struct zram *zram, size_t index);

+static int zram_slot_trylock(struct zram *zram, u32 index)
+{
+ return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].value);
+}
+
static void zram_slot_lock(struct zram *zram, u32 index)
{
bit_spin_lock(ZRAM_LOCK, &zram->table[index].value);
@@ -443,29 +448,45 @@ static ssize_t backing_dev_store(struct device *dev,

static unsigned long get_entry_bdev(struct zram *zram)
{
- unsigned long entry;
+ unsigned long blk_idx;
+ unsigned long ret = 0;

- spin_lock(&zram->bitmap_lock);
/* skip 0 bit to confuse zram.handle = 0 */
- entry = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
- if (entry == zram->nr_pages) {
- spin_unlock(&zram->bitmap_lock);
- return 0;
+ blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
+ if (blk_idx == zram->nr_pages)
+ goto retry;
+
+ spin_lock_irq(&zram->bitmap_lock);
+ if (test_bit(blk_idx, zram->bitmap)) {
+ spin_unlock_irq(&zram->bitmap_lock);
+ goto retry;
}

- set_bit(entry, zram->bitmap);
- spin_unlock(&zram->bitmap_lock);
+ set_bit(blk_idx, zram->bitmap);
+ ret = blk_idx;
+ goto out;
+retry:
+ spin_lock_irq(&zram->bitmap_lock);
+ blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
+ if (blk_idx == zram->nr_pages)
+ goto out;
+
+ set_bit(blk_idx, zram->bitmap);
+ ret = blk_idx;
+out:
+ spin_unlock_irq(&zram->bitmap_lock);

- return entry;
+ return ret;
}

static void put_entry_bdev(struct zram *zram, unsigned long entry)
{
int was_set;
+ unsigned long flags;

- spin_lock(&zram->bitmap_lock);
+ spin_lock_irqsave(&zram->bitmap_lock, flags);
was_set = test_and_clear_bit(entry, zram->bitmap);
- spin_unlock(&zram->bitmap_lock);
+ spin_unlock_irqrestore(&zram->bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
}

@@ -886,9 +907,10 @@ static ssize_t debug_stat_show(struct device *dev,

down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
- "version: %d\n%8llu\n",
+ "version: %d\n%8llu %8llu\n",
version,
- (u64)atomic64_read(&zram->stats.writestall));
+ (u64)atomic64_read(&zram->stats.writestall),
+ (u64)atomic64_read(&zram->stats.miss_free));
up_read(&zram->init_lock);

return ret;
@@ -1400,10 +1422,14 @@ static void zram_slot_free_notify(struct block_device *bdev,

zram = bdev->bd_disk->private_data;

- zram_slot_lock(zram, index);
+ atomic64_inc(&zram->stats.notify_free);
+ if (!zram_slot_trylock(zram, index)) {
+ atomic64_inc(&zram->stats.miss_free);
+ return;
+ }
+
zram_free_page(zram, index);
zram_slot_unlock(zram, index);
- atomic64_inc(&zram->stats.notify_free);
}

static int zram_rw_page(struct block_device *bdev, sector_t sector,
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 72c8584b6dff..89da501292ff 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -79,6 +79,7 @@ struct zram_stats {
atomic64_t pages_stored; /* no. of pages currently stored */
atomic_long_t max_used_pages; /* no. of maximum pages stored */
atomic64_t writestall; /* no. of write slow paths */
+ atomic64_t miss_free; /* no. of missed free */
};

struct zram {
--
2.20.0.rc0.387.gc7a69e6b6c-goog


2018-11-26 08:29:37

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 2/7] zram: fix double free backing device

If blkdev_get fails, we shouldn't do blkdev_put. Otherwise,
kernel emits below log. This patch fixes it.

[ 31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 blkdev_put+0x105/0x120
[ 31.075104] Modules linked in:
[ 31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
[ 31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 31.079589] RIP: 0010:blkdev_put+0x105/0x120
[ 31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b e9 5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f
[ 31.085080] RSP: 0018:ffffb409005c7ed0 EFLAGS: 00010297
[ 31.086383] RAX: ffff9779fe5a8040 RBX: ffff9779fbc17300 RCX: 00000000b9fc37a4
[ 31.088105] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9640e740
[ 31.089850] RBP: ffff9779fbc17318 R08: ffffffff95499a89 R09: 0000000000000004
[ 31.091201] R10: ffffb409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 0000000000000083
[ 31.092276] R13: ffff9779fe607b98 R14: 0000000000000000 R15: ffff9779fe607a38
[ 31.093355] FS: 00007fc118d9b840(0000) GS:ffff9779fc600000(0000) knlGS:0000000000000000
[ 31.094582] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 31.095541] CR2: 00007fc11894b8dc CR3: 00000000339f6001 CR4: 0000000000160ef0
[ 31.096781] Call Trace:
[ 31.097212] __x64_sys_swapoff+0x46d/0x490
[ 31.097914] do_syscall_64+0x5a/0x190
[ 31.098550] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 31.099402] RIP: 0033:0x7fc11843ec27
[ 31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48
[ 31.103149] RSP: 002b:00007ffdf69be648 EFLAGS: 00000206 ORIG_RAX: 00000000000000a8
[ 31.104425] RAX: ffffffffffffffda RBX: 00000000011d98c0 RCX: 00007fc11843ec27
[ 31.105627] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00000000011d98c0
[ 31.106847] RBP: 0000000000000001 R08: 00007ffdf69be690 R09: 0000000000000001
[ 31.108038] R10: 00000000000002b1 R11: 0000000000000206 R12: 0000000000000001
[ 31.109231] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 31.110433] irq event stamp: 4466
[ 31.111001] hardirqs last enabled at (4465): [<ffffffff953ebd43>] __free_pages_ok+0x1e3/0x490
[ 31.112437] hardirqs last disabled at (4466): [<ffffffff95201b7a>] trace_hardirqs_off_thunk+0x1a/0x1c
[ 31.113973] softirqs last enabled at (3420): [<ffffffff95e00333>] __do_softirq+0x333/0x446
[ 31.115364] softirqs last disabled at (3407): [<ffffffff9527aee1>] irq_exit+0xd1/0xe0

Cc: [email protected] # 4.14+
Signed-off-by: Minchan Kim <[email protected]>
---
drivers/block/zram/zram_drv.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 472027eaed60..514f5aaf6eff 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -387,8 +387,10 @@ static ssize_t backing_dev_store(struct device *dev,

bdev = bdgrab(I_BDEV(inode));
err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
- if (err < 0)
+ if (err < 0) {
+ bdev = NULL;
goto out;
+ }

nr_pages = i_size_read(inode) >> PAGE_SHIFT;
bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
--
2.20.0.rc0.387.gc7a69e6b6c-goog


2018-11-26 08:29:38

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 3/7] zram: refactoring flags and writeback stuff

This patch does renaming some variables and restructuring
some codes for better redability in writeback and zs_free_page.

Signed-off-by: Minchan Kim <[email protected]>
---
drivers/block/zram/zram_drv.c | 105 +++++++++++++---------------------
drivers/block/zram/zram_drv.h | 8 +--
2 files changed, 44 insertions(+), 69 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 514f5aaf6eff..fee7e67c750d 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,17 +55,17 @@ static void zram_free_page(struct zram *zram, size_t index);

static int zram_slot_trylock(struct zram *zram, u32 index)
{
- return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].value);
+ return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].flags);
}

static void zram_slot_lock(struct zram *zram, u32 index)
{
- bit_spin_lock(ZRAM_LOCK, &zram->table[index].value);
+ bit_spin_lock(ZRAM_LOCK, &zram->table[index].flags);
}

static void zram_slot_unlock(struct zram *zram, u32 index)
{
- bit_spin_unlock(ZRAM_LOCK, &zram->table[index].value);
+ bit_spin_unlock(ZRAM_LOCK, &zram->table[index].flags);
}

static inline bool init_done(struct zram *zram)
@@ -76,7 +76,7 @@ static inline bool init_done(struct zram *zram)
static inline bool zram_allocated(struct zram *zram, u32 index)
{

- return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+ return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
zram->table[index].handle;
}

@@ -99,19 +99,19 @@ static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
{
- return zram->table[index].value & BIT(flag);
+ return zram->table[index].flags & BIT(flag);
}

static void zram_set_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
{
- zram->table[index].value |= BIT(flag);
+ zram->table[index].flags |= BIT(flag);
}

static void zram_clear_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
{
- zram->table[index].value &= ~BIT(flag);
+ zram->table[index].flags &= ~BIT(flag);
}

static inline void zram_set_element(struct zram *zram, u32 index,
@@ -127,15 +127,15 @@ static unsigned long zram_get_element(struct zram *zram, u32 index)

static size_t zram_get_obj_size(struct zram *zram, u32 index)
{
- return zram->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
+ return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
}

static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
{
- unsigned long flags = zram->table[index].value >> ZRAM_FLAG_SHIFT;
+ unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;

- zram->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
+ zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
}

#if PAGE_SIZE != 4096
@@ -282,16 +282,11 @@ static ssize_t mem_used_max_store(struct device *dev,
}

#ifdef CONFIG_ZRAM_WRITEBACK
-static bool zram_wb_enabled(struct zram *zram)
-{
- return zram->backing_dev;
-}
-
static void reset_bdev(struct zram *zram)
{
struct block_device *bdev;

- if (!zram_wb_enabled(zram))
+ if (!zram->backing_dev)
return;

bdev = zram->bdev;
@@ -318,7 +313,7 @@ static ssize_t backing_dev_show(struct device *dev,
ssize_t ret;

down_read(&zram->init_lock);
- if (!zram_wb_enabled(zram)) {
+ if (!zram->backing_dev) {
memcpy(buf, "none\n", 5);
up_read(&zram->init_lock);
return 5;
@@ -448,7 +443,7 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
}

-static unsigned long get_entry_bdev(struct zram *zram)
+static unsigned long alloc_block_bdev(struct zram *zram)
{
unsigned long blk_idx;
unsigned long ret = 0;
@@ -481,13 +476,13 @@ static unsigned long get_entry_bdev(struct zram *zram)
return ret;
}

-static void put_entry_bdev(struct zram *zram, unsigned long entry)
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
{
int was_set;
unsigned long flags;

spin_lock_irqsave(&zram->bitmap_lock, flags);
- was_set = test_and_clear_bit(entry, zram->bitmap);
+ was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(&zram->bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
}
@@ -601,7 +596,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
if (!bio)
return -ENOMEM;

- entry = get_entry_bdev(zram);
+ entry = alloc_block_bdev(zram);
if (!entry) {
bio_put(bio);
return -ENOSPC;
@@ -612,7 +607,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len,
bvec->bv_offset)) {
bio_put(bio);
- put_entry_bdev(zram, entry);
+ free_block_bdev(zram, entry);
return -EIO;
}

@@ -630,18 +625,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
return 0;
}

-static void zram_wb_clear(struct zram *zram, u32 index)
-{
- unsigned long entry;
-
- zram_clear_flag(zram, index, ZRAM_WB);
- entry = zram_get_element(zram, index);
- zram_set_element(zram, index, 0);
- put_entry_bdev(zram, entry);
-}
-
#else
-static bool zram_wb_enabled(struct zram *zram) { return false; }
static inline void reset_bdev(struct zram *zram) {};
static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
u32 index, struct bio *parent,
@@ -656,7 +640,8 @@ static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
{
return -EIO;
}
-static void zram_wb_clear(struct zram *zram, u32 index) {}
+
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx) {};
#endif

#ifdef CONFIG_ZRAM_MEMORY_TRACKING
@@ -678,11 +663,6 @@ static void zram_accessed(struct zram *zram, u32 index)
zram->table[index].ac_time = ktime_get_boottime();
}

-static void zram_reset_access(struct zram *zram, u32 index)
-{
- zram->table[index].ac_time = 0;
-}
-
static ssize_t read_block_state(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
@@ -763,7 +743,6 @@ static void zram_debugfs_unregister(struct zram *zram)
static void zram_debugfs_create(void) {};
static void zram_debugfs_destroy(void) {};
static void zram_accessed(struct zram *zram, u32 index) {};
-static void zram_reset_access(struct zram *zram, u32 index) {};
static void zram_debugfs_register(struct zram *zram) {};
static void zram_debugfs_unregister(struct zram *zram) {};
#endif
@@ -964,17 +943,18 @@ static void zram_free_page(struct zram *zram, size_t index)
{
unsigned long handle;

- zram_reset_access(zram, index);
-
+#ifdef CONFIG_ZRAM_MEMORY_TRACKING
+ zram->table[index].ac_time = 0;
+#endif
if (zram_test_flag(zram, index, ZRAM_HUGE)) {
zram_clear_flag(zram, index, ZRAM_HUGE);
atomic64_dec(&zram->stats.huge_pages);
}

- if (zram_wb_enabled(zram) && zram_test_flag(zram, index, ZRAM_WB)) {
- zram_wb_clear(zram, index);
- atomic64_dec(&zram->stats.pages_stored);
- return;
+ if (zram_test_flag(zram, index, ZRAM_WB)) {
+ zram_clear_flag(zram, index, ZRAM_WB);
+ free_block_bdev(zram, zram_get_element(zram, index));
+ goto out;
}

/*
@@ -983,10 +963,8 @@ static void zram_free_page(struct zram *zram, size_t index)
*/
if (zram_test_flag(zram, index, ZRAM_SAME)) {
zram_clear_flag(zram, index, ZRAM_SAME);
- zram_set_element(zram, index, 0);
atomic64_dec(&zram->stats.same_pages);
- atomic64_dec(&zram->stats.pages_stored);
- return;
+ goto out;
}

handle = zram_get_handle(zram, index);
@@ -997,10 +975,11 @@ static void zram_free_page(struct zram *zram, size_t index)

atomic64_sub(zram_get_obj_size(zram, index),
&zram->stats.compr_data_size);
+out:
atomic64_dec(&zram->stats.pages_stored);
-
zram_set_handle(zram, index, 0);
zram_set_obj_size(zram, index, 0);
+ WARN_ON_ONCE(zram->table[index].flags & ~(1UL << ZRAM_LOCK));
}

static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
@@ -1011,24 +990,20 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
unsigned int size;
void *src, *dst;

- if (zram_wb_enabled(zram)) {
- zram_slot_lock(zram, index);
- if (zram_test_flag(zram, index, ZRAM_WB)) {
- struct bio_vec bvec;
-
- zram_slot_unlock(zram, index);
+ zram_slot_lock(zram, index);
+ if (zram_test_flag(zram, index, ZRAM_WB)) {
+ struct bio_vec bvec;

- bvec.bv_page = page;
- bvec.bv_len = PAGE_SIZE;
- bvec.bv_offset = 0;
- return read_from_bdev(zram, &bvec,
- zram_get_element(zram, index),
- bio, partial_io);
- }
zram_slot_unlock(zram, index);
+
+ bvec.bv_page = page;
+ bvec.bv_len = PAGE_SIZE;
+ bvec.bv_offset = 0;
+ return read_from_bdev(zram, &bvec,
+ zram_get_element(zram, index),
+ bio, partial_io);
}

- zram_slot_lock(zram, index);
handle = zram_get_handle(zram, index);
if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
unsigned long value;
@@ -1140,7 +1115,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,

if (unlikely(comp_len >= huge_class_size)) {
comp_len = PAGE_SIZE;
- if (zram_wb_enabled(zram) && allow_wb) {
+ if (zram->backing_dev && allow_wb) {
zcomp_stream_put(zram->comp);
ret = write_to_bdev(zram, bvec, index, bio, &element);
if (!ret) {
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 89da501292ff..d75bf190f262 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -30,7 +30,7 @@


/*
- * The lower ZRAM_FLAG_SHIFT bits of table.value is for
+ * The lower ZRAM_FLAG_SHIFT bits of table.flags is for
* object size (excluding header), the higher bits is for
* zram_pageflags.
*
@@ -41,7 +41,7 @@
*/
#define ZRAM_FLAG_SHIFT 24

-/* Flags for zram pages (table[page_no].value) */
+/* Flags for zram pages (table[page_no].flags) */
enum zram_pageflags {
/* zram slot is locked */
ZRAM_LOCK = ZRAM_FLAG_SHIFT,
@@ -60,7 +60,7 @@ struct zram_table_entry {
unsigned long handle;
unsigned long element;
};
- unsigned long value;
+ unsigned long flags;
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
ktime_t ac_time;
#endif
@@ -105,8 +105,8 @@ struct zram {
* zram is claimed so open request will be failed
*/
bool claim; /* Protected by bdev->bd_mutex */
-#ifdef CONFIG_ZRAM_WRITEBACK
struct file *backing_dev;
+#ifdef CONFIG_ZRAM_WRITEBACK
struct block_device *bdev;
unsigned int old_block_size;
unsigned long *bitmap;
--
2.20.0.rc0.387.gc7a69e6b6c-goog


2018-11-26 08:29:46

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 6/7] zram: add bd_stat statistics

bd_stat represents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim <[email protected]>
---
Documentation/ABI/testing/sysfs-block-zram | 8 ++++++
Documentation/blockdev/zram.txt | 11 ++++++++
drivers/block/zram/zram_drv.c | 30 ++++++++++++++++++++++
drivers/block/zram/zram_drv.h | 5 ++++
4 files changed, 54 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..65fc33b2f53b 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact: Minchan Kim <[email protected]>
Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What: /sys/block/zram<id>/bd_stat
+Date: November 2018
+Contact: Minchan Kim <[email protected]>
+Description:
+ The bd_stat file is read-only and represents backing device's
+ statistics (bd_count, bd_reads, bd_writes) in a format
+ similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 806cdaabac83..550bca77d322 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It consists of a single
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages

+File /sys/block/zram<id>/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count size of data written in backing device.
+ Unit: pages
+ bd_reads the number of reads from backing device
+ Unit: pages
+ bd_writes the number of writes to backing device
+ Unit: pages
+
9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 3d069b2328f8..cceaa10301e8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -518,6 +518,8 @@ static unsigned long alloc_block_bdev(struct zram *zram)
ret = blk_idx;
out:
spin_unlock_irq(&zram->bitmap_lock);
+ if (ret != 0)
+ atomic64_inc(&zram->stats.bd_count);

return ret;
}
@@ -531,6 +533,7 @@ static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(&zram->bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
+ atomic64_dec(&zram->stats.bd_count);
}

static void zram_page_end_io(struct bio *bio)
@@ -686,6 +689,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}

+ atomic64_inc(&zram->stats.bd_writes);
/*
* We released zram_slot_lock so need to check if the slot was
* changed. If there is freeing for the slot, we can catch it
@@ -775,6 +779,7 @@ static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
{
+ atomic64_inc(&zram->stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -1031,6 +1036,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
}

+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct zram *zram = dev_to_zram(dev);
+ ssize_t ret;
+
+ down_read(&zram->init_lock);
+ ret = scnprintf(buf, PAGE_SIZE,
+ "%8llu %8llu %8llu\n",
+ (u64)atomic64_read(&zram->stats.bd_count),
+ (u64)atomic64_read(&zram->stats.bd_reads),
+ (u64)atomic64_read(&zram->stats.bd_writes));
+ up_read(&zram->init_lock);
+
+ return ret;
+}
+#endif
+
static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -1051,6 +1075,9 @@ static ssize_t debug_stat_show(struct device *dev,

static DEVICE_ATTR_RO(io_stat);
static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
static DEVICE_ATTR_RO(debug_stat);

static void zram_meta_free(struct zram *zram, u64 disksize)
@@ -1777,6 +1804,9 @@ static struct attribute *zram_disk_attrs[] = {
#endif
&dev_attr_io_stat.attr,
&dev_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_WRITEBACK
+ &dev_attr_bd_stat.attr,
+#endif
&dev_attr_debug_stat.attr,
NULL,
};
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 07695fe70e17..7cbb134fc0c9 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -82,6 +82,11 @@ struct zram_stats {
atomic_long_t max_used_pages; /* no. of maximum pages stored */
atomic64_t writestall; /* no. of write slow paths */
atomic64_t miss_free; /* no. of missed free */
+#ifdef CONFIG_ZRAM_WRITEBACK
+ atomic64_t bd_count; /* no. of pages in backing device */
+ atomic64_t bd_reads; /* no. of reads from backing device */
+ atomic64_t bd_writes; /* no. of writes from backing device */
+#endif
};

struct zram {
--
2.20.0.rc0.387.gc7a69e6b6c-goog


2018-11-26 08:30:14

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 5/7] zram: support idle/huge page writeback

This patch supports new feature "zram idle/huge page writeback".
On zram-swap usecase, zram has usually many idle/huge swap pages.
It's pointless to keep in memory(ie, zram).

To solve the problem, this feature introduces idle/huge page
writeback to backing device so the goal is to save more memory
space on embedded system.

Normal sequence to use idle/huge page writeback feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.

echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}

By per discussion:
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,

This patch removes direct incommpressibe page writeback feature
(d2afd25114f4, zram: write incompressible pages to backing device)
so we could regard it as regression because incompressible pages
doesn't go to backing storage automatically. Instead, usre should
do it via "echo huge" > /sys/block/zram/writeback" manually.

If we hear some regression, we could restore the function.

Signed-off-by: Minchan Kim <[email protected]>
---
Documentation/ABI/testing/sysfs-block-zram | 7 +
Documentation/blockdev/zram.txt | 28 ++-
drivers/block/zram/Kconfig | 5 +-
drivers/block/zram/zram_drv.c | 245 ++++++++++++++-------
drivers/block/zram/zram_drv.h | 1 +
5 files changed, 207 insertions(+), 79 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact: Minchan Kim <[email protected]>
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram<id>/block_state
+
+What: /sys/block/zram<id>/writeback
+Date: November 2018
+Contact: Minchan Kim <[email protected]>
+Description:
+ The writeback file is write-only and trigger idle and/or
+ huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..806cdaabac83 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -238,11 +238,31 @@ The stat file represents device's mm statistics. It consists of a single

= writeback

-With incompressible pages, there is no memory saving with zram.
-Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
to backing storage rather than keeping it in memory.
-User should set up backing device via /sys/block/zramX/backing_dev
-before disksize setting.
+To use the feature, admin should set up backing device via
+
+ "echo /dev/sda5 > /sys/block/zramX/backing_dev"
+
+before disksize setting. It supports only partition at this moment.
+If admin want to use incompressible page writeback, they could do via
+
+ "echo huge > /sys/block/zramX/write"
+
+To use idle page writeback, first, user need to declare zram pages
+as idle.
+
+ "echo all > /sys/block/zramX/idle"
+
+From now on, any pages on zram are idle pages. The idle mark
+will be removed until someone request access of the block.
+IOW, unless there is access request, those pages are still idle pages.
+
+Admin can request writeback of those idle pages at right timing via
+
+ "echo idle > /sys/block/zramX/writeback"
+
+With the command, zram writeback idle pages from memory to the storage.

= memory tracking

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fcd055457364..1ffc64770643 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
See Documentation/blockdev/zram.txt for more information.

config ZRAM_WRITEBACK
- bool "Write back incompressible page to backing device"
+ bool "Write back incompressible or idle page to backing device"
depends on ZRAM
help
With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
For this feature, admin should set up backing device via
/sys/block/zramX/backing_dev.

+ With /sys/block/zramX/{idle,writeback}, application could ask
+ idle page's writeback to the backing device to save in memory.
+
See Documentation/blockdev/zram.txt for more information.

config ZRAM_MEMORY_TRACKING
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 59f78011d2d9..3d069b2328f8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -52,6 +52,9 @@ static unsigned int num_devices = 1;
static size_t huge_class_size;

static void zram_free_page(struct zram *zram, size_t index);
+static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
+ u32 index, int offset, struct bio *bio);
+

static int zram_slot_trylock(struct zram *zram, u32 index)
{
@@ -73,13 +76,6 @@ static inline bool init_done(struct zram *zram)
return zram->disksize;
}

-static inline bool zram_allocated(struct zram *zram, u32 index)
-{
-
- return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
- zram->table[index].handle;
-}
-
static inline struct zram *dev_to_zram(struct device *dev)
{
return (struct zram *)dev_to_disk(dev)->private_data;
@@ -138,6 +134,13 @@ static void zram_set_obj_size(struct zram *zram,
zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
}

+static inline bool zram_allocated(struct zram *zram, u32 index)
+{
+ return zram_get_obj_size(zram, index) ||
+ zram_test_flag(zram, index, ZRAM_SAME) ||
+ zram_test_flag(zram, index, ZRAM_WB);
+}
+
#if PAGE_SIZE != 4096
static inline bool is_partial_io(struct bio_vec *bvec)
{
@@ -306,10 +309,14 @@ static ssize_t idle_store(struct device *dev,
}

for (index = 0; index < nr_pages; index++) {
+ /*
+ * Do not mark ZRAM_UNDER_WB slot as ZRAM_IDLE to close race.
+ * See the comment in writeback_store.
+ */
zram_slot_lock(zram, index);
- if (!zram_allocated(zram, index))
+ if (!zram_allocated(zram, index) ||
+ zram_test_flag(zram, index, ZRAM_UNDER_WB))
goto next;
-
zram_set_flag(zram, index, ZRAM_IDLE);
next:
zram_slot_unlock(zram, index);
@@ -566,6 +573,156 @@ static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec,
return 1;
}

+#define HUGE_WRITEBACK 0x1
+#define IDLE_WRITEBACK 0x2
+
+static ssize_t writeback_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct zram *zram = dev_to_zram(dev);
+ unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+ unsigned long index;
+ struct bio bio;
+ struct bio_vec bio_vec;
+ struct page *page;
+ ssize_t ret, sz;
+ char mode_buf[64];
+ unsigned long mode = -1UL;
+ unsigned long blk_idx = 0;
+
+ strlcpy(mode_buf, buf, sizeof(mode_buf));
+ /* ignore trailing newline */
+ sz = strlen(mode_buf);
+ if (sz > 0 && mode_buf[sz - 1] == '\n')
+ mode_buf[sz - 1] = 0x00;
+
+ if (!strcmp(mode_buf, "idle"))
+ mode = IDLE_WRITEBACK;
+ if (!strcmp(mode_buf, "huge"))
+ mode = HUGE_WRITEBACK;
+
+ if (mode == -1UL)
+ return -EINVAL;
+
+ down_read(&zram->init_lock);
+ if (!init_done(zram)) {
+ ret = -EINVAL;
+ goto release_init_lock;
+ }
+
+ if (!zram->backing_dev) {
+ ret = -ENODEV;
+ goto release_init_lock;
+ }
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page) {
+ ret = -ENOMEM;
+ goto release_init_lock;
+ }
+
+ for (index = 0; index < nr_pages; index++) {
+ struct bio_vec bvec;
+
+ bvec.bv_page = page;
+ bvec.bv_len = PAGE_SIZE;
+ bvec.bv_offset = 0;
+
+ if (!blk_idx) {
+ blk_idx = alloc_block_bdev(zram);
+ if (!blk_idx) {
+ ret = -ENOSPC;
+ break;
+ }
+ }
+
+ zram_slot_lock(zram, index);
+ if (!zram_allocated(zram, index))
+ goto next;
+
+ if (zram_test_flag(zram, index, ZRAM_WB) ||
+ zram_test_flag(zram, index, ZRAM_SAME) ||
+ zram_test_flag(zram, index, ZRAM_UNDER_WB))
+ goto next;
+
+ if ((mode & IDLE_WRITEBACK &&
+ !zram_test_flag(zram, index, ZRAM_IDLE)) &&
+ (mode & HUGE_WRITEBACK &&
+ !zram_test_flag(zram, index, ZRAM_HUGE)))
+ goto next;
+ /*
+ * Clearing ZRAM_UNDER_WB is duty of caller.
+ * IOW, zram_free_page never clear it.
+ */
+ zram_set_flag(zram, index, ZRAM_UNDER_WB);
+ /* Need for hugepage writeback racing */
+ zram_set_flag(zram, index, ZRAM_IDLE);
+ zram_slot_unlock(zram, index);
+ if (zram_bvec_read(zram, &bvec, index, 0, NULL)) {
+ zram_slot_lock(zram, index);
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+ zram_slot_unlock(zram, index);
+ continue;
+ }
+
+ bio_init(&bio, &bio_vec, 1);
+ bio_set_dev(&bio, zram->bdev);
+ bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
+ bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
+
+ bio_add_page(&bio, bvec.bv_page, bvec.bv_len,
+ bvec.bv_offset);
+ /*
+ * XXX: A single page IO would be inefficient for write
+ * but it would be not bad as starter.
+ */
+ ret = submit_bio_wait(&bio);
+ if (ret) {
+ zram_slot_lock(zram, index);
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+ zram_slot_unlock(zram, index);
+ continue;
+ }
+
+ /*
+ * We released zram_slot_lock so need to check if the slot was
+ * changed. If there is freeing for the slot, we can catch it
+ * easily by zram_allocated.
+ * A subtle case is the slot is freed/reallocated/marked as
+ * ZRAM_IDLE again. To close the race, idle_store doesn't
+ * mark ZRAM_IDLE once it found the slot was ZRAM_UNDER_WB.
+ * Thus, we could close the race by checking ZRAM_IDLE bit.
+ */
+ zram_slot_lock(zram, index);
+ if (!zram_allocated(zram, index) ||
+ !zram_test_flag(zram, index, ZRAM_IDLE)) {
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+ goto next;
+ }
+
+ zram_free_page(zram, index);
+ zram_clear_flag(zram, index, ZRAM_UNDER_WB);
+ zram_set_flag(zram, index, ZRAM_WB);
+ zram_set_element(zram, index, blk_idx);
+ blk_idx = 0;
+ atomic64_inc(&zram->stats.pages_stored);
+next:
+ zram_slot_unlock(zram, index);
+ }
+
+ if (blk_idx)
+ free_block_bdev(zram, blk_idx);
+ ret = len;
+ __free_page(page);
+release_init_lock:
+ up_read(&zram->init_lock);
+
+ return ret;
+}
+
struct zram_work {
struct work_struct work;
struct zram *zram;
@@ -623,57 +780,8 @@ static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
else
return read_from_bdev_async(zram, bvec, entry, parent);
}
-
-static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
- u32 index, struct bio *parent,
- unsigned long *pentry)
-{
- struct bio *bio;
- unsigned long entry;
-
- bio = bio_alloc(GFP_ATOMIC, 1);
- if (!bio)
- return -ENOMEM;
-
- entry = alloc_block_bdev(zram);
- if (!entry) {
- bio_put(bio);
- return -ENOSPC;
- }
-
- bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
- bio_set_dev(bio, zram->bdev);
- if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len,
- bvec->bv_offset)) {
- bio_put(bio);
- free_block_bdev(zram, entry);
- return -EIO;
- }
-
- if (!parent) {
- bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
- bio->bi_end_io = zram_page_end_io;
- } else {
- bio->bi_opf = parent->bi_opf;
- bio_chain(bio, parent);
- }
-
- submit_bio(bio);
- *pentry = entry;
-
- return 0;
-}
-
#else
static inline void reset_bdev(struct zram *zram) {};
-static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
- u32 index, struct bio *parent,
- unsigned long *pentry)
-
-{
- return -EIO;
-}
-
static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
{
@@ -1026,7 +1134,8 @@ static void zram_free_page(struct zram *zram, size_t index)
atomic64_dec(&zram->stats.pages_stored);
zram_set_handle(zram, index, 0);
zram_set_obj_size(zram, index, 0);
- WARN_ON_ONCE(zram->table[index].flags & ~(1UL << ZRAM_LOCK));
+ WARN_ON_ONCE(zram->table[index].flags &
+ ~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
}

static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
@@ -1135,7 +1244,6 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
struct page *page = bvec->bv_page;
unsigned long element = 0;
enum zram_pageflags flags = 0;
- bool allow_wb = true;

mem = kmap_atomic(page);
if (page_same_filled(mem, &element)) {
@@ -1160,21 +1268,8 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
return ret;
}

- if (unlikely(comp_len >= huge_class_size)) {
+ if (comp_len >= huge_class_size)
comp_len = PAGE_SIZE;
- if (zram->backing_dev && allow_wb) {
- zcomp_stream_put(zram->comp);
- ret = write_to_bdev(zram, bvec, index, bio, &element);
- if (!ret) {
- flags = ZRAM_WB;
- ret = 1;
- goto out;
- }
- allow_wb = false;
- goto compress_again;
- }
- }
-
/*
* handle allocation has 2 paths:
* a) fast path is executed with preemption disabled (for
@@ -1663,6 +1758,7 @@ static DEVICE_ATTR_RW(max_comp_streams);
static DEVICE_ATTR_RW(comp_algorithm);
#ifdef CONFIG_ZRAM_WRITEBACK
static DEVICE_ATTR_RW(backing_dev);
+static DEVICE_ATTR_WO(writeback);
#endif

static struct attribute *zram_disk_attrs[] = {
@@ -1677,6 +1773,7 @@ static struct attribute *zram_disk_attrs[] = {
&dev_attr_comp_algorithm.attr,
#ifdef CONFIG_ZRAM_WRITEBACK
&dev_attr_backing_dev.attr,
+ &dev_attr_writeback.attr,
#endif
&dev_attr_io_stat.attr,
&dev_attr_mm_stat.attr,
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 214fa4bb46b9..07695fe70e17 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -47,6 +47,7 @@ enum zram_pageflags {
ZRAM_LOCK = ZRAM_FLAG_SHIFT,
ZRAM_SAME, /* Page consists the same element */
ZRAM_WB, /* page is stored on backing_device */
+ ZRAM_UNDER_WB, /* page is under writeback */
ZRAM_HUGE, /* Incompressible page */
ZRAM_IDLE, /* not accessed page since last idle marking */

--
2.20.0.rc0.387.gc7a69e6b6c-goog


2018-11-26 08:30:32

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 4/7] zram: introduce ZRAM_IDLE flag

To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.

Userspace can mark zram slots as "idle" via
"echo all > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.

300 75.033841 ...i
301 63.806904 s..i
302 63.806919 ..hi

Once there is IO for the slot, the mark will be disappeared.

300 75.033841 ...
301 63.806904 s..i
302 63.806919 ..hi

Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.

Signed-off-by: Minchan Kim <[email protected]>
---
Documentation/ABI/testing/sysfs-block-zram | 8 ++++
Documentation/blockdev/zram.txt | 10 ++--
drivers/block/zram/zram_drv.c | 55 ++++++++++++++++++++--
drivers/block/zram/zram_drv.h | 1 +
4 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index c1513c756af1..04c9a5980bc7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -98,3 +98,11 @@ Contact: Minchan Kim <[email protected]>
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
+
+What: /sys/block/zram<id>/idle
+Date: November 2018
+Contact: Minchan Kim <[email protected]>
+Description:
+ idle file is write-only and mark zram slot as idle.
+ If system has mounted debugfs, user can see which slots
+ are idle via /sys/kernel/debug/zram/zram<id>/block_state
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 3c1b5ab54bc0..f3bcd716d8a9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -169,6 +169,7 @@ comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction
debug_stat RO this file is used for zram debugging purposes
backing_dev RW set up backend storage for zram to write out
+idle WO mark allocated slot as idle


User space is advised to use the following files to read the device statistics.
@@ -251,16 +252,17 @@ pages of the process with*pagemap.
If you enable the feature, you could see block state via
/sys/kernel/debug/zram/zram0/block_state". The output is as follows,

- 300 75.033841 .wh
- 301 63.806904 s..
- 302 63.806919 ..h
+ 300 75.033841 .wh.
+ 301 63.806904 s...
+ 302 63.806919 ..hi

First column is zram's block index.
Second column is access time since the system was booted
Third column is state of the block.
(s: same page
w: written page to backing store
-h: huge page)
+h: huge page
+i: idle page)

First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fee7e67c750d..59f78011d2d9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -281,6 +281,45 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
}

+static ssize_t idle_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct zram *zram = dev_to_zram(dev);
+ unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+ int index;
+ char mode_buf[64];
+ ssize_t sz;
+
+ strlcpy(mode_buf, buf, sizeof(mode_buf));
+ /* ignore trailing new line */
+ sz = strlen(mode_buf);
+ if (sz > 0 && mode_buf[sz - 1] == '\n')
+ mode_buf[sz - 1] = 0x00;
+
+ if (strcmp(mode_buf, "all"))
+ return -EINVAL;
+
+ down_read(&zram->init_lock);
+ if (!init_done(zram)) {
+ up_read(&zram->init_lock);
+ return -EINVAL;
+ }
+
+ for (index = 0; index < nr_pages; index++) {
+ zram_slot_lock(zram, index);
+ if (!zram_allocated(zram, index))
+ goto next;
+
+ zram_set_flag(zram, index, ZRAM_IDLE);
+next:
+ zram_slot_unlock(zram, index);
+ }
+
+ up_read(&zram->init_lock);
+
+ return len;
+}
+
#ifdef CONFIG_ZRAM_WRITEBACK
static void reset_bdev(struct zram *zram)
{
@@ -660,6 +699,7 @@ static void zram_debugfs_destroy(void)

static void zram_accessed(struct zram *zram, u32 index)
{
+ zram_clear_flag(zram, index, ZRAM_IDLE);
zram->table[index].ac_time = ktime_get_boottime();
}

@@ -692,12 +732,13 @@ static ssize_t read_block_state(struct file *file, char __user *buf,

ts = ktime_to_timespec64(zram->table[index].ac_time);
copied = snprintf(kbuf + written, count,
- "%12zd %12lld.%06lu %c%c%c\n",
+ "%12zd %12lld.%06lu %c%c%c%c\n",
index, (s64)ts.tv_sec,
ts.tv_nsec / NSEC_PER_USEC,
zram_test_flag(zram, index, ZRAM_SAME) ? 's' : '.',
zram_test_flag(zram, index, ZRAM_WB) ? 'w' : '.',
- zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.');
+ zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.',
+ zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.');

if (count < copied) {
zram_slot_unlock(zram, index);
@@ -742,7 +783,10 @@ static void zram_debugfs_unregister(struct zram *zram)
#else
static void zram_debugfs_create(void) {};
static void zram_debugfs_destroy(void) {};
-static void zram_accessed(struct zram *zram, u32 index) {};
+static void zram_accessed(struct zram *zram, u32 index)
+{
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+};
static void zram_debugfs_register(struct zram *zram) {};
static void zram_debugfs_unregister(struct zram *zram) {};
#endif
@@ -946,6 +990,9 @@ static void zram_free_page(struct zram *zram, size_t index)
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
zram->table[index].ac_time = 0;
#endif
+ if (zram_test_flag(zram, index, ZRAM_IDLE))
+ zram_clear_flag(zram, index, ZRAM_IDLE);
+
if (zram_test_flag(zram, index, ZRAM_HUGE)) {
zram_clear_flag(zram, index, ZRAM_HUGE);
atomic64_dec(&zram->stats.huge_pages);
@@ -1611,6 +1658,7 @@ static DEVICE_ATTR_RO(initstate);
static DEVICE_ATTR_WO(reset);
static DEVICE_ATTR_WO(mem_limit);
static DEVICE_ATTR_WO(mem_used_max);
+static DEVICE_ATTR_WO(idle);
static DEVICE_ATTR_RW(max_comp_streams);
static DEVICE_ATTR_RW(comp_algorithm);
#ifdef CONFIG_ZRAM_WRITEBACK
@@ -1624,6 +1672,7 @@ static struct attribute *zram_disk_attrs[] = {
&dev_attr_compact.attr,
&dev_attr_mem_limit.attr,
&dev_attr_mem_used_max.attr,
+ &dev_attr_idle.attr,
&dev_attr_max_comp_streams.attr,
&dev_attr_comp_algorithm.attr,
#ifdef CONFIG_ZRAM_WRITEBACK
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index d75bf190f262..214fa4bb46b9 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -48,6 +48,7 @@ enum zram_pageflags {
ZRAM_SAME, /* Page consists the same element */
ZRAM_WB, /* page is stored on backing_device */
ZRAM_HUGE, /* Incompressible page */
+ ZRAM_IDLE, /* not accessed page since last idle marking */

__NR_ZRAM_PAGEFLAGS,
};
--
2.20.0.rc0.387.gc7a69e6b6c-goog


2018-11-26 08:30:55

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 7/7] zram: writeback throttle

On small memory system, there are lots of write IO so if we use
flash device as swap, there would be serious flash wearout.
To overcome the problem, system developers need to design write
limitation strategy to guarantee flash health for entire product life.

This patch creates a new konb "writeback_limit" on zram. With that,
if current writeback IO count(/sys/block/zramX/io_stat) excceds
the limitation, zram stops further writeback until admin can reset
the limit.

Signed-off-by: Minchan Kim <[email protected]>
---
Documentation/ABI/testing/sysfs-block-zram | 9 +++++
Documentation/blockdev/zram.txt | 2 +
drivers/block/zram/zram_drv.c | 47 +++++++++++++++++++++-
drivers/block/zram/zram_drv.h | 2 +
4 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index 65fc33b2f53b..9d2339a485c8 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact: Minchan Kim <[email protected]>
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
+
+What: /sys/block/zram<id>/writeback_limit
+Date: November 2018
+Contact: Minchan Kim <[email protected]>
+Description:
+ The writeback_limit file is read-write and specifies the maximum
+ amount of writeback ZRAM can do. The limit could be changed
+ in run time and "0" means disable the limit.
+ No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 550bca77d322..41748d52712d 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WO trigger device reset
mem_used_max WO reset the `mem_used_max' counter (see later)
mem_limit WO specifies the maximum amount of memory ZRAM can use
to store the compressed data
+writeback_limit WO specifies the maximum amount of write IO zram can
+ write out to backing device
max_comp_streams RW the number of possible concurrent compress operations
comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index cceaa10301e8..07c0847b7c0f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -328,6 +328,40 @@ static ssize_t idle_store(struct device *dev,
}

#ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct zram *zram = dev_to_zram(dev);
+ u64 val;
+ ssize_t ret = -EINVAL;
+
+ if (kstrtoull(buf, 10, &val))
+ return ret;
+
+ down_read(&zram->init_lock);
+ atomic64_set(&zram->stats.bd_wb_limit, val);
+ if (val == 0 || val > atomic64_read(&zram->stats.bd_writes))
+ zram->stop_writeback = false;
+ up_read(&zram->init_lock);
+ ret = len;
+
+ return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ u64 val;
+ struct zram *zram = dev_to_zram(dev);
+
+ down_read(&zram->init_lock);
+ val = atomic64_read(&zram->stats.bd_wb_limit);
+ up_read(&zram->init_lock);
+
+ return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
static void reset_bdev(struct zram *zram)
{
struct block_device *bdev;
@@ -592,6 +626,7 @@ static ssize_t writeback_store(struct device *dev,
char mode_buf[64];
unsigned long mode = -1UL;
unsigned long blk_idx = 0;
+ u64 wb_count, wb_limit;

strlcpy(mode_buf, buf, sizeof(mode_buf));
/* ignore trailing newline */
@@ -631,6 +666,11 @@ static ssize_t writeback_store(struct device *dev,
bvec.bv_len = PAGE_SIZE;
bvec.bv_offset = 0;

+ if (zram->stop_writeback) {
+ ret = -EIO;
+ break;
+ }
+
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
@@ -689,7 +729,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}

- atomic64_inc(&zram->stats.bd_writes);
+ wb_count = atomic64_inc_return(&zram->stats.bd_writes);
/*
* We released zram_slot_lock so need to check if the slot was
* changed. If there is freeing for the slot, we can catch it
@@ -713,6 +753,9 @@ static ssize_t writeback_store(struct device *dev,
zram_set_element(zram, index, blk_idx);
blk_idx = 0;
atomic64_inc(&zram->stats.pages_stored);
+ wb_limit = atomic64_read(&zram->stats.bd_wb_limit);
+ if (wb_limit != 0 && wb_count >= wb_limit)
+ zram->stop_writeback = true;
next:
zram_slot_unlock(zram, index);
}
@@ -1786,6 +1829,7 @@ static DEVICE_ATTR_RW(comp_algorithm);
#ifdef CONFIG_ZRAM_WRITEBACK
static DEVICE_ATTR_RW(backing_dev);
static DEVICE_ATTR_WO(writeback);
+static DEVICE_ATTR_RW(writeback_limit);
#endif

static struct attribute *zram_disk_attrs[] = {
@@ -1801,6 +1845,7 @@ static struct attribute *zram_disk_attrs[] = {
#ifdef CONFIG_ZRAM_WRITEBACK
&dev_attr_backing_dev.attr,
&dev_attr_writeback.attr,
+ &dev_attr_writeback_limit.attr,
#endif
&dev_attr_io_stat.attr,
&dev_attr_mm_stat.attr,
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 7cbb134fc0c9..024044e82c9b 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -86,6 +86,7 @@ struct zram_stats {
atomic64_t bd_count; /* no. of pages in backing device */
atomic64_t bd_reads; /* no. of reads from backing device */
atomic64_t bd_writes; /* no. of writes from backing device */
+ atomic64_t bd_wb_limit; /* writeback limit of backing device */
#endif
};

@@ -113,6 +114,7 @@ struct zram {
*/
bool claim; /* Protected by bdev->bd_mutex */
struct file *backing_dev;
+ bool stop_writeback;
#ifdef CONFIG_ZRAM_WRITEBACK
struct block_device *bdev;
unsigned int old_block_size;
--
2.20.0.rc0.387.gc7a69e6b6c-goog


2018-11-26 09:49:48

by Joey Pabalinas

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] zram: support idle/huge page writeback

On Mon, Nov 26, 2018 at 05:28:11PM +0900, Minchan Kim wrote:
> + strlcpy(mode_buf, buf, sizeof(mode_buf));
> + /* ignore trailing newline */
> + sz = strlen(mode_buf);

One possible idea would be to use strscpy() instead and directly assign
the return value to sz, avoiding an extra strlen() call (though you would
have to check if `sz == -E2BIG` and do `sz = sizeof(mode_buf) - 1` in that
case).

> + if (!strcmp(mode_buf, "idle"))
> + mode = IDLE_WRITEBACK;
> + if (!strcmp(mode_buf, "huge"))
> + mode = HUGE_WRITEBACK;

Maybe using `else if (!strcmp(mode_buf, "huge"))` would be slightly
better here, avoiding a second strcmp() if mode_buf has already
matched "idle".

> + if ((mode & IDLE_WRITEBACK &&
> + !zram_test_flag(zram, index, ZRAM_IDLE)) &&
> + (mode & HUGE_WRITEBACK &&
> + !zram_test_flag(zram, index, ZRAM_HUGE)))
> + goto next;

Wouldn't writing this as `mode & (IDLE_WRITEBACK | HUGE_WRITEBACK)`
be a bit easier to read as well as slightly more compact?

> + ret = len;
> + __free_page(page);
> +release_init_lock:
> + up_read(&zram->init_lock);
> + return ret;

Hm, I noticed that this function either returns an error or just the passed
in len on success, and I'm left wondering if there might be other useful
information which could be passed back to the caller instead. I can't
immediately think of any such information, though, so it's possible I'm
just daydreaming :)

--
Cheers,
Joey Pabalinas


Attachments:
(No filename) (1.45 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-26 13:46:39

by Joey Pabalinas

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] zram: support idle/huge page writeback

On Sun, Nov 25, 2018 at 11:47:37PM -1000, Joey Pabalinas wrote:
> > + if ((mode & IDLE_WRITEBACK &&
> > + !zram_test_flag(zram, index, ZRAM_IDLE)) &&
> > + (mode & HUGE_WRITEBACK &&
> > + !zram_test_flag(zram, index, ZRAM_HUGE)))
> > + goto next;
>
> Wouldn't writing this as `mode & (IDLE_WRITEBACK | HUGE_WRITEBACK)`
> be a bit easier to read as well as slightly more compact?

Scratch this comment, it just dawned on me that this is not an
equivalent expression, heh.

--
Cheers,
Joey Pabalinas


Attachments:
(No filename) (532.00 B)
signature.asc (849.00 B)
Download all attachments

2018-11-26 20:51:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] zram: fix lockdep warning of free block handling

On Mon, 26 Nov 2018 17:28:07 +0900 Minchan Kim <[email protected]> wrote:

>
> ...
>
> With writeback feature, zram_slot_free_notify could be called
> in softirq context by end_swap_bio_read. However, bitmap_lock
> is not aware of that so lockdep yell out. Thanks.
>
> The problem is not only bitmap_lock but it is also zram_slot_lock
> so straightforward solution would disable irq on zram_slot_lock
> which covers every bitmap_lock, too.
> Although duration disabling the irq is short in many places
> zram_slot_lock is used, a place(ie, decompress) is not fast
> enough to hold irqlock on relying on compression algorithm
> so it's not a option.
>
> The approach in this patch is just "best effort", not guarantee
> "freeing orphan zpage". If the zram_slot_lock contention may happen,
> kernel couldn't free the zpage until it recycles the block. However,
> such contention between zram_slot_free_notify and other places to
> hold zram_slot_lock should be very rare in real practice.
> To see how often it happens, this patch adds new debug stat
> "miss_free".
>
> It also adds irq lock in get/put_block_bdev to prevent deadlock
> lockdep reported. The reason I used irq disable rather than bottom
> half is swap_slot_free_notify could be called with irq disabled
> so it breaks local_bh_enable's rule. The irqlock works on only
> writebacked zram slot entry so it should be not frequent lock.
>
> Cc: [email protected] # 4.14+
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> drivers/block/zram/zram_drv.c | 56 +++++++++++++++++++++++++----------
> drivers/block/zram/zram_drv.h | 1 +
> 2 files changed, 42 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 4879595200e1..472027eaed60 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -53,6 +53,11 @@ static size_t huge_class_size;
>
> static void zram_free_page(struct zram *zram, size_t index);
>
> +static int zram_slot_trylock(struct zram *zram, u32 index)
> +{
> + return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].value);
> +}
> +
> static void zram_slot_lock(struct zram *zram, u32 index)
> {
> bit_spin_lock(ZRAM_LOCK, &zram->table[index].value);
> @@ -443,29 +448,45 @@ static ssize_t backing_dev_store(struct device *dev,
>
> static unsigned long get_entry_bdev(struct zram *zram)
> {
> - unsigned long entry;
> + unsigned long blk_idx;
> + unsigned long ret = 0;
>
> - spin_lock(&zram->bitmap_lock);
> /* skip 0 bit to confuse zram.handle = 0 */
> - entry = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> - if (entry == zram->nr_pages) {
> - spin_unlock(&zram->bitmap_lock);
> - return 0;
> + blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> + if (blk_idx == zram->nr_pages)
> + goto retry;
> +
> + spin_lock_irq(&zram->bitmap_lock);
> + if (test_bit(blk_idx, zram->bitmap)) {
> + spin_unlock_irq(&zram->bitmap_lock);
> + goto retry;
> }
>
> - set_bit(entry, zram->bitmap);
> - spin_unlock(&zram->bitmap_lock);
> + set_bit(blk_idx, zram->bitmap);

Here we could do

if (test_and_set_bit(...)) {
spin_unlock(...);
goto retry;

But it's weird to take the spinlock on behalf of bitops which are
already atomic!

It seems rather suspicious to me. Why are we doing this?

> + ret = blk_idx;
> + goto out;
> +retry:
> + spin_lock_irq(&zram->bitmap_lock);
> + blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> + if (blk_idx == zram->nr_pages)
> + goto out;
> +
> + set_bit(blk_idx, zram->bitmap);
> + ret = blk_idx;
> +out:
> + spin_unlock_irq(&zram->bitmap_lock);
>
> - return entry;
> + return ret;
> }
>
> static void put_entry_bdev(struct zram *zram, unsigned long entry)
> {
> int was_set;
> + unsigned long flags;
>
> - spin_lock(&zram->bitmap_lock);
> + spin_lock_irqsave(&zram->bitmap_lock, flags);
> was_set = test_and_clear_bit(entry, zram->bitmap);
> - spin_unlock(&zram->bitmap_lock);
> + spin_unlock_irqrestore(&zram->bitmap_lock, flags);

Here's another one. Surely that locking is unnecessary.

> WARN_ON_ONCE(!was_set);
> }
>


2018-11-26 20:56:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] zram: writeback throttle

On Mon, 26 Nov 2018 17:28:13 +0900 Minchan Kim <[email protected]> wrote:

> On small memory system, there are lots of write IO so if we use
> flash device as swap, there would be serious flash wearout.
> To overcome the problem, system developers need to design write
> limitation strategy to guarantee flash health for entire product life.
>
> This patch creates a new konb "writeback_limit" on zram. With that,
> if current writeback IO count(/sys/block/zramX/io_stat) excceds
> the limitation, zram stops further writeback until admin can reset
> the limit.
>
> +++ b/Documentation/ABI/testing/sysfs-block-zram
> @@ -121,3 +121,12 @@ Contact: Minchan Kim <[email protected]>
> The bd_stat file is read-only and represents backing device's
> statistics (bd_count, bd_reads, bd_writes) in a format
> similar to block layer statistics file format.
> +
> +What: /sys/block/zram<id>/writeback_limit
> +Date: November 2018
> +Contact: Minchan Kim <[email protected]>
> +Description:
> + The writeback_limit file is read-write and specifies the maximum
> + amount of writeback ZRAM can do. The limit could be changed
> + in run time and "0" means disable the limit.
> + No limit is the initial state.
> diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
> index 550bca77d322..41748d52712d 100644
> --- a/Documentation/blockdev/zram.txt
> +++ b/Documentation/blockdev/zram.txt
> @@ -164,6 +164,8 @@ reset WO trigger device reset
> mem_used_max WO reset the `mem_used_max' counter (see later)
> mem_limit WO specifies the maximum amount of memory ZRAM can use
> to store the compressed data
> +writeback_limit WO specifies the maximum amount of write IO zram can
> + write out to backing device

Neither the changelog nor the Documentation specify the units of
writeback_limit. Bytes? Pages? Blocks?

This gets so confusing that in many /proc/sys/vm files we actually put
the units into the filenames.


2018-11-26 20:59:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] zram: add bd_stat statistics

On Mon, 26 Nov 2018 17:28:12 +0900 Minchan Kim <[email protected]> wrote:

> +File /sys/block/zram<id>/bd_stat
> +
> +The stat file represents device's backing device statistics. It consists of
> +a single line of text and contains the following stats separated by whitespace:
> + bd_count size of data written in backing device.
> + Unit: pages
> + bd_reads the number of reads from backing device
> + Unit: pages
> + bd_writes the number of writes to backing device
> + Unit: pages

Using `pages' is a bad choice. And I assume this means that
writeback_limit is in pages as well, which is worse.

Page sizes are not constant! We want userspace which was developed on
4k pagesize to work the same on 64k pagesize.

Arguably, we could require that well-written userspace remember to use
getpagesize(). However we have traditionally tried to avoid that by
performing the pagesize normalization within the kernel.



2018-11-27 02:10:08

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] zram: add bd_stat statistics

On Mon, Nov 26, 2018 at 12:58:33PM -0800, Andrew Morton wrote:
> On Mon, 26 Nov 2018 17:28:12 +0900 Minchan Kim <[email protected]> wrote:
>
> > +File /sys/block/zram<id>/bd_stat
> > +
> > +The stat file represents device's backing device statistics. It consists of
> > +a single line of text and contains the following stats separated by whitespace:
> > + bd_count size of data written in backing device.
> > + Unit: pages
> > + bd_reads the number of reads from backing device
> > + Unit: pages
> > + bd_writes the number of writes to backing device
> > + Unit: pages
>
> Using `pages' is a bad choice. And I assume this means that
> writeback_limit is in pages as well, which is worse.
>
> Page sizes are not constant! We want userspace which was developed on
> 4k pagesize to work the same on 64k pagesize.
>
> Arguably, we could require that well-written userspace remember to use
> getpagesize(). However we have traditionally tried to avoid that by
> performing the pagesize normalization within the kernel.

zram works based on page so I used that term but I agree it's rather
vague. If there is no objection, I will use (Unit: 4K) instead of
(Unit: pages).


2018-11-27 02:14:24

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] zram: support idle/huge page writeback

On Sun, Nov 25, 2018 at 11:47:37PM -1000, Joey Pabalinas wrote:
> On Mon, Nov 26, 2018 at 05:28:11PM +0900, Minchan Kim wrote:
> > + strlcpy(mode_buf, buf, sizeof(mode_buf));
> > + /* ignore trailing newline */
> > + sz = strlen(mode_buf);
>
> One possible idea would be to use strscpy() instead and directly assign
> the return value to sz, avoiding an extra strlen() call (though you would
> have to check if `sz == -E2BIG` and do `sz = sizeof(mode_buf) - 1` in that
> case).

Thanks for the suggstion.
If I limit destination buffer smaller, I couldn't meet -E2BIG?

>
> > + if (!strcmp(mode_buf, "idle"))
> > + mode = IDLE_WRITEBACK;
> > + if (!strcmp(mode_buf, "huge"))
> > + mode = HUGE_WRITEBACK;
>
> Maybe using `else if (!strcmp(mode_buf, "huge"))` would be slightly
> better here, avoiding a second strcmp() if mode_buf has already
> matched "idle".

I considered "huge|idle" as an option. Anyway, in that case, mode should
"mode |= ". At this moment, yes, lets use "else if" since I don't have
strong opinion to support "idle|huge".

>
> > + if ((mode & IDLE_WRITEBACK &&
> > + !zram_test_flag(zram, index, ZRAM_IDLE)) &&
> > + (mode & HUGE_WRITEBACK &&
> > + !zram_test_flag(zram, index, ZRAM_HUGE)))
> > + goto next;
>
> Wouldn't writing this as `mode & (IDLE_WRITEBACK | HUGE_WRITEBACK)`
> be a bit easier to read as well as slightly more compact?
>
> > + ret = len;
> > + __free_page(page);
> > +release_init_lock:
> > + up_read(&zram->init_lock);
> > + return ret;
>
> Hm, I noticed that this function either returns an error or just the passed
> in len on success, and I'm left wondering if there might be other useful
> information which could be passed back to the caller instead. I can't
> immediately think of any such information, though, so it's possible I'm
> just daydreaming :)

It is write syscall semantic of sysfs so not sure it's doable to pass
other value to user.

>
> --
> Cheers,
> Joey Pabalinas



2018-11-27 02:41:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] zram: writeback throttle

On Mon, Nov 26, 2018 at 12:54:46PM -0800, Andrew Morton wrote:
> On Mon, 26 Nov 2018 17:28:13 +0900 Minchan Kim <[email protected]> wrote:
>
> > On small memory system, there are lots of write IO so if we use
> > flash device as swap, there would be serious flash wearout.
> > To overcome the problem, system developers need to design write
> > limitation strategy to guarantee flash health for entire product life.
> >
> > This patch creates a new konb "writeback_limit" on zram. With that,
> > if current writeback IO count(/sys/block/zramX/io_stat) excceds
> > the limitation, zram stops further writeback until admin can reset
> > the limit.
> >
> > +++ b/Documentation/ABI/testing/sysfs-block-zram
> > @@ -121,3 +121,12 @@ Contact: Minchan Kim <[email protected]>
> > The bd_stat file is read-only and represents backing device's
> > statistics (bd_count, bd_reads, bd_writes) in a format
> > similar to block layer statistics file format.
> > +
> > +What: /sys/block/zram<id>/writeback_limit
> > +Date: November 2018
> > +Contact: Minchan Kim <[email protected]>
> > +Description:
> > + The writeback_limit file is read-write and specifies the maximum
> > + amount of writeback ZRAM can do. The limit could be changed
> > + in run time and "0" means disable the limit.
> > + No limit is the initial state.
> > diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
> > index 550bca77d322..41748d52712d 100644
> > --- a/Documentation/blockdev/zram.txt
> > +++ b/Documentation/blockdev/zram.txt
> > @@ -164,6 +164,8 @@ reset WO trigger device reset
> > mem_used_max WO reset the `mem_used_max' counter (see later)
> > mem_limit WO specifies the maximum amount of memory ZRAM can use
> > to store the compressed data
> > +writeback_limit WO specifies the maximum amount of write IO zram can
> > + write out to backing device
>
> Neither the changelog nor the Documentation specify the units of
> writeback_limit. Bytes? Pages? Blocks?
>
> This gets so confusing that in many /proc/sys/vm files we actually put
> the units into the filenames.
>

I will use unit as 4K.

Thanks.

2018-11-27 02:45:49

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] zram: fix lockdep warning of free block handling

On Mon, Nov 26, 2018 at 12:49:28PM -0800, Andrew Morton wrote:
> On Mon, 26 Nov 2018 17:28:07 +0900 Minchan Kim <[email protected]> wrote:
>
> >
> > ...
> >
> > With writeback feature, zram_slot_free_notify could be called
> > in softirq context by end_swap_bio_read. However, bitmap_lock
> > is not aware of that so lockdep yell out. Thanks.
> >
> > The problem is not only bitmap_lock but it is also zram_slot_lock
> > so straightforward solution would disable irq on zram_slot_lock
> > which covers every bitmap_lock, too.
> > Although duration disabling the irq is short in many places
> > zram_slot_lock is used, a place(ie, decompress) is not fast
> > enough to hold irqlock on relying on compression algorithm
> > so it's not a option.
> >
> > The approach in this patch is just "best effort", not guarantee
> > "freeing orphan zpage". If the zram_slot_lock contention may happen,
> > kernel couldn't free the zpage until it recycles the block. However,
> > such contention between zram_slot_free_notify and other places to
> > hold zram_slot_lock should be very rare in real practice.
> > To see how often it happens, this patch adds new debug stat
> > "miss_free".
> >
> > It also adds irq lock in get/put_block_bdev to prevent deadlock
> > lockdep reported. The reason I used irq disable rather than bottom
> > half is swap_slot_free_notify could be called with irq disabled
> > so it breaks local_bh_enable's rule. The irqlock works on only
> > writebacked zram slot entry so it should be not frequent lock.
> >
> > Cc: [email protected] # 4.14+
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > drivers/block/zram/zram_drv.c | 56 +++++++++++++++++++++++++----------
> > drivers/block/zram/zram_drv.h | 1 +
> > 2 files changed, 42 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 4879595200e1..472027eaed60 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -53,6 +53,11 @@ static size_t huge_class_size;
> >
> > static void zram_free_page(struct zram *zram, size_t index);
> >
> > +static int zram_slot_trylock(struct zram *zram, u32 index)
> > +{
> > + return bit_spin_trylock(ZRAM_LOCK, &zram->table[index].value);
> > +}
> > +
> > static void zram_slot_lock(struct zram *zram, u32 index)
> > {
> > bit_spin_lock(ZRAM_LOCK, &zram->table[index].value);
> > @@ -443,29 +448,45 @@ static ssize_t backing_dev_store(struct device *dev,
> >
> > static unsigned long get_entry_bdev(struct zram *zram)
> > {
> > - unsigned long entry;
> > + unsigned long blk_idx;
> > + unsigned long ret = 0;
> >
> > - spin_lock(&zram->bitmap_lock);
> > /* skip 0 bit to confuse zram.handle = 0 */
> > - entry = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> > - if (entry == zram->nr_pages) {
> > - spin_unlock(&zram->bitmap_lock);
> > - return 0;
> > + blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> > + if (blk_idx == zram->nr_pages)
> > + goto retry;
> > +
> > + spin_lock_irq(&zram->bitmap_lock);
> > + if (test_bit(blk_idx, zram->bitmap)) {
> > + spin_unlock_irq(&zram->bitmap_lock);
> > + goto retry;
> > }
> >
> > - set_bit(entry, zram->bitmap);
> > - spin_unlock(&zram->bitmap_lock);
> > + set_bit(blk_idx, zram->bitmap);
>
> Here we could do
>
> if (test_and_set_bit(...)) {
> spin_unlock(...);
> goto retry;
>
> But it's weird to take the spinlock on behalf of bitops which are
> already atomic!
>
> It seems rather suspicious to me. Why are we doing this?

What I need is look_up_and_set operation. I don't see there is an
atomic operation for that. But I want to minimize irq disabled
area so first, it scans the bit lockless and if race happens,
i can try under the lock.

It seems __set_bit is enough under the lock.

>
> > + ret = blk_idx;
> > + goto out;
> > +retry:
> > + spin_lock_irq(&zram->bitmap_lock);
> > + blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> > + if (blk_idx == zram->nr_pages)
> > + goto out;
> > +
> > + set_bit(blk_idx, zram->bitmap);
> > + ret = blk_idx;
> > +out:
> > + spin_unlock_irq(&zram->bitmap_lock);
> >
> > - return entry;
> > + return ret;
> > }
> >
> > static void put_entry_bdev(struct zram *zram, unsigned long entry)
> > {
> > int was_set;
> > + unsigned long flags;
> >
> > - spin_lock(&zram->bitmap_lock);
> > + spin_lock_irqsave(&zram->bitmap_lock, flags);
> > was_set = test_and_clear_bit(entry, zram->bitmap);
> > - spin_unlock(&zram->bitmap_lock);
> > + spin_unlock_irqrestore(&zram->bitmap_lock, flags);
>
> Here's another one. Surely that locking is unnecessary.

Indeed! although get_entry_bdev side can miss some bits, it's not a critical problem.
Benefit is we might remove irq disable for the lockdep problem.
Yes, I will cook and test.

Thanks, Andrew.

2018-11-27 02:55:26

by Joey Pabalinas

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] zram: support idle/huge page writeback

On Tue, Nov 27, 2018 at 11:13:27AM +0900, Minchan Kim wrote:
> On Sun, Nov 25, 2018 at 11:47:37PM -1000, Joey Pabalinas wrote:
> > On Mon, Nov 26, 2018 at 05:28:11PM +0900, Minchan Kim wrote:
> > > + strlcpy(mode_buf, buf, sizeof(mode_buf));
> > > + /* ignore trailing newline */
> > > + sz = strlen(mode_buf);
> >
> > One possible idea would be to use strscpy() instead and directly assign
> > the return value to sz, avoiding an extra strlen() call (though you would
> > have to check if `sz == -E2BIG` and do `sz = sizeof(mode_buf) - 1` in that
> > case).
>
> Thanks for the suggstion.
> If I limit destination buffer smaller, I couldn't meet -E2BIG?

-E2BIG return from strscpy() just means the string was longer than would
fit in the length passed as the third argument. This means only
`sizeof(mode_buf) - 1` bytes were copied into the dest, with the last
byte being the \0 terminator.

You can find the function source in lib/string.c.

> > > + ret = len;
> > > + __free_page(page);
> > > +release_init_lock:
> > > + up_read(&zram->init_lock);
> > > + return ret;
> >
> > Hm, I noticed that this function either returns an error or just the passed
> > in len on success, and I'm left wondering if there might be other useful
> > information which could be passed back to the caller instead. I can't
> > immediately think of any such information, though, so it's possible I'm
> > just daydreaming :)
>
> It is write syscall semantic of sysfs so not sure it's doable to pass
> other value to user.

Well, with the write system call you can have partial writes, in which
case it would return the (short) number of bytes written. But in this
function, even if there was some sort of partial write possible, this
function still only every returns len or error.

Neither of these are that important though, so Ack from me with or
without these two suggestions.

Reviewed-by: Joey Pabalinas <[email protected]>

--
Cheers,
Joey Pabalinas


Attachments:
(No filename) (1.95 kB)
signature.asc (849.00 B)
Download all attachments

2018-11-28 23:31:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] zram: add bd_stat statistics

On Tue, 27 Nov 2018 11:07:54 +0900 Minchan Kim <[email protected]> wrote:

> On Mon, Nov 26, 2018 at 12:58:33PM -0800, Andrew Morton wrote:
> > On Mon, 26 Nov 2018 17:28:12 +0900 Minchan Kim <[email protected]> wrote:
> >
> > > +File /sys/block/zram<id>/bd_stat
> > > +
> > > +The stat file represents device's backing device statistics. It consists of
> > > +a single line of text and contains the following stats separated by whitespace:
> > > + bd_count size of data written in backing device.
> > > + Unit: pages
> > > + bd_reads the number of reads from backing device
> > > + Unit: pages
> > > + bd_writes the number of writes to backing device
> > > + Unit: pages
> >
> > Using `pages' is a bad choice. And I assume this means that
> > writeback_limit is in pages as well, which is worse.
> >
> > Page sizes are not constant! We want userspace which was developed on
> > 4k pagesize to work the same on 64k pagesize.
> >
> > Arguably, we could require that well-written userspace remember to use
> > getpagesize(). However we have traditionally tried to avoid that by
> > performing the pagesize normalization within the kernel.
>
> zram works based on page so I used that term but I agree it's rather
> vague. If there is no objection, I will use (Unit: 4K) instead of
> (Unit: pages).

Is that still true if PAGE_SIZE=64k?

2018-11-29 01:46:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] zram: add bd_stat statistics

On Wed, Nov 28, 2018 at 03:30:21PM -0800, Andrew Morton wrote:
> On Tue, 27 Nov 2018 11:07:54 +0900 Minchan Kim <[email protected]> wrote:
>
> > On Mon, Nov 26, 2018 at 12:58:33PM -0800, Andrew Morton wrote:
> > > On Mon, 26 Nov 2018 17:28:12 +0900 Minchan Kim <[email protected]> wrote:
> > >
> > > > +File /sys/block/zram<id>/bd_stat
> > > > +
> > > > +The stat file represents device's backing device statistics. It consists of
> > > > +a single line of text and contains the following stats separated by whitespace:
> > > > + bd_count size of data written in backing device.
> > > > + Unit: pages
> > > > + bd_reads the number of reads from backing device
> > > > + Unit: pages
> > > > + bd_writes the number of writes to backing device
> > > > + Unit: pages
> > >
> > > Using `pages' is a bad choice. And I assume this means that
> > > writeback_limit is in pages as well, which is worse.
> > >
> > > Page sizes are not constant! We want userspace which was developed on
> > > 4k pagesize to work the same on 64k pagesize.
> > >
> > > Arguably, we could require that well-written userspace remember to use
> > > getpagesize(). However we have traditionally tried to avoid that by
> > > performing the pagesize normalization within the kernel.
> >
> > zram works based on page so I used that term but I agree it's rather
> > vague. If there is no objection, I will use (Unit: 4K) instead of
> > (Unit: pages).
>
> Is that still true if PAGE_SIZE=64k?

Oops, it will fix.