2019-03-29 07:58:35

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 0/7] Extend write-hint for in-kernel use

Towards extending write-hints/streams infrastucture so that file-systems
and other kernel-mode components can use exclusive hints. This patchset also
modifies Ext4/JBD2 to use new in-kernel, write-hint with journal.

Here is the history and changelog -

Changes since v2:
- Introduce API in block layer so that drivers can register stream info. Added
new limit in request queue for this purpose.
- Block layer does the conversion from write-hint to stream-id.
- Any write-hint beyond registered limit turn to 0.
- New macro "WRITE_LIFE_KERN_MIN" can be used as base by kernel mode components.

Changes since v1:
- introduce four more hints for in-kernel use, as recommended by Dave chinner
& Jens axboe. This isolates kernel-mode hints from user-mode ones.
- remove mount-option to specify write-hint, as recommended by Jan kara &
Dave chinner. Rather, FS always sets write-hint for journal. This gets ignored
if device does not support stream.
- Removed code-redundancy for write_dirty_buffer (Jan kara's review comment)

V2 patch:
https://patchwork.kernel.org/cover/10754405/
V1 patch:
https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2

Note: I am sorry about the fact this patchset is against 5.0.0-rc4, rather than
against linux-next.

Kanchan Joshi (7):
block: extend stream count for in-kernel use
block: introduce API to register stream information with block layer
block: add write-hint to stream-id conversion
nvme: register stream info with block layer
fs: introduce APIs to enable sending write-hint with buffer-head
fs: introduce write-hint start point for in-kernel hints
fs/ext4,jbd2: add support for passing write-hint with journal

block/blk-core.c | 20 ++++++++++++++++++++
block/blk-settings.c | 12 ++++++++++++
drivers/nvme/host/core.c | 19 ++++++-------------
fs/buffer.c | 18 ++++++++++++++++--
fs/ext4/ext4_jbd2.h | 1 +
fs/ext4/super.c | 2 ++
fs/jbd2/commit.c | 11 +++++++----
fs/jbd2/journal.c | 3 ++-
fs/jbd2/revoke.c | 3 ++-
include/linux/blkdev.h | 9 +++++++--
include/linux/buffer_head.h | 3 +++
include/linux/fs.h | 2 ++
include/linux/jbd2.h | 8 ++++++++
13 files changed, 88 insertions(+), 23 deletions(-)

--
2.7.4



2019-03-29 07:58:42

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 1/7] block: extend stream count for in-kernel use

This patch bumps up stream count to suppor in-kernel hints. It also
adds 'streamid' member in 'request' and declares new API for driver
to register stream-info with block layer.

Signed-off-by: Kanchan Joshi <[email protected]>
---
include/linux/blkdev.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 338604d..4088e21 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -214,6 +214,7 @@ struct request {
#endif

unsigned short write_hint;
+ unsigned short streamid;
unsigned short ioprio;

void *special; /* opaque pointer available for LLD use */
@@ -341,6 +342,8 @@ struct queue_limits {
unsigned char misaligned;
unsigned char discard_misaligned;
unsigned char raid_partial_stripes_expensive;
+
+ unsigned short nr_streams;
enum blk_zoned_model zoned;
};

@@ -567,8 +570,9 @@ struct request_queue {
size_t cmd_size;

struct work_struct release_work;
-
-#define BLK_MAX_WRITE_HINTS 5
+#define BLK_MAX_USER_HINTS (WRITE_LIFE_KERN_MIN - 2)
+#define BLK_MAX_KERN_HINTS 4
+#define BLK_MAX_WRITE_HINTS (1 + BLK_MAX_USER_HINTS + BLK_MAX_KERN_HINTS)
u64 write_hints[BLK_MAX_WRITE_HINTS];
};

@@ -1071,6 +1075,7 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
+extern void blk_queue_stream_limits(struct request_queue *, unsigned short);

/*
* Number of physical segments as sent to the device.
--
2.7.4


2019-03-29 07:58:47

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 2/7] block: introduce API to register stream information with block layer

This associates stream limit (count of streams exposed by driver) with
request queue.

Signed-off-by: Kanchan Joshi <[email protected]>
---
block/blk-settings.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 3e7038e..bb0da61 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -231,6 +231,18 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
EXPORT_SYMBOL(blk_queue_max_discard_sectors);

/**
+ * blk_queue_stream_limits - set stream parameters
+ * @q: the request queue for the device
+ * @nr_streams: number of streams supported by device
+ **/
+void blk_queue_stream_limits(struct request_queue *q,
+ unsigned short nr_streams)
+{
+ q->limits.nr_streams = nr_streams;
+}
+EXPORT_SYMBOL(blk_queue_stream_limits);
+
+/**
* blk_queue_max_write_same_sectors - set max sectors for a single write same
* @q: the request queue for the device
* @max_write_same_sectors: maximum number of sectors to write per command
--
2.7.4


2019-03-29 07:58:58

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 3/7] block: add write-hint to stream-id conversion

Earlier this conversion was done by driver (nvme). Current conversion
is of the form "streamid = write-hint - 1", for both user and kernel
streams (note that existing infra takes care that user-streams do not
bump into kernel ones). Conversion takes stream limit (maintained in
request queue) into account. Write-hints beyond the queue-limit turn
to 0.
New field 'streamid' has been added in request. While 'write-hint' field
continues to exist. It keeps original value passed from upper layer, and
used during merging checks.

Signed-off-by: Kanchan Joshi <[email protected]>
---
block/blk-core.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3c5f61c..c86daed 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -727,6 +727,25 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
return false;
}

+enum rw_hint blk_write_hint_to_streamid(struct request *req,
+ struct bio *bio)
+{
+ enum rw_hint streamid, nr_streams;
+ struct request_queue *q = req->q;
+ nr_streams = q->limits.nr_streams;
+
+ streamid = bio->bi_write_hint;
+ if (!nr_streams || streamid == WRITE_LIFE_NOT_SET ||
+ streamid == WRITE_LIFE_NONE)
+ streamid = 0;
+ else {
+ --streamid;
+ if(streamid > nr_streams)
+ streamid = 0;
+ }
+ return streamid;
+}
+
void blk_init_request_from_bio(struct request *req, struct bio *bio)
{
if (bio->bi_opf & REQ_RAHEAD)
@@ -735,6 +754,7 @@ void blk_init_request_from_bio(struct request *req, struct bio *bio)
req->__sector = bio->bi_iter.bi_sector;
req->ioprio = bio_prio(bio);
req->write_hint = bio->bi_write_hint;
+ req->streamid = blk_write_hint_to_streamid(req, bio);
blk_rq_bio_prep(req->q, req, bio);
}
EXPORT_SYMBOL_GPL(blk_init_request_from_bio);
--
2.7.4


2019-03-29 07:59:02

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 4/7] nvme: register stream info with block layer

nvme registes number of streams with block layer, which will use that
for write-hint to streamid conversion. Registration is done for each
namespace. Since NVMe spec allow all available streams (within subsystem)
to be used by all namespaces, no attempt has been made to add reservation
at namespace level.

Signed-off-by: Kanchan Joshi <[email protected]>
---
drivers/nvme/host/core.c | 19 ++++++-------------
1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 150e497..e34386b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -513,14 +513,7 @@ static int nvme_configure_directives(struct nvme_ctrl *ctrl)
return ret;

ctrl->nssa = le16_to_cpu(s.nssa);
- if (ctrl->nssa < BLK_MAX_WRITE_HINTS - 1) {
- dev_info(ctrl->device, "too few streams (%u) available\n",
- ctrl->nssa);
- nvme_disable_streams(ctrl);
- return 0;
- }
-
- ctrl->nr_streams = min_t(unsigned, ctrl->nssa, BLK_MAX_WRITE_HINTS - 1);
+ ctrl->nr_streams = ctrl->nssa;
dev_info(ctrl->device, "Using %u streams\n", ctrl->nr_streams);
return 0;
}
@@ -533,12 +526,9 @@ static void nvme_assign_write_stream(struct nvme_ctrl *ctrl,
struct request *req, u16 *control,
u32 *dsmgmt)
{
- enum rw_hint streamid = req->write_hint;
+ enum rw_hint streamid = req->streamid;

- if (streamid == WRITE_LIFE_NOT_SET || streamid == WRITE_LIFE_NONE)
- streamid = 0;
- else {
- streamid--;
+ if (streamid != 0) {
if (WARN_ON_ONCE(streamid > ctrl->nr_streams))
return;

@@ -3138,6 +3128,7 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
{
struct streams_directive_params s;
int ret;
+ u16 nr_streams;

if (!ctrl->nr_streams)
return 0;
@@ -3149,6 +3140,8 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
ns->sws = le32_to_cpu(s.sws);
ns->sgs = le16_to_cpu(s.sgs);

+ nr_streams = min_t(unsigned, ctrl->nr_streams, BLK_MAX_WRITE_HINTS - 1);
+ blk_queue_stream_limits(ns->queue, nr_streams);
if (ns->sws) {
unsigned int bs = 1 << ns->lba_shift;

--
2.7.4


2019-03-29 07:59:08

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 6/7] fs: introduce write-hint start point for in-kernel hints

kernel-mode components can define own write-hints using
"WRITE_LIFE_KERN_MIN" as base.

Signed-off-by: Kanchan Joshi <[email protected]>
---
include/linux/fs.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 29d8e2c..6a2673e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -291,6 +291,8 @@ enum rw_hint {
WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
+/* Kernel should use write-hint starting from this */
+ WRITE_LIFE_KERN_MIN,
};

#define IOCB_EVENTFD (1 << 0)
--
2.7.4


2019-03-29 07:59:15

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 5/7] fs: introduce APIs to enable sending write-hint with buffer-head

submit_bh and write_dirty_buffer do not take write-hint as parameter.
This introduces variants which do, and pass write-hint down.

Signed-off-by: Kanchan Joshi <[email protected]>
---
fs/buffer.c | 18 ++++++++++++++++--
include/linux/buffer_head.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 52d024b..19cf910 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3094,6 +3094,13 @@ int submit_bh(int op, int op_flags, struct buffer_head *bh)
}
EXPORT_SYMBOL(submit_bh);

+int submit_bh_write_hint(int op, int op_flags, struct buffer_head *bh,
+ enum rw_hint hint)
+{
+ return submit_bh_wbc(op, op_flags, bh, hint, NULL);
+}
+EXPORT_SYMBOL(submit_bh_write_hint);
+
/**
* ll_rw_block: low-level access to block devices (DEPRECATED)
* @op: whether to %READ or %WRITE
@@ -3151,6 +3158,13 @@ EXPORT_SYMBOL(ll_rw_block);

void write_dirty_buffer(struct buffer_head *bh, int op_flags)
{
+ write_dirty_buffer_with_hint(bh, op_flags, 0);
+}
+EXPORT_SYMBOL(write_dirty_buffer);
+
+void write_dirty_buffer_with_hint(struct buffer_head *bh, int op_flags,
+ enum rw_hint hint)
+{
lock_buffer(bh);
if (!test_clear_buffer_dirty(bh)) {
unlock_buffer(bh);
@@ -3158,9 +3172,9 @@ void write_dirty_buffer(struct buffer_head *bh, int op_flags)
}
bh->b_end_io = end_buffer_write_sync;
get_bh(bh);
- submit_bh(REQ_OP_WRITE, op_flags, bh);
+ submit_bh_wbc(REQ_OP_WRITE, op_flags, bh, hint, NULL);
}
-EXPORT_SYMBOL(write_dirty_buffer);
+EXPORT_SYMBOL(write_dirty_buffer_with_hint);

/*
* For a data-integrity writeout, we need to wait upon any in-progress I/O
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 7b73ef7..3d682ac 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -200,7 +200,10 @@ void ll_rw_block(int, int, int, struct buffer_head * bh[]);
int sync_dirty_buffer(struct buffer_head *bh);
int __sync_dirty_buffer(struct buffer_head *bh, int op_flags);
void write_dirty_buffer(struct buffer_head *bh, int op_flags);
+void write_dirty_buffer_with_hint(struct buffer_head *bh, int op_flags,
+ enum rw_hint hint);
int submit_bh(int, int, struct buffer_head *);
+int submit_bh_write_hint(int, int, struct buffer_head *, enum rw_hint hint);
void write_boundary_block(struct block_device *bdev,
sector_t bblock, unsigned blocksize);
int bh_uptodate_or_lock(struct buffer_head *bh);
--
2.7.4


2019-03-29 07:59:18

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 7/7] fs/ext4,jbd2: add support for passing write-hint with journal

For NAND based SSDs, mixing of data with different life-time reduces
efficiency of internal garbage-collection. During FS operations, series
of journal updates will follow/precede series of data/meta updates, causing
intermixing inside SSD. By passing a write-hint with journal, its write
can be isolated from other data/meta writes, leading to endurance/performance
benefit on SSD.

This patch introduces "j_writehint" member in JBD2 journal, using which
Ext4 specifies write-hint (as SHORT) for journal

Signed-off-by: Kanchan Joshi <[email protected]>
---
fs/ext4/ext4_jbd2.h | 1 +
fs/ext4/super.c | 2 ++
fs/jbd2/commit.c | 11 +++++++----
fs/jbd2/journal.c | 3 ++-
fs/jbd2/revoke.c | 3 ++-
include/linux/jbd2.h | 8 ++++++++
6 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 15b6dd7..b589ca4 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -16,6 +16,7 @@
#include <linux/jbd2.h>
#include "ext4.h"

+#define EXT4_JOURNAL_WRITE_HINT (WRITE_LIFE_KERN_MIN)
#define EXT4_JOURNAL(inode) (EXT4_SB((inode)->i_sb)->s_journal)

/* Define the number of blocks we need to account to a transaction to
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index fb12d3c..9c2c73e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4289,6 +4289,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);

+ sbi->s_journal->j_writehint = EXT4_JOURNAL_WRITE_HINT;
+
sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;

no_journal:
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 2eb55c3..6da4c28 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -153,10 +153,12 @@ static int journal_submit_commit_record(journal_t *journal,

if (journal->j_flags & JBD2_BARRIER &&
!jbd2_has_feature_async_commit(journal))
- ret = submit_bh(REQ_OP_WRITE,
- REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);
+ ret = submit_bh_write_hint(REQ_OP_WRITE,
+ REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh,
+ journal->j_writehint);
else
- ret = submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+ ret = submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC, bh,
+ journal->j_writehint);

*cbh = bh;
return ret;
@@ -711,7 +713,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
bh->b_end_io = journal_end_buffer_io_sync;
- submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+ submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC,
+ bh, journal->j_writehint);
}
cond_resched();
stats.run.rs_blocks_logged += bufs;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 8ef6b6d..804dc2c 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1384,7 +1384,8 @@ static int jbd2_write_superblock(journal_t *journal, int write_flags)
jbd2_superblock_csum_set(journal, sb);
get_bh(bh);
bh->b_end_io = end_buffer_write_sync;
- ret = submit_bh(REQ_OP_WRITE, write_flags, bh);
+ ret = submit_bh_write_hint(REQ_OP_WRITE, write_flags, bh,
+ journal->j_writehint);
wait_on_buffer(bh);
if (buffer_write_io_error(bh)) {
clear_buffer_write_io_error(bh);
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index a1143e5..376b1d8 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -642,7 +642,8 @@ static void flush_descriptor(journal_t *journal,
set_buffer_jwrite(descriptor);
BUFFER_TRACE(descriptor, "write");
set_buffer_dirty(descriptor);
- write_dirty_buffer(descriptor, REQ_SYNC);
+ write_dirty_buffer_with_hint(descriptor, REQ_SYNC,
+ journal->j_writehint);
}
#endif

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 0f919d5..918f21e 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1139,6 +1139,14 @@ struct journal_s
*/
__u32 j_csum_seed;

+ /**
+ * @j_writehint:
+ *
+ * write-hint for journal (set by FS).
+ */
+ enum rw_hint j_writehint;
+
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/**
* @j_trans_commit_map:
--
2.7.4


2019-03-29 16:59:20

by Heitke, Kenneth

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] nvme: register stream info with block layer


On 3/29/2019 1:53 AM, Kanchan Joshi wrote:
> nvme registes number of streams with block layer, which will use that

s/registes/registers/ ??

> for write-hint to streamid conversion. Registration is done for each
> namespace. Since NVMe spec allow all available streams (within subsystem)
> to be used by all namespaces, no attempt has been made to add reservation
> at namespace level.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> drivers/nvme/host/core.c | 19 ++++++-------------
> 1 file changed, 6 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 150e497..e34386b 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -513,14 +513,7 @@ static int nvme_configure_directives(struct nvme_ctrl *ctrl)
> return ret;
>
> ctrl->nssa = le16_to_cpu(s.nssa);
> - if (ctrl->nssa < BLK_MAX_WRITE_HINTS - 1) {
> - dev_info(ctrl->device, "too few streams (%u) available\n",
> - ctrl->nssa);
> - nvme_disable_streams(ctrl);
> - return 0;
> - }
> -
> - ctrl->nr_streams = min_t(unsigned, ctrl->nssa, BLK_MAX_WRITE_HINTS - 1);
> + ctrl->nr_streams = ctrl->nssa;
> dev_info(ctrl->device, "Using %u streams\n", ctrl->nr_streams);
> return 0;
> }
> @@ -533,12 +526,9 @@ static void nvme_assign_write_stream(struct nvme_ctrl *ctrl,
> struct request *req, u16 *control,
> u32 *dsmgmt)
> {
> - enum rw_hint streamid = req->write_hint;
> + enum rw_hint streamid = req->streamid;
>
> - if (streamid == WRITE_LIFE_NOT_SET || streamid == WRITE_LIFE_NONE)
> - streamid = 0;
> - else {
> - streamid--;
> + if (streamid != 0) {
> if (WARN_ON_ONCE(streamid > ctrl->nr_streams))
> return;
>
> @@ -3138,6 +3128,7 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
> {
> struct streams_directive_params s;
> int ret;
> + u16 nr_streams;
>
> if (!ctrl->nr_streams)
> return 0;
> @@ -3149,6 +3140,8 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
> ns->sws = le32_to_cpu(s.sws);
> ns->sgs = le16_to_cpu(s.sgs);
>
> + nr_streams = min_t(unsigned, ctrl->nr_streams, BLK_MAX_WRITE_HINTS - 1);
> + blk_queue_stream_limits(ns->queue, nr_streams);
> if (ns->sws) {
> unsigned int bs = 1 << ns->lba_shift;
>

2019-03-30 17:49:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] block: extend stream count for in-kernel use

On Mar 29, 2019, at 1:53 AM, Kanchan Joshi <[email protected]> wrote:
>
> This patch bumps up stream count to suppor in-kernel hints. It also
> adds 'streamid' member in 'request' and declares new API for driver
> to register stream-info with block layer.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> include/linux/blkdev.h | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 338604d..4088e21 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -214,6 +214,7 @@ struct request {
> #endif
>
> unsigned short write_hint;
> + unsigned short streamid;
> unsigned short ioprio;
>
> void *special; /* opaque pointer available for LLD use */
> @@ -341,6 +342,8 @@ struct queue_limits {
> unsigned char misaligned;
> unsigned char discard_misaligned;
> unsigned char raid_partial_stripes_expensive;
> +
> + unsigned short nr_streams;
> enum blk_zoned_model zoned;
> };
>
> @@ -567,8 +570,9 @@ struct request_queue {
> size_t cmd_size;
>
> struct work_struct release_work;
> -
> -#define BLK_MAX_WRITE_HINTS 5
> +#define BLK_MAX_USER_HINTS (WRITE_LIFE_KERN_MIN - 2)
> +#define BLK_MAX_KERN_HINTS 4
> +#define BLK_MAX_WRITE_HINTS (1 + BLK_MAX_USER_HINTS + BLK_MAX_KERN_HINTS)
> u64 write_hints[BLK_MAX_WRITE_HINTS];
> };
>
> @@ -1071,6 +1075,7 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int);
> extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
> extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
> extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
> +extern void blk_queue_stream_limits(struct request_queue *, unsigned short);

This declaration should be in the 2/7 patch where the function is added.


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-03-30 17:51:07

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] fs/ext4,jbd2: add support for passing write-hint with journal

On Mar 29, 2019, at 1:53 AM, Kanchan Joshi <[email protected]> wrote:
>
> For NAND based SSDs, mixing of data with different life-time reduces
> efficiency of internal garbage-collection. During FS operations, series
> of journal updates will follow/precede series of data/meta updates, causing
> intermixing inside SSD. By passing a write-hint with journal, its write
> can be isolated from other data/meta writes, leading to endurance/performance
> benefit on SSD.
>
> This patch introduces "j_writehint" member in JBD2 journal, using which
> Ext4 specifies write-hint (as SHORT) for journal

The comment here says the "WRITE_LIFE_SHORT" hint is used for the journal,
but the code uses WRITE_LIFE_KERN_MIN. However, it seems that "MIN" will
be mapped to "NONE" if it exceeds the number of streams available in the
underlying device. It would be better to use "SHORT" if there are not
enough streams available.

It should call blk_queue_stream_limits() to see if there are extra stream
IDs available, and fall back to WRITE_LIFE_SHORT if not.

Cheers, Andreas

> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> fs/ext4/ext4_jbd2.h | 1 +
> fs/ext4/super.c | 2 ++
> fs/jbd2/commit.c | 11 +++++++----
> fs/jbd2/journal.c | 3 ++-
> fs/jbd2/revoke.c | 3 ++-
> include/linux/jbd2.h | 8 ++++++++
> 6 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 15b6dd7..b589ca4 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -16,6 +16,7 @@
> #include <linux/jbd2.h>
> #include "ext4.h"
>
> +#define EXT4_JOURNAL_WRITE_HINT (WRITE_LIFE_KERN_MIN)
> #define EXT4_JOURNAL(inode) (EXT4_SB((inode)->i_sb)->s_journal)
>
> /* Define the number of blocks we need to account to a transaction to
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index fb12d3c..9c2c73e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4289,6 +4289,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>
> set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
>
> + sbi->s_journal->j_writehint = EXT4_JOURNAL_WRITE_HINT;
> +
> sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
>
> no_journal:
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 2eb55c3..6da4c28 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -153,10 +153,12 @@ static int journal_submit_commit_record(journal_t *journal,
>
> if (journal->j_flags & JBD2_BARRIER &&
> !jbd2_has_feature_async_commit(journal))
> - ret = submit_bh(REQ_OP_WRITE,
> - REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);
> + ret = submit_bh_write_hint(REQ_OP_WRITE,
> + REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh,
> + journal->j_writehint);
> else
> - ret = submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
> + ret = submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC, bh,
> + journal->j_writehint);
>
> *cbh = bh;
> return ret;
> @@ -711,7 +713,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> clear_buffer_dirty(bh);
> set_buffer_uptodate(bh);
> bh->b_end_io = journal_end_buffer_io_sync;
> - submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
> + submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC,
> + bh, journal->j_writehint);
> }
> cond_resched();
> stats.run.rs_blocks_logged += bufs;
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 8ef6b6d..804dc2c 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1384,7 +1384,8 @@ static int jbd2_write_superblock(journal_t *journal, int write_flags)
> jbd2_superblock_csum_set(journal, sb);
> get_bh(bh);
> bh->b_end_io = end_buffer_write_sync;
> - ret = submit_bh(REQ_OP_WRITE, write_flags, bh);
> + ret = submit_bh_write_hint(REQ_OP_WRITE, write_flags, bh,
> + journal->j_writehint);
> wait_on_buffer(bh);
> if (buffer_write_io_error(bh)) {
> clear_buffer_write_io_error(bh);
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index a1143e5..376b1d8 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -642,7 +642,8 @@ static void flush_descriptor(journal_t *journal,
> set_buffer_jwrite(descriptor);
> BUFFER_TRACE(descriptor, "write");
> set_buffer_dirty(descriptor);
> - write_dirty_buffer(descriptor, REQ_SYNC);
> + write_dirty_buffer_with_hint(descriptor, REQ_SYNC,
> + journal->j_writehint);
> }
> #endif
>
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 0f919d5..918f21e 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1139,6 +1139,14 @@ struct journal_s
> */
> __u32 j_csum_seed;
>
> + /**
> + * @j_writehint:
> + *
> + * write-hint for journal (set by FS).
> + */
> + enum rw_hint j_writehint;
> +
> +
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> /**
> * @j_trans_commit_map:
> --
> 2.7.4
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-04-01 05:03:55

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] block: extend stream count for in-kernel use

On Fri, Mar 29, 2019 at 01:23:46PM +0530, Kanchan Joshi wrote:
> This patch bumps up stream count to suppor in-kernel hints. It also
> adds 'streamid' member in 'request' and declares new API for driver
> to register stream-info with block layer.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> include/linux/blkdev.h | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 338604d..4088e21 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -214,6 +214,7 @@ struct request {
> #endif
>
> unsigned short write_hint;
> + unsigned short streamid;
> unsigned short ioprio;
>
> void *special; /* opaque pointer available for LLD use */
> @@ -341,6 +342,8 @@ struct queue_limits {
> unsigned char misaligned;
> unsigned char discard_misaligned;
> unsigned char raid_partial_stripes_expensive;
> +
> + unsigned short nr_streams;
> enum blk_zoned_model zoned;
> };
>
> @@ -567,8 +570,9 @@ struct request_queue {
> size_t cmd_size;
>
> struct work_struct release_work;
> -
> -#define BLK_MAX_WRITE_HINTS 5
> +#define BLK_MAX_USER_HINTS (WRITE_LIFE_KERN_MIN - 2)

WRITE_LIFE_KERN_MIN is undefined.

-Dave.

> +#define BLK_MAX_KERN_HINTS 4
> +#define BLK_MAX_WRITE_HINTS (1 + BLK_MAX_USER_HINTS + BLK_MAX_KERN_HINTS)
> u64 write_hints[BLK_MAX_WRITE_HINTS];
> };
>
> @@ -1071,6 +1075,7 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int);
> extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
> extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
> extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
> +extern void blk_queue_stream_limits(struct request_queue *, unsigned short);
>
> /*
> * Number of physical segments as sent to the device.
> --
> 2.7.4
>
>

--
Dave Chinner
[email protected]

2019-04-01 05:09:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] block: add write-hint to stream-id conversion

On Fri, Mar 29, 2019 at 01:23:48PM +0530, Kanchan Joshi wrote:
> Earlier this conversion was done by driver (nvme). Current conversion
> is of the form "streamid = write-hint - 1", for both user and kernel
> streams (note that existing infra takes care that user-streams do not
> bump into kernel ones).

Unless we add new user streams, then all the kernel streams change
ID. I'll deal with this in more detail in a later patch.

> Conversion takes stream limit (maintained in
> request queue) into account. Write-hints beyond the queue-limit turn
> to 0.
> New field 'streamid' has been added in request. While 'write-hint' field
> continues to exist. It keeps original value passed from upper layer, and
> used during merging checks.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> block/blk-core.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3c5f61c..c86daed 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -727,6 +727,25 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
> return false;
> }
>
> +enum rw_hint blk_write_hint_to_streamid(struct request *req,
> + struct bio *bio)
> +{
> + enum rw_hint streamid, nr_streams;
> + struct request_queue *q = req->q;
> + nr_streams = q->limits.nr_streams;
> +
> + streamid = bio->bi_write_hint;
> + if (!nr_streams || streamid == WRITE_LIFE_NOT_SET ||
> + streamid == WRITE_LIFE_NONE)
> + streamid = 0;
> + else {
> + --streamid;

What's this magic thing do?

> + if(streamid > nr_streams)
> + streamid = 0;

So, basically, we'll compress all the kernel hints down to "no hint"
if there are more user streams than the device supports?

Surely we should be reserving a stream for the kernel hints separate
from the user and "none" streams when we have limited device streams
available...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-04-01 05:13:52

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] fs: introduce write-hint start point for in-kernel hints

On Fri, Mar 29, 2019 at 01:23:51PM +0530, Kanchan Joshi wrote:
> kernel-mode components can define own write-hints using
> "WRITE_LIFE_KERN_MIN" as base.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> include/linux/fs.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 29d8e2c..6a2673e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -291,6 +291,8 @@ enum rw_hint {
> WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
> WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
> WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
> +/* Kernel should use write-hint starting from this */
> + WRITE_LIFE_KERN_MIN,

Which means that when a new userspace hint is defined, all the
kernel hints change numbers and, AIUI, that changes how the kernel
hints are mapped to the underlying device.

The kernel hints need to be mapped to the highest supported number a
work down, while userspace starts at the lowest and works up. The
"kernel to device stream id" needs to translate the kernel hints
down to the upper range of the device hints.

I think the mapping range the code uses should be:

HINT Type device
0 USER 0 0
1 USER 1 1
......
n USER MAX n

{n,65535-m} UNUSED {n,dev_max-m}

65535 - m KERN_MIN, dev_max - m
......
65532 KERN 3 dev_max - 3
65533 KERN 2 dev_max - 2
65534 KERN 1 dev_max - 1
65535 KERN 0 dev_max

i.e. if you look at the mapping as a signed short, >= 0 are user
hints, < 0 are kernel hints. This provides an obvious, simple way
to map the kernel hints to the upper range of the device hint
range. It also provides a simple way to compress both user and
kernel hints into a limited device hint range - kernel always uses
the top device hint, user is limited to the rest of the range....

This means the ranges don't overlap or change at either the
code or the device level as we add more user and kernel hint
channels in the future.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-04-02 09:09:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] fs/ext4,jbd2: add support for passing write-hint with journal

On Sat 30-03-19 11:49:54, Andreas Dilger wrote:
> On Mar 29, 2019, at 1:53 AM, Kanchan Joshi <[email protected]> wrote:
> >
> > For NAND based SSDs, mixing of data with different life-time reduces
> > efficiency of internal garbage-collection. During FS operations, series
> > of journal updates will follow/precede series of data/meta updates, causing
> > intermixing inside SSD. By passing a write-hint with journal, its write
> > can be isolated from other data/meta writes, leading to endurance/performance
> > benefit on SSD.
> >
> > This patch introduces "j_writehint" member in JBD2 journal, using which
> > Ext4 specifies write-hint (as SHORT) for journal
>
> The comment here says the "WRITE_LIFE_SHORT" hint is used for the journal,
> but the code uses WRITE_LIFE_KERN_MIN. However, it seems that "MIN" will
> be mapped to "NONE" if it exceeds the number of streams available in the
> underlying device. It would be better to use "SHORT" if there are not
> enough streams available.
>
> It should call blk_queue_stream_limits() to see if there are extra stream
> IDs available, and fall back to WRITE_LIFE_SHORT if not.

I disagree. I'd first keep the behavior implemented in this patch to keep
things simple. Later if we decide more smarts are needed when SSDs don't
have enough hints available, we can always add them. But this patch either
keeps the current behavior (i.e., no hint) or improves the situation by
providing a special hint. So it is a clear win. I'm not so convinced using
WRITE_LIFE_SHORT is always a win when userspace's idea of "short" is
different from the kernel's idea of "short"...

Honza
>
> Cheers, Andreas
>
> > Signed-off-by: Kanchan Joshi <[email protected]>
> > ---
> > fs/ext4/ext4_jbd2.h | 1 +
> > fs/ext4/super.c | 2 ++
> > fs/jbd2/commit.c | 11 +++++++----
> > fs/jbd2/journal.c | 3 ++-
> > fs/jbd2/revoke.c | 3 ++-
> > include/linux/jbd2.h | 8 ++++++++
> > 6 files changed, 22 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > index 15b6dd7..b589ca4 100644
> > --- a/fs/ext4/ext4_jbd2.h
> > +++ b/fs/ext4/ext4_jbd2.h
> > @@ -16,6 +16,7 @@
> > #include <linux/jbd2.h>
> > #include "ext4.h"
> >
> > +#define EXT4_JOURNAL_WRITE_HINT (WRITE_LIFE_KERN_MIN)
> > #define EXT4_JOURNAL(inode) (EXT4_SB((inode)->i_sb)->s_journal)
> >
> > /* Define the number of blocks we need to account to a transaction to
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index fb12d3c..9c2c73e 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -4289,6 +4289,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> >
> > set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
> >
> > + sbi->s_journal->j_writehint = EXT4_JOURNAL_WRITE_HINT;
> > +
> > sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
> >
> > no_journal:
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 2eb55c3..6da4c28 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -153,10 +153,12 @@ static int journal_submit_commit_record(journal_t *journal,
> >
> > if (journal->j_flags & JBD2_BARRIER &&
> > !jbd2_has_feature_async_commit(journal))
> > - ret = submit_bh(REQ_OP_WRITE,
> > - REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);
> > + ret = submit_bh_write_hint(REQ_OP_WRITE,
> > + REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh,
> > + journal->j_writehint);
> > else
> > - ret = submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
> > + ret = submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC, bh,
> > + journal->j_writehint);
> >
> > *cbh = bh;
> > return ret;
> > @@ -711,7 +713,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > clear_buffer_dirty(bh);
> > set_buffer_uptodate(bh);
> > bh->b_end_io = journal_end_buffer_io_sync;
> > - submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
> > + submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC,
> > + bh, journal->j_writehint);
> > }
> > cond_resched();
> > stats.run.rs_blocks_logged += bufs;
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 8ef6b6d..804dc2c 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -1384,7 +1384,8 @@ static int jbd2_write_superblock(journal_t *journal, int write_flags)
> > jbd2_superblock_csum_set(journal, sb);
> > get_bh(bh);
> > bh->b_end_io = end_buffer_write_sync;
> > - ret = submit_bh(REQ_OP_WRITE, write_flags, bh);
> > + ret = submit_bh_write_hint(REQ_OP_WRITE, write_flags, bh,
> > + journal->j_writehint);
> > wait_on_buffer(bh);
> > if (buffer_write_io_error(bh)) {
> > clear_buffer_write_io_error(bh);
> > diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> > index a1143e5..376b1d8 100644
> > --- a/fs/jbd2/revoke.c
> > +++ b/fs/jbd2/revoke.c
> > @@ -642,7 +642,8 @@ static void flush_descriptor(journal_t *journal,
> > set_buffer_jwrite(descriptor);
> > BUFFER_TRACE(descriptor, "write");
> > set_buffer_dirty(descriptor);
> > - write_dirty_buffer(descriptor, REQ_SYNC);
> > + write_dirty_buffer_with_hint(descriptor, REQ_SYNC,
> > + journal->j_writehint);
> > }
> > #endif
> >
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 0f919d5..918f21e 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -1139,6 +1139,14 @@ struct journal_s
> > */
> > __u32 j_csum_seed;
> >
> > + /**
> > + * @j_writehint:
> > + *
> > + * write-hint for journal (set by FS).
> > + */
> > + enum rw_hint j_writehint;
> > +
> > +
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > /**
> > * @j_trans_commit_map:
> > --
> > 2.7.4
> >
>
>
> Cheers, Andreas
>
>
>
>
>


--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-04-02 09:38:18

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] block: add write-hint to stream-id conversion

On Mon 01-04-19 16:08:21, Dave Chinner wrote:
> On Fri, Mar 29, 2019 at 01:23:48PM +0530, Kanchan Joshi wrote:
> > + if(streamid > nr_streams)
> > + streamid = 0;
>
> So, basically, we'll compress all the kernel hints down to "no hint"
> if there are more user streams than the device supports?
>
> Surely we should be reserving a stream for the kernel hints separate
> from the user and "none" streams when we have limited device streams
> available...

The question is what to do in a situation when the device has exactly as
many hints as we currently offer to userspace. Because currently either the
device has enough hints for all userspace hint values or we disable the
feature altogether. If we always mandated some hints are available for the
kernel, we'd have to regress some fuctionality currently available to
userspace. So I think that the option that the kernel won't get any hints
is the least painful solution. Later when people would like to extend hints
available to userspace, we could make sure kernel's batch of hints has
priority over these "extended userspace hints".

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-04-02 20:36:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] block: add write-hint to stream-id conversion

On Tue, Apr 02, 2019 at 11:20:44AM +0200, Jan Kara wrote:
> On Mon 01-04-19 16:08:21, Dave Chinner wrote:
> > On Fri, Mar 29, 2019 at 01:23:48PM +0530, Kanchan Joshi wrote:
> > > + if(streamid > nr_streams)
> > > + streamid = 0;
> >
> > So, basically, we'll compress all the kernel hints down to "no hint"
> > if there are more user streams than the device supports?
> >
> > Surely we should be reserving a stream for the kernel hints separate
> > from the user and "none" streams when we have limited device streams
> > available...
>
> The question is what to do in a situation when the device has exactly as
> many hints as we currently offer to userspace.

Then do what we do now for that case. For every other case, the
kernel should have reserved space and not get intermingled with
userspace hints.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-04-03 04:08:39

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v3 7/7] fs/ext4,jbd2: add support for passing write-hint with journal


Kanchan,

> For NAND based SSDs, mixing of data with different life-time reduces
> efficiency of internal garbage-collection. During FS operations,
> series of journal updates will follow/precede series of data/meta
> updates, causing intermixing inside SSD. By passing a write-hint with
> journal, its write can be isolated from other data/meta writes,
> leading to endurance/performance benefit on SSD.

Why not just introduce REQ_JOURNAL and let the device driver decide how
to turn that into something appropriate for the device?

That's what I'll need for SCSI. Existing SCSI streams are not a good
fit.

--
Martin K. Petersen Oracle Linux Engineering

2019-04-03 09:37:05

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] block: add write-hint to stream-id conversion

On Wed 03-04-19 07:35:08, Dave Chinner wrote:
> On Tue, Apr 02, 2019 at 11:20:44AM +0200, Jan Kara wrote:
> > On Mon 01-04-19 16:08:21, Dave Chinner wrote:
> > > On Fri, Mar 29, 2019 at 01:23:48PM +0530, Kanchan Joshi wrote:
> > > > + if(streamid > nr_streams)
> > > > + streamid = 0;
> > >
> > > So, basically, we'll compress all the kernel hints down to "no hint"
> > > if there are more user streams than the device supports?
> > >
> > > Surely we should be reserving a stream for the kernel hints separate
> > > from the user and "none" streams when we have limited device streams
> > > available...
> >
> > The question is what to do in a situation when the device has exactly as
> > many hints as we currently offer to userspace.
>
> Then do what we do now for that case. For every other case, the
> kernel should have reserved space and not get intermingled with
> userspace hints.

Yup, we are on the same page then.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-04-03 13:44:10

by Kanchan Joshi

[permalink] [raw]
Subject: RE: [PATCH v3 7/7] fs/ext4,jbd2: add support for passing write-hint with journal

Hi Martin,

> Why not just introduce REQ_JOURNAL and let the device driver decide how to
turn that into something appropriate for the device?

It began with that kind of thought/goal i.e. introduce something just for
FS journal. But it seems to have evolved for good.
Current approach extends write-hint infra so that whole thing becomes
extensible for other kind of use-cases (than FS journal) as well.
Also in this approach, driver will do little, while block-layer will do
majority of the work.

> That's what I'll need for SCSI. Existing SCSI streams are not a good fit.

Do you see that it's difficult for SCSI to use write-hint infrastructure for
streams?



-----Original Message-----
From: Martin K. Petersen [mailto:[email protected]]
Sent: Wednesday, April 03, 2019 8:28 AM
To: Kanchan Joshi <[email protected]>
Cc: [email protected]; [email protected];
[email protected]; [email protected];
[email protected]; [email protected]; [email protected];
[email protected]; [email protected]
Subject: Re: [PATCH v3 7/7] fs/ext4,jbd2: add support for passing write-hint
with journal


Kanchan,

> For NAND based SSDs, mixing of data with different life-time reduces
> efficiency of internal garbage-collection. During FS operations,
> series of journal updates will follow/precede series of data/meta
> updates, causing intermixing inside SSD. By passing a write-hint with
> journal, its write can be isolated from other data/meta writes,
> leading to endurance/performance benefit on SSD.

Why not just introduce REQ_JOURNAL and let the device driver decide how to
turn that into something appropriate for the device?

That's what I'll need for SCSI. Existing SCSI streams are not a good fit.

--
Martin K. Petersen Oracle Linux Engineering

2019-04-03 14:31:58

by Kanchan Joshi

[permalink] [raw]
Subject: RE: [PATCH v3 6/7] fs: introduce write-hint start point for in-kernel hints

> Which means that when a new userspace hint is defined, all the kernel
hints change numbers and, AIUI, that changes how the kernel hints are mapped
to the underlying device.

Currently adding a new user-space hint requires modifying code and
installing modified kernel. So I felt it would be less probable to encounter
that situation while in production workload.


>The kernel hints need to be mapped to the highest supported number a work
down, while userspace starts at the lowest and works up.

Actually, I initially implemented "blk_write_hint_to_streamid" function like
that i.e. as per the table you've put. But that code involved more
checks/branches (condition checks) than the current one.
Also, request queue contained this statically defined array called
"write_hints", which nvme driver updated to gather stream stats.
Snippet below -

if (streamid < ARRAY_SIZE(req->q->write_hints))
req->q->write_hints[streamid] += blk_rq_bytes(req) >> 9;

That requires nvme driver doing a reverse conversion from streamid to
array-index(some more conditional checks) if kernel-hints get mapped to
highest possible stream numbers.


Overall, will it not be about adding additional run-time checks in I/O path
(which we will always execute) for the condition which will happen only if
one chooses to extend user-space hint count in between?


Thanks,

-----Original Message-----
From: Dave Chinner [mailto:[email protected]]
Sent: Monday, April 01, 2019 10:43 AM
To: Kanchan Joshi <[email protected]>
Cc: [email protected]; [email protected];
[email protected]; [email protected];
[email protected]; [email protected]; [email protected];
[email protected]; [email protected]
Subject: Re: [PATCH v3 6/7] fs: introduce write-hint start point for
in-kernel hints

On Fri, Mar 29, 2019 at 01:23:51PM +0530, Kanchan Joshi wrote:
> kernel-mode components can define own write-hints using
> "WRITE_LIFE_KERN_MIN" as base.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> include/linux/fs.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h index
> 29d8e2c..6a2673e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -291,6 +291,8 @@ enum rw_hint {
> WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
> WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
> WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
> +/* Kernel should use write-hint starting from this */
> + WRITE_LIFE_KERN_MIN,

Which means that when a new userspace hint is defined, all the kernel hints
change numbers and, AIUI, that changes how the kernel hints are mapped to
the underlying device.

The kernel hints need to be mapped to the highest supported number a work
down, while userspace starts at the lowest and works up. The "kernel to
device stream id" needs to translate the kernel hints down to the upper
range of the device hints.

I think the mapping range the code uses should be:

HINT Type device
0 USER 0 0
1 USER 1 1
......
n USER MAX n

{n,65535-m} UNUSED {n,dev_max-m}

65535 - m KERN_MIN, dev_max - m
......
65532 KERN 3 dev_max - 3
65533 KERN 2 dev_max - 2
65534 KERN 1 dev_max - 1
65535 KERN 0 dev_max

i.e. if you look at the mapping as a signed short, >= 0 are user hints, < 0
are kernel hints. This provides an obvious, simple way to map the kernel
hints to the upper range of the device hint range. It also provides a simple
way to compress both user and kernel hints into a limited device hint range
- kernel always uses the top device hint, user is limited to the rest of the
range....

This means the ranges don't overlap or change at either the code or the
device level as we add more user and kernel hint channels in the future.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-04-03 14:50:49

by Kanchan Joshi

[permalink] [raw]
Subject: RE: [PATCH v3 3/7] block: add write-hint to stream-id conversion

> Then do what we do now for that case. For every other case, the kernel
> should have reserved space and not get intermingled with userspace
> hints.

I hope this means that we're fine with the current conversion approach.
As you would have noticed, current approach does not disable stream feature
based on dearth of streams.
It either exposes 8 streams (if device has equal or more than 8 streams) or
N streams (if N is less than 8).
For less than 8 streams case, user-space hints get priority over
kernel-space hints. But at any point of time, there is no intermingling.


Thanks,
Kanchan

-----Original Message-----
From: Jan Kara [mailto:[email protected]]
Sent: Wednesday, April 03, 2019 3:06 PM
To: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>; Kanchan Joshi <[email protected]>;
[email protected]; [email protected];
[email protected]; [email protected];
[email protected]; [email protected]; [email protected];
[email protected]; [email protected]
Subject: Re: [PATCH v3 3/7] block: add write-hint to stream-id conversion

On Wed 03-04-19 07:35:08, Dave Chinner wrote:
> On Tue, Apr 02, 2019 at 11:20:44AM +0200, Jan Kara wrote:
> > On Mon 01-04-19 16:08:21, Dave Chinner wrote:
> > > On Fri, Mar 29, 2019 at 01:23:48PM +0530, Kanchan Joshi wrote:
> > > > + if(streamid > nr_streams)
> > > > + streamid = 0;
> > >
> > > So, basically, we'll compress all the kernel hints down to "no hint"
> > > if there are more user streams than the device supports?
> > >
> > > Surely we should be reserving a stream for the kernel hints
> > > separate from the user and "none" streams when we have limited
> > > device streams available...
> >
> > The question is what to do in a situation when the device has
> > exactly as many hints as we currently offer to userspace.
>
> Then do what we do now for that case. For every other case, the kernel
> should have reserved space and not get intermingled with userspace
> hints.

Yup, we are on the same page then.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR